I&A Tech Solution Architecture Guidelines
I&A Tech Solution Architecture Guidelines
Page 1 of 321
Section 3.7 - Global or Country Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Section 3.8 - Job Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Section 3.9 - Data Integration Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Section 4 - Information security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Section 4.1 - Environment and data access management . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
UDL & BDL Access Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Section 4.2 - Encrypting Data-At-Rest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
Section 4.3 - Encrypting Data-in-Transit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Section 4.4 - Security on SQL Database/DW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Section 4.5 - Data Lake Audit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
JML Audit Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Section 5 - Cost Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Section 5.1 - High Level Estimate - Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
Section 5.2 - Cost Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Technical Implementation - Cost Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Section 6 - New Foundation Design - Azure Foundation 2018 . . . . . . . . . . . . . . . . . . . . . . . . . . 293
Section 6.1 - Express Route – Setup and Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Section 6.2 - I&A Subscription Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
Section 6.3 - Product migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
Section 7 - New Tool Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
Section 7.1 - Data Share . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
Section 7.2 - Snowflake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
Section 7.3 - Synapse Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
Page 2 of 321
I&A Azure Solution Architecture Guidelines
New Foundation Design - Azure Foundation Section 6 - New Foundation Design - Azure Foundation
2018 2018
1. Manasa Sampya
2. Hemalatha B
3. Sumi Nair
4. Indira Kotennavar
5. Vishal Gupta
6. Niranjan Waghulde
The following section is broken down into the following solution areas:
Logical Architecture
The Logical Reference Architecture defines the applications approved to support the conceptual services.
Physical Architecture
The Physical Reference Architecture defines the products available to support them.
Old Foundation Design was the landscape and network design created when Unilever Azure Journey started.
Since Unilever is constantly making the networking layer and security layer better, a new foundation setup/design is
created.
New Foundation Design includes the similar components as in Old Foundation design but the platform is made
more secure with networking made better.
Azure platform hosted for Information & Analytics covers 4 main layers.
UDL is the single shared capability that will act as the backbone for all analytics works in Unilever
All master, and transactional data scattered across business systems
Data from true source systems
Data in its native form (in its source form) and time series beyond the time duration that sources will maintain
Which are, to begin with, Unilever business functional specific data lakes that slices and dices data, and
perform calculations, summarization, aggregation (all the verbs that can describe to curate data that is
unique to a business function)
Unilever functional boundaries like CD, CMI, Finance, Supply Chain, R&D, HR, etc
Sharable KPI’s and sharable business logic.
Product
The business data lakes while providing data that is curated to specific business areas , to make decisions
that are cross business like CCBTs or markets where cross business processed data is required to deliver
specific KPIs or answer granular questions to drive insights. To deliver insights via dashboards or data
science models built on the data and to assist decision making.
Experiment
Time bound environment provisioned to prove business use cases , which can be analytical or data
science. Experiment can take the data from UDL & BDL by following right governance process.
As the data, both internal and external grows exponentially, the traditional methods of storing vast volumes of data
in a database or data warehouse are no longer sustainable or cost effective.
Data Lake, a new concept emerged at the back of BIG data where various types of data stored in its native form
within a system. As in (water) lake, a data lake is hydrated (populated) via many data streams: master data,
enterprise transactional data, social feeds, 3rd party data that an enterprise wants to co-mingle with internal data,
irrespective of shape (structured and unstructured) and volume of the data. The most significant advantage is, there
is no constraint on how big a data lake should be and how many varieties of data we can hydrate into the data lake.
Combined with data lake as storage idea and new high performing cloud processing capabilities, makes data lake a
desirable and cost-effective solution for vast enterprises like Unilever to leverage as foundation capability to build
complex data, information and analytics solutions.
Unilever data lake strategy is built on a layered approach to ensure we maintain enterprise-scale while giving agility
for businesses to define analytics solutions leveraging the data lake concept. It comprises of 3 layers, Trusted Zone
,TechDebt and Project Specific Landing Zone(PSLZ).
The UDL is hosted in Azure Data Lake Store (ADLS). As of today, UDL, BDL, Techdebt, PSLZ shares the same
ADLS, which are organised and managed using different folders.
UDL will be hosted in 1 ADLS consisting of Trusted and Tech Debt Folders. PSLZ (project specific landing
zone), will exist as part of each PDS / Product. Though all the 3 layers are part of UDL, but physically PSLZ
will sit as part of respective PDS.
Datasets which are not from true data sources,or not in raw form (may be aggregated) are considered for
TechDebt, as they are in non-compliance with the Data Lake Strategy and will needs to be retrofitted to become
compliant (i.e. source changed to an trusted source, and stored following the UDL, BDL strategy).
Technical debt will also be identified, organised, managed and secured by folders.
Datasets which are local and specific to the product and do not need to be shared across products, will reside
within a ‘Project Specific Landing Zone’ folder. Project specific data will be identified, organised, managed and
secured by folders.
Every Business function will consist of a Business Data Lake – CD, R&D, Finance, Supply Chain,HR, MDM , and
Marketing
Data will be read from the UDL, transformed and landed into the BDL using Azure Databricks (ADB); this process
will be orchestrated using Azure Data Factory(ADF).
Are used to create shareable KPIs & calculations, summarisations and aggregations of data sets within that
function
Is preserved in its curated form along with any required master data.
BDLs will be shared across products that are building analytics capabilities.
I&A will own (or via proxy) the definitions for curated data within BDL
As each function owns respective BDL. Every data set / KPI ingested into BDL will be governed, managed,
cataloged by respective BDL teams. Data history, retention and granularity will be defined by BDL SME’s based on
the global product requirements.
The business data lakes while providing data that is curated to specic business areas will not be sucient to make
decisions that are cross business like CCBTs or markets where cross business process data is required to deliver
specic KPIs or answer granular questions to drive insights. To facilitate granular and meaningful insights, it is
required to develop analytics products where we will deliver insights via dashboards to assist decision making,
respond to natural queries to augment decision makings, and oer on-demand trusted insights via push data that to
use in automated decision making. For optimal and timely delivery insights, the strategy envisions the concept of
Product specic data stores (PDS). These PDSs will pull (and push where needed) necessary data from all the
associated BDLs or UDL . We also envision that a multi-geo rollout of a product may have geo-specic PDSs above
and beyond a core PDS. The only constraint on the PDS is that the data in one PDS is not shareable with
another Product.
In future, if we identify any product specic functionality and data that we can share across other products, we will
promote that functionality along with the data to the respective BDLs so that we avoid duplication.
We will preserve the data processed in this layer in its processed form along with any required master data. I&A will
build, support and act as custodian of these PDS and the corresponding products with agility to ensure we deliver
new product capabilities at the pace required for business decision making. I&A will own (or via proxy) the denitions
for the data elements within PDS and build the business glossary and data catalogue as the BDLs take shape
Experimentation Area
Experimentation environment is created for feasibility or quick business use case piloting.Environment is time
bound with validity of 1 -3 Months, with required TDA and Business approvals.
Types of environments :
The UDL,BDL will be hosted in Azure Data Lake Store (ADLS). UDL is the single shared capability that will act as
the backbone for all analytics works in Unilever.
Unilever data lake strategy is built on a layered approach to ensure we maintain enterprise-scale while giving agility
for businesses to define analytics solutions leveraging the data lake concept. It comprises of Trusted Zone ,
TechDebt and Project Specific Landing Zone(PSLZ).
The UDL is hosted in Azure Data Lake Store (ADLS). As per the current approach, there is only one ADLS
instance, which will be used for all layers of Universal Data Lake. UDL, Techdebt, PSLZ which will be separated,
organised and managed using folders. This design is being reviewed to simply the PSLZ layer to manage it within
respective PDS. Though the PSLZ is logically part of UDL, but for better management this will be hosted as part of
each PDS layer.
Trusted zone consists of the the Unfiltered/full data from trusted/true source in its most granular format. True
sources like ECC data will be hosted in Trusted zone. Some of the external data like Nielsen or Retailer data will
also be hosted in Trusted zone.
Datasets which are from untrusted data sources,or not in raw form (i.e. aggregated / Filtered) data are considered
for TechDebt, as they are in non-compliance with the Data Lake Strategy and will need to be retrofitted to become
compliant (i.e. source changed to an trusted source, and stored following the UDL, BDL strategy). Example sources
are Teradata, Merlin etc.
Technical debt will also be identified, organised, managed and secured by folders.
Datasets which are local and specific to the product and do not need to be shared across products will only reside
within a ‘Project Specific Landing Zone’ folder, which also resides at the same level as the UDL , TechDebt and
BDL and will also reside on the same ADLS instance.
Project specific data will be identified, organised, managed and secured by folders.
Any data planned for PSLZ should take care of the below approvals.
Approval and justification from the Data Owner/Data Expertise to scope it to PSLZ
Reason why the data cannot be added into UDL Trusted/Tech Debt.
Clear commitment from the project team to retrofit to Trusted / Tech Debt layer whenever the data is
available in these layers
ICAA / Security assessment of data set to classify the data into different categories (restricted, confidential,
internal, personal, personal sensitive)
Data in PSLZ is non-sharable at any case. If the similar data is required for another project then the project needs
to bring the data again into respective PSLZ project folder.
It is imperative that a logical, organised folder structure is in place within the Data Lake, to ensure it remains a Data
Lake and does not become a Data Swamp. The organisation of the UDL follows an industry best practice
framework, providing a logical easy to follow structure for both developers and support staff alike. There are 8 levels
in total, described below:
This structure has been determined as it is immune to organisation changes. The only time the structure would
need to be amended is if a source is changed/added
It will be the single point of ingestion for all internal and 3rd party data (structured and unstructured) and will act as
the single shared capability backbone for all analytics work in Unilever.
This layer will hold a complete set of data (ie no rows filtered; no columns removed) so there should be no need to
ever go back to the source system to provide the same/similar data. This layer will hold at least as much data as is
required by the functions and as storage is inexpensive then more could be kept if deemed beneficial.
The standard way Spark handles date partitions is to prefix the date related folders with the date partition it relates
to, for example ‘year=2018’, ‘month=01’, ‘day=01’. As Spark is the primary data processing engine which Unilever
will use, particularly for data ingestion, this prefixing has influenced the naming conventions for the date related
folders in UDL/BDL.
Architecture Standards:
Folder names for dates must be exactly as shown above (ie ‘yyyy=’ rather than ‘year=’). This is because
partition name cannot be same as the column name (As it causes issue in spark processing), and we might
have column as ‘Year’ in the data set, which can cause issues.
The BDLs will be hosted in ADLS and there will be one per function – CD, CMI, R&D, Finance, Supply Chain and
Marketing (TBC Master Data or others). Data will be read from the UDL, transformed and landed into the BDL using
Spark processing, this process will be orchestrated using Azure Data Factory. Business functional specific data
lakes:
Are used to create shareable KPIs & calculations, summarisations and aggregations of data sets within that
function
Is preserved in its curated form along with any required master data.
BDLs will be shared across products that are building analytics capabilities
I&A Technology (or the IT functional platforms within ETS) will develop, support and act as custodian to
these.
I&A will own (or via proxy) the definitions for curated data within BDL
Extensive data cataloging should be maintained for all KPI's published in BDL.
BDL's should also catalogue the consumer's of data i.e. PDS s consuming the data.
As each function will have its own BDL then each function will determine the retention policy for data within their
BDL. This retention will be supported/challenged by I&A, as proxy for the function to ensure each has enough data
to meet its requirements.
These retention policies will be published so product owners can request a longer retention policy if needed by their
current/future products.
The folder structure for the Business Data Lake follows the same principals as for the UDL, but will have less levels
The structure of the BDL will be: 1. Top Level – Total Company 2. UDL and BDL 3. Functional 4. Process 5. Sub-
process
Though this can be tailored to meet the needs of the function if appropriate.
Data Persistence:
Azure Data Lake Storage (ADLS) is where data is physically stored (persisted) within the BDL.
Azure Data Factory v2 (ADF) will be the primary tool used for the orchestration of data between UDL and BDL.
Data Curation:
Azure Databricks (ADB) is the recommended processing service to transform and enrich the data from the UDL,
applying the business logic required for writing the data back to the BDL.
BDL ARCHITECTURE
Business data lakes provides data that is curated to specific business areas. BDL's are not sufficient to make
decisions that are cross business like CCBTs or markets where cross business process data is required to deliver
specific KPIs or answer granular questions to drive insights.
Product specific data stores (PDS) will be developed to facilitate granular and meaningful insights, which deliver
insights via dashboards to assist decision making, respond to natural queries to augment decision making. These
PDSs will pull (and push where needed) necessary data from all the associated BDLs.
1. To avoid duplication of data, PDS ARE NOT allowed to share data from one PDS to another PDS.
2. Data science specific PDS are allowed to write data back to BDL so that it can be used by other PDS
Experiment environment is a time bound environment provided for quick piloting of the solution. There are two types
of experiment environments,
Data science Experiment Environment : Experiment environment provided for piloting data science use
cases. There are two types of environments provided here.
IaaS Environment : Azure Data Science VM’s are provided here, which are hosted in a private
network. This is mainly provided to data scientists who are comfortable piloting in a desktop like
environment. IaaS Vm’s cannot be productionized. Once the pilot is complete and the usecase can be
industrialized, this has to be moved into a PaaS environment.
PaaS Environment : PaaS environment consists of only PaaS tools for experimentation. Azure
Databricks and ML service are the two approved data science tools provided in this environment.
Going with PaaS environment is suggested to avoid migration effort during the industrialization (which
is higher for IaaS environment) .
Analytics Experiment Environment : Analytics experiment environment is used to mainly pilot reporting /
BI applications. This environment consists of only approved PaaS tools.
Approved components: Refer Section 2 - Approved Components for the information on Approved PaaS tools.
Market Environment
Market environment is provided to the market use cases where delivery of the product is managed by respective
markets and not by I&A delivery team. Market development environment is a quick environment provided in order to
start the development activities while the required process is being sorted. Every market use case should be
aligned with respective I&A Functional team for any duplication on the demand or alignment for delivery process.
Market environment should follow the same process as that of any product environment. Dev environment will be
provisioned to start with and project can bring their own data to pilot and test. Industrialization process for market
environment is similar to that of any product environment. All the data consumed by Market environment has to flow
through one of the layers of UDL. Project should follow all the architecture standards and complete UDL process in
order to get next set of environments for industrialization.
Approved components: Any approved components listed in Section 2 - Approved Components section can be used
in Market environment. Environment owner needs to make sure the tools are used as per the approved purpose.
Any deviation from the standards needs to be corrected by the project team before industrialization.
1. Development Environment: Development environment is for Build activities. Members of the Developer
and Dev-Ops security groups can build and deploy components and code in this environment.
2. QA Environment: QA environment is a locked down environment, mainly used to test the application.
Deployment to QA environment is done using CI/CD pipelines built as part of the Azure Dev-Ops.
Deployment and integration testing can be carried out in QA environment.
3. UAT Environment (optional): UAT environment is for the business users/SMEs to validate the functionality
/requirements. This environment is also used for performance testing. This is also a lockdown environment
and only way to deploy code is through CICD release pipelines.
4. PPD (Pre-Prod) Environment: The primary purpose of your PPD environment is to provide an area where a
hotfixs can be created and tested by members of the DevOps group. A hotfix only applies to an application
that is live in its production environment. Using PPD environment DevOps team can support, fix and test a fix
to the issues from production system.
5. Production Environment: Access to this environment is restricted. All releases must be performed from the
production CICD release pipeline in lieu of an approved RFC. The release team will perform the release and
make sure all pull requests are complete and in order.
UDL is the single shared capability that will act as the backbone for all analytics works in Unilever. All master and
transactional data scattered across those 20+ business systems (True Source) is stored here. Data is stored in its
native form (in its source form) and time series beyond the time duration that sources will maintain
Note: Other than the above 3 Azure PAAS components, no other components are approved in UDL
BDLs are Unilever business functional specific data lakes that slices and dices data, and perform calculations,
summarization, aggregation (all the verbs that can describe to curate data that is unique to a business function).
They are aligned to Unilever functional boundaries like CD, CMI, Finance, Supply Chain, R&D, HR, etc
Note: Other than the above 3 Azure PAAS components, no other components are approved in BDL
While the business data lakes providing data that is curated to specific business areas (as defined above), will not
be sufficient to make decisions that are cross business like CCBTs or markets where cross business process data
is required to deliver specific KPIs or answer granular questions to drive insights.
To facilitate granular and meaningful insights, ETS will develop analytics products where we will deliver insights via
dashboards to assist decision making, respond to natural quires to augment decision makings, and offer on-
demand trusted insights via push data that to use in automated decision making
Provisioning Process
Analytics products built on I&A Azure platform should be aligned with respecting Business functional I&A director.
Once the alignment is done, product needs to follow the below process for the provisioning of environment.
Technical Design Authority (TDA) consisting of azure architect’s (solution architect, cloud architect, EC, Security,
UDL, Landscape), will review the architecture presented by solution architect and sign off on the architecture for
provisioning.
Product Industrialization
Reach out to the concerned architect for your business function if you have any question.
Experiment Environment
Requirements for requesting an Experimentation Environment
Experiment environment is a time bound environment provided for quick piloting of the solution. Experimentation
environment is created for feasibility or quick business use case piloting. Experiment environment is time bound
with validity of 1-3 months with approvals from work level 3.
Self Sign of From Experimentation team, is mandatory to make sure all the rules listed by security is followed. Every
individual working in the experiment environment has to go through security document and sign off on following
the same.
PAAS: Experimentation project should not create any more SQL components apart from what is provisioned
by Landscape.
PAAS: Project should not share SQL Service account credentials with anyone other than the one who is
approved by responsible Unilever PM.
PAAS: Firewall should be “On” for all the resources and no IP whitelisting is allowed.
IAAS: Responsibility of the user to make sure no new tools are installed in the IAAS VM’s which are against
Unilever policies.
Self sign off holds good only for data sets with classification as Confidential and below . Restricted or
Sensitive data set usage in experimentation space has to go through ISA (Information security
Assessment)
Restricted and Sensitive data should be encrypted at transit and rest, in all layers of the architecture.
All cost on azure is based on usage (Pay as you Go) i.e. Number of hours of usage of the environment. Find the
cost guide here
DATA ACCESS:
Work with I&A Data expertise team to identify the data in UDL and BDL and request for the approvsl.
UDL Access : Reach out to UDL Dev Ops team to get access on UDL data.
BDL Access : Reach out to respective BDL owner for any data in BDL.
Project team or User has to fill the requested in Experimentation Environment Request Template and share it with
TDA team to approve the experimentation environment Provisioning
Market Environment
Overview
Market development environment is created for markets to build the products quickly with or without including the
I&A standard Delivery teams/SI partners. Market environment should be aligned with respecting Business Function
director. Industrialization process for a market environment is similar to that of any product environment.
Market environments are provided to markets for their development activities, only when there is a clear
industrialization path defined for the use case, else team needs to go with experiment environment.
Market environments needs to follow the same process as that of any development environment. Only
difference is that the delivery is not managed by I&A delivery teams.
Market teams will be shared with below standards/Guidelines, which needs to be taken care.
Environment Provisioning process
Standard architecture
Standard Components and approved usage
Architecture Guidelines for components
Data Access and approvals
Security guidelines to be followed
Environment guidelines
Industrialization process
Markets are fully responsible for taking care of all the guidelines mentioned. In case the standards are not
met industrialization of environment is not allowed.
Access controls are similar to the one followed today for project environments
No Support will be provided if the team is not using the components as per the approved usage.
Business team needs to work with I&A Team to align on the use case and to identify the solution categorization as
Market Built solution.
New Environment
Dev Environment - TDA approval Mandatory (TDA meeting conducted only on Tuesday’s)
Project BOSCARD and Alignment with I&A delivery director (if not run by I&A Delivery)
Functional Requirement – Use case detaills
Scoping Initiation Mail – For all identified data with Data Architect and Data SME
ICAA process initiation Mail
Self Sign off by Project on Security Standards
TDA approval from Architect: with Architecture Artifacts - SLA < 5 Days (TDA review
happens every Tuesday )
TDA approved architecture will be shared with project with all components and
connections defined. (Jira Entry into TDA project)
Additional Environment – TDA approval Mandatory
Approved ICAA
Approval from Data SME on Scoping, DC and DMR for UDL and BDL
Exceptional approval and retrofit alignment for any PSLZ data.
Gate 1 & Gate 2 Checklist completion.
TDA approval from Architect: with Architecture Artifacts - SLA < 5 Days (Once all the above
is provided) - TDA review happens every Tuesday.
TDA approval for further environments – With Jira Entry into TDA project
Existing Environment:
Additional Components for existing environment – Offline approval from Architect, if required
Component is from approved List. – SLA < 2 days
If component is from Approved List – Architect can approve offline.
If component is new and not from approved list then architecture has to go through TDA
approval.
GATE PROCESS:
STANDARD ARCHITECTURE
Here is the Market based architecture template for Market Led project environment.
Please refer to Section 2 - Approved Components for details about approved components for PDS
Please refer to Section 2 - Approved Components and right practices for each component
development purpose project can bring their own data but industrialization is allowed only through UDL.
Integration patterns with source systems for bringing any new data has to be aligned with UDL team and
architecture team
Market project team needs to work with security to derive data classification for the data sets brought directly
into project environment. (ICAA)
Restricted and sensitive data needs to be encrypted at all layers (at rest and transit). PII data needs to go
through Data protection Impact Assessment (DPIA) with security.
All components hosted in Azure must have Firewall ON. Access is allowed only for Unilever ID’s through
MFA enabled AD Groups.
Market Project team is not allowed to share any data from this environment with any other application or
external system.
Market Project team should not provide any access or credentials of the environment to any user who is not
supposed to view the data.
Responsibility of the project to make sure data is accessed by right users. Make sure the team has gone
through security guideline listed in the below link.
In case of any confusion in security guideline project team to contact either security or solution architecture
team
Read the document Security Guidelines and sign off on the security.
ENVIRONMENT GUIDELINES
INDUSTRIALIZATION PROCESS
ML b{regionCode}-{platform}-{env}-{ITSG}-unilevercom- bieno-fs-d-54321-unilevercom-ml-01
ml-{NN}
FS Cloud
CE Consumer Engagement
FR Frontier
CS CD-SAP and O2
ES eScience
PL PLM
GF Global Finance
WP Workplace
HR HR
CF Corporate Functions
MD Master Data
CA Central Analytics
CL Collaboration
NT Integration
SE Security
DV Devices
NE Networks
SA SAP
There are a few standard naming conventions which apply to all elements in Azure Data factory.
Linked service connects data from a source to a destination (sink), so the names of linked service would be similar
to datasets.
Dataset Naming:
The below table has a column for Linked services and Datasets.
PIPELINE NAMING
PL_[<Capability>]_[<TypeofLoad>]_[<MeaningName>]
Note:
ACTIVITIES NAMING
Note:
FMT is a file management tool (web app) that is used to upload manual files to a blob storage in Azure.
Pre-requisites
The users should have AD group created for their project in the format Sec-Azo-FMTTool-<ProjectName>.
Please note that each project needs to have a unique AD group
The team members part of the AD group (mentioned in point no 1) will only have access to upload the files.
The user should provide the DMR/Schema in the attached format to the Unilever Data Architect Team for
approval
The user needs to share the attached DMR & Data Catalogue (in the approved format only) with the
Development/ DevOps team for the FMT tool Admin to upload it from the backend.
Data files which will be validated against the schema / DMR.
1. Please select the "Data Catalogue" tab to input a data catalogue entry. Please note that all the fields are
mandatory to be filled.
2. Please make sure there are no empty rows above the header row (data should start from A1)
3. Please note that for every project there is a unique AD group. Kindly input the same. If you donot have the
AD group information for your project kindly make sure to create an AD group.
4. Please make sure the Project+DataSet+DMR-Entity combination is unique
5. For all naming conventions (except AD group), kindly do not use hypen "-" consider using underscore
instead "_"
1. For each Data Catalogue entry, please make sure there is a corresponding DMR/ schema entry
2. Please select the "DMR Or Schema" tab to make the DMR entry. Please note that all the fields are
mandatory to be filled.
3. Please make sure that there are no empty rows above the header row (data should start from A1)
4. The supported datatypes are as shown in the table below. Kindly choose from these.
S No Datatypes Format
3 char normal
4 varchar normal
5 nvarchar normal
6 date normal
7 time hh:mm:ss
Key Highlights:
The (user) project team needs to create an AD group specific (unique)to that project
Kindly follow this format for the naming convention "Sec-Azo-FMTTool-<ProjectName>"
Please make sure that all the members who need access to the tool should be added to the created AD
group.
The upload page allows the user to upload a file (one at a time) for validating against a pre-defined schema
STEPS TO BE FOLLOWED
If the user is unable to see the respective project/ dataset/ schema information, please ensure that the
schema (DMR) and data catalogue has been approved by the Unilever data Architecture team (TDA). Post
the approval, please send the DMR and Data Catalogue to the FMT admin team to load it from the backend
Key Highlights:
The tool accepts only flat files (.txt), excel files with single worksheet (xls and xlsx) and parameter separated
files (comma, tab and pipe)
Once, the file has been uploaded successfully, the user can check the status of the uploaded File here.
TYPES OF STATUS
Waiting for Validation – The tool is busy validating other files & the user file is in queue
Validating – The file has been picked by the tool and is being processed against the selected schema to
validate
Processed – The file has passed all the rules as mentioned in the schema selected and is processed
successfully
Failed- The file has failed for one or more rule(s) defined in the selected schema
Unexpected system error: Incase there is an unexpected system error wherein the tool is not working
properly or the uploaded file was not found in Azure Blob container for validation or if a schema against
which the validation has to be performed is not present
STEPS TO BE FOLLOWED
1. By default, the user can view the status of the latest uploaded file on this page (last uploaded shows first in
queue). However, the user can also sort and filter the records as per requirement.
2. The user can sort and filter the records by selecting any of the 7 dropdowns to apply the filter based on
which the upload status of the uploaded file will be shown.
3. Click Apply Filter to see the status of the files uploaded based on the selections made in the top section.
4. View Details mentioned in the status table can be clicked to understand the status better
Filter Description
Fields
Project Lists all the projects that the user is allowed to access
File The File Schema Name related to the project which had been selected during the time of upload.
Schema
Name
Source The source of the data. For example: All region, Cordillera etc.
Details
Extracti The frequency at which the data is extracted from the Source system. For example: weekly,
on monthly, annually etc
Frequen
cy
Upload Lists the status of the file upload process, from which the file has to be selected. For example:
Status Waiting for Validation, Validating etc. The list of the Upload Status is provided in TypesofStatus
These read-only fields are auto-populated once the selections are made in the top section of the Upload Status
Page. These help the user to get a better understanding of the status of the uploaded file.
Status Table
Status Description
Headings
Uploade Shows the server date and time when the file is uploaded.
d On
Status The user can view the status of the file if it is pending for validation or has been validated and
uploaded successfully or has failed validation as shown in TypesofStatus
Project The user can view the project for which the file has been uploaded
Source The source system from which the data in the uploaded file is taken. e.g. Cordillera
Details
Extractio The frequency at which the data is extracted from the Source system. For example: weekly,
n monthly, annually etc.
Frequen
cy
Uploade The email id of the user who has uploaded the file
d By
View Details in the upload status page can be clicked to understand the details of the uploaded file.
Waiting for Validation – The tool is busy validating other files & the user file is in queue waiting to be picked
up for validation
Validating – The file has been picked by the tool and is being processed against the selected schema to
validate
Processed – The file has passed all the rules as mentioned in the selected schema and is processed
successfully
Failed- The file has failed validation and errors will be shown with respect to row and column
Unexpected system error: Incase the file could not be validated due to an unexpected system error wherein
the tool is not working properly or the uploaded file was not found in Azure Blob container for validation or if a
schema against which the validation has to be performed is not present
Steps to be followed:
By default, the user can view the status for 10 files uploaded by him. However, the user can choose the
number of records that can be viewed at a time and change it using a dropdown to 25, 50 and 100 as well.
The time of upload shows the time of the server during which the file has been uploaded.
The user can view the status of the files which are uploaded only by him, and not by other users.
Only 30 days of history of the files uploaded by users is maintained
Audit Log
The history of the files uploaded can be viewed in this section. User can view all the operations that the uploaded
file is going through before being ingested into UDL.
1. File Upload : The file is uploaded into the tool and waits for validation
2. File Validation: The file is validated against the File Schema selected during the time of upload and checked
for errors.
3. File Processed/ Failed: If errors are found in the file, the file is moved to the Failed Zone and, if the file has
been successfully validated, it is moved to the Processed Zone and thereafter, into UDL.
Steps to be followed:
1. By default, the user can view the logs of the latest uploaded files on this page (last uploaded shows first in
queue). However, the user can also sort and filter the records as per requirement.
2. To sort and filter, the user needs to select any of the 3 dropdowns based on which the logs of the uploaded
file will be shown
3. Click Apply Filter to see the logs against the status of the files uploaded based on the selections made in the
top section
Filter Headings:
Searc Description
h
Filter
Fields
Uploa Lists the status of the file upload process, from which the file has to be selected. For example: Waiting
d for Validation, Validating etc. The list of the Upload Status is provided in TypesofStatus
Status
Log Description
Headi
ngs
Time Shows the server date and time when the file is taken for an operation. View FileOperations.
Uploa The email id of the user who has uploaded the file
ded
By
Status The user can view the status of the file. View the types of status in TypesofStatus
The user can view 10 files at a time, on a single page. The view can also be changed using a dropdown to
25, 50 and 100 as well.
The time of upload shows the time of the server during which the file is uploaded.
The user can view the status of the files which are uploaded by him
Data will be shown for past 30 days
The user can however search as per the required dates and retrieve the history of uploaded files
High Availability: High availability strategies are intended for handling temporary failure conditions to allow the
system to continue functioning.
Disaster recovery is the process of restoring application functionality in the wake of a catastrophic loss.
Organization’s tolerance for reduced functionality during a disaster is a business decision that varies from one
application to the next. It might be acceptable for some applications to be unavailable or to be partially available
with reduced functionality or delayed processing for a period of time. For other applications, any reduced
functionality is unacceptable.
DR rating for each application is calculated based on the business criticality by business or technology owners.
DR & RTO
DR Class RTO
1 12 hours
UDL to be internally treated as SC1/DR1 service though service catalog calls this out as SC3/DR3 service
All other components to be tagged as SC3/DR3 unless there is a separate tagging undertaken by projects as
per business requirement.
Below tables summarizes the SLA for each component that Microsoft already offers for business continuity,
Component SLA
ADLS 99.9%
HDInsight 99.9%
LogicApps 99.9%
Write 99.9%
Microsoft SLA translates to downtimes per week, month and year as depicted below. In most cases 99% SLA
suffices Unilever’s SC3/DR3 service requirement for PaaS components.
SLA Downtime per week Downtime per month Downtime per year
Now lets look at each of the I&A platforms business continuity design in detail.
BUSINESS CONTINUITY DETAILED LEVEL PLAN FOR UDL & BDL (ADLS GEN1)
BUSINESS CONTINUITY DETAILED LEVEL PLAN FOR UDL & BDL (ADLS GEN2)
Product (HA/DR)
Go with Microsoft SLA with the assumption that the service will be up and running as per the SLA summarized
earlier (Not suggested for SC2 & SC1 applications)
Advantages
Risk
Dependency on Microsoft to bring up component as per SLA. If Microsoft cannot bring up the component
within the agreed SLA, business will be impacted.
No guarantee on the service restoration.
Data lost may or may not be recovered. No guarantee from MS team on data restoration if service goes
down.
Set up a DR environment in the paired region that acts a passive environment with only data backed up regularly
and during Outage projects can connect to the DR environment to ensure business continuity.
Advantages
Risk
Minimum additional cost to keep the backup of the data in secondary region.
Advantages
Risk
Additional cost to keep the secondary region of the data in second region.
Co O Option 2 Option 3
m p
po ti (Active-Passive) (Active-Active)
ne o
nts n
1
(
M
S
S
L
A)
S N Geo- Replication. Spin up new User defined backup every 8 hours. Restore from backup to
QL o instance and restore from backup DR instance and pause.
DW ne during outage
AAS N Daily Backup to Azure Storage. Every 8 hours Backup to Azure Storage. Restore from the
o Spin up new instance and restore backup to DR AAS instance daily and Pause.
ne from backup during outage.
Az N RA-GRS provides read access from RA-GRS for read availability. Copy data from storage account
ur o DR instance. in the secondary region to another storage account. Point
e ne applications to that storage account for both read and write
St availability.
or
age
A N Every release into primary to also Every release into primary to also have a release to redundant
DF o have a release to redundant instance
, ne instance
A
DB
Ke N Every release if changes are added Every release if changes are added to key vault to be backed
yV o to key vault to be backed up to Blob up to Blob and restored to a new instance of Key Vault.
ault ne and restored from there during
Outage
€12.65 €37.95
SQLDW Basic RA-GRS Cost - €10 Basic RA-GRS Cost - €10 Compute - €70.33
Storage - €113.99
Azure Storage Basic RA-GRS Cost - €10 Basic RA-GRS Cost ~ €10 Basic RA-GRS Cost ~ €10
KEY ASSUMPTIONS
ADLS - 1TB ; SQLDW -100DWU- 60 hours, 1TB storage ; AAS -S2 - 60 hours – 50GB;
Blob Storage- LRS- GPV2- Std- 50GB; Keyvault – Storage Size – 50GB
Data Quality
Data quality refers to the state of qualitative or quantitative pieces of information. There are many definitions of data
quality, but data is generally considered high quality if it is "fit for [its] intended uses in operations, decision making
and planning"
Data Profiling: To examine the data health initially for modelling & integration process design purposes. (Not an
assessment task to improve data quality, more to understand data)
Data Validation
Data Validation: To validate incoming data against set of pre-defined data quality validation rules as per data
requirement needs. Incase the validation fails, reject & log the error record, this record would not be processed any
further. (i.e. Does not correct data)
Data Conversion/Transformation: To convert incoming data based on the rule/condition defined based on the
requirement. (i.e. Corrects data)
Numeric Check
Value (greater than, equal to, less than, not equal to,
contains, begins with, ends with)
Length (greater than, equal to, less than, not equal to) lengths of fields incorrect
Custom Formula
Custom Formula
Refer high level flow diagrams for the UDL data validation implementation Strategy :
Refer high level flow diagrams for the BDL data validation implementation Strategy :
Recommendations:
Data correction should always happen at the source side only and not in the Data Lake, apart from standardization
of date/time, decimal value etc.
To maintain consistency the NULLs /Blanks/Spaces should be converted to same value for example
“UNKNOWN” based on the requirement.
In BDL only logical check like RI check and lookup check should happen.
Business specific rules should be agreed as part of the functional specification document.
Standard ways of failing the Pipeline and communicating to the User/AM team.
Data Profiling is to examine the data health initially for modeling & integration process design purposes. (Not an
assessment task to improve data quality, more to understand data)
For each data column/field generate profile of the dataset or data domain, primarily:
Archival is the process of maintaining the historical data based on some policies like the retention of the data i.e.
the time till which the data has to be maintained in the archive folder. Once the data crosses the retention period, it
can be then purged. The data copied from the source systems to the Raw folder and then processed further is
archived once its processed.
Purging
The files archived in the archival folder should adhere to the defined retention policy. The files crossing the defined
retention period should be deleted permanently. The Azure data factory pipeline that would purge the data from
different folders would read the retention period from the pipeline parameter that need to be assigned while
triggering it.
The pipeline can be scheduled on any frequency as need basis. Ideally the purge process pipeline should be
executed daily. On each run, the pipeline would do the following activities.
Refer the document “Unilever_DataIngestion-LLD_v1.0.docx” to understand on the MCS framework ,Note that this
document is owned by UDL team.
Deletion Logic
Ingestion Type
Full Load: Erase complete contents of one or more tables and reload all data everytime.
There is no need to have deleted logic implemented for Full Load.
Incremental Load: Apply only ongoing changes to one or more tables based on a predefined requirements.
If source provides deletion flag(s), the same should be used by the downstream systems.
If source provides primary key only then, UDL will implement deletion logic using UDL_Flag (I- Insert,
U-Update, D-Delete ONLY).
Primary keys used in UDL/BDL/PDS should be same as source system.
Scheduling timing
UDL, BDL, PDS data ingestion schedules should be aligned as per business requirements.
BDL and PDS MUST take into consideration dependency on UDL data ingestion jobs. Please refer Section
3.8 - Job Management
Notes
1. Micro batch deletion logic is not in scope of batch deletion logic explained above.
a. UDL WILL NOT provide deletion logic for micro batch landing folder. To implement deletion for for
microbatch there are 2 options:
i. UDL will provide deletion logic in processed parquet in UDL. BDL can use UDL_Flag from
processed parquet on 24 hour frequency (rows highlighted in below table - Table1) to delete
records from microbatch objects in BDL.
ii. The deletion logic need to be handled in the end applications , as it doesnt make sense to scan
through complete history to mark a record as deleted, every 15 min,
2. UDL/BDL to publish deletion flag & deletion logic for each source for consumption by downstream
applications.
Table1
Note: Anything not explicitly approved below i.e. components mentioned as “Approved (case by case basis)”
requires I&A Architect review and approval
Storage Data Lake Approved PAAS tool used for Data Storage Section 2.4 - Azure
Store Data Lake Storage
(Gen1 & (ADLS)
Gen2)
Blob Approved Used for external data ingestion and for Section 2.3 - Azure
Storage component Logs BLOB Storage
Database SQL Data Approved MPP database for data > 50 GB Section 2.6 - Azure
Ware Data Warehouse
House /Azure Synapse
(Gen2) Analytics
SQL Approved Approved only for Metadata capture and for TBA
Database data of of volume < 50 GB
In Azure Approved Used for faster report response (In Memory limit Section 2.5 - Azure
Memory Analysis of 400 GB) Analysis Services
Services
Compute Databricks Approved Compute. Processing engine (aggregation, KPI, Section 2.1 - Azure
Business logic) Databricks (ADB)
Ingestion Azure Approved Job Scheduling/ Orchestration in PDS Section 2.2 - Azure
and Data Data Factory V2
Scheduli factory Integration with different source system for data
ng (V2) ingestion in to UDL.
Security Azure key Approved Credential Management. Provided to all projects TBA
& Vault (Case by by default
Access Case
control Basis)
Data Databricks Approved Data Science models using Spark ML , Spark R TBA
Science and PySpark
Azure ML Approved UI tool and compute for Python, R and data Section 2.7 - Azure
Services science models. Approved for Visual Interface Machine Learning
(ML Studio V2 )as well
Others Logic Approved Approved for Alerting and monitoring and Job Section 2.10 - Azure
Apps (Case by Triggering Logic App
Case
Basis)
Log Approved Approved for Logging and Monitoring. Section 2.8 - Azure
Analytics (Case by Monitor & Log
Case Analytics
Basis)
Batch Approved Uses case by case basis, as agreed with I&A TBA
Account, (Case by Tech Architect
Case
functions Basis)
,
Service
Bus,
Azure
Monitor
New Azure Approved Provided only as exceptional approval, Section 2.12 - Azure
Compon Cache for (Case by considering these components are used for Cache for Redis
ents Redis Case specific cases.
Basis)
https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/azure-subscription-service-limits
Overview
Azure Databricks (ADB) will be the processing service to transform and process source data and get it in an
enriched business useful form. Azure Databricks is an Apache Spark-based analytics platform optimized for the
Microsoft Azure cloud services platform. ADB has auto-scaling and auto-termination (like a pause/resume), has a
workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over
traditional Apache Spark.
ADB will be used as a primary processing engine for all forms of data (structured, semi-structured). It will be used to
perform delta processing (with use of Databricks Delta), data quality checks and enrichment of this data with
business KPIs and other business logic. DB is 100% based on Spark and is extensible with support for Scala, Java,
R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib).
Architecture Standard: Scala should be used as scripting language for Spark for DQ and delta processing etc.
PySpark or SparkR should be used for Analytics
Restrict the use of secret scope to the creator. Without this, any user of the workspace will be able to use the
secret scope and read the secrets from KeyVault provided:
The user knows the name of the secret scope
The user knows the secret names in the KeyVault
Collect Databricks diagnostic logs. This option is not available in standard workspaces. If I’m not wrong, UDL
and SC BDL teams have started redirecting these logs to log analytics
User access management. Premium workspaces allow you to decide who has admin permissions on the
workspace. Without this, everyone who has access to the workspace is admin
Cluster management. With premium, you can create a cluster and control who has access to it. You can
restrict users who are not admins from creating or editing clusters.
AD credential passthrough. With premium workspaces you can create clusters with passthrough access. You
can also create mountpoints that work with passthrough credentials.
And few more features that we don’t use in Unilever.
For all Databricks Standard Workspaces, Landscape will make sure that the secret scope is deleted after creation
of mountpoints (if any). This will prevent the workspace users from accessing the secret scope and hence the
secrets in the KeyVault.
Any secret that is required at the run time should be stored in Databricks backed secret scope.
Until 2019 the users were granted contributor permission on the Databricks instance on Azure portal. Starting April
2020, Landscape runs a script every day that adds/removes users from the workspace based on the users in the
dev/test/support AD user groups maintained by the project teams.
Design Standards
General Guidelines
Workspace Standards
Spark Style Guide
Automated Code Formatting Tools
Variables
Chained Method Calls
Spark SQL
Columns
Immutable Columns
Open Source
User Defined Functions
Custom transformations
Naming conventions
Schema Dependent DataFrame Transformations
Schema Independent DataFrame Transformations
What type of DataFrame transformation should be used
Null
JAR Files
Documentation
Column Functions
DataFrame Transformations
Testing
Cluster Configuration Standards
Cluster Sizing Starting Points
Different Azure Instance Types
Recommended VM Family
Recommended VM Family Series
Choose cluster VMs to match workload class
Arrive at correct cluster size by iterative performance testing
Cluster Tags
To configure cluster tags:
GENERAL GUIDELINES
While reading the input file use option repartition(sc.defaultParallelism * 2) to increase the read performance
as option("multiline","true") will invoke only one executor all the time.
Follow coding standards (either fully use Uppercase letters or only lowercase letters).
Calibrate the execution times of each command to analyse the notebook performance.
Provide the comment line for commands/cells to highlight the functions.
Utilize the Spark optimization techniques available.
Use the available options to invoke Spark Parallelism and try avoiding commands that utilizes extensive
memory, CPU time and shuffling.
Avoid the operations with respect to dbutils like dbutils.cp (CPY command).
Use Databricks delta wherever possible.
Consider using val for variable assignment wherever possible instead of using var.
Consider chunking out the data and provide them as input to SPARK.
WORKSPACE STANDARDS
UDL to have workspace specific to functional area to ensure metastore limits are not reached. In the new
foundation design UDL to have workspaces specifically for CD, SC, Finance, HR etc instead of one single
workspace for all functional areas. Similarly if projects have too many jobs running then suggestion is to have
multiple workspaces as the Hive metastore has a limit of 250 connections.
Scalafmt and scalariform are automated code formatting tools. scalariform's default settings format code similar to
the Databricks scala-style-guide and is a good place to start. The sbt-scalariform plugin automatically reformats
code upon compile and is the best way to keep code formatted consistely without thinking. Here are some
scalariform settings that work well with Spark code.
SbtScalariform.scalariformSettings
ScalariformKeys.preferences := ScalariformKeys.preferences.value
.setPreference(DoubleIndentConstructorArguments, true)
.setPreference(SpacesAroundMultiImports, false)
.setPreference(DanglingCloseParenthesis, Force)
Variables
Variables should use camelCase. Variables that point to DataFrames, Datasets, and RDDs should be suffixed
accordingly to make your code readable:
Variables pointing to DataFrames should be suffixed with DF (following conventions in the Spark
Programming Guide)
peopleDF.createOrReplaceTempView("people")
case Row(key: Int, value: String) => s"Key: $key, Value: $value"
Use col1 and col2 for methods that take two Column arguments.
Use cols for methods that take an arbitrary number of Column arguments.
For methods that take column name String arguments, follow the same pattern and use colName, colName1,
colName2, and colNames as variables.
// DONT DO THIS
Spark methods are often deeply chained and should be broken up on multiple lines.
jdbcDF.write
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "schema.tablename")
.option("user", "username")
.option("password", "password")
.save()
.select(
"name",
"Date of Birth"
.transform(someCustomTransformation())
.filter(
Spark SQL
select
`first_name`,
`last_name`,
`hair_color`
from people
""")
Columns
Columns that contain boolean values should use predicate names like is_nice_person or has_red_hair. Use
snake_case for column names, so it's easier to write SQL code.
You can write (col("is_summer") && col("is_europe")) instead of (col("is_summer") === true && col("is_europe")
=== true). The predicate column names make the concise syntax nice and readable.
Columns should only be nullable if null values are allowed. Code written for nullable columns should always
address null values gracefully.
Use acronyms when needed to keep column names short. Define any acronyms used at the top of the data file, so
other programmers can follow along.
Use the following shorthand notation for columns that perform comparisons.
player_age_gt_20
player_age_gt_15_leq_30
player_age_between_13_19
player_age_eq_45
Immutable Columns
Custom transformations shouldn't overwrite an existing field in a schema during a transformation. Add a new
column to a DataFrame instead of mutating the data in an existing column.
Suppose you have a DataFrame with name and nickname columns and would like a column that coalesces the
name and nickname columns.
+-----+--------+
| name|nickname|
+-----+--------+
| joe| null|
| null| crazy|
|frank| bull|
+-----+--------+
Don't overwrite the name field and create a DataFrame like this:
+-----+--------+
| name|nickname|
+-----+--------+
| joe| null|
|crazy| crazy|
|frank| bull|
+-----+--------+
Create a new column, so existing columns aren't changed and column immutability is preserved.
+-----+--------+---------+
| name|nickname|name_meow|
+-----+--------+---------+
+-----+--------+---------+
Open Source
You should write generic open source code whenever possible. Open source code is easily reusable (especially
when it's uploaded to Spark Packages / Maven Repository) and forces you to design code without business logic.
The org.apache.spark.sql.functions class provides some great examples of open source functions.
The Dataset and Column classes provide great examples of code that facilitates DataFrame transformations.
Spark let’s you define custom SQL functions called user defined functions (UDFs). UDFs are great when built-in
SQL functions aren’t sufficient, but should be used sparingly because they’re not performant. If you need to write a
UDF, make sure to handle the null case as this is a common cause of errors.
Custom transformations
Use multiple parameter lists when defining custom transformations, so you can chain your custom transformations
with the Dataset#transform method. You should disregard this advice from the Databricks Scala style guide: "Avoid
using multiple parameter lists. They complicate operator overloading, and can confuse programmers less familiar
with Scala."
You need to use multiple parameter lists to write awesome code like this:
Naming conventions
Schema dependent DataFrame transformations make assumptions about the underlying DataFrame schema.
Schema dependent DataFrame transformations should explicitly validate DataFrame dependencies to make the
code and error messages more readable.
The following withFullName() DataFrame transformation assumes that the underlying DataFrame has first_name
and last_name columns.
df.withColumn(
"full_name",
You should use spark-daria to validate the schema requirements of a DataFrame transformation.
df.withColumn(
"full_name",
See this blog post for a detailed description on validating DataFrame dependencies.
Schema independent DataFrame transformations do not depend on the underlying DataFrame schema, as
discussed in this blog post.
def withAgePlusOne(
ageColName: String,
resultColName: String
df.withColumn(resultColName, col(ageColName) + 1)
Schema dependent transformations should be used for functions that rely on a large number of columns or
functions that are only expected to be run on a certain schema (e.g. a data lake with a schema that doensn't
change).
Schema independent transformations should be run for functions that will be run on a variety of DataFrame
schemas.
Null
null should be used in DataFrames for values that are unknown, missing, or irrelevant.
Spark core functions frequently return null and your code can also add null to DataFrames (by returning None or
explicitly returning null).
In general, it's better to keep all null references out of code and use Option[T] instead. Option is a bit slower and
explicit null references may be required for performance sensitve code. Start with Option and only use explicit null
references if Option becomes a performance bottleneck.
The schema for a column should set nullable to false if the column should not take null values.
JAR Files
spark-testing-base_2.11-2.1.0_0.6.0.jar
Generically:
spark-testing-base_scalaVersion-sparkVersion_projectVersion.jar
If you're using sbt assembly, you can use the following line of code to build a JAR file using the correct naming
conventions.
If you're using sbt package, you can add this code to your build.sbt file to generate a JAR file that follows the
naming conventions.
Documentation
The following documentation guidelines generally copy the documentation in the Spark source code. For example,
here's how the rpad method is defined in the Spark source code.
/**
*Right-pad the string column with pad to a length of len. If the string column is longer
*@group string_funcs
*@since 1.5.0
*/
Here's an example of the the Column#equalTo() method that contains an example code snippet.
/**
*Equality test.
*{{{
*// Scala:
*// Java
*df.filter( col("colA").equalTo(col("colB")) );
*}}}
*@group expr_ops
*@since 1.3.0
*/
The @since annotation should be used to document when features are added to the API.
The @note annotaion should be used to detail important information about a function, like the following example.
/**
*{{{
*}}}
*@note The list of columns should match with grouping columns exactly, or empty (means all the
*grouping columns).
*@group agg_funcs
*@since 2.0.0
*/
Column Functions
Column functions should be annotated with the following groups, consistent with the Spark functions that return
Column objects.
/**
*@group string_funcs
*@since 2.0.0
*/
DataFrame Transformations
Custom transformations can add/remove rows and columns from a DataFrame. DataFrame transformation
documentation should specify how the custom transformation is modifying the DataFrame and list the name of
columns added to the DataFrame as appropriate.
Testing
Use the spark-fast-tests library for writing DataFrame / Dataset / RDD tests with Spark. spark-testing-base should
be used for streaming tests.
Read this blog post for a gentle introduction to testing Spark code, this blog post on how to design easily testable
Spark code, and this blog post on how to cut the run time of a Spark test suite.
Instance methods should be preceded with a pound sign (e.g. #and) and static methods should be preceded with a
period (e.g. .standardizeName) in the describe block. This follows Ruby testing conventions.
Here is an example of a test for the #and instance method defined in the functions class as follows:
import spark.implicits._
describe("#and") {
// some code
describe(".standardizeName") {
// some code
Projects should use only the below recommended VM family series and VM Types for running all their project
workloads. Choose based on the workloads so sizing exercise is the key to identify which cluster best works.
Recommended VM Family
Impact: High
To allocate the right amount and type of cluster resource for a job, we need to understand how different types of
jobs demand different types of cluster resources.
Machine Learning - To train machine learning models it’s usually required cache all of the data in memory.
Consider using memory optimized VMs so that the cluster can take advantage of the RAM cache. To size
the cluster, take a % of the data set cache it see how much memory it used extrapolate that to the rest of
the data. The tungsten data serializer optimizes the data in-memory. Which means you’ll need to test the
data to see the relative magnitude of compression.
Streaming - You need to make sure that the processing rate is just above the input rate at peak times of the
day. Depending peak input rate times, consider compute optimized VMs for the cluster to make sure
processing rate is higher than your input rate.
ETL - In this case, data size and deciding how fast a job needs to be will be a leading indicator. Spark doesn’
t always require data to be loaded into memory in order to execute transformations, but you’ll at the very
least need to see how large the task sizes are on shuffles and compare that to the task throughput you’d like.
To analyze the performance of these jobs start with basics and check if the job is by CPU, network, or local I
/O, and go from there. Consider using a general purpose VM for these jobs.
Interactive / Development Workloads - The ability for a cluster to auto scale is most important for these types
of jobs. Azure Databricks has a cluster manager and Serverless clusters to optimize the size of cluster during
peak and low times. In this case taking advantage of Serverless clusters and Autoscaling will be your best
friend in managing the cost of the infrastructure.
Impact: High
It is impossible to predict the correct cluster size without developing the application because Spark and Azure
Databricks use numerous techniques to improve cluster utilization. The broad approach you should follow for sizing
is:
1. Develop on a medium sized cluster of 2-8 nodes, with VMs matched to workload class as explained earlier.
2. After meeting functional requirements, run end to end test on larger representative data while measuring
CPU, memory and I/O used by the cluster at an aggregate level.
3. Optimize cluster to remove bottlenecks found in step 2:
a. CPU bound: add more cores by adding more nodes
b. Network bound: use fewer, bigger SSD backed machines to reduce network size and improve remote
read performance
c. Disk I/O bound: if jobs are spilling to disk, use VMs with more memory.
4. Repeat steps 2 and 3 by adding nodes and/or evaluating different VMs until all obvious bottlenecks have
been addressed.
Performing these steps will help you to arrive at a baseline cluster size which can meet SLA on a subset of data.
Because Spark workloads exhibit linear scaling, you can arrive at the production cluster size easily from here. For
example, if it takes 5 nodes to meet SLA on a 100TB dataset, and the production data is around 1PB, then prod
cluster is likely going to be around 50 nodes in size.
CLUSTER TAGS
Cluster tags allow you to easily monitor the cost of cloud resources used by various groups in the organization. One
can specify tags as key-value pairs when you create a cluster, and Azure Databricks applies these tags to cloud
resources like VMs and disk volumes.
Cluster tags propagate to these cloud resources along with pool tags and workspace (resource group) tags. Azure
Databricks applies four default tags to each cluster: Vendor, Creator, ClusterName, and ClusterId. TDA
recommends adding the below Cluster Tags as a mandatory process for all Databricks clusters in use for a project.
Tag Purpose
Name
ProjectS If there is a group email for your team, please put that here. Otherwise, you can put the name of the
upportT person who administers the cluster.
eam
For cases where the cluster is managed by DevOps pipelines, put the group email for the DevOps
team. If a group email doesn’t exist, put the name of the Unilever colleague who is responsible for
the cluster.
Purpose Describe the use of the cluster. You can pick some of the following or add new descriptions.
Development, Data Analysis, Logging, Historical Data Processing workloads, Data Processing
Workloads, Machine Learning, Testing, etc
3. Add a key-value pair for each custom tag as per the recommended Tags above from TDA team.
Introduction
Planning, deploying, and running Azure Databricks (ADB) at scale requires one to make many architectural
decisions.
While each ADB deployment is unique to an organization's needs we have found that some patterns are common
across most successful ADB projects. Unsurprisingly, these patterns are also in-line with modern Cloud-centric
development best practices.
This short guide summarizes these patterns into prescriptive and actionable best practices for Azure Databricks.
We follow a logical path of planning the infrastructure, provisioning the workspaces, developing Azure Databricks
applications, and finally, running Azure Databricks in production.
The audience of this guide are system architects, field engineers, and development teams of customers, Microsoft,
and Databricks. Since the Azure Databricks product goes through fast iteration cycles, we have avoided
recommendations based on roadmap or Private Preview features.
Our recommendations should apply to a typical Fortune 500 enterprise with at least intermediate level of Azure and
Databricks knowledge. We've also classified each recommendation according to its likely impact on solution's
quality attributes. Using the Impact factor, you can weigh the recommendation against other competing choices.
Example: if the impact is classified as “Very High”, the implications of not adopting the best practice can have a
significant impact on your deployment.
As ardent cloud proponents, we value agility and bringing value quickly to our customers. Hence, we’re releasing
the first version somewhat quickly, omitting some important but advanced topics in the interest of time. We will
cover the missing topics and add more details in the next round, while sincerely hoping that this version is still
useful to you.
Azure Databricks (ADB) deployments for very small organizations, PoC applications, or for personal education
hardly require any planning. You can spin up a Workspace using Azure Portal in a matter of minutes, create a
Notebook, and start writing code. Enterprise-grade large scale deployments are a different story altogether. Some
upfront planning is necessary to avoid cost overruns, throttling issues, etc. In particular, you need to understand:
Let’s start with a short Azure Databricks 101 and then discuss some best practices for scalable and secure
deployments.
ADB is a Big Data analytics service. Being a Cloud Optimized managed PaaS offering, it is designed to hide the
underlying distributed systems and networking complexity as much as possible from the end user. It is backed by a
team of support staff who monitor its health, debug tickets filed via Azure, etc. This allows ADB users to focus on
developing value generating apps rather than stressing over infrastructure management.
You can deploy ADB using Azure Portal or using ARM templates. One successful ADB deployment produces
exactly one Workspace, a space where users can log in and author analytics apps. It comprises the file browser,
notebooks, tables, clusters, DBFS storage, etc. More importantly, Workspace is a fundamental isolation unit in
Databricks. All workspaces are expected to be completely isolated from each other -- i.e., we intend that no action
in one workspace should noticeably impact another workspace.
Each workspace is identified by a globally unique 53-bit number, called Workspace ID or Organization ID. The URL
that a customer sees after logging in always uniquely identifies the workspace they are using:
https://regionName.azuredatabricks.net/?o=workspaceId
Azure Databricks uses Azure Active Directory (AAD) as the exclusive Identity Provider and there’s a seamless out
of the box integration between them. Any AAD member belonging to the Owner or Contributor role can deploy
Databricks and is automatically added to the ADB members list upon first login. If a user is not a member of the
Active Directory tenant, they can’t login to the workspace.
Azure Databricks comes with its own user management interface. You can create users and groups in a
workspace, assign them certain privileges, etc. While users in AAD are equivalent to Databricks users, by default
AAD roles have no relationship with groups created inside ADB. ADB also has a special group called Admin, not to
be confused with AAD’s admin.
The first user to login and initialize the workspace is the workspace owner. This person can invite other users to the
workspace, create groups, etc. The ADB logged in user’s identity is provided by AAD, and shows up under the user
menu in Workspace:
With this basic understanding lets discuss how to plan a typical ADB deployment. We first grapple with the issue of
how to divide workspaces and assign them to users and teams.
Though partitioning of workspaces depends on the organization structure and scenarios, it is generally
recommended to partition workspaces based on a related group of people working together collaboratively. This
also helps in streamlining your access control matrix within your workspace (folders, notebooks etc.) and also
across all your resources that the workspace interacts with (storage, related data stores like Azure SQL DB, Azure
SQL DW etc.). This type of division scheme is also known as the Business Unit Subscription design pattern and
aligns well with Databricks chargeback model.
Customers commonly partition workspaces based on teams or departments and arrive at that division naturally. But
it is also important to partition keeping Azure Subscription and ADB Workspace level limits in mind.
Azure Databricks is a multitenant service and to provide fair resource sharing to all regional customers, it imposes
limits on API calls. These limits are expressed at the Workspace level and are due to internal ADB components. For
instance, you can only run up to 150 concurrent jobs in a workspace. Beyond that, ADB will deny your job
submissions. There are also other limits such as max hourly job submissions, etc.
There is a limit of 1000 scheduled jobs that can be seen in the UI.
The maximum number of jobs that a workspace can create in an hour is 1000.
At any time, you cannot have more than 150 jobs simultaneously running in a workspace.
There can be a maximum of 150 notebooks or execution contexts attached to a cluster.
Next, there are Azure limits to consider since ADB deployments are built on top of the Azure infrastructure.
Due to security reasons, we also highly recommend separating the production and dev/stage environments into
separate subscriptions.
It is important to divide your workspaces appropriately using different subscriptions based on your business
keeping in mind the Azure limits.
Impact: Low
While you can deploy more than one Workspace in a VNet by keeping the subnets separate, we recommend that
you follow the hub and spoke model and separate each workspace in its own VNet. Recall that a Databricks
Workspace is designed to be a logical isolation unit, and that Azure’s VNets are designed for unconstrained
connectivity among the resources placed inside it. Unfortunately, these two design goals are at odds with each
other since VMs belonging to two different workspaces in the same VNet can therefore communicate. While this is
normally innocuous from our experience, it should be avoided if as much as possible.
Impact: High
But, because of the address space allocation scheme, the size of private and public subnets is constrained
by the VNet’s CIDR
The allowed values for the enclosing VNet CIDR are from /16 through /24
The private and public subnet masks must be:
Equal
At least two steps down from enclosing VNet CIDR mask
Must be greater than /26
With this info, we can quickly arrive at the table below, showing how many nodes one can use across all clusters for
a given VNet CIDR. It is clear that selection of VNet CIDR has far reaching implications in terms of maximum
cluster size.
Enclosing VNet CIDR’s Allowed Masks on Max number of nodes across all clusters in the
Mask where ADB Private and Public Workspace, assuming higher subnet mask is
Workspace is deployed Subnets (should be chosen
equal)
Impact: High
This recommendation is driven by security and data availability concerns. Every Workspace comes with a default
DBFS, primarily designed to store libraries and other system-level configuration artifacts such as Init scripts. You
should not store any production data in it, because:
1. The lifecycle of default DBFS is tied to the Workspace. Deleting the workspace will also delete default DBFS
and permanently remove its contents.
2. One can’t restrict access to this default folder and its contents.
Note that this recommendation doesn’t apply to Blob or ADLS folders explicitly mounted as DBFS by the
end user.
ALWAYS HIDE SECRETS IN KEY VAULT AND DO NOT EXPOSE THEM OPENLY IN NOTEBOOKS
Impact: High
It is a significant security risk to expose sensitive data such as access credentials openly in Notebooks or other
places such as job configs, etc. You should instead use a vault to securely store and access them. You can either
use ADB’s internal Key Vault for this purpose or use Azure’s Key Vault (AKV) service.
If using Azure Key Vault, create separate AKV-backed secret scopes and corresponding AKVs to store credentials
pertaining to different data stores. This will help prevent users from accessing credentials that they might not have
access to. Since access controls are applicable to the entire secret scope, users with access to the scope will see
all secrets for the AKV associated with that scope.
After understanding how to provision the workspaces, best practices in networking, etc., let’s put on the developer’s
hat and see the design choices typically faced by them:
In this chapter we will address such concerns and provide our recommendations, while also explaining the internals
of Databricks clusters and associated topics. Some of these ideas seem counterintuitive but they will all make
sense if you keep these important design attributes of the ADB service in mind:
1. Cloud Optimized: Azure Databricks is a product built exclusively for cloud environments, like Azure. No on-
prem deployments currently exist. It assumes certain features are provided by the Cloud, is designed
keeping Cloud best practices, and conversely, provides Cloud-friendly
features.
2. Platform/Software as a Service Abstraction: ADB sits somewhere between the PaaS and SaaS ends of the
spectrum, depending on how you use it. In either case ADB is designed to hide infrastructure details as
much as possible so the user can focus on application development. It is
not, for example, an IaaS offering exposing the guts of the OS Kernel to you.
3. Managed Service: ADB guarantees a 99.95% uptime SLA. There’s a large team of dedicated staff members
who monitor various aspects of its health and get alerted when something goes wrong. It is run like an
always-on website and the staff strives to minimize any downtime.
These three attributes make ADB very different than other Spark platforms such as HDP, CDH, Mesos, etc. which
are designed for on-prem datacenters and allow the user complete control over the hardware. The concept of a
cluster is pretty unique in Azure Databricks. Unlike YARN or Mesos clusters which are just a collection of worker
machines waiting for an application to be scheduled on them, clusters in ADB come with a pre-configured Spark
application. ADB submits all subsequent user requests like notebook commands, SQL queries, Java jar jobs, etc. to
this primordial app for execution. This app is called the “Databricks Shell.”
Under the covers Databricks clusters use the lightweight Spark Standalone resource allocator.
When it comes to taxonomy, ADB clusters are divided along notions of “type”, and “mode.” There are two types of
ADB clusters, according to how they are created. Clusters created using UI are called Interactive Clusters,
whereas those created using Databricks API are called Jobs Clusters. Further, each cluster can be of two modes:
Standard and High Concurrency. All clusters in Azure Databricks can automatically scale to match the workload,
called Autoscaling.
Recommended Concurrency 1 10
Impact: Medium
1. Deploy a shared cluster instead of letting each user create their own cluster.
2. Create the shared cluster in High Concurrency mode instead of Standard mode.
3. Configure security on the shared High Concurrency cluster, using one of the following options:
a. Turn on AAD Credential Passthrough if you’re using ADLS
b. Turn on Table Access Control for all other stores
If you’re using ADLS, we currently recommend that you select either Table Access Control or AAD
Credential Passthrough. Do not combine them together.
To understand why, let’s quickly see how interactive workloads are different from batch workloads:
Optimization Metric: Low execution time: low individual query latency. Maximizing jobs executed over
What matters to end some time period: high throughput.
users?
Cost: Are the No. Understanding data via interactive Yes, because a Job’s logic is fixed
workload’s demands exploration requires multitude of queries and doesn’t change with each run.
predictable? impossible to predict ahead of time.
Because of these differences, supporting Interactive workloads entails minimizing cost variability and optimizing for
latency over throughput, while providing a secure environment. These goals are satisfied by shared High
Concurrency clusters with Table access controls or AAD Passthrough turned on (in case of ADLS):
1. Minimizing Cost: By forcing users to share an autoscaling cluster you have configured with maximum node
count, rather than say, asking them to create a new one for their use each time they log in, you can control
the total cost easily. The max cost of shared cluster can be calculated by assuming it is running 24X7 at
maximum size with the particular VMs. You can’t achieve this if each user is given free reign over creating
clusters of arbitrary size and VMs.
2. Optimizing for Latency: Only High Concurrency clusters have features which allow queries from different
users share cluster resources in a fair, secure manner. HC clusters come with Query Watchdog, a process
which keeps disruptive queries in check by automatically pre-empting rogue queries, limiting the maximum
size of output rows returned, etc.
3. Security: Table Access control feature is only available in High Concurrency mode and needs to be turned
on so that users can limit access to their database objects (tables, views, functions...) created on the shared
cluster. In case of ADLS, we recommend restricting access using the AAD Credential Passthrough feature
instead of Table Access Controls.
That said, irrespective of the mode (Standard or High Concurrency), all Azure Databricks clusters use Spark
Standalone cluster resource allocator and hence execute all Java and Scala user code in the same JVM. A shared
cluster model is secure only for SQL or Python programs because:
1. It is possible to isolate each user’s Spark SQL configuration storing sensitive credentials, temporary tables,
etc. in a Spark Session. ADB creates a new Spark Session for each Notebook attached to a High
Concurrency cluster. If you’re running SQL queries, then this isolation model works because there’s no way
to examine JVM’s contents using SQL.
2. Similarly, PySpark runs user queries in a separate process, so ADB can isolate DataFrames and DataSet
operations belonging to different PySpark users.
In contrast a Scala or Java program from one user could easily steal secrets belonging to another user sharing the
same cluster by doing a thread dump. Hence the isolation model of HC clusters, and this recommendation, only
applies to interactive queries expressed in SQL or Python. In practice this is rarely a limitation because Scala and
Java languages are seldom used for interactive exploration. They are mostly used by Data Engineers to build data
pipelines consisting of batch jobs. Those type of scenarios involve batch ETL jobs and are covered by the next
recommendation.
SUPPORT BATCH ETL WORKLOADS WITH SINGLE USER EPHEMERAL STANDARD CLUSTERS
Impact: Medium
Unlike Interactive workloads, logic in batch Jobs is well defined and their cluster resource requirements are known a
priori. Hence to minimize cost, there’s no reason to follow the shared cluster model and we recommend letting each
job create a separate cluster for its execution. Thus, instead of submitting batch ETL jobs to a cluster already
created from ADB’s UI, submit them using the Jobs APIs. These APIs automatically create new clusters to run Jobs
and also terminate them after running it. We call this the Ephemeral Job Cluster pattern for running jobs because
the clusters short life is tied to the job lifecycle.
Azure Data Factory uses this pattern as well - each job ends up creating a separate cluster since the underlying call
is made using the Runs-Submit Jobs API.
Just like the previous recommendation, this pattern will achieve general goals of minimizing cost, improving the
target metric (throughput), and enhancing security by:
1. Enhanced Security: ephemeral clusters run only one job at a time, so each executor’s JVM runs code from
only one user. This makes ephemeral clusters more secure than shared clusters for Java and Scala code.
2. Lower Cost: if you run jobs on a cluster created from ADB’s UI, you will be charged at the higher Interactive
DBU rate. The lower Data Engineering DBUs are only available when the lifecycle of job and cluster are
same. This is only achievable using the Jobs APIs to launch jobs on ephemeral
clusters.
3. Better Throughput: cluster’s resources are dedicated to one job only, making the job finish faster than while
running in a shared environment.
For very short duration jobs (< 10 min) the cluster launch time (~ 7 min) adds a significant overhead to total
execution time. Historically this forced users to run short jobs on existing clusters created by UI -- a costlier and less
secure alternative. To fix this, ADB is coming out with a new feature called Warm Pools in Q3 2019 bringing down
cluster launch time to 30 seconds or less.
FAVOR CLUSTER SCOPED INIT SCRIPTS OVER GLOBAL AND NAMED SCRIPTS
Impact: High
Init Scripts provide a way to configure cluster’s nodes and can be used in the following modes:
1. Global: by placing the init script in /databricks/init folder, you force the script’s execution every time any
cluster is created or restarted by users of the workspace.
2. Cluster Named: you can limit the init script to run only on for a specific cluster’s creation and restarts by
placing it in /databricks/init/<cluster_name> folder.
3. Cluster Scoped: in this mode the init script is not tied to any cluster by its name and its automatic execution
is not a virtue of its dbfs location. Rather, you specify the script in cluster’s configuration by either writing it
directly or providing its location on DBFS. Any location under
DBFS /databricks folder except /databricks/init can be used for this purpose. eg,
/databricks/<my-directory>/set-env-var.sh
You should treat Init scripts with extreme caution because they can easily lead to intractable cluster launch failures.
If you really need them, a) try to use the Cluster Scoped execution mode as much as possible, and, b) write them
directly in the cluster’s configuration rather than placing them on default DBFS and specifying the path. We say this
because:
1.
1. ADB executes the script’s body in each cluster node’s LxC container before starting Spark’s executor or
driver JVM in it -- the processes which ultimately run user code. Thus, a successful cluster launch and
subsequent operation is predicated on all nodal init scripts executing in a timely manner without any errors
and reporting a zero exit code. This process is highly error prone, especially for scripts downloading artifacts
from an external service over unreliable and/or misconfigured networks.
2. Because Global and Cluster Named init scripts execute automatically due to their placement in a special
DBFS location, it is easy to overlook that they could be causing a cluster to not launch. By specifying the Init
script in the Configuration, there’s a higher chance that you’ll consider them while debugging launch failures.
3. As we explained earlier, all folders inside default DBFS are accessible to workspace users. Your init scripts
containing sensitive data can be viewed by everyone if you place them there.
SEND LOGS TO BLOB STORE INSTEAD OF DEFAULT DBFS USING CLUSTER LOG DELIVERY
Impact: Medium
By default, Cluster logs are sent to default DBFS but you should consider sending the logs to a blob store location
using the Cluster Log delivery feature. The Cluster Logs contain logs emitted by user code, as well as Spark
framework’s Driver and Executor logs. Sending them to blob store is recommended over DBFS because:
1. ADB’s automatic 30-day DBFS log purging policy might be too short for certain compliance scenarios. Blob
store is the solution for long term log archival.
2. You can ship logs to other tools only if they are present in your storage account and a resource group
governed by you. The root DBFS, although present in your subscription, is launched inside a Microsoft-Azure
Databricks managed resource group and is protected by a read lock. Because of this lock the logs are only
accessible by privileged Azure Databricks framework code which shows them on UI. Constructing a pipeline
to ship the logs to downstream log analytics tools requires logs to be in a lock-free location first.
Impact: High
To allocate the right amount and type of cluster resource for a job, we need to understand how different types of
jobs demand different types of cluster resources.
Machine Learning - To train machine learning models it’s usually required cache all of the data in memory.
Consider using memory optimized VMs so that the cluster can take advantage of the RAM cache. To size
the cluster, take a % of the data set cache it see how much memory it used extrapolate that to the rest of
the data. The tungsten data serializer optimizes the data in-memory. Which means you’ll need to test the
data to see the relative magnitude of compression.
Streaming - You need to make sure that the processing rate is just above the input rate at peak times of the
day. Depending peak input rate times, consider compute optimized VMs for the cluster to make sure
processing rate is higher than your input rate.
ETL - In this case, data size and deciding how fast a job needs to be will be a leading indicator. Spark doesn’
t always require data to be loaded into memory in order to execute transformations, but you’ll at the very
least need to see how large the task sizes are on shuffles and compare that to the task throughput you’d like.
To analyze the performance of these jobs start with basics and check if the job is by CPU, network, or local I
/O, and go from there. Consider using a general purpose VM for these jobs.
Interactive / Development Workloads - The ability for a cluster to auto scale is most important for these types
of jobs. Azure Databricks has a cluster manager and Serverless clusters to optimize the size of cluster during
peak and low times. In this case taking advantage of Serverless clusters and Autoscaling will be your best
friend in managing the cost of the infrastructure.
Impact: High
It is impossible to predict the correct cluster size without developing the application because Spark and Azure
Databricks use numerous techniques to improve cluster utilization. The broad approach you should follow for sizing
is:
1. Develop on a medium sized cluster of 2-8 nodes, with VMs matched to workload class as explained earlier.
2. After meeting functional requirements, run end to end test on larger representative data while measuring
CPU, memory and I/O used by the cluster at an aggregate level.
3. Optimize cluster to remove bottlenecks found in step 2:
a. CPU bound: add more cores by adding more nodes
b. Network bound: use fewer, bigger SSD backed machines to reduce network size and improve remote
read performance
c. Disk I/O bound: if jobs are spilling to disk, use VMs with more memory.
4. Repeat steps 2 and 3 by adding nodes and/or evaluating different VMs until all obvious bottlenecks have
been addressed.
Performing these steps will help you to arrive at a baseline cluster size which can meet SLA on a subset of data.
Because Spark workloads exhibit linear scaling, you can arrive at the production cluster size easily from here. For
example, if it takes 5 nodes to meet SLA on a 100TB dataset, and the production data is around 1PB, then prod
cluster is likely going to be around 50 nodes in size.
Impact: High
A shuffle occurs when we need to move data from one node to another in order to complete a stage. Depending on
the type of transformation you are doing you may cause a shuffle to occur. This happens when all the executors
require seeing all of the data in order to accurately perform the action. If the Job requires a wide transformation, you
can expect the job to execute slower because all of the partitions need to be shuffled around in order to complete
the job. Eg: Group by, Distinct.
You’ve got two control knobs of a shuffle you can use to optimize
spark.conf.set("spark.sql.shuffle.partitions", 10)
These two determine the partition size, which we recommend should be in the Megabytes to 1 Gigabyte range. If
your shuffle partitions are too small, you may be unnecessarily adding more tasks to the stage. But if they are too
big, you may get bottlenecked by the network.
Impact: High
Azure Databricks has an optimized Parquet reader, enhanced over the Open Source Spark implementation and it is
the recommended data storage format. In addition, storing data in partitions allows you to take advantage of
partition pruning and data skipping. Most of the time partitions will be on a date field but choose your partitioning
field based on the relevancy to the queries the data is supporting. For example, if you’re always going to be filtering
based on “Region,” then consider partitioning your data by region.
Evenly distributed data across all partitions (date is the most common)
10s of GB per partition (~10 to ~50GB)
Small data sets should not be partitioned
Beware of over partitioning
Monitoring
Once you have your clusters setup and your Spark applications running, there is a need to monitor your Azure
Databricks pipelines. These pipelines are rarely executed in isolation and need to be monitored along with a set of
other services. Monitoring falls into four broad areas:
For the purposes of this version of this document we will focus on (1). This is the most common ask from customers.
COLLECT RESOURCE UTILIZATION METRICS ACROSS AZURE DATABRICKS CLUSTER IN A LOG ANALYTICS WORKSPACE
Impact: Medium
An important facet of monitoring is understanding the resource utilization across an Azure Databricks cluster. You
can also extend this to understand utilization across all your Azure Databricks clusters in a workspace. This could
be useful in arriving at a cluster size and VM sizes given each VM size does have a set of limits (cores/disk
throughput/network throughput) and could play a role in the performance profile of an Azure Databricks job.
In order to get utilization metrics of the Azure Databricks cluster, we use the Azure Linux diagnostic extension as an
init script into the clusters we want to monitor. Note: This could increase your cluster startup time by a minute.
You can reach out to your respective TDA architect to install the Log Analytics agent on Azure Databricks agent to
collect VM metrics in your Log Analytics workspace.
Querying VM metrics in Log Analytics once you have started the collection using the above document
You can use Log analytics directly to query the Perf data. Here is an example of a query which charts out CPU for
the VM’s in question for a specific cluster ID. See log analytics overview for further documentation on log analytics
and query syntax.
Perf
| where TimeGenerated > now() - 7d and TimeGenerated < now() - 6d
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| where InstanceName == "_Total"
| where _ResourceId contains "databricks-rg-"
| where Computer has "0408-235319-boss755" //clusterID
| project ObjectName , CounterName , InstanceName , TimeGenerated ,
CounterValue , Computer
| summarize avg(CounterValue) by bin(TimeGenerated, 1min),Computer
| render timechart
Delta handling
Databricks Delta is a single data management tool that combines the scale of a data lake, the reliability and
performance of a data warehouse, and the low latency of streaming in a single system.
Delta lets organizations remove complexity by getting the benefits of multiple storage systems in one. By combining
the best attributes of existing systems over scalable, low-cost cloud storage, Delta will enable dramatically simpler
data architectures that let organizations focus on extracting value from their data.
I&A Tech recommends use of Databricks Delta for complex delta operations. Seek advice from your Solution
Architect representative.
Databricks Delta to DW
SQL Data warehouse can read data from Databricks delta Parquet files.
-- Create a db master key if one does not already exist, using your own
password.
CREATE MASTER KEY;
)
WITH (
DATA_SOURCE = [DataLakeStore],
LOCATION = N'/root/delta/events/',
FILE_FORMAT = [ParquetFileFormat],
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
Pools hold warm instances so Clusters (Automated or Interactive) can start blazingly fast.
To reduce cluster start time, you can attach a cluster to a predefined Pool of idle instances.
When attached to a Pool, a cluster allocates its driver and worker nodes from the pool.
If the Pool does not have sufficient idle resources to accommodate the cluster’s request, the Pool expands
by allocating new instances from Azure.
When an attached cluster is terminated, the instances it used are returned to the Pool and can be reused by
a different clusters.
Without Pools, instances can take 5-10 minutes to fetch usually. This could mean a 1 minute job could take
11 minutes overall.
With Pools, it will probably load in 10s.
What’s the value of this feature? (e.g. without pools vs. with pools)
Pools make clusters start and scale much faster across all workloads.
https://docs.microsoft.com/en-us/azure/databricks/clusters/instance-pools/create
The minimum number of instances the pool keeps idle. These instances do not terminate, regardless of the
setting specified in Idle Instance Auto Termination. If a cluster consumes idle instances from the pool, Azure
Databricks provisions additional instances to maintain the minimum. Projects could use scripts to reduce this
to a lower number and practically zero when not in use basis their job frequency.
Maximum Capacity
The maximum number of instances that the pool will provision. If set, this value constrains all instances (idle
+ used). If a cluster using the pool requests more instances than this number during autoscaling, the request
will fail with an INSTANCE_POOL_MAX_CAPACITY_FAILURE error. You should restrict the capacity to
ensure project uses a defined capacity and threshold from cost and capacity threshold on the subscription.
The time in minutes that instances above the value set in Minimum Idle Instances can be idle before being
terminated by the pool.
Ideally set this time to 5 or 10 mins. This will ensure we don’t keep many idle VMs in pool and get costed for
the same.
Instance types
A pool consists of both idle instances kept ready for new clusters and instances in use by running clusters.
All of these instances are of the same instance provider type, selected when creating a pool.
A pool’s instance type cannot be edited. Clusters attached to a pool use the same instance type for the driver
and worker nodes
Basis the workloads you run look at recommended cluster configurations and choose the right VM types.
Pool Tags
Ensure you add appropriate tags while creating pools for easier tracking and analysis. Follow the Tagging
guidelines in the design standards page.
Please note that the process described in this section can be used by projects that want to manage create, edit,
delete of pools using Databricks notebooks. There are various code samples and a combination of them can be
used to serve your purpose.
This process makes use of a Databricks Token belonging to a user who has admin permissions on
Databricks workspace. This token can be used to do almost anything on the workspace, hence be careful
while using it.
Azure Databricks pools reduce cluster start and auto-scaling times by maintaining a set of idle, ready-to-use
instances.
When a cluster attached to a pool needs an instance, it first attempts to allocate one of the pool’s idle
instances.
If the pool has no idle instances, the pool expands by allocating a new instance from the instance provider in
order to accommodate the cluster’s request.
When a cluster releases an instance, it returns to the pool and is free for another cluster to use. Only clusters
attached to a pool can use that pool’s idle instances.
DOCUMENTATION
“Please note we are using Databricks Secrets for the Bearer Token in this notebook.
NOTEBOOK SETUP
import requests
import json
from string import Template
def log_response(resp):
"""
Logs the JSON response to the console
:param resp: JSON response object.
:return: None.
"""
print(json.dumps(resp.json(), indent=2))
return None
class ApiError(Exception):
"""An API Error Exception"""
def __str__(self):
return "ApiError: {}".format(self.status)
def list_pools():
endpoint = "/instance-pools/list"
endpoint = "/instance-pools/create"
pool_name = "restapi-pool-prashanth"
pool_id = ""
resp = requests.post(domain + endpoint, data=data, headers=headers)
if resp.status_code != 200:
raise ApiError(f"Request to: '{endpoint}' -- status code: {resp.
status_code}")
else:
log_response(resp)
pool_id = resp.json()['instance_pool_id']
endpoint = "/instance-pools/get"
else:
log_response(resp)
endpoint = "/instance-pools/edit"
endpoint = "/instance-pools/delete"
Azure Data Factory v2 (ADF) will be the primary tool used for the orchestration of data into ADLS. Azure Data
Factory is a hybrid data integration service that allows you to create, schedule and orchestrate your ETL/ELT
workflows at scale wherever your data lives, in the cloud or a self-hosted network. Meet your security and
compliance needs while taking advantage of ADF’s extensive capabilities. Azure Data Factory is used to create and
schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores. It can process
and transform the data by using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake
Analytics, and Azure Machine Learning
Azure Data Factory is the platform that solves such data scenarios. It is the cloud-based ETL and data integration
service that allows you to create data-driven workflows for orchestrating data movement and transforming data at
scale. Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can
ingest data from disparate data stores. You can build complex ETL processes that transform data visually with data
flows or by using compute services such as Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database.
Additionally, you can publish your transformed data to data stores such as Azure SQL Data Warehouse for
business intelligence (BI) applications to consume. Ultimately, through Azure Data Factory, raw data can be
organized into meaningful data stores and data lakes for better business decisions.
Enterprises have data of various types that are located in disparate sources on-premises, in the cloud, structured,
unstructured, and semi-structured, all arriving at different intervals and speeds.
The first step in building an information production system is to connect to all the required sources of data and
processing, such as software-as-a-service (SaaS) services, databases, file shares, and FTP web services. The next
step is to move the data as needed to a centralized location for subsequent processing.
With Data Factory, you can use the Copy Activity in a data pipeline to move data from both on-premises and cloud
source data stores to a centralization data store in the cloud for further analysis. For example, you can collect data
in Azure Data Lake Storage and transform the data later by using an Azure Data Lake Analytics compute service.
You can also collect data in Azure Blob storage and transform it later by using an Azure HDInsight Hadoop cluster
After data is present in a centralized data store in the cloud, process or transform the collected data by using ADF
mapping data flows. Data flows enable data engineers to build and maintain data transformation graphs that
execute on Spark without needing to understand Spark clusters or Spark programming.
If you prefer to code transformations by hand, ADF supports external activities for executing your transformations
on compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning etc .Preferred
to use the Azure data-bricks,Azure dataflows to transform or enrich the data the data
Data Factory offers full support for CI/CD of your data pipelines using Azure DevOps and GitHub. This allows you to
incrementally develop and deliver your ETL processes before publishing the finished product. After the raw data
has been refined into a business-ready consumable form, load the data into Azure Data Warehouse, Azure SQL
Database, Azure CosmosDB, or whichever analytics engine your business users can point to from their business
intelligence tools.
Use configuration files to assist in deploying to multiple environments (dev/test/prod). Configuration files are used in
Data Factory projects in Visual Studio. When Data Factory assets are published, Visual Studio uses the content in
the configuration file to replace the specified JSON attribute values before deploying to Azure. A Data Factory
configuration file is a JSON file that provides a name-value pair for each attribute that changes based upon the
environment to which you are deploying. This could include connection strings, usernames, passwords, pipeline
start and end dates, and more. When you publish from Visual Studio, you can choose the appropriate deployment
configuration through the deployment wizard.
Monitor
After you have successfully built and deployed your data integration pipeline, providing business value from refined
data, monitor the scheduled activities and pipelines for success and failure rates. Azure Data Factory has built-in
support for pipeline monitoring via Azure Monitor, API, PowerShell, Azure Monitor logs, and health panels on the
Azure portal.
Resource groups might have one or more Azure Data Factory instances (or data factories) based on project
requirement. Azure Data Factory is composed of four key components.
Pipeline
A data factory might have one or more pipelines. A pipeline is a logical grouping of activities that performs a unit of
work. Together, the activities in a pipeline perform a task. For example, a pipeline can contain a group of activities
that ingests data from an Azure blob, and then runs a Hive query on an HDInsight cluster to partition the data.
Pipeline allows to manage the activities as a set instead of managing each one individually. The activities in a
pipeline can be chained together to operate sequentially, or they can operate independently in parallel.
Create and manage graphs of data transformation logic that can be used to transform any-sized data. Build-up a
reusable library of data transformation routines and execute those processes in a scaled-out manner from your ADF
pipelines. Data Factory will execute logic on a Spark cluster that spins-up and spins-down based on the
requirement.
Activity
Activities represent a processing step in a pipeline. For example, use a copy activity to copy data from one data
store to another data store. Similarly, might use a Hive activity, which runs a Hive query on an Azure HDInsight
cluster, to transform or analyze your data. Data Factory supports three types of activities: data movement activities,
data transformation activities, and control activities.
Datasets
Datasets represent data structures within the data stores, which simply point to or reference the data you want to
use in your activities as inputs or outputs.
Linked services
Linked services are much like connection strings, which define the connection information that's needed for Data
Factory to connect to external resources. a linked service defines the connection to the data source, and a dataset
represents the structure of the data. For example, an Azure Storage-linked service specifies a connection string to
connect to the Azure Storage account. Additionally, an Azure blob dataset specifies the blob container and the
folder that contains the data.
The credentials of the Linked services should be stored in Azure key vault(AKV)
Triggers
Triggers represent the unit of processing that determines when a pipeline execution needs to be kicked off. There
are different types of triggers for different types of events.
Pipeline runs
A pipeline run is an instance of the pipeline execution. Pipeline runs are typically instantiated by passing the
arguments to the parameters that are defined in pipelines. The arguments can be passed manually or within the
trigger definition.
Parameters
Parameters are key-value pairs of read-only configuration. Parameters are defined in the pipeline. The arguments
for the defined parameters are passed during execution from the run context that was created by a trigger or a
pipeline that was executed manually. Activities within the pipeline consume the parameter values.
A dataset is a strongly typed parameter and a reusable/reference able entity. An activity can reference datasets and
can consume the properties that are defined in the dataset definition.
A linked service is also a strongly typed parameter that contains the connection information to either a data store or
a compute environment. It is also a reusable/reference able entity.
Control flow
Control flow is an orchestration of pipeline activities that includes chaining activities in a sequence, branching,
defining parameters at the pipeline level, and passing arguments while invoking the pipeline on-demand or from a
trigger. It also includes custom-state passing and looping containers, that is, For-each iterators.
Variables
Variables can be used inside of pipelines to store temporary values and can also be used in conjunction with
parameters to enable passing values between pipelines, data flows, and other activities.
Naming Standards
PIPELINES
ACTIVITY
{ActivityShortName}_{Purpose}
Ex: LK_LogStart where LK=LookUp Activity and LogStart is the Purpose
VARIABLES
Best Practices
Should have function specific Azure data factory based on data sensitivity
Should not exceed the 5000 objects in the ADF
Should follow the standard naming conventions for all objects
Use service principal authentication in the linked service to connect to Azure Data Lake Store. Azure Data
Lake Store uses Azure Active Directory for authentication. One can use service principal authentication in an
Azure Data Factory linked service used to connect to Azure Data Lake Store. This alleviates some of the
issues where tokens expired at inopportune times and removes the need to manage another unattended
service account in Azure Active Directory. Creating the service principal can be automated using PowerShell.
Integration Runtime
Azure Integration Runtime (IR) (formerly DMG) is the compute infrastructure used by Azure Data Factory to provide
the following data integration capabilities across different network environments:
Data movement: Move data between data stores in public network and data stores in private network (on-premise
or virtual private network). It provides support for built-in connectors, format conversion, column mapping, and
performant and scalable data transfer.
Activity dispatch: Dispatch and monitor transformation activities running on a variety of compute services such as
Azure HDInsight, Azure Machine Learning, Azure SQL Database, SQL Server, and more.
SSIS package execution: Natively execute SQL Server Integration Services (SSIS) packages in a managed Azure
compute environment.
In Data Factory, an activity defines the action to be performed. A linked service defines a target data store or a
compute service. An integration runtime provides the bridge between the activity and linked Services. It is
referenced by the linked service, and provides the compute environment where the activity either runs on or gets
dispatched from. This way, the activity can be performed in the region closest possible to the target data store or
compute service in the most performant way while meeting security and compliance needs.
IR NAMING STANDARD
For naming follow a lowerCase followed by the RG and unique identifier as shown below,based on environment
standard naming should be followed
Ex: bieno-da-d-80011-dfgw-05
Azure IR
By default, each data factory has an Azure IR in the backend that supports operations on cloud data stores and
compute services in public network. The location of that Azure IR is auto-resolve. If connectVia property is not
specified in the linked service definition, the default Azure IR is used. You only need to explicitly create an Azure IR
when you would like to explicitly define the location of the IR, or if you would like to virtually group the activity
executions on different IRs for management purpose.
Self Hosted IR
You can use a single self-hosted integration runtime for multiple on-premises data sources. You can also
share it with another data factory within the same Azure Active Directory (Azure AD) tenant.
You can install only one instance of a self-hosted integration runtime on any single machine. If you have two
data factories that need to access on-premises data sources, either use the self-hosted IR sharing feature to
share the self-hosted IR, or install the self-hosted IR on two on-premises computers, one for each data
factory.
The self-hosted integration runtime doesn't need to be on the same machine as the data source. However,
having the self-hosted integration runtime close to the data source reduces the time for the self-hosted
integration runtime to connect to the data source. We recommend that you install the self-hosted integration
runtime on a machine that differs from the one that hosts the on-premises data source. When the self-hosted
integration runtime and data source are on different machines, the self-hosted integration runtime doesn't
compete with the data source for resources.
You can have multiple self-hosted integration run-times on different machines that connect to the same on-
premises data source. For example, if you have two self-hosted integration run-times that serve two data
factories, the same on-premises data source can be registered with both data factories.
If you already have a gateway installed on your computer to serve a Power BI scenario, install a separate
self-hosted integration runtime for Data Factory on another machine.
Use a self-hosted integration runtime to support data integration within an Azure virtual network.
Treat your data source as an on-premises data source that is behind a firewall, even when you use Azure
Express Route. Use the self-hosted integration runtime to connect the service to the data source.
Use the self-hosted integration runtime even if the data store is in the cloud on an Azure Infrastructure as a
Service (IaaS) virtual machine.
SCALE CONSIDERATIONS
Scale out: When processor usage is high and available memory is low on the self-hosted IR, add a new node
to help scale out the load across machines. If activities fail because they time out or the self-hosted IR node
is offline, it helps if you add a node to the gateway.
Scale up: When the processor and available RAM aren't well utilized, but the execution of concurrent jobs
reaches a node's limits, scale up by increasing the number of concurrent jobs that a node can run. You might
also want to scale up when activities time out because the self-hosted IR is overloaded
To lift and shift existing SSIS workload, you can create an Azure-SSIS IR to natively execute SSIS packages.Azure-
SSIS IR can be provisioned in either public network or private network. On-premises data access is supported by
joining Azure-SSIS IR to a Virtual Network that is connected to your on-premises network.
Azure-SSIS IR is a fully managed cluster of Azure VMs dedicated to run SSIS packages. You can bring your own
Azure SQL Database or Managed Instance server to host the catalog of SSIS projects/packages (SSISDB) that is
going to be attached to it. You can scale up the power of the compute by specifying node size and scale it out by
specifying the number of nodes in the cluster. You can manage the cost of running your Azure-SSIS Integration
Runtime by stopping and starting it as you see fit.
Performance Tuning
Refer to the following link on Microsoft docs to understand more on ADF Copy performance tuning
https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance
Grandparent Pipelines
Firstly, the grandparent pipeline, the most senior level of our ADF pipelines. Our approach here would be to build
and consider two main operations:
1. Attaching Data Factory Triggers to start our solution execution. Either scheduled or event based. From Logic
Apps or called by PowerShell etc. The grandparent starts the processing.
2. Grouping the execution of our processes, either vertically through the layers of our logical data warehouse or
maybe horizontally from ingestion to output. In each case we need to handle the high level dependencies
within out wider platform.
In the above mock-up pipeline I’ve used extract, transform and load (ETL) as a common example for where we
would want all our data ingestion processes to complete before starting any downstream pipelines.
You might also decide to controlling the scale and state of the services we are about to invoke. For example, when
working with:
Parent Pipelines
Next, at the parent level we read metadata about the processes that need to run, and the different configurations for
each of those executions. A metadata driven approach is vital to scale out processing needed for parallal execution.
To support and manage the parallel execution of our child transformations/activities, the Data Factory ForEach
activity helps. Let’s think about these examples, when working with:
Azure SQLDB or Azure SQLDW, how many stored procedures do we want to execute at once.
Azure Databricks, how many notebooks do we want to execute.
Azure Analysis Service, how many models do we want to process at once.
Using this hierarchical structure, we aim to call the first stage ForEach activity which will contain calls to child
pipeline(s).
Child Pipelines
Next, at our child level we handle the actual execution of our data transformations. Plus, the nesting of the ForEach
activities in each parent and child level then gives us the additional scale out processing needed for some services.
At this level we are getting the configurations for each child run passed from the parent level. This is where running
we will be running the lowest level transformation operations against the given compute. Logging the outcome at
each stage should also happen at this level.
Our infants contain reusable handlers and calls that could potentially be used at any level in our solution. The best
example of an infant is where an ‘Error Handler’ that does bit more than just calling a Stored Procedure.
If created in Data Factory, we might have and ‘Error Handler’ infant/utility containing something like the below.
Azure Key Vault is now a core component of any solution, it should be in place holding the credentials for all our
service interactions. In the case of Data Factory most linked service connections support the obtaining of values
from Key Vault. Where ever possible we should be including this extra layer of security and allowing only Data
Factory to retrieve secrets from Key Vault using its own Managed Identity.
Where design allows it, always try to simplify the number of datasets listed in a Data Factory. In version 1 of the
product a hard coded dataset was required as the input and output for every stage in our processing. Thankfully
those days are in the past. Now we can use a completely metadata driven dataset for dealing with a particular type
of object against a linked service. For example, one dataset of all CSV files from Blob Storage and one dataset for
all SQLDB tables.
At runtime the dynamic content underneath the datasets are created in full so monitoring is not impacted by making
datasets generic. If anything, debugging becomes easier because of the common/reusable code.
Where generic datasets are used pass following values as parameters. Typically from the pipeline, or resolved at
runtime within the pipeline.
Folders and sub-folders are such a great way to organise our Data Factory components, we should all be using
them to help ease of navigation. Be warned though, these folders are only used when working with the Data
Factory portal UI. They are not reflected in the structure of our source code repo.
Adding components to folders is a very simple drag and drop exercise or can be done in bulk if you want to attack
the underlying JSON directly. Subfolders get applied using a forward slash, just like other file paths.
Every Pipeline and Activity within Data Factory has a none mandatory description field. I want to encourage all of us
to start making better use of it. When writing any other code we typically add comments to things to offer others our
understanding. I want to see these description fields used in ADF in the same way.
CI/CD LIFECYCLE
Below is a sample overview of the CI/CD lifecycle in an Azure data factory that's configured with Azure Repos Git.
For more information on how to configure a Git repository, see Source control in Azure Data Factory.
1. A development data factory is created and configured with Azure Repos Git. All developers should have
permission to author Data Factory resources like pipelines and datasets.
2. As the developers make changes in their feature branches, they debug their pipeline runs with their most
recent changes.
3. After the developers are satisfied with their changes, they create a pull request from their feature branch to
the master or collaboration branch to get their changes reviewed by peers.
4. After a pull request is approved and changes are merged in the master branch, the changes can be
published to the development factory.
5. When the team is ready to deploy the changes to the test factory and then to the production factory, the team
exports the Resource Manager template from the master branch.
6. The exported Resource Manager template is deployed with different parameter files to the test factory and
the production factory.
Ingestion Patterns
This page delivers a graphical representation of the Ingestion Design Patterns agreed for each different source
system.
Cordillera, U2K2, Internal SAP BW Open Hub File Destination ADF + IR File
Fusion- BW Connector
Manual Mapping Internal CSV, flat files FMT ADF Direct Blob
Files (S0) / (Excl. Excel)
External
Manual Mapping Internal CSV, flat files Blob / ADLS Generation 2 ADF Direct ADF
Files (S1) / (Excl. Excel) Connector
External
Manual Mapping External CSV, flat files External SFTP ADF FTP –
Files (S2) (Excl. Excel) Source
Kantar
ADF FTP –
Source
INTRODUCTION
Anaplan is a planning platform that enables organizations to accelerate decision making by connecting data,
people, and plans across the business. Unilever has identified a number of use cases for Anaplan and it is being
adopted in various parts of the business. While Anaplan is good for generating insights and has some nice
graphical capabilities, Power BI is the tool of choice for most of the reporting and visualization requirements across
the business. Interactions with Power BI reports is simpler and the business users are comfortable using the tool.
This page provides an example of how to extract data from Anaplan and copy it in the Azure environment to make it
available for reporting.
Anaplan uses a limited type of files which include excel and csv. One of the reasons is that these file types can
easily be broken in 'chunks' so that each file is less than 10 mb.
Copy batch data from Anaplan into Azure Data Lake Store
Copy data from Azure Data Lake Store to Anaplan
The requirement is for a tool that would allow data integration between Anaplan and Azure. This would include both
download and upload of data from Anaplan.
API Support
Anaplan supports APIs to export and import data. The latest API version needs an authentication token which can
be obtained by following the Authentication Service API documentation.
Both the tools are approved tools by EA for data movement operations. However ADF is much more scalable and
allows to promote code very easily across the environments, the proposal is to use ADF.
HyperConnect
HyperConnect is another out of the box tool provided by Anaplan for data movement operations but it requires an
IaaS VM to operate. It again isn't very easy to maintain software versions or to promote code. IaaS comes with a
number of other maintenance overheads hence isn't an optimal solution. Hyperconnect doesn't have an approval
from Unilever EA.
Anaplan Connect
Yet another tool offered by Anaplan and it also requires a VM hence is not being considered at this stage.
The image below shows an architecture where data from Anaplan is stored in ADLS, procesed by Azure
Databricks, modelled using Azure Datawarehouse and Analysis services and finally visualized by Power BI. Azure
Data Factory can be used to expract data using APIs provided by Anaplan. This article only focuses on ADF
integration with Anaplan APIs.
Before any file can be downloaded, the exports need to run. We can use ADF pipelines to call API endpoints and
run the exports.
},
"method": "GET",
"headers": {
"Authorization": {
"value": "@concat('AnaplanAuthToken ',
string(activity('Get Auth Token').output.tokenInfo.tokenValue))",
"type": "Expression"
},
"Content-Type": "application/json"
}
}
},
{
"name": "Iterate Exports",
"type": "ForEach",
"dependsOn": [
{
"activity": "List Exports",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "@activity('List Exports').output.
exports",
"type": "Expression"
},
"isSequential": false,
"activities": [
{
"name": "Create Export Task",
"type": "WebActivity",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"url": {
"value": "@concat(pipeline().
parameters.AnaplanBaseURL, '/exports/', item().id, '/tasks')",
"type": "Expression"
},
"method": "POST",
"headers": {
"Authorization": {
"value": "@concat
('AnaplanAuthToken ', string(activity('Get Auth Token').output.
tokenInfo.tokenValue))",
"type": "Expression"
},
"Content-Type": "application/json"
},
"body": {
"localeName": "en_GB"
}
}
}
]
}
}
],
"parameters": {
"AnaplanBaseURL": {
"type": "string",
"defaultValue": "https://api.anaplan.com/2/0/workspaces
/8a81b08e654f3cef0165a5bcd2935f29/models
/2AC252F199AB4C71B69AB49A807BAA15"
}
},
"annotations": []
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
{
"name": "Anaplan Download files",
"properties": {
"activities": [
{
"name": "Get Auth Token",
"type": "WebActivity",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"url": "https://auth.anaplan.com/token
/authenticate",
"method": "POST",
"headers": {
"Authorization": "Basic
dmlzaGFsLmd1cHRhQHVuaWxldmVyLmNvbTpIb2xpZGF5MTIz"
},
"body": "{name:test}"
}
},
{
"name": "List Files",
"type": "WebActivity",
"dependsOn": [
{
"activity": "Get Auth Token",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"url": "https://api.anaplan.com/2/0/workspaces
/8a81b08e654f3cef0165a5bcd2935f29/models
/2AC252F199AB4C71B69AB49A807BAA15/files",
"method": "GET",
"headers": {
"Authorization": {
"value": "@concat('AnaplanAuthToken ',
string(activity('Get Auth Token').output.tokenInfo.tokenValue))",
"type": "Expression"
},
"Content-Type": "application/json"
}
}
},
{
"name": "Iterate File List",
"type": "ForEach",
"dependsOn": [
{
"activity": "List Files",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "@activity('List Files').output.files",
"type": "Expression"
},
"isSequential": false,
"activities": [
{
"name": "Check Chunk Counts",
"type": "IfCondition",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"expression": {
"value": "@greater(item().
chunkCount, 0)",
"type": "Expression"
},
"ifTrueActivities": [
{
"name": "Execute Copy Pipeline",
"type": "ExecutePipeline",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"pipeline": {
"referenceName":
"Anaplan Copy",
"type":
"PipelineReference"
},
"waitOnCompletion": true,
"parameters": {
"auth": {
"value": "@{concat
('Authorization : AnaplanAuthToken ', string(activity('Get Auth Token').
output.tokenInfo.tokenValue))}",
"type": "Expression"
},
"filename": {
"value": "@item().
name",
"type": "Expression"
},
"id": {
"value": "@item().
id",
"type": "Expression"
},
"chunkCount": {
"value": "@item().
chunkCount",
"type": "Expression"
}
}
}
}
]
}
}
]
}
}
],
"annotations": []
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
{
"name": "Anaplan Copy",
"properties": {
"activities": [
{
"name": "Download Each Chunk",
"type": "ForEach",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"items": {
"value": "@range(0,pipeline().parameters.
chunkCount)",
"type": "Expression"
},
"activities": [
{
"name": "Download chunk",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [
{
"name": "Source",
"value": "files/116000000023"
},
{
"name": "Destination",
"value": "root/tmp/AnaplanPoC
/somefile1.xls"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "HttpReadSettings",
"requestMethod": "GET",
"additionalHeaders": {
"value": "@{pipeline().
parameters.auth}",
"type": "Expression"
},
"requestTimeout": ""
},
"formatSettings": {
"type":
"DelimitedTextReadSettings"
}
},
"sink": {
"type": "DelimitedTextSink",
"storeSettings": {
"type":
"AzureDataLakeStoreWriteSettings"
},
"formatSettings": {
"type":
"DelimitedTextWriteSettings",
"quoteAllText": true,
"fileExtension": ""
}
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "DS_Anaplan_HTTP",
"type": "DatasetReference",
"parameters": {
"id": {
"value": "@pipeline().
parameters.id",
"type": "Expression"
},
"chunk": "0"
}
}
],
"outputs": [
{
"referenceName": "DS_AnaplanSink",
"type": "DatasetReference",
"parameters": {
"filename": {
"value": "@pipeline().
parameters.filename",
"type": "Expression"
},
"filepath": {
"value": "@concat('root/tmp
/AnaplanPoC/', formatDateTime(utcnow(), 'yyyyMMddhhmm'), '/')",
"type": "Expression"
}
}
}
]
}
]
}
}
],
"parameters": {
"auth": {
"type": "string",
"defaultValue": "Authorization : AnaplanAuthToken
zJIO0MiQw4KiiNMYK5In3A==.bcughHx6b7s5TpXl/dMzWfO+dLO1rPP
/T2oRI38UpZxBbCKbZcTvroQuKB5mQETGe93ZZyX6P
/7bAfAFxPN+Re9Q97+DWlNuqRvKvHlsnP0tA3fc446ZlRT+j86+E2ypBO2nkGdidk3LL
/reqzmBVRKHoko3mss+z3Ou5Z5IB3ZC+I/SWOcRJwBtp4rS3GAZ6NYe/T05qdIXxKM7Vzjx
/6DjSfbHFkAMzf7UGxElZkN9G6i7LyuoFUxe9nYxCRxrZcaC3s4puPPTA0/S0jgZeRTdsKjQ
/ewzs8hQC/mVe7QinjrElPjx3zNJf3Atb5Ntab3TTC0hQnTom
/s5spZaRbaHPPOA02mecFXbBMmOtWOxN9RzXB1HWcislEfcvcZmFT3yzSFDIwmcvsmSYnbI
/FRO8jzjGRkr9lWd0our+05ABLyRcvv2z60Y3JKZASpsLQixwR6/J0kLLC/kgZD1tQ==.
8wlQMs1+WJTrvzuIv0oAx+ZiU/JJ5vRVu52wHJyULhA="
},
"filename": {
"type": "string",
"defaultValue": "ICAT Export.xlsx"
},
"id": {
"type": "string",
"defaultValue": "116000000055"
},
"chunkCount": {
"type": "int",
"defaultValue": 1
}
},
"annotations": []
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
Linked Services
Following Linked Services are required to connect to the source and destination.
A service account can be used for this connection. For this example, I have used my account on Anaplan. Create a
HTTP Linked Service as below with basic authentication (password for basic authentication can be stored within
Azure Key Vault). See the documentation to use a certificate for authentication.
{
"name": "LS_Anaplan",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"annotations": [],
"type": "HttpServer",
"typeProperties": {
"url": "https://api.anaplan.com/2/0/workspaces
/8a81b08e654f3cef0165a5bcd2935f29/models
/2AC252F199AB4C71B69AB49A807BAA15/",
"enableServerCertificateValidation": false,
"authenticationType": "Basic",
"userName": "[email protected]",
"encryptedCredential": "********"
}
}
}
{
"name": "LS_ADLS_DataLake",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"annotations": [],
"type": "AzureDataLakeStore",
"typeProperties": {
"dataLakeStoreUri": "adl://*********.azuredatalakestore.
net",
"servicePrincipalId": "*******-****-****-****-************",
"servicePrincipalKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "LS_KYVA_KeyVault",
"type": "LinkedServiceReference"
},
"secretName": "SPN-PocDevApp-Cloud-dev"
},
"tenant": "*******-****-****-****-************",
"subscriptionId": "*******-****-****-****-************",
"resourceGroupName": "*****************"
}
}
}
Datasets
Source Dataset
Create a HTTP based dataset for your source connection. Since all the files are going to be delimited text files, the
dataset would look something like the following image
Sink Dataset
You would also reqire a dataset to save the files to the data lake store. It should be a ADLS based dataset (I've
used ADLS Gen 1 in the example) and the file type should be Delimited Text. It also accepts two parameters
SCENARIO 2 - DATA LAKE TO ANAPLAN USING AZURE DATA FACTORY & ANAPLAN API
This requirement hasn’t been prioritised and other options are suggested by I&A Tech as a work around. Still, if this
becomes a real requirement, this section will be updated.
APPENDIX
References
ADF pipeline are executed based on occurrence of trigger. There are 3 types of trigger as listed below and all are
approved for usage.
Schedule trigger runs pipelines on a wall-clock schedule. This trigger supports periodic and advanced
calendar options.
Tumbling window triggers are a type of trigger that fires at a periodic time interval from a specified start time,
while retaining state. Tumbling windows are a series of fixed-sized, non-overlapping, and contiguous time
intervals.
Event-based trigger runs pipelines in response to an event, such as the arrival of a file, or the deletion of a
file, in Azure Blob Storage.
There are some applications that use logic app to sense blob event and then trigger ADF pipeline from logic app
action. It is recommended such pipelines are moved to event-based triggers.
If you have ADF pipelines that requires a event based triggering below architecture can be used.
Pattern 2: Get Metadata Job Triggers for Source systems which are file based
Schedule Get Metadata pipeline as the first pipeline to verify the existence of the data in the source system. When
the source file is available, get metadata activity will trigger the child pipeline. Run the get metadata job frequently
to avoid any delay in triggering of actual job.
Get metadata can be used for On premise file system, Amazon S3,Google Cloud Storage,Azure Blob storage,
Azure Data Lake Storage Gen1,Azure Data Lake Storage Gen2,Azure Files,File system,SFTP,FTP.
When using Get Metadata activity against a folder, make sure to have LIST/EXECUTE permission to the given
folder.
Pattern 3: Hybrid architecture where on prem and cloud systems are interacting
Event based triggers work as follows for AAS refresh where on-premise SSIS writes a dummy file to blob to mark
completion of data load.
JOB DEPENDENCY
Job Dependency can be used based on the below approaches.First approach is being followed today for UDL.
Poll the Metadata table’s entries which specifies job completion and downstream pipelines can be triggered
accordingly.
Use service bus approach to notify the completion of job and consumers have to subscribe to the service bus.
Azure Data Lake Storage (ADLS) Gen2 can publish events to Azure Event Grid to be processed by subscribers
such as WebHooks, Azure Event Hubs, Azure Functions and Logic Apps. With this capability, individual changes to
files and directories in ADLS Gen2 can automatically be captured and made available to data engineers for creating
rich big data analytics platforms that use event-driven architectures.
OVERVIEW
Mapping data flows are visually designed data transformations in Azure Data Factory. Data flows allow data
engineers to develop graphical data transformation logic without writing code. The resulting data flows are executed
as activities within Azure Data Factory pipelines that use scaled-out Apache Spark clusters. Data flow activities can
be engaged via existing Data Factory scheduling, control, flow, and monitoring capabilities.
Mapping data flows provide an entirely visual experience with no coding required. Your data flows run on your
execution cluster for scaled-out data processing. Azure Data Factory handles all the code translation, path
optimization, and execution of your data flow jobs.
Mapping Data Flow follows an extract, load, transform (ELT) approach and works with staging datasets that are all
in Azure. Currently the following datasets can be used in a source transformation:
Settings specific to these connectors are located in the Source options tab.
Azure Data Factory has access to over 90 native connectors. To include data from those other sources in your data
flow, use the Copy Activity to load that data into one of the supported staging areas.
Below is a list of the transformations currently supported in mapping data flow. Click on each transformations to
learn its configuration details.
Aggregate Schema Define different types of aggregations such as SUM, MIN, MAX, and COUNT
modifier grouped by existing or computed columns.
Alter row Row modifier Set insert, delete, update, and upsert policies on rows.
Condition Multiple inputs Route rows of data to different streams based on matching conditions.
al split /outputs
Derived Schema generate new columns or modify existing fields using the data flow expression
column modifier language.
Exists Multiple inputs Check whether your data exists in another source or stream.
/outputs
Flatten
Schema Take array values inside hierarchical structures such as JSON and unroll them
modifier into individual rows.
New Multiple inputs Apply multiple sets of operations and transformations against the same data
branch /outputs stream.
Pivot Schema An aggregation where one or more grouping columns has its distinct row values
modifier transformed into individual columns.
Select Schema Alias columns and stream names, and drop or reorder columns
modifier
Sort Row modifier Sort incoming rows on the current data stream
https://docs.microsoft.com/en-us/azure/data-factory/concepts-data-flow-overview
Azure BLOB Storage is a service that stores unstructured data in the cloud as objects/blobs. Blob storage can store
any type of text or binary data, such as a document, media file, or application installer. Blob storage is also referred
to as object storage.
As per I&A Tech guidelines, BLOB storage should only be used as a transitory storage method, in the scenario
where externally hosted source systems are unable to support the ‘pull’ of data. Data will data be “pushed” from the
source into a blob store. This ensures access to ADLS is not shared with these 3rd party data providers.
The Containers should be created based on the project requirement and data sensitivity.
Types of Blobs
Block Blobs: Block blobs can store binary media files, documents, or text. A single block blob can store up to
50,000 blobs of 100 MB each. The total block size can reach more than 4.75 TB. Most object storage scenarios
Append Blobs: Append blobs are optimized for appending operations like logging scenarios. The difference
between append blobs and block blobs is the storage capacity. Append blob can store only up to 4MB of data. As a
result, append blocks are limited to a total size of 195 GB.
Page Blobs: Page blobs have a storage capacity of about 8 TB, which makes them useful for high reading and
writing scenarios. There are two different page blob categories, Premium and Standard. Standard blobs are used
for average Virtual Machines (VMs) read/write operations. Premium is used for intensive VM operations. Page
blobs are useful for all Azure VM storage disks including the operating system disk. Used for random reads and
writes.
Strong consistency
Multiple Redundancy types
Tiered storage – Hot & Cool
LRS: 3 Copies, 1 ZRS: 6 Copies, Same or 2 Separate Regions GRS: 6 Copies, 2 Separate Regions
Datacenter
Storage Scenarios
Blob Storage is used as the default storage solution for a wide range of Azure services
The list below reviews the essential best practices for controlling and maintaining Blob storage costs and availability.
1. Define the Type of Content - When you upload files to blob storage, usually all files are stored as an
application/octet-stream by default. The problem is that most browsers start to download this type of file
instead of showing it. This is why you have to change the file type when uploading videos or images. To
change the file type, you have to parse each file and update the properties of that file.
2. Define the Cache-Control Header - The HTTP cache-control header allows you to improve availability. In
addition, the header decreases the number of transactions made in each storage control. For example, a
cache-control header in a static website hosted on Azure blob storage can decrease the server traffic loads
by placing the cache on the client-side
3. Parallel Uploads and Downloads - Uploading large volumes of data to blob storage is time-consuming and
affects the performance of an application. Parallel uploads can improve the upload speed in both Block blobs
and Page blobs. For example, an upload of 70GB can take approximately 1,700 hours. However, a parallel
upload can reduce the time to just 8 hours.
4. Choose the Right Blob Type - Each blob type has its own characteristics. You have to choose the most
suitable type for your needs. Block blobs are suitable for streaming content. You can easily render the blocks
for streaming solutions. Make sure to use parallel uploads for large blocks. Page blobs enable you to read
and write to a particular blob part. As a result, all other files are not affected.
5. Improve Availability and Caching with Snapshots - Blob snapshots increase the availability of Azure
storage by caching the data. Snapshots allow you to have a backup copy of the blob without paying extra.
You can increase the availability of the entire system by creating several snapshots of the same blob and
serving them to customers. Assign snapshots as the default blob for reading operations and leave the
original blob for writing.
6. Enable a Content Delivery Network (CDN) - A content delivery network is a network of servers that can
improve availability and reduce latency by caching content on servers that are close to end-users. When
using CDNs for Blob storage, the network places a blob duplicate closer to the client. Accordingly, each
client is redirected to the closest CDN node of blobs.
Grant limited access to Azure Blob using shared access signatures (SAS)
A shared access signature (SAS) provides secure delegated access to resources in your storage account without
compromising the security of your data. With a SAS, you have granular control over how a client can access your
data. You can control what resources the client may access, what permissions they have on those resources, and
how long the SAS is valid, among other parameters. Unilever internal and external applications can connect to Blob
storage using a SAS token.
User delegation SAS: A user delegation SAS is secured with Azure Active Directory (Azure AD) credentials
and also by the permissions specified for the SAS. A user delegation SAS applies to Blob storage only.
Service SAS: A service SAS is secured with the storage account key. A service SAS delegates access to a
resource in only one of the Azure Storage services: Blob storage, Queue storage, Table storage, or Azure
Files.
Account SAS: An account SAS is secured with the storage account key. An account SAS delegates access
to resources in one or more of the storage services. All of the operations available via a service or user
delegation SAS are also available via an account SAS. Additionally, with the account SAS, you can delegate
access to operations that apply at the level of the service, such as Get/Set Service Properties and Get
Service Stats operations. You can also delegate access to read, write, and delete operations on blob
containers, tables, queues, and file shares that are not permitted with a service SAS.
Always use HTTPS to create or distribute a SAS. If a SAS is passed over HTTP and intercepted, an attacker
performing a man-in-the-middle attack is able to read the SAS and then use it just as the intended user could
have, potentially compromising sensitive data or allowing for data corruption by the malicious user.
Use a user delegation SAS when possible. A user delegation SAS provides superior security to a service
SAS or an account SAS. A user delegation SAS is secured with Azure AD credentials, so that you do not
need to store your account key with your code.
Use near-term expiration times on an ad hoc SAS service SAS or account SAS. In this way, even if a SAS is
compromised, it's valid only for a short time. This practice is especially important if you cannot reference a
stored access policy. Near-term expiration times also limit the amount of data that can be written to a blob by
limiting the time available to upload to it.
Be careful with SAS start time. If you set the start time for a SAS to now, then due to clock skew (differences
in current time according to different machines), failures may be observed intermittently for the first few
minutes. In general, set the start time to be at least 15 minutes in the past. Or, don't set it at all, which will
make it valid immediately in all cases. The same generally applies to expiry time as well--remember that you
may observe up to 15 minutes of clock skew in either direction on any request. For clients using a REST
version prior to 2012-02-12, the maximum duration for a SAS that does not reference a stored access policy
is 1 hour, and any policies specifying longer term than that will fail.
Be specific with the resource to be accessed. A security best practice is to provide a user with the minimum
required privileges. If a user only needs read access to a single entity, then grant them read access to that
single entity, and not read/write/delete access to all entities. This also helps lessen the damage if a SAS is
compromised because the SAS has less power in the hands of an attacker.
Understand that your account will be billed for any usage, including via a SAS. If you provide write access to
a blob, a user may choose to upload a 200 GB blob. If you've given them read access as well, they may
choose to download it 10 times, incurring 2 TB in egress costs for you. Again, provide limited permissions to
help mitigate the potential actions of malicious users. Use short-lived SAS to reduce this threat.
Validate data written using a SAS. When a client application writes data to your storage account, keep in
mind that there can be problems with that data. If your application requires that data be validated or
authorized before it is ready to use, you should perform this validation after the data is written and before it is
used by your application. This practice also protects against corrupt or malicious data being written to your
account, either by a user who properly acquired the SAS, or by a user exploiting a leaked SAS.
If a SAS is leaked, it can be used by anyone who obtains it, which can potentially compromise your
storage account.
If a SAS provided to a client application expires and the application is unable to retrieve a new SAS
from your service, then the application's functionality may be hindered.
Azure Data Lake Store is the Web implementation of the Hadoop Distributed File System (HDFS). Meaning that
files are split up and distributed across an array of cheap storage. Each file you place into the store is split into
250MB chunks called extents. This enables parallel read and write. For availability and reliability, extents are
replicated into three copies. As files are split into extents, bigger files have more opportunities for parallelism than
smaller files. If you have a file smaller than 250MB it is going to be allocated to one extent and one vertex (which is
the work load presented to the Azure Data Lake Analytics), whereas a larger file will be split up across many
extents and can be accessed by many vertexes.
The format of the file has a huge implication for the storage and parallelisation. Splittable formats – files which are
row oriented, such as CSV – are parallelizable as data does not span extents. Non-splittable formats, however, –
files what are not row oriented and data is often delivered in blocks, such as XML or JSON – cannot be parallelized
as data spans extents and can only be processed by a single vertex.
In addition to the storage of unstructured data, Azure Data Lake Store also stores structured data in the form of row-
oriented, distributed clustered index storage, which can also be partitioned. The data itself is held within the
“Catalog” folder of the data lake store, but the metadata is contained in the data lake analytics. For many, working
with the structured data in the data lake is very similar to working with SQL databases.
ADLS is the primary storage component for both the UDL and the BDL’s.
Recommended practices:
Security Considerations :
Azure Data Lake Storage Gen2 offers POSIX access controls for Azure Active Directory (Azure AD) users, groups,
and service principals. These access controls can be set to existing files and directories. The access controls can
also be used to create default permissions that can be automatically applied to new files or directories.
When working with big data in Data Lake Storage Gen2, it is likely that a service principal is used to allow services
such as Azure Databricks or ADF to work with the data. However, there are cases where individual users need
access to the data as well. In all cases, consider using Azure Active Directory security groups instead of assigning
individual users to directories and files.
Once a security group is assigned permissions, adding or removing users from the group doesn’t require any
updates to Data Lake Storage Gen2. This also helps ensure not exceed the maximum number of access control
entries per access control list (ACL). Currently, that number is 32, Each directory can have two types of ACL, the
access ACL and the default ACL, for a total of 64 access control entries.
For more information about these ACLs, see Access control in Azure Data Lake Storage Gen2.
Data Lake Storage Gen2 supports the option of turning on a firewall and limiting access only to Azure services,
which is recommended to limit the vector of external attacks. Firewall can be enabled on a storage account in the
Azure portal via the Firewall > Enable Firewall (ON) > Allow access to Azure services options. This is
suggested mostly when the consumers of the applications are Azure services.
ADLS provides 3 copies of the data in the same region in order to avoid Hardware failures. These copies are
managed internally by Microsoft . Incase of hardware failures within Microsoft center, Microsoft will manage pointing
to one of the working copy of the data, customer doesn't have a way to identify the copy or access different copies
of data.
Apart from the above, Gen2 also comes with other replication options, such as ZRS or GZRS (preview), improve
HA, while GRS & RA-GRS improve DR. For data resiliency with Data Lake Storage Gen2, it is recommended to
geo-replicate the data using GRS or RA-GRS. RA-GRS makes the secondary copy read only / accessible. With geo
replication, Microsoft manages the block level replication of data internally, without customer worrying about it.
There could be a delay on the availability of data in secondary/paired region and as claimed by Microsoft, this delay
is not more than 15 Minutes.
In order to avoid data corruption or accidental deletes, it is suggested to take snapshots of the data periodically.
Currently Gen2 doesnt provide Snapshot feature but is in road map. In case projects wants to keep the snapshot, it
is suggested to take manual backups of the data periodically into different ADLS location in the same region.
Some of the features which are in roadmap and Unilever actively working with MS to prioritize and get updates are
Azure Analysis Services is a fully managed platform as a service (PaaS) that provides enterprise-grade data
models in the cloud. Use advanced mashup and modeling features to combine data from multiple data sources,
define metrics, and secure your data in a single, trusted tabular semantic data model. The data model provides an
easier and faster way for users to browse massive amounts of data for ad hoc data analysis.
Azure Analysis Services is compatible with many great features already in SQL Server Analysis Services Enterprise
Edition. Azure Analysis Services supports tabular models at the 1200 and higher compatibility levels. Tabular
models are relational modeling constructs (model, tables, columns), articulated in tabular metadata object
definitions in Tabular Model Scripting Language (TMSL) and Tabular Object Model (TOM) code. Partitions,
perspectives, row-level security, bi-directional relationships, and translations are all supported*. Multidimensional
models and PowerPivot for SharePoint are not supported in Azure Analysis Services.
Tabular models in both in-memory and DirectQuery modes are supported. In-memory mode (default) tabular
models support multiple data sources. Because model data is highly compressed and cached in-memory, this mode
provides the fastest query response over large amounts of data. It also provides the greatest flexibility for complex
datasets and queries. Partitioning enables incremental loads, increases parallelization, and reduces memory
consumption. Other advanced data modeling features like calculated tables, and all DAX functions are supported.
In-memory models must be refreshed (processed) to update cached data from data sources. With Azure service
principal support, unattended refresh operations using PowerShell, TOM, TMSL and REST offer flexibility in making
sure your model data is always up to date.
DirectQuery mode* leverages the backend relational database for storage and query execution. Extremely large
data sets in single SQL Server, SQL Server Data Warehouse, Azure SQL Database, Azure Synapse Analytics
(SQL Data Warehouse), Oracle, and Teradata data sources are supported. Backend data sets can exceed
available server resource memory. Complex data model refresh scenarios aren't needed. There are also some
restrictions, such as limited data source types, DAX formula limitations, and some advanced data modeling features
aren't supported. Before determining the best mode for you, see Direct Query mode.
Tabular models consist of Tables linked together by Relationships. It works best when your data is modelled
according to the concepts of dimensional modelling
It provides a semantic layer that sits between your data and your users and gives them:
The ability to query the data without knowing a query language like SQL
Fast query performance
Shared business logic – how the data is joined and aggregated, calculations, security – as well as just
shared data
Every table in AAS can can be split up into multiple partitions. Usually this is applied for large tables with millions of
rows.
DAX is the query and calculation language of Tabular models. There are six places that DAX can be used when
designing a model:
1.
1. Calculated columns
2. Calculated tables
3. Calculation groups
4. Measures
5. Detail Rows expressions
6. Security
Design Principles
VertiPaq Architecture
Tabular models in Analysis Services are databases that run in-memory or in DirectQuery mode, connecting to data
directly from back-end relational data sources. By using state-of-the-art compression algorithms and multi-threaded
query processor, the Analysis Services Vertipaq analytics engine delivers fast access to tabular model objects and
data by reporting client applications like Power BI and Excel.
On-Premises Gateways
For Azure AS to connect to on-premises data sources you need to install an on-premises gateway. This is the same
gateway used by Power BI, Flow, LogicApps and PowerApps. Azure AS can only use gateways configured for the
same Azure region for performance reasons
Install the gateway as close as possible to the data source for the best performance
Gateways can be clustered for high availability
It provides extensive logging options for troubleshooting
These guidelines apply to all Azure Analysis Services instances in all environments across Unilever’s Azure
subscriptions. These are in place to ensure that business value is delivered and we use the service optimally at the
same time. These measures allow us to save costs.
Azure Analysis Services is one of the most expansive components on our Azure stack and pausing it when not in
use can save massive amount of costs.
The ADF instance in PDS environments come with ready made pipelines that can be used for suspending the
service. These pipelines can be scheduled or triggered manually.
Please note that regular audits are in place to ascertain that the service is paused by projects.
Make sure that the Analysis Services is appropriately sized. If the service tier is too low for your project
requirements the refresh might fail especially if many users are accessing the cube at the time of refresh. ADF for
every PDS project will have ready made pipelines for scaling the service up and down. If necessary, use these
pipelines to scale up the service before refresh and scale it back down after refresh to save costs.
If the service tier is too high then lot of capacity will be wasted and the project will incur huge costs.
Azure Analysis Services servers come in three tiers. There are multiple performance levels in each tier
Developer tier is intended for dev use but licence allows for use in production
How big is your model? Not easy to determine until you deploy
How much will your model grow over time?
There are lots of well-known tricks for reducing AS Tabular memory usage
A full process may mean memory usage almost doubles – but do you need to do a full process?
Processing individual partitions/tables will use less memory
Some unoptimized queries/calculations may result in large memory spikes
Caching takes place for some types of data returned by a query
QPU = Query Processing Unit. It is a relative unit of computing power for querying and processing. As an
illustration, 20 QPUs is roughly equal to 1 pretty fast core. Also note that a server with 200 QPUs will be 2x faster
than one with 100 QPUs.
Deciding on how many QPUs you will need for processing depends on the following:
How often do you need to process? What type of processing will you be doing? What will you be processing?
Not easy to know until later in the development cycle
The more QPUs, the more you can process in parallel
Tables in a model can be processed in parallel
Partitions in a table can be processed in parallel
Many processing operations such as compression are parallelised
Always better to plan so you do not process partitions containing data that has not changed!
Deciding on how many QPUs you will need for querying depends on the following:
How many concurrent users will you have? What types of query will they run? Not easy to determine until
you go into Production
There are two parts of SSAS that deal with queries
All queries start off in the Formula Engine
This is single threaded
More QPUs = more concurrent users running queries
But even then data access might become a bottleneck
The Storage Engine reads and aggregates data
Parallelism is only possible on large tables of > 16 million rows
Chart like below will give the usage details and dotted lines (blue) is the paused state
In DirectQuery mode there is no processing needed – the data stays in the source database
DirectQuery needs Standard tier
At query time the Formula Engine is used but there is no Storage Engine activity – queries are sent back to
the data source
Therefore:
You need the bare minimum of memory
QPUs are still important because a lot of activity still takes place in the Formula Engine
An AAS database or Power BI dataset should contain all the data you need for your report
If you need to get data from multiple databases/datasets, you’re in trouble
Until composite models for Live connections arrive, but even then…
Advice: put all your data into on database/dataset until you have a good reason not to
Reasons to create multiple small databases/datasets include easier scale up/out, easier security, easier
development
Development tools
There are quite a few tools that may be used to buils and manage a Tabular model database on Azure Analysis
Services. Following sections describe some of them.
Visual Studio is used for Analysis Services development. You need to install an extension called “Analysis Services
projects” to do this.
For new Azure AS projects you should choose the highest compatibility level available
Advantages:
Disadvantages:
Advantages:
Disadvantages:
BISM Normalizer: Visual Studio extension for deploying and comparing Analysis Services databases
DAX Studio: a tool for writing and tuning DAX queries
Vertipaq Analyzer: helps you understand memory usage by tables and columns
Power BI tools include:
Various at https://powerbi.tips/tools/
Power BI Helper: https://powerbihelper.org/
Power BI Sentinel ($): http://www.powerbisentinel.com/
Data Vizioner ($): https://www.datavizioner.com/
Azure Analysis Services security is based on Azure AD. It doesn’t allow using usernames/passwords. It mandates
all users to have a valid Azure AD identity in a tenant in the same subscription as the AAS instances.
The Azure SSAS firewall controls which IP addresses clients can connect from
Access from Power BI is a built-in option
Current bug that Power BI imports don’t work, only Live connections
For everything else you need to supply an IP address range
Applications blocked by the firewall get a 401 Unauthorized error message
All tables in AAS can have row-level security filters applied. The filter takes the form of a DAX expression that
returns a Boolean expression – if the expression returns false for a row, that row is not seen by a user. Since filters
move along relationships, from the one side to the many side, filtering a dimension table also filters a fact table
RLS guidelines:
4. OBJECT-LEVEL SECURITY
Remember that an inefficient model can completely slow down a report, even with very small data volumes. We
should try and build the model towards these goals:
Dimensional modelling structures data specifically for analysis as opposed to storage/retrieval. Hence it often
requires denormalizing the data which effectively minimizes joins. Here is Kimball Group’s 4 step process to do this:
2. OPTIMIZE DIMENSIONS
1. Minimize the number of columns. In particular columns with high number of distinct values should be
excluded
2. Reduce cardinality (data type conversions)
a. For example, convert DateTime to Date, or reduce the precision of numeric fields to achieve better
compression ratio and reducing number of unique values.
b. Even if you need the time precision, split the DateTime column in 2 columns - Date and Time and if
possible round the time to the minute
3. Filter out unused dimension values (unless a business scenario requires them)
4. Use integer Surrogate Keys, pre-sort them
a. Azure Analysis Services compresses rows in segments of millions of rows
b. Integers use Run Length Encoding
c. Sortime will maximize compression when encoded as it reduces the range of values per segment
5. Ordered by Surrogate Key (to maximize Value encoding)
6. Hint for VALUE encoding on numeric columns
7. Hint for disabling hierarchies on Surrogate Keys
3. OPTIMIZING FACTS
1. Minimize the number of columns and exclude the ones not required for any reporting or self-service. In
particular columns with high number of distinct values should be excluded. Usually primary keys for Fact
tables can be excluded.
2. Handle early arriving facts. [Facts without corresponding dimension records]
3. Replace dimension IDs with their surrogate keys
4. Reduce cardinality (data type conversions)
a. For example, convert DateTime to Date, or reduce the precision of numeric fields to achieve better
compression ratio and reducing number of unique values.
b. Even if you need the time precision, split the DateTime column in 2 columns - Date and Time and if
possible round the time to the minute
5. Consider moving calculations in the BDL (source) layer so that results can be used in compression
evaluations
6. Order by less diverse SKs first (to maximize compression)
7. Increase Tabular sample size for deciding Encoding
Bi-directional relationships are undesired because applying filters/slicers traverses many relationships and will be
slower. Also, some filter chains are unlikely to add business value
Overview
This article describes a standard approach of processing an Analysis Services database using webhooks. As per
the guidelines from TDA, this process should be used by any project that makes use of AAS and requires to refresh
cubes. This process replaces the previously approved approaches of using Batch Account and Function Apps for
cube refresh.
Similar to the AAS pause and resume ADF pipelines, the webhook is a standard code maintained by the landscape.
This allows for standardizing the approach we use for cube refreshes while allowing the projects the ability to do full
/partial refreshes. The ADF pipeline uses automation account to connect with AAS and process the cube.
How It Works
This approach uses Tabular Model Scripting Language which allows to maintain the AAS cubes. When a new ADF
is configured in a PDS environment, Landscape team will provide 2 additional pipelines. Please refer to the
following sections for their use.
PL_PROCESS_CUBE
This pipeline has a single activity and a parameter. You may choose to pass the parameter from another pipeline or
you can trigger the pipeline on its own by supplying appropriate value for the parameter. The parameter is called
tmslScript and it accepts TMSL representation of the refresh command that you want to perform on the cube.
The pipeline will finish in a few seconds generating an asynchronous task. When the asynchronous task finishes, it
creates a blob in the project’s storage account in a container that has the same name as the data factory. The
webhook will create the container if it doesn’t exist and it also supports projects with multiple data factories. Name
of the blob will be the same as the pipeline run id. Details of AAS refresh asynchronous task will be available in the
said blob. Please note, the blob appears in the container as soon as the asynchronous task is finished. Any
exceptions raised by AAS refresh command will be contained in blob.
ExecuteTMSL is a web activity and generates a POST request. The ‘body’ of the request is formed of a few pieces
and is generated at run-time. The dynamic content looks like this:
@concat('{"csv":"',pipeline().DataFactory,',',pipeline().RunId,',,PL_CALLBACK",','"object":',pipeline().parameters.
tmslScript,'}')
@concat('{"csv":"',pipeline().DataFactory,',',pipeline().RunId,',,
PL_CALLBACK",','"object":',pipeline().parameters.tmslScript,'}')
Notice that the pipeline parameter tmslScript is passed in the body. The name of the callback pipeline is also
passed in the body.
If this pipeline isn’t already available in your ADF, you can use the following JSON to create it:
{
"name": "PL_PROCESS_CUBE",
"properties": {
"description": "Process AAS Cube using TMSL refresh command.",
"activities": [
{
"name": "ExecuteTMSL",
"type": "WebActivity",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"url": "https://s2events.azure-automation.net
/webhooks?token=C2ZOe%2fMXfGJuAkmLiyOk1Or5PsRG5Tn9sqTqEPdg%2bFM%3d",
"method": "POST",
"body": {
"value": "@concat('{\"csv\":\"',pipeline().
DataFactory,',',pipeline().RunId,',,PL_CALLBACK\",','\"object\":',
pipeline().parameters.tmslScript,'}')",
"type": "Expression"
},
"linkedServices": [],
"datasets": []
}
}
],
"parameters": {
"tmslScript": {
"type": "string",
"defaultValue": {
"refresh": {
"type": "dataOnly",
"objects": [
{
"database": "Livewiree"
}
]
}
}
}
},
"annotations": []
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
PL_CALLBACK
An additional pipeline is provided if you want to perform any action upon completion of the asynchronous task. The
webhook calls this pipeline and passes 2 parameters:
exitStatus – It will be a Boolean value suggesting if the pipeline has been successful
logFileName – A string value representing the name of the blob.
Use of this pipeline is optional and you may decide to use any other pipeline upon completion of
PL_PROCESS_CUBE. To do that, replace PL_CALLBACK with the name of your pipeline in the body of
ExecuteTMSL web activity. Please note, your pipeline should accept the exitStatus and logFileName parameters.
If this pipeline isn’t already available in your ADF, you can use the following JSON to create it:
{
"name": "PL_CALLBACK",
"properties": {
"activities": [
{
"name": "Set variable1",
"type": "SetVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "dummy",
"value": {
"value": "@pipeline().parameters.logFileName",
"type": "Expression"
}
}
}
],
"parameters": {
TMSL SCRIPT
While TMSL can be used for several operations on AAS cubes, this webhook code is restricted to only run ‘refresh’
commands. Any other command supplied as part of the tmslScript parameter will not be accepted and an
appropriate error message will be generated.
Action TMSL
Automatic {"refresh":{"type":"automatic","objects":[{"database":"Livewiree"}]}}
Introduction
Data warehousing is about bringing massive amounts of data from diverse sources into one definitive, trusted
source for analytics and reporing. A data warehouse layer represents a single source of truth in a curated fashion
that Unilever can use to gain insights and make decisions. Azure Synapse is a SQL database that is optimized for
analytics, enforces structure, data quality and security. Azure Synapse is a PaaS service that brings together
enterprise data warehousing and Big Data analytics. It has storage and compute as 2 separate components where
you don’t have to pay for lots of idle compute when you don’t need it.
Synapse SQL pool refers to the enterprise data warehousing features that are generally available in Azure Synapse.
SQL pool represents a collection of analytic resources that are being provisioned when using Synapse SQL. The
size of SQL pool is determined by Data Warehousing Units (DWU).
Import big data with simple PolyBase T-SQL queries, and then use the power of MPP to run high-performance
analytics. As you integrate and analyze, Synapse SQL pool will become the single version of truth your business
can count on for faster and more robust insights.
In a cloud data solution, data is ingested into big data stores from a variety of sources. Once in a big data store,
Hadoop, Spark, and machine learning algorithms prepare and train the data. When the data is ready for complex
analysis, Synapse SQL pool uses PolyBase to query the big data stores. PolyBase uses standard T-SQL queries to
bring the data into Synapse SQL pool tables.
Synapse SQL pool stores data in relational tables with columnar storage. This format significantly reduces the data
storage costs, and improves query performance. Once data is stored, you can run analytics at massive scale.
Compared to traditional database systems, analysis queries finish in seconds instead of minutes, or hours instead
of days.
The analysis results can go to worldwide reporting databases or applications. Business analysts can then gain
insights to make well-informed business decisions.
Synapse SQL leverages a scale-out architecture to distribute computational processing of data across multiple
nodes. The unit of scale is an abstraction of compute power that is known as a data warehouse unit. Compute is
separate from storage, which enables you to scale compute independently of the data in your system.
SQL Analytics uses a node-based architecture. Applications connect and issue T-SQL commands to a Control
node, which is the single point of entry for SQL Analytics. The Control node runs the MPP engine, which optimizes
queries for parallel processing, and then passes operations to Compute nodes to do their work in parallel.
The Compute nodes store all user data in Azure Storage and run the parallel queries. The Data Movement Service
(DMS) is a system-level internal service that moves data across the nodes as necessary to run queries in parallel
and return accurate results.
With decoupled storage and compute, when using Synapse SQL pool one can:
Here are the best practice guidelines for Azure SQL DW.
Data Warehouse is one of the most expansive components on our Azure stack and pausing it when not in use can
save massive amount of costs.
The ADF instance in PDS environments come with ready made pipelines that can be used for suspending the
service. These pipelines can be scheduled or triggered manually.
Please note that regular audits are in place to ascertain that the service is paused by projects.
Make sure that the data warehouse is appropriately sized. If the service tier is too low for your project requirements
the processes will take longer and if multiple processes are trying to access the service, some of them might even
fail.
On the other hand if the service tier is too high then lot of capacity will be wasted and the project will incur huge
costs.
Follow the recommendations in each of the following sections to minimise spend as well as achieve better
performance from SQL Data Warehouse.
ACCESS CONTROL
Developers do NOT have access to SQL DW in any environment except DEV environment.
Developers can access SQL DW via SSMS using their Unilever credentials.
To connect to ADLS gen1 for polybase, SPN MUST be used.
STORAGE OPTIONS
Keeps the data in an indexed / ordered fashion which is great for reading. These indexes however get Fragmented
over time.
Performance
When you insert data which is ordered on Cluster Key, new rows are appended to the end of the table
resulting in fast load performance. If the data is not ordered on Cluster Key, new rows inserted into existing
pages results in Page Splits – poor load performance
Index maintenance is an overhead on data load
Good lookup performance
Ideal for limited range scans & singleton selects (Seeks)
Slower for table scans / partition scans / loading
Rowstore – Heap
Performance
New rows appended to the end of the table – fast load performance
Whole table is / may be read for lookups (Seeks)
Whole table is read for Scans
Bad read performance
Performance
DISTRIBUTION
Pick the right distribution for your tables. Select the proper table distribution - Replicate tables
Select the right hash distribution column and minimize data movement
Replicate dimension tables to reduce data movement
Replicated Tables
A distribution is the basic unit of storage and processing for parallel queries that run on distributed data. When SQL
Data Warehouse runs a query, the work is divided into 60 smaller queries that run in parallel. Each of the 60 smaller
queries runs on one of the data distributions. Each Compute node manages one or more of the 60 distributions. A
data warehouse with maximum compute resources has one distribution per Compute node. A data warehouse with
minimum compute resources has all the distributions on one compute node.
When creating a table in SQL DW, you need to decide if the table will be hash distributed or round-robin distributed.
This decision has implications for query performance. Each of these distributed tables may require data movement
during query processing when joined together. Data movement in MPP RDBMS system is an expensive but
sometimes unavoidable step. The objective of a good data warehouse design in SQL DW is to minimize data
movement.
Start with Round Robin but aspire to a hash-distribution strategy to take advantage of a massively parallel
architecture.
Make sure that common hash keys have the same data format.
Don’t distribute on varchar format.
Dimension tables with a common hash key to a fact table with frequent join operations can be hash
distributed.
Use sys.dm_pdw_nodes_db_partition_stats to analyze any skewness in the data.
Use sys.dm_pdw_request_steps to analyze data movements behind queries, monitor the time broadcast,
and shuffle operations take. This is helpful to review your distribution strategy.
INDEXING
By default, SQL Data Warehouse creates a clustered ColumnStore index when no index options are specified on a
table. Clustered ColumnStore tables offer both the highest level of data compression as well as the best overall
query performance. Clustered ColumnStore is the best place to start when you are unsure of how to index your
table.
The clustered ColumnStore index is more than an index, it is the primary table storage. It achieves high data
compression and a significant improvement in query performance on large data warehousing fact and dimension
tables. Clustered ColumnStore indexes are best suited for analytics queries rather than transactional queries, since
analytics queries tend to perform operations on large ranges of values rather than looking up specific values.
Heap Index
This is no index at all used recommended only in temporary landing of data eg staging layer data load
Clustered Index
Clustered indexes may outperform clustered ColumnStore tables when a single row needs to be quickly retrieved.
For queries where a single or very few row lookup is required to performance with extreme speed, consider a
cluster index or non-clustered secondary index.
The disadvantage to using a clustered index is that only queries that benefit are the ones that use a highly selective
filter on the clustered index column. To improve filter on other columns a non-clustered index can be added to other
columns. However, each index which is added to a table adds both space and processing time to loads.
DATA SKEW
Data skew primarily refers to a non uniform distribution in a dataset. Basically the data warehouse has 60 nodes
where the data is distrubited. If some nodes have more data than others, the workers with more data should work
harder, longer, and need more resources and time to complete their jobs. Detect data skew
Natural Skew
NULL hash key values
Default hash key value
Bad hash key choice
Resolution
The MPP Query Optimizer heavily relies on statistics to evaluate plans. Out-of-Date or Non-Existent Statistics is the
most common reason for MPP performance issues. Avoid issues with statistics by creating them on all
recommended columns and updating them after every load
Joins
Predicates
Aggregations
Group By’s
Order By’s
Computations
Good news is that Azure SQL DW now supports automatic creation of column level statistics however
Queries occupy Concurrency Slots, based on Resource Class. Number of concurrent queries depends on DWU
service objective. Allocated RAM / Query allocated depends on Resource Class and DWU.
Large fact tables or historical transaction tables should be stored as hash distributed tables
Partitioning for large fact benefit data maintenance and query performance
No Statistics, recommendations for updating statistics:
Frequency of stats updates: Conservative: Daily
After loading or transforming your data: if the data is less than 1 billion rows, use default sampling (20
percent). With more than 1 billion rows, statistics on a 2 percent range is good
Temporarily landing data on SQL Data Warehouse needs a heap table makes the overall process faster
Minimal logging with bulk load ( INSERT from SELECT ) to avoid memory errors
Batch size >= 102,400 per partition aligned distribution
Scale up and down depending on the ELT process and AAS cubes generation.
No business reports/dashboards connect to Azure SQL DW.
It can not be paused because operational data incidents must be analyzed using SQL statements. Data
Linage is needed for investigation.
Reserved capacity recommended.
Azure Data Factory & Databricks for the ETL process
Split user reporting workloads:
Direct Query (Azure SQLDW) – Fact Detail
Import (Power BI Datasets) – Fact Aggregates & Dimensions
PowerBI composite models & aggregated tables
Azure SQL Datawarehouse Materialized Views and Result set cache
Improve performance and query concurrency for queries at lowest granular level
Reduce the number of pre-calculated combinations at the cube level
Need to find the right aggregation level for an scalable architecture
Incremental refresh vs Full refresh
Good Reads
https://techcommunity.microsoft.com/t5/DataCAT/Azure-SQL-Data-Warehouse-loading-patterns-and-
strategies/ba-p/305456
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-
overview#unsupported-table-features
Azure Machine Learning is a cloud service that you use to train, deploy, automate, and manage machine learning
models, all at the broad scale that the cloud provides.
The following table shows various development environments supported by Azure Machine learning, along with
their pros and cons.
Automated ML Automated machine learning automates the Less control on data preparation, hyper-
process of selecting the best algorithm to parameter tuning. Available in Enterprise
use for your specific data, so you can Edition only.
generate a machine learning model quickly.
Designer Azure Machine Learning designer lets you Still in preview. Only has initial set of popular
(preview) visually connect datasets and modules on modules
an interactive canvas to create machine
learning models. It enables you to prep
data, train, test, deploy, manage, and track
machine learning models without writing
code. Provides a UI-based, low-code
experience.
Azure Free. Supports more languages than any Each project is limited to 4GB memory and
Notebooks other platform including Python 2, Python 3, 1GB data to prevent abuse.
R, and F#
Machine Easiest way to get started, includes Scale (10GB training data limit). Supports
Learning Studio hundreds of built-in packages and support proprietary compute target, CPU only. ML
(Classic) for custom code. Supports data Pipeline is not supported.
manipulation, model training and deployment
Local Full control of your development Takes longer to get started. Necessary SDK
environment environment and dependencies. Run with packages must be installed, and an
any build tool, environment, or IDE of your environment must also be installed if you
choice. don't already have one.
Azure Ideal for running large-scale intensive Overkill for experimental machine learning, or
Databricks machine learning workflows on the scalable smaller-scale experiments and workflows.
Apache Spark platform. Additional cost incurred for Azure Databricks.
See pricing details .
The Data Similar to the cloud-based notebook VM A slower getting started experience
Science Virtual (Python and the SDK are pre-installed), but compared to the cloud-based notebook VM.
Machine with additional popular data science and
(DSVM) machine learning tools pre-installed. Easy to
scale and combine with other custom tools
and workflows.
Use case
Limitation
Industrialization
Azure ML jobs needs to be published as web services, which can be scheduled from Azure
When you create a Machine Learning component you get the following resources:
A workspace is a centralized place to work on all Machine Learnig aspects. It is a logical container for your machine
learning experiments, compute targets, data stores, machine learning models, docker images, deployed services
etc. keeping them all together. It allows to prepare data for experimentation, train models, compare experimentation
results, deploy models and monitor them. The workspace keeps a history of all training runs, including logs, metrics,
output, and a snapshot of your scripts. This is where you create experiments and maintain models
Storage Account
Is used as the default datastore for the workspace. Jupyter notebooks that are used with your notebook VMs are
stored here as well.
Application Insights
Key Vault
Stores secrets that are used by compute targets and other sensitive information that's needed by the workspace.
In normal machine learning scenarios, you bring data together from despirate places and prepare it for model
training. You then go ahead and train the model and one you're happy with the output of the model that you've got,
you go ahead and deploy it.
This whole process is a number of steps and interconnected decisions that you make to get the model accuracy
that you're looking for. For example, when you're preparing data you may ask yourself how do you handle nulls, do
you have right number of features in that dataset to prepare the model for the accuracy and score that you're
looking for. And when you're building and training, what algorithms should you select? Should you be selecting a
linear regression or a decision tree and what are the hyper-parameters that you need to choose for those specific
algorithms.
So as you can see there are a number of questions that you may be asking yourself through this process. Without a
tool that can automatically do this for you, you might be iteratively trying a combination of the above to achieve the
score you wanted. This process can be very costly and time consuming especially if you don't really know the data.
Automated machine learning is a way to automate this process. With Automated machine learning you enter your
data, you define your goals, and you apply your constraints. Automated machine learning builds an end to end
pipeline that allows you to build a model and reach an accuracy quickly and effectively. You can get an optimized
model with far fewer iterations and far fewer steps saving you time, money and resources.
Automated machine learning basically simplifies the process of generating models tuned from the goals and
constraints you defined for your experiment, such as the time for the experiment to run or which models to allow or
deny, how many iterations to run of an exit score that you may have defined.
Automated machine learning examines the dataset that you have and the characterstics and recommends new
pipelines to use to build your machine learning models. This encompasis:
preprocessing steps
feature extraction
feature generation
model selection
hyper-parameter tuning
It also learns from the metadata from your previous iteration to recommend new pipelines to get to your score and
exit criteria much quicker and much sooner. This helps accelerate your machine learning processes with more
efficiency and ease.
Designer (preview)
The visual interface uses your Azure Machine Learning workspace to:
There is no programming required, you visually connect datasets and modules to construct your model.
Azure Notebooks is a free hosted service to develop and run Jupyter notebooks in the cloud with no installation.
Please note however, each project is limited to 4GB memory and 1GB data to prevent abuse. Legitimate users that
exceed these limits see a Captcha challenge to continue running notebooks. Azure Notebooks helps you to get
started quickly on prototyping, data science, academic research, or learning to program Python giving instant
access to a full Anaconda environment with no installation.
To learn more, check out the documentation .
Azure Machine Learning Studio gives you an interactive, visual workspace to easily build, test, and iterate on a
predictive analysis model. You drag-and-drop datasets and analysis modules onto an interactive canvas,
connecting them together to form an experiment, which you run in Machine Learning Studio. To iterate on your
model design, you edit the experiment, save a copy if desired, and run it again. When you're ready, you can convert
your training experiment to a predictive experiment, and then publish it as a web service so that your model
can be accessed by others.
There is no programming required, just visually connecting datasets and modules to construct your predictive
analysis model.
The Azure ML Notebook VM is a cloud-based workstation created specifically for data scientists. Developers and
data scientists can perform every operation supported by the Azure Machine Learning Python SDK using a familiar
Jupyter notebook in a secure, enterprise-ready environment. Notebook VM is secure and easy-to-use,
preconfigured for machine learning, and fully customizable.
Secure – provides AAD login integrated with the AML Workspace, provides access to files stored in the
workspace, implicitly configured for the workspace.
Preconfigured – with Jupyter, JupyterLab, up-to-date AML Python Environment, GPU drivers, Tensorflow,
Pytorch, Scikit learn, etc. (uses DSVM under the hood)
Simple set up – created with a few clicks in the AML workspace portal, managed from within the AML
workspace portal.
Customizable – use CPU or GPU machine types, install your own tools (or drivers), ssh to the machine,
changes persist across restarts.
Local environment
Azure Machine Learning enables you to locally create and run machine learning experiments, create and train
models and much more. This requires installing various tools like Visual Studio Code for development environment
and the Azure ML SDK . Using azureml.core.compute.ComputeTarget you can select local machine as a
compute target. For a comprehensive guide on setting up and managing compute targets, see the how-to .
Here's a video to help you get started using Visual Studio code for Machine Learning
DSVMs are Azure Virtual Machine images, pre-installed, configured and tested with several popular tools that are
commonly used for data analytics, machine learning and AI training. They are Pre-configured environments in the
cloud for Data Science and AI Development.
“DSVMs offer the most flexiblity to develop ML models but it is hard to govern and hence
Unilever doens't promote use of DSVM unless there's a specific business requirement that
can't be met by the standard PaaS services described above.
”
An Azure ML pipeline performs a complete logical workflow with an ordered sequence of steps. Each step is a
discrete processing action. Pipelines run in the context of an Azure Machine Learning Experiment .
Data preparation including importing, validating and cleaning, munging and transformation, normalization,
and staging
Training configuration including parameterizing arguments, filepaths, and logging / reporting configurations
Training and validating efficiently and repeatably, which might include specifying specific data subsets,
different hardware compute resources, distributed processing, and progress monitoring
Deployment, including versioning, scaling, provisioning, and access control
Azure Notebook VM
Azure Notebook VMs come with the entire ML SDK already installed in your workspace VM, and notebook tutorials
are pre-cloned and ready to run. While it is the easiest way to get started to run ML models, there are some
disadvantages.
Lack of control over your development environment and dependencies. Additional cost incurred for Linux VM (VM
can be stopped when not in use to avoid charges).
MLOps
Machine Learning Operations (MLOps) is based on DevOps principles and practices that increase the efficiency of
workflows. For example, continuous integration, delivery, and deployment.
Create reproducible ML pipelines. Machine Learning pipelines allow you to define repeatable and reusable
steps for your data preparation, training, and scoring processes.
Create reusable software environments for training and deploying models.
Register, package, and deploy models from anywhere. You can also track associated metadata required
to use the model.
Capture the governance data for the end-to-end ML lifecycle. The logged information can include who is
publishing models, why changes were made, and when models were deployed or used in production.
Notify and alert on events in the ML lifecycle. For example, experiment completion, model registration,
model deployment, and data drift detection.
Monitor ML applications for operational and ML-related issues. Compare model inputs between training
and inference, explore model-specific metrics, and provide monitoring and alerts on your ML infrastructure.
Automate the end-to-end ML lifecycle with Azure Machine Learning and Azure Pipelines. Using
pipelines allows you to frequently update models, test new models, and continuously roll out new ML models
alongside your other applications and services.
Azure Monitor provides a single management point for infrastructure-level logs and monitoring for most of your
Azure services. Azure Monitor maximizes the availability and performance of your applications by delivering a
comprehensive solution for collecting, analyzing, and acting on telemetry from your cloud and on-premises
environments.
The following diagram depicts a high-level view of Azure Monitor. At the center of the diagram are the data stores
for metrics and logs, which are the two fundamental types of data that Azure Monitor uses. On the left side are the
sources of monitoring data that populate these data stores. On the right side are the different functions that Azure
Monitor performs with this collected data such as analysis, alerting, and streaming to external systems.
Azure Monitor can collect data from a variety of sources. You can think of monitoring data for your applications as
occurring in tiers that range from your application to any OS and the services it relies on to the platform itself. Azure
Monitor collects data from each of the following tiers:
Application monitoring data - Data about the performance and functionality of the code you have written,
regardless of its platform.
Guest OS monitoring data - Data about the OS on which your application is running. It might be running in
Azure, in another cloud, or on-premises.
Azure resource monitoring data - Data about the operation of an Azure resource.
Azure subscription monitoring data - Data about the operation and management of an Azure subscription
and data about the health and operation of Azure itself.
Azure tenant monitoring data - Data about the operation of tenant-level Azure services, such as Azure
Active Directory (Azure AD).
As soon as you create an Azure subscription and start adding resources, such as VMs and web apps, Azure
Monitor starts collecting data. Activity logs record when resources are created or modified and metrics tell you how
the resource is performing and the resources that it's consuming. You can also extend the data you're collecting by
enabling diagnostics in your apps and adding agents to collect telemetry data from Linux and Windows or
Application Insights.
Azure Monitor is the place to start for all your near real-time resource metric insights. Many Azure resources will
start outputting metrics automatically once deployed. For example, Azure Web App instances will output compute
and application request metrics. Metrics from Application Insights are also collated here in addition to VM host
diagnostic metrics.
Log Analytics
Centralized logging can help you uncover hidden issues that may be difficult to track down. With Log Analytics you
can query and aggregate data across logs. This cross-source correlation can help you identify issues or
performance problems that may not be evident when looking at logs or metrics individually. The following illustration
shows how Log Analytics acts as a central hub for monitoring data. Log Analytics receives monitoring data from
your Azure resources and makes it available to consumers for analysis or visualization.
You can collate a wide range of data sources, security logs, Azure activity logs, server, network, and application
logs. You can also push on-premises System Center Operations Manager data to Log Analytics in hybrid
deployment scenarios and have Azure SQL Database send diagnostic information directly into Log Analytics for
detailed performance monitoring.
When designing a monitoring strategy, it's important to include every component in the application chain, so you
can correlate events across services and resources. For services that support Azure Monitor, they can be easily
configured to send their data to a Log Analytics workspace. You can also submit custom data to Log Analytics
through the Log Analytics API. The following illustration shows how Log Analytics acts as a central hub for
monitoring data. Log Analytics receives monitoring data from your Azure resources and makes it available to
consumers for analysis or visualization.
Log Analytics allows us to Collect, search and visualize machine data from cloud and on-premises.
With this data in Log Analytics, you can query the raw data for troubleshooting, root cause identification, and
auditing purposes. Here are some examples:
Track the performance of your resource (such as a VM, website, or logic app) by plotting its metrics on a
portal chart and pinning that chart to a dashboard.
Get notified of an issue that impacts the performance of your resource when a metric crosses a certain
threshold.
Configure automated actions, such as autoscaling a resource or firing a runbook when a metric crosses a
certain threshold.
Perform advanced analytics or reporting on performance or usage trends of your resource.
Archive the performance or health history of your resource for compliance or auditing purposes
Query the logs
For several known services (SQL Server, Windows Server Active Directory), there are management solutions
readily available that visualize monitoring data and uncover compliance with best practices.
Log Analytics allows you to create queries and interact with other systems based on those queries. The most
common example is an alert. Maybe you want to receive an email when a system runs out of disk space or a best
practice on SQL Server is no longer followed. Log Analytics can send alerts, kick off automation, and even hook
into custom APIs for things like integration with IT service management (ITSM).
Only PAAS Azure Web App component is approved as part of I&A Landscape.
Only Windows based web app is approved as a PAAS component.
Web App should use AAD authentication and MFA for user access control
WAF (Web Application Firewall) is mandatory for Web App as per the security right practices
Web App can use SQL Database to keep the configuration and user profiling information. For any data
requirement please work with Architect to define the right database for storing the data
mysql database is not approved to store the data
Web app connection to SQL DB/DW should be only through MSI method.
ASP .Net is the suggested language
Only 2 App service plans are suggested one for Prod and another for Non prod
Deployment of Web App component should follow standard Azure DevOPs deployment. Docker is not
suggested.
Web App can be used to embed Power BI dashboards
Blazor
Blazor allows you to build full stack web-apps using just .NET Core 3.0 and C#.
Until now the .NET has been used to generte server rendered web apps. The server runs .NET and generates
HTML or JSON code in response to a browser request. If you wanted to do anything on the client (browser), you
could use JavaScript.
Blazor employs a component model. So, unlike MVC, where each “View” is essentially an entire page, which you
get back from the server when you make a request, Blazor deals in components.
A component can represent an entire “page”, or you can render components inside other components.
The component model approach gives you a convenient way to take a mockup or design, break it down into smaller
parts, and build each part separately (as components) before composing them back together. Components also
carry the benefit of being easily re-used.
Hence it is a very productive way of writing/maintaining the web app code. Its available with VS 2019 and also on
VS Code with C# extension.
Pro: Pro:
Con: Con:
Concerns
Blazor server does most of its processing on the server, making your server (and network resources) the primary
point of failure, and bottleneck when it comes to speed. Every button click, or DOM triggered event gets sent over a
socket connection to the server, which processes the event, figures out what should change in the browser, then
sends a small “diff” up to the browser (via the socket connection).
This article from Microsoft is a useful guide to the kind of performance you can expect.
“In our tests, a single Standard_D1_v2 instance on Azure (1 vCPU, 3.5 GB memory) could handle over 5,000
concurrent users without any degradation in latency”
Based on this, for an average web application it seems the server resources wouldn’t be a major concern, but the
network latency might be a factor.
Blazor Web Assembly will be able to remove this constraint but it is still in preview and is expected to be in GA by
May 2020.
Overview
A web app can connect to an Azure SQL database using Managed Service Identity and it is a safer way to connect
as there is no need of username or password in the connection string. This document describes the process of
achieving that.
Note: the PoC was performed for projects running on .Net Framework 4.7.2 but this method should be supported by
.Net Frameworks 4.5.2 and 4.6.1 as well.
Setup
Few things need to be set up for it to work. Most of these would be done by landscape team so projects don’t have
to worry about them but to describe the complete process this document will include them.
We need to enable MSI for the web app. To do that, navigate to your web app and under settings, click on ‘identity’.
Under the System assigned tab, change the status to On. It will generate the Object ID.
AD GROUPS
Three AD groups in each environment will be in place and would have admin, write and read accesses on the
Azure SQL Database. Please note that landscape would have already enabled the AD admin on the SQL Server
setting the first AD group as admin for SQL Server.
Depending on your requirement, you can add the MSI as member either to the reader or to the writer AD group.
Please reach out to the DevOps team or the landscape team to get it set it up.
If your application uses ASP.NET then follow these steps. Steps in this section should be done by the project teams.
Add another section under after <configSections> using the following code:
<SqlAuthenticationProviders>
<providers>
<add name="Active Directory Interactive" type="Microsoft.Azure.
Services.AppAuthentication.SqlAppAuthenticationProvider, Microsoft.
Azure.Services.AppAuthentication" />
</providers>
</SqlAuthenticationProviders>
"server=tcp:<server-name>.database.windows.net;database=<db-name>;
UID=AnyString;Authentication=Active Directory Interactive"
Replace <server-name> and <db-name> with your server name and database name.
References
https://docs.microsoft.com/en-us/azure/app-service/app-service-web-tutorial-dotnet-sqldatabase
https://docs.microsoft.com/en-us/azure/app-service/app-service-web-tutorial-connect-msi
Introduction
Azure logic app helps with creating schedule, automate, and orchestrate tasks, business processes, and workflows.
Logic app workflow or processing is initiated based on a trigger. Logic app works on trigger-action model.
Triggers: Trigger is an event that meets specified conditions. For example, receiving an email or new file/blob
creation in storage account. Recurrence trigger can be used to start logic app workflow.
Actions: Actions are steps that are executed as a result of trigger. Each action usually maps to an operation
that's defined by a managed connector, custom API, or custom connector.
Approved connectors
Logic app supports multiple connectors for trigger and action. Following connectors are approved for usage within
Unilever environments:
1. Azure SQL DB
2. Azure SQL DW
3. Azure Data Lake Storage Gen1
4. Sharepoint
5. O365 email
Any connectors not listed above MUST be approved via I&A TDA.
In this pattern, logic app and logic app API connector is deployed by I&A landscape in DEV environment.
For higher environments, this MUST be deployed via azure devops CI/CD pipeline.
Azure SQL data warehouse logic app connector SHOULD be deployed using SQL database credentials.
Data warehouse credentials SHOULD have limited privileges required to execute workflow. (e.g. read or
read-write access to specific tables.)
Azure SQL data warehouse credentials MUST be stored in product specific azure keyvault.
Product teams WILL NOT have permissions to read secrets, credentials from product specific azure keyvault.
Unilever I&A landscape will deploy logic app using following workflow:
Project team WILL be provided with required templates to run these deployments from azure devops.
At the time of publishing of this document logic app does not support SPN or MSI authentication with Azure
SQL. Project MUST change the connectivity type, when SPN or MSI authentication is available with Azure
SQL connector.
In this pattern, logic app and logic app API connector is deployed by I&A landscape in DEV environment.
For higher environments, this MUST be deployed via azure devops CI/CD pipeline.
Azure SQL connector MUST be deployed using SQL database credentials.
Database credentials SHOULD have limited privileges which are required to execute workflow. (e.g. read or
read-write access to specific tables.)
Azure SQL database credentials MUST be stored in product specific azure keyvault.
Product teams WILL NOT have permissions to read secrets, credentials from product specific azure keyvault.
Unilever I&A landscape will deploy logic app using following workflow:
Project team WILL be provided with required templates to run these deployments from azure devops.
At the time of publishing of this document logic app does not support SPN or MSI authentication with Azure
SQL. Project MUST change the connectivity type, when SPN or MSI authentication is available with Azure
SQL connector.
In this pattern, logic app and logic app API connector is deployed by I&A landscape in DEV environment.
For higher environments, this MUST be deployed via azure devops CI/CD pipeline.
Azure Data Lake storage gen1 api connector MUST be deployed using SPN credentials.
Project team WILL be provided with required templates to run these deployments from azure devops.
In this pattern, logic app and logic app API connector is deployed by I&A landscape in DEV environment.
For higher environments, this MUST be deployed via azure devops CI/CD pipeline.
Sharepoint api connector SHOULD be deployed using AD user credentials which have required permissions
on sharepoint.
I&A Landscape WILL create sharepoint api connector without credentials.
AD user credentials SHOULD be entered manually by project team.
Project team MUST manage AD user password expiry.
Project team is responsible to document and implement process to manage AD user credential expiry.
Unilever I&A landscape will deploy logic app using following workflow:
Project team WILL be provided with required templates to run these deployments from azure devops.
In this pattern, logic app and logic app API connector is deployed by I&A landscape in DEV environment.
For higher environments, this MUST be deployed via azure devops CI/CD pipeline.
O365 email api connector SHOULD be deployed using AD user credentials which has permissions to send
email.
I&A Landscape WILL create O365 email api connector without credentials.
O365 user credentials SHOULD be entered manually by project team.
Project team MUST manage O365 user password expiry.
Project team is responsible to document and implement process to manage O365 email user credential
expiry.
Unilever I&A landscape will deploy logic app using following workflow:
Project team WILL be provided with required templates to run these deployments from azure devops.
PowerApp visual for PowerBI is approved for interactive powerBI interface. This plug-in allows users to take action
on business insight from within powerBI report and observe the impact from same powerBI report.
1. PowerApp to SQL connectivity MUST use Azure AD auth via “set admin” option.
2. Landscape WILL configure azure AD admin using “set admin” configuration.
3. Project SHOULD create new azure AD MFA enabled group that consists of all end-users of the application
along with application specific azure AD login (referred to as service account).
4. AD group created in step 3 SHOULD be provided READ only access on the required tables only. (This
should be added as contained group.)
5. Project SHOULD use service account when creating power app.
6. When users connect to powerapp using their own AD account and then access to backend data will also be
based on user’s own AD account.
1. Write back to SQL from powerapp SHOULD be configured via azure logic app.
2. For details on logic app to SQL connectivity please refer to Section 2.10 - Azure Logic App.
3. PowerApp automation SHOULD be configured to call logic app HTTP request endpoint.
4. Logic app HTTP endpoint SHOULD be configured to use Shared Access Signature (SAS) in the endpoint's
URI.
5. HTTP request to logic app will follows this format: https://<request-endpoint-URI>sp=<permissions>sv=<SAS-
version>sig=<signature>
6. Each URI contains the sp, sv, and sig query parameter as described in this table:
a. sp: Specifies permissions for the permitted HTTP methods to use.
b. sv: Specifies the SAS version to use for generating the signature.
c.
c. sig: Specifies the signature to use for authenticating access to the trigger. This signature is generated
by using the SHA256 algorithm with a secret access key on all the URL paths and properties. Never
exposed or published, this key is kept encrypted and stored with the logic app. Your logic app
authorizes only those triggers that contain a valid signature created with the secret key.
PowerApp Licensing:
Please work with Technology services - Collaboration services team for power app and power app connector
licensing.
Azure Cache for Redis provides an in-memory data store based on the open-source software Redis. Redis is in-
memory data structure store that can be used as a database, cache and message broker.
When used as a cache, Redis improves the performance and scalability of systems that rely heavily on backend
data-stores. Performance is improved by copying frequently accessed data to fast storage located close to the
application. With Azure Cache for Redis, this fast storage is located in-memory instead of being loaded from disk by
a database.
Azure Cache for Redis offers access to a secure, dedicated Redis cache. It is managed by Microsoft, hosted on
Azure, and accessible to any application within or outside of Azure.
Geo-replication No No Yes
1. Azure cache for redis is approved for usage within Unilever I&A environment in cache aside pattern.
2. It is NOT part of standard architecture but is approved for usage on case by case basis.
SECURITY GUIDELINES
REFERENCES
Overview
Power BI Service
Live Dashboards
Interactive Reports
Data Visualizations
Mobile Applications
Natural Language query
Type questions in plain language – Power BI Q&A will provide the answers
Q&A intelligently filters, sorts, aggregates, groups and displays data based on the key words in the question
Sharing with Others
Power BI Architecture
Power BI connects to most of the popular databases as explained in the point above.
First, you need to determine if the database can be connected to Power BI. Some databases may need the
corresponding ODBC drivers to be installed.
Any database which is hosted on a server within the company's network is considered an On Premise Data Source.
If you want to use any On Premises data source and have a scheduled data refresh then it needs a bridge called
'On Premise Data Gateway'.
Such solutions need to be discussed with the TDA team to get the necessary approvals for setting up the Data
Gateway.
When sharing data, you always need to assess who should access it and restrict the access accordingly. You can
do via restricting access to reports using Active Directory Groups, restricting access to specific values in the dataset
or if required, via both systems.
Visit these links and post your questions in chatter for more information.
https://docs.microsoft.com/en-us/power-bi/service-admin-rls
https://docs.microsoft.com/en-us/power-bi/developer/embedded-row-level-security
Also, there might be a level of security implemented in your data source, if you are using managed databases of
any kind as backend. However, when you extract data in Power BI and publish it, it will be embedded in your report
(sort of) and you'll be solely responsible for the access restrictions.
User authentication
Data source security
Authentication methods
Data source authentications in Power BI Desktop
Data refresh
Real-time visibility
using the Power BI REST API or with built-in Azure Stream Analytics integration
Live connectivity
data is updated as user interacts with dashboard or report
to existing on-premise sources, e.g. Analysis Services, with auto-refresh
to Azure SQL Database with auto refresh
Automatic and scheduled refresh
regardless of where data lives
SaaS data sources (automatic)
Schedule refreshes for on-premise data sources with Personal Gateway
Scheduled refresh using Power BI Personal Gateway (on-premises sources)
Personal Gateway empowers the business analyst to securely and easily refresh on-premise data
No help from IT required to setup Personal Gateway (on local machine) or schedule refreshes
With the Power BI Desktop or Excel and the Personal Gateway, data from a wide range of on-
premises sources can be imported and kept it up-to-date
The Gateway installs and runs as a service on your computer, using a Windows account you specify.
Data transfer between Power BI and the Gateway is secured through Azure Service Bus
While its free to publish a report in Power BI Service, one needs a Pro license to share the report with other users.
Any user who accesses these reports also needs a Pro license.
Publishing a report on Power BI Premium has a different licensing policy which necessitates only the publisher to
have a Pro license and the users who access the shared reports can be free users.
Power BI Premium is slightly different different from the normal Power BI Service in terms of performance
and scalability.
While Power BI Service is a capacity shared by all customers of Microsoft who have bought that service,
Power BI Premium is a dedicated capacity bought by an enterprise for its users.
Since it is a dedicated capacity, it can provide better performance and few other advantages such as –
Easier sharing of reports with other users where the users accessing the reports need not have Pro
license
Total size of all models can be much more than the available memory of the capacity
It manages operations by priority to make the most use of available resources. Low memory can result in
eviction
The system makes decisions based on current resource usage
Since Premiun capacity is fixed, performance can be significantly influenced by resource scarcity
Through improper management its possible to get inconsistent bad performance in Premium even with well
optimized models
Unilever has acquired Power BI Premium capacity which is being shared with a number of projects. To enable
users to utilize the full benefits of Premium we are providing end users and project teams the opportunity to host
their solutions on the Premium Capacity as a part of a temporary arrangement untill we go live. We are calling this
interim phase as the 'Soft Launch'.
For a project that wants to host their Workspaces on premium, following information is required:
Access to Power BI
Any user with a valid Unilever Id can have access to Power BI. Power BI has two components – Power BI Desktop
and Power BI Service.
Power BI Desktop: This is a free utility which can be downloaded and installed by users on their local desktop
/laptop. It is primarily used for developing Power BI reports which later can be published and shared on the Power
BI Service
Power BI Service: This is a cloud based service offered by Microsoft which allows free access to all users with a
valid Unilever id. Every Unilever Id has a free account on this service where the reports can be published and
shared. This service is accessible through the link https://powerbi.microsoft.com. If a user has a Pro License or is a
member of a premium workspace, they would be able to avail all features of Power BI Service.
Power BI Mobile: Users can connect to your on-premises and cloud data from the Power BI mobile apps. Try
viewing and interacting with your Power BI dashboards and reports on your mobile device — be it iOS (iPad,
iPhone, iPod Touch, or Apple Watch), Android phone or tablet, or Windows 10 device.
Power BI is the preferred tool for reporting as per the Ecosystem 3.0 guidelines. It caters to a multitude of scenarios
and requirements across the business.
However, there are use cases where Power BI currently is not the tool of choice as it does not satisfactorily cater to
the requirements. Most of these requirements fall under Data Science and advanced analysis.
The preferred way forward for these requirements is custom development using managed code (.net etc.)
Listing below scenarios where Power BI will not be the preferred tool of choice:
The main reasons are that huge datasets are involved and these require a lot of processing. R visuals can be used
but so far, the R implementation in power BI is limited in implementation as called out in below points:
Data size limitations – data used by the R visual for plotting is limited to 150,000 rows. If more than 150,000
rows are selected, only the top 150,000 rows are used and a message is displayed on the image.
Calculation time limitation – if an R visual calculation exceeds five minutes the execution times out, resulting
in an error.
Relationships – as with other Power BI Desktop visuals, if data fields from different tables with no defined
relationship between them are selected, an error occurs.
R visuals are refreshed upon data updates, filtering, and highlighting. However, the image itself is not
interactive and cannot be the source of cross-filtering.
R visuals respond to highlighting other visuals, but you cannot click on elements in the R visual in order to
cross filter other elements.
Only plots that are plotted to the R default display device are displayed correctly on the canvas. Avoid
explicitly using a different R display device.
Power BI performance
Introduction
Which part is slow?
Tuning the data refresh
Verify that query folding is working
Minimize the data you are loading
Consider performing joins in DAX, not in M
Review your applied steps
Make use of SQL indexes
Tuning the model
Use the Power BI Performance Analyzer
Remove data you don’t need
Avoid iterator functions
Use a star schema
Visualization Rendering
Lean towards aggregation
Filter what is being shown
Testing Performance of Power BI reports
Tools for performance testing
Introduction
Performance tuning Power BI reports requires identifying the bottleneck and using a handful of external
applications. This section covers how to narrow down the performance problem, as well as general best practices.
Data refresh
Model calculations
Visualization rendering
Identifying which one of these is the problem is the first step to improving performance. In most cases, if a report is
slow it’s an issue with step 2, your data model.
Usually you are going to see a slow refresh when you are authoring the report or if a scheduled refresh fails. It’s
important to tune your data refresh to avoid timeouts and minimize how much data your are loading.
If you are querying a relational database, especially SQL Server or Data Warehouse, then you want to make sure
that query folding is being applied. Query folding is when M code in PowerQuery is pushed down to the source
system, often via a SQL query. One simple way to confirm that query folding is working is to right click on a step
sand select View Native Query . This will show you the SQL query that will be run against the database. If you have
admin privileges on the server, you can also use extended events to monitor the server for queries.
Some transformation steps can break query folding, making any steps after them unfoldable. Finding out which
steps break folding is a matter of trial and error. But simple transformations, such as filtering rows and removing
columns, should be applied early.
If you don’t need certain columns, then remove them. If you don’t need certain rows of data, then filter them out.
This can improve performance when refreshing the data.
If your Power BI file is more than 100MB, there is a good chance you are going to see a slowdown due to the data
size. Once it gets bigger than that it is important to either work on your DAX code, or look into an alternative
querying/hosting method such as DirectQuery or Power BI Premium.
If you need to establish a relationship purely for filtering reasons, such as using a dimension table, then consider
creating the relationship in DAX instead of in PowerQuery. DAX is blazing fast at applying filters, whereas Power
Query can be very slow at applying a join, especially if that join is not being folded down to the SQL level.
Because Power Query is a graphical tool, it can be easy to make changes and then forget about them. For
example, sometimes people sort the data during design but that step is costly and often is not required. Make sure
such extra steps are not left behind. This will be terrible for data loading performance.
If your data is in a relational database, then you want to make sure there are indexes to support your queries. If you
are using just a few columns it may be possible to create a covering query that covers all of the columns you need.
When someone says that a Power BI report is slow, it is usually an issue with the DAX modelling. Unfortunately,
that fact isn’t obvious to the user and it can look like the visuals themselves are slow. There is a tool to identify the
difference: the Power BI Performance Analyzer .
If your report is slow, the very first thing you should do is run the Power BI performance analyzer. This will give you
detailed measurements of which visuals are slow as well as how much of that time is spent running DAX and how
much is spent rendering the visual. Additionally, this tool gives you the actual DAX code being run behinds the
scenes, which you can run manually with DAX Studio.
Because of the way the data is stored in Power BI, the more columns you have the worse compression and
performance you have. Additionally, unnecessary rows can slow things down as well. Two years of data is almost
always going to be faster than 10 years of the same data.
Additionally, avoid columns with a lot of unique values such as primary keys. The more repeated values in a
column, the better the compression because of run-length encoding. Unique columns can actually take up more
space when encoded for Power BI than the source data did.
Iterator functions will calculate a result row by agonizing row, which is not ideal for a columnar data store like DAX.
There are two ways to identify iterator functions. The aggregation functions generally end in an X: SUMX, MAXX,
CONCATENATEX, etc. Additionally, many iterators take in a table as the first parameter and then an expression as
the second parameter. Iterators with simple logic are generally fine, and sometimes are secretly converted to more
efficient forms.
Using a star schema, a transaction table in the center surrounded by lookup tables, has a number of benefits. It
encourages filtering based on the lookup tables and aggregating based on the transaction table. The two things
DAX is best at is filtering and aggregating. A star schema also keeps the relationships simple and easy to
understand.
Visualization Rendering
Sometimes the issue isn’t necessarily the data model but the visuals. I’ve seen this when a user tries to put >20
different things on a page or has a table with thousands of rows.
The DAX engine, Vertipaq, is really good at two things: filtering and aggregations. This means it’s ideal for high
level reporting like KPIs and traditional dashboards. This also means it is not good at very detail-heavy and granular
reporting. If you have a table with 10,000 rows and complex measures being calculated for each row, it’s going to
be really slow. If you need to be able to show detailed information, take advantage of drill-though pages or report
tooltips to pre-filter the data that is being shown.
Unless you are using simple aggregations, it’s not advisable to show all of the data at once. One way to deal with
this is to apply report or page filters to limit how many rows are being rendered at any one time. Another options is
to use drill-through pages and report tool-tips to implicitly filter the data being shown.
Limit how many visualizations are on screen. The part of Power BI that renders the visualization is single-threaded
and can be slow sometimes. Whenever possible, try to not have more than 20 elements on screen. Even simple
lines and boxes can slow down rendering a little bit.
Here are the recommended steps that should be done in order to identify and benchmark the performance of Power
BI reports.
There are few tools that allow performance testing on Power BI.
One of them is the ‘performance analyser ’ that is part of Power BI desktop.
There’s another PowerShell based tool that runs performance tests on Power BI reports by passing dynamic
parameters. You can define how many reports to run in parallel and how many instances of each report. A nice
video describing the usage is here .
There was an joint exercise carried out with Microsoft to find better efficiency for designing Power BI Reports. Key
Takeaways were:
POWER BI MODELLING
It is important to trace queries at development time for better understand the efficiency of the report and if
there are any scope of improvement on the same. Use Performance Analyser, DAX Studio or SQL Profiler.
Formulas to be equipped, with considering only filtered data rather than calculating on the whole data set.
POWER BI UX
Azure search is a search-as-a-service cloud solution that gives developers APIs and tools for adding a rich search
experience over private, heterogeneous content in web, mobile, and enterprise applications.
(15 million/partition)
1. Azure search is approved for usage within Unilever I&A environment to create index for large undifferentiated
text, image files, or application files such as Office content types on an Azure data source such as Azure
Blob storage.
2. It is NOT part of standard architecture but is approved for usage on case by case basis.
Security guidelines
1. By default, Azure search listens on HTTPS port 443. Across the platform, connections to Azure services are
encrypted.
2. Azure search data MUST be encrypted at rest using azure storage service encryption.
3. It is recommended to use Microsoft managed keys for storage service encryption.
4. Azure search access keys MUST be stored in key vault.
5. Web application MUST NOT hardcode azure search access keys.
References
Databricks ADLS SPN Deploy Time No KeyVault backed Secret Scope Yes
is used. Credentials are
removed so that user cannot
access.
Databricks SQL DW SQL Run Time yes Databricks requires either SQL Pattern Not
Server credentials (which cannot suggested
be shared due to security reason)
or access on Key-vault to read
the SQL credentials. Whoever
has access to databricks will be
able to read the credentials if
notebook execution access is
given in production, so which is a
security risk hence pattern is not
continued.
AAS SQL DW SQL Deploy Time Yes CICD looks up the sql credential. Yes
Azure AAS SPN Run Time Yes Requires access to Key-vault. Pattern Not
Functions Read access on key-vault suggested
provides access to all credentials
in key-vault including SQL DW
Databricks Log Key Run Time Yes Added KeyVault backed Secret Yes
Analytics Scope to store the log
analytics related credentials.
Batch AAS Certificate Run Time Yes Default turnkey in Amsterdam Pattern not
doesn’t grant key vault access to suggested
the SPN. Requires manual
intervention.
Azure SQL MSI Deploy time No Web App makes use of Microsoft. Yes
Web App DW / DB Azure.Services.
AppAuthentication NuGet
Package to authenticate with the
database hence doesn’t require
to supply credentials at run time.
Power SQL Azure Run Time NO Users need to be added to MFA Yes
App AD auth enabled AD Group which is
(SSO) provided access on required SQL
tables. Power APP connects to
underlying system using the
Single Sign On i.e. credentials of
the user who has logged in.
Logic ADLS SPN Deploy Time Yes Dev deployment performed by Yes
App landscape using ARM template
which references credentials from
AKV. Higher environment
deployment to be performed by
product team using CI/CD.
(Landscape automation
development in progress and for
immediate requirements
deployment will be manual.)
Logic SQL SQL Deploy time Yes Dev deployment performed by Yes
App Credenti landscape using ARM template
als which references credentials from
AKV. Higher environment
deployment to be performed by
product team using CI/CD.
(Landscape automation
development in progress and for
immediate requirements
deployment will be manual.)
Logic Log SPN Deploy Time Yes Dev deployment performed by Yes
App Analytic landscape using ARM template
s which references credentials from
AKV. Higher environment
deployment to be performed by
product team using CI/CD.
(Landscape automation
development in progress and for
immediate requirements
deployment will be manual.)
Logic Sharepo Sharepoi Deploy Time No Landscape creates logic app and Yes
App int nt API connector. Project team
credentia responsible for adding sharepoint
ls credentials.
Logic O365 O365 Deploy Time No Landscape creates logic app and Yes
App credentia API connector. Project team
ls responsible for adding O365
credentials.
WebApp Azure cache Run Time Yes Application code fetches keys Yes
cache access from key vault and uses these
for redis keys keys to access cache.
WebApp Azure search Run Time Yes Application code fetches keys Yes
search access from key vault and uses these
keys keys to access search API.
ADF + Apache Kafka topics can be used for getting the data needed to Data Lakes.
HDInsight Kafka can be used to stream data into the Data Lake and expose it for further processing by
Databricks.
Stream Analytics starts with a source of streaming data. The data can be ingested into Azure from a device
using an Azure event hub or IoT hub. Preferred pattern for streaming from sources which are IoT or event
streams from connected devices, services, applications
Internal Data can be streamed using IoT Hub and Stream Analytics
External Data can be streamed using Event Hubs and Stream Analytics
Architectural Patterns
NRT Streaming – Every 15 minutes/One Hour and processed immediately – Only where needed
Lambda – Data fed in both Batch Layer and Speed layer. Speed layer will compute real time views while
Batch layer will compute batch views at regular interval. Combination of both covers all needs
Kappa – Processing data as a stream and then processed data served for queries
Streaming outside lakes (all kinds of streaming source) directly into products (Kappa)
Important questions to address today is, how easy the data science lifecycle is? Getting environment, right data
sets, knowledge of scalable tools and technologies, industrialization for big data. These questions leave our data
scientists dependent on many other supporting teams before even starting the actual work. In a world where, fail
fast strategy works well, if the strategic decisions must wait for long to get answers, is not acceptable.
This article is targeted on the measures taken in order to make the life of data scientist easier in Unilever I&A Azure
Environment.
Domain expertise is the knowledge that Data Scientist comes with , Assumption here is data scientist already
knows the problem statement and data required to solve it.
Easy access to Raw and Cleansed data for problem solving. Universal data lake (UDL) and Business Data
Lake (BDL) have made the data availability easier and faster as all data is available in one place.
Though UDL and BDL have extensive data, but there could be a case where some of the data sets required for
data scientist is not in UDL and BDL, in those cases data scientist's can bring their own data into the environment
for quick pilot while the same data is prioritized for ingestion into UDL and BDL
Data Scientist needs to view the catalogue , identify the data and get access to data with right approvals.
Main challenge that a data scientist encounters today is knowledge of tools and technologies to be used to derive
the outcome. In order to overcome this issue, Unilever I&A Technology team has worked with Microsoft and data
science COE team to come up with list of Experimentation scenarios and tools which can be used by data scientists.
Time Bound Experimentation environment for quick pilots – Look for scenario that the use case fall into
Cost : Pay only for what you use (number of hours of usage). Pause when not in use.
SCENARIO 1:
User responsibilities
Tools with MFA Support: Azure supports access only through Unilever ID and MFA enabled tools
Tool installation & Licenses has to be taken care by the Data Scientist/User
Cost Involved: No cost on Azure if only data is accessed from UDL/BDL
When to Use?
Data size is a concern ( ~1-3 GB, with no complex processing)
Data is existing in User Laptop And /Or in Azure
SCENARIO 2:
Azure VM Configuration
Standard configuration (Data Science VM )
Standard hardware configurations. (N standard options)
Pre Installed (Excel, One drive) & Internet Enabled
Unilever approved data science tools pre-installed ( R, Python and Data science tools are available. )
Tool installation & Licenses has to be taken care by the Data Scientist/Project team
Cost Involved: Azure VMs costed as PAYG, i.e. pay per usage. Cost Monitoring tool will provided by
I&A Tech to track spent v/s budget. Cost needs to managed by Users of the environment.
Code Sharing: VSTS Git provided as source code repository.
When to Use? ( Talk to I&A Tech Architecture team if IAAS VM is right component for the use case)
Compute required is larger than User laptop but not too complex
Data size is comparatively larger (~10-15 GB, with no complex models) (Exception is push down processing)
Code and Data needs to be shared between multiple users of the system.
SCENARIO 3:
Tools in Experimentation:
Environment with Preloaded tools but Paused
Environment management (PAUSE & RESUME) to be manged by Data Science Team.
Cost Involved: Azure cost is PAYG. i.e. based on the usage. Cost Monitoring tool provided by I&A
Tech to manage spent v/s budget.
Cost Accountability is with Service Line
When to Use?
Data is in Azure
Compute required is larger than available compute in User Laptop/Azure VM
Parallel processing required
Data size is larger
Example Use case:
Experimentation for a use case which needs to be industrialized/scaled immediately after the
value is proved to end users. For example: Livewire Analytics.
Data scientists are excellent with model development on sample data with very good results. Once the pilot
is complete, same requires industrialization with E2E automation.
Knowledge of tools and technology and scaling of solution was a big problem for Data Scientists. Unilever
I&A Tech came up with process to industrialize the solution from experimentation scenarios using Azure
Cloud tech stack.
Cost : Pay only for the compute used for execution. Pause when not in use.
Industrialization efforts change depending from the tools used building the solution
SCENARIO 1:
SCENARIO 2:
SCENARIO 3:
Overview
Data distribution layer is responsible for providing data lake data to consumers (Internal and external world). The
main vision is to provide ‘Data on Demand’ model using different layers of granularity (including APIs) such that
data is not tightly coupled to project specific views. Data is exposed from sharable layers (UDL , BDL).
Unilever has derived multiple ways of providing data based on the hosted platform of consumers.
Data connectivity should be de-coupled using different layers of granularity (including APIs) such that data is
not tightly coupled to project specific views. The source data APIs & connectors should provide maximum re-
use.
Should support event sourcing techniques & data topics hub approach using publish and subscribe style
model for greater agility and to make data available in a timely manner.
Must register data consumption details (API, Integration layer consumption) in a central service catalog,
business glossary, or other semantics, so that users can find data, optimize queries, govern data, and
reduce data redundancy.
Data Exploration/Access: Identification of the right dataset to work with is essential before one starts
exploring it with flexible access. Metadata catalog should be implemented help users discover and
understand relevant data worth analyzing using self-service tools.
Integration Layer
All data requests will go through processes defined for UDL and BDL, which involves Data Owner and WL3
approval.
Integration layer will be managed as part of DevOps Activities. Creation of Pipelines, Containers, SAS
Tokens will be managed and governed by respective DevOps teams.
All access via integration layer will be read-only and subject to approvals.
Products/ Experiments can use this approach to consume data from UDL and BDL. Unilever Azure
hosted, ISA approved platforms can also use this option.
Option 2 : Data copy and staging layer:
This option can be used by external/ third party applications to access data.
I&A UDL/BDL will host an Integration layer with Gen2, ADF, Databricks and SQL DB.
Dynamic ADF and Databricks pipelines will be built using metadata in SQL DB to copy UDL and BDL
data to ADLS Gen2.
Integration layer will only retain latest 3 copies of data. History data will be provided as adhoc onetime
load and will be deleted after incremental data is made available.
Datasets will be segregated and access controlled at container / folder level with IP Whitelisting for
two factor authentication. External platform either connect via SPN or through SAS tokens. User
connections will be made through RBAC.
Option 3 : Azure Data Share:
I&A Tech hosts Integration layer with Azure Data Share component to share data with Destination
Azure Data Share. – Security approval still in Progress and can be used only after full approved
by security.
Option 4 : Data Push from UDL,BDL:
Unilever hosted integration tool (ADF) can be used to share the data from UDL , BDL or Data copy
and staging layer.
Integration tool(ADF) can be part of UDL or BDL or separately managed in separate environment to
share data from UDL & BDL.
Product data cannot be shared. As of today I&A doesn’t own support for integration tool to push the
data. It’s the decision of respective platform to support the option, else tool needs to be managed and
supported by respective data requester.
Option 5 : Micro Service Layer:
I&A hosts an API layer which allows connection through REST endpoints.
Microservice layer and workflow is built based on the requirement. Its suitable for small data sets <
1GB. Two factor authentication will be considered for Micro service layer. (In Roadmap for UDL)
Option 1: For Push, Data factory hosted in Unilever Azure platform can be used. There is no central “integration
as service” as of today for UDL and BDL. UDL and BDL can decide to host this layer for pushing data into third
party systems or separate integration layer needs to be created based on the requirement to push the data from
UDL / BDL. Respective platform where the integration tool is hosted has to manage all support activities for this
ADF themselves. (https://docs.microsoft.com/en-us/azure/data-factory/connector-overview )
Option 2: External systems should have a mechanism to connect using SPN or SAS Token. Any integration tool
which supports these options can be used. If external platforms want to script it themselves, AZCopy tool with 3-5
lines of code can be used to extract the data using Windows or Linux operating systems.
Option 3: Custom Micro-service layer to be Built for small data sets < 1 GB. Huge data sets are not supported as
part of this pattern. External Applications pull the data using Rest End Point published in Integration Layer. Web
service layer is being built and prioritized by UDL based on the request.
Decision Tree
I&A Applications , Applications hosted in Core Data Ecosystem Subscription : (Integration Option 1)
Unilever I&A Landscape will manage the creation of application and attaching the SPN (service
principle) for data access.
SPN credentials are not shared with any individuals.
SPN credentials are maintained only in Key-vault, which can only be accessed through application
hosted in Azure. ( Key-Vault is considered as the Secure credential management tool, approved by
security)
Products/ Experiments hosted within I&A Platform can connect directly to UDL and BDL using the
environment provided by I&A platform team.
As the applications are hosted in Unilever platform and credentials are not exchanged in human
readable format, this is considered as secure approach.
UDL Access : Contact UDL Dev Ops team for Remedy Request Details
BDL Access : Contact BDL Dev OPS team for respective BDL access
DATA CATALOG TO BE MANAGED TO MAKE SURE ALL SHARING DETAILS ARE CAPTURED:
Infrastructure – Landscape
Integration layers will be set up in Dublin (North Europe) – New foundation design
New resource groups will be created as part of UDL and BDLs for Integration Layer.
DevOps teams will build ADF pipelines and Databricks notebooks in Dev environments.
Creating pipelines for datasets or cross charging will be taken care by respective DevOps teams
Option 1
HA/DR or Business continuity process applied to UDL will apply.
Option 2 : Gen2
Gen2 Geo replication to be enabled for the Distribution Layer.
In case of region disaster, consumers should point to secondary location shared.
Detailed process will be published as part of Business Continuity process.
Option 3 : Data Share
Data share and Micro Service layer HA/DR to be planned along with MS team. Since the design is
still in progress, this needs to be planned
Security Consideration
I&A Application:
Data owner approval for sharing the data.
Non I&A (Unilever Azure Hosted Application)
Data owner approval for sharing the data.
ISA approval for consumer platform
SPN created and managed at consumer azure platform.
Non-Unilever / External Applications.
Security approval for moving data out of Unilever
Data owner Approval
Legal Approval wherever applicable.
Additional: Exceptional approval from Sri/Phin to share the data with Legacy systems (On Premise)
Restricted
<Business Function – (Data belongs to)>
DataSetName
<Global – (As-IS ) Data> /
<Country – (When filtered on country)>/
<FilterName/Usecase- (When filtered on specific usecase)>
<Date/FrequencyFolder-date when file is placed in distribution layer>
Actual File
Non-Restricted (Same structure as restricted)
Restricted / Non Restricted (Separate folder for restricted and Non Restricted data set)
BDL - (DataSet Existing in BDL only filtered)
DataSetName/KPI Name ( Meaningful name to identify source data set from BDL)
<Global – (As-IS ) Data>
<Country – (When filtered on country)>/
<FilterName/Usecase- (When filtered on specific usecase)>
<Date/Frequency Folder [Daily – (dd-mm-yyyy), Weekly – (WeekID-yyyy),
Monthly –(mm-yyyy)>
Actual File (dd-mm-yyyyhh24MI – File name.csv)
Non-BDL (Data set which is not existing in BDL. Taken from)
DataSetName/KPI Name (meaningful name to identify source data set)
<Country – (When filtered on country)>/
<FilterName/Usecase- (When filtered on specific usecase)>
<Date/FrequencyFolder [Daily – (dd-mm-yyyy), Weekly – (WeekID-yyyy), Monthly
–(mm-yyyy)>
Actual File (dd-mm-yyyyhh24MI – File name.csv)
REQUIREMENTS – PHASE 1
In Scope
Distribution layer to share the data from data lake with internal and external business applications.
Data will be made available as- is in its current format from UDL and BDL into distribution layer.
Only delta/incremental data is made available as part of distribution layer.
Consumers will pull the data hosted in distribution layer, no data push into consumer platform scoped.
Data will be shared in flat file format i.e. CSV format.
Minimum Two factor authentication is done for authentication of the consumer application.
No restricted data is scoped as part of data distribution layer.
Consumer applications needs to follow a set of process to get approvals from security, legal and data
owner before getting access to the data.
Out of Scope
Filtering of data before placing in distribution layer.
Converting file format
Integration as service for pushing data into consumer platform.
REQUIREMENTS – PHASE 2
In Scope
Distribution layer to share the data from data lake with internal and external business applications.
Small transformations allowed in order to take care of certain conditions like,
Country/ Unilever & Non Unilever data filtering to meet legal requirement.
Column/Row filtering to reduce the overall egress of data.
Joining of multiple data set to apply filtering or reduce overall data.
Only delta/incremental data is made available as part of distribution layer. History data is shared only
as adhoc basis, available only for a certain period.
Data in delta/parquet format should be converted to flat file format (CSV or pipe Separated)
Data encryption in distribution layer whenever restricted data is to be shared with the application.
Splitting of files into multiple small files as exceptional basis. (Logic to add file number based on the
split)
Configurable or automated data pipeline creation to take care of all the requirements, without any
manual intervention.
Automated workflow creation to take care of approval process.
Folder format for data sets in distribution layer.
Consumers will pull the data hosted in distribution layer, no data push into consumer platform scoped.
Data will be shared in flat file format i.e. CSV format.
Minimum Two factor authentication is done for authentication of the consumer application.
No restricted data is scoped as part of data distribution layer.
Consumer applications needs to follow a set of process to get approvals from security, legal and data
owner before getting access to the data.
Out of Scope
Integration as service for pushing data into consumer platform.
TECHNOLOGY STACK:
E2E FLOW:
In order to make sure such insights are shareable to other business usecases, a new pattern is developed to write
back the data from Data science products into BDL.
Some of the example usecases which falls under this category are
Leveredge IQ: IQ recommendations generated are useful for many of the decision making done in other
products but IQ is hosted as PDS solution.
NRM TPO – Optimization Promotions and plans are useful insights for other products to consume.
Create a shareable space in respective functional BDL to ingest the PDS generated data science Insights.
BDL Functional owner is responsible to make sure right governance in place for the insights written by PDS
into BDL.
Architecture
Analytical (Data science) products may write back insights to the BDLs; But simple sharable KPIs need to be
built into the BDLs
PHYSICAL ARCHITECTURE
Data Science products can ingest the data into BDL if the below criteria are met
Process is applicable only for Product/PDS categorized as Data science use case and insights to be shared
are model outputs.
Product confirms the consumption of only Trusted or Semi Trusted data from UDL and BDL to generate the
insights. No PSLZ or manual data used to generate the insights.
BDL owner approval on the below points
Generated insights is reviewed by respected Business Lake data owner and approved as shareable.
BDL Catalog updated with below information and approval from data owner on the catalog.
Frequency, availability, SLA clearly defined.
Logic used to generate the O/P documented in Catalog
Support agreement in place between BDL and PDS for data issues and fixes.
Agreement on the process for changes. Should include below clauses.
BDL to confirm the changes are valid and approves the changes
All consumers of the data (downstream systems consuming the data from BDL) are informed of
the changes in the data and impact assessment carried out.
Approved release of changes into the product environment and in turn into BDL
Sharing of product ingested data from BDL with other usecases
Process for BDL data sharing to be followed.
Additional document on who is consuming each data set has to be documented. Mainly to
understand the dependency on each data.
Architectural approval from I&A architecture team
Verification of approval and alignment with respective BDL team.
Technical architecture to write back the data into BDL
Involve BDL SME from the design Phase if the write back is required.
BDL Team will provide the BDL Object / Folder where the write back has to happen. PDS will get access
only to that folder to write back, and PDS cannot create any further objects.
Decision is with BDL owner to approve or reject data write back into BDL.
Retailer data: Multiple retailer (100s of retailer) data from small retailer shops available as manual files and
shared with data owners through mails or some other method, requires one level of data standardization, to
bring it to right format. As long as the format of the data is standard and useful for all consumers then this
standardization can be done at data prep layer.
Social Network Data: Data from social network extracted at different time frame. Validation is required from
data owner to verify the completeness of data, if not validated, wrong/incomplete data will flow into UDL and
requires data cleaning and fixes. In order to avoid this, data prep layer will be made available for data owner
to validate and confirm the data ready for ingestion into UDL.
Data Prep Storage is part of I&A Tech Central Platform. A space is provided to data provider to ingest the
Raw data make it UDL ready.
Data Prep Compute which includes ADF and Databricks will be provided in respective provider space , with
access only to data provider.
Ideally Data staging or prep layer is used only to validate/consolidate the data for the cases where the data is
coming from social network or third party (with no automated integration). Modification of data in Data Prep
layer is allowed in exceptional cases where modification of the data is must for consumption of the data.
ADF will be used to ingest the data into Data Prep Layer.
Databricks hosted in Data Prep layer will be used to make the data UDL ready. Validation and approved
modification can be done here.
Modification of the data is allowed only when there is a business justification for modification.
If the data is unusable without required modification/ Transformation
Modification/Transformation details are aligned with Data SME/expertise.
Modification/Transformation is automated.
Any data planned in Prep layer has to have a valid DC and DMR agreed and approved with Data Expertise
and Data architect team.
Data validation i.e. DQ after the ingestion in to staging is responsibility of data staging owner. Only validated
data will be ingested to UDL from UDL_Ready Folder.
UDL is the only consumer for Data Prep Layer. Data cannot be shared from Data Prep Layer.
Data prep layer is available only as part of UDL. Below is the resource group structure for the same.
1 Central Resource group (Shared Resource group as part of UDL for Staging Storage )
ADLS Gen2 : Raw and UDL ready data will be available in ADLS
Blob Storage: Landing zone for external and manual data sets.
Resource Group per Data Provider : Each data provider will be given a separate resource group to
manage the data preparation before ingestion into UDL.
Databricks (Read only access)
ADF ( Read and Write)
Reporting layer in I&A Azure platform consists of tools like Power BI, Excel used by end users and Azure Analysis
services and SQL DW used as backend for the same. This pattern makes use of only Azure Analysis services as
backend layer for the self service. Self service from users needs to go through Multi Factor Authentication.
Below are the steps for self service from Azure Analysis services
Refresh required data in to Azure Analysis services from SQL DW or from ADLS, based on the architecture.
If the self service data is different / more granular than the data required for published report then it is
suggested to use different AAS/Cubes for self service and published report.
Create multiple AD Groups to provide role based authentication on the Cube/Data. Consult with Business to
align on Number of roles required based on the data in AAS instances.
Enable MFA on the AD groups
Unilever security mandates having Multi factor Authentication for all Public end points. Since AAS
doesn't provide inbuilt MFA, only way to apply MFA is by enabling it on the AD groups.
No users should be directly added into AAS instance. All access should be provided through one of
the AD group attached on the AAS instance.
Make sure to enable MFA for all AD groups attached on the AAS Instance.
Do not white-list any IP on the AAS instance.
Once all AD groups attached on the AAS is MFA enabled, firewall on the AAS can be turned off for
self service via excel or Power BI
Project (Delivery Manager) owns the accountability to make sure all the Access is controlled
through AD group and all AD groups are MFA enabled before disabling the firewall.
Expensive, as all data required for self service needs to be present in AAS. This means project needs to use
either huge AAS instance or go with multiple AAS instance.
Limitation on Data Size: Max data that can be stored in largest AAS instance (S9) is 400GB. For global
project’s where the granular data is huge, not all data can fit into a cube. Project needs to go with multiple
AAS instances, which increases the duplication of data if same data is required in multiple instances.
This pattern makes use of both SQL DW and AAS as backend layer for the self service and published report
respectively.
Below are the steps for self service from SQL DW.
Refresh only aggregated data required into Azure Analysis services from SQL DW or from ADLS, based on
the architecture.
All the granular data will reside in SQL DW.
Published reports are served from AAS
For self service end users are supposed to connect directly to SQL DW using MFA and their own Unilever
credentials
Self service can be achieved here via two patterns
Direct query to SQL DW via Power BI reports
Create cubes which is a non persistent layer and connect to SQL DW via the cubes
SQL DW credentials will not be shared with the user.
Create multiple AD Groups to provide role based authentication on the SQL DW tables. Consult with
Business to align on Number of roles required based on the data in SQL DW.
Enable MFA on the AD groups
Unilever security mandates having Multi factor Authentication for all Public end points. Since SQL DW
doesn't provide inbuilt MFA, only way to apply MFA is by enabling it on the AD groups.
No users should be directly added into SQL DW instance. All access should be provided through one
of the AD group attached on the SQL DW instance.
Make sure to enable MFA for all AD groups attached on the SQL DW Instance.
Do not white-list any IP on the SQL DW instance.
Once all AD groups attached on the SQL DW is MFA enabled, firewall on the SQL DW should be
turned off
Project (Delivery Manager) owns the accountability to make sure all the Access is controlled
through AD group and all AD groups are MFA enabled before disabling the firewall.
Though this pattern allows self service on most granular data of any size supported on SQL DW, but this can
turn out to be an Expensive solution, if large SQL DW is used and kept running 24/7.
Suggested to go with < 1500 DWU for SQL DW.
Align with end users to keep the SQL DW running only during office hours to minimize the cost.
Limitation on concurrency: SQL DW supports limited concurrency. 2000 DWU supports only 42 concurrent
queries in small RC(Resource Class). If the product requires large number of end users to concurrently
connect and do the self service then this may not be the right solution.
Performance limitation: Performance may not be as good as connecting to AAS. Project team needs to
analyse the performance and align with end users accordingly.
PATTERN 3: SELF SERVICE CONNECTING TO SQL DB (EXCEPTIONAL PATTERN WHERE SQL DB IS USED IN E2E
ARCHITECTURE)
This pattern makes use of only both SQL DB as backend layer for the self service and published report . This
pattern is suggested only when the data size < 50 GB
Below are the steps for self service from SQL DB.
I&A tech azure platform supports below 2 tools for self service
Power Bi Desktop
Excel
Approach in Azure
I&A tech azure platform provides different ways to allow self service.
RDS & Citrix is the common environment provided to host different DevOps tools for Azure Platform. Citrix is
considered as secure environment as user is required to have Unilever ID and go through MFA to login to RDS
/Citrix.
RDS/Citrix keeps the data secure by restricting users from downloading the data into User laptop. In case if
projects host restricted or PII data, the only approved method to connect to azure components is over RDS/Citrix.
USER LAPTOP:
Accessing azure using tools from user laptop has certain restrictions. Tools that connects over Secured port and
with MFA authentication, are allowed direct connection from user laptop.
As of now, only Excel and Power BI are the TDA approved tool’s allowed to use from User Laptop. Any addition of
new tool, as a self-service tool has to go through evaluation process and TDA approval for usage. User;s can
access the Azure platform only through MFA. Projects has to make sure to enable MFA on the Azure tools before
providing access to the End user.
User onboarding to RDS/Citrix needs to be taken care by the project as part of the project process, to use
RDS for self-service.
There is a risk of data extracted by users into their laptop and miss-used. Before providing self service
access to the data from Excel and PBI, project owner needs to make sure right user is getting access and
any restricted or sensitive data is not shared.
Data owner and Project owner is responsible/accountable for any data shared through self service with end
users.
Data extraction using Excel can affect the performance. If multiple users are downloading data through Excel
over Self Service, can cause huge performance issue. Self-service should be limited to only required users,
who are aware of the performance risk in downloading the data.
End users should be trained on right practice on self-service. This can reduce the performance issues.
Projects/ End user’s should be made aware of the risks involved in user downloading the data into his laptop
and ensure that Information Securtiy requirements are fulfilled
2. Below window opens up and by default Remedy 8 shows the location updated in inside.unilever.com.
Below is the screenshot for the mandate fields that should be filled in before submitting the CRQ.
Mandate fields (Service Categorization, Operational Categorization, Change Reason, Impact, Importance, Priority,
Risk Level)
Note : CRQ Risk Level should be always be Minor-1 for this activity.
On the change reason once you click “Requirement Driven change” below questionnaires should be answered
Backout-Plan :
Development Plan : NA
Implementation Plan :
Note: Both the CRQ and the Task start and end Date should match.
Click on Tasks tab and click on relate and a new task would be created fill in the summary details and assign to IT-
GL-Active Directory. (Please note Start and End date of the CRQ should be minimum 2 days).
For example :
SQL DW: One SQL
DW all countries
AAS : One AAS
instance
Releases Today releases are done No flexibility in releases Flexibility of a any Option 2
per ITSG as code as all countries share country deployment
repository is associated same code base and or release can be If project team
with each Project or ITSG. release brach. done at any comes with a
point. Each country process to
Releases requires will have its own manage releases
controlled approach so branch and code and downtime is
that developers do not repository acceptable to
checkin things to countries, option
release branch when a 1 can be
release is planned for considered
different country.
Product team requires
a control over release
branch and check-in's.
Costing Cost on azure is PAY As Cost can reduce as Cost is high , as Option 1
(IAAS and You Go. Every component one environment is each country will
PAAS) will have a per hour cost used for Dev, QA, PPD have its own
associated with it. & Prod . environment and
Noumber of IAAS VM's components
IAAS Components are required reduces as
setup per country to meet countries are sharing
the security mandate of the environment
two level of authentication
Code Code repository is One code base for all No Issue as each Option 2
Managem mainatined at ITSG level. countries can become country has its own
ent an issue if not code branch.
managed right. Limitation could be
on not having a
Each country can central code branch.
have its own
Vendor Partner
developing the
code. Risk of
vendor partner
affecting each
others code.
One release
branch, hence the
releases between
the countries needs
to be managed by
project centralally.
Project needs to
have a central team
which manages
and coordinates the
releases between
the countries.
Features and
development's are
manged right so
that one country
changes doesnt
affect the other
country. This is a
big risk unless
there is a
governance
mechanism within
the project to
manage this
Infrastruct Infrastructure currently is If all countries exist in All countries will All 3 are feasible
ure provided per ITSG, as one underlying have its own ITSG
resource groups are infrastructure, a master and resource groups
created based on ITSG, ITSG is required for
and even costing is done which the cost and
on ITSG’s. resource groups can
be assigned
Data Analytics/ Data science All data at once place , Analytics Option 1
Sharing can access any data from and analytics which /Datascience has to
for UDL or BDL. requires data from sit within the same
Analytics When it is product, cross country becomes country setup. So
analytics has to be within easier. multiple deployment
the product as cross of same analytics
connections between product might be
products is not allowed required.
Access Access control is defined Row level access Countries have its Option 1
Control at the project level. Each cannot be defined on own AD groups Option 2 : If no
project has its own AD SQL DW or AAS. Since data restriction
groups and end users are objects are shared, between the
assigned to those AD every country user will country users or
groups. get access to all the right level of
data available in the controls applied
underlying AAS model
or SQL DW database.
Service Service setup is managed One service as ITSG is Each product as Option 2
per ITSG and per product one separate service
Notes*
If Prod and Non prod are setup with different options For Example Prod (Option 2) and Non - Prod
(Option 3) then there will be additional complexy in release process and environment provisioning
process. Right controls to be put in place by project to manage the situation
Dependencies and multi level dependencies are not handled very well, as ADF manages only time
schedules.
If the job to be executed on data arrival, there are lot of challenges, as on file arrival feature is not straight
forward , it depends on source systems.
If there is an action which needs to be triggered on successful completion of a job or multiple jobs , there is
no means to achieve that as of now
If a job missed a scheduled SLA , currently no way to notify the users. It is either manual or through a
dashboard.
As a solution to the above said issues and with some added features there is a new framework which projects are
encouraged to adapt based on their requirements
This tool is will act as a workflow engine for ADF scheduling and monitoring purposes
This tool can replace the existing schedulers which are currently are not centralized
This tool also serves as a template for all the ADF pipelines as per schedules
Modifications/updates can be done on the template instead of individual ADF pipe lines
Tools used
Framework Design
Framework workflow
The following workflow which can be achieved by implemented and steps provided below . This can be highly
customised can be used for job schedules of varied frequency . This can be also used where a schedule is not in
place and job needs to be triggered on data arrival. If any project wants to implement a part of the framework and
not the framework on entirety that can be achieved as well.
Framework Jobs
Daily Job Population: Workflow to load the daily executable Jobs into Job_Schedule
Workflow uses Job Master, Job Exceptions and Job Adhoc Run for populating Job_Schedule table
Will run everyday at 00:00 Hour.
An entry has to be created for each execution. (Hourly job will have 24 entries)
Logic to be created for daily, hourly, weekly, monthly , by hour and by minute jobs.
Execution Engine
Execution engine will analyse which job can be executed. Analysis is done based on
Data availability
Dependency completion (using Job Run Status details)
Exclusion list in Job Schedule.
Priority of the job
Schedule of the job.
Execution engine will trigger the workflow , there is no scheduling done on ADF.
Execution engine will also make an entry into Job_Run_Status for each execution, with Run_ID of the
job.
Execution engine will check the run status and if its failed will retry the job until the retry threshold is
reached on the retry count
Execution engine will make sure to limit the number of parallel execution as required for the
application.
Job Status Update: Workflow to update the status of the jobs
Workflow will look for Job completion from ADF logs, using the ADF run_id
Updates the Job Status on completion of the job with status, error message etc.
Source Data availability check
Jobs configured based on source data availability can only be run when data is available.
This job will check the availability of data in Source.
Source could be MCS table of UDL
Actual source systems like Blob (Event Trigger), File Share /SFTP etc.
Finding the source data availability is dependent on the source system. Custom logic is
required for each source.
Refer the section on Job Triggering in ADF for more details.
Notification Workflow
Checks for miss in the SLA and sends notification to configured list of users
Send notification for success and failure of job as well
Ad-hoc Job Run : Job to look for any adhoc job run configuration and add that into Job Schedule.
Should check for Job Dependency before adding a job into Job SChedule.
Should make sure a job is added again in to Job -Schedule when
Existing job is complete (Success/failed)
A new job to be added
Improvements : TO be looked into :
How to stop the long running jobs
Remedy Integration
Service Bus Integration for Subscription of notification by dependent PDS/Systems.
Extract business run date from the data and capture in the table.
Please find the attached framework data model . If any product wants to implement this framework in parts or in
entirety then this data model needs to be in place . The team can choose which tables to be used based on the use
case but table structures and column names should be the same as given below.
(This table captures the metadata for all jobs in Pipeline_ Name of the ADF Pipleline
the application. ) Name
Metadata Table Job_Exe Scheduled Start Time for pipeline run, it will be
cution_St populated if pipeline is scheduled - this is the
art_Dateti time from which the pipeline will be first
me scheduled to run
Error Message
Pipeline_
Run_Stat
us_Messa
ge
Due to some data issues at source, project Exceptio Exception created Date
doesn't want to run a job till it is fixed. ) n_Create
d_DateTi
me
Adhoc_S
chedule_
Created_
DateTime
Real-time data integration is the idea of processing information the moment it's obtained. In contrast, Batch data-
based integration methods involve the process of storing all the data received until a certain amount is collected
and then processed as a batch.
Batch Integration:
Batch integration supports data in batches , which includes data refresh hourly, daily, weekly monthly, Yearly and
Adhoc.
Bring history data (data in full) as per the requirement only once.
Bring only incremental data in agreed batches.
Use databricks delta for update, delete and insert
Process the data only for the modified records.
Keep the data in Delta Parquet format for easy updates.
Micro batch integration is created for getting the data more frequently than the batch. At the moment Micro-Batch
support is in 15 min batches.
Bring history data (data in full) as per the requirement only once.
Near real time/Micro batch integration : Based on the requirement data can be consumed by BDL /PDS applications
as shown below
These are two approaches which can be used to achieve the near real time requirements.This holds good if the
data volume is less and minimum or no transformations are involved.
Approach 2- Reading the data using the Logic app to establish the dependency on the file arrival.
Best Practices:
Use Cluster pools or interactive clusters based on the size of the data for Micro batch. Job Cluster takes
minimum 4 minutes to spin up the cluster.
Use single processing, instead of write and read into underlying storage.
Avoid multiple layers/stages, if the data is to be processed quick.
HDInsight Kafka can be used to stream data into the Data Lake and expose it for further processing in UDL
Stream Analytics starts with a source of streaming data. The data can be ingested into Azure from a device
using an Azure event hub or IoT hub. Preferred pattern for streaming from sources which are IoT or event
streams from connected devices, services, applications
Internal Data can be streamed using IoT Hub and Stream Analytics
External Data can be streamed using Event Hubs and Stream Analytics
Architectural Patterns
NRT Streaming – Every 15 minutes/One Hour and processed immediately – Only where needed
Lambda – Data fed in both Batch Layer and Speed layer. Speed layer will compute real time views while
Batch layer will compute batch views at regular interval. Combination of both covers all needs
Kappa – Processing data as a stream and then processed data served for queries
Only additive delta can be streamed. Any other complex delta mechanisms may not be suitable for streaming
Unpredictable incoming streaming patterns like out of sequence and late events can be streamed but only
with complex logic while processing it further
Skewed data with unpredictable streams cannot be streamed as managing latency and throughput will be
difficult
External lookup while streaming is a memory consuming operation and hence needs additional mechanisms
like cache reference data, additional memory allocation, etc to enable effective streaming
Information Classification
There are different types information classification standards and each standard requires different level of protection.
Public Information is information that is available to the general public (i.e. already in the public domain). Public
information may be freely distributed without risk of harm to Unilever.
Internal Information is non-public, proprietary Unilever information that is for undertaking our business processes
and operational activities and where the unauthorized disclosure, modification or destruction is not expected to
have a serious impact to any part of Unilever.
Confidential Information is non-public, proprietary Unilever information where the unauthorized disclosure,
modification or destruction could seriously impact a part of the Unilever organisation (e.g. country, brand, function).
Confidential Personal data is defined as information which can be used to directly or indirectly identify an
individual.
Confidential Sensitive Personal data is any personal data which has the potential to be used for discriminatory,
oppressive, or prejudicial purposes.
Restricted Information is information which the Group Secretary has classified as Restricted Information because it
is highly sensitive to Unilever for commercial, legal and/or regulatory reasons.
Unilever information classification of Internal MUST be protected using disk, file or database encryption.
Unilever information with a classification of Confidential or above MUST be protected using disk, file or
database encryption.
Encryption technologies may be applied at the physical or logical storage volume and may be
managed by Unilever or by a 3rd party service provider
For Restricted and Sensitive Personal Data, controls MUST ensure that only authorised individuals
can access data and MUST fail closed.
For Restricted and Sensitive Personal Data, annual review of the effectiveness of these controls
MUST take place
All mass storage devices used to store Internal information or above, MUST be protected from accidental
information disclosure (e.g. due to theft of the device) by use of encryption technology.
The encryption method applied MUST be of AES 256bit or equivalent/higher and MUST either provide
file level protection (if stored on an otherwise unencrypted volume) or encrypt the entire storage
volume.
Data In Transit
Internal information MUST be protected during transmission whenever source and destination are in different
security resource group.
Confidential information MUST be protected during transmission whenever source and destination are in
different security resource group.
Restricted or Sensitive Personal data MUST be encrypted whenever in transit. User authentication
credentials MUST always be protected regardless of where they are being transmitted.
Acceptable methods for encrypting data in transit include:
References:
Environment Access
Access to resource groups, components and data is controlled using custom Azure roles granted to Azure Active
Directory security groups. Since Landscape create the environments, developers write and deploy code into their
environments, developers & DevOps personnel do not need Contributor permission on the application resource
groups. Developers and DevOps members cannot provision or edit Azure resources.
There are five custom roles. Two are used to grant access to resource groups, the remaining three control access
to data. For data access note that the type of storage component does not matter. So e.g., if you are a member of
the data reader group for a given environment, you will be able to read data in SQLDW, SQLDB, ADLS & BLOB.
InA InA equivalent to Made up of all actions from the standard 'Reader' role plus any SEC-ES-
Tech regular Azure permissions from 'Data Factory Contributor' that relate to DA-<env>-
App 'Reader' role. monitoring and controlling pipelines. It excludes permissions that <ITSG>-
Reader enable authoring. app-reader
InA InA equivalent to This is a cut down version of the regular 'Contributor' role. It does SEC-ES-
Tech regular Azure not enable resources to be stood up. Instead it gives 'Reader' on DA-<env>-
App 'Contributor' role. the resource group, 'Data Factory Contributor'. <ITSG>-
Contri app-reader
butor
InA InA custom role for Read access to all data via its public end point. Does not include SEC-ES-
Tech reading data in any portal access. DA-<env>-
Data component <ITSG>-
Reader data-
reader
InA InA custom role for Write access to all data via its public end point. Does not include SEC-ES-
Tech writing data in any portal access. DA-<env>-
Data component <ITSG>-
Writer data-writer
InA InA custom role for Read and write access to all data via its public end point. Abilty SEC-ES-
Tech reading, writing and to set ADLS folder permissions. Must be granted in addition to DA-<env>-
Data controlling access 'InA Tech App Reader/Contributor' in order provide portal access. <ITSG>-
Owner to data. data-owner
Below Functional Roles are used in the Dublin & Amsterdam operating model, a user is part of single role only.
Description Name
Security
Group
(Functional)
Developer This user group is for the Developers. A developer’s permissions diminish as SEC-ES-DA-d-
(New you move into each higher environment: <DevITSG>-
Foundation) azure-developer
Tester This user group is for the Testers. Testers have read only access in QA but SEC-ES-DA-d-
(New have no access to the Pre-Prod and production environments. <DevITSG>-
Foundation) azure-tester
DevOps DevOps users combine the permissions required to develop, test and support SEC-ES-DA-p-
(New the application once it has been released to production. Generally, this mean <ProdITSG>-
Foundation) access is granted in all environments. azure-devops
Developer This user group is for the Developers. A developer’s permissions diminish as SEC-ES-DA-d-
(Old you move into each higher environment: <DevITSG>-
Foundation) Developer
Tester (Old This user group is for the Testers. Testers have read only access in QA but SEC-ES-DA-d-
Foundation) have no access to the Pre-Prod and production environments. <DevITSG>-
Tester
DevOps DevOps users combine the permissions required to develop, test and support SEC-ES-DA-p-
(Old the application once it has been released to production. Generally, this mean <ProdITSG>-
Foundation) access is granted in all environments. Support
Support This group is for the users who provide support during Production releases. SEC-ES-DA-p-
Level 1 User will get execute and data loading permission for Prod environment <ProdITSG>-
(Old supportlevel1
Foundation)
Application The application SPN has data reader and writer permissions but it cannot svc-b-da-
SPN read key vault. Wherever possible, ADF linked services use the application <env>-<ITSG>-
SPN when connecting to data. ina-aadprincipal
Deployment The deployment SPN has elevated permissions and has full access to data, svc-b-da-
SPN components and key vault. <env>-<ITSG>-
ina-deployment
SupportLevel1 n/a Reader, Execute Reader, Execute Reader, Execute Reader, Execute
Developer InA Tech InA Tech InA Tech n/a n/a InA Tech
App App Reader App Reader App
Contributor InA Tech InA Tech Contributor
InA Tech Data Reader Data Reader InA Tech
Data Owner Data Owner
DevOps InA Tech InA Tech InA Tech InA Tech InA Tech n/a
App App Reader App Reader App App Reader
Contributor InA Tech InA Tech Contributor InA Tech
InA Tech Data Reader Data Writer InA Tech Data Reader
Data Owner Data Owner
Application InA Tech InA Tech InA Tech InA Tech InA Tech InA Tech
SPN Data Writer Data Writer Data Writer Data Writer Data Writer Data Writer
Data InA Tech InA Tech InA Tech InA Tech InA Tech InA Tech
Scientist App App App App App App
Contributor Contributor Contributor Contributor Contributor Contributor
InA Tech InA Tech InA Tech InA Tech InA Tech InA Tech
Data Owner Data Owner Data Owner Data Owner Data Owner Data Owner
Developer Full Access on Dev and Read & Execute Full Access On Dev and Read
access on QA, UAT access on QA, UAT
Tester Read Access on Dev and Read & Execute Read Access on Dev, QA Only
Access on QA & UAT
Dev Ops Full Access on all environments Full Access on Dev & Reader
access on the rest.
Custom AD groups names should follow the operating model environment naming convention with a descriptive
name relevant to the access provided, requested and managed by the AD or UAM team using self-service Remedy
offerings (IT, Technical, Active Directory, Groups, …)
For example, a typical custom AD group to grant business user access to a Production Power BI report reading
data from AAS is : SEC-ES-DA-P-<ITSG>-FinancePBIReadProd
UDL and BDL consists of data which is shareable . There are lot of business usecases hosted within Unilever
Landscape and outside Unilever landscape requires this data. Refer “Distribution Strategy Design pattern” for
different methods and process involved to share the data.
Data access control process for ADLS (UDL & BDL) is three step process:
Make sure to give “Read only “ access to the consumer’s of data. Only respective platform internal process should
have Write permission on the data lake.
UDL:
Step 1 : AD Groups are to be decided by Data SME's/Data owner. It can be based on the below points
Different source system
Data access groups can be defined at any level in the folder hierarchy, Data owner should
make a informed decision to group the data in order to avoid governance on lot of AD groups.
Classification of the data (Restricted, Sensitive, Confidential, Internal)
For example , restricted data access groups are created as separate groups.
Suggestion is to keep separate AD groups for Restricted and Non Restricted data access and
for each business function
Grouping of different commonly used data set. For example: All supply chain data which is mostly
internal or confidential.
Based on the country wise restriction. For example: Indonesia and India finance data is restricted
where as same doesn’t apply to other countries like SEAA or Namet.
SME needs to analyse each data set ingested in to UDL for each function and accordingly arrive at
AD Groups for data access control.
Step 2 : Once the AD Groups and underlying folders are decided. Same should be forwarded to Landscape
team to attach it to respective folders. – NO Super user access given on the ADLS
Step 3 : Process to add the users and SPN’s for read only purpose, into AD group has to be approved
/owned by Data SME.
BDL:
Step1 is owned by the BDL devops team/Data SME – Data SME needs to decide the different of AD
groups required to give different level of access on the BDL data.
For example, SC BDL, the team building this needs to liaise with the IDAM or Landscape team to
create whatever AD groups they need based on the level of granularity of access they wish to provide
and manage.
If the SC BDL team could decide to have one AD group for the whole BDL or one per data set or even
more granular still depending on how they want to manage access.
Step 2 is owned by the I&A landscape team (under Jobby) to make sure AD groups are attached to right
base folders, based on the output of (1)- No super user access given on the ADLS.
Step3 is owned by the project devops team again to ensure that the people who need access, have that
access, by adding them into right AD Groups. All access here should be read only.
PDS:
No user access given on PDS ADLS. Only connection allowed on PDS for users is to AAS for power BI self-service
via MFA-enabled AD group.
This page methods approved for encryption of data-at-Rest for various azure components that are used for storing
data.
Based on the Encryption-at-Rest options that each of the components provide, security will be involved to identify
which components should be used for hosting restricted data.
Component Name Azure Data Lake Store Gen 1 Azure Data Lake Store
Gen 1
Methods of Encryption Microsoft provided keys - AES Customer Keys - AES 256
256 bit bit
Methods of Encryption Supported Microsoft provided keys - AES customer-managed key - RSA
256 bit 2048
Encryption Key Stored Microsoft Generated Keys: database boot record Customer-managed
keys: Stored in Azure Key Vault .
Encryption Key Stored Microsoft Generated Keys: database boot record Customer-managed
keys: Stored in Azure Key Vault .
SQL MI
AAS
Coming soon
This page describes protection and encryption methods used for data in transit.
For data in transit, Data Lake Storage Gen1 uses the industry-standard Transport Layer Security (TLS 1.2) protocol
to secure data over the network.
Azure storage account uses TLS 1.2 on public HTTPs endpoints. TLS 1.0 and TLS 1.1 are supported for backward
compatibility.
Azure SQL database always encrypted is data encryption technology that helps protect sensitive data at rest on the
server, during transit between client and server, and while the data is in use, ensuring that sensitive data never
appears as plaintext inside the database system.
SQL MI
Azure SQL database always encrypted is data encryption technology that helps protect sensitive data at rest on the
server, during transit between client and server, and while the data is in use, ensuring that sensitive data never
appears as plaintext inside the database system.
Coming soon
In order to avoid the security concern, I&A Platform went with subnet based approach in old foundation design,
where as the same approach is made better in new foundation design, in order to remove subnet and improve cost
spent on proxy VM’s used as IR and OPDG.
Subnet Approach in Old Foundation Design: (Apply’s to both SQL DW and SQL DB)
How to Verify
Verify below steps for each environment
Check whether you have IR and OPDG created in your environment ? It should be created in
the same subnet as that of SQL DW.
Check if SQL DW has a service end point defined for the below
Subnet service end point , which allows connection to IR and OPDG
Subnet Service end point , which allows connection to Workstation VM’s
Subnet service end point, which allows connection to RDS/Citrix
Check if your ADF pipelines connecting to SQL DW is using the IR hosted in the same subnet
as SQL DW (Not On Prem IR, It is IAAS IR) . Linked service has to use the Azure IAAS IR to
make connection to SQL DW
Check the AAS refresh is using OPDG to make connection to SQL DW. AAS can connect only
through OPDG hosted in the subnet as that of SQL DW
If any of the above is not taken care , please work with respective team and get it corrected.
(Details in the first Step)
To Secure the SQL DW without hosting it in Subnet, below points are taken care as part of the design.
Refer this link for Design Approval and Security Approval Non Subnet Design for SQL
Refer the document Azure Central Lake Security Details to understand and adhere to Security process.
Refer the document Logging & Monitoring to understand and adhere to Logging and Monitoring process.
PENETRATION TESTING
Unilever is using the PAAS components like- ADLS, Azure Data Factory, Azure Databricks,Azure Key vault ,
Integration Runtime, SQL DB, which are pen tested by the Microsoft.
Refer the document UDL User Access to understand and adhere to UDL User Access process.
Note : This process is laid by UDL team in discussion with TDA process.This is living document keep updating
based on future demands.
Refer the document Azure access Process to understand Azure Access Management Process SOP
Note : This process is laid by Landscape team in discussion with TDA process.This is living document keep
updating based on future demands.
The purpose of this document is to define a process for on-board & off-board the resource to get access/revoke
permissions for Azure applications in UDL, BDL, Product and Experiment environment.
The responsibility of on-boarding and raising request for environment and its access remains with the Project
Manager and Unilever Delivery Managers.
New Joiner
Mover
Leaver
Remove UL Account A R
Project/ Environment
Decommission
SUPPORTING DOCUMENTS
Refer the supporting document "UDL DEVOPS SERVICE REQUEST AND INCIDENT MANAGEMENT" to
understand the UDL Service request, User access requests Link
Note: Attached document provides the details about the UDL process only
Azure Landscape team and respective devops team maintain the Tracker with latest data as mentioned in
the below format
The required data can be extracted from the Azure or MIM portal.
Setup the process to send the notification email to Unilever Delivery Manager, Vendor Partner Manager and
Business manager every month to review the active members.
The user access should be revoked based on the input from the Unilever Delivery Managers.
Note : The details need to be maintained by the team who is providing the access to the MIM portal.
Data Access
The data access should be managed by the Unilever Delivery managers who own the respective environments and
Data Expertise Team.
The data access is governed by the data owners and environment owners. They are responsible to manage the
access and revoke of the data access permissions.
UDL Access : Fill the data access template and share it with UDL Dev Ops team.UDL Data Access Forms
can be downloaded from here
BDL Access : Fill the data access template and share it with BDL Dev Ops Team. BDL data access template
can be downloaded for here . BDL Dev OPS contact is as below
Supply Chain: (Reach out to BDL Dev Ops)
New User
Mover
Landscape/Devops team S
removes the access
Landscape/Devops removes A S
the access
SPN
Create SPN A R
Landscape/Devops team S
removes the access
SUPPORTING DOCUMENTS
Respective Devops team maintain the Tracker with latest data as mentioned in the below format
The required data can be extracted from the Azure portal.
Setup the process to send the notification email to Data Expertise team, Unilever Delivery Manager, Vendor
Partner Manager every month to review the active members.
The user access should be revoked based on the input from the Data Expertise team and Unilever Delivery
Managers.
Costing Questionnaire : Before an application/product is built, there are certain planning activities required to
close the functional and non functional requirements, on-board a build team, secure the budget etc. In order
to secure the budget project requires high level estimation of their Infra, Build and Support cost based on the
functional requirements to be covered in the project. Build and support cost is planned by delivery team’s.
But because the infrastructure used here is Azure cloud PAAS services, architecture team provides high
level estimates of infrastructure cost. Costing questionnaire consists of set of questions to get information on
functional requirements and scope of the project, which will help architecture team to build a draft
architecture and provide high level estimate of the cost based on the draft architecture. This is only HLE
derived based on the answers from the project team and not the actual cost. Cost can differ if the project
makes use of the environment in different manner than claimed in the costing questionnaire.
TDA T Shirt Calculator : Excel based self explanatory calculator built to provide high level estimate. Projects/
Delivery teams who are aware of the data size and high level architecture, can use this calculator to derive
high level cost.
Cost optimization Recommendation: Though the cost in Azure is Pay as you Go, but if the components are
not managed right , cloud can turn out to be an expensive solution. In order to avoid projects spending a lot
on their infrastructure, architecture team has come up with certain guidelines and scripts to optimize the cost.
A template is available to capture the Project Inputs required for cost estimates (See Appendix D). The subsequent
sections outline how to fill in the template, why the information is required and how it is used to support cost
estimates
However, it is important to understand the more information made available the more accurate the estimate will be.
1. Data Sources
Field Why it is collected and what the information is used for. Needed
for
Costing?
Source This is the recognized system of record for the data sources required to support Required
System the solution. for design
Source Description of Data Source (this needs to be common to source technical team, Required
System business, project and I&A teams for design
Description
Internal/ Is the source system · on-premise? · Is an external system to UL, managed Required
External externally? · Is an external system to UL, which will first land data on premise? for design
Data Source / Technical name of the data source (interface name) Required
Interface for
Costing
No of Data Number of interfaces(extracts) from each source (both external/internal) and on- Required
Source / prem/cloud for the given frequencies for
Interface Costing
Frequency This is the refresh frequency, real time, infra day, daily, weekly, monthly Required
for
Costing
Size (in MB)/ The size of the file for each refresh. See next section for all data volume Required
Data Volume information required. for
Costing
Provide all the details on the data source requirement for the project. Reference Data Sources Tab of Costing
Questionnaire Excel Document
It Sourc Source Internal Transaction/ Master Source / Frequency Size (in MB)
em e System \ Data/ Text/ Interface / data
System Description External Hierarchy volume
D1
D2
….
T This information is used in the determination of the (A) Total (B1) Low (C)Total
ot architecture design patterns and confirmation that Number of Frequency = incremental
al the source systems have current design patterns Data Total number volume
which are fully tested and available. Source daily/ weekly month is
/Interfaces /monthly etc calculated
For high level estimates carried out pre Gate1 in the project life cycle the estimates can be used for the number of
data sources, frequency and size by source system and data type (transaction, master, text and hierarchy).
For gate 1 and subsequent re-estimation point the data is collected at a granular level a record for each data
source/ Interface by data type.
List all the volumetric details for the sources considered in section above.
D 1 Year Data Total volume of data for most current rolling full year by data source/ interface Required
1.1 Volume for
Costing
on
Storage
and
Compute
D History Number of years of historical data required. With actual data volumes by year Required
1.2 and data source captured. for
Costing
on
Storage
and
Compute
D Total Data This is the estimated go live volumes. At Gate 0 a ball park estimate can be Required
1.3 Volume for used (project assumption can be used), at Gate 1 items 1.1 and 1.2 need to for
solution go be completed. Costing
live (Current on Storage
+ History)
D Retention What is the required archiving strategy, will data be archived and/or deleted, Required
1.4 required if not for how long will data be kept. I.e. go live data volumes could be current for
apart from year plus + 2 years, but the retention period for the solution post go live could Costing
(Current + be 10years. on Storage
History +
post go live
retention
period)
D Data growth The data volumes for each data set will vary. Some data sets grow at a Required
1.5 expected per constant rate i.e, volume of data by month is static over time. Other data sets for
Year either grow or reduce in relative size over time. Costing
on
Storage
and
Compute
D Data in “Hot” Hot data is accessible to the reporting solution and can be accessed by both Required
1.6 storage developed reporting solutions and self-service. The Hot data needs to be for DR
defined by the layer in the architecture and for the future retention period design
outlined in point 1.4 (details to
be added
in Phase 2
of Costing)
D Data in Data can be stored in low cost infrastructure, which can be accessed by data Required
1.7 “Cold” storage science functions, but not the front-end dashboard and self service tools. This for DR
data could be retrieved into the reporting solution on request or accessed design
utilizing I&A data science functions. The retention of data in cold which is only (details to
required in exceptional situations offers a cost effective approach to this be added
requirement. in Phase 2
of Costing)
1. Dashboard Reporting
This section cover’s all the dashboard reporting requirement of the project.
R2 Total data If existing solution and TDE's are available, then total size of Required for Costing on
volume for the TDE. Azure Analysis Services
dashboard and SQL DW for
reporting? reporting
R3 No of No of actual workbooks exposed to users with multiple tabs (as Required for design and
Dashboards applicable) Dev resource planning
Planned?
R4 Maximum No. of reports to be built in one dashboard. The dashboard Required for design and
reports in any with maximum number of reports will identify the complexity of Dev resource planning
Dashboard? dashboard creation
R5 Maximum Maximum data that can be fetched from across all the Required for design of
volume of dashboards AAS cubes
records for
any report?
R6 Total Number of business users with access to run dashboards Required for Design and
dashboard Premium capacity
reporting
users?
R7 Total Number of users accessing the reports at the same time Required for Costing
concurrent
Users
accessing the
reports?
R8 Maximum Maximum response time for availability of report data after the Required for Costing
acceptable users run the dashboards
response
time?
R9 SLA for report Is access to the report required 24 by 7 ? If not by time zone, Required for Costing
availability? what are the required availability hours? Can the reporting
performance be turned down out of defined availability period?
Note: The current recommendation is not assumed reporting solution can be turned off out of hours. Historical it
has been proved that this has been impractical, a similar approach was attempted to manage the on-premise
Tableau servers. A hard close, would result:
· Critical out of office activities being impacted if reporting solutions are not pre-booked to be kept running.
· The Application Management (AM) team will require access to the environment to evaluate faults. And application
development (AD) teams and DevOps teams will need access. to the environment to deploy enhancements and
CRs
· Additionally, UL is a global organization with report uses traveling globally, resulting in out of hours requirements.
The I&A-Tech Platform will continue to team with projects and business functions on this requirement and ensure
opportunities to reduce costs are taken. However, the recommendation stands that it should not be assumed that
the reporting solution can be turned off out of core reporting hours.
S3 Total data volume for Self- Overall size (total volume) of data required for Required for costing
Service Reporting? self-service slice and dice reporting for SQL DW
S4 Total Self-Service users? Number of users doing slice and dice Required for Costing
of Power BI Pro
licenses
S5 Total concurrent Users Number of users doing slice and dicing at the Required for Costing
accessing the self-service same time of SQL DW
environment?
S6 Maximum acceptable response Maximum response time for availability of Required for Costing
time? report data after the users run the dashboards
This section cover’s all the Analytics/Data Science requirements of the project
A2 Total data volume considered for data Overall size (total volume) of data Required for costing of
science required for data science use cases Azure ML/HDInsight
A3 Total Data Science/Analytics Users Number of data scientists running Required for design and
models Azure ML licenses
A4 Total concurrent users accessing Number of users accessing models Required for design
analytical platform at same time
A5 No of analytical models planned Number of analytical models Required for design and
dev resource capacity
A6 Is Maximum data volume for any Is Maximum data volume for any Required for design and
analytical model > 10 GB analytical model > 10 GB(if size > 10 Costing if size > 10 GB
GB use HDI, else use ML) use HDI, else use ML
A7 Maximum data volume for any Maximum size of final data set Required for Costing
analytical model considered for all models
A8 Time Frame when the analytical/ Data Number of hours of usage of Required for costing of
Science users are expected to use environment non-Prod environments
the system? (24/7, 12/5) esp. Dev
Reduce the data to 10% in Non-Prod environments for UDL, BDL and PDS, mainly (Dev and QA) , for all
components (ADLS, SQL DW, AAS)
Right size the environments and components
AAS and SQL are the biggest contributors to cost. AAS (<S2) and SQL DW (< 400 DWU) should be
used in Non Prod environments.
UAT and PPD environment should be up and running only when UAT or Performance testing is
carried out.
If Consider a sprint of 4 weeks with release after every 2 sprints. Then
8 weeks of sprint ( Dev Environment)
2 week of testing – (QA environment – 15% of Development cost)
1 week of Performance testing (PPD environment – 8% of Development cost
Monitor the environment to verify, if the environment is up only during limited period.
Pause all Non Prod environment during Weekend and Non-Working Hours.
The usage of Non-Prod should be minimal or No usage during non office hours.
Monitor / Have a approval process to bring up the component in Non-Prod during weekend or non -
working hours, if the requirement is justified.
Pause all compute environment after the processing is complete or when not in Use.
Architecture & Landscape has provided a method to Pause components when not in use. Mainly AAS
and SQL DW , using web hook.
For support activities on SQL DW use lower configuration of SQL DW
Use azure component utilization report to see when all SQL DW and AAS are being used and when it
can be paused. Implement pause and resume accordingly.
Databricks optimization
Implement a timeout of 15-20 Minutes for Interactive cluster.
Implement Databricks cluster optimization. Use right cluster (type and number of nodes) for job type –
small, memory intensive, compute intensive
Reduce the number of Databricks premium instances if premium features are not used
Implement Databricks Delta with partition to reduce compute costs
SQL DW optimization : Implement data distribution, Partitions, Indexing, Distribution, enable statistics to
improve query performance; improved loading techniques
AAS Optimization: Move only calculated/Aggerated data required for reports. Implement partitions,
Incremental refresh, Calculated Groups
Set budget and alert for projects cost over a threshold.
Migrate to New foundation design to completely remove the IR and OPDG for projects. (Applicable to
projects in Old foundation Design i.e. Dublin)
Review and Implement best practices published for each Azure component by TDA in solution
architecture guidelines.
Use Log Analytics and alerts for monitoring and alerting will help reduce manual monitoring costs
Migrate to ADLS Gen2
Pause IAAS components using Park my Cloud
Start/Stop VMs on a schedule: Instruction for PMC- User Guide
Implement reserved instances for Databricks VM’s and ADLS - Work in progress as standardization of
VM type is happening across products.
Architectural patterns to remove the staging component which doesn’t add value in the E2E Model.
(In Progress)
New design patterns implemented to remove SQL DW if database if not critical, as Databricks can be
used as compute layer.
AAS refresh directly from ADLS is tested and implemented in few of the projects.
Shareable cluster for IAAS components IR & OPDG for non-prod environments
Implemented
Migrate to “New Foundation Design” subscription, which no longer need IR & OPDG due to changes
in security controls on SQL DW & resource groups
Work in progress
Move from HDInsight to Databricks
Most of the projects have already migrated.
Migrate to ADLS Gen2 (Gen2 cost is ~30% less than Gen1 Storage cost)
Use Cold, Hot and Archive data layers for ADLS Gen2
Best practices Guidelines and Trainings
Train Build team on right practices for each component.
Review projects which are highest contributor to overall cost of Azure Platform for opportunities on cost
improvement.
In order to save costs on Azure Analysis Services and Azure Synapse (formerly Data Warehouse), Unilever I&A
recommends using standard processes that will be available to all projects. These standards are developed and
maintained by landscape and make use of Azure Automation Account. Here is the list of what is supported and
further sections describe how projects can implement them:
AAS
Similar implementation is done for processing Azure Analysis Services cubes using TMSL. For details, refer to Web
hooks for AAS cube refreshes.
This section covers the use of web hooks for automation and how they are used for reducing costs for projects.
I&A Landscape has access to an Azure Automation account that is used to provide provisioning services to
projects. Access to these services is provided via shared Web Hooks that you can use. Additional services can be
added on request. Note that the runbooks that provide these services use a copy of your application’s parameter
files held in landscape’s private storage account. If you edit your Overrides Parameter File and the new values are
required by a webhook, you must ask landscape to copy the updated file to their storage account.
If you look in your parameter file and search for ‘Webhook’ you will see various entries that help identifying
resources for your project that can be managed by webhooks.
If you don’t see the above in your parameter files, ask landscape to regenerate them for your project.
In addition, there are some code sample in your ADF that show how to call them:
Theses samples are deployed by default in all newly provision environments in Amsterdam.
These pipelines only work if you trigger them. Running them in debug mode doesn’t work.
WEBHOOK PARAMETERS
In the code samples you will see that the body of each web request is constructed using a dynamic ADF
expression. Two separate formats are supported and note that format1(legacy) is a subset of format2:
Format 1 (Legacy)
@concat(pipeline().DataFactory,',',pipeline().RunId,',','')
This evaluates to a csv string where the first 4 values have a fixed meaning as follows:
4 N Call back pipeline The name of a pipeline in the data factory to trigger once runbook
processing is complete
Purpose of the first 2 columns is apparent from the above table. Here is some detail about the remaining columns:
You can control which parameter file override to use. A use case for this could be if you have multiple SQLDW
instances. Since the main parameter file only holds the name of one instance you can use an override file that
contains the name of the second. The webhook will use the values in the override file and so act on the second
SQLDW instance.
Webhooks run asynchronously and so the ADF web task will complete well before the underlying automation
runbook. In a data processing scenario where ADF needs to resume SQLDW before attempting to process its data,
you can use the call back feature to start your data processing pipeline after SQLDW has fully resumed. This
parameter is supported for all webhooks.
Format 2
@concat('{"csv":"',pipeline().DataFactory,',',pipeline().RunId,',,
PL_PROCESS_CUBE_CALLBACK",','"object":',pipeline().parameters.
tmslScript,'}')
This evaluates to a JSON string and the CSV string from format 1 appears in a node called ‘csv’. Another node
called ‘object’ is also created and this can be used for anything. Currently, it is only used for cube processing via a
TMSL script (i.e. a json string).
When calling this webhook you must pass the sku in csv column 5.
When calling this webhook you must pass the database edition and pricing tier in columns 5 and 6 respectively. In
column 5 pass ‘Datawarehouse’ for SQLDW or empty string for SQLDB.
If your project requires Azue Analysis Services, you are likely to have one instance per environment - Dev, QA,
PROD etc. You’ll also have an ADF pipeline that allows to pause AAS instances. These pipelines have no triggers
associated with them. i.e. are not scheduled to run by default. Each project is supposed to implement schedules to
run these pipelines according to their needs keeping in mind that AAS is an expensive component and should
remain pause when not in use. The pipeline looks like this:
URL - All non-prod URLs will be the same - they call the same webhooks. For production, you’ll have a
different URL. If you need, reach out to landscape team to get that URL.
Method - should always be ‘POST’
Body - If you only have one instance per environment then you don’t need to change the body element. It’ll
automatically identify your instance for the relevant environment based on your parameters file and will
pause the instance. If you have multiple AAS instances in your environment, you need to specify which
instance to pause. The syntax for it is slightly different as you’ll have to specify the instance name.
As an example, Body for a project that has 2 instances of AAS per environment would look like:
@concat(pipeline().DataFactory,',',pipeline().RunId,',','overrides.
aastwo.d.80181.json')
If your project requires Azue Analysis Services, you are likely to have one instance per environment - Dev, QA,
PROD etc. You’ll also have an ADF pipeline that allows to resume AAS instances. These pipelines have no triggers
associated with them. i.e. are not scheduled to run by default. Each project is supposed to implement schedules to
run these pipelines according to their needs keeping in mind that AAS is an expensive component and should
remain pause when not in use. The pipeline looks like this:
URL - All non-prod URLs will be the same - they call the same webhooks. For production, you’ll have a
different URL. If you need, reach out to landscape team to get that URL.
Method - should always be ‘POST’
Body - If you only have one instance per environment then you don’t need to change the body element. It’ll
automatically identify your instance for the relevant environment based on your parameters file and will
resume the instance. If you have multiple AAS instances in your environment, you need to specify which
instance to resume. The syntax for it is slightly different as you’ll have to specify the instance name.
As an example, Body for a project that has 2 instances of AAS per environment would look like:
@concat(pipeline().DataFactory,',',pipeline().RunId,',','overrides.
aastwo.d.80181.json')
Everytime a new project environment is created in I&A managed subscriptions with an instance of Azure Synapse,
a pipeline is also created by landscape in ADF that allows to pause the SQL DW instance. Similar to the examples
above, this pipeline makes use of web hooks and comes without a default trigger. Based on the requirements the
projects can trigger this pipeline to pause SQL DW instance.
The configuration is limited to the same 3 elements, URL, Method and Body. If you have only one instance per
environment you don’t need to make any changes to the above representation of pipeline. If your environments
have multiple instances of DW (rarely the case), you can make use of the 5th parameter in Body element as above
examples. Please note, the URL for non-prod and prod environments are different and projects can reach out to
landscape to know these URLs if needed.
Similar to the above implementations, your ADF will have ready-made pipelines for resuming Azure Synapse
instances. Here is a representative example of how it will look.
A regular auditing in place that identifies instances that are left running over the weekends and outside
business hours.
We are moving from single account single subscription model to multi-account multi-subscription model.
Design Criteria
Access Control
RBAC
Policies
Private Peering
Microsoft Peering
Public Peering (Deprecated)
EXPRESSROUTE Details:
Express Route connections do not go over the public Internet, and offer more reliability, faster speeds, lower
latencies and higher security than typical connections over the Internet
ExpressRoute circuit represents a logical connection between on-premises infrastructure and Microsoft cloud
services through a connectivity provider.
Multiple ExpressRoute circuits can be setup and Each circuit can be in the same or different regions, and
can be connected to on premises through different connectivity providers
EXPRESSROUTE circuits are uniquely identified by GUID called service key. Service key is the only piece of
information exchanged between on premise network, provider and Microsoft network
Each circuit has a fixed bandwidth (50 Mbps, 100 Mbps, 200 Mbps, 500 Mbps, 1 Gbps, 10 Gbps) and is
mapped to a connectivity provider and a peering location. Bandwidth can be dynamically scaled without
tearing down the network.
Billing model can be picked by the customer, unlimited data, metered data, EXPRESSROUTE premium add-
on
ExpressRoute circuits can include two independent peerings: Private peering and Microsoft peering. Old
EXPRESSROUTE circuits had three peerings: Azure Public, Azure Private and Microsoft. Each peering is a
pair of independent BGP sessions, each of them configured redundantly for high availability
Azure compute services, namely virtual machines (IaaS) and cloud services (PaaS), that are deployed within a
virtual network can be connected through the private peering domain
Private peering is considered to be a trusted extension of core network into Microsoft Azure
This establishes a Bi-directional connectivity between core network and Azure virtual networks (VNets)
Private peering is achieved between the 2 virtual network i.e. on premise private network and azure IAAS
private network.
Connections to PAAS services which supports Vnet hosting can be routed through EXPRESSROUTE circuit.
Connectivity to Microsoft online services (Office 365, Dynamics 365, and Azure PaaS services) occurs through
Microsoft peering
Microsoft peering allows Microsoft cloud services only over public IP addresses that are owned by customer
or the connectivity provider.
Route Filter needs to be defined to allow connections for a particular region and service. Similar
configuration as Allow Azure Services. (For example: If route filter enabled for a North Europe, all the IP’s of
North Europe will be whitelisted)
Connections always originate from on premise network and not from Microsoft network.
Requirements:
/30 public subnet for primary link
/30 public subnet for secondary link
/30 Advertised subnet
ASN
Azure compute services, namely virtual machines (IaaS) and cloud services (PaaS), that are deployed within a
virtual network can be connected through the private peering domain
Any services hosted within a private network can make use of expressroute circuit for connection to unilever
network.
Data ingestion from Unilever Data Center to azure can be made to go through EXPRESSROUTE using
private peering, by hosting the Azure IAAS IR within Azure network and enabling the firewall port to on
premise network
End User connection cannot be made to through EXPRESSROUTE unless public or Microsoft peering is
enabled.
I&A platform was setup in Foundation design which was created in North Europe i.e. Dublin. With creation of new
foundation design, I&A has started hosting all new products in new foundation design, but the existing projects are
to be migrated to New foundation design.
Old foundation design is setup in North Europe (Dublin) . Three subscriptions are hosted in old foundation design
Prod, Prod 2 and Prod 3 and all the three are common subscriptions shared with all IT platforms
Prod is the initial subscription created, which was shared with all platforms. When the subscription limits
were reached new subscription Prod 2 and Prod 3 were created.
I&A Tech is not the owner of any of these subscriptions , as these are shared with multiple platforms.
I&A is dependent on EC in order to create resource group or assign permissions which can be done only
with Owner permission.
New subscriptions are created whenever the Limit for subscription is reached.
Networking (VNet, Subnet, IAAS) components sits as part of each subscription and shared between the
platforms.
Migration to the new foundation design; is migration to better managed platform. New foundation design is
available both in Dublin & Amsterdam regions
Brings network benefits & fixes security concerns vs old “Dublin” design
Old Dublin design has single vNet, subnet and subscription architecture. No segregation between platforms
or at the networking level.
Networking design with no segregation between prod and non-prod is discouraged by InfoSec.
Data protection mechanisms are not granular in old Dublin design since it is a single vNet
Hub and Spoke Model: Each platform has its own subscription and controls its own environment.
Faster provisioning of components as I&A technology team can create resource groups and IaaS
components – full control without dependency on EC team; faster delivery
Subscription limit issues removed/reduced by segregation of subscriptions per platform. Less number of MS
issues.
In order to avoid the issues faced in Old foundation design, I&A TDA came up with new design of hosting multiple
subscriptions to segregate UDL, BDl, PDS and experiment environments.
Hard Limit
250 Storage accounts per Subscription.
RBAC Limit
Number of network calls
Soft Limits
Core Limit
Resource limit
Cannot keep static number of subscriptions. New subscriptions are required as and when the limit is reached.
As per Microsoft, any subscription should have ideally not more than 40 Concurrent Databricks clusters to
overcome throttling issue due to networking calls made.
Some of the best practices is to, monitor the limits and keep a threshold , when reached new subscriptions to
be created.
Keep threshold alerts to 60% of Hard Limit
Stop provisioning of new resources when the subscription reaches 80% of the Limit
Remaining 20% will be used for scaling the existing solutions in the subscriptions.
Any new applications should be moved to new subscription created
Soft Limit alerts to be configured for 80% , in order to increase the limit well in time.
UDL and BDL’s will be hosted in Dublin. Existing products which are in old foundation design will be migrated to
new foundation design in Dublin.
UDL SUBSCRIPTIONS:
BDL SUBSCRIPTIONS:
PDS SUBSCRIPTIONS:
EXPERIMENT SUBSCRIPTION:
This document provides guidance on product migration from old foundation to new foundation using azure move
resources feature.
This migration method has been tested and works with standard I&A stack shown below.
As part of OLD foundation, each product has been deployed using 3 RGs per environment:
APP-RG
DATA-RG
STG-RG
As part of NEW foundation, each product MUST be deployed using 1 RG per environment. For details on new
foundation design lease refer to Section 6 - New Foundation Design - Azure Foundation 2018.
In addition to this, I&A landscape and product team also need to carry out prep-work. This prep-work enables and
readies your product for migration to new foundation.
Product teams SHOULD use following migration steps as a skeleton plan and draft detailed plan to complete
migration to new foundation.
Decommission OPDG VM EC
Decommission IR VM EC
Once the tool is approved for usage within the Azure Landscape, it will be moved to Approved tool section with
design patterns published.
Azure Data Share Preview enables organizations to simply and securely share data with multiple customers and
partners. In just a few clicks, you can provision a new data share account, add datasets, and invite customers and
partners to your data share. Data providers are always in control of the data that they have shared. Azure Data
Share makes it simple to manage and monitor what data was shared, when and by whom.
Key Capabilities
· Share data from Azure Storage and Azure Data Lake Store with customers and partners
· How frequently your data consumers are receiving updates to your data
· Allow your customers to pull the latest version of your data as needed, or allow them to automatically receive
incremental changes to your data at an interval defined by you
· Trigger a full or incremental snapshot of a Data Share that an organization has shared with you
· Subscribe to a Data Share to receive the latest copy of the data through incremental snapshot copy
· Accept data shared with you into an Azure Blob Storage or Azure Data Lake Gen2 account
How it Works?
Azure Data Share currently offers snapshot-based sharing and in-place sharing.
In snapshot-based sharing, data moves from the data provider's Azure subscription and lands in the data
consumer's Azure subscription. As a data provider, you provision a data share and invite recipients to the data
share. Data consumers receive an invitation to your data share via e-mail. Once a data consumer accepts the
invitation, they can trigger a full snapshot of the data shared with them. This data is received into the data
consumers storage account. Data consumers can receive regular, incremental updates to the data shared with
them so that they always have the latest version of the data.
Data providers can offer their data consumers incremental updates to the data shared with them through a
snapshot schedule. Snapshot schedules are offered on an hourly or a daily basis. When a data consumer accepts
and configures their data share, they can subscribe to a snapshot schedule. This is beneficial in scenarios where
the shared data is updated on a regular basis, and the data consumer needs the most up-to-date data.
When a data consumer accepts a data share, they are able to receive the data in a data store of their choice. For
example, if the data provider shares data using Azure Blob Storage, the data consumer can receive this data in
Azure Data Lake Store. Similarly, if the data provider shares data from an Azure SQL Data Warehouse, the data
consumer can choose whether they want to receive the data into an Azure Data Lake Store, an Azure SQL
Database or an Azure SQL Data Warehouse. In the case of sharing from SQL-based sources, the data consumer
can also choose whether they receive data in parquet or csv.
With in-place sharing, data providers can share data where it resides without copying the data. After sharing
relationship is established through the invitation flow, a symbolic link is created between the data provider's source
data store and the data consumer's target data store. Data consumer can read and query the data in real time using
its own data store. Changes to the source data store is available to the data consumer immediately. In-place
sharing is currently in preview for Azure Data Explorer.
Security
Azure Data Share leverages the underlying security that Azure offers to protect data at rest and in transit. Data is
encrypted at rest, where supported by the underlying data store. Data is also encrypted in transit. Metadata about a
data share is also encrypted at rest and in transit.
Access controls can be set on the Azure Data Share resource level to ensure it is accessed by those that are
authorized.
Azure Data Share leverages Managed Identities for Azure Resources (previously known as MSIs) for automatic
identity management in Azure Active Directory. Managed identities for Azure Resources are leveraged for access to
the data stores that are being used for data sharing. There is no exchange of credentials between a data provider
and a data consumer. For more information, refer to the Managed Identities for Azure Resources page.
Pricing Details
A dataset is the specific data that is to be shared. A dataset can only include resources from one Azure data store.
For example, a dataset can be an Azure Data Lake Storage (“ADLS”) Gen2 file system, an ADLS Gen2 folder, an
ADLS Gen2 file, a blob container, a blob folder, a blob, a SQL table, or a SQL view, etc.
Dataset Snapshot is the operation to move a dataset from its source to a destination.
Snapshot Execution includes the underlying resources to execute movement of a dataset from its source to a
destination.
You may incur network data transfer charges depending where your source and destination are located. Network
prices do not include a preview discount. Refer to the Bandwidth pricing details page for more details.
Currently, data provider is billed for Dataset Snapshot and Snapshot Execution.
Clone enables branching of database. It copies metadata at a point in time so the clone will appear like the
original, from a user perspective. They only pay for data storage for deviations from the original. Ideal for
testing or if a static copy is required for auditing purposes.
Live sharing feature enables an account to share data with another Snowflake account. Sharer pays for
storage, recipient pays for any compute they carry out. Sharing across regions will be implemented by
replicating data across regions and across cloud providers.
All account types include time-travel in order to “undelete” for 1 day (Standard) or 90 days (Enterprise and
above).
Concurrency : Very good compared to any data base solutions available, along with scalability of the
product.
Cost : As the cost is per second , this can turn out to be a cost optimized solution for large data sets.
Agility: Snowflake requires no maintenance associated with other data warehouses, data is loaded and
queried without indexing, partitioning etc. This can enable a more agile working environment.
Evaluation Criteria: Unilever engaged with Snowflake to carry out testing of Snowflake’s Cloud Data Warehouse.
Technical Problem with existing solution: Currently I&A uses SQL DW and AAS as source for end user
reporting, this is turning out to be an expensive solution. Main evaluation done here is to see if snowflake can
replace both SQL DW and AAS combined as a compute for End user reporting.
AAS has a limitation of 400 GB of max data per instance (s9 instance), which costs a lot.
Unilever has business requirement of pre-built reports and self service from end users using PBI
Pre-Built reports are not granular hence 400 GB of cache is good enough.
Self service needs to be done on granular data , with the expectation of good performance. Having
this huge data in AAS is turning out to be very expensive and not a feasible solution due to 400 GB
limit.
Unilever also looked into an option of using SQL DW for Self- service, but there are two issues with it
SQL DW requires a minimum configuration of 1000DW to get good enough performance. But 1000
DWU costs too much when kept up and running for 24/7
SQL DW doesn’t support enough concurrency. Currently only 32 concurrency is supported on 1000
DWU Gen2
With introduction / evaluation of Snowflake, the idea was to replace both AAS and SQL DW with snowflake if
snowflake can give similar performance as that of AAS.
NO limit on concurrency on snowflake.
Cost of snowflake is cheaper as it is per second billing and comes with automated scale up and scale
down ( automated (pause and resume)
Performance is the main criteria, which is evaluated through this POC
Data from Azure can be integrated with Snowflake database using Blob storage and Snowpipe. ADF cannot be
used as orchestration tool for Snowflake but snowflake comes with its own scheduling capability.
PBI has connector to Snowflake but currently snowflake requires OPDG hosted in IAAS VM to make the connection
possible. No gateway solution is being built and will be ready very soon.
Note: Microsoft is working on a non-gateway solution to onboard the new Snowflake ODBC driver into
Power BI Service
POC COnfigurations:
Analytics: 278,015,760 rows (Example: Select SUM(col1) from table group by(col2))
The differences between these two warehouses are qualitative and is caused by their design choices:
SQL DW emphasize flexibility and maturity in working with AAS, Power BI & other Microsoft Azure Tools.
Conclusion:
Snowflake is approximately twice as performant compared to Azure SQL DW for Interactive and Analytics
workloads
SCENARIO 1(EXTRACTION OF 1 MILLION ROWS FROM COMPUTE) USING NEW ODBC DRIVER:
Workload 1,000,000 rows loaded into the Power BI table component in the Dashboard.
No calculations/aggregations or optimizations were done in the Power BI Dashboard
Conclusion:
Based on the above results it is evident that Power BI performs slightly better with SQLDW as a backend
compared to Snowflake. But AAS performs better when the number of concurrent connections are higher.
SCENARIO 2 (CALCULATION AND AGGREGATION ON THE COMPUTE LAYER AND RESULTS EXTRACTED IN TO PBI) USING
NEW ODB DRIVER
This dashboard involves a join between two tables (278 million records and 1.3 million records).
Materialized views were created at Snowflake on the above tables to derive the data for the Dashboard.
All the aggregations were done at the Snowflake end. ~50 records were returned to Power BI
Conclusion:
Snowflake performance is good when all the processing is pushed to underlaying compute. SQL DW and
AAS is giving the similar results as that of Snowflake
Cost Comparison:
Cost comparison based on the workload used for the pilot (Interactive & Analytics )
Costing Calculation :
Snowflake Compute is charged based on the cluster up time, which can be set while creating the warehouse.
Snowflake storage is charged based on compressed data at $23/TB/Month vs Azure at $110.8 /TB/Month
Snowflake is designed to scale up, but more importantly scale down and suspend instantly and automatically.
[Above calculation is based on the snowflake cost details provided from the below site
Reference: https://www.snowflake.com/blog/how-usage-based-pricing-delivers-a-budget-friendly-cloud-data-
warehouse/ ]
SQL DW : DW 1000c/Gen2 - indicative cost : £ 10.83 /Hour; DW 2000c/Gen2 - indicative cost : £ 22.94
/Hour;
Conclusion:
Snowflake is comparatively still cheaper than SQL DW, due to auto scale down and auto suspend
capabilities.
Snowflake Storage cost is very cheap compared to SQL Storage cost as it uses ADLS as storage.
This product has very good credentials regarding performance, concurrency and simplicity. At this point of time
(November 2019) Power BI performs slightly better with AAS as backend compared to Snowflake.
Snowflake is comparatively a lot cheaper than SQL DW, for the short runs as charging is done only the
compute used and per second. If any job is running continuously for an hour, then SQL DW is 50% cheaper
than Snowflake.
AAS cost is very much similar as that of Snowflake, but AAS has a limitation of 400GB at its highest
configuration Tier and concurrency limitations which is not the case with Snowflake.
Snowflake Storage cost is very cheap compared to SQL Storage cost as it uses ADLS as storage.
Snowflake provides faster query on top of the large data with unlimited concurrency. This could be a decision
point as most of the self service reporting is limited on AAS and SQL DW layer.
Microsoft is working on a non-gateway solution to onboard the new Snowflake ODBC driver into Power BI Service,
which is expected to be available December 2019. Unilever may carry out another set of testing for performance
once the non gateway solution is available.
Unilever needs to conduct the similar analysis for a production workload / live application before taking the decision
on moving to Snowflake.
Synapse Analytics is a fully managed analytics service that is equipped with E2E tools from ingestion to BI
reporting at one place in a single workspace.
ADF – Ingestion
ADLS – Storage
Tools Planned
Power BI
Azure Analysis Services
Azure ML
Benefits to Unilever
Easy Management: Consolidation of components per project under one workspace. E2E Analytics tools
used in Unilever today like ADF, ADLS, Spark, SQL DW will be hosted within a single workspace.
Integrated Dev Ops environment: Integrated development environment with web development tools.
DevOps teams can access and build modules using the web development environment instead of using
specific dev ops tools like VSTS, SSMS etc.
Easy deployment: One place to manage all resources and deploy it from one workspace to another.
Secure Environment: Greater security and access controls at the workspace level. Need not store or share
the credentials of each component with user. Data and compute lies at one place managed through access
controls applied at workspace level. No download and install of dev Ops tools required.
Centralized Monitoring & Alerting: A central space to manage and monitor all the jobs for a workspace.
Additional features: On-demand SQL on top of ADLS, On demand Spark on ADLS. Users can run any
query (Spark or SQL) on ADLS
Workspace
Perform all activities for analytics solution
Secure and manage lifecycle
Pay only for what you need
Data analytics at-scale
Relational and big data processing with Spark and SQL analytics
Serverless and provisioned compute
Synapse Analytics vision is great and will be very beneficial for Unilever, but at its current state the product
is not completely mature. Only General Availability component as of April 2020 is SQL DW Gen2/ Synapse
database. Many of the features are still in private preview.
Unilever is one of the customer conducted the review of the product and provided valuable feedback on the
improvements to product. There are lot of features missing in the tool. Unilever is working with Microsoft to review
the missing features.
Features are not mature: Not all features available in ADF, Spark present in Synapse Analytics preview.
In its current form, lot of work required to have a complete Integrated Dev Ops environment Tools like
AAS and Power BI and ML tools are missing.
Easy migration: One click Migration from current process to new integrated platform is missing, in order to
adopt the tool for existing environments.
Security roles and controls. Security controls and roles are not mature enough in Arcadia. Expectation is to
manage connections between each component internally and no credentials needs to be shared with User.
Though this is the vision , currently it is not available.
Azure DevOPs integration: DevOps integration is planned sooner, currently not available.
Note: Only SQL DW Database or Synapse database is approved for usage in Synapse Analytics. Once all
the features are reviewed and security approved, Architecture team will make the product available in I&A
Landscape