0% found this document useful (0 votes)
17 views85 pages

SF-Best Practices

Uploaded by

nchawhan15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views85 pages

SF-Best Practices

Uploaded by

nchawhan15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 85

SF – Best Practices for Product Build

Prepared by Solution Factory Architecture CoE Team

D&A Solution Factory

08/07/2024

Sensitivity: Internal
Document Version Control
Version Change Description Author Date

1.0 Final Best practices for Solution Factory products Divya Shah/ Bhavana/Manoj 17-08-2022

1.1 Added new checks Divya/Bhavana 17-11-2022

1.2 Reorganized with documents Divya 09-12-2022

1.3 Included sample data flow (attachment) for AAS Incremental refresh Bhavana 09-01-2023

1.4 Added PBI visuals best practices Suchintya 08-03-2023

1.5 Included details for AAD REST API based solution for extraction of user list from Bhavana 04-04-2023
group
1.6 Convert data frame to excel format Bhavana 05-04-2023

1.7 Included section to refresh PBI from ADF Manoj 23-05-2023

1.8 Included steps / pre-requisites for PBI Premium onboarding Bhavana 26-05-2023

1.9 Unity Catalog naming conventions included Bhavana 01-03-2024

2.0 Folder structure details included for Unity Catalog Paras 13-03-2024

2.1 Included slide for points of contact Bhavana 20-05-2024

Sensitivity: Internal 2
Document List to be followed by a Project Deliverable Reference

Design Build & Review Test


Solution Architect / UX Designer Development Team Development Team
Document Name
Location Document Name Document Name
& Description
& Description Location & Description Location
SF - Technical Architecture – ProjectName.pptx Path
(Reference guide) SF- Code deployment Path Performance Analyzer Path
D&A – Branding and UX Guidelines Path guidelines.pptx for PBI

Standard guidelines for code deployment https://tabulareditor.c


Data Team SF-Code Build and Review Checklist.xlsx Path Performance analyzer om/
Document Name for AAS model BPA rules
Location Standard guidelines to be filled in with
& Description
response which SF team will validate Explanation to use
SF-Data Model –ProjectName.xlsx Path
the tool -
Data modeling standards and Guidelines
ApplicationSchedule_project name.xlsx Path https://www.elegantb
Design App FS for PDS Design App i.com/
Wireframe to FS Mapping v0.1.xlsx Path Schedule of your pipeline
PDS-High level design document-Project Path
Named.docx
Development Team End to end project document which will
Document Name mainly contain the links for individual
Location documents
& Description
Path
Detailed_Dataflow Design_Project name.xlsx Path (Reference guide) JMF Handbook guide for
scheduler
ADF details with SQL and ADB objects

DMR document Path


Data_Ingestion_DMR_Template.xlsx

Standard guidelines to be considered before Path


initiating code build for ETL and PBI
SF-Code Build and Review Checklist.xlsx

Sensitivity: Internal
Design Best Practices
Area Guidelines

Architecture 1. Refer Solution Architecture guidelines to identify the pattern to be leveraged for the solution Path
Architecture 1. Data sources are clearly defined and links to Azure cloud solution are defined
2. Data volumes are clear with total data retention in PDS with long term view
3. Number of concurrent users
4. What is the data extraction data volume?
5. Threshold limitation defined for architecture like 50 GB cube for x markets, 1 hour for ADB to PBI dataset refresh
6. Manual files/ tactical sources to be approved by Data SMEs
1. Any source not in UDL to be approved by Data SME
2. Manual files to be ingested using FMT to UDL / Techdebt
7. End to end PDS flow to be defined with all Azure services for ETL
8. Semantic layer (cubes) defined for different markets in the same flow
9. Understand self service requirements with respect to NFR
10. Web app represented if required with firewall
11. No Product to Product sourcing
12. MLOPS sources and lineage defined
13. Define if delta loads to be performed from sources
14. Security mechanism for Cubes Semantic layer defined
15. Security for sourcing/storing PII data defined
16. JMF configs for the product defined
17. Right configurations for ADB, SQL, AAS, Webapp. ADB-LTS version
18. Refresh SLA’s to be considered while defining solution
Data Model 1. Star schema with Facts and Dimensions only with Many to one relationship
2. Global data model defined if there are multiple markets
3. Facts do not have descriptive attributes
4. Granularity of Facts is kept to the minimum for e.g ProductSKU, Brand..
5. Monthly/Daily facts are different
6. All dimensions and Facts are supposed to be connected via Big int Surrogate key named as <Col>SK
7. Master Calendar dimension to be consumed from BDL
8. Master Product global hierarchy to be consumed from BDL
9. Every table should have LastModifiedDate attribute
WF Mapping 1. All slicers are connected to dimensions
2. All visuals have KPI defined in cube and X/Y axis connected to dimensions
3. KPI calculation will not be part of review
Sensitivity: Internal 4. KPI link to the visuals should be reviewed by Business Analyst 4
Development Best Practices
Area Guidelines

ADLS 1. Folder structure is maintained as per guidelines (refer slide 11)


2. No manual files ingested in PDS without exceptional approval
3. ADLS Gen2 is used
ADF 1. Single Pipeline flow as per recommendation for all objects which is config driven(refer slide 13/14)
2. Parameterized configured in ADLS csv file
3. Webhooks for PBI dataset/AAS configured
4. One job execution from JMF
Azure 1. Right Cluster configuration for Prod and Non Prod as mentioned on ADB slides
Databricks) 2. Only job cluster to be leveraged
3. Right Flow of activities within notebook(refer slide 15)
4. No hard coding of paths and data. Should come from config file in ADF or parametrized on data model
5. No Unwanted test code that will consume processing time - like count check, select statements etc
6. The objects should be developed with incremental load and NOT TRUNCATE AND LOAD strategy (refer slide 12)
7. All the objects should have a Primary key Surrogate key which is a Hash key in numeric data type (applicable to object where natural keys are not numeric).
E.g., for hash key xxhash64(concat_ws('||’,col1, col2))
8. Date object should have YYYYMMDD as the key to convert in integer.
9. All the objects should have a RowSK Surrogate key which is a Hash key for non key attributes. Audit columns like RowSK and Primary Key SK should
identify the delta loads and update the Last modified Date audit column. (refer Incremental loading strategy slide)
10. All the objects should have Last modified Date audit column
11. All Joins within Fact/dimensions should be on Surrogate Keys
12. All dimensions should have an Unknown record in all Dimensions which will be leveraged for Late arriving facts
13. DQ validations – Null and Duplicate check to be performed
14. Clear cached data (if used) from notebook like df as soon as their job is done. df.unpresist() to clean df cache and
spark.catalog.uncacheTable(tableName) ) (only if using spark cache, although it is not recommended to use spark cache as delta has auto cache by default)
15. Use Databricks VACUUM to remove historical versions of data that are not required​
16. DO NOT write any data to DBFS root. Even though DBFS root is writeable, store data in mounted object storage (ADLS) rather than DBFS root
17. Avoid large number of small files​in ADLS
18. Avoid using IN /NOT IN clause
19. Avoid Cross join() at any cost
20. Make Sure Correct ADB LTS is leveraged in clusters
21. Avoid use of Spark UDF and native python code.
22. Use partitioning in delta table to reduce shuffle.
23. All clusters should be tagged with a custom tag. Please follow guidelines on slide named ‘Azure Databricks Cluster – Track marketwise utilization/costs’

Sensitivity: Internal 5
Development Best Practices
Area Guidelines
Azure SQL 1. Right SQL configuration for Prod and Non Prod
2. No hard coding of paths and data. Should come from config file in ADF or parametrized on data model
3. No Unwanted test code that will consume processing time - like count check, select statements etc
4. The objects should be developed with incremental load and NOT TRUNCATE AND LOAD strategy
5. All Joins within Fact/dimensions should be on Surrogate Keys
6. Views created for reference in semantic layer with only filters and no calculations
PBI Dataset 1. All Facts needs to be partitioned on the Date column on the Power Query filter (and heavy dimensions) by configuring the Native incremental
refresh
2. LastModifiedDate attribute will be used to detect which partition is changing in a cube
3. Add all measures at dataset level, not in the individual reports.
4. Add parameters for source connection to change connection dynamically
5. VPAX is analyzed in Tabular Editor Best practice analyzer BestPracticeAnalyzerRules.json
6. VPAX is analyzed in the PBI COE Performance Analyzer tool
7. Review the lineage of a PBI Dataset
8. Custom Partitions to be created when the data is above 20M per partition
9. Only relevant SK’s are imported on Facts and NO CODE attributes
10. No RowSK and PKSK columns used for delta detection within ADB are imported on Cube
11. All common KPI’s needs to be defined in Cube and NOT in individual PBI report
12. Security implemented on datasets
13. User friendly names applied on dataset
14. Encoding hint for all numeric fields should be set as VALUE and character fields should be set as HASH
15. All fact measures should be DECIMAL data-type (not DOUBLE or FLOAT)
16. Use always 32-bit integer data-type instead of 64-bit integer for whole-number fields, if min/max value fits with-in 32 bit range
17. Must avoid any BI-DIRECTIONAL and MANY-TO-MANY relationships between DIM and FACTS
18. Minimize Power Query transformations. Ensure query folding for native queries
19. Dataset setting as “Large dataset” in PBI for datasets above 10 GB
20. Disable Auto date time
21. There should not be any datetime column. If you need time, it should be a separate column
22. PBI/AAS Time Intelligence Calculation – Time Intelligence calculation for small and medium datasets should be implemented through built-in
‘Calculation Items’ capability,
23. Use SQL Profiler traces to assess data loads, load throughput, cost of additional fields and parallelism of partition loading
24. Use PBI Tool from Phil Seamark to analyse the results from SQL Profile traces and assess optimal parallelism for data loads
https://dax.tips/2021/02/15/visualise-your-power-bi-refresh/

Sensitivity: Internal 6
Development Best Practices
Area Guidelines
AAS Cube 1. All Facts needs to be partitioned on the Date column on the Power Query filter (and heavy dimensions)
2. Only relevant SK’s are imported on Facts and NO CODE attributes
3. No RowSK and PKSK columns used for delta detection imported on Cube
4. All common KPI’s needs to be defined in Cube and NOT in individual PBI report
5. A PartitionConfig file with details of the Period to be considered within each partition
6. The PartitionConfig and Fact is joined within PowerQuery to filter each partition on the respective period. This will enable the AAS Fact to dynamically
partition data
7. The LastModifiedDate attribute will be used to detect which partition is changing in a cube. The RefreshFlag in Partitionconfig file will by dynamically
changed based on LastModifiedDAte
8. The webhooks to refresh AAS cube will receive the TMSL script with only partition names filtered with RefreshFlag as 1
9. When Refresh from CSV files – Each partition of AAS tables should be connected to individual CSV files/folder . Refer the AAS refresh slide
10.Security implemented on datasets
11. Refresh from CSV files – Each partition of AAS tables should be connected to individual CSV files directly by using below M-Query method to speed-up
data load process. Refer below example for one sample fact partition table, similar way M-Query transformation should be created for remaining
partitions pointing to respective CSV files
=======================================================
Folder level assignment
let
Source = AzureStorage.DataLake("https://dbstorageda16d901632adls.dfs.core.windows.net/unilever/Fact_1/Jan_CY/"),
#"Removed Columns" = Table.RemoveColumns(Source,{"Name", "Extension", "Date accessed", "Date modified", "Date created", "Attributes",
"Folder Path"}),
#"Filtered Hidden Files1" = Table.SelectRows(#"Removed Columns", each [Attributes]?[Hidden]? <> true),
#"Invoke Custom Function1" = Table.AddColumn(#"Filtered Hidden Files1", "Transform File", each #"Transform File"([Content])),
#"Removed Other Columns1" = Table.SelectColumns(#"Invoke Custom Function1", {"Transform File"}),
#"Expanded Table Column1" = Table.ExpandTableColumn(#"Removed Other Columns1", "Transform File", Table.ColumnNames(#"Transform
File"(#"Sample File"))),
#"Changed Type" = Table.TransformColumnTypes(#"Expanded Table Column1",{{"Period", Int64.Type}, {"Fld_1", type number}, {"Fld_2", type
number}})
in
#"Changed Type"
================================================================

Sensitivity: Internal 7
Development Best Practices
Area Guidelines

AAS Cube ================================================================


File level assignment
let
Source = AzureStorage.DataLake("https://dbstorageda16d901632adls.dfs.core.windows.net/unilever/SP_Test/Jan_CY.csv"),
#"Removed Columns" = Table.RemoveColumns(Source,{"Name", "Extension", "Date accessed", "Date modified", "Date created", "Attributes",
"Folder Path"}),
#"Filtered Hidden Files1" = Table.SelectRows(#"Removed Columns", each [Attributes]?[Hidden]? <> true),
#"Invoke Custom Function1" = Table.AddColumn(#"Filtered Hidden Files1", "Transform File", each #"Transform File"([Content])),
#"Removed Other Columns1" = Table.SelectColumns(#"Invoke Custom Function1", {"Transform File"}),
#"Expanded Table Column1" = Table.ExpandTableColumn(#"Removed Other Columns1", "Transform File", Table.ColumnNames(#"Transform
File"(#"Sample File"))),
#"Changed Type" = Table.TransformColumnTypes(#"Expanded Table Column1",{{"Period", Int64.Type}, {"Fld_1", type number}, {"Fld_2", type
number}})
in
#"Changed Type"
================================================================

11. Encoding hint for all numeric fields should be set as VALUE and character fields should be set as HASH
12. All fact measures should be DECIMAL data-type (not DOUBLE or FLOAT)
13. Use always 32-bit integer data-type instead of 64-bit integer for whole-number fields, if min/max value fits with-in 32 bit range
14. Must avoid any BI-DIRECTIONAL and MANY-TO-MANY relationships between DIM and FACTS
15. Minimize Power Query transformations
16. PBI/AAS Time Intelligence Calculation – Time Intelligence calculation for small and medium datasets should be implemented through built-in
‘Calculation Items’ capability, this feature is available in PBI datasets as well as in AAS cubes. For large datasets – this should be considered cases to
case basis based on performance impact. Refer below sample DAX examples to define Calculation Items for Time Intelligence –
> MTD = CALCULATE ( SELECTEDMEASURE (), DATESMTD ( 'Calendar'[Date] ) )
> QTD = CALCULATE ( SELECTEDMEASURE (), DATESQTD ( 'Calendar'[Date] ) )
> YTD = CALCULATE ( SELECTEDMEASURE (), DATESYTD ( 'Calendar'[Date] ) )
> PY = CALCULATE ( SELECTEDMEASURE (), SAMEPERIODLASTYEAR ( 'Calendar'[Date] ) )
> YOY = VAR CY = SELECTEDMEASURE ()
VAR PY = CALCULATE ( SELECTEDMEASURE (), SAMEPERIODLASTYEAR ( 'Calendar'[Date] ) )
RETURN CY - PY

Sensitivity: Internal 8
Development Best Practices
Area Guidelines

PBI Report 1. PowerBI Best Practices from PowerBI CoE ::


https://app.powerbi.com/groups/me/reports/9aa0742d-b669-421e-a37d-a41fe79395cd/ReportSection7f43af771aa953456cb8?experience=power-bi&book
markGuid=Bookmark9d861270a32bd1f3d333
2. Advanced performance analyzer is captured for each PBI reports to identify time taking pages and validating with PBI coe for any fixes
3. visual filter applied only if needed
4. Instead of multiple cards visual, use a matrix/table with transparent border for showing the measures together. They render faster as it is one visual for
Power BI
5. No custom visuals to be deployed
6. Reduce usage of bookmarks to the minimum. Use Field Parameters instead
7. No KPI is calculated on PBI report. Need justification for any KPI defined
8. Dax best practices leveraged
1. Do not use CONTAINSTRING, rather use a Vairable array VAR _FilterMonthPeriods = TREATAS ( {"MTH10-2012", "MTH10-2013", "MTH10-
2014", xxxxxx, "MTH9-2022"}, MarketSales[Period])
Web app 1. Use of KeyVault to hold secrets, credentials. DO NOT USE DefaultAzureCredential for authentication during development from local machine /
VM
2. Secured data in transit across the layers
3. Any open source coding used should be latest version
4. Latest version of .NET framework used
5. Is the Penetration test completed for the project by Infosec team?
6. Implementation of HTTP Strict Transport Security (HSTS) for critical interfaces
7. Ensure no proprietary algorithms used without proper validation
8. Source code must be properly commented on to increase readability
9. Ensure logging mechanism implemented for critical functionality
10. User and Role-based privileges must be verified thoroughly
11. Authentication: Web App should use AAD authentication
12. Authorization:We required custom coding to implement role based authorization for a specific Web API call.
Data model 1. Validate the data lineage captured in DMR on high level
(scope for 2. Validate output of each object - quick checks like duplicate,datatypes,business transformation rules applied as per PDM.
Data 3. Databricks Code review- Participate in code review along with SA internally to validate any complex ETL logic implementations/best practices applied.
modeler) 4. DAX code review - Participate in DAX code review along with SA internally to validate if logic implementations are done correctly as per wireframe
mappings for important KPIs.

Sensitivity: Internal 9
Development Best Practices
Area Guidelines

Web app 1. Use of KeyVault to hold secrets, credentials. DO NOT USE DefaultAzureCredential for authentication during development from local machine /
VM
2. Secured data in transit across the layers
3. Any open source coding used should be latest version. For e.g. node.js latest version as of Apr 2023 is <>
4. Latest version of .NET framework used
5. Is the Penetration test completed for the project by Infosec team?
6. Implementation of HTTP Strict Transport Security (HSTS) for critical interfaces
7. Ensure no proprietary algorithms used without proper validation
8. Source code must be properly commented on to increase readability
9. Ensure logging mechanism implemented for critical functionality
10. User and Role-based privileges must be verified thoroughly

Follow the standard guidelines before initiating code build for ETL. Refer SF-Code Build and Review Checklist.xlsxPath

Sensitivity: Internal 10
Development Best Practices for Data science / ML Product
Area Guidelines

1. Change Databricks environment from premium to standard.


2. Use separate clusters for development and model run purpose, development clusters which are supposed to run long hours must be small 4 core worker
and driver.
3. Use compute optimized delta accelerated machines for worker and driver ( d#ds_v4 and d#ds_v5 series only)
Infrastructure 4. In cluster config use minimum worker = 1 and termination time = 30 min max.
5. Worker config should not be more then d8ds_v4/v5. In case on more compute requirement setup auto scalability and keep min worker = 1.
6. Use databricks runtime version 10.4 LTS or 11.3 LTS
7. DO NOT write any data to DBFS root. Even though DBFS root is writeable, store data in mounted object storage (ADLS) rather than DBFS root
1. Use spark native code while doing data wrangling and if need to use panadas use with panadas_udf()
2. Avoided running model on calculated data frame, first write calculated data frame as a table and then create fresh data frame on delta table to be
consumed in model.
3. Use multi processing and Parnellism while running model to reduce the model run time and utilize the cluster with full capacity.
4. Avoid use of UDF’s , it comes together with a very high cost in Pyspark. They operate one row at a time and thus suffer from high serialization and
Code Best invocation overhead. In other words, they make the data move between executor JVM and Python interpreter resulting in a significant serialization cost.
practices Furthermore, after calling a Python UDF, Spark will forget how the data was distributed before. For this reason, usage of UDFs in Pyspark inevitably reduces
performance as compared to UDF implementations in Java or Scala.
5. If you are still to use UDFs, consider using pandas UDFs which are built on top of Apache Arrow. It promises the ability to define low-overhead, high-
performance UDFs entirely in Python and supported since version 2.3.
6. Try wrapping up your native python code in panadas_udf()
7. While using 3rd party libraries or writing native code, make sure garbage collection(GC) is handled properly, it can cause out of memory issue.

1. Avoid large number of small files​in ADLS


2. Use Databricks optimize and auto optimize to consolidate small files​
3. Use Databricks VACUUM to remove historical versions of data that are not required​
4. Avoid NOT IN clause
5. Clear cached data from notebook. df.unpresist() to clean df cache and spark.catalog.uncacheTable(tableName) (only if using spark cache,
Data Wrangling although it is not recommended to use spark cache as delta has auto cache by default)
6. Use ganglia monitor to see the cluster uses.
7. Use delta format while reading and writing tables in databricks.
8. Don’t do transformation on a data frame which is created on csv files, first create a table from this data frame and then create a new data
frame on delta table for transformation purpose.
9. Sample code available in this : ML best practices Sample Notebook.dbc

Sensitivity: Internal 11
Development Best Practices for Synapse Serverless Pool
• Area • Guidelines

• Make sure the storage and serverless SQL pool are in the same region
• Colocate a client application with the Azure Synapse workspace. Placing a client application and the Azure Synapse workspace in different regions could
cause bigger latency and slower streaming of results.
• Set TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true) for all the delta tables
• Ran Vacuum on delta tables
• Optimize storage layout by using partitioning and keeping files size of the table in the range between 100 MB and 10 GB.
• Don’t stress the storage account with other workloads during query execution.
• Convert large CSV and JSON files to Parquet. Serverless SQL pool skips the columns and rows that aren't needed in a query if you're reading Parquet
files.
• It's better to have equally sized files for a single OPENROWSET path or an external table LOCATION.
• Schema should not be inferred. For example, Parquet files don't contain metadata about maximum character column length. So serverless SQL pool infers
it as varchar(8000).
• Use appropriate data types
• Use the smallest data size that can accommodate the largest possible value.
• Code Best
• If the maximum character value length is 30 characters, use a character data type of length 30.
practices
• If all character column values are of a fixed size, use char or nchar. Otherwise, use varchar or nvarchar.
• If the maximum integer column value is 500, use smallint because it's the smallest data type that can accommodate this value.
• Use integer-based data types if possible. SORT, JOIN, and GROUP BY operations complete faster on integers than on character data

Sensitivity: Internal 12
Product Ingestion/Transformation Design

Sensitivity: Internal
Azure Data Lake Folder Structure
Follow the below mentioned folder structure for ADLS within PDS layer
Landing Layer Staging Layer Transform Layer
Unilever
Manual File Unilever Unilever
<<ProductName>> ** <<ProductName>> ** <<ProductName>>
Unilever
LandingLayer StagingLayer TransformLayer
<<ProductNa
me>> FinConnect (TransactionSource) Dimensions
FinConnect (TransactionSource) *
ManualFiles *
<<Table1>> <<Dimension1>>
<<Table1>> <<Dimension2>>
File<>
File<> yyyy=2021 yyyy=2021

ArchiveManualFiles mm=01 Facts


mm=01
File<> dd=01 dd=01 <<Fact1>>

GMRDR (AnalyticalSource) yyyy=2021


GMRDR (AnalyticalSource)
<<Table1>> (E.g.,ProductCategory) <<Table1>> (E.g.,ProductCategory) mm=01

<<partition name>> (if needed) <<partition name>> (if needed) dd=01

ErrorLogs
ArchiveData Semantic Layer (csv format for AAS)
Unilever Unilever
Unilever <<ProductName>> <<ProductName>>

<<ProductName>> ** ErrorLogs SemanticLayer


<ObjectName1> Dimensions
LandingLayer
<<ObjectName1_Duplicates_YYYYMMDD>> <<Dimension1>>
ArchiveData
<<ObjectName1_NULLs_YYYYMMDD>>
<<Table1>> Facts
Product (example)
<<Fact1>>
yyyymmdd Product_Duplicates_20220307
yyyy=2021
Product_NULLs_20220307 mm=01
dd=01
Sensitivity: Internal 14
Azure Data Lake Folder Structure – Multiple Markets

Sensitivity: Internal 15
Azure Data Lake Folder Structure – Unity Catalog Enabled
Follow the below mentioned folder structure for ADLS within PDS layer
Use volumes for flatfiles
Transform Layer
Staging Layer
Unilever
Unilever
<<ProductName>>
<<ProductName>> **
TransformLayer
StagingLayer
Dimensions
FinConnect (TransactionSource) *
* <<Dimension1>>
<<Table1>>
<<Dimension2>>
yyyy=2021
Facts
mm=01
<<Fact1>>
dd=01
yyyy=2021
GMRDR (AnalyticalSource)
mm=01
<<Table1>> (E.g.,ProductCategory)
dd=01
<<partition name>> (if needed)

ArchiveData ErrorLogs Semantic Layer (csv format for AAS)


Unilever
Unilever
Unilever <<ProductName>>
<<ProductName>>
SemanticLayer
<<ProductName>> ** ErrorLogs
Dimensions
LandingLayer <ObjectName1> <<Dimension1>>
<<ObjectName1_Duplicates_YYYYMMDD>>
ArchiveData
<<ObjectName1_NULLs_YYYYMMDD>> Facts
<<Table1>>
Product (example) <<Fact1>>
yyyymmdd Product_Duplicates_20220307 yyyy=2021
mm=01
Product_NULLs_20220307
dd=01
Sensitivity: Internal 16
Sample 3 part naming convention with Unity Catalog
Name Part 1 Name Part 2 Name Part3 Type Path

pds_p4gsky_903681_dev Catalog https://dbstorageda12d903681adl2.blob.core.windows.net/unilever

landing_uk Schema ./SKY/Market=UKI/LandingLayer/

vol_landing_uk Volume ./SKY/Market=UKI/LandingLayer/Files

landed_nielsen Table ./SKY/Market=UKI/LandingLayer/Tables/Nielsen

landed_kantar Table ./SKY/Market=UKI/LandingLayer/Tables/Kantar

staging_uk Schema ./SKY/Market=UKI/StagingLayer/

staged_nielsen Table ./SKY/Market=UKI/StagingLayer/Nielsen

tranform_uk Schema ./SKY/Market=UKI/TransformLayer/

dim_calendar Table ./SKY/Market=UKI/TransformLayer/Calendar

dim_product Table ./SKY/Market=UKI/TransformLayer/Product

fact_pos Table ./SKY/Market=UKI/TransformLayer/PointOfSales

Sensitivity: Internal 17
Unity Catalog – Naming convention for PDS
Catalog Name : pds_<productname>_<itsg>_<env>
Example, pds_gmi_902772_dev

Schema Name : landing / staging / transform


For Global products having multiple markets, suffix _<marketname>
Example, transform_mx

Table Name : <objectname>


Example, dim_calendar, fact_sales

View Name : vw_<objectname>


Example, vw_dim_calendar, vw_fact_sales

Sensitivity: Internal 18
Incremental Loading Strategy
All objects in PDS should be loaded incrementally. Audit columns like ObjectRowSK and Primary Key SK should identify the delta loads and update the Last modified Date audit
column.

Scenario 1 – Upsert
1) Compare A & B table on KeySK and RowSK, if matching records found for KeySK Scenario 2 – Upsert/Delete
but not for RowSK then UPDATE the matching records from B into A (DELETE > 1) Compare A & B table on KeySK and RowSK, if matching records found for KeySK but not for
INSERT) RowSK then UPDATE the matching records from B into A (DELETE > INSERT)
2) Compare A & B table on KeySK and INSERT unmatched records from B into A 2) Compare A & B table on KeySK and INSERT unmatched records from B into A
3) Compare A & B table on KeySK and DELETE unmatched records from A
Incremental Process for Incremental data (Upsert only) - Scenario 1 Incremental Process for Partial Full data (Upsert & Delete) - Scenario 2
Existing Processed Data (A) Existing Processed Data (A)
KeySK RowSK Period Product Geography KPI-1 KPI-2 Mod_Date KeySK RowSK Period Product Geography KPI-1 KPI-2 Mod_Date
44562P1G1 44562P1G1100120 01-01-2022P1 G1 100 120 30-03-2022 44562P1G1 44562P1G1100120 01-01-2022P1 G1 100 120 30-03-2022
44563P1G1 44563P1G1150200 02-01-2022P1 G1 150 200 30-03-2022 44563P1G1 44563P1G1150200 02-01-2022P1 G1 150 200 30-03-2022
44563P2G2 44563P2G2150200 02-01-2022P2 G2 150 200 29-03-2022 44563P2G2 44563P2G2150200 02-01-2022P2 G2 150 200 29-03-2022
44564P3G3 44564P3G3120100 03-01-2022P3 G3 120 100 29-03-2022 44564P3G3 44564P3G3120100 03-01-2022P3 G3 120 100 29-03-2022

New Data from Source (B)


New Data from Source (B) KeySK RowSK Period Product Geography KPI-1 KPI-2
KeySK RowSK Period Product Geography KPI-1 KPI-2 44562P1G1 44562P1G1200300 01-01-2022P1 G1 200 300
44562P1G1 44562P1G1200300 01-01-2022P1 G1 200 300 44563P1G1 44563P1G1220250 02-01-2022P1 G1 220 250
44563P1G1 44563P1G1220250 02-01-2022P1 G1 220 250 44563P2G2 44563P2G2150200 02-01-2022P2 G2 150 200
44564P1G1 44564P1G1300350 03-01-2022P1 G1 300 350 44565P4G2 44565P4G2225234 04-01-2022P4 G2 225 234
44565P4G2 44565P4G2225234 04-01-2022P4 G2 225 234
Updated Processed Data (A)
Updated Processed Data (A) KeySK RowSK Period Product Geography KPI-1 KPI-2 Mod_Date Flag
KeySK RowSK Period Product Geography KPI-1 KPI-2 Mod_Date Flag 44562P1G1 44562P1G1200300 01-01-2022P1 G1 200 300 01-04-2022Modify
44562P1G1 44562P1G1200300 01-01-2022P1 G1 200 300 01-04-2022Modify 44563P1G1 44563P1G1220250 02-01-2022P1 G1 220 250 01-04-2022Modify
44563P2G2 44563P2G2150200 02-01-2022P2 G2 150 200 30-03-2022Retain
44563P1G1 44563P1G1220250 02-01-2022P1 G1 220 250 01-04-2022Modify 44564P3G3 44564P3G3120100 03-01-2022P3 G3 120 100 29-03-2022Delete
44563P2G2 44563P2G2150200 02-01-2022P2 G2 150 200 30-03-2022 44565P4G2 44565P4G2225234 04-01-2022P4 G2 225 234 01-04-2022Add
44564P3G3 44564P3G3120100 03-01-2022P3 G3 120 100 29-03-2022
44564P1G1 44564P1G1300350 03-01-2022P1 G1 300 350 01-04-2022Add
44565P4G2 44565P4G2225234 04-01-2022P4 G2 225 234 01-04-2022Add

Sensitivity: Internal 19
ADF Pipeline Design (1/2)
When IngestionFlag = 1

P1

PL_<APPLICATION_NAME>_PROCESS_<SRC>_DIM_DATA J2
Parameterised pipeline

3 4 4a Data error logs


J1 1 2
Check for duplicates and NULLs
and log error records 5 6
When TransformationFlag = 1

4b J3

Legend :
P1 – Parameterized pipeline that will be scheduled to execute via JMF. PipelineName & ObjectName will be passed as a parameter from JMF
J1, J2, J3 – These are JMF activities and must be included for pipeline level logging into JMF framework (Refer JMF Table Entries)
Activity Details
1 - Read config values like tablename, source file path, destination file path, notebook name from config file. The config file needs to be stored in ADLS (Sample file, InputParameters attached below)
2 – Filter and load objects from InputParameters file using Filter activity
3 – Check value for IngestionFlag. If IngestionFlag=1, copy data from source into LandingLayer. If IngestionFlag = 0, move to transformation. Follow folder structure defined in subsequent slides
4 – 4a – Check value for TransformationFlag. If TransformationFlag = 0, exit the notebook. If TransformationFlag =1, continue notebook activities.
Error logging : Check for duplicates and null values in surrogate key columns as a first step in databricks notebook.
If duplicates are found, take a distinct and continue processing. Log the duplicate values into error log. Folder structure defined in subsequent slides
If null values are found in key columns, log them into error log. Follow folder structure defined in subsequent slides
4b – Transform the data using logic provided in PDM and store data in transform layer where delta checks are carried on basis of RowSK and PrimarySK (refer Incremental Loading strategy slide)
5 – Archive source data within LandingLayer>>Archive folder. For daily refresh cycle, archive data within date folder, YYYYMMDD format. For monthly refresh cycle, folder name must be YYYYMM format
6 – Purge old data. Retain 7 days of archived data for daily refresh cycle. Retain 3 months of archived data for monthly refresh cycle.

Best Practices:
• Follow naming conventions defined in SAG
• Timeout should be set at activity level and not be default of 7 days
• Include description to pipeline activity JMF Table Entries InputParameters
• Organize data factory components into folders for easy traversal

Sensitivity: Internal 20
ADF Pipeline Design (2/2)
When IngestionFlag = 1

P1 **
PL_<APPLICATION_NAME>_PROCESS_<SRC>_FACT_DATA J2
Parameterised pipeline

3 4 4a Data error logs


Check for duplicates and NULLs
J1 1 2 and log error records 5 6

**
When TransformationFlag = 1

4b J3

Legend :
P1 – Parameterized pipeline that will be scheduled to execute via JMF. PipelineName & ObjectName will be passed as a parameter from JMF
J1, J2, J3 – These are JMF activities and must be included for pipeline level logging into JMF framework (Refer JMF Table Entries)

Activity Details
1 - Read config values like tablename, source file path, destination file path, notebook name from config file. The config file needs to be stored in ADLS (Sample file, InputParameters attached below)
2 – Filter and load objects from InputParameters file using Filter activity
3 – Check value for IngestionFlag. If IngestionFlag = 0, exit the notebook. If IngestionFlag =1, Copy data from source into LandingLayer for the required time period based on where condition. Follow folder structure defined in subsequent
slides
4 – 4a – Check value for TransformationFlag. If TransformationFlag = 0, exit the notebook. If TransformationFlag =1, continue notebook activities.
Error logging : Check for duplicates and null values in surrogate key columns as a first step in databricks notebook.
If duplicates are found, take a distinct and continue processing. Log the duplicate values into error log. Folder structure defined in subsequent slides
If null values are found in key columns, log them into error log. Follow folder structure defined in subsequent slides
4b – Transform the data using logic provided in PDM and store data in transform layer where delta checks are carried on basis of RowSK and PrimarySK and update the LastModifiedDate(refer Incremental Loading strategy slide)
5 – Archive source data within LandingLayer>>Archive folder. For daily refresh cycle, archive data within date folder, YYYYMMDD format. For monthly refresh cycle, folder name must be YYYYMM format
6 – Purge old data. Retain 7 days of archived data for daily refresh cycle. Retain 3 months of archived data for monthly refresh cycle.

Best Practices:
JMF Table Entries InputParameters
• Follow naming conventions defined in SAG
• Timeout should be set at activity level and not be default of 7 days
• Include description to pipeline activity
Sensitivity: Internal • Organize data factory components into folders for easy traversal 21
Generic Flow of a Transform Notebook
1. Import required libraries
2. Read various parameters from ADF config file
3. Create a dataframe for required columns for the object
4. Call generic function for DQ checks (these are prebuilt)
5. Check for nulls in PK columns. If nulls found, write into error logs folder defined in previous slide
6. Remove rows for which PKs are null
7. Check for duplicates in PK/ composite key. If found, write into error logs folder defined in architecture deck
8. Remove rows having duplicates and select only distinct values
9. Create dataframe with clean data and include hash columns
10. Incremental load

a. Delete rows available in target (PDS) but not in source

b. Capture delta records by comparing source and target (delta table) for insert and update

c. If change in records found, merge the final table data with the delta captured in 10b
11. Use VACUUM to remove historical versions of data
12. Sample notebooks can be provided

Sensitivity: Internal 22
Facts with Different Grain
If the requirement is to merge Facts with different Grain, following design should be followed which will have a single fact

Source Data
Product Fact
ProdSK SKU Category Brand Prod Value
PT1 PT1 C1 B1 PT1 10
PT2 PT2 C1 B1 PT2 32
PT3 PT3 C1 B1 PT3 323
PT4 PT4 C1 B2 B2 344
PT5 PT5 C1 B2 C1 232

Proposed Data model Design


Product C1 232
Fact B1
ProdSK Category Brand SKU
C1 C1 ProdSK Value PT1 10
B1 C1 B1 Pt2 32
PT1 10
B2 C1 B2 PT3 323
PT1 C1 B1 PT1 PT2 32 B2 344
PT2 C1 B1 PT2 PT4
PT3 C1 B1 PT3 PT3 323
PT5
PT4 C1 B2 PT4 B2 344
PT5 C1 B2 PT5
C1 232

Custom Rollup DAX example for Facts with different grain

Custom Rollup Logic.pbix Microsoft Excel


Macro-Enabled Worksheet

Sensitivity: Internal 23
Refresh Power Bi dataset with no partition or Default Incremental load partition

1. Create ADF pipeline using template in this link PBI_Refresh.zip


2. Get Required parameter values from power bi datasets and configure them in JMF.

WorkspaceOrGroupID PBIDatasetID
3. Add Your Env SPN to Power bi workspace as User.
4. Whitelist your SPN in azure tenant by getting it added in Azure AD group sf-powerbi-spn - Microsoft Azure
Reach out to landscape or SF- Architect leads to get it added in Ad group

Sensitivity: Internal
AAD REST API based solution for extraction of user list from AD Group
Problem Statement
Read & store user details from Azure AD for required AD Group. This mapping is required to apply row level security on databricks table/s where role
based security by ADGroup cannot be directly applied (user level details are needed)

Initial solution provided


Execute powershell script that would loop through each ADGroup and read user details within.
This required technical approval to create a batch/automation account in each environment

Alternative suggested
AAD REST API based solution for extraction of user list from group. API call can be made using ADF web activity.

PoC to prove alternate approach


PoC successfully completed by using Graph based API.
Enhanced the code to read users >9k and nested ADGroups
Pre-requisites and pipeline setup are documented in the attached Pre-requisites and
pipeline

Sensitivity: Internal 25
Convert data frame to excel format
Option1 (import pip install - suggested)
Step1 Step2 Step3 Step4 Step5
# Install openpyxl using pip command # create dataframe #Define path to copy # write dataframe to excel # Copy the file
pip install openpyxl df_marks = pd.DataFrame({'name': ['X', temp_file='/tmp/mysheet_using_pip.xlsx' df_marks.to_excel(temp_file, copyfile(temp_file, final)
'Y', 'Z', 'M'], final='/dbfs/mnt/PowerBIRefresh/ index=False)
import pandas as pd 'physics': [68, 74, 77, 78], mysheet_using_pip.xlsx' print('DataFrame is written successfully
from shutil import copyfile 'chemistry': [84, 56, 73, 69], to Excel File.')
'algebra': [78, 88, 82, 87]})

Option2 (install spark library)

Step1 Step2 Step2


Install library for excel file as directed in the above from pyspark.sql.types import StructType,StructField, StringType, IntegerType df.write.format("com.crealytics.spa
snippet data2 = [("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000), rk.excel")\
("Robert","","Williams","42114","M",4000), .option("header", "true")\
("Maria","Anne","Jones","39192","F",4000), .mode("overwrite")\
("Jen","Mary","Brown","","F",-1)
]
schema = StructType([ \ .save("/mnt/pbirefresh/PowerBIRef
StructField("firstname",StringType(),True), \ resh/mysheet_using_sparklib.xlsx")
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])

df = spark.createDataFrame(data=data2,schema=schema)

Sensitivity: Internal 26
Product Consumption Layer Design

Sensitivity: Internal
Incremental Refresh - AAS
1. Define Partitions on the AAS model on Facts/Dimensions based on the transaction date attribute which can be at Year/Month grain with values as Month, Month-1, Month-2
OR Year, Year-1, Year-2
2. Define a PartitionConfig file with details of the Period to be considered within each partition defined above. Have a RefreshFlag on each row per Partition to identify if it
AAS Partition
Config Sample File needs a refresh.
3. Join the PartitionConfig and Fact on PowerQuery to filter each partition on the respective period. This will enable the AAS Fact to dynamically partition data
4. The LastModifiedDate attribute will be used to detect which partition is changing in a cube
5. Develop a notebook will change the RefreshFlag to 1 for all the changed records based on LastModifiedDAte
6. The webhooks to refresh AAS cube will receive the TMSL script with only partition names filtered with RefreshFlag as 1
7. Refresh from CSV files – Each partition of AAS tables should be connected to individual CSV files directly by using below M-Query method to speed-up data load process.
Refer below example for one sample fact partition table, similar way M-Query transformation should be created for remaining partitions pointing to respective CSV files
let
Source=#"Folder/C:\Users\xxxx\xxxxx\Data",
#"File Content" = Source{[#"Folder Path"=" C:\Users\xxxx\xxxxx\Data\",Name="Fact-Partition-Jan-22.csv"]}[Content],
#"Imported CSV" = Csv.Document(#"File Content",[Delimiter=",", Columns=102, Encoding=65001, QuoteStyle=QuoteStyle.None]),
#"Promoted Headers" = Table.PromoteHeaders(#"Imported CSV", [PromoteAllScalars=true]), Sample flow from
#"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"Month", Int64.Type}, {"Slno", Int64.Type}, {"Fld_1", type number}, {"Fld_2", type number}}) R&D

in
#"Changed Type"

Define Partition config file with a Define a notebook to identify which


Join Partition config and AAS Fact Pass the variable as a parameter to
partition, period for partition, partition needs a refresh and create
on Power Query for each partition Webhook for AAS refresh
Refresh flag a TMSL script
WebHook-SSASProcessCube: https://s2events.azure-automation.net/webhooks?token=C2ZOe%2fMXfGJuAkmLiyOk1Or5PsRG5Tn9sqTqEPdg%2bFM%3d
Sensitivity: Internal 28
Incremental Refresh- Power BI (native partitions)
1. Identify partition key column (on the date attribute) in FACT in the Power BI model which can be at Year/Month/Week grain
2. Specify the reserved, case-sensitive names RangeStart and RangeEnd date/time parameters to create expected number of partitions on the dataset
3. Specify the Incremental refresh Policy using Power BI desktop by supplying all the required parameters

- table name

- archiving the data for no. of years/months = data which needs to be cleared

- Incrementally refreshing the data for no. of years/months = data to be retained in the dataset (Incremental refresh happens only on those partitions )
4. Specify the name of the LastModifiedDate column to enable the detect data changes and refresh only impacted partitions
5. After applying all above steps, we will see No. of partitions will get created based on months/years specification defined in PBI.
6. If the rows with date/time no longer within the refresh period (outside of refresh period specified for partitions refresh) then no partition will get refreshed.
7. Ensure Query folding is validated to ensure the filter logic is included in the queries being executed against the data source.
8. However, other data sources may be unable to verify without tracing the queries. If Power BI Desktop is unable to confirm, a warning is shown in the Incremental refresh policy
configuration dialog. If you see such warning and want to verify the necessary query folding is occurring, use the Power Query Diagnostics feature or trace queries by using a tool
supported by the data source, like SQL Profiler.
9. If query folding is not occurring, verify the filter logic is included in the query being passed to the data source. If not, it's likely the query includes a transformation that prevents
folding - effectively defeating the purpose of incremental refresh.

PBI Query Folding


& Incremental Refresh

Define Range Start and Specify the Last modified Date


Set the refresh policy for the Pass the variable as a
RangeEnd parameters for column to detect delta and
object for the range of period parameter to Webhook for PBI
Number of partitions to be refresh only impacted
to be refreshed. Dataset refresh
defined on a Fact attribute partitions

Sensitivity: Internal 29
Incremental Refresh- Power BI (custom partitions)
1. The ADB partitions will not have the month /year combination in the partition name. Instead, the partition name will be a generic name like P1,P2, etc. for the total number of
partitions needed. 24 incase we are partitioning by month for 2 years
2. The custom partitions made in PBI Dataset will also have the similar partition names. In effect, there will be a 1-1 map between the ADB partitions and PBI dataset. The advantages
of this approach are

• No PBI dataset management needed for new partition creation, dropping partitions

• The PBI dataset can be refreshed using the PBI refresh API through an ADF pipeline which will execute after fact loading

3. A partition manager table within ADB delta will have the actual year month and any other attribute combination which identifies the data being stored in that ADB partition. The
flag ToBeRefreshed identifies if the partition needs to be refreshed. See example below
Fact Name PartitionName Year Month Country ToBeRefreshed
SalesOrderToInvoice P1_VN 2020 04 VN 0
SalesOrderToInvoice P1_ID 2020 04 ID 0
SalesOrderToInvoice P24_VN 2022 03 VN 1

4. When new data arrives for the next month, the data for the oldest partition needs to be overwritten. So we will order our partition master on year and month column in desc order
and replace value of oldest partition with Apr 2023 data, Also we will add refresh flag to 1 for this partition.
Fact Name PartitionNam Year Month Country ToBeRefreshed
e
SalesOrderToInvoice P1_VN 2022 04 VN 1
SalesOrderToInvoice P1_ID 2020 04 ID 0
SalesOrderToInvoice P24_VN 2022 03 VN 0

5. Over time the partitions will be overwritten one at a time but the number of partitions will remain the same.
6. The ADF pipeline will use the PBI refresh APIs to refresh the custom partition based on the toberefreshed flag
7. Pls use attached document for more details

Microsoft Word
Document

Sensitivity: Internal 30
Dynamic M-Query – Change source file path from config
Sample M-Query steps to change the source file path dynamically from the configuration for each table partition. Refer to the link for a sample AAS model (Link –
MQueryDynamicFile_FolderSelection.zip)
Partition P1 – Point the source to a specific file
let
Source = DataSource,
Container = Source{[Name ="unilever"]}[Data],
#"AASConfig_GetContent" = Csv.Document(Container{[#"Name" = "SP_Test/AASConfig.csv"]}[Content],[Delimiter=",", Encoding=1252, QuoteStyle=QuoteStyle.Csv]),
#"AASConfig_GetHeaders" = Table.PromoteHeaders(#"AASConfig_GetContent", [PromoteAllScalars=true]),
#"AASConfig_GetDataPath" = Table.SelectRows(#"AASConfig_GetHeaders", each ([Object] = "Point_To_A_File") and ([PartitionID] = "P1")),
DataPath = List.Max(#"AASConfig_GetDataPath"[DataPath]) & List.Max(#"AASConfig_GetDataPath"[DateId]),
#"DataFile_GetContent" = Csv.Document(Container{[#"Name" = #"DataPath" & ".txt"]}[Content],[Delimiter=",", Encoding=1252, QuoteStyle=QuoteStyle.Csv]),
#"DataFile_GetHeaders" = Table.PromoteHeaders(#"DataFile_GetContent", [PromoteAllScalars=true]),
#"DataFile_ChangedType" = Table.TransformColumnTypes(DataFile_GetHeaders,{{"DateId", type number}, {"Fld_1", type number}, {"Fld_2", type number}})
in
#"DataFile_ChangedType“
------------------------------------------------------------------------------------------------------------------------------------------------------
Partition P2 – Point source to a specific file
let
Source = DataSource,
Container = Source{[Name ="unilever"]}[Data],
#"AASConfig_GetContent" = Csv.Document(Container{[#"Name" = "SP_Test/AASConfig.csv"]}[Content],[Delimiter=",", Encoding=1252, QuoteStyle=QuoteStyle.Csv]),
#"AASConfig_GetHeaders" = Table.PromoteHeaders(#"AASConfig_GetContent", [PromoteAllScalars=true]),
#"AASConfig_GetDataPath" = Table.SelectRows(#"AASConfig_GetHeaders", each ([Object] = "Point_To_A_File") and ([PartitionID] = "P2")),
DataPath = List.Max(#"AASConfig_GetDataPath"[DataPath]) & List.Max(#"AASConfig_GetDataPath"[DateId]),
#"DataFile_GetContent" = Csv.Document(Container{[#"Name" = #"DataPath" & ".txt"]}[Content],[Delimiter=",", Encoding=1252, QuoteStyle=QuoteStyle.Csv]),
#"DataFile_GetHeaders" = Table.PromoteHeaders(#"DataFile_GetContent", [PromoteAllScalars=true]),
#"DataFile_ChangedType" = Table.TransformColumnTypes(DataFile_GetHeaders,{{"DateId", type number}, {"Fld_1", type number}, {"Fld_2", type number}})
in
#"DataFile_ChangedType“
Sensitivity: Internal 31
Dynamic M-Query – Change source folder path from config
Sample M-Query steps to change the source folder path dynamically from the configuration for each table partition. Refer to the link for a sample AAS model (Link –
MQueryDynamicFile_FolderSelection.zip)

Partition P1 – Point the source to a specific folder to read all files for a specific partition

let
Source = DataSource,
Container = Source{[Name="unilever"]}[Data],
#"AASConfig_GetContent" = Csv.Document(Container{[#"Name" = "SP_Test/AASConfig.csv"]}[Content],[Delimiter=",", Encoding=1252, QuoteStyle=QuoteStyle.Csv]),
#"AASConfig_GetHeaders" = Table.PromoteHeaders(#"AASConfig_GetContent", [PromoteAllScalars=true]),
#"AASConfig_GetDataPath" = Table.SelectRows(#"AASConfig_GetHeaders", each ([Object] = "Point_To_A_Folder") and ([PartitionID] = "Partition")),
DataPath = List.Max(#"AASConfig_GetDataPath"[DataPath]), //& List.Max(#"AASConfig_GetDataPath"[DateId]),
#"Filtered_DataPath" = Table.SelectRows(Container, each Text.StartsWith([Name], #"DataPath")),
#"Filtered_FileExtn" = Table.SelectRows(#"Filtered_DataPath", each ([Extension] = ".txt")),
#"Removed Columns" = Table.RemoveColumns(#"Filtered_FileExtn",{"Name", "Extension", "Date accessed", "Date modified", "Date created", "Attributes", "Folder Path"}),
#"Filtered Hidden Files" = Table.SelectRows(#"Removed Columns", each [Attributes]?[Hidden]? <> true),
#"Invoke Custom Function" = Table.AddColumn(#"Filtered Hidden Files", "Transform File", each #"Transform File"([Content])),
#"Removed Other Columns" = Table.SelectColumns(#"Invoke Custom Function", {"Transform File"}),
#"Expanded Table Column" = Table.ExpandTableColumn(#"Removed Other Columns", "Transform File", Table.ColumnNames(#"Transform File"(#"Sample File"))),
#"Changed Type" = Table.TransformColumnTypes(#"Expanded Table Column",{{"DateId", Int64.Type}, {"Fld_1", Int64.Type}, {"Fld_2", Int64.Type}})
in
#"Changed Type"

Sensitivity: Internal 32
Cube Security Design >> Workflow

There are two types of security requirement

1) Column Level Security


Create a DAX calculated
Run a Power shell script to Create a config table
Create a table measure which will scan the
extract user email id’s for having column which are
having AD groups table to mask column where it
respective AD group secured per AD group
is secured
Run it daily
DAX: var kpi_name = "KPI-1"
var access_flag = LOOKUPVALUE(vw_a_User[Result], vw_a_User[UserId], USERPRINCIPALNAME(), vw_a_User[KPI], kpi_name)
return SUM(a_Data[KPI_1]) * access_flag

2) Row Level Security

Create a config table


Add AD groups per Create a DAX calculation in Cube security per table to
having AD groups
role on the cube lookup against the config table
roles with row filters

Add a new AD group when there is a new requirement

Reference :
https://docs.microsoft.com/en-us/analysis-services/tutorial-tabular-1200/supplemental-lesson-implement-dynamic-secu
rity-by-using-row-filters?view=asallproducts-allversions
Sensitivity: Internal 33
Cube Security Design Security on Product
Security on Specification • Sample Input file
• Sample Input file
Security Model
Tables

• Create a table (ADGroupUserMapping) and extract all users by ADGroup


• Import UserSecuritySalesConfig table into PDS and join with
ADGroupUserMapping on ADGroupName
• Create calculated columns for all the volume and value metrics to lookup for
Product[SubDivision2Code] for the logged in user on UserSecuritySalesConfig
• Import UserSecuritySpecConfig table into PDS
[SubDivision2Code] && Value/Volume accessible flag = ‘Yes’. If the value is
• Create one role for each ADGroupName and set the following
not null, display the measure else default to 0
properties • Sample Code snippet below ONLY for reference
 Database permission – Read ProfitValueSecuredLKP2:= (SUM(FctSales[ValueMetric]) * LOOKUPVALUE('DimUserSecurity'[Result],
'DimUserSecurity'[UserName], USERNAME(), 'DimUserSecurity'[SubDivisionID],
 Add the ADGroup to membership 'DimProduct'[SubDivisionID])
 Define a DAX query to Specification dimension to lookup for
AuthorisationGroupCode for the logged in user on Snapshots below displays the actual value and secured values
UserSecuritySpecConfig [SpecAuthGroupCode] column NOTE: The actual measure should be hidden in the model and calculated measure
• Snapshots below show the difference between full data and RLS should be displayed. Snippet below is an illustration to demonstrate the difference in
applied view.

Important** : We will have to test the performance of user security with full data

Sensitivity: Internal
Send Email Using Graph API

Sensitivity: Internal
Send Email Using Graph API
Problem Statement

Our current email sending process relies on service accounts with Office 365 licenses, which contradicts
company policy. These accounts are not meant for email use due to organizational and security constraints.

Solution
To resolve this issue, we require an alternative approach that involves utilizing Shared Mailboxes and
granting Microsoft Graph permissions to AzureAD App for sending emails from applications or
automations. This adjustment ensures alignment with company policy while bolstering security measures.

Please find the below attached document for detailed steps to send email using Graph API from ADF or
Logic APP

Process To Send Email Using GraphAPI.docx

Sensitivity: Internal
Non Prod Environments

Sensitivity: Internal
Azure Services Cost Optimization over Non-Prod environments
1. It is important that minimal configuration of services are leveraged over Non-Prod environments unless required for UAT testing
2. Dedicated SQL pool and Azure Analysis Services to be paused when not used on Dev/QA especially during Evenings IST. Automatic pausing should be scheduled. PPD
should be paused throughout and available only during UAT.
3. Databricks should not use Job cluster, rather fixed cluster should be used. Databricks should be using Standard All-purpose Compute DBU, rather than premium.
4. Virtual machines on Dev/QA/PPD should be chosen with very minimum configs. Park my cloud can be leveraged for automatic pausing
5. Only Sample data should be carried on to validate rather than full data on QA –Dev, like if you have 100 products in your full data you can bring only 10 as sample..
6. Chose small clusters for Databricks on Non prod environment

Service Dev QA PPD


ADLS Gen2 Basic Basic Basic
Data Factory Basic Basic Basic
Data bricks Refer Slide 29

Dedicated SQL Pool Compute Optimized Gen2, Dedicated Compute Optimized Gen2, Dedicated SQL Pools: Compute Optimized Gen2, Dedicated SQL Pools: DWU 400 x 300
SQL Pools: DWU 100 x 300 Hours, DWU 400 x 300 Hours Hours
(Paused in IST Evening) (Paused in IST Evening) (Paused all time except during UAT)

Azure Analysis Standard S1 (Hours), 1 Instance(s), 500 Standard S4 (Hours), 1 Instance(s), 500 Hours Standard S4 (Hours), 1 Instance(s), 500 Hours
Services Hours (Paused in IST Evening) (Paused in IST Evening)
(Paused in IST Evening)
Virtual Machine Minimum Config(Paused in IST Evening) Minimum Config(Paused in IST Evening) Minimum Config(Paused in IST Evening)

Sensitivity: Internal
Databricks best practices

Sensitivity: Internal
ADB Load types
To allocate the right amount and type of cluster resource for a job, we need to understand how different
types of jobs demand different types of cluster resources.

Machine Learning - To train machine learning models it’s usually required cache all of the data in
memory. Consider using memory optimized VMs so that the cluster can take advantage of the RAM cache.
To size the cluster, take a % of the data set cache it see how much memory it used extrapolate that to the
rest of the data. The tungsten data serializer optimizes the data in-memory. Which means you’ll need to test
the data to see the relative magnitude of compression.

Analytical load - In this case, data size and deciding how fast a job needs to be will be a leading indicator.
Spark does' t always require data to be loaded into memory in order to execute transformations, but you’ll
at the very least need to see how large the task sizes are on shuffles and compare that to the task throughput
you’d like. To analyze the performance of these jobs, start with basics and check if the job is by CPU,
network, or local I /O, and go from there. Consider using a general purpose VM for these jobs.

Sensitivity: Internal
Azure Data bricks config recommendations (**delta accelerated nodes)
Service Dev QA PPD Prod

Memory Optimized : Standard Tier Standard Tier Premium Tier/ Job cluster Premium Tier / Job cluster
When serving ETL Termination min = 30 Termination min = 30 Small load :
Analytical load which Min Worker : 1 Min Worker : 1 Termination min = 30 Min Worker : 1
has high use of RAM Max worker :8 Max worker :8 Min Worker : 1 Max worker :10
due to data movement. Driver :4 core VM Driver :4 core VM Max worker :8
Insert update deletes Worker : 4 core VM Worker : 4 core VM Driver & Worker : based on Heavy load:
for big tables. Approved VM’s as below ( based on job load select cores job load select cores from 4 to Termination min < 15
from 4 to 72) 72 Min Worker :9 (based on workload)
Eds_v4 Approved VM’s as below Approved VM’s as below Max worker :10
Eds_v5 Driver : based on job load select cores from 4 to 72)
Eads_v5 Eds_v4 Eds_v4 Worker : based on job load select cores from 4 to 72*
Eds_v5 Eds_v5 Approved VM’s as below
Eads_v5 Eads_v5
Eds_v4
Eds_v5

General Purpose : Standard Tier Premium Tier Premium Tier/ Job cluster Premium Tier / Job cluster
Small Loads
When serving Termination min = 30 Termination min = 30 Min Worker : 1
Analytical load with Min Worker : 1 Min Worker : 1 Termination min = 30 Max worker :10
high CPU due Max worker :8 Max worker :8 Min Worker : 1 Heavy Loads
regression and other Driver :4 core VM Driver :4 core VM Max worker :8 Termination min <15
compute. Worker : 4 core VM Worker : 4 core VM Driver & Worker : based on Min Worker :9 (based on workload)
Approved VM’s as below Approved VM’s as below job load select cores from 4 to Max worker :10
Dds_v5 Dds_v5 72 Driver & Worker : based on job load select cores from 4 to
Dads_v5 Dads_v5 Approved VM’s as below 72*
Dds_v5
Dads_v5 Approved VM’s as below
Dds_v5

Sensitivity: Internal
* Refer next slide to identify best cluster sizing after some development is completed
Azure Databricks Cluster – How to select?
1. Run End to end pipeline with all notebooks in full load.
2. Open ganglia Ui from Metrics tab in cluster logs.
3. Check overall load of CPU and Memory, one of these should be consumed ~80%.
4. If both are underused, try reducing your cluster config and run again
5. For worker config check load on each worker in nodes section.

1. CPU uses is sufficient, but Memory is touching 100%, so recommended is we optimize our code to clean cache
so we can free up ~30% RAM. If not possible increase to next size for worker.

Sensitivity: Internal
Databricks Best Practices
A shuffle occurs when we need to move data from one node to another in order to complete a
stage. Depending on the type of transformation you are doing you may cause a shuffle to occur.
This happens when all the executors require seeing all of the data in order to accurately perform
the action. If the Job requires a wide transformation, you can expect the job to execute slower
because all of the partitions need to be shuffled around in order to complete the job. Eg: Group
by, Distinct.

Databricks queries usually go through a very heavy shuffle operation due to the following:
 JOIN()
 DISTINCT()
 GROUPBY()
 ORDERBY()
And technically some actions like count() (very small shuffle )
Resolution
• Don’t use wide tables (exclude all unwanted columns from your data.
• Filter records before join.
• Avoid Cross join() at any cost.
• Don’t do so many nested queries on big tables. (Views Inside Views)
• Increase number of partition to do shuffle.e.g if your shuffle size is 250 GB,
and you need 200 MB partition for joining : (250*1024)/200 = 1280
spark.conf.set (“spark.sql.shuffle.partitions”, 1280)
• It’s good practice to write a few interim tables that will be used by several users or queries on a regular basis. If a dataframe is
created from several table joins and aggregations, then it might make sense to write the dataframe as a managed table

Sensitivity: Internal
Azure Databricks Cluster – Track marketwise utilization/costs
For a multi-market rollout or Multi Product solution, a question that would come up from business, is to specify the Azure cost incurred per market. This is so that markets
pay only as per actual resource utilization. With respect to Databricks, the way to achieve market wise cost reporting, is through use of custom tags on clusters

The 2 types of clusters and the guidelines to add custom tags in each case is as follows :

1) Interactive Cluster 2) Job Cluster


 Clusters should not be shared between markets/products  Linked Service for Databricks should include the custom tag marketname
Each market can have one or more clusters but it  Linked Service > Parameters > New > Add parameter having
should be used by that market/products alone Name = cost_tag , Default value = leave it blank (see Fig 2)
 Add Custom tag onto all clusters  Linked Service > Additional cluster settings > Cluster Custom tags >
 Edit cluster when its not in use New > add custom tag with Name= cost_tag,
 Add custom tag called cost_tag (lowercase) and Value = @linkedService. cost_tag. (see Fig 3). Save changes
its value (uppercase) . See Fig 1 as example  DataFactory should pass in the value for parameter cost_tag
 Save changes
 Takes 24 hours for custom tag to reflect on Azure cost management Fig 2 Fig 3
Fig 1

Sensitivity: Internal
Databricks Table Size calculations
To Calculate Size of databricks tables with active data
%scala
spark.read.format("delta").load("dbfs:/mnt/adls/PowerBIRefresh/PowerBIRefreshTest_100K_4GB/Processed_Parquet/").queryExecution.analyzed.stats
spark.read.format("parquet").load("dbfs:/mnt/adls/PowerBIRefresh/PowerBIRefreshTest_100K_4GB/Processed_Parquet/").queryExecution.analyzed.stats
spark.read.format("csv").load("dbfs:/mnt/adls/PowerBIRefresh/PowerBIRefreshTest_100K_4GB/Processed_Parquet/file.csv").queryExecution.analyzed.stats

To Calculate Size of ADS folder to calculate soft deleted and Vacuum data size with active data
%py
def recursiveDirSize(path):
total = 0
dir_files = dbutils.fs.ls(path)
for file in dir_files:
if file.isDir():
total += recursiveDirSize(file.path)
else:
total += file.size
return total
recursiveDirSize("/mnt/adls/PowerBIRefresh/PowerBIRefreshTest_100K_4GB/Processed_Parquet/")

Sensitivity: Internal
Trainings

Trainings which can help

•Databricks cost optimization


•Databricks performance optimization
•Machine learning with Databricks
•Big data analytics with ADB
•New features, Tips and Tricks

Sensitivity: Internal
Innovations
1. Is there a way to get the glossary cum documentation of data bricks code? >> Unity Catalog
2. Currently there is a timeout of 1 hour between Databricks and PBI connection, can we get rid of it?
3. In case of Big data analytics, we tried using ADB as Direct query to Power BI, which did not work
very well. Is there any alternate solution for Big data analytics using ADB and PBI?
4. We were looking forward to come up with a health dashboard to have all the cluster configurations
along with Worker nodes in a single place for all instances within Unilever with the cost of each
cluster. Can we get some help? >> Overwatch
5. Can we have an optimizer guide for ADB notebooks which points out exactly where code needs
optimization ?
6. Can we have some data reconciliation tool over ADB?
7. Is there any feature of Graph expected ?

Do we need any support from data bricks?


8. We can have a monthly connect to review any open issues on projects

Sensitivity: Internal
Databricks spot instances

Sensitivity: Internal
What is a SPOT INSTANCE

• A Spot Instance is a spare computing capacity instance that allows you to


access and utilize them at a steeply discounted rate of up to 90 percent
compared to other pay-as-you-go or dedicated instances.

• Azure’s eviction policy makes Spot VMs well suited for Azure Databricks,
whose clusters are resilient to interruptions for a variety of data and AI use
cases, such as ingestion, ETL, stream processing, AI models, batch scoring
and more.

• Spot instances are recommended to be used for non-prod environment only

Sensitivity: Internal 49
Landscape Report for Non-Spot ADB cost
Landscape publishes a report to see Non spot usage per product.

Analysis reveals that some Non-Spot VM usage can be attributed to


1.Job Clusters being called for from ADF pipelines do not request for On-Spot instances. By default, job clusters use fully priced VMs
2. Azure not providing spot instances even when they are requested for.

The plan is to request for spot instances during both Interactive and Job cluster creation. If spot instances are not available and Azure
defaults to full priced VMs, the report above will still show Non-Spot usage.

Sensitivity: Internal 50
Enable Spot instance for “All Purpose Cluster”

Please note that the driver is always a full


priced VM. Only workers can be spot in case
of “All Purpose Cluster”

Sensitivity: Internal 51
Steps to enable SPOT Instance for Job Cluster
STEP 1 : Create cluster pool with required instance type as per the highlighted config

Specify value as 0. Value greater than 0 will ensure that


this no of instances are always idle until some job picks
these clusters. Once the Job is completed, these
instances will come back to Idle status. This will save
spin up time, but idle instances are charged by instance
provider minus DBU

Specify appropriate interval. This value means that


once the job completes, those instances are ready to be
picked by next job before the interval period ends

Select appropriate instance type depending on the use


case

Ensure All Spot is selected. Both drivers and workers


will be of type SPOT

Sensitivity: Internal 52
Steps to enable SPOT Instance for Job Cluster

STEP 2 : Edit ADF linked service to select cluster as “Existing Instance Pool”
• Select appropriate instance pool from the drop down.
• Select OK

Select existing instance pool


name

Sensitivity: Internal 53
Steps to enable SPOT Instance for Job Cluster
STEP 3 : Run ADF Job that uses Databricks notebook activity
• Navigate to ADB Instance pool to verify Instance status and attached clusters.
ADF Job Running
2 instances created and 2 are
busy during Job Run

New Job cluster is being


assigned and starting

ADF Job completed

2 instances are idle for the


interval specified

No Active job clusters

Sensitivity: Internal 54
PBI Premium Capacity Onboarding

Sensitivity: Internal
Request for PBI Premium Workspace Provisioning
Product Architecture Sign off from SFD team

Request for PPU licenses(1-2) for PBI Datasets (large dataset). Link
mentioned in the Access request slide

SF Build team to develop the PBI dataset on PPU


4 WS to be created
PPU > Project name-Market – Dev, Project name-Market – QA
Premium > Project name-Market – UAT, Project name-Market-Prod

Peer review by Delivery team on QA WS


1) VPAX analysis using Best Practice Analyzer feature part of Tabular Editor
2) VPAX analysis using Performance Analyzer tool
All feedback from above tools should be implemented

Review by SF Design Team


VPAX analysis using Best Practice Analyzer feature part of Tabular Editor
VPAX analysis using Performance Analyzer tool
Legend Screenshots indicating implementation of Incremental Refresh.
SF Design Team Screenshot of Lineage details
Product Owner Load statistics compliance met for inc refresh. DS size, time to load
PBI Coe Team
Team PBI /SA/ Build Based on the compliance of above points, the PBI WS UAT will be onboarded to
Build Team Premium

Bug fixes will be carried on PPU WS – Dev and QA

Sensitivity: Internal
Steps to procure a Premium Workspace for new projects (1/2)
1. Clarify reason why a Premium workspace is needed
Below are some of the reasons why a Premium workspace is provisioned
Premium Capacity features – Incremental Refresh / Data Volume greater than 1 GB

2. Verify output from vertipac analyser


Obtain .vpax from build team and connect to Vertipac Analyzer v0.1.pbix

Share observations (if any) as below)


- Callout the parameters or checks indicated in “Red” and ask the
build team to justify or fix the same.
- PKSK checks to be done which currently are not part of VPAX
tool.
- Check if “Incremental Refresh” is applied on the datasets; this
is a mandatory check.
- Validate lineage which is not part of VPAX but needs to be
checked for approval.
- An updated checklist filled by the development team to further
assess, if the best practices are implemented.

Sensitivity: Internal
Steps to procure a Premium Workspace for new projects (2/2)
3. Build team to produce below reports as an output of Best Practice Analyzer feature part of Tabular Editor:

4. Validate all the points mentioned above once passed contact design team to onboard the project onto SF Prod1
Premium Capacity.

5] Prior to final sign off – ask for the Performance Analyser Tool [Dashboard performance checks] results from the build
team and ensure the SLA’s for the dashboards are met.
-- Check there are no redundant KPIs.

Sensitivity: Internal
Power bi dataset connectivity with Webapp Reference documentation
Power bi Rest API call (Execute Queries)
ADLS connection with PBI
Datasets & Webapp
1 Web Application
2
User favorite selections PBI cached dataset
Semantic Layer Microsoft Word
3
Provides the URL query filter Document
Caches the user
selections

PBI Paginated report

4
Exported report Final report with filters applied

1) How to connect Webapp and PBI dataset ?


a. 1 option is to use the Rest api available https://docs.microsoft.com/en-us/rest/api/power-bi/datasets/execute-queries
Sample example
{
"queries": [
{
"query": "EVALUATE
FILTER(SUMMARIZE('MarketSales', MarketSales[BrandCode],(MarketSales[ValueInEuroCurrency])), MarketSales[BrandCode] = \"BF0012\")"
}
],
"serializerSettings": {
"includeNulls": true
}
}
2) Incremental refresh for PBI dataset using ADLS as source - Parquet, ADLS Gen2, ETL, and Incremental Refresh i... - Microsoft Power BI Community

Sensitivity: Internal 59
Power Bi – Powerful feature Field Parameters

https://docs.microsoft.com/en-us/power-bi/create-reports/power-bi-field-parameters.

1. Field parameters allow users to dynamically change the measures or dimensions being analysed within a report.

2. Recommended usage instead of large number of bookmarks to change visuals ,[Bookmarks hinder performance of the

dashboard].

3. Also recommended to be used in place of nested Switch Case. [This also causes performance issues and is not recommended

].

4. Check the limitations on direct query*

Sensitivity: Internal
PBI Visuals Some best practices

Sensitivity: Internal
Reduce number of individual visuals
Category Performance Report 1. Squares and triangles are
individual visuals which are
displayed using formatting.
There are 28 visuals in the
left hand visual

2. Instead of multiple card


visuals, use a table/matrix for
all the measures

And use matrix cell


formatting to include the
icons - this helps reducing
the number of queries
even further.

08/07/2024

Sensitivity: Internal 62
Reduce number of individual visuals Use tables instead of multiple cards,
whenever possible.
Top Exit Pages
The screenshot shows 6 cards, where
we should have two (transparent)
tables, one with two columns (top
line) and other with 4 measures
(bottom line).

This will have a noticeable impact


because we will be running 2 queries
instead of 6, accelerating the
execution start of the rest of the
visuals.
Static elements and custom visuals
1. For visual appeal and to match zeplin design, sometimes the developers add in static elements like boxes,
shapes, lines etc. Instead create a background image for all these elements and then add visuals on top.

2. Remove any custom visuals which are not provided by Microsoft.


To be accurate, Custom Visuals will generally impact performance,
08/07/2024
regardless the author. If we must use them, they have to be
"Power BI Certified", which can be from Microsoft
Sensitivity: Internal or other authors. 63
Avoid visual level filters
Visual Level Filter which have a measure
as the parameter to
Distributor Performance filter, instead move the
filters while calculating
the measure itself

The measure [#of


invoice] was a measure
made on 3 other
measures

Normal dimensions used as


visual filters might even
have a positive impact as
they'll reduce the data
queried internally.

However, by removing the filter, we were getting more rows than expected in the visual – this is because some of the measures displayed were returning values when even the
condition [# of invoices]>0 was NOT met.
The way to work this out was to change all those measures and force a result of blank when [# of invoices]>0 wasn’t met. I have implemented those measures, all of them with
the following pattern:

alex_#_of_outlets =
08/07/2024
VAR _InvCount = [# of Invoices]
Return if(_InvCount>0, [# of Outlets], blank())
Sensitivity: Internal 64
Visual Level Filter
Channel Performance

Multiple visual filters applied on measures

08/07/2024

Sensitivity: Internal 65
Uncheck Show items with no data
Distributor Performance

The Show items


with no
data feature lets
you include data
rows and columns
that don't contain
measure data (blank
measure values)

 Uncheck this

08/07/2024

Sensitivity: Internal 66
DAX errors
Distributor Performance

Data model does


not have the right
relationships

08/07/2024

Sensitivity: Internal 67
Zeplin design and PBI Visual Limitations
Zeplin Design 1 (with spaces in between the Alternate visual - which build team has
bars) developed on the PBI report.  clustered
column chart

Zeplin Design 2 (bars formatted by the difference )


Alternate visual –
single colour chart

Sensitivity: Internal 68
Zeplin design and PBI Visual Limitations
Zeplin Design 3 Alternate visual –
single colour chart

Zeplin Design 4 (bars embedded with spacing) Alternate visual – stacked column
chart

Sensitivity: Internal 69
Zeplin design and PBI Visual Limitations
Alternate visual – clustered column
Zeplin Design 5 (-ve axis bar chart) chart +ve axis only

Alternate visual – horizontal bar char single


Zeplin Design 6 (formatted bsar)
color

Sensitivity: Internal 70
Web app best practices

Sensitivity: Internal
WebAPP Best Practices
Area Section Guideline
Application Structure Serverless Approach Use azure function in most common cases where need to write custom reminders , notifications and scheduling tasks

Codebase Code quality check Perform code quality analysis by removing all warnings and if using .NET or later version then code analysis is enabled by default.
Codebase Class libraries Use common class library of all those re-usable common components. common validation,common mathematical calculation in one place of library and use it
across project
Codebase Security Use data encryption algorithm for sensible data. Some common c# encryption algorithm has been attached in appendix A sheet for reference
Codebase Code Remove Unused code/hard code values from the code base
Data traveling in web app which is fetched from database or any sources should be aligned with master data primary key id value not on text value.
Codebase Session In session mostly store key value pairs data and do not store large object(>10MB) entity data
Codebase Code Error log message format include method name ,line number, exception detail message and path
Codebase Codebase Source code must be properly commented to increase readability
Codebase/Infra Memory issues Implement finite loops , single run instance on local of application to reduce running out of memory issues
Devops CICD pipeline Ensure Code Deployment done through CI/CD pipeline in all Environment
Document Document Use read me text file to document steps for newly users to follow across project.
Infra web server Ensure web server scalability and fine/recommendable configuration, processor: 2 x 1,6 GHz CPU. ,RAM: 8,16 GB RAM.HDD: 1x 80 GB of free space or more
is recommended ,VM : Basic Medium VM
Infra Infra Notify about your site's downtime including data migration, critical deployment or any infra level task as per project requirement .
Logging Logging Ensure logging and tracing implemented at info, debug and error level by using Azure log monitoring/AppInsight service on critical functionality. Its mandatory
on production
Testing Testing Use API testing tool swagger/ postman free latest version to test API
Security Security Complete penetration test on priority system i.e. PS1 (e.g. if SC1 and DR 1 then PS 1)for the project by infosec team(cost included) . Refer appendix B
Security Security Implement Web app accessing domain should have meaningful domain name and SSL domain certificate activated with WAF security standard policy. Steps for
reference is attached in appendix C Its mandatory before business go live
Vulnerability scan should be perform on web app on prod or any prod identical environment(No cost). Refer appendix D Its mandatory before business
go live
Security Security Use key vault azure service to hold secrets and credentials values.
Security Security Secured data in transit across the layer
Security Security Any open source coding used should be latest version
Security Security Latest version of .Net framework used
Security Security Implementation of HTTP strict transport security(HSTS) for critical interfaces
Security Security Ensure no proprietary algorithm used without proper validation
Security Security User and role based privileges must be verified thoroughly
Authentication: Web App should use AAD authentication
Authorization:We required custom codeing to implement role based authorization for a specific Web API call.
Security Security Implement external user authentication via AAD . Refere Appendix D for Steps.
Application API Architecture • Front end should not request too much of data from backend, we must implement pagination at backend and only return < 50 records to frontend in one
request.
• For any use case of data export, we must use web jobs or download APIs which will do the file creation in backend only and binary stream the file to
frontend or give download link.
• IF backend API is hosted on different server then frontend, communication between front end and backend must be secured using App Service
authentication and authorization and in Firewall IP restriction.
Sensitivity: Internal • • A web app can connect to an Azure SQL database using Managed Service Identity. There is no need of username or password in the connection
Webapp to SQL Server Connectivity Best Practices

A web app can connect to an Azure SQL database using Managed Service Identity. There is no need of username or password in the connection string.

Setup
1. Enable the MSI : We need to enable MSI for the web app. To do that, navigate to your web app and under settings, click on ‘identity’. Under the System assigned
tab, change the status to On. It will generate the Object ID. Create service request for landscape team to give permission to access SQL from web app by
providing all resource groups, SPN and SQL server details.
Service Catalog Item - Employee Service Center (service-now.com)
2. Modify the Application code and consume connection string without id and password.
“server=tcp:<server-name>.database.windows.net;database=<db-name>; ;Authentication=Active Directory Interactive“
3. Local development machines which implementing application code should be the MSI enabled DTL machines. For local development required to do the
whitelisting of IP address of DTL machine and same need to mention in service request.

NOTE: POC application is available to connect SQL server from c# code by using MSI authentication
WebAppSQLAADAuth.zip

Sensitivity: Internal
WebAppDesign(1/2)
15 16

Front End Environment Back End Environment

8 7
Azure Blob Storage

2
5
Redis Cache SQL Database

User 1 3 4 API
Front end 6 Azure
Active Directory

Third Party/External
Azure DevOps Azure Monitor Azure Key Vault data ingestions
14 system
9 10 11 12 13

1.A user accesses the public website domain in the browser


2.Browser pulls static resources and product images from Azure Content Delivery Network
3.A user browses the website UI to submit requests by MSI authentication with WAF security.
4.Web APP UI calls Rest API methods to communicates with different modules
5.The Content module loads product images from Azure Blob Storage (optional)
6.User authenticates against Azure Active Directory (Azure AD)
7.The modules pull data from the database
8.The result from Database is cached in the Redis Cache (optional)
9.Azure function send reminder and notification
10.Logic App connect disparate systems across cloud(MS team notifaction)
11.Azure devops CI/CD pipeline publish code on all environment
12.Azure log monitor analysis service
13.Azure Key Vault helps teams to securely store and manage sensitive information such as keys
14.Access to third party services
15.App Insights as Application Performance Management
Sensitivity: Internal 16.Power BI as reporting service 74
WebAppDesign(2/2)

Azure Function
Web UI Web API Web Business Access Layer

Web Data Access Layer

WebWeb
Client Services
Client Services

Sensitivity: Internal 75
Domain name , SSL and WAF onboarding
LANDSCAPE Team 6
Akamai CDN Team DomainSSLWAFS
2 teps
BUILD Build team generate Service request to landscape team Build team will raise a JIRA request to Akamai CDN
TEAM Get CNAME , txt value and root domain value for SSL team to add delivery configuration
validation.

4
7
Build team will check with landscape to validate these
values has been added or not. After configuration ,@Akamai_CDNConfig team will
raise the request to DNS txt entry at dns level.

5
Build team Build team should raise request to Landscape team to
purchase SSL certificate from Microsoft for the 9
1 get DNS
name production environment. The @Akamai_CDNConfig team should complete
from CDN property configuration and share San edge key
business information and server information/CNAME change
14
for new web
Build team will raise new SR for landscape team to
application
URL whitelist IP’s list received from WAF team

12
Build team
will verify
DNS TEAM WAF TEAM
URL on 3 11
staging Build team will generate Service request to DNS team to add
environment “CNAME” ,”txt” and “root domain value and send email to Build team will raise request for WAF team to WAF
as per steps DNS team with request number onboarding
provided by
WAF team
8
DNS team will confirm on txt value added on the email.
13
WAF team will push the changes on to the production.
Build team will raise request to DNS team and share 10 And share IP addresses list in the email
CNAME(e.g. e2ebolt.unilever.com) and IP address(e.g.
20.107.224.53)

Sensitivity: Internal 76
C# Best Practices(1/2)

Architecture Design Level Guidelines Code Indentation and Comments


Use default code editor setting provided by Microsoft Visual Studio.
Implement moduler and loosely coupled architecture using interfaces and abstract class. Write only one statement and declaration per line.
Add one blank line space between each method.
Use of generics would help you to make reusable classes and functions
Use parentheses to understand the code written.
Separate your application into multiple assemblies. Create separate assemblies for UI, Use xml commenting to describe functions, class and constructor.
Business Layer, Data Access Layer, Framework, Exception handling and Logging Use Tab for indentation.
components.
Use one blank line to separate logical groups of code
Do not access database from UI pages. Use data access layer to perform all tasks which
are related to database.
Use #region and #endregion to group related piece of code as per below
Private Member
Always use parameterized stored procedure instead of writing inline queries in C# code. Private Properties
Public Properties
Try to use design pattern, practices and SOLID principles. Constructors
For same code, create separate utility file or move it to base class. Event Handlers/Action Methods
Private Methods
Don’t store large objects into Session. Storing large or complex object into session may Public Methods
consume server memory. Destroy or Dispose such session variable after use.
Do not write comments for every line of code and every variable declared.
Don’t store large object into view state, this will increase the page load time. Use // or /// for comments avoid using /* … */
Always refer third party dll, javascripts and css framework through NuGet package so If you have to use some complex for any reason, document it very well with
that you can update with latest version whenever required and approved from architect. sufficient comments.
Always refer minified version of javascript or css files, this will reduce unnecessary If all variables and method names are meaningful, that would make the code very
overhead to the server. readable and will not need many comments.

Sensitivity: Internal 77
C# Best Practices (2/2)
Good Programming Practices
Avoid writing long functions. The typical function should have max 40-50 lines of code. If method has more than 50 line of code, you must consider re factoring into separate private methods.
Avoid writing long class files. The typical class file should contain 600-700 lines of code. If the class file has more than 700 line of code, you must create partial class. The partial class combines code into single unit
after compilation.
Don’t have number of classes in single file. Create a separate file for each class.
Avoid the use of var in place of dynamic.
Add a whitespace around operators, like +, -, ==, etc.
Always succeed the keywords if, else, do, while, for and foreach, with opening and closing parentheses, even though the language does not require it.
The method name should have meaningful name so that it cannot mislead names. The meaningful method name doesn’t need code comments.
The method / function/Controller Action should have only single responsibility (one job). Don’t try to combine multiple functionalities into single function.

Do not hardcode string or numbers; instead, create separate file sfor constants and put all constants into that or declare constants on top of file and refer these constants into your code.
While comparing string, convert string variables into Upper or Lower case
Use String.Empty instead of “”
Use enums wherever required. Don’t use numbers or strings to indicate discrete values

The event handler should not contain the code to perform the required action. Instead call another private or public method from the event handler. Keep event handler or action method as clean as possible.

Never hardcode a path or drive name in code. Get the application path programmatically and use relative path. Use input or output classes System.IO) to achieve this.
Always do null check for objects and complex objects before accessing them

Error message to end use should be user friendly and self-explanatory but log the actual exception details using logger. Create constants for this and use them in application.

Avoid public methods and properties to expose, unless they really need to be accessed from outside the class. Use internal if they are accessed only within the same assembly and use private if used in same class.
Avoid passing many parameters to function. If you have more than 4-5 parameters use class or structure to pass it.
While working with collection be aware of the below points,
While returning collection return empty collection instead of returning null when you have no data to return.
Always check Any() operator instead of checking count i.e. collection.Count > 0 and checking of null
Use foreach instead of for loop while traversing.
Use IList<T>, IEnumrable<T>,ICollection<T> instead of concrete classes e.g. using List<>
Use object initializers to simplify object creation.
The using statements should be sort by framework namespaces first and then application namespaces in ascending order
If you are opening database connections, sockets, file stream etc, always close them in the finally block. This will ensure that even if an exception occurs after opening the connection, it will be safely closed in the
finally block.

Simplify your code by using the C# using statement. If you have a try-finally statement in which the only code in the finally block is a call to the Dispose method, use a using statement instead.
Always catch only the specific exception instead of catching generic exception
Use StringBuilder class instead of String when you have to manipulate string objects in a loop. The String object works in a weird way in .NET. Each time you append a string, it is actually discarding the old string
object and recreating a new object, which is a relatively expensive operation.

Sensitivity: Internal 78
Web Application TG Connectivity

PDS

Azure App service


TG with Auth Layer

React web app REST++ authentication using


REST API’s are Token
Request
This is an end user Request developed using .Net • One Secret associated with a
experience built Core 8 Framework particular user and the user’s
AAD which in turns connect Tiger
Authentication with React from privileges for a particular
to Tiger Graph DB Graph
(SSO) where user can Azure Web App graph.
User Response server for fetching the database
access web app • Anyone who has this secret
details. Response server
screens in mobile can invoke a special REST
&browser endpoint to generate
REST ++
Glossary-Page authorization tokens
Store secret value and • New Aauthorization token
Product detail Page certificate in KV must be included in the header
provided by TG team for each REST endpoint call.

TG to web app connectivity guide WebAppToTGCon


nectivity
Sensitivity: Internal 79
Web Application Vulnerability scan
Steps:
• Send email to [email protected] and [email protected] with details of web app
• Need to fill attached excel sheet form and resent to same above mentioned group.
• Scan team will take one or two working days to complete scan and send pdf report back.
Purpose of vulnerability scan:
• Scanning tool helps you minimize risks & reduce your attack surface for modern web apps and APIs.
• It identify weaknesses in the network that attackers could exploit to gain unauthorized access, steal data, or launch attacks.
•Notes:
• All Critical and High-Risk vulnerabilities need to be fixed before “Go-Live” as they represent an increased risk to Unilever and its Digital Assets.
• Medium and Low vulnerabilities need to be fixed within the stipulated SLA’s from the date of Go-Live.
• In-case the application is already live or published, necessary remediation action must be undertaken on top priority.
•ETA to remediate the reported vulnerabilities; Critical – 2 days; High – 5 days; Medium – 60 days; Low – 90 days. Failing to adhere the SLA will be considered
as non-compliance and the webapp will be taken offline from production

Sensitivity: Internal 80
Point of Contacts for Access

Sensitivity: Internal
Access Requests (Owner: Delivery Team)
Table below provides details on requesting access for different Azure components
Component Link to SR / Document

Create Mountpoint in databricks 1. Product team to raise service now request using the below link
2. Cloud Platform Management: Create Mountpoint

Access to UDL 1. Product team to raise service now request using the below link
2. Universal Data Lake: Data Access
3. Provide databricks SPN details to enable accessing mount paths

Access to BDL 1. Once mount point is created in databricks, Landscape team will request for approval from respective business
owner in an email
2. Product team to raise request with business owner
3. Once approved, product team to provide approval email to Landscape team to enable SPN access
PPU license for Dev team Attached document has all the details

PPU License for


Dev Team
Power BI Premium Onboarding Attached document has all the details
1. Build team to create workspaces following naming convention, <projectname>_Qa/Prod and share the workspace
ID and URL along with costing with PBI CoE to convert to Premium capacity Link

Web App configurations It is mandatory to implement following for web applications. Follow attached for detailed implementation
1. Akamai WAF onboarding
2. Custom DNS name
3. SSL certificate
** Note: WAF onboarding requires SSL certificate to be procured. In Dev and QA, the certificate can be provisioned by build
team which is free of cost and valid for 1 year. For production, this will need to be procured and will come with a cost of
$69.99/year with yearly auto renewal External Users -
Web application
configurations ACAM process

Sensitivity: Internal
Access Requests (Owner: Delivery Team)

Table below provides details on requesting access for different Azure components
Component Link to SR / Document

FMT Tool Unilever Point of Contacts:


1. Balasubramanian, Aishwarya <[email protected]>
2. Zakir, Amanulla <[email protected]>
3. Shaik, Reshmi <[email protected]>
Microsoft Word
4. [email protected]
Document
Logic App

Microsoft Word
Document
BDL Enriched and DDL layer Ajey Kartik

BDL Enriched BDL Enriched


Layer Overview Layer E2E Design
Design App IA_DL_COE_MetaStore [email protected]
Mandar Pathak

Teams Tab web app integration To create pp account and with adding in the Teams Admin center
Malviya, Raksha <[email protected]>; Subbappa, Rashmi <[email protected]>
To MS Teams license assignation to PP Accounts
[email protected]
[email protected]
[email protected]
[email protected]>;Janaka Wijesekara <[email protected]>; Iyengar, Srinivasa <[email protected]>;
Preetha J <[email protected]>;[email protected]
MS TEAMS APP deployment team:
[email protected]
[email protected]
[email protected]

Email conversation has been attached here.

Sensitivity: Internal
(Owner: Delivery Team)
Access Requests – Data owners

Functional Data Owner


Area
CD [email protected]

Supply Chain [email protected]

Marketing [email protected]

HR [email protected]

Finance [email protected]

R&D [email protected]

Master Data [email protected]

Sensitivity: Internal
Appendix A Web APP String Encryption Algorithm
Microsoft Word
Document

Appendix B : WebApp Penetration Testing

Microsoft
PowerPoint Presentation

Appendix C : WebApp SSL - WAF Onboarding


OLD_Domain_SSL
NEW_Domain_SSL
_WAF StepS _WAF_Steps

Appendix D : WebApp External User Onboarding

Microsoft Word
Document

Appendix E : WebApp Vulnerability Scan

WebAppVulnarab
ilityScan

Sensitivity: Internal 85

You might also like