SF-Best Practices
SF-Best Practices
08/07/2024
Sensitivity: Internal
Document Version Control
Version Change Description Author Date
1.0 Final Best practices for Solution Factory products Divya Shah/ Bhavana/Manoj 17-08-2022
1.3 Included sample data flow (attachment) for AAS Incremental refresh Bhavana 09-01-2023
1.5 Included details for AAD REST API based solution for extraction of user list from Bhavana 04-04-2023
group
1.6 Convert data frame to excel format Bhavana 05-04-2023
1.8 Included steps / pre-requisites for PBI Premium onboarding Bhavana 26-05-2023
2.0 Folder structure details included for Unity Catalog Paras 13-03-2024
Sensitivity: Internal 2
Document List to be followed by a Project Deliverable Reference
Sensitivity: Internal
Design Best Practices
Area Guidelines
Architecture 1. Refer Solution Architecture guidelines to identify the pattern to be leveraged for the solution Path
Architecture 1. Data sources are clearly defined and links to Azure cloud solution are defined
2. Data volumes are clear with total data retention in PDS with long term view
3. Number of concurrent users
4. What is the data extraction data volume?
5. Threshold limitation defined for architecture like 50 GB cube for x markets, 1 hour for ADB to PBI dataset refresh
6. Manual files/ tactical sources to be approved by Data SMEs
1. Any source not in UDL to be approved by Data SME
2. Manual files to be ingested using FMT to UDL / Techdebt
7. End to end PDS flow to be defined with all Azure services for ETL
8. Semantic layer (cubes) defined for different markets in the same flow
9. Understand self service requirements with respect to NFR
10. Web app represented if required with firewall
11. No Product to Product sourcing
12. MLOPS sources and lineage defined
13. Define if delta loads to be performed from sources
14. Security mechanism for Cubes Semantic layer defined
15. Security for sourcing/storing PII data defined
16. JMF configs for the product defined
17. Right configurations for ADB, SQL, AAS, Webapp. ADB-LTS version
18. Refresh SLA’s to be considered while defining solution
Data Model 1. Star schema with Facts and Dimensions only with Many to one relationship
2. Global data model defined if there are multiple markets
3. Facts do not have descriptive attributes
4. Granularity of Facts is kept to the minimum for e.g ProductSKU, Brand..
5. Monthly/Daily facts are different
6. All dimensions and Facts are supposed to be connected via Big int Surrogate key named as <Col>SK
7. Master Calendar dimension to be consumed from BDL
8. Master Product global hierarchy to be consumed from BDL
9. Every table should have LastModifiedDate attribute
WF Mapping 1. All slicers are connected to dimensions
2. All visuals have KPI defined in cube and X/Y axis connected to dimensions
3. KPI calculation will not be part of review
Sensitivity: Internal 4. KPI link to the visuals should be reviewed by Business Analyst 4
Development Best Practices
Area Guidelines
Sensitivity: Internal 5
Development Best Practices
Area Guidelines
Azure SQL 1. Right SQL configuration for Prod and Non Prod
2. No hard coding of paths and data. Should come from config file in ADF or parametrized on data model
3. No Unwanted test code that will consume processing time - like count check, select statements etc
4. The objects should be developed with incremental load and NOT TRUNCATE AND LOAD strategy
5. All Joins within Fact/dimensions should be on Surrogate Keys
6. Views created for reference in semantic layer with only filters and no calculations
PBI Dataset 1. All Facts needs to be partitioned on the Date column on the Power Query filter (and heavy dimensions) by configuring the Native incremental
refresh
2. LastModifiedDate attribute will be used to detect which partition is changing in a cube
3. Add all measures at dataset level, not in the individual reports.
4. Add parameters for source connection to change connection dynamically
5. VPAX is analyzed in Tabular Editor Best practice analyzer BestPracticeAnalyzerRules.json
6. VPAX is analyzed in the PBI COE Performance Analyzer tool
7. Review the lineage of a PBI Dataset
8. Custom Partitions to be created when the data is above 20M per partition
9. Only relevant SK’s are imported on Facts and NO CODE attributes
10. No RowSK and PKSK columns used for delta detection within ADB are imported on Cube
11. All common KPI’s needs to be defined in Cube and NOT in individual PBI report
12. Security implemented on datasets
13. User friendly names applied on dataset
14. Encoding hint for all numeric fields should be set as VALUE and character fields should be set as HASH
15. All fact measures should be DECIMAL data-type (not DOUBLE or FLOAT)
16. Use always 32-bit integer data-type instead of 64-bit integer for whole-number fields, if min/max value fits with-in 32 bit range
17. Must avoid any BI-DIRECTIONAL and MANY-TO-MANY relationships between DIM and FACTS
18. Minimize Power Query transformations. Ensure query folding for native queries
19. Dataset setting as “Large dataset” in PBI for datasets above 10 GB
20. Disable Auto date time
21. There should not be any datetime column. If you need time, it should be a separate column
22. PBI/AAS Time Intelligence Calculation – Time Intelligence calculation for small and medium datasets should be implemented through built-in
‘Calculation Items’ capability,
23. Use SQL Profiler traces to assess data loads, load throughput, cost of additional fields and parallelism of partition loading
24. Use PBI Tool from Phil Seamark to analyse the results from SQL Profile traces and assess optimal parallelism for data loads
https://dax.tips/2021/02/15/visualise-your-power-bi-refresh/
Sensitivity: Internal 6
Development Best Practices
Area Guidelines
AAS Cube 1. All Facts needs to be partitioned on the Date column on the Power Query filter (and heavy dimensions)
2. Only relevant SK’s are imported on Facts and NO CODE attributes
3. No RowSK and PKSK columns used for delta detection imported on Cube
4. All common KPI’s needs to be defined in Cube and NOT in individual PBI report
5. A PartitionConfig file with details of the Period to be considered within each partition
6. The PartitionConfig and Fact is joined within PowerQuery to filter each partition on the respective period. This will enable the AAS Fact to dynamically
partition data
7. The LastModifiedDate attribute will be used to detect which partition is changing in a cube. The RefreshFlag in Partitionconfig file will by dynamically
changed based on LastModifiedDAte
8. The webhooks to refresh AAS cube will receive the TMSL script with only partition names filtered with RefreshFlag as 1
9. When Refresh from CSV files – Each partition of AAS tables should be connected to individual CSV files/folder . Refer the AAS refresh slide
10.Security implemented on datasets
11. Refresh from CSV files – Each partition of AAS tables should be connected to individual CSV files directly by using below M-Query method to speed-up
data load process. Refer below example for one sample fact partition table, similar way M-Query transformation should be created for remaining
partitions pointing to respective CSV files
=======================================================
Folder level assignment
let
Source = AzureStorage.DataLake("https://dbstorageda16d901632adls.dfs.core.windows.net/unilever/Fact_1/Jan_CY/"),
#"Removed Columns" = Table.RemoveColumns(Source,{"Name", "Extension", "Date accessed", "Date modified", "Date created", "Attributes",
"Folder Path"}),
#"Filtered Hidden Files1" = Table.SelectRows(#"Removed Columns", each [Attributes]?[Hidden]? <> true),
#"Invoke Custom Function1" = Table.AddColumn(#"Filtered Hidden Files1", "Transform File", each #"Transform File"([Content])),
#"Removed Other Columns1" = Table.SelectColumns(#"Invoke Custom Function1", {"Transform File"}),
#"Expanded Table Column1" = Table.ExpandTableColumn(#"Removed Other Columns1", "Transform File", Table.ColumnNames(#"Transform
File"(#"Sample File"))),
#"Changed Type" = Table.TransformColumnTypes(#"Expanded Table Column1",{{"Period", Int64.Type}, {"Fld_1", type number}, {"Fld_2", type
number}})
in
#"Changed Type"
================================================================
Sensitivity: Internal 7
Development Best Practices
Area Guidelines
11. Encoding hint for all numeric fields should be set as VALUE and character fields should be set as HASH
12. All fact measures should be DECIMAL data-type (not DOUBLE or FLOAT)
13. Use always 32-bit integer data-type instead of 64-bit integer for whole-number fields, if min/max value fits with-in 32 bit range
14. Must avoid any BI-DIRECTIONAL and MANY-TO-MANY relationships between DIM and FACTS
15. Minimize Power Query transformations
16. PBI/AAS Time Intelligence Calculation – Time Intelligence calculation for small and medium datasets should be implemented through built-in
‘Calculation Items’ capability, this feature is available in PBI datasets as well as in AAS cubes. For large datasets – this should be considered cases to
case basis based on performance impact. Refer below sample DAX examples to define Calculation Items for Time Intelligence –
> MTD = CALCULATE ( SELECTEDMEASURE (), DATESMTD ( 'Calendar'[Date] ) )
> QTD = CALCULATE ( SELECTEDMEASURE (), DATESQTD ( 'Calendar'[Date] ) )
> YTD = CALCULATE ( SELECTEDMEASURE (), DATESYTD ( 'Calendar'[Date] ) )
> PY = CALCULATE ( SELECTEDMEASURE (), SAMEPERIODLASTYEAR ( 'Calendar'[Date] ) )
> YOY = VAR CY = SELECTEDMEASURE ()
VAR PY = CALCULATE ( SELECTEDMEASURE (), SAMEPERIODLASTYEAR ( 'Calendar'[Date] ) )
RETURN CY - PY
Sensitivity: Internal 8
Development Best Practices
Area Guidelines
Sensitivity: Internal 9
Development Best Practices
Area Guidelines
Web app 1. Use of KeyVault to hold secrets, credentials. DO NOT USE DefaultAzureCredential for authentication during development from local machine /
VM
2. Secured data in transit across the layers
3. Any open source coding used should be latest version. For e.g. node.js latest version as of Apr 2023 is <>
4. Latest version of .NET framework used
5. Is the Penetration test completed for the project by Infosec team?
6. Implementation of HTTP Strict Transport Security (HSTS) for critical interfaces
7. Ensure no proprietary algorithms used without proper validation
8. Source code must be properly commented on to increase readability
9. Ensure logging mechanism implemented for critical functionality
10. User and Role-based privileges must be verified thoroughly
Follow the standard guidelines before initiating code build for ETL. Refer SF-Code Build and Review Checklist.xlsxPath
Sensitivity: Internal 10
Development Best Practices for Data science / ML Product
Area Guidelines
Sensitivity: Internal 11
Development Best Practices for Synapse Serverless Pool
• Area • Guidelines
• Make sure the storage and serverless SQL pool are in the same region
• Colocate a client application with the Azure Synapse workspace. Placing a client application and the Azure Synapse workspace in different regions could
cause bigger latency and slower streaming of results.
• Set TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true) for all the delta tables
• Ran Vacuum on delta tables
• Optimize storage layout by using partitioning and keeping files size of the table in the range between 100 MB and 10 GB.
• Don’t stress the storage account with other workloads during query execution.
• Convert large CSV and JSON files to Parquet. Serverless SQL pool skips the columns and rows that aren't needed in a query if you're reading Parquet
files.
• It's better to have equally sized files for a single OPENROWSET path or an external table LOCATION.
• Schema should not be inferred. For example, Parquet files don't contain metadata about maximum character column length. So serverless SQL pool infers
it as varchar(8000).
• Use appropriate data types
• Use the smallest data size that can accommodate the largest possible value.
• Code Best
• If the maximum character value length is 30 characters, use a character data type of length 30.
practices
• If all character column values are of a fixed size, use char or nchar. Otherwise, use varchar or nvarchar.
• If the maximum integer column value is 500, use smallint because it's the smallest data type that can accommodate this value.
• Use integer-based data types if possible. SORT, JOIN, and GROUP BY operations complete faster on integers than on character data
Sensitivity: Internal 12
Product Ingestion/Transformation Design
Sensitivity: Internal
Azure Data Lake Folder Structure
Follow the below mentioned folder structure for ADLS within PDS layer
Landing Layer Staging Layer Transform Layer
Unilever
Manual File Unilever Unilever
<<ProductName>> ** <<ProductName>> ** <<ProductName>>
Unilever
LandingLayer StagingLayer TransformLayer
<<ProductNa
me>> FinConnect (TransactionSource) Dimensions
FinConnect (TransactionSource) *
ManualFiles *
<<Table1>> <<Dimension1>>
<<Table1>> <<Dimension2>>
File<>
File<> yyyy=2021 yyyy=2021
ErrorLogs
ArchiveData Semantic Layer (csv format for AAS)
Unilever Unilever
Unilever <<ProductName>> <<ProductName>>
Sensitivity: Internal 15
Azure Data Lake Folder Structure – Unity Catalog Enabled
Follow the below mentioned folder structure for ADLS within PDS layer
Use volumes for flatfiles
Transform Layer
Staging Layer
Unilever
Unilever
<<ProductName>>
<<ProductName>> **
TransformLayer
StagingLayer
Dimensions
FinConnect (TransactionSource) *
* <<Dimension1>>
<<Table1>>
<<Dimension2>>
yyyy=2021
Facts
mm=01
<<Fact1>>
dd=01
yyyy=2021
GMRDR (AnalyticalSource)
mm=01
<<Table1>> (E.g.,ProductCategory)
dd=01
<<partition name>> (if needed)
Sensitivity: Internal 17
Unity Catalog – Naming convention for PDS
Catalog Name : pds_<productname>_<itsg>_<env>
Example, pds_gmi_902772_dev
Sensitivity: Internal 18
Incremental Loading Strategy
All objects in PDS should be loaded incrementally. Audit columns like ObjectRowSK and Primary Key SK should identify the delta loads and update the Last modified Date audit
column.
Scenario 1 – Upsert
1) Compare A & B table on KeySK and RowSK, if matching records found for KeySK Scenario 2 – Upsert/Delete
but not for RowSK then UPDATE the matching records from B into A (DELETE > 1) Compare A & B table on KeySK and RowSK, if matching records found for KeySK but not for
INSERT) RowSK then UPDATE the matching records from B into A (DELETE > INSERT)
2) Compare A & B table on KeySK and INSERT unmatched records from B into A 2) Compare A & B table on KeySK and INSERT unmatched records from B into A
3) Compare A & B table on KeySK and DELETE unmatched records from A
Incremental Process for Incremental data (Upsert only) - Scenario 1 Incremental Process for Partial Full data (Upsert & Delete) - Scenario 2
Existing Processed Data (A) Existing Processed Data (A)
KeySK RowSK Period Product Geography KPI-1 KPI-2 Mod_Date KeySK RowSK Period Product Geography KPI-1 KPI-2 Mod_Date
44562P1G1 44562P1G1100120 01-01-2022P1 G1 100 120 30-03-2022 44562P1G1 44562P1G1100120 01-01-2022P1 G1 100 120 30-03-2022
44563P1G1 44563P1G1150200 02-01-2022P1 G1 150 200 30-03-2022 44563P1G1 44563P1G1150200 02-01-2022P1 G1 150 200 30-03-2022
44563P2G2 44563P2G2150200 02-01-2022P2 G2 150 200 29-03-2022 44563P2G2 44563P2G2150200 02-01-2022P2 G2 150 200 29-03-2022
44564P3G3 44564P3G3120100 03-01-2022P3 G3 120 100 29-03-2022 44564P3G3 44564P3G3120100 03-01-2022P3 G3 120 100 29-03-2022
Sensitivity: Internal 19
ADF Pipeline Design (1/2)
When IngestionFlag = 1
P1
PL_<APPLICATION_NAME>_PROCESS_<SRC>_DIM_DATA J2
Parameterised pipeline
4b J3
Legend :
P1 – Parameterized pipeline that will be scheduled to execute via JMF. PipelineName & ObjectName will be passed as a parameter from JMF
J1, J2, J3 – These are JMF activities and must be included for pipeline level logging into JMF framework (Refer JMF Table Entries)
Activity Details
1 - Read config values like tablename, source file path, destination file path, notebook name from config file. The config file needs to be stored in ADLS (Sample file, InputParameters attached below)
2 – Filter and load objects from InputParameters file using Filter activity
3 – Check value for IngestionFlag. If IngestionFlag=1, copy data from source into LandingLayer. If IngestionFlag = 0, move to transformation. Follow folder structure defined in subsequent slides
4 – 4a – Check value for TransformationFlag. If TransformationFlag = 0, exit the notebook. If TransformationFlag =1, continue notebook activities.
Error logging : Check for duplicates and null values in surrogate key columns as a first step in databricks notebook.
If duplicates are found, take a distinct and continue processing. Log the duplicate values into error log. Folder structure defined in subsequent slides
If null values are found in key columns, log them into error log. Follow folder structure defined in subsequent slides
4b – Transform the data using logic provided in PDM and store data in transform layer where delta checks are carried on basis of RowSK and PrimarySK (refer Incremental Loading strategy slide)
5 – Archive source data within LandingLayer>>Archive folder. For daily refresh cycle, archive data within date folder, YYYYMMDD format. For monthly refresh cycle, folder name must be YYYYMM format
6 – Purge old data. Retain 7 days of archived data for daily refresh cycle. Retain 3 months of archived data for monthly refresh cycle.
Best Practices:
• Follow naming conventions defined in SAG
• Timeout should be set at activity level and not be default of 7 days
• Include description to pipeline activity JMF Table Entries InputParameters
• Organize data factory components into folders for easy traversal
Sensitivity: Internal 20
ADF Pipeline Design (2/2)
When IngestionFlag = 1
P1 **
PL_<APPLICATION_NAME>_PROCESS_<SRC>_FACT_DATA J2
Parameterised pipeline
**
When TransformationFlag = 1
4b J3
Legend :
P1 – Parameterized pipeline that will be scheduled to execute via JMF. PipelineName & ObjectName will be passed as a parameter from JMF
J1, J2, J3 – These are JMF activities and must be included for pipeline level logging into JMF framework (Refer JMF Table Entries)
Activity Details
1 - Read config values like tablename, source file path, destination file path, notebook name from config file. The config file needs to be stored in ADLS (Sample file, InputParameters attached below)
2 – Filter and load objects from InputParameters file using Filter activity
3 – Check value for IngestionFlag. If IngestionFlag = 0, exit the notebook. If IngestionFlag =1, Copy data from source into LandingLayer for the required time period based on where condition. Follow folder structure defined in subsequent
slides
4 – 4a – Check value for TransformationFlag. If TransformationFlag = 0, exit the notebook. If TransformationFlag =1, continue notebook activities.
Error logging : Check for duplicates and null values in surrogate key columns as a first step in databricks notebook.
If duplicates are found, take a distinct and continue processing. Log the duplicate values into error log. Folder structure defined in subsequent slides
If null values are found in key columns, log them into error log. Follow folder structure defined in subsequent slides
4b – Transform the data using logic provided in PDM and store data in transform layer where delta checks are carried on basis of RowSK and PrimarySK and update the LastModifiedDate(refer Incremental Loading strategy slide)
5 – Archive source data within LandingLayer>>Archive folder. For daily refresh cycle, archive data within date folder, YYYYMMDD format. For monthly refresh cycle, folder name must be YYYYMM format
6 – Purge old data. Retain 7 days of archived data for daily refresh cycle. Retain 3 months of archived data for monthly refresh cycle.
Best Practices:
JMF Table Entries InputParameters
• Follow naming conventions defined in SAG
• Timeout should be set at activity level and not be default of 7 days
• Include description to pipeline activity
Sensitivity: Internal • Organize data factory components into folders for easy traversal 21
Generic Flow of a Transform Notebook
1. Import required libraries
2. Read various parameters from ADF config file
3. Create a dataframe for required columns for the object
4. Call generic function for DQ checks (these are prebuilt)
5. Check for nulls in PK columns. If nulls found, write into error logs folder defined in previous slide
6. Remove rows for which PKs are null
7. Check for duplicates in PK/ composite key. If found, write into error logs folder defined in architecture deck
8. Remove rows having duplicates and select only distinct values
9. Create dataframe with clean data and include hash columns
10. Incremental load
b. Capture delta records by comparing source and target (delta table) for insert and update
c. If change in records found, merge the final table data with the delta captured in 10b
11. Use VACUUM to remove historical versions of data
12. Sample notebooks can be provided
Sensitivity: Internal 22
Facts with Different Grain
If the requirement is to merge Facts with different Grain, following design should be followed which will have a single fact
Source Data
Product Fact
ProdSK SKU Category Brand Prod Value
PT1 PT1 C1 B1 PT1 10
PT2 PT2 C1 B1 PT2 32
PT3 PT3 C1 B1 PT3 323
PT4 PT4 C1 B2 B2 344
PT5 PT5 C1 B2 C1 232
Sensitivity: Internal 23
Refresh Power Bi dataset with no partition or Default Incremental load partition
WorkspaceOrGroupID PBIDatasetID
3. Add Your Env SPN to Power bi workspace as User.
4. Whitelist your SPN in azure tenant by getting it added in Azure AD group sf-powerbi-spn - Microsoft Azure
Reach out to landscape or SF- Architect leads to get it added in Ad group
Sensitivity: Internal
AAD REST API based solution for extraction of user list from AD Group
Problem Statement
Read & store user details from Azure AD for required AD Group. This mapping is required to apply row level security on databricks table/s where role
based security by ADGroup cannot be directly applied (user level details are needed)
Alternative suggested
AAD REST API based solution for extraction of user list from group. API call can be made using ADF web activity.
Sensitivity: Internal 25
Convert data frame to excel format
Option1 (import pip install - suggested)
Step1 Step2 Step3 Step4 Step5
# Install openpyxl using pip command # create dataframe #Define path to copy # write dataframe to excel # Copy the file
pip install openpyxl df_marks = pd.DataFrame({'name': ['X', temp_file='/tmp/mysheet_using_pip.xlsx' df_marks.to_excel(temp_file, copyfile(temp_file, final)
'Y', 'Z', 'M'], final='/dbfs/mnt/PowerBIRefresh/ index=False)
import pandas as pd 'physics': [68, 74, 77, 78], mysheet_using_pip.xlsx' print('DataFrame is written successfully
from shutil import copyfile 'chemistry': [84, 56, 73, 69], to Excel File.')
'algebra': [78, 88, 82, 87]})
df = spark.createDataFrame(data=data2,schema=schema)
Sensitivity: Internal 26
Product Consumption Layer Design
Sensitivity: Internal
Incremental Refresh - AAS
1. Define Partitions on the AAS model on Facts/Dimensions based on the transaction date attribute which can be at Year/Month grain with values as Month, Month-1, Month-2
OR Year, Year-1, Year-2
2. Define a PartitionConfig file with details of the Period to be considered within each partition defined above. Have a RefreshFlag on each row per Partition to identify if it
AAS Partition
Config Sample File needs a refresh.
3. Join the PartitionConfig and Fact on PowerQuery to filter each partition on the respective period. This will enable the AAS Fact to dynamically partition data
4. The LastModifiedDate attribute will be used to detect which partition is changing in a cube
5. Develop a notebook will change the RefreshFlag to 1 for all the changed records based on LastModifiedDAte
6. The webhooks to refresh AAS cube will receive the TMSL script with only partition names filtered with RefreshFlag as 1
7. Refresh from CSV files – Each partition of AAS tables should be connected to individual CSV files directly by using below M-Query method to speed-up data load process.
Refer below example for one sample fact partition table, similar way M-Query transformation should be created for remaining partitions pointing to respective CSV files
let
Source=#"Folder/C:\Users\xxxx\xxxxx\Data",
#"File Content" = Source{[#"Folder Path"=" C:\Users\xxxx\xxxxx\Data\",Name="Fact-Partition-Jan-22.csv"]}[Content],
#"Imported CSV" = Csv.Document(#"File Content",[Delimiter=",", Columns=102, Encoding=65001, QuoteStyle=QuoteStyle.None]),
#"Promoted Headers" = Table.PromoteHeaders(#"Imported CSV", [PromoteAllScalars=true]), Sample flow from
#"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"Month", Int64.Type}, {"Slno", Int64.Type}, {"Fld_1", type number}, {"Fld_2", type number}}) R&D
in
#"Changed Type"
- table name
- archiving the data for no. of years/months = data which needs to be cleared
- Incrementally refreshing the data for no. of years/months = data to be retained in the dataset (Incremental refresh happens only on those partitions )
4. Specify the name of the LastModifiedDate column to enable the detect data changes and refresh only impacted partitions
5. After applying all above steps, we will see No. of partitions will get created based on months/years specification defined in PBI.
6. If the rows with date/time no longer within the refresh period (outside of refresh period specified for partitions refresh) then no partition will get refreshed.
7. Ensure Query folding is validated to ensure the filter logic is included in the queries being executed against the data source.
8. However, other data sources may be unable to verify without tracing the queries. If Power BI Desktop is unable to confirm, a warning is shown in the Incremental refresh policy
configuration dialog. If you see such warning and want to verify the necessary query folding is occurring, use the Power Query Diagnostics feature or trace queries by using a tool
supported by the data source, like SQL Profiler.
9. If query folding is not occurring, verify the filter logic is included in the query being passed to the data source. If not, it's likely the query includes a transformation that prevents
folding - effectively defeating the purpose of incremental refresh.
Sensitivity: Internal 29
Incremental Refresh- Power BI (custom partitions)
1. The ADB partitions will not have the month /year combination in the partition name. Instead, the partition name will be a generic name like P1,P2, etc. for the total number of
partitions needed. 24 incase we are partitioning by month for 2 years
2. The custom partitions made in PBI Dataset will also have the similar partition names. In effect, there will be a 1-1 map between the ADB partitions and PBI dataset. The advantages
of this approach are
• No PBI dataset management needed for new partition creation, dropping partitions
• The PBI dataset can be refreshed using the PBI refresh API through an ADF pipeline which will execute after fact loading
3. A partition manager table within ADB delta will have the actual year month and any other attribute combination which identifies the data being stored in that ADB partition. The
flag ToBeRefreshed identifies if the partition needs to be refreshed. See example below
Fact Name PartitionName Year Month Country ToBeRefreshed
SalesOrderToInvoice P1_VN 2020 04 VN 0
SalesOrderToInvoice P1_ID 2020 04 ID 0
SalesOrderToInvoice P24_VN 2022 03 VN 1
4. When new data arrives for the next month, the data for the oldest partition needs to be overwritten. So we will order our partition master on year and month column in desc order
and replace value of oldest partition with Apr 2023 data, Also we will add refresh flag to 1 for this partition.
Fact Name PartitionNam Year Month Country ToBeRefreshed
e
SalesOrderToInvoice P1_VN 2022 04 VN 1
SalesOrderToInvoice P1_ID 2020 04 ID 0
SalesOrderToInvoice P24_VN 2022 03 VN 0
5. Over time the partitions will be overwritten one at a time but the number of partitions will remain the same.
6. The ADF pipeline will use the PBI refresh APIs to refresh the custom partition based on the toberefreshed flag
7. Pls use attached document for more details
Microsoft Word
Document
Sensitivity: Internal 30
Dynamic M-Query – Change source file path from config
Sample M-Query steps to change the source file path dynamically from the configuration for each table partition. Refer to the link for a sample AAS model (Link –
MQueryDynamicFile_FolderSelection.zip)
Partition P1 – Point the source to a specific file
let
Source = DataSource,
Container = Source{[Name ="unilever"]}[Data],
#"AASConfig_GetContent" = Csv.Document(Container{[#"Name" = "SP_Test/AASConfig.csv"]}[Content],[Delimiter=",", Encoding=1252, QuoteStyle=QuoteStyle.Csv]),
#"AASConfig_GetHeaders" = Table.PromoteHeaders(#"AASConfig_GetContent", [PromoteAllScalars=true]),
#"AASConfig_GetDataPath" = Table.SelectRows(#"AASConfig_GetHeaders", each ([Object] = "Point_To_A_File") and ([PartitionID] = "P1")),
DataPath = List.Max(#"AASConfig_GetDataPath"[DataPath]) & List.Max(#"AASConfig_GetDataPath"[DateId]),
#"DataFile_GetContent" = Csv.Document(Container{[#"Name" = #"DataPath" & ".txt"]}[Content],[Delimiter=",", Encoding=1252, QuoteStyle=QuoteStyle.Csv]),
#"DataFile_GetHeaders" = Table.PromoteHeaders(#"DataFile_GetContent", [PromoteAllScalars=true]),
#"DataFile_ChangedType" = Table.TransformColumnTypes(DataFile_GetHeaders,{{"DateId", type number}, {"Fld_1", type number}, {"Fld_2", type number}})
in
#"DataFile_ChangedType“
------------------------------------------------------------------------------------------------------------------------------------------------------
Partition P2 – Point source to a specific file
let
Source = DataSource,
Container = Source{[Name ="unilever"]}[Data],
#"AASConfig_GetContent" = Csv.Document(Container{[#"Name" = "SP_Test/AASConfig.csv"]}[Content],[Delimiter=",", Encoding=1252, QuoteStyle=QuoteStyle.Csv]),
#"AASConfig_GetHeaders" = Table.PromoteHeaders(#"AASConfig_GetContent", [PromoteAllScalars=true]),
#"AASConfig_GetDataPath" = Table.SelectRows(#"AASConfig_GetHeaders", each ([Object] = "Point_To_A_File") and ([PartitionID] = "P2")),
DataPath = List.Max(#"AASConfig_GetDataPath"[DataPath]) & List.Max(#"AASConfig_GetDataPath"[DateId]),
#"DataFile_GetContent" = Csv.Document(Container{[#"Name" = #"DataPath" & ".txt"]}[Content],[Delimiter=",", Encoding=1252, QuoteStyle=QuoteStyle.Csv]),
#"DataFile_GetHeaders" = Table.PromoteHeaders(#"DataFile_GetContent", [PromoteAllScalars=true]),
#"DataFile_ChangedType" = Table.TransformColumnTypes(DataFile_GetHeaders,{{"DateId", type number}, {"Fld_1", type number}, {"Fld_2", type number}})
in
#"DataFile_ChangedType“
Sensitivity: Internal 31
Dynamic M-Query – Change source folder path from config
Sample M-Query steps to change the source folder path dynamically from the configuration for each table partition. Refer to the link for a sample AAS model (Link –
MQueryDynamicFile_FolderSelection.zip)
Partition P1 – Point the source to a specific folder to read all files for a specific partition
let
Source = DataSource,
Container = Source{[Name="unilever"]}[Data],
#"AASConfig_GetContent" = Csv.Document(Container{[#"Name" = "SP_Test/AASConfig.csv"]}[Content],[Delimiter=",", Encoding=1252, QuoteStyle=QuoteStyle.Csv]),
#"AASConfig_GetHeaders" = Table.PromoteHeaders(#"AASConfig_GetContent", [PromoteAllScalars=true]),
#"AASConfig_GetDataPath" = Table.SelectRows(#"AASConfig_GetHeaders", each ([Object] = "Point_To_A_Folder") and ([PartitionID] = "Partition")),
DataPath = List.Max(#"AASConfig_GetDataPath"[DataPath]), //& List.Max(#"AASConfig_GetDataPath"[DateId]),
#"Filtered_DataPath" = Table.SelectRows(Container, each Text.StartsWith([Name], #"DataPath")),
#"Filtered_FileExtn" = Table.SelectRows(#"Filtered_DataPath", each ([Extension] = ".txt")),
#"Removed Columns" = Table.RemoveColumns(#"Filtered_FileExtn",{"Name", "Extension", "Date accessed", "Date modified", "Date created", "Attributes", "Folder Path"}),
#"Filtered Hidden Files" = Table.SelectRows(#"Removed Columns", each [Attributes]?[Hidden]? <> true),
#"Invoke Custom Function" = Table.AddColumn(#"Filtered Hidden Files", "Transform File", each #"Transform File"([Content])),
#"Removed Other Columns" = Table.SelectColumns(#"Invoke Custom Function", {"Transform File"}),
#"Expanded Table Column" = Table.ExpandTableColumn(#"Removed Other Columns", "Transform File", Table.ColumnNames(#"Transform File"(#"Sample File"))),
#"Changed Type" = Table.TransformColumnTypes(#"Expanded Table Column",{{"DateId", Int64.Type}, {"Fld_1", Int64.Type}, {"Fld_2", Int64.Type}})
in
#"Changed Type"
Sensitivity: Internal 32
Cube Security Design >> Workflow
Reference :
https://docs.microsoft.com/en-us/analysis-services/tutorial-tabular-1200/supplemental-lesson-implement-dynamic-secu
rity-by-using-row-filters?view=asallproducts-allversions
Sensitivity: Internal 33
Cube Security Design Security on Product
Security on Specification • Sample Input file
• Sample Input file
Security Model
Tables
Important** : We will have to test the performance of user security with full data
Sensitivity: Internal
Send Email Using Graph API
Sensitivity: Internal
Send Email Using Graph API
Problem Statement
Our current email sending process relies on service accounts with Office 365 licenses, which contradicts
company policy. These accounts are not meant for email use due to organizational and security constraints.
Solution
To resolve this issue, we require an alternative approach that involves utilizing Shared Mailboxes and
granting Microsoft Graph permissions to AzureAD App for sending emails from applications or
automations. This adjustment ensures alignment with company policy while bolstering security measures.
Please find the below attached document for detailed steps to send email using Graph API from ADF or
Logic APP
Sensitivity: Internal
Non Prod Environments
Sensitivity: Internal
Azure Services Cost Optimization over Non-Prod environments
1. It is important that minimal configuration of services are leveraged over Non-Prod environments unless required for UAT testing
2. Dedicated SQL pool and Azure Analysis Services to be paused when not used on Dev/QA especially during Evenings IST. Automatic pausing should be scheduled. PPD
should be paused throughout and available only during UAT.
3. Databricks should not use Job cluster, rather fixed cluster should be used. Databricks should be using Standard All-purpose Compute DBU, rather than premium.
4. Virtual machines on Dev/QA/PPD should be chosen with very minimum configs. Park my cloud can be leveraged for automatic pausing
5. Only Sample data should be carried on to validate rather than full data on QA –Dev, like if you have 100 products in your full data you can bring only 10 as sample..
6. Chose small clusters for Databricks on Non prod environment
Dedicated SQL Pool Compute Optimized Gen2, Dedicated Compute Optimized Gen2, Dedicated SQL Pools: Compute Optimized Gen2, Dedicated SQL Pools: DWU 400 x 300
SQL Pools: DWU 100 x 300 Hours, DWU 400 x 300 Hours Hours
(Paused in IST Evening) (Paused in IST Evening) (Paused all time except during UAT)
Azure Analysis Standard S1 (Hours), 1 Instance(s), 500 Standard S4 (Hours), 1 Instance(s), 500 Hours Standard S4 (Hours), 1 Instance(s), 500 Hours
Services Hours (Paused in IST Evening) (Paused in IST Evening)
(Paused in IST Evening)
Virtual Machine Minimum Config(Paused in IST Evening) Minimum Config(Paused in IST Evening) Minimum Config(Paused in IST Evening)
Sensitivity: Internal
Databricks best practices
Sensitivity: Internal
ADB Load types
To allocate the right amount and type of cluster resource for a job, we need to understand how different
types of jobs demand different types of cluster resources.
Machine Learning - To train machine learning models it’s usually required cache all of the data in
memory. Consider using memory optimized VMs so that the cluster can take advantage of the RAM cache.
To size the cluster, take a % of the data set cache it see how much memory it used extrapolate that to the
rest of the data. The tungsten data serializer optimizes the data in-memory. Which means you’ll need to test
the data to see the relative magnitude of compression.
Analytical load - In this case, data size and deciding how fast a job needs to be will be a leading indicator.
Spark does' t always require data to be loaded into memory in order to execute transformations, but you’ll
at the very least need to see how large the task sizes are on shuffles and compare that to the task throughput
you’d like. To analyze the performance of these jobs, start with basics and check if the job is by CPU,
network, or local I /O, and go from there. Consider using a general purpose VM for these jobs.
Sensitivity: Internal
Azure Data bricks config recommendations (**delta accelerated nodes)
Service Dev QA PPD Prod
Memory Optimized : Standard Tier Standard Tier Premium Tier/ Job cluster Premium Tier / Job cluster
When serving ETL Termination min = 30 Termination min = 30 Small load :
Analytical load which Min Worker : 1 Min Worker : 1 Termination min = 30 Min Worker : 1
has high use of RAM Max worker :8 Max worker :8 Min Worker : 1 Max worker :10
due to data movement. Driver :4 core VM Driver :4 core VM Max worker :8
Insert update deletes Worker : 4 core VM Worker : 4 core VM Driver & Worker : based on Heavy load:
for big tables. Approved VM’s as below ( based on job load select cores job load select cores from 4 to Termination min < 15
from 4 to 72) 72 Min Worker :9 (based on workload)
Eds_v4 Approved VM’s as below Approved VM’s as below Max worker :10
Eds_v5 Driver : based on job load select cores from 4 to 72)
Eads_v5 Eds_v4 Eds_v4 Worker : based on job load select cores from 4 to 72*
Eds_v5 Eds_v5 Approved VM’s as below
Eads_v5 Eads_v5
Eds_v4
Eds_v5
General Purpose : Standard Tier Premium Tier Premium Tier/ Job cluster Premium Tier / Job cluster
Small Loads
When serving Termination min = 30 Termination min = 30 Min Worker : 1
Analytical load with Min Worker : 1 Min Worker : 1 Termination min = 30 Max worker :10
high CPU due Max worker :8 Max worker :8 Min Worker : 1 Heavy Loads
regression and other Driver :4 core VM Driver :4 core VM Max worker :8 Termination min <15
compute. Worker : 4 core VM Worker : 4 core VM Driver & Worker : based on Min Worker :9 (based on workload)
Approved VM’s as below Approved VM’s as below job load select cores from 4 to Max worker :10
Dds_v5 Dds_v5 72 Driver & Worker : based on job load select cores from 4 to
Dads_v5 Dads_v5 Approved VM’s as below 72*
Dds_v5
Dads_v5 Approved VM’s as below
Dds_v5
Sensitivity: Internal
* Refer next slide to identify best cluster sizing after some development is completed
Azure Databricks Cluster – How to select?
1. Run End to end pipeline with all notebooks in full load.
2. Open ganglia Ui from Metrics tab in cluster logs.
3. Check overall load of CPU and Memory, one of these should be consumed ~80%.
4. If both are underused, try reducing your cluster config and run again
5. For worker config check load on each worker in nodes section.
1. CPU uses is sufficient, but Memory is touching 100%, so recommended is we optimize our code to clean cache
so we can free up ~30% RAM. If not possible increase to next size for worker.
Sensitivity: Internal
Databricks Best Practices
A shuffle occurs when we need to move data from one node to another in order to complete a
stage. Depending on the type of transformation you are doing you may cause a shuffle to occur.
This happens when all the executors require seeing all of the data in order to accurately perform
the action. If the Job requires a wide transformation, you can expect the job to execute slower
because all of the partitions need to be shuffled around in order to complete the job. Eg: Group
by, Distinct.
Databricks queries usually go through a very heavy shuffle operation due to the following:
JOIN()
DISTINCT()
GROUPBY()
ORDERBY()
And technically some actions like count() (very small shuffle )
Resolution
• Don’t use wide tables (exclude all unwanted columns from your data.
• Filter records before join.
• Avoid Cross join() at any cost.
• Don’t do so many nested queries on big tables. (Views Inside Views)
• Increase number of partition to do shuffle.e.g if your shuffle size is 250 GB,
and you need 200 MB partition for joining : (250*1024)/200 = 1280
spark.conf.set (“spark.sql.shuffle.partitions”, 1280)
• It’s good practice to write a few interim tables that will be used by several users or queries on a regular basis. If a dataframe is
created from several table joins and aggregations, then it might make sense to write the dataframe as a managed table
Sensitivity: Internal
Azure Databricks Cluster – Track marketwise utilization/costs
For a multi-market rollout or Multi Product solution, a question that would come up from business, is to specify the Azure cost incurred per market. This is so that markets
pay only as per actual resource utilization. With respect to Databricks, the way to achieve market wise cost reporting, is through use of custom tags on clusters
The 2 types of clusters and the guidelines to add custom tags in each case is as follows :
Sensitivity: Internal
Databricks Table Size calculations
To Calculate Size of databricks tables with active data
%scala
spark.read.format("delta").load("dbfs:/mnt/adls/PowerBIRefresh/PowerBIRefreshTest_100K_4GB/Processed_Parquet/").queryExecution.analyzed.stats
spark.read.format("parquet").load("dbfs:/mnt/adls/PowerBIRefresh/PowerBIRefreshTest_100K_4GB/Processed_Parquet/").queryExecution.analyzed.stats
spark.read.format("csv").load("dbfs:/mnt/adls/PowerBIRefresh/PowerBIRefreshTest_100K_4GB/Processed_Parquet/file.csv").queryExecution.analyzed.stats
To Calculate Size of ADS folder to calculate soft deleted and Vacuum data size with active data
%py
def recursiveDirSize(path):
total = 0
dir_files = dbutils.fs.ls(path)
for file in dir_files:
if file.isDir():
total += recursiveDirSize(file.path)
else:
total += file.size
return total
recursiveDirSize("/mnt/adls/PowerBIRefresh/PowerBIRefreshTest_100K_4GB/Processed_Parquet/")
Sensitivity: Internal
Trainings
Sensitivity: Internal
Innovations
1. Is there a way to get the glossary cum documentation of data bricks code? >> Unity Catalog
2. Currently there is a timeout of 1 hour between Databricks and PBI connection, can we get rid of it?
3. In case of Big data analytics, we tried using ADB as Direct query to Power BI, which did not work
very well. Is there any alternate solution for Big data analytics using ADB and PBI?
4. We were looking forward to come up with a health dashboard to have all the cluster configurations
along with Worker nodes in a single place for all instances within Unilever with the cost of each
cluster. Can we get some help? >> Overwatch
5. Can we have an optimizer guide for ADB notebooks which points out exactly where code needs
optimization ?
6. Can we have some data reconciliation tool over ADB?
7. Is there any feature of Graph expected ?
Sensitivity: Internal
Databricks spot instances
Sensitivity: Internal
What is a SPOT INSTANCE
• Azure’s eviction policy makes Spot VMs well suited for Azure Databricks,
whose clusters are resilient to interruptions for a variety of data and AI use
cases, such as ingestion, ETL, stream processing, AI models, batch scoring
and more.
Sensitivity: Internal 49
Landscape Report for Non-Spot ADB cost
Landscape publishes a report to see Non spot usage per product.
The plan is to request for spot instances during both Interactive and Job cluster creation. If spot instances are not available and Azure
defaults to full priced VMs, the report above will still show Non-Spot usage.
Sensitivity: Internal 50
Enable Spot instance for “All Purpose Cluster”
Sensitivity: Internal 51
Steps to enable SPOT Instance for Job Cluster
STEP 1 : Create cluster pool with required instance type as per the highlighted config
Sensitivity: Internal 52
Steps to enable SPOT Instance for Job Cluster
STEP 2 : Edit ADF linked service to select cluster as “Existing Instance Pool”
• Select appropriate instance pool from the drop down.
• Select OK
Sensitivity: Internal 53
Steps to enable SPOT Instance for Job Cluster
STEP 3 : Run ADF Job that uses Databricks notebook activity
• Navigate to ADB Instance pool to verify Instance status and attached clusters.
ADF Job Running
2 instances created and 2 are
busy during Job Run
Sensitivity: Internal 54
PBI Premium Capacity Onboarding
Sensitivity: Internal
Request for PBI Premium Workspace Provisioning
Product Architecture Sign off from SFD team
Request for PPU licenses(1-2) for PBI Datasets (large dataset). Link
mentioned in the Access request slide
Sensitivity: Internal
Steps to procure a Premium Workspace for new projects (1/2)
1. Clarify reason why a Premium workspace is needed
Below are some of the reasons why a Premium workspace is provisioned
Premium Capacity features – Incremental Refresh / Data Volume greater than 1 GB
Sensitivity: Internal
Steps to procure a Premium Workspace for new projects (2/2)
3. Build team to produce below reports as an output of Best Practice Analyzer feature part of Tabular Editor:
4. Validate all the points mentioned above once passed contact design team to onboard the project onto SF Prod1
Premium Capacity.
5] Prior to final sign off – ask for the Performance Analyser Tool [Dashboard performance checks] results from the build
team and ensure the SLA’s for the dashboards are met.
-- Check there are no redundant KPIs.
Sensitivity: Internal
Power bi dataset connectivity with Webapp Reference documentation
Power bi Rest API call (Execute Queries)
ADLS connection with PBI
Datasets & Webapp
1 Web Application
2
User favorite selections PBI cached dataset
Semantic Layer Microsoft Word
3
Provides the URL query filter Document
Caches the user
selections
4
Exported report Final report with filters applied
Sensitivity: Internal 59
Power Bi – Powerful feature Field Parameters
https://docs.microsoft.com/en-us/power-bi/create-reports/power-bi-field-parameters.
1. Field parameters allow users to dynamically change the measures or dimensions being analysed within a report.
2. Recommended usage instead of large number of bookmarks to change visuals ,[Bookmarks hinder performance of the
dashboard].
3. Also recommended to be used in place of nested Switch Case. [This also causes performance issues and is not recommended
].
Sensitivity: Internal
PBI Visuals Some best practices
Sensitivity: Internal
Reduce number of individual visuals
Category Performance Report 1. Squares and triangles are
individual visuals which are
displayed using formatting.
There are 28 visuals in the
left hand visual
08/07/2024
Sensitivity: Internal 62
Reduce number of individual visuals Use tables instead of multiple cards,
whenever possible.
Top Exit Pages
The screenshot shows 6 cards, where
we should have two (transparent)
tables, one with two columns (top
line) and other with 4 measures
(bottom line).
However, by removing the filter, we were getting more rows than expected in the visual – this is because some of the measures displayed were returning values when even the
condition [# of invoices]>0 was NOT met.
The way to work this out was to change all those measures and force a result of blank when [# of invoices]>0 wasn’t met. I have implemented those measures, all of them with
the following pattern:
alex_#_of_outlets =
08/07/2024
VAR _InvCount = [# of Invoices]
Return if(_InvCount>0, [# of Outlets], blank())
Sensitivity: Internal 64
Visual Level Filter
Channel Performance
08/07/2024
Sensitivity: Internal 65
Uncheck Show items with no data
Distributor Performance
Uncheck this
08/07/2024
Sensitivity: Internal 66
DAX errors
Distributor Performance
08/07/2024
Sensitivity: Internal 67
Zeplin design and PBI Visual Limitations
Zeplin Design 1 (with spaces in between the Alternate visual - which build team has
bars) developed on the PBI report. clustered
column chart
Sensitivity: Internal 68
Zeplin design and PBI Visual Limitations
Zeplin Design 3 Alternate visual –
single colour chart
Zeplin Design 4 (bars embedded with spacing) Alternate visual – stacked column
chart
Sensitivity: Internal 69
Zeplin design and PBI Visual Limitations
Alternate visual – clustered column
Zeplin Design 5 (-ve axis bar chart) chart +ve axis only
Sensitivity: Internal 70
Web app best practices
Sensitivity: Internal
WebAPP Best Practices
Area Section Guideline
Application Structure Serverless Approach Use azure function in most common cases where need to write custom reminders , notifications and scheduling tasks
Codebase Code quality check Perform code quality analysis by removing all warnings and if using .NET or later version then code analysis is enabled by default.
Codebase Class libraries Use common class library of all those re-usable common components. common validation,common mathematical calculation in one place of library and use it
across project
Codebase Security Use data encryption algorithm for sensible data. Some common c# encryption algorithm has been attached in appendix A sheet for reference
Codebase Code Remove Unused code/hard code values from the code base
Data traveling in web app which is fetched from database or any sources should be aligned with master data primary key id value not on text value.
Codebase Session In session mostly store key value pairs data and do not store large object(>10MB) entity data
Codebase Code Error log message format include method name ,line number, exception detail message and path
Codebase Codebase Source code must be properly commented to increase readability
Codebase/Infra Memory issues Implement finite loops , single run instance on local of application to reduce running out of memory issues
Devops CICD pipeline Ensure Code Deployment done through CI/CD pipeline in all Environment
Document Document Use read me text file to document steps for newly users to follow across project.
Infra web server Ensure web server scalability and fine/recommendable configuration, processor: 2 x 1,6 GHz CPU. ,RAM: 8,16 GB RAM.HDD: 1x 80 GB of free space or more
is recommended ,VM : Basic Medium VM
Infra Infra Notify about your site's downtime including data migration, critical deployment or any infra level task as per project requirement .
Logging Logging Ensure logging and tracing implemented at info, debug and error level by using Azure log monitoring/AppInsight service on critical functionality. Its mandatory
on production
Testing Testing Use API testing tool swagger/ postman free latest version to test API
Security Security Complete penetration test on priority system i.e. PS1 (e.g. if SC1 and DR 1 then PS 1)for the project by infosec team(cost included) . Refer appendix B
Security Security Implement Web app accessing domain should have meaningful domain name and SSL domain certificate activated with WAF security standard policy. Steps for
reference is attached in appendix C Its mandatory before business go live
Vulnerability scan should be perform on web app on prod or any prod identical environment(No cost). Refer appendix D Its mandatory before business
go live
Security Security Use key vault azure service to hold secrets and credentials values.
Security Security Secured data in transit across the layer
Security Security Any open source coding used should be latest version
Security Security Latest version of .Net framework used
Security Security Implementation of HTTP strict transport security(HSTS) for critical interfaces
Security Security Ensure no proprietary algorithm used without proper validation
Security Security User and role based privileges must be verified thoroughly
Authentication: Web App should use AAD authentication
Authorization:We required custom codeing to implement role based authorization for a specific Web API call.
Security Security Implement external user authentication via AAD . Refere Appendix D for Steps.
Application API Architecture • Front end should not request too much of data from backend, we must implement pagination at backend and only return < 50 records to frontend in one
request.
• For any use case of data export, we must use web jobs or download APIs which will do the file creation in backend only and binary stream the file to
frontend or give download link.
• IF backend API is hosted on different server then frontend, communication between front end and backend must be secured using App Service
authentication and authorization and in Firewall IP restriction.
Sensitivity: Internal • • A web app can connect to an Azure SQL database using Managed Service Identity. There is no need of username or password in the connection
Webapp to SQL Server Connectivity Best Practices
A web app can connect to an Azure SQL database using Managed Service Identity. There is no need of username or password in the connection string.
Setup
1. Enable the MSI : We need to enable MSI for the web app. To do that, navigate to your web app and under settings, click on ‘identity’. Under the System assigned
tab, change the status to On. It will generate the Object ID. Create service request for landscape team to give permission to access SQL from web app by
providing all resource groups, SPN and SQL server details.
Service Catalog Item - Employee Service Center (service-now.com)
2. Modify the Application code and consume connection string without id and password.
“server=tcp:<server-name>.database.windows.net;database=<db-name>; ;Authentication=Active Directory Interactive“
3. Local development machines which implementing application code should be the MSI enabled DTL machines. For local development required to do the
whitelisting of IP address of DTL machine and same need to mention in service request.
NOTE: POC application is available to connect SQL server from c# code by using MSI authentication
WebAppSQLAADAuth.zip
Sensitivity: Internal
WebAppDesign(1/2)
15 16
8 7
Azure Blob Storage
2
5
Redis Cache SQL Database
User 1 3 4 API
Front end 6 Azure
Active Directory
Third Party/External
Azure DevOps Azure Monitor Azure Key Vault data ingestions
14 system
9 10 11 12 13
Azure Function
Web UI Web API Web Business Access Layer
WebWeb
Client Services
Client Services
Sensitivity: Internal 75
Domain name , SSL and WAF onboarding
LANDSCAPE Team 6
Akamai CDN Team DomainSSLWAFS
2 teps
BUILD Build team generate Service request to landscape team Build team will raise a JIRA request to Akamai CDN
TEAM Get CNAME , txt value and root domain value for SSL team to add delivery configuration
validation.
4
7
Build team will check with landscape to validate these
values has been added or not. After configuration ,@Akamai_CDNConfig team will
raise the request to DNS txt entry at dns level.
5
Build team Build team should raise request to Landscape team to
purchase SSL certificate from Microsoft for the 9
1 get DNS
name production environment. The @Akamai_CDNConfig team should complete
from CDN property configuration and share San edge key
business information and server information/CNAME change
14
for new web
Build team will raise new SR for landscape team to
application
URL whitelist IP’s list received from WAF team
12
Build team
will verify
DNS TEAM WAF TEAM
URL on 3 11
staging Build team will generate Service request to DNS team to add
environment “CNAME” ,”txt” and “root domain value and send email to Build team will raise request for WAF team to WAF
as per steps DNS team with request number onboarding
provided by
WAF team
8
DNS team will confirm on txt value added on the email.
13
WAF team will push the changes on to the production.
Build team will raise request to DNS team and share 10 And share IP addresses list in the email
CNAME(e.g. e2ebolt.unilever.com) and IP address(e.g.
20.107.224.53)
Sensitivity: Internal 76
C# Best Practices(1/2)
Sensitivity: Internal 77
C# Best Practices (2/2)
Good Programming Practices
Avoid writing long functions. The typical function should have max 40-50 lines of code. If method has more than 50 line of code, you must consider re factoring into separate private methods.
Avoid writing long class files. The typical class file should contain 600-700 lines of code. If the class file has more than 700 line of code, you must create partial class. The partial class combines code into single unit
after compilation.
Don’t have number of classes in single file. Create a separate file for each class.
Avoid the use of var in place of dynamic.
Add a whitespace around operators, like +, -, ==, etc.
Always succeed the keywords if, else, do, while, for and foreach, with opening and closing parentheses, even though the language does not require it.
The method name should have meaningful name so that it cannot mislead names. The meaningful method name doesn’t need code comments.
The method / function/Controller Action should have only single responsibility (one job). Don’t try to combine multiple functionalities into single function.
Do not hardcode string or numbers; instead, create separate file sfor constants and put all constants into that or declare constants on top of file and refer these constants into your code.
While comparing string, convert string variables into Upper or Lower case
Use String.Empty instead of “”
Use enums wherever required. Don’t use numbers or strings to indicate discrete values
The event handler should not contain the code to perform the required action. Instead call another private or public method from the event handler. Keep event handler or action method as clean as possible.
Never hardcode a path or drive name in code. Get the application path programmatically and use relative path. Use input or output classes System.IO) to achieve this.
Always do null check for objects and complex objects before accessing them
Error message to end use should be user friendly and self-explanatory but log the actual exception details using logger. Create constants for this and use them in application.
Avoid public methods and properties to expose, unless they really need to be accessed from outside the class. Use internal if they are accessed only within the same assembly and use private if used in same class.
Avoid passing many parameters to function. If you have more than 4-5 parameters use class or structure to pass it.
While working with collection be aware of the below points,
While returning collection return empty collection instead of returning null when you have no data to return.
Always check Any() operator instead of checking count i.e. collection.Count > 0 and checking of null
Use foreach instead of for loop while traversing.
Use IList<T>, IEnumrable<T>,ICollection<T> instead of concrete classes e.g. using List<>
Use object initializers to simplify object creation.
The using statements should be sort by framework namespaces first and then application namespaces in ascending order
If you are opening database connections, sockets, file stream etc, always close them in the finally block. This will ensure that even if an exception occurs after opening the connection, it will be safely closed in the
finally block.
Simplify your code by using the C# using statement. If you have a try-finally statement in which the only code in the finally block is a call to the Dispose method, use a using statement instead.
Always catch only the specific exception instead of catching generic exception
Use StringBuilder class instead of String when you have to manipulate string objects in a loop. The String object works in a weird way in .NET. Each time you append a string, it is actually discarding the old string
object and recreating a new object, which is a relatively expensive operation.
Sensitivity: Internal 78
Web Application TG Connectivity
PDS
Sensitivity: Internal 80
Point of Contacts for Access
Sensitivity: Internal
Access Requests (Owner: Delivery Team)
Table below provides details on requesting access for different Azure components
Component Link to SR / Document
Create Mountpoint in databricks 1. Product team to raise service now request using the below link
2. Cloud Platform Management: Create Mountpoint
Access to UDL 1. Product team to raise service now request using the below link
2. Universal Data Lake: Data Access
3. Provide databricks SPN details to enable accessing mount paths
Access to BDL 1. Once mount point is created in databricks, Landscape team will request for approval from respective business
owner in an email
2. Product team to raise request with business owner
3. Once approved, product team to provide approval email to Landscape team to enable SPN access
PPU license for Dev team Attached document has all the details
Web App configurations It is mandatory to implement following for web applications. Follow attached for detailed implementation
1. Akamai WAF onboarding
2. Custom DNS name
3. SSL certificate
** Note: WAF onboarding requires SSL certificate to be procured. In Dev and QA, the certificate can be provisioned by build
team which is free of cost and valid for 1 year. For production, this will need to be procured and will come with a cost of
$69.99/year with yearly auto renewal External Users -
Web application
configurations ACAM process
Sensitivity: Internal
Access Requests (Owner: Delivery Team)
Table below provides details on requesting access for different Azure components
Component Link to SR / Document
Microsoft Word
Document
BDL Enriched and DDL layer Ajey Kartik
Teams Tab web app integration To create pp account and with adding in the Teams Admin center
Malviya, Raksha <[email protected]>; Subbappa, Rashmi <[email protected]>
To MS Teams license assignation to PP Accounts
[email protected]
[email protected]
[email protected]
[email protected]>;Janaka Wijesekara <[email protected]>; Iyengar, Srinivasa <[email protected]>;
Preetha J <[email protected]>;[email protected]
MS TEAMS APP deployment team:
[email protected]
[email protected]
[email protected]
Sensitivity: Internal
(Owner: Delivery Team)
Access Requests – Data owners
Marketing [email protected]
Finance [email protected]
Sensitivity: Internal
Appendix A Web APP String Encryption Algorithm
Microsoft Word
Document
Microsoft
PowerPoint Presentation
Microsoft Word
Document
WebAppVulnarab
ilityScan
Sensitivity: Internal 85