Data Engineer
Data Engineer
End to end project with Continuous Integration and Continuous Deployment (CICD)
Author: Shanmukh Sattiraju
https://in.linkedin.com/in/shanmukh-sattiraju
End to End flow
Data science
LMS Data
Analysis
Data Factory
Lakehouse
BI Reports Data-Science ML
Datalake
dataframe. dataframe.
write\ write\
.format(“parquet”)\ .format(“delta”)\
.save(“/data/”) .save(“/data/”)
Store
Source Report
Store
Source Report
Computes
Onelake
SaaS Foundation
F2 2 $0.36/hour
F4 4 $0.72/hour
F8 8 $1.44/hour
F 16 16 $2.88/hour
F 32 32 $5.76/hour
F 64 64 $11.52/hour
Storage Price
**
OneLake storage/month $0.023 per GB
Item
Workspace
Experience Item
Workspace
Fabric home
Item
Workspace
Experience Item
Workspace
Data pipeline
Data
Finance
Engineering
Notebook
Fabric home
Semantic
Model
Power BI Marketing
Report
OneLake
Onelake
SaaS Foundation
Serverless Compute
Onelake
Analysis
Spark SQL KQL Services
Serverless Compute
Capacity Capacity
Role Can add admins? Can add Can write data and Can read data?
members? create items?
Admin Yes Yes Yes Yes
Member No Yes Yes Yes
Contributor No No Yes Yes
Viewer No No No Yes
Managed
It displays any folders or files present in the managed area that
Unidentified lack the associated tables.
If the table created is not a delta table, it will automatically get
saved in the Unidentified folder.
Dataset
Pipeline
Pipeline
Dataset
Pipeline
Pipeline
Pipeline
Pipeline
✓
Azure Data Lake
Author: Shanmukh Sattiraju
OneLake
https://in.linkedin.com/in/shanmukh-sattiraju
✓
Pipeline Triggers in Fabric
Triggers invoke pipelines in Data Factory
Azure Datalake
Lakehouse
OneLake
Amazon S3
Dataverse
Shortcuts
Container: shortcutfile
SubFolder: Emp Section: Table
Files: Emp1.csv,Emp2.csv,etc
Delta File
Parquet File
Author: Shanmukh Sattiraju
https://in.linkedin.com/in/shanmukh-sattiraju
Creating a shortcut in Table
Shortcuts in Tables
• In the Tables folder, you can only create shortcuts at the
top level. Shortcuts aren't supported in other
subdirectories of the Tables folder
• Data to be in Delta/parquet format so that lakehouse
automatically synchronizes the metadata and recognizes
the folder as a table
Shortcuts in Files
• If your shortcut location data is in form of sub-directories
go with storing them in files.
• If they are not in delta-parquet format , store them in files
Write access
Update data
Read access
Update data
Write access
Gets updated Update data
Update data
✓
Author: Shanmukh Sattiraju
https://in.linkedin.com/in/shanmukh-sattiraju
Shortcut deletion scenarios
Delete data
✓
Author: Shanmukh Sattiraju
https://in.linkedin.com/in/shanmukh-sattiraju
Scenario 2: Delete a specific content in ADLS
Delete data
Gets deleted
✓
Delete data
Delete data
✓
Author: Shanmukh Sattiraju
https://in.linkedin.com/in/shanmukh-sattiraju
Scenario 4: Delete a specific content of delta table in ADLS
Delete data
Gets deleted
Delete data
✓
Author: Shanmukh Sattiraju
https://in.linkedin.com/in/shanmukh-sattiraju
Scenario 5: Deleting shortcut completely in Lakehouse
Delete shortcut
Author: Shanmukh Sattiraju
https://in.linkedin.com/in/shanmukh-sattiraju
Shortcut deletion scenarios
Scenario 1 File Delete content in Shortcut of Files
section
Deletes in Datalake ✓
Scenario 2 File
Delete a specific content in ADLS Deletes in Lakehouse ✓
Scenario 3 Table Delete content in Shortcut of Tables
section
Deletes in Datalake ✓
Scenario 4 Table Delete a specific content of delta
table in ADLS
Deletes in Lakehouse ✓
Scenario 5 File &
Table
Deleting shortcut completely in
Lakehouse
ADLS data will not be
deleted
Author: Shanmukh Sattiraju
https://in.linkedin.com/in/shanmukh-sattiraju
Synapse Data Engineering
Spark Pools
Starter Pools
Default
compute
Auto scale On
Dynamic Allocation On
Custom Pools
Notebook
• Multi-task:
• One user can use multiple notebooks with one session
• Prevents delays due to session creation
• Security:
• Session sharing is always within a single user boundary
• Cost-effective:
• Better resource utilization and cost-saving
Fabric Workspace
Identity
Synapse - Choosing Between Spark Notebook vs Spark Job Definition - Microsoft Community Hub
HR
IT
100 100
80 80
60 60
40 40
20 20
0 0
25-06-2024 25-06-2024
10:01:30 10:01:30
Interactive delay 10 minutes < usage <= 60 User-requested interactive jobs are
minutes delayed 20 seconds at submission.
Interactive rejection 60 minutes < usage <= 24 hours User requested interactive jobs are
rejected.
Background rejection Usage > 24 hours User scheduled background jobs are
rejected and not executed.
Microsoft Documentation: Understand your Fabric capacity throttling - Microsoft Fabric | Microsoft Learn
Author: Shanmukh Sattiraju
https://in.linkedin.com/in/shanmukh-sattiraju
Overage protection
Utilization
120
Overage
100
80
60
40
20
0
25-06-2024
10:01:30
80
60
40
20
0
25-06-2024 25-06-2024
10:01:30 10:06:30
5 mins
100
80
60
40
20
0
25-06-2024 25-06-2024
10:01:30 10:06:30
100
80
60
40
20
0
25-06-2024 25-06-2024
10:01:30 10:06:30
24 hour
100
10 mins >
80
60
40
20
0
25-06-2024 25-06-2024 25-06-2024
10:01:30 10:06:30 10:21:30
20 mins
100
10 mins > CU usage <= 60 mins
80
60
40
20
0
25-06-2024 25-06-2024 25-06-2024
10:01:30 10:06:30 10:21:30
20 mins
100
60 mins > CU usage <= 24 hours
80
60
40
20
0
25-06-2024 25-06-2024 25-06-2024 25-06-2024
10:01:30 10:06:30 10:21:30 11:02:00
> 1 hour
100
> 24 hours
80
60
40
20
0
25-06-2024 25-06-2024 25-06-2024 25-06-2024 26-06-2024
10:01:30 10:06:30 10:21:30 11:02:00 10:01:30
> 24 hours
Overage protection Usage <= 10 minutes Jobs can consume 10 minutes of future
capacity use without throttling.
Interactive delay 10 minutes < usage <= 60 minutes User-requested interactive jobs are delayed
20 seconds at submission.
Interactive rejection 60 minutes < usage <= 24 hours User requested interactive jobs are rejected.
Lakehouse Warehouse
Data Synapse Data Synapse Data Synapse Data Synapse Real Data
Power BI
Factory Engineering Warehousing Science Time Analytics Activator
Onelake
SaaS Foundation
Author: Shanmukh Sattiraju
https://in.linkedin.com/in/shanmukh-sattiraju
Lakehouse vs Warehouse
Data Synapse Data Synapse Data Synapse Data Synapse Real Data
Power BI
Factory Engineering Warehousing Science Time Analytics Activator
Delta Delta
Lakehouse Onelake Warehouse
SaaS Foundation
Author: Shanmukh Sattiraju
https://in.linkedin.com/in/shanmukh-sattiraju
Lakehouse Warehouse
Lakehouse
Warehouse
Author: Shanmukh Sattiraju
https://in.linkedin.com/in/shanmukh-sattiraju
Update data from Lakehouse and Warehouse
UPDATE
CREATE TABLE
{ database_name.schema_name.table_name
(or)
schema_name.table_name (or) table_name }
AS CLONE OF
{ database_name.schema_name.table_name |
(or)
schema_name.table_name (or) table_name }
[AT {point_in_time}] -- 'YYYY-MM-DDThh:mm:ss’
[dbo].[Dept] [dbo].[Employee]
Tables Files
Tables/Folder Files/Folder
Marketing
Capacity Capacity
Workspace Workspace
Workspace Workspace
Role Can add admins? Can add Can write data and Can read data?
members? create items?
Admin Yes Yes Yes Yes
Member No Yes Yes Yes
Contributor No No Yes Yes
Viewer No No No Yes
✓ ✓
Connect to SQL Analytics endpoint
✓ ✓
Read data and shortcuts through SQL endpoint
✓ ✓ ✓ ✓
✓ ✓
Read through OneLake API
✓
Read through Spark (Shortcut)
✓ ✓ ✓
Create , Modify tables / views, etc.
✓ ✓ ✓
Shortcuts :
1. Reading through shortcuts need additional permission from shortcut destination for objects internal to Fabric
2. ADLS shortcuts use delegated authorization model
WSP_1 WSP_2
LH_A LH_B
WSP_2
T-SQL
✓
Bob (having Contributor role in WSP_2)
Author: Shanmukh Sattiraju
https://in.linkedin.com/in/shanmukh-sattiraju
Accessing shortcuts internal to fabric
WSP_1 WSP_2
LH_A LH_B
WSP_2
T-SQL
✓
Bob (having Contributor role in WSP_2)
Author: Shanmukh Sattiraju
https://in.linkedin.com/in/shanmukh-sattiraju
Accessing shortcuts internal to fabric
WSP_1 WSP_2
LH_A LH_B
WSP_2
T-SQL
✓
Bob (having Contributor role in WSP_2)
Author: Shanmukh Sattiraju
https://in.linkedin.com/in/shanmukh-sattiraju
Accessing shortcuts internal to fabric
WSP_1 WSP_2
LH_A LH_B
WSP_2
T-SQL
✓
Bob (having Contributor role in WSP_2)
Author: Shanmukh Sattiraju
https://in.linkedin.com/in/shanmukh-sattiraju
Accessing shortcuts internal to fabric
But this behavior will be different when accessing from notebook, the
caller need to have access at shortcut destination
LH_B
Azure Data Lake storage
Alice (having Storage blob data
contributor role in ADLS)
WSP_2
T-SQL
✓ ✓
Bob (having Contributor role in WSP_2)
Author: Shanmukh Sattiraju
https://in.linkedin.com/in/shanmukh-sattiraju
Accessing ADLS shortcuts
WSP_2
LH_B
Azure Data Lake storage
Alice (having Storage blob data
contributor role in ADLS)
WSP_2
T-SQL
✓ ✓
Bob (having Contributor role in WSP_2)
Author: Shanmukh Sattiraju
https://in.linkedin.com/in/shanmukh-sattiraju
Lakehouse permissions
Lakehouse actions Admin Member Contributor Viewer
Items Sharing
Data pipeline
Data flow Gen2
Event Stream
WSP_1
Read, ReadData permissions
No role in workspace
Steve
WH_A
WSP_2
LH_A
WSP_1
Read, ReadData, ReadAll permissions
No role in workspace
Steve
WH_A
✓
WSP_2
LH_A
Read all SQL Endpoint data • Read data from the SQL analytics endpoint of the Lakehouse
• User cannot create or modify tables
• Need GRANT / MODIFY from admins to make changes
Read all Apache Spark (ReadAll) • Read Lakehouse data through OneLake APIs and Spark.
• Read Lakehouse data through Lakehouse explorer.
Build reports on the default • Build reports on top of the default semantic model connected to the
semantic Model (Build) warehouse
You must also grant run permission to any user who gets edit permission.
[dbo].[Dept] [dbo].[Employee]
[dbo].[Dept] [dbo].[Employee]
For string data types, use XXXX (or fewer) if the size of the field is fewer than 4 characters
(char, nchar, varchar, nvarchar, text, ntext).
For binary data types use a single byte of ASCII value 0 (binary, varbinary, image).
Email
Random A random masking function for use on any numeric type to mask the original value with a random value within a
specified range.
Custom String Masking method that exposes the first and last letters and adds a custom padding string in the
middle. prefix,[padding],suffix
If the original value is too short to complete the entire mask, part of the prefix or suffix isn't exposed.
Author: Shanmukh Sattiraju
https://in.linkedin.com/in/shanmukh-sattiraju
default() function
Full masking according to the data types of the designated fields
[dbo].[Dept] [dbo].[Employee]
✓
103 IT $3000 xxxx34
Row level security 104 Transport $4000 xxxx76
Pulls data in real time Translates to source dialect Sends DAX queries
e.g., SQL
Direct Query
Import
Direct Query
Import
parquet
Synapse
Data
Engineering
Synapse
Data Science
Semantic Model
Table1
Table2 User 1
Table3
Table4
User 2 Table5
Table6
Employee Table
ID
Name Show
Subject
Fees
Hide OE1
OE2
What tool is used? Pipelines, dataflows and Dataflows or notebooks SQL Endpoint or
notebooks semantic models
Data science
LMS Data
Analysis
Landing
LMS Data
Raw Landing
ProcessingDate = 2024-09-17
ProcessingDate = 2024-09-18
LMS Data
Landing LH_Bronze
ProcessingDate = 2024-09-16
bronze_data
ProcessingDate = 2024-09-17
Landing
INSERT
Landing
Landing
bronze_data silver_data
Data Cleaning
1.Handle duplicates
2.Handle Missing or NULL values
1.Delete rows for missing critical values
2.Fill rows with default values for other data
3.Standardize date formats
4.Check for logical consistency
Business transformations
(Quiz_Average_Score * 0.2) +
Performance_Score (Assignment_Average_score) * 0.2 +
(Project_Score) * 0.1
LH_Bronze
LH_Bronze
INSERT
LH_Bronze
LH_Bronze
UPDATE
silver_data
Fact Dimension
Dimension
Student_ID Student_ID
Course_ID Course_ID
dim_student
dim_course
fact_student_performance
DEV DEV
PROD PROD
• Currently, only Git in Azure Repos with the same tenant as the Fabric
tenant is supported.
• If the workspace and Git repo are in two different geographical
regions, the tenant admin must enable cross-geo exports.
• Azure DevOps on-prem isn't supported.
• Sovereign clouds aren't supported.
Owner of the item (if the tenant switch blocks updates for
nonowners)
Azure DevOps
Pull request
Feature Workspace
new
modified
deleted
conflict