Advanced Data Engineering With Databricks
Advanced Data Engineering With Databricks
•
PHOTO
•
/in/profile
• Name
• Role and team
• Length of experience with Spark and Databricks
• Motivation for attending
• Fun fact or favorite mobile app
moovio
✓ Collaborative
Streaming Design
Patterns
Auto Load to
Multiplex Bronze
Streaming from
Multiplex Bronze
Streaming
Deduplication
Quality
Enforcement
Promoting to Silver
• Type 2: Adding a new row for each change and marking the old as
obsolete
E.g. Able to record product price changes over time, integral to
business logic.
2 99 Jump St Abhi
Type 2 SCD
windowedDF =
(eventsDF
.groupBy(window("eventTime",
"10 minutes",
"5 minutes"))
.count()
.writeStream
.trigger(processingTime=“5 minutes”)
)
RECORD 1
STRUCTURED
Input RECORD 2
...
STREAMING
RECORD N
State size m
(records to be
compared against)
RECORD 1
State RECORD 2
...
RECORD M
Input size n
(records in
mini-batch)
Input
State size m
(records to be
RECORD 1 compared against)
RECORD 2
...
RECORD M
State
Input size n
(records in
mini-batch)
Limit m in O(n⨯m)
RECORD 1
STRUCTURED
Input RECORD 2
...
STREAMING
RECORD N
State size m
(records to be
compared against)
RECORD 1
RECORD 2
...
RECORD M
State
Input size n
(records in
mini-batch)
RECORD 1
STRUCTURED
Input RECORD 2
...
STREAMING
RECORD N
stateful aggregation
2. Join item category lookup
itemSalesSDF = (
salesSDF
.join( spark.table(“items”), “item_id”)
)
Limit m in O(n⨯m)
RECORD 1
STRUCTURED
Input RECORD 2
...
STREAMING
RECORD N
State size m
(records to be
compared against)
RECORD 1
RECORD 2
...
RECORD M
State
Input size n
(records in
mini-batch)
Also called the serving layer; gold tables exist at this level.
Stored Views
Materialized Gold
Tables
GDPR
without vault
Creating a
Pseudonymized PII
Lookup Table
©2022 Databricks Inc. — All rights reserved
Notebook
Full Access
Full Access
PROD Data Eng WS PROD Data
Read Only
Access
Analytics WS
Full Access
Derivative
Data
©2022 Databricks Inc. — All rights reserved
Grant Access to Production Datasets
Assumptions
● End-users need read-only access
● Datasets organized by database
Deidentified PII
Access
Data Lake
AI & Reporting
Data quality
ignoreChanges ignoreDeletes
CDF
Processing Records
from Change Data
Feed
©2022 Databricks Inc. — All rights reserved
Notebook
Propagating Deletes
with Change Data
Feed
©2022 Databricks Inc. — All rights reserved
Notebook
Deleting at Partition
Boundaries
Job
Serial
1 2 3 4
Parallel
2
1 4
3
Job
Task 1 Task 2 Task 3
Task + Cluster Task + Cluster Task + Cluster
Schedule
Task 4
Task + Cluster
Creating a
Multi-Task Job
Task-1 Task-5*
Create Database Create Task 5
Task-4* Task-7*
Name-Param Cleanup
* Make sure to use a different cluster for Tasks #2 & #3
* Add the parameter “name” with some value (e.g. your name)
* Make sure to use the same cluster for Tasks #5, #6 & #7
©2022 Databricks Inc. — All rights reserved
* After running once with the error, update Task-6 to pass
Promoting Code
with Databricks Repos
▲
▼
Git / CI/CD
Systems Version Review Test
Supported Git
Providers
https://github.com/databricks-academy/cli-demo
Orchestration with
the Databricks CLI