100% found this document useful (2 votes)
2K views125 pages

2024 07 Eb Big Book of Data Engineering 3rd Edition

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
2K views125 pages

2024 07 Eb Big Book of Data Engineering 3rd Edition

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 125

E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 2

Introduction to Data Engineering on Databricks........................................................................................................3


Contents
Guidance and Best Practices....................................................................................................................................... 13
Databricks Assistant Tips and Tricks for Data Engineers......................................................................................................................................................14
Applying Software Development and DevOps Best Practices to Delta Live Table Pipelines.................................................................................. 22
Unity Catalog Governance in Action: Monitoring, Reporting and Lineage...................................................................................................................... 32
Scalable Spark Structured Streaming for REST API Destinations..................................................................................................................................... 40
A Data Engineer’s Guide to Optimized Streaming With Protobuf and Delta Live Tables.......................................................................................... 47
Design Patterns for Batch Processing in Financial Services................................................................................................................................................58
How to Set Up Your First Federated Lakehouse......................................................................................................................................................................... 71
Orchestrating Data Analytics With Databricks Workflows.................................................................................................................................................. 77
Schema Management and Drift Scenarios via Databricks Auto Loader.......................................................................................................................... 83
From Idea to Code: Building With the Databricks SDK for Python.....................................................................................................................................96

Ready-to-Use Notebooks and Datasets................................................................................................................. 104

Case Studies ................................................................................................................................................................. 106


Cox Automotive................................................................................................................................................................................................................................... 107
Block......................................................................................................................................................................................................................................................... 110
Trek Bicycle............................................................................................................................................................................................................................................113
Coastal Community Bank................................................................................................................................................................................................................. 116
Powys Teaching Health Board (PTHB)..........................................................................................................................................................................................122
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 3

01 Introduction to
Data Engineering on Databricks
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 4

A recent MIT Tech Review Report shows that 88% of surveyed organizations are 1. Ingest
either investing in, adopting or experimenting with generative AI (GenAI) and 71% Data ingestion is the process of bringing data from one or more data sources
intend to build their own GenAI models. This increased interest in AI is fueling into a data platform. These data sources can be files stored on-premises or
major investments as AI becomes a differentiating competitive advantage in on cloud storage services, databases, applications and increasingly — data
every industry. As more organizations work to leverage their proprietary data for
streams that produce real-time events.
this purpose, many encounter the same hard truth:
2. Transform
The best GenAI models in the world will not succeed without good data. Data transformation takes raw ingested data and uses a series of steps
(referred to as “transformations”) to filter, standardize, clean and finally
This reality emphasizes the importance of building reliable data pipelines that
aggregate it so it’s stored in a usable way. A popular pattern is the medallion
can ingest or stream vast amounts of data efficiently and ensure high data
quality. In other words, good data engineering is an essential component of architecture, which defines three stages in the process — Bronze, Silver
success in every data and AI initiative and especially for GenAI. and Gold.

3. Orchestrate
Using practical guidance, useful patterns, best practices and real-world
Data orchestration refers to the way a data pipeline that performs ingestion
examples, this book will provide you with an understanding of how the
Databricks Data Intelligence Platform helps data engineers meet the and transformation is scheduled and monitored as well as the control of the
challenges of this new era. various pipeline steps and handling failures (e.g., by executing a retry run).

What is data engineering?


Data engineering is the practice of taking raw data from a data source and
processing it so it’s stored and organized for a downstream use case such
as data analytics, business intelligence (BI) or machine learning (ML) model
training. In other words, it’s the process of preparing data so value can be
extracted from it.

A useful way of thinking about data engineering is by using the following


framework, which includes three main parts:
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 5

Challenges of data engineering in the AI era These challenges stress the importance of choosing the right data platform for
navigating new waters in the age of AI. But a data platform in this new age can
As previously mentioned, data engineering is key to ensuring reliable data for
also go beyond addressing just the challenges of building AI solutions. The right
AI initiatives. Data engineers who build and maintain ETL pipelines and the
platform can improve the experience and productivity of data practitioners,
data infrastructure that underpins analytics and AI workloads face specific
including data engineers, by infusing intelligence and using AI to assist with
challenges in this fast-moving landscape.
daily engineering tasks.

■ Handling real-time data: From mobile applications to sensor data on In other words, the new data platform is a data intelligence platform.
factory floors, more and more data is created and streamed in real
time and requires low-latency processing so it can be used in real-time
decision-making.
■ Scaling data pipelines reliably: With data coming in large quantities
and often in real time, scaling the compute infrastructure that runs
data pipelines is challenging, especially when trying to keep costs low
and performance high. Running data pipelines reliably, monitoring data
pipelines and troubleshooting when failures occur are some of the most
important responsibilities of data engineers.
■ Data quality: “Garbage in, garbage out.” High data quality is essential to
training high-quality models and gaining actionable insights from data.
Ensuring data quality is a key challenge for data engineers.
■ Governance and security: Data governance is becoming a key challenge
for organizations who find their data spread across multiple systems
with increasingly larger numbers of internal teams looking to access and
utilize it for different purposes. Securing and governing data is also an
important regulatory concern many organizations face, especially in highly
regulated industries.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 6

The Databaricks Data Intelligence Platform


Databricks’ mission is to democratize data and AI, allowing organizations
to use their unique data to build or fine-tune their own machine learning
and generative AI models so they can produce new insights that lead to
business innovation.

The Databricks Data Intelligence Platform is built on lakehouse architecture


to provide an open, unified foundation for all data and governance, and it’s
powered by a Data Intelligence Engine that understands the uniqueness of your
data. With these capabilities at its foundation, the Data Intelligence Platform
lets Databricks customers run a variety of workloads, from business intelligence
and data warehousing to AI and data science.

To get a better understanding of the Databricks Platform, here’s an overview of


the different parts of the architecture as it relates to data engineering.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 7

Data reliability and performance with Delta Lake


To bring openness, reliability and lifecycle management to data lakes, the
Databricks lakehouse architecture is built on the foundation of Delta Lake, an
open source, highly performant storage format that solves challenges around
unstructured/structured data ingestion, the application of data quality, difficulties
with deleting data for compliance or issues with modifying data for data capture.
Delta Lake UniForm users can now read Delta tables with Hudi and Iceberg
clients, keeping them in control of their data. In addition, Delta Sharing enables
easy and secure sharing of datasets inside and outside the organization.

Unified governance with Unity Catalog


With Unity Catalog, data engineering and governance teams benefit from an
enterprisewide data catalog with a single interface to manage permissions,
centralize auditing, automatically track data lineage down to the column level
and share data across platforms, clouds and regions.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 8

DatabricksIQ — the Data Intelligence Engine Reliable data pipelines and real-time stream processing with Delta Live Tables
At the heart of the Data Intelligence Platform lies DatabricksIQ, the engine that Delta Live Tables (DLT) is a declarative ETL framework that helps data teams
uses AI to infuse intelligence throughout the platform. DatabricksIQ is a first-of- simplify and make ETL cost-effective in streaming and batch. Simply define
its-kind Data Intelligence Engine that uses AI to power all parts of the Databricks the transformations you want to perform on your data and let DLT pipelines
Data Intelligence Platform. It uses signals across your entire Databricks automatically handle task orchestration, cluster management, monitoring, data
environment, including Unity Catalog, dashboards, notebooks, data pipelines and quality and error management. Engineers can treat their data as code and
documentation, to create highly specialized and accurate generative AI models apply modern software engineering best practices like testing, error handling,
that understand your data, your usage patterns and your business terminology. monitoring and documentation to deploy reliable pipelines at scale. DLT fully
supports both Python and SQL and is tailored to work with both streaming and
batch workloads.

Unified data orchestration with Databricks Workflows


Databricks Workflows offers a simple, reliable orchestration solution for data
and AI on the Data Intelligence Platform. Databricks Workflows lets you define
multi-step workflows to implement ETL pipelines, ML training workflows and
more. It offers enhanced control flow capabilities and supports different task
types and workflow triggering options. As the platform native orchestrator,
Databricks Workflows also provides advanced observability to monitor and
visualize workflow execution along with alerting capabilities for when issues arise.
Databricks Worklfows offers serverless compute options so you can leverage
smart scaling and efficient task execution.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 9

A rich ecosystem of data solutions


The Data Intelligence Platform is built on open source technologies and uses open standards so leading data solutions can be leveraged with anything you build on the
lakehouse. A large collection of technology partners makes it easy and simple to integrate the technologies you rely on when migrating to Databricks — and you are not
locked into a closed data technology stack.

The Data Intelligence Platform integrates with a large collection of technologies


E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 10

Why data engineers choose the Data Intelligence Platform


So how does the Data Intelligence Platform help with each of the data engineering challenges discussed earlier?

■ Real-time data stream processing: The Data Intelligence Platform simplifies development and operations by automating the production aspects associated with
building and maintaining real-time data workloads. Delta Live Tables provides a declarative way to define streaming ETL pipelines and Spark Structured Streaming
helps build real-time applications for real-time decision-making.

Data Cloud Lakehouse Platform


Sources Storage
Workflows for end-to-end orchestration
Data
Warehouses Real -Time BI Apps
Real-Time Analytics with

Streaming ETL with


Databricks SQL
On-premises
Systems
Delta Live Tables
Real -Time AI Apps
Real-Time Machine Learning
SaaS
Applications
Messag with
e Buses Databricks ML Predictive
Maintenance
Personalized
Offers
Patient
Diagnostics

Machine &
Application Logs
Real-Time Operational Apps
Real-Time Applications with
Spark Structured Streaming Alerts
Fraud
Detection
Dynamic
Pricing
Application
Events

Mobile & IoT Photon for lightning-fast data processing


Data

Unity Catalog for data governance and sharing

Delta Lake for open and reliable data storage

A unified set of tools for real-time data processing


E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 11

■ Reliable data pipelines at scale: Both Delta Live Tables and Databricks In addition, data engineers using the Data Intelligence Platform benefit
Workflows use smart autoscaling and auto-optimized resource from cutting-edge innovations in the form of AI-infused intelligence in
management to handle high-scaled workloads. With lakehouse the form of DatabricksIQ:
architecture, the high scalability of data lakes is combined with the high
■ AI-powered productivity: Specifically useful for data engineers,
reliability of data warehouses, thanks to Delta Lake — the storage format
DatabricksIQ powers the Databricks Assistant, a context-aware AI
that sits at the foundation of the lakehouse.
assistant that offers a conversational API to query data, generate code,
■ Data quality: High reliability — starting at the storage level with Delta explain code queries and even fix issues.
Lake and coupled with data quality–specific features offered by Delta
Live Tables — ensures high data quality. These features include setting
data “expectations” to handle corrupt or missing data as well as automatic
retries. In addition, both Databricks Workflows and Delta Live Tables
provide full observability to data engineers, making issue resolution faster
and easier.
■ Unified governance with secured data sharing: Unity Catalog provides
a single governance model for the entire platform so every dataset and
pipeline are governed in a consistent way. Datasets are discoverable
and can be securely shared with internal or external teams using Delta
Sharing. In addition, because Unity Catalog is a cross-platform governance
solution, it provides valuable lineage information so it’s easy to have a full
understanding of how each dataset and table is used downstream and
where it originates upstream.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 12

Conclusion FO LLOW PROV E N B EST PR ACTI C ES

As organizations strive to innovate with AI, data engineering is a focal point for In the next section, we describe best practices for data engineering and
success by delivering reliable, real-time data pipelines that make AI possible. end-to-end use cases drawn from real-world examples. From data ingestion
With the Data Intelligence Platform, built on the lakehouse architecture and and real-time processing to orchestration and data federation, you’ll learn how
powered by DatabricksIQ, data engineers are set up for success in dealing to apply proven patterns and make the best use of the different capabilities
with the critical challenges posed in the modern data landscape. By leaning of the Data Intelligence Platform.
on the advanced capabilities of the Data Intelligence Platform, data engineers
As you explore the rest of this guide, you can find datasets and code samples in
don’t need to spend as much time managing complex pipelines or dealing
the various Databricks Solution Accelerators, so you can get your hands dirty
with reliability, scalability and data quality issues. Instead, they can focus on
and start building on the Data Intelligence Platform.
innovation and bringing more value to the organization.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 13

02 Guidance and Best Practices


E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 14

To get data from the datausa.io API and load it into a Delta Lake table with
Databricks Assistant Tips and Tricks for Python, we used the following prompt:
Data Engineers
Help me ingest data from this API into a Delta Lake table:
by Jackie Zhang, Rafi Kurlansik and Richard Tomlinson
https://datausa.io/api/data?drilldowns=Nation&measures=Population

The generative AI revolution is transforming the way that teams work, and Make sure to use PySpark, and be concise! If the Spark DataFrame columns have
Databricks Assistant leverages the best of these advancements. It allows you any spaces in them, make sure to remove them from the Spark DF.
to query data through a conversational interface, making you more productive
inside your Databricks Workspace. The Assistant is powered by DatabricksIQ,
the Data Intelligence Engine for Databricks, helping to ensure your data is
secured and responses are accurate and tailored to the specifics of your
enterprise. Databricks Assistant lets you describe your task in natural language
to generate, optimize, or debug complex code without interrupting your
developer experience.

In this chapter we’ll discuss how to get the most out of your Databricks Assistant
and focus on how the Assistant can improve the life of Data Engineers by
eliminating tedium, increasing productivity and immersion, and accelerating
time to value. We will follow up with a series of posts focused on different data
practitioner personas, so stay tuned for upcoming entries focused on data
scientists, SQL analysts, and more.

I N G ESTI O N
A similar prompt can be used to ingest JSON files from cloud storage into Delta
When working with Databricks as a data engineer, ingesting data into Delta Lake tables, this time using SQL:
Lake tables is often the first step. Let’s take a look at two examples of how the
Assistant helps load data, one from APIs, and one from files in cloud storage. I have JSON files in a UC Volume here: /Volumes/rkurlansik/default/data_science/

For each, we will share the prompt and results. As mentioned in the 5 tips blog, sales_data.json

being specific in a prompt gives the best results, a technique consistently used Write code to ingest this data into a Delta Lake table. Use SQL only,
in this article. and be concise!
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 15

Consider this example using the Title column from the IMDb dataset:

This column contains two distinct observations — film title and release year. With
the following prompt, the Assistant identifies an appropriate regular expression to
parse the string into multiple columns.

Here is an example of the Title column in our dataset: 1. The Shawshank


Redemption (1994). The title name will be between the number and the
parentheses, and the release date is between parentheses. Write a function
that extracts both the release date and the title name from the Title column in
the imdb_raw DataFrame.

TR AN S FO R M I N G DATA FRO M U N STR U CTU R E D


TO STR U CTU R E D

Following tidy data principles, any given cell of a table should contain a single
observation with a proper data type. Complex strings or nested data structures
are often at odds with this principle, and as a result, data engineering work
consists of extracting structured data from unstructured data. Let’s explore two
examples where the Assistant excels at this task — using regular expressions and
exploding nested data structures.

Regular expressions
Regular expressions are a means to extract structured data from messy strings,
but figuring out the correct regex takes time and is tedious. In this respect, the
Assistant is a boon for all data engineers who struggle with regex.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 16

Providing an example of the string in our prompt helps the Assistant find the Data engineers may be asked to flatten the nested array and extract revenue
correct result. If you are working with sensitive data, we recommend creating a metrics for each product. Normally this task would take significant trial and
fake example that follows the same pattern. In any case, now you have one less error — even in a case where the data is relatively straightforward. The Assistant,
problem to worry about in your data engineering work. however, being context-aware of the schemas of DataFrames you have in
memory, generates code to get the job done. Using a simple prompt, we get the
Nested Structs, Arrays (JSON, XML, etc)
results we are looking for in seconds.
When ingesting data via API, JSON files in storage, or noSQL databases, the
resulting Spark DataFrames can be deeply nested and tricky to flatten correctly. Write PySpark code to flatten the df and extract revenue for each product
Take a look at this mock sales data in JSON format: and customer
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 17

R E FACTO R I N G , D E B U G G I N G AN D O P TI M IZ ATI O N The generated example below properly documents the code, and uses
generator expressions instead of conditional blocks to improve memory
Another scenario data engineers face is rewriting code authored by other
utilization on larger datasets.
team members, either ones that may be more junior or have left the company.
In these cases, the Assistant can analyze and explain poorly written code by
understanding its context and intent. It can suggest more efficient algorithms,
1 def calculate_total(cart_items):
refactor code for better readability, and add comments. 2 """
3 Calculate the total price of items in the shopping cart.

Improving documentation and maintainability


4 Apply a 5% discount to book items. The total is calculated by summing
This Python code calculates the total cost of items in an online shopping cart. 5 the prices, taking into account the discount for books.

6 Parameters:
7 cart_items (list of dict): A list of dictionaries, where each dictionary
8 represents an item in the cart with keys 'name',
1 def calculate_total(cart_items): 9 'type', and 'price'.
2 total = 0
3 for i in range(len(cart_items)): 10 Returns:
4 if cart_items[i]['type'] == 'book': 11 float: The total price of all items in the cart after discounts.
5 discount = cart_items[i]['price'] * 0.05 12 """
6 total += cart_items[i]['price'] - discount 13 # Use a generator expression to iterate over cart items, applying a 5% discount
7 else: 14 to books.
8 total += cart_items[i]['price'] 15 return sum(item['price'] * 0.95 if item['type'] == 'book' else item['price'] for
9 return total 16 item in cart_items)

10 cart_items = [{'name': 'Python Programming', 'type': 'book', 'price': 50}, 17 # Example usage
11 {'name': 'Laptop', 'type': 'electronics', 'price': 800}] 18 cart_items = [{'name': 'Python Programming', 'type': 'book', 'price': 50},
12 total_price = calculate_total(cart_items) 19 {'name': 'Laptop', 'type': 'electronics', 'price': 800}]
20 total_price = calculate_total(cart_items)

The use of conditional blocks in this code makes it hard to read and inefficient at
scale. Furthermore, there are no comments to explain what is happening. A good
place to begin is to ask the Assistant to explain the code step by step. Once the
data engineer understands the code, the Assistant can transform it, making it
more performant and readable with the following prompt:

Rewrite this code in a way that is more performant, commented properly, and
documented according to Python function documentation standards
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 18

Diagnosing errors Transpiling pandas to PySpark


Inevitably, data engineers will need to debug. The Assistant eliminates the need Pandas is one of the most successful data-wrangling libraries in Python and
to open multiple browser tabs or switch contexts in order to identify the cause is used by data scientists everywhere. Sticking with our JSON sales data, let’s
of errors in code, and staying focused is a tremendous productivity boost. To imagine a situation where a novice data scientist has done their best to flatten
understand how this works with the Assistant, let’s create a simple PySpark the data using pandas. It isn’t pretty, it doesn’t follow best practices, but it
DataFrame and trigger an error. produces the correct output:

1 import pandas as pd
2 import json

3 with open("/Volumes/rkurlansik/default/data_science/sales_data.json") as file:


4 data = json.load(file)

5 # Bad practice: Manually initializing an empty DataFrame and using a deeply nested
6 for-loop to populate it.
7 df = pd.DataFrame(columns=['company', 'year', 'quarter', 'region_name', 'product_
8 name', 'units_sold', 'product_sales'])

9 for quarter in data['quarters']:


10 for region in quarter['regions']:
11 for product in region['products']:
12 df = df.append({
13 'company': data['company'],
14 'year': data['year'],
15 'quarter': quarter['quarter'],
16 'region_name': region['name'],
17 'product_name': product['name'],
18 'units_sold': product['units_sold'],
19 'product_sales': product['sales']
20 }, ignore_index=True)

In the above example, a typo is introduced when adding a new column to the 21 # Inefficient conversion of columns after data has been appended
DataFrame. The zero in “10” is actually the letter “O”, leading to an invalid decimal 22 df['year'] = df['year'].astype(int)
23 df['units_sold'] = df['units_sold'].astype(int)
literal syntax error. The Assistant immediately offers to diagnose the error. It 24 df['product_sales'] = df['product_sales'].astype(int)

correctly identifies the typo, and suggests corrected code that can be inserted 25 # Mixing access styles and modifying the dataframe in-place in an inconsistent
into the editor in the current cell. Diagnosing and correcting errors this way can 26 manner
27 df['company'] = df.company.apply(lambda x: x.upper())
save hours of time spent debugging. 28 df['product_name'] = df['product_name'].str.upper()
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 19

By default, Pandas is limited to running on a single machine. The data engineer Note the generated code includes creating a SparkSession, which isn’t required
shouldn’t put this code into production and run it on billions of rows of data until in Databricks. Sometimes the Assistant, like any LLM, can be wrong or hallucinate.
it is converted to PySpark. This conversion process includes ensuring the data You, the data engineer, are the ultimate author of your code and it is important to
engineer understands the code and rewrites it in a way that is maintainable, review and understand any code generated before proceeding to the next task. If
testable, and performant. The Assistant once again comes up with a better you notice this type of behavior, adjust your prompt accordingly.
solution in seconds.
WR ITI N G TESTS

One of the most important steps in data engineering is to write tests to ensure
your DataFrame transformation logic is correct, and to potentially catch any
corrupted data flowing through your pipeline. Continuing with our example
from the JSON sales data, the Assistant makes it a breeze to test if any of the
revenue columns are negative - as long as values in the revenue columns are
not less than zero, we can be confident that our data and transformations in this
case are correct.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 20

We can build off this logic by asking the Assistant to incorporate the test into G E T TI N G H E LP
PySpark’s native testing functionality, using the following prompt:
Beyond a general capability to improve and understand code, the Assistant
Write a test using assertDataFrameEqual from pyspark.testing.utils to check possesses knowledge of the entire Databricks documentation and Knowledge
that an empty DataFrame has the same number of rows as our negative Base. This information is indexed on a regular basis and made available as
revenue DataFrame. additional context for the Assistant via a RAG architecture. This allows users
to search for product functionality and configurations without leaving the
The Assistant obliges, providing working code to bootstrap our testing efforts.
Databricks Platform.

For example, if you want to know details about the system environment for the
version of Databricks Runtime you are using, the Assistant can direct you to the
appropriate page in the Databricks documentation.

This example highlights the fact that being specific and adding detail to your
prompt yields better results. If we simply ask the Assistant to write tests for us
without any detail, our results will exhibit more variability in quality. Being specific
and clear in what we are looking for — a test using PySpark modules that builds
off the logic it wrote — generally will perform better than assuming the Assistant
can correctly guess at our intentions.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 21

The Assistant can handle simple, descriptive, and conversational questions, enhancing the user experience in navigating Databricks’ features and resolving issues. It can
even help guide users in filing support tickets! For more details, read the announcement article.

CO N C LU S I O N

The barrier to entry for quality data engineering has been lowered thanks to the power of generative AI with the Databricks Assistant. Whether you are a newcomer
looking for help on how to work with complex data structures or a seasoned veteran who wants regular expressions written for them, the Assistant will improve your
quality of life. Its core competency of understanding, generating, and documenting code boosts productivity for data engineers of all skill levels. To learn more, see the
Databricks documentation on how to get started with the Databricks Assistant today.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 22

Applying Software Development and DevOps Best Practices to Delta Live Table Pipelines
by Alex Ott

Databricks Delta Live Tables (DLT) radically simplifies the development of the robust data processing pipelines by decreasing the amount of code that data engineers
need to write and maintain. And also reduces the need for data maintenance and infrastructure operations, while enabling users to seamlessly promote code and
pipelines configurations between environments. But people still need to perform testing of the code in the pipelines, and we often get questions on how people can
do it efficiently.

In this chapter we’ll cover the following items based on our experience working with multiple customers:

■ How to apply DevOps best practices to Delta Live Tables


■ How to structure the DLT pipeline’s code to facilitate unit
and integration testing
■ How to perform unit testing of individual transformations of your
DLT pipeline
■ How to perform integration testing by executing the full DLT pipeline
■ How to promote the DLT assets between stages
■ How to put everything together to form a CI/CD pipeline (with Azure DevOps as an example)
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 23

APPLY I N G D E VO PS PR ACTI C ES TO D LT: TH E B I G PI CTU R E

The DevOps practices are aimed at shortening the software development life
cycle (SDLC) providing the high quality at the same time. Typically they include
these steps:

■ Version control of the source code and infrastructure


■ Code reviews
■ Separation of environments (development/staging/production)
■ Automated testing of individual software components and the whole
product with the unit and integration tests
■ Continuous integration (testing) and continuous deployment of
changes (CI/CD)

All these practices can be applied to Delta Live Tables pipelines as well:

Figure: DLT development workflow


E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 24

To achieve this we use the following features of Databricks product portfolio: 3. CI/CD system reacts to the commit and starts the build pipeline (CI part of CI/
CD) that will update a staging Databricks Repo with the changes, and trigger
■ Databricks Repos provide an interface to different Git services, so we
execution of unit tests.
can use them for code versioning, integration with CI/CD systems, and
promotion of the code between environments a. Optionally, the integration tests could be executed as well, although
in some cases this could be done only for some branches, or as a
■ Databricks CLI (or Databricks REST API) to implement CI/CD pipelines
separate pipeline.
■ Databricks Terraform Provider for deployment of all necessary
4. If all tests are successful and code is reviewed, the changes are merged into
infrastructure and keeping it up to date
the main (or a dedicated branch) of the Git repository.
The recommended high-level development workflow of a DLT pipeline
5. Merging of changes into a specific branch (for example, releases) may trigger a
is as following:
release pipeline (CD part of CI/CD) that will update the Databricks Repo in the
production environment, so code changes will take effect when pipeline runs
next time.

As illustration for the rest of the chapter we’ll use a very simple DLT pipeline
consisting just of two tables, illustrating typical Bronze/Silver layers of a typical
lakehouse architecture. Complete source code together with deployment
instructions is available on GitHub.

1. A developer is developing the DLT code in their own checkout of a Git


Figure: Example DLT pipeline
repository using a separate Git branch for changes.

2. When code is ready and tested, code is committed to Git and a pull request
Note: DLT provides both SQL and Python APIs, in most of the chapter we focus
is created.
on Python implementation, although we can apply most of the best practices
also for SQL-based pipelines.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 25

D E V E LO PM E NT CYC LE WITH D E LTA LI V E TAB LES STR U CTU R I N G TH E D LT PI PE LI N E ' S CO D E

When developing with Delta Live Tables, typical development process looks To be able to evaluate individual functions and make them testable it’s very
as follows: important to have correct code structure. Usual approach is to define all
data transformations as individual functions receiving and returning Spark
1. Code is written in the notebook(s).
DataFrames, and call these functions from DLT pipeline functions that will form
2. When another piece of code is ready, a user switches to DLT UI and starts the DLT execution graph. The best way to achieve this is to use files in repos
the pipeline. (To make this process faster it’s recommended to run the functionality that allows to expose Python files as normal Python modules that
pipeline in the Development mode, so you don’t need to wait for resources could be imported into Databricks notebooks or other Python code. DLT natively
again and again). supports files in repos that allows importing Python files as Python modules
3. When a pipeline is finished or failed because of the errors, the user (please note, that when using files in repos, the two entries are added to the
analyzes results, and adds/modifies the code, repeating the process. Python’s sys.path — one for repo root, and one for the current directory of the
4. When code is ready, it’s committed. caller notebook). With this, we can start to write our code as a separate Python
file located in the dedicated folder under the repo root that will be imported as a
For complex pipelines, such dev cycle could have a significant overhead because
Python module:
the pipeline’s startup could be relatively long for complex pipelines with dozens
of tables/views and when there are many libraries attached. For users it would be
easier to get very fast feedback by evaluating the individual transformations and
testing them with sample data on interactive clusters.

Figure: Source code for a Python package


E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 26

And the code from this Python package could be used inside the DLT The final layout of the Databricks Repo, with unit and integration tests, may look
pipeline code: as following:

Figure: Using functions from the Python package in the DLT code

Note, that function in this particular DLT code snippet is very small — all it’s doing
is just reading data from the upstream table, and applying our transformation
defined in the Python module. With this approach we can make DLT code simpler Figure: Recommended code layout in Databricks Repo
to understand and easier to test locally or using a separate notebook attached
to an interactive cluster. Splitting the transformation logic into a separate Python This code structure is especially important for bigger projects that may consist
module allows us to interactively test transformations from notebooks, write unit of the multiple DLT pipelines sharing the common transformations.
tests for these transformations and also test the whole pipeline (we’ll talk about
testing in the next sections).
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 27

I M PLE M E NTI N G U N IT TESTS The demo repository contains a sample code for both of these approaches — for
local execution of the tests, and executing tests as notebooks. The CI pipeline
As mentioned above, splitting transformations into a separate Python module
shows both approaches.
allows us easier write unit tests that will check behavior of the individual
functions. We have a choice of how we can implement these unit tests: Please note that both of these approaches are applicable only to the Python
code — if you’re implementing your DLT pipelines using SQL, then you need to
■ We can define them as Python files that could be executed locally, for
follow the approach described in the next section.
example, using pytest. This approach has following advantages:
■ We can develop and test these transformations using the IDE, and I M PLE M E NTI N G I NTEG R ATI O N TESTS
for example, sync the local code with Databricks repo using the
While unit tests give us assurance that individual transformations are working
Databricks extension for Visual Studio Code or dbx sync command if
as they should, we still need to make sure that the whole pipeline also works.
you use another IDE
Usually this is implemented as an integration test that runs the whole pipeline,
■ Such tests could be executed inside the CI/CD build pipeline without but usually it’s executed on the smaller amount of data, and we need to validate
need to use Databricks resources (although it may depend if some execution results. With Delta Live Tables, there are multiple ways to implement
Databricks-specific functionality is used or the code could be integration tests:
executed with PySpark)
■ Implement it as a Databricks Workflow with multiple tasks — similarly what
■ We have access to more development related tools — static code
is typically done for non-DLT code
and code coverage analysis, code refactoring tools, interactive
debugging, etc.
■ Use DLT expectations to check pipeline’s results

■ We can even package our Python code as a library, and attach to


I M PLE M E NTI N G I NTEG R ATI O N TESTS WITH
multiple projects DATAB R I C KS WO R KFLOWS
■ We can define them in the notebooks — with this approach:
In this case we can implement integration tests with Databricks Workflows with
■ We can get feedback faster as we always can run sample code and multiple tasks (we can even pass data, such as, data location, etc. between tasks
tests interactively using task values). Typically such a workflow consists of the following tasks:
■ We can use additional tools like Nutter to trigger execution of
■ Setup data for DLT pipeline
notebooks from the CI/CD build pipeline (or from the local machine)
■ Execute pipeline on this data
and collect results for reporting
■ Perform validation of produced results.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 28

Resulting DLT pipeline for integration test may look as following (we have two
additional tables in the execution graph that check that data is valid):

Figure: Implementing integration test with Databricks Workflows

The main drawback of this approach is that it requires writing quite a significant
amount of the auxiliary code for setup and validation tasks, plus it requires Figure: Implementing integration tests using DLT expectations

additional compute resources to execute the setup and validation tasks.

This is the recommended approach to performing integration testing of DLT


U S E D LT E XPECTATI O N S TO I M PLE M E NT
pipelines. With this approach, we don’t need any additional compute resources
I NTEG R ATI O N TESTS
- everything is executed in the same DLT pipeline, so get cluster reuse, all data is
We can implement integration tests for DLT by expanding the DLT pipeline with logged into the DLT pipeline’s event log that we can use for reporting, etc.
additional DLT tables that will apply DLT expectations to data using the fail
operator to fail the pipeline if results don’t match to provided expectations. Please refer to DLT documentation for more examples of using DLT expectations

It’s very easy to implement - just create a separate DLT pipeline that will for advanced validations, such as, checking uniqueness of rows, checking

include additional notebook(s) that define DLT tables with expectations presence of specific rows in the results, etc. We can also build libraries of

attached to them. DLT expectations as shared Python modules for reuse between different
DLT pipelines.
For example, to check that silver table includes only allowed data in the type
column we can add following DLT table and attach expectations to it:

1 @dlt.table(comment="Check type")
2 @dlt.expect_all_or_fail({"valid type": "type in ('link', 'redlink')",
"type is not null": "type is not null"})
3 def filtered_type_check():
4 return dlt.read("clickstream_filtered").select("type")
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 29

PRO M OTI N G TH E D LT AS S E TS B E T WE E N E N V I RO N M E NTS PUT TI N G E V E RY TH I N G TO G E TH E R TO FO R M


A C I/C D PI PE LI N E
When we’re talking about promotion of changes in the context of DLT, we’re
talking about multiple assets: After we implemented all the individual parts, it’s relatively easy to implement
a CI/CD pipeline. GitHub repository includes a build pipeline for Azure DevOps
■ Source code that defines transformations in the pipeline
(other systems could be supported as well — the differences are usually in the
■ Settings for a specific Delta Live Tables pipeline file structure). This pipeline has two stages to show ability to execute different
The simplest way to promote the code is to use Databricks Repos to work with sets of tests depending on the specific event:
the code stored in the Git repository. Besides keeping your code versioned,
■ onPush is executed on push to any Git branch except releases branch and
Databricks Repos allows you to easily propagate the code changes to other
version tags. This stage only runs and reports unit tests results (both local
environments using the Repos REST API or Databricks CLI.
and notebooks).
From the beginning, DLT separates code from the pipeline configuration to make ■ onRelease is executed only on commits to the releases branch, and in
it easier to promote between stages by allowing to specify the schemas, data addition to the unit tests it will execute a DLT pipeline with integration test.
locations, etc. So we can define a separate DLT configuration for each stage that
will use the same code, while allowing you to store data in different locations, use
different cluster sizes, etc.

To define pipeline settings we can use Delta Live Tables REST API or Databricks
CLI’s pipelines command, but it becomes difficult in case you need to use
instance pools, cluster policies, or other dependencies. In this case the more
flexible alternative is Databricks Terraform Provider’s databricks_pipeline
resource that allows easier handling of dependencies to other resources, and we
can use Terraform modules to modularize the Terraform code to make it reusable.
The provided code repository contains examples of the Terraform code for
deploying the DLT pipelines into the multiple environments.
Figure: Structure of Azure DevOps build pipeline
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 30

Except for the execution of the integration test in the onRelease stage, the
structure of both stages is the same — it consists of following steps:

1. Checkout the branch with changes

2. Set up environment — install Poetry which is used for managing Python


environment management, and installation of required dependencies

3. Update Databricks Repos in the staging environment

4. Execute local unit tests using the PySpark

5. Execute the unit tests implemented as Databricks notebooks using Nutter

6. For releases branch, execute integration tests

7. Collect test results and publish them to Azure DevOps


E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 31

Results of tests execution are reported back to the Azure DevOps, so we can
track them:

Figure: Release pipeline to deploy code changes to production DLT pipeline

Figure: Reporting the tests execution results


Try to apply approaches described in this chapter to your Delta Live Table
pipelines! The provided demo repository contains all necessary code together
If commits were done to the releases branch and all tests were successful, the
with setup instructions and Terraform code for deployment of everything to
release pipeline could be triggered, updating the production Databricks repo, so
Azure DevOps.
changes in the code will be taken into account on the next run of DLT pipeline.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 32

Unity Catalog Governance in Action: Monitoring, Reporting and Lineage


by Ari Kaplan and Pearl Ubaru

Databricks Unity Catalog (UC) provides a single unified governance solution for all of a company’s data and AI assets across clouds and data platforms. This chapter
digs deeper into the prior Unity Catalog Governance Value Levers blog to show how the technology itself specifically enables positive business outcomes through
comprehensive data and AI monitoring, reporting, and lineage.

OV E R ALL C HALLE N G ES WITH TR AD ITI O NAL H OW DATAB R I C KS U N IT Y CATALO G S U PP O RTS A U N I FI E D


N O N - U N I FI E D G OV E R NAN C E V I E W, M O N ITO R I N G , AN D O B S E RVAB I LIT Y

The Unity Catalog Governance Value Levers blog discussed the “why” of the So, how does this all work from a technical standpoint? UC manages all registered
organizational importance of governance for information security, access control, assets across the Databricks Data Intelligence Platform. These assets can be
usage monitoring, enacting guardrails, and obtaining “single source of truth” anything within BI, DW, data engineering, data streaming, data science, and ML.
insights from their data assets. These challenges compound as their company This governance model provides access controls, lineage, discovery, monitoring,
grows and without Databricks UC, traditional governance solutions no longer auditing, and sharing. It also provides metadata management of files, tables,
adequately meet their needs. ML models, notebooks, and dashboards. UC gives one single view of your entire
end-to-end information, through the Databricks asset catalog, feature store and
The major challenges discussed included weaker compliance and fractured
model registry, lineage capabilities, and metadata tagging for data classifications,
data privacy controlled across multiple vendors; uncontrolled and siloed
as discussed below:
data and AI swamps; exponentially rising costs; loss of opportunities, revenue,
and collaboration.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 33

Unified view of the entire data estate ■ Feature Store and Model Registry: define features used by data scientists
■ Asset catalog: through system tables that contain metadata, you can see within the centralized repository. This is helpful for consistent model
all that is contained in your catalog such as schemas, tables, columns, files, training and inference for your entire AI workflow.
models, and more. If you are not familiar with volumes within Databricks, ■ Lineage capabilities: trust in your data is key for your business to take
they are used for managing non-tabular datasets. Technically, they are action in real life. End-to-end transparency into your data is needed for
logical volumes of storage to access files in any format: structured, semi- trust in your reports, models, and insights. UC makes this easy through
structured, and unstructured. lineage capabilities, providing insights on: What are the raw data sources?
Who created it and when? How was data merged and transformed? What
is the traceability from the models back to the datasets they are trained
on? Lineage shows end-to-end from data to model - both table-level and
column-level. You can even query across data sources such as Snowflake
and benefit immediately:

Data sources can be across platforms such as Snowflake and Databricks

Catalog Explorer lets you discover and govern all your data and ML models
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 34

■ Metadata tagging for data classifications: enrich your data and queries by
providing contextual insights about your data assets. These descriptions
at the column and table level can be manually entered, or automatically
described with GenAI by Databricks Assistant. Below is an example of
descriptions and quantifiable characteristics:

Metadata tagging insights: details on the “features” table

Metadata tagging insights: frequent users, notebooks, queries, joins, billing trends and more

Databricks Assistant uses GenAI to write context-aware descriptions of columns and tables
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 35

Having one unified view results in: How do you ensure trust in your data, ML models, and AI across your entire
data pipeline in a single view regardless of where the data resides? Databricks
■ Accelerated innovation: your insights are only as good as your data.
Lakehouse Monitoring is the industry’s only comprehensive solution from data
Your analysis is only as good as the data you access. So, streamlining
(regardless of where it resides) to insights. It accelerates the discovery of issues,
your data search drives faster and better generation of business insights,
helps determine root causes, and ultimately assists in recommending solutions.
driving innovation.
■ Cost reduction through centralized asset cataloging: lowers license UC provides Lakehouse Monitoring capabilities with both democratized

costs (just one vendor solution versus needing many vendors), dashboards and granular governance information that can be directly queried

lowers usage fees, reduces time to market pains, and enables overall through system tables. The democratization of governance extends operational

operational efficiencies. oversight and compliance to non-technical people, allowing a broad variety of
teams to monitor all of their pipelines.
■ It’s easier to discover and access all data by reducing data sprawl across
several databases, data warehouses, object storage systems, and more. Below is a sample dashboard of the results of an ML model including its accuracy
over time:
CO M PR E H E N S I V E DATA AN D AI M O N ITO R I N G AN D
R E P O RTI N G

Databricks Lakehouse Monitoring allows teams to monitor their entire data


pipelines — from data and features to ML models — without additional tools
and complexity. Powered by Unity Catalog, it lets users uniquely ensure that
their data and AI assets are high quality, accurate and reliable through deep
insight into the lineage of their data and AI assets. The single, unified approach to
monitoring enabled by lakehouse architecture makes it simple to diagnose errors,
perform root cause analysis, and find solutions.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 36

It further shows data integrity of predictions and data drift over time: It’s one thing to intentionally seek out ML model information when you are looking
for answers, but it is a whole other level to get automated proactive alerts on
errors, data drift, model failures, or quality issues. Below is an example alert for a
potential PII (Personal Identifiable Information) data breach:

Example proactive alert of potential unmasked private data

And model performance over time, according to a variety of ML metrics such as One more thing — you can assess the impact of issues, do a root cause analysis,
R2, RMSE, and MAPE: and assess the downstream impact by Databrick’s powerful lineage capabilities
— from table-level to column-level.

System tables: metadata information for lakehouse observability


and ensuring compliance

Lakehouse Monitoring dashboards show data and AI assets quality


E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 37

These underlying tables can be queried through SQL or activity dashboards ■ Clusters table contains the full history of cluster configurations over time
to provide observability about every asset within the Databricks Intelligence for all-purpose and job clusters.
Platform. Examples include which users have access to which data objects; billing ■ Predictive optimization tables are great because they optimize your data
tables that provide pricing and usage; compute tables that take cluster usage layout for peak performance and cost efficiency. The tables track the
and warehouse events into consideration; and lineage information between operation history of optimized tables by providing the catalog name,
columns and tables: schema name, table name, and operation metrics about compaction
■ Audit tables include information on a wide variety of UC events. UC and vacuuming.

captures an audit log of actions performed against the metastore giving From the catalog explorer, here are just a few of the system tables any of which
administrators access to details about who accessed a given dataset and can be viewed for more details:
the actions that they performed.
■ Billing and historical pricing tables will include records for all billable usage
across the entire account; therefore you can view your account’s global
usage from whichever region your workspace is in.
■ Table lineage and column lineage tables are great because they allow
you to programmatically query lineage data to fuel decision making and
reports. Table lineage records each read-and-write event on a UC table
or path that might include job runs, notebook runs and dashboards
associated with the table. For column lineage, data is captured by reading
the column.
■ Node types tables capture the currently available node types with their
basic hardware information outlining the node type name, the number
of vCPUs for the instance, and the number of GPUs and memory for the
instance. Also in private preview are node_utilization metrics on how much
usage each node is leveraging.
■ Query history holds information on all SQL commands, i/o performance,
and number of rows returned.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 38

As an example, drilling down on the "key_column_usage" table, you can see Another example is the "share_recipient_privileges" table, to see who granted
precisely how tables relate to each other via their primary key: which shares to whom:
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 39

The example dashboard below shows the number of users, tables, ML models, What does having a comprehensive data and AI monitoring and
percent of tables that are monitored or not, dollars spent on Databricks DBUs reporting tool result in?
over time, and so much more:
■ Reduced risk of non-compliance with better monitoring of internal policies
and security breach potential results in safeguarded reputation and
improved data and AI trust from employees and partners.
■ Improved integrity and trustworthiness of data and AI with "one source of
truth", anomaly detection, and reliability metrics.

Value Levers with Databricks Unity Catalog


If you are looking to learn more about the values Unity Catalog brings to
businesses, the prior Unity Catalog Governance Value Levers blog went
into detail: mitigating risk around compliance; reducing platform complexity
and costs; accelerating innovation; facilitating better internal and external
collaboration; and monetizing the value of data.

CO N C LU S I O N

Governance is key to mitigating risks, ensuring compliance, accelerating


innovation, and reducing costs. Databricks Unity Catalog is unique in the market,
providing a single unified governance solution for all of a company’s data and AI
across clouds and data platforms.

UC Databricks architecture makes governance seamless: a unified view and


discovery of all data assets, one tool for access management, one tool for
Governance dashboard showing billing trends, usage, activity and more
auditing for enhanced data and AI security, and ultimately enabling platform-
independent collaboration that unlocks new business values.

Getting started is easy - UC comes enabled by default with Databricks if you are
a new customer! Also if you are on premium or enterprise workspaces, there are
no additional costs.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 40

Some practical examples that we often come across are in Operational and
Scalable Spark Structured Streaming for Security Analytics workloads. Customers want to ingest and enrich real-time
REST API Destinations streaming data from sources like kafka, eventhub, and Kinesis and publish it into
by Art Rask and Jay Palaniappan operational search engines like Elasticsearch, Opensearch, and Splunk. A key
advantage of Spark Streaming is that it allows us to enrich, perform data quality
Spark Structured Streaming is the widely-used open source engine at the checks, and aggregate (if needed) before data is streamed out into the search
foundation of data streaming on the Data Intelligence Platform. It can elegantly engines. This provides customers a high quality real-time data pipeline for
handle diverse logical processing at volumes ranging from small-scale ETL to operational and security analytics.
the largest Internet services. This power has led to adoption in many use cases
The most basic representation of this scenario is shown in Figure 1. Here we have
across industries.
an incoming stream of data - it could be a Kafka topic, AWS Kinesis, Azure Event
Another strength of Structured Streaming is its ability to handle a variety of Hub, or any other streaming query source. As messages flow off the stream we
both sources and sinks (or destinations). In addition to numerous sink types need to make calls to a REST API with some or all of the message data.
supported natively (incl. Delta, AWS S3, Google GCS, Azure ADLS, Kafka topics,
Kinesis streams, and more), Structured Streaming supports a specialized sink
that has the ability to perform arbitrary logic on the output of a streaming query:
the foreachBatch extension method. With foreachBatch, any output target
addressable through Python or Scala code can be the destination for a stream.

In this chapter we will share best practice guidance we’ve given customers who
have asked how they can scalably turn streaming data into calls against a REST
API. Routing an incoming stream of data to calls on a REST API is a requirement
seen in many integration and data engineering scenarios.
Figure 1

In a greenfield environment, there are many technical options to implement


this. Our focus here is on teams that already have streaming pipelines in Spark
for preparing data for machine learning, data warehousing, or other analytics-
focused uses. In this case, the team will already have skills, tooling and DevOps
processes for Spark. Assume the team now has a requirement to route some
data to REST API calls. If they wish to leverage existing skills or avoid re-working
their tool chains, they can use Structured Streaming to get it done.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 41

KE Y I M PLE M E NTATI O N TEC H N I Q U ES , AN D SO M E CO D E 2. Whenever possible, group multiple rows from the input on each outgoing
REST API call. In relative terms, making the API call over HTTP will be a
A basic code sample is included as Exhibit 1. Before looking at it in detail, we will
slow part of the process. Your ability to reach high throughput will be
call out some key techniques for effective implementation.
dramatically improved if you include multiple messages/records on the
For a start, you will read the incoming stream as you would any other streaming body of each API call. Of course, what you can do will be dictated by the
job. All the interesting parts here are on the output side of the stream. If your target REST API. Some APIs allow a POST body to include many items up
data must be transformed in flight before posting to the REST API, do that as to a maximum body size in bytes. Some APIs have a max count of items on
you would in any other case. This code snippet reads from a Delta table; as the POST body. Determine the max you can fit on a single call for the target
mentioned, there are many other possible sources. API. In your method invoked by foreachBatch, you will have a prep step to
transform the micro-batch dataframe into a pre-batched dataframe where
each row has the grouped records for one call to the API. This step is also a
1 dfSource = (spark.readStream
chance for any last transform of the records to the format expected by the
2 .format("delta")
3 .table("samples.nyctaxi.trips")) target API. An example is shown in the code sample in Exhibit 1 with the call
to a helper function named preBatchRecordsForRestCall.

For directing streamed data to the REST API, take the following approach: 3. In most cases, to achieve a desired level of throughput, you will want to
make calls to the API from parallel tasks. You can control the degree of
1. Use the foreachBatch extension method to pass incoming parallelism by calling repartition on the dataframe of pre-batched data.
micro-batches to a handler method (callRestAPIBatch) which will Call repartition with the number of parallel tasks you want calling the API.
handle calls to the REST API. This is actually just one line of code.

1 streamHandle = (dfSource.writeStream 1 ### Repartition pre-batched df for parallelism of API calls


2 .foreachBatch(callRestAPIBatch) 2 new_df = pre_batched_df.repartition(8)
3 .start())
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 42

It is worth mentioning (or admitting) that using repartition here is


a bit of an anti-pattern. Explicit repartitioning with large datasets can 1 if not (response.status_code==200 or response.status_code==201) :
2 raise Exception("Response status : {} .Response message : {}".\
have performance implications, especially if it causes a shuffle between
3 format(str(response.status_code),response.text))
nodes on the cluster. In most cases of calling a REST API, the data size of
any micro-batch is not massive. So, in practical terms, this technique is
The six elements above should prepare your code for sending streaming data to
unlikely to cause a problem. And, it has a big positive effect on throughput
a REST API, with the ability to scale for throughput and to handle error conditions
to the API.
cleanly. The sample code in Exhibit 1 is an example implementation. Each point
4. Execute a dataframe transformation that calls a nested function dedicated
stated above is reflected in the full example.
to making a single call to the REST API. The input to this function will be
one row of pre-batched data. In the sample, the payload column has the
data to include on a single call. Call a dataframe action method to invoke
execution of the transformation.

1 submitted_df = new_df.withColumn("RestAPIResponseCode",\
2 callRestApiOnce(new_df["payload"])).\
3 collect()

5. Inside the nested function which will make one API call, use your libraries
of choice to issue an HTTP POST against the REST API. This is commonly
done with the Requests library but any library suitable for making the
call can be considered. See the callRestApiOnce method in Exhibit 1
for an example.

6. Handle potential errors from the REST API call by using a try..except
block or checking the HTTP response code. If the call is unsuccessful,
the overall job can be failed by throwing an exception (for job retry or
troubleshooting) or individual records can be diverted to a dead letter
queue for remediation or later retry.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 43

1 from pyspark.sql.functions import * 32 ### Call helper method to transform df to pre-batched df with one row per REST
2 from pyspark.sql.window import Window 33 API call
3 import math 34 ### The POST body size and formatting is dictated by the target API; this is an
4 import requests 35 example
5 from requests.adapters import HTTPAdapter 36 pre_batched_df = preBatchRecordsForRestCall(df, 10)

6 def preBatchRecordsForRestCall(microBatchDf, batchSize): 37 ### Repartition pre-batched df for target parallelism of API calls
7 batch_count = math.ceil(microBatchDf.count() / batchSize) 38 new_df = pre_batched_df.repartition(8)
8 microBatchDf = microBatchDf.withColumn("content", to_json(struct(col("*"))))
9 microBatchDf = microBatchDf.withColumn("row_number",\ 39 ### Invoke helper method to call REST API once per row in the pre-batched df
10 row_number().over(Window(). 40 submitted_df = new_df.withColumn("RestAPIResponseCode",\
11 orderBy(lit('A')))) 41 callRestApiOnce(new_df["payload"])).collect()
12 microBatchDf = microBatchDf.withColumn("batch_id", col("row_number") % batch_
13 count)
14 return microBatchDf.groupBy("batch_id").\ 42 dfSource = (spark.readStream
15 agg(concat_ws(",|", collect_ 43 .format("delta")
16 list("content")).\ 44 .table("samples.nyctaxi.trips"))
alias("payload"))
45 streamHandle = (dfSource.writeStream
46 .foreachBatch(callRestAPIBatch)
17 def callRestAPIBatch(df, batchId): 47 .trigger(availableNow=True)
18 restapi_uri = "<REST API URL>" 48 .start())

19 @udf("string")
20 def callRestApiOnce(x): Exhibit 1
21 session = requests.Session()
22 adapter = HTTPAdapter(max_retries=3)
23 session.mount('http://', adapter)
24 session.mount('https://', adapter)

24 #this code sample calls an unauthenticated REST endpoint; add headers


25 necessary for auth
26 headers = {'Authorization':'abcd'}
27 response = session.post(restapi_uri, headers=headers, data=x, verify=False)
28 if not (response.status_code==200 or response.status_code==201) :
29 raise Exception("Response status : {} .Response message : {}".\
30 format(str(response.status_code),response.text))

31 return str(response.status_code)
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 44

D ES I G N AN D O PE R ATI O NAL CO N S I D E R ATI O N S

Exactly Once vs At Least Once Guarantees


As a general rule in Structured Streaming, using foreachBatch only provides at-
least-once delivery guarantees. This is in contrast to the exactly-once delivery
guarantee provided when writing to sinks like a Delta table or file sinks. Consider,
for example, a case where 1,000 records arrive on a micro-batch and your code
in foreachBatch begins calling the REST API with the batch. In a hypothetical
failure scenario, let’s say that 900 calls succeed before an error occurs and fails
the job. When the stream restarts, processing will resume by re-processing the
failed batch. Without additional logic in your code, the 900 already-processed
calls will be repeated. It is important that you determine in your design whether
this is acceptable, or whether you need to take additional steps to protect
against duplicate processing.

The general rule when using foreachBatch is that your target sink (REST API in this
case) should be idempotent or that you must do additional tracking to account
Line H in the table shows the estimated number of worker cores necessary to
for multiple calls with the same data.
sustain the target throughput. In the example shown here, you could provision a
Estimating Cluster Core Count for a Target Throughput cluster with two 16-core workers or 4 8-core workers, for example. For this type
Given these techniques to call a REST API with streaming data, it will quickly of workload, fewer nodes with more cores per node is preferred.
become necessary to estimate how many parallel executors/tasks are necessary
Line H is also the number that would be put in the repartition call in foreachBatch,
to achieve your required throughput. And you will need to select a cluster size.
as described in item 3 above.
The following table shows an example calculation for estimating the number of
worker cores to provision in the cluster that will run the stream. Line G is a rule of thumb to account for other activity on the cluster. Even if your
stream is the only job on the cluster, it will not be calling the API 100% of the time.
Some time will be spent reading data from the source stream, for example. The
value shown here is a good starting point for this factor - you may be able to fine
tune it based on observations of your workload.

Obviously, this calculation only provides an estimated starting point for tuning
the size of your cluster. We recommend you start from here and adjust up or
down to balance cost and throughput.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 45

OTH E R FACTO RS TO CO N S I D E R M E AS U R E D TH RO U G H PUT TO A M O C KE D API WITH


D I FFE R E NT N U M B E RS O F PAR ALLE L TAS KS
There are other factors you may need to plan for in your deployment. These are
outside the scope of this post, but you will need to consider them as part of To provide representative data of scaling REST API calls as described here, we
implementation. Among these are: ran tests using code very similar to Example 1 against a mocked up REST API that
persisted data in a log.
1. Authentication requirements of the target API: It is likely that the REST
API will require authentication. This is typically done by adding required Results from the test are shown in Table 1. These metrics confirm near-linear
headers in your code before making the HTTP POST. scaling as the task count was increased (by changing the partition count
using repartition). All tests were run on the same cluster with a single 16-core
2. Potential rate limiting: The target REST API may implement rate limiting
worker node.
which will place a cap on the number of calls you can make to it per
second or minute. You will need to ensure you can meet throughout
targets within this limit. You’ll also want to be ready to handle throttling
errors that may occur if the limit is exceeded.

3. Network path required from worker subnet to target API: Obviously, the
worker nodes in the host Spark cluster will need to make HTTP calls to
the REST API endpoint. You’ll need to use the available cloud networking
options to configure your environment appropriately.

4. If you control the implementation of the target REST API (e.g., an internal
custom service), be sure the design of that service is ready for the load Table 1

and throughput generated by the streaming workload.


E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 46

R E PR ES E NTATI V E ALL U P PI PE LI N E D ES I G N S 2. Simple Autoloader to REST API job

1. Routing some records in a streaming pipeline to REST API (in addition to


This pattern is an example of leveraging the diverse range of sources
persistent sinks)
supported by Structured Streaming. Databricks makes it simple to
consume incoming near real-time data - for example using Autoloader to
This pattern applies in scenarios where a Spark-based data pipeline
ingest files arriving in cloud storage. Where Databricks is already used for
already exists for serving analytics or ML use cases. If a requirement
other use cases, this is an easy way to route new streaming sources to a
emerges to post cleansed or aggregated data to a REST API with low
REST API.
latency, the technique described here can be used.

S U M MARY

We have shown here how structured streaming can be used to send streamed
data to an arbitrary endpoint - in this case, via HTTP POST to a REST API. This
opens up many possibilities for flexible integration with analytics data pipelines.
However, this is really just one illustration of the power of foreachBatch in Spark
Structured Streaming.

The foreachBatch sink provides the ability to address many endpoint types that
are not among the native sinks. Besides REST APIs, these can include databases
via JDBC, almost any supported Spark connector, or other cloud services that are
addressable via a helper library or API. One example of the latter is pushing data
to certain AWS services using the boto3 library.

This flexibility and scalability enables Structured Streaming to underpin a vast


range of real-time solutions across industries.

If you are a Databricks customer, simply follow the getting started tutorial
to familiarize yourself with Structured Streaming. If you are not an existing
Databricks customer, sign up for a free trial.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 47

A Data Engineer’s Guide to Optimized


Streaming With Protobuf and Delta
Live Tables
by Craig Lukasik

This article describes an example use case where events from multiple games
stream through Kafka and terminate in Delta tables. The example illustrates how
to use Delta Live Tables (DLT) to:

1. Stream from Kafka into a Bronze Delta table.

2. Consume streaming Protobuf messages with schemas managed by the


Confluent Schema Registry, handling schema evolution gracefully.
First, let’s look at the Delta Live Tables code for the example and the related
3. Demultiplex (demux) messages into multiple game-specific, append-only
pipeline DAG so that we can get a glimpse of the simplicity and power of the
Silver Streaming Tables. Demux indicates that a single stream is split or
DLT framework.
fanned out into separate streams.

4. Create Materialized Views to recalculate aggregate values periodically.

A high-level view of the system architecture is illustrated below.


E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 48

On the left, we see the DLT Python code. On the right, we see the view and the CO M M O N STR E AM I N G U S E CAS ES
tables created by the code. The bottom cell of the notebook on the left operates
The Databricks Data Intelligence Platform is a comprehensive data-to-AI
on a list of games (GAMES_ARRAY) to dynamically generate the fourteen target
enterprise solution that combines data engineers, analysts, and data scientists
tables we see in the DAG.
on a single platform. Streaming workloads can power near real-time prescriptive
Before we go deeper into the example code, let’s take a step back and review and predictive analytics and automatically retrain Machine Learning (ML) models
streaming use cases and some streaming payload format options. using Databricks built-in MLOps support. The models can be exposed as
scalable, serverless REST endpoints, all within the Databricks platform.
STR E AM I N G OV E RV I E W
The data comprising these streaming workloads may originate from various
Skip this section if you’re familiar with streaming use cases, protobuf, use cases:
the schema registry, and Delta Live Tables. In this article, we’ll dive into a range
of exciting topics.
S T R E A M I N G DATA U S E CA S E
■ Common streaming use cases: Uncover the diverse streaming data
applications in today’s tech landscape.
Generating predictive maintenance
■ Protocol buffers (Protobuf): Learn why this fast and compact serialization IoT sensors on manufacturing floor equipment
alerts and preemptive part ordering
format is a game-changer for data handling.
■ Delta Live Tables (DLT): Discover how DLT pipelines offer a rich, feature- Detecting network instability and
Set-top box telemetry
packed platform for your ETL (Extract, Transform, Load) needs. dispatching service crews

■ Building a DLT pipeline: A step-by-step guide on creating a DLT pipeline


Calculating leader-board metrics and
that seamlessly consumes Protobuf values from an Apache Kafka stream. Player metrics in a game
detecting cheat

■ Utilizing the Confluent Schema Registry: Understand how this tool is


crucial for decoding binary message payloads effectively.
Data in these scenarios is typically streamed through open source messaging
■ Schema evolution in DLT pipelines: Navigate the complexities of schema systems, which manage the data transfer from producers to consumers.
evolution within the DLT pipeline framework when streaming protobuf Apache Kafka stands out as a popular choice for handling such payloads.
messages with evolving schema. Confluent Kafka and AWS MSK provide robust Kafka solutions for those seeking
managed services.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 49

O P TI M IZ I N G TH E STR E AM I N G PAY LOAD FO R MAT DATAB R I C KS MAKES WO R KI N G WITH PROTO B U F E ASY

Databricks provides capabilities that help optimize the AI journey by unifying Starting in Databricks Runtime 12.1, Databricks provides native support for
Business Analysis, Data Science, and Data Analysis activities in a single, governed serialization and deserialization between Apache Spark struct.... Protobuf
platform. In your quest to optimize the end-to-end technology stack, a key support is implemented as an Apache Spark DataFrame transformation and
focus is the serialization format of the message payload. This element is crucial can be used with Structured Streaming or for batch operations. It optionally
for efficiency and performance. We’ll specifically explore an optimized format integrates with the Confluent Schema Registry (a Databricks-exclusive feature).
developed by Google, known as protocol buffers (or "protobuf"), to understand
Databricks makes it easy to work with protobuf because it handles the protobuf
how it enhances the technology stack.
compilation under the hood for the developer. For instance, the data pipeline
developer does not have to worry about installing protoc or using it to compile
WHAT MAK ES PROTO B U F AN O P TI M IZ E D
S E R IALIZ ATI O N FO R MAT ? protocol definitions into Python classes.

Google enumerates the advantages of protocol buffers, including compact data E XPLO R I N G PAY LOAD FO R MATS FO R STR E AM I N G
storage, fast parsing, availability in many programming languages, and optimized I OT DATA
functionality through automatically generated classes.
Before we proceed, it is worth mentioning that JSON or Avro may be suitable
A key aspect of optimization usually involves using pre-compiled classes in alternatives for streaming payloads. These formats offer benefits that, for some
the consumer and producer programs that a developer typically writes. In a use cases, may outweigh protobuf. Let’s quickly review these formats.
nutshell, consumer and producer programs that leverage protobuf are "aware" of
a message schema, and the binary payload of a protobuf message benefits from J SO N
primitive data types and positioning within the binary message, removing the JSON is an excellent format for development because it is primarily human-
need for field markers or delimiters. readable. The other formats we’ll explore are binary formats, which require tools
to inspect the underlying data values. Unlike Avro and protobuf, however, the
WH Y I S PROTO B U F U S UALLY PAI N FU L TO WO R K WITH ?
JSON document is stored as a large string (potentially compressed), meaning
Programs that leverage protobuf must work with classes or modules compiled more bytes may be used than a value represents. Consider the short int value of
using protoc (the protobuf compiler). The protoc compiler compiles those 8. A short int requires two bytes. In JSON, you may have a document that looks
definitions into classes in various languages, including Java and Python. To learn like the following, and it will require several bytes (~30) for the associated key,
more about how protocol buffers work, go here. quotes, etc.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 50

AV RO
1 {
2 "my_short": 8
Avro is an attractive serialization format because it is compact, encompasses
3 } schema information in the files themselves, and has built-in database support
in Databricks that includes schema registry integration. This tutorial,
When we consider protobuf, we expect 2 bytes plus a few more for the overhead co-authored by Databricks’ Angela Chu, walks you through an example that
related to the positioning metadata. leverages Confluent’s Kafka and Schema Registry.

J SO N S U PP O RT I N DATAB R I C KS To explore an Avro-based dataset, it is as simple as working with JSON:

On the positive side, JSON documents have rich benefits when used with
Databricks. Databricks Autoloader can easily transform JSON to a structured 1 df = spark.read.format("avro").load("example.avro")
DataFrame while also providing built-in support for:

■ Schema inference - when reading JSON into a DataFrame, you can supply This datageeks.com article compares Avro and protobuf. It is worth a read if you
a schema so that the target DataFrame or Delta table has the desired are on the fence between Avro and protobuf. It describes protobuf as the "fastest
schema. Or you can let the engine infer the schema. Alternatively, schema amongst all.", so if speed outweighs other considerations, such as JSON and
hints can be supplied if you want a balance of those features. Avro’s greater simplicity, protobuf may be the best choice for your use case.
■ Schema evolution - Autoloader provides options for how a workload
should adapt to changes in the schema of incoming files.
E X AM PLE D E M UX PI PE LI N E

Consuming and processing JSON in Databricks is simple. To create a Spark The source code for the end-to-end example is located on GitHub. The example
DataFrame from JSON files can be as simple as this: includes a simulator (Producer), a notebook to install the Delta Live Tables
pipeline (Install_DLT_Pipeline), and a Python notebook to process the data that
is streaming through Kafka (DLT).

1 df = spark.read.format("json").load("example.json")
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 51

SC E NAR I O
1 import pyspark.sql.functions as F
Imagine a scenario where a video gaming company is streaming events from 2 from pyspark.sql.protobuf.functions import from_protobuf
game consoles and phone-based games for a number of the games in its
3 @dlt.view
portfolio. Imagine the game event messages have a single schema that evolves 4 def bronze_events():
5 return (
(i.e., new fields are periodically added). Lastly, imagine that analysts want the data 6 spark.readStream.format("kafka")
for each game to land in its own Delta Lake table. Some analysts and BI tools 7 .options(**kafka_options)
8 .load()
need pre-aggregated data, too. 9 .withColumn('decoded', from_protobuf(F.col("value"), options = schema_registry_
10 options))
11 .selectExpr("decoded.*")
Using DLT, our pipeline will create 1+2N tables:
12 )

■ One table for the raw data (stored in the Bronze view).
■ One Silver Streaming Table for each of the N games, with events streaming The repo includes the source code that constructs values for kafka_options.
through the Bronze table. These details are needed so the streaming Delta Live Table can consume
■ Each game will also have a Gold Delta table with aggregates based on the messages from the Kafka topic and retrieve the schema from the Confluent

associated Silver table. Schema registry (via config values in schema_registry_options). This line of code
is what manages the deserialization of the protobuf messages:
CO D E WALK TH RO U G H

B RO NZ E TAB LE D E FI N ITI O N
1 .withColumn('decoded', from_protobuf(F.col("value"), options = schema_registry_
2 options))
We’ll define the Bronze table (bronze_events) as a DLT view by using the
@dlt.view annotation
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 52

The simplicity of transforming a DataFrame with protobuf payload is thanks to Notice the use of the @dlt.table annotation. Thanks to this annotation, when
this function: from_protobuf (available in Databricks Runtime 12.1 and later). In build_silver is invoked for a given gname, a DLT table will be defined that depends
this article, we don’t cover to_protobuf, but the ease of use is the same. The on the source bronze_events table. We know that the tables created by this
schema_registry_options are used by the function to look up the schema from function will be Streaming Tables because of the use of dlt.read_stream.
the Confluent Schema Registry.
G O LD TAB LES
Delta Live Tables is a declarative ETL framework that simplifies the development
The following function creates a Gold Materialized View for the given game name
of data pipelines. So, suppose you are familiar with Apache Spark Structured
provided as a parameter:
Streaming. In that case, you may notice the absence of a checkpointLocation
(which is required to track the stream’s progress so that the stream can be
stopped and started without duplicating or dropping data). The absence of the
1 def build_gold(gname):
checkpointLocation is because Delta Live Tables manages this need out-of-the- 2 .table(name=f"gold_{gname}_player_agg")
3 def gold_unified():
box for you. Delta Live Tables also has other features that help make developers
4 return (
more agile and provide a common framework for ETL across the enterprise. Delta 5 dlt.read(f"silver_{gname}_events")
6 .groupBy(["gamer_id"])
Live Tables Expectations, used for managing data quality, is one such feature. 7 .agg(
8 F.count("*").alias("session_count"),
9 F.min(F.col("event_timestamp")).alias("min_timestamp"),
S I LV E R TAB LES 10 F.max(F.col("event_timestamp")).alias("max_timestamp")
11 )
The following function creates a Silver Streaming Table for the given game name 12 )

provided as a parameter:

We know the resulting table will be a "Materialized View" because of the use of
dlt.read. This is a simple Materialized View definition; it simply performs a count
1 def build_silver(gname):
2 .table(name=f"silver_{gname}_events")
of source events along with min and max event times, grouped by gamer_id.
3 def gold_unified():
4 return dlt.read_stream("bronze_events").where(F.col("game_name") == gname)
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 53

M E TADATA- D R I V E N TAB LES

The previous two sections of this article defined functions for creating Silver
(Streaming) Tables and Gold Materialized Views. The metadata-driven approach
in the example code uses a pipeline input parameter to create N*2 target tables
(one Silver table for each game and one aggregate Gold table for each game).
This code drives the dynamic table creation using the aforementioned
build_silver and build_gold functions:

1 GAMES_ARRAY = spark.conf.get("games").split(",")

2 for game in GAMES_ARRAY:


3 build_silver(game)
4 build_gold(game)

At this point, you might have noticed that much of the control flow code
data engineers often have to write is absent. This is because, as mentioned
above, DLT is a declarative programming framework. It automatically detects
dependencies and manages the pipeline’s execution flow. Here’s the DAG that
DLT creates for the pipeline:

A note about aggregates in a streaming pipeline


E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 54

For a continuously running stream, calculating some aggregates can be very R E V I E WI N G TH E B E N E FITS O F D LT
resource-intensive. Consider a scenario where you must calculate the "median"
Governance
for a continuous stream of numbers. Every time a new number arrives in the
Unity Catalog governs the end-to-end pipeline. Thus, permission to target tables
stream, the median calculation will need to explore the entire set of numbers
can be granted to end-users and service principals needing access across any
that have ever arrived. In a stream receiving millions of numbers per second, this
Databricks workspaces attached to the same metastore.
fact can present a significant challenge if your goal is to provide a destination
table for the median of the entire stream of numbers. It becomes impractical to Lineage
perform such a feat every time a new number arrives. The limits of computation From the Delta Live Tables interface, we can navigate to the Catalog and
power and persistent storage and network would mean that the stream would view lineage.
continue to grow a backlog much faster than it could perform the calculations.

In a nutshell, it would not work out well if you had such a stream and tried to
recalculate the median for the universe of numbers that have ever arrived in the
stream. So, what can you do? If you look at the code snippet above, you may
notice that this problem is not addressed in the code! Fortunately, as a Delta
Live Tables developer, I do not have to worry about it. The declarative framework Click on a table in the
DAG. Then click on the
handles this dilemma by design. DLT addresses this by materializing results
"Target table" link.
only periodically. Furthermore, DLT provides a table property that allows the
developer to set an appropriate trigger interval.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 55

HAN DS - O FF SC H E MA E VO LUTI O N
Click on the "Lineage"
tab for the table. Then
click on the "See lineage
graph" link.

Lineage also provides


visibility into other
related artifacts,
such as notebooks,
models, etc.

This lineage helps


accelerate team
velocity by making it
easier to understand
how assets in the
workspace are related.

Delta Live Tables will detect this as the source stream’s schema evolves, and the
pipeline will restart. To simulate a schema evolution for this example, you would
run the Producer notebook a subsequent time but with a larger value for
num_versions, as shown on the left. This will generate new data where the
schema includes some additional columns. The Producer notebook updates the
schema details in the Confluent Schema Registry.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 56

When the schema evolves, you will see a pipeline failure like this one: If the Delta Live Tables pipeline runs in Production mode, a failure will result in an
automatic pipeline restart. The Schema Registry will be contacted upon restart
to retrieve the latest schema definitions. Once back up, the stream will continue
with a new run:
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 57

CO N C LU S I O N

In high-performance IoT systems, optimization extends through every layer of Secrets to configure
the technology stack, focusing on the payload format of messages in transit. The following Kafka and Schema Registry connection details (and credentials)
Throughout this article, we’ve delved into the benefits of using an optimized should be saved as Databricks Secrets and then set within the Secrets notebook
serialization format, protobuf, and demonstrated its integration with Databricks to that is part of the repo:
construct a comprehensive end-to-end demultiplexing pipeline. This approach
■ SR_URL: Schema Registry URL
underlines the importance of selecting the right tools and formats to maximize
(e.g. https://myschemaregistry.aws.confluent.cloud)
efficiency and effectiveness in IoT systems.
■ SR_API_KEY: Schema Registry API Key
Instructions for running the example
■ SR_API_SECRET: Schema Registry API Secret
To run the example code, follow these instructions:
■ KAFKA_KEY: Kafka API Key
1. In Databricks, clone this repo:
■ KAFKA_SECRET: Kafka Secret
https://github.com/craig-db/protobuf-dlt-schema-evolution.
■ KAFKA_SERVER: Kafka host:port (e.g. mykafka.aws.confluent.cloud:9092)
2. Set up the prerequisites (documented below).
■ KAFKA_TOPIC: The Kafka Topic
3. Follow the instructions in the README notebook included in the repo code.
■ TARGET_SCHEMA: The target database where the streaming data will be
Prerequisites appended into a Delta table (the destination table is named unified_gold)
1. A Unity Catalog-enabled workspace — this demo uses a Unity Catalog- ■ CHECKPOINT_LOCATION: Some location (e.g., in DBFS) where the
enabled Delta Live Tables pipeline. Thus, Unity Catalog should be
checkpoint data for the streams will be stored
configured for the workspace where you plan to run the demo.
Go here to learn how to save secrets to secure sensitive information (e.g.,
2. As of January 2024, you should use the Preview channel for the Delta Live
credentials) within the Databricks Workspace: https://docs.databricks.com/
Tables pipeline. The "Install_DLT_Pipeline" notebook will use the Preview
security/secrets/index.html.
channel when installing the pipeline.

3. Confluent account – this demo uses Confluent Schema Registry and


Confluent Kafka.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 58

WH Y BATC H I N G ESTI O N MAT TE RS


Design Patterns for Batch Processing in
Financial Services Over the last two decades, the global shift towards an instant society has forced
organizations to rethink the operating and engagement model. A digital-first
by Eon Retief
strategy is no longer optional but vital for survival. Customer needs and demands
are changing and evolving faster than ever. This desire for instant gratification
Financial services institutions (FSIs) around the world are facing unprecedented
is driving an increased focus on building capabilities that support real-time
challenges ranging from market volatility and political uncertainty to changing
processing and decisioning. One might ask whether batch processing is still
legislation and regulations. Businesses are forced to accelerate digital
relevant in this new dynamic world.
transformation programs; automating critical processes to reduce operating
costs and improve response times. However, with data typically scattered While real-time systems and streaming services can help FSIs remain agile in
across multiple systems, accessing the information required to execute on these addressing the volatile market conditions at the edge, they do not typically
initiatives tends to be easier said than done. meet the requirements of back-office functions. Most business decisions are
not reactive but rather, require considered, strategic reasoning. By definition,
Architecting an ecosystem of services able to support the plethora of data-
this approach requires a systematic review of aggregate data collected over a
driven use cases in this digitally transformed business can, however, seem
period of time. Batch processing in this context still provides the most efficient
to be an impossible task. This chapter will focus on one crucial aspect of the
and cost-effective method for processing large, aggregate volumes of data.
modern data stack: batch processing. A seemingly outdated paradigm, we’ll see
Additionally, batch processing can be done offline, reducing operating costs and
why batch processing remains a vital and highly viable component of the data
providing greater control over the end-to-end process.
architecture. And we’ll see how Databricks can help FSIs navigate some of the
crucial challenges faced when building infrastructure to support these scheduled The world of finance is changing, but across the board incumbents and
or periodic workflows. startups continue to rely heavily on batch processing to power core business
functions. Whether for reporting and risk management or anomaly detection and
surveillance, FSIs require batch processing to reduce human error, increase the
speed of delivery, and reduce operating costs.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 59

G E T TI N G STARTE D To demonstrate the power and efficiency of the LFS, we turn to the world of
insurance. We consider the basic reporting requirements for a typical claims
Starting with a 30,000-ft view, most FSIs will have a multitude of data sources
workflow. In this scenario, the organization might be interested in the key metrics
scattered across on-premises systems, cloud-based services and even third-
driven by claims processes. For example:
party applications. Building a batch ingestion framework that caters for all these
connections require complex engineering and can quickly become a burden ■ Number of active policies
on maintenance teams. And that’s even before considering things like change ■ Number of claims
data capture (CDC), scheduling, and schema evolution. In this section, we will
■ Value of claims
demonstrate how the Databricks Lakehouse for Financial Services (LFS) and its
■ Total exposure
ecosystem of partners can be leveraged to address these key challenges and
greatly simplify the overall architecture. ■ Loss ratio

Additionally, the business might want a view of potentially suspicious claims and
The Databricks lakehouse architecture was designed to provide a unified
a breakdown by incident type and severity. All these metrics are easily calculable
platform that supports all analytical and scientific data workloads. Figure 1 shows
from two key sources of data: 1) the book of policies and 2) claims filed by
the reference architecture for a decoupled design that allows easy integration
customers. The policy and claims records are typically stored in a combination
with other platforms that support the modern data ecosystem. The lakehouse
of enterprise data warehouses (EDWs) and operational databases. The main
makes it easy to construct ingestion and serving layers that operate irrespective
challenge becomes connecting to these sources and ingesting data into our
of the data’s source, volume, velocity, and destination.
lakehouse, where we can leverage the power of Databricks to calculate the
desired outputs.

Luckily, the flexible design of the LFS makes it easy to leverage best-in-class
products from a range of SaaS technologies and tools to handle specific
tasks. One possible solution for our claims analytics use case would be to use
Fivetran for the batch ingestion plane. Fivetran provides a simple and secure
platform for connecting to numerous data sources and delivering data directly
to the Databricks lakehouse. Additionally, it offers native support for CDC,
schema evolution and workload scheduling. In Figure 2, we show the technical
architecture of a practical solution for this use case.

Figure 1: Reference architecture of the Lakehouse for Financial Services


E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 60

Figure 2: Technical architecture for a simple insurance claims workflow


E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 61

Once the data is ingested and delivered to the LFS, we can use Delta Live
Tables (DLT) for the entire engineering workflow. DLT provides a simple, scalable
declarative framework for automating complex workflows and enforcing data
quality controls. The outputs from our DLT workflow, our curated and aggregated
assets, can be interrogated using Databricks SQL (DB SQL). DB SQL brings data
warehousing to the LFS to power business-critical analytical workloads. Results
from DB SQL queries can be packaged in easy-to-consume dashboards and
served to business users.

STE P 1 : C R E ATI N G TH E I N G ESTI O N L AY E R

Setting up an ingestion layer with Fivetran requires a two-step process. First,


configuring a so-called destination where data will be delivered, and second,
establishing one or more connections with the source systems. The Partner
Connect interface takes care of the first step with a simple, guided interface to
connect Fivetran with a Databricks Warehouse. Fivetran will use the warehouse
to convert raw source data to Delta Tables and store the results in the Databricks
Lakehouse. Figures 3 and 4 show steps from the Partner Connect and Fivetran
interfaces to configure a new destination.

Figure 3: Databricks Partner Connect interface for creating a new connection


E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 62

For the next step, we move to the Fivetran interface. From here, we can
easily create and configure connections to several different source systems
(please refer to the official documentation for a complete list of all supported
connections). In our example, we consider three sources of data: 1) policy
records stored in an Operational Data Store (ODS) or Enterprise Data Warehosue
(EDW), 2) claims records stored in an operational database, and 3) external data
delivered to blob storage. As such, we require three different connections to be
configured in Fivetran. For each of these, we can follow Fivetran’s simple guided
process to set up a connection with the source system. Figures 5 and 6 show
how to configure new connections to data sources.

Figure 4: Fivetran interface for confirming a new destination

Figure 5: Fivetran interface for selecting a data source type


E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 63

Figure 6: Fivetran interface for configuring a data source connection Figure 7: Overview of configuration for a Fivetran connect

Connections can further be configured once they have been validated. One Fivetran will immediately interrogate and ingest data from source systems once
important option to set is the frequency with which Fivetran will interrogate the a connection is validated. Data is stored as Delta tables and can be viewed from
source system for new data. In Figure 7, we can see how easy Fivetran has made within Databricks through the DB SQL Data Explorer. By default, Fivetran will
it to set the sync frequency with intervals ranging from 5 minutes to 24 hours. store all data under the Hive metastore. A new schema is created for each new
connection, and each schema will contain at least two tables: one containing the
data and another with logs from each attempted ingestion cycle (see Figure 8).
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 64

STE P 2 : AUTO MATI N G TH E WO R KFLOW

With the data in the Databricks Data Intelligence Platform, we can use Delta
Live Tables (DLT) to build a simple, automated data engineering workflow. DLT
provides a declarative framework for specifying detailed feature engineering
steps. Currently, DLT supports APIs for both Python and SQL. In this example,
we will use Python APIs to build our workflow.

The most fundamental construct in DLT is the definition of a table. DLT


Figure 8: Summary of tables created by Fivetran in the Databricks Warehouse for
an example connection interrogates all table definitions to create a comprehensive workflow for how
data should be processed. For instance, in Python, tables are created using
Having the data stored in Delta tables is a significant advantage. Delta Lake function definitions and the `dlt.table` decorator (see example of Python code
natively supports granular data versioning, meaning we can time travel through below). The decorator is used to specify the name of the resulting table, a
each ingestion cycle (see Figure 9). We can use DB SQL to interrogate specific descriptive comment explaining the purpose of the table, and a collection of
versions of the data to analyze how the source records evolved. table properties.

1 @dlt.table(
2 name = "curated_claims",
3 comment = "Curated claim records",
4 table_properties = {
5 "layer": "silver",
6 "pipelines.autoOptimize.managed": "true",
7 "delta.autoOptimize.optimizeWrite": "true",
8 "delta.autoOptimize.autoCompact": "true"
9 }
10 )
11 def curate_claims():
12 # Read the staged claim records into memory
Figure 9: View of the history showing changes made to the Fivetran audit table 13 staged_claims = dlt.read("staged_claims")
14 # Unpack all nested attributes to create a flattened table structure
It is important to note that if the source data contains semi-structured or unstructured values, those 15 curated_claims = unpack_nested(df = staged_claims, schema = schema_claims)
attributes will be flattened during the conversion process. This means that the results will be stored in
grouped text-type columns, and these entities will have to be dissected and unpacked with DLT in the 16 ...
curation process to create separate attributes.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 65

Instructions for feature engineering are defined inside the function body using One significant advantage of DLT is the ability to specify and enforce data quality
standard PySpark APIs and native Python commands. The following example standards. We can set expectations for each DLT table with detailed data quality
shows how PySpark joins claims records with data from the policies table to constraints that should be applied to the contents of the table. Currently, DLT
create a single, curated view of claims. supports expectations for three different scenarios:

1 ... D E C O R ATO R DESCRIPTION


2
3 # Read the staged claim records into memory
4 curated_policies = dlt.read("curated_policies")
5 # Evaluate the validity of the claim expect Retain records that violate expectations
6 curated_claims = curated_claims \
7 .alias("a") \
8 .join(
expect_or_drop Drop records that violate expectations
9 curated_policies.alias("b"),
10 on = F.col("a.policy_number") == F.col("b.policy_number"),
11 how = "left"
12 ) \
expect_or_fail Halt the execution if any record(s) violate constraints
13 .select([F.col(f"a.{c}") for c in curated_claims.columns] + [F.col(f"b.{c}").
14 alias(f"policy_{c}") for c in ("effective_date", "expiry_date")]) \
15 .withColumn(
17 # Calculate the number of months between coverage starting and the
18 claim being filed Expectations can be defined with one or more data quality constraints. Each
19 "months_since_covered", F.round(F.months_between(F.col("claim_date"),
20 F.col("policy_effective_date")))
constraint requires a description and a Python or SQL expression to evaluate.
21 ) \ Multiple constraints can be defined using the expect_all, expect_all_or_drop,
22 .withColumn(
23 # Check if the claim was filed before the policy came into effect and expect_all_or_fail decorators. Each decorator expects a Python dictionary
24 "claim_before_covered", F.when(F.col("claim_date") < F.col("policy_
25 effective_date"), F.lit(1)).otherwise(F.lit(0))
where the keys are the constraint descriptions, and the values are the respective
26 ) \ expressions. The example below shows multiple data quality constraints for the
27 .withColumn(
28 # Calculate the number of days between the incident occurring and the retain and drop scenarios described above.
29 claim being filed
30 "days_between_incident_and_claim", F.datediff(F.col("claim_date"),
31 F.col("incident_date"))
32 )

33 # Return the curated dataset


34 return curated_claims
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 66

1 @dlt.expect_all({
2 "valid_driver_license": "driver_license_issue_date > (current_date() -
3 cast(cast(driver_age AS INT) AS INTERVAL YEAR))",
4 "valid_claim_amount": "total_claim_amount > 0",
5 "valid_coverage": "months_since_covered > 0",
6 "valid_incident_before_claim": "days_between_incident_and_claim > 0"
7 })
8 @dlt.expect_all_or_drop({
9 "valid_claim_number": "claim_number IS NOT NULL",
10 "valid_policy_number": "policy_number IS NOT NULL",
11 "valid_claim_date": "claim_date < current_date()",
12 "valid_incident_date": "incident_date < current_date()",
13 "valid_incident_hour": "incident_hour between 0 and 24",
14 "valid_driver_age": "driver_age > 16",
15 "valid_effective_date": "policy_effective_date < current_date()",
17 "valid_expiry_date": "policy_expiry_date <= current_date()"
18 })
19 def curate_claims():
20 ...

We can use more than one Databricks Notebook to declare our DLT tables.
Assuming we follow the medallion architecture, we can, for example, use
different notebooks to define tables comprising the bronze, silver, and gold layers. Figure 10: Overview of a complete Delta Live Tables (DLT) workflow

The DLT framework can digest instructions defined across multiple notebooks
to create a single workflow; all inter-table dependencies and relationships
are processed and considered automatically. Figure 10 shows the complete
workflow for our claims example. Starting with three source tables, DLT builds a
comprehensive pipeline that delivers thirteen tables for business consumption.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 67

Results for each table can be inspected by selecting the desired entity. Figure Results from the data quality expectations can be analyzed further by querying
11 provides an example of the results of the curated claims table. DLT provides a the event log. The event log contains detailed metrics about all expectations
high-level overview of the results from the data quality controls: defined for the workflow pipeline. The query below provides an example for
viewing key metrics from the last pipeline update, including the number of
records that passed or failed expectations:

1 SELECT
2 row_expectations.dataset AS dataset,
3 row_expectations.name AS expectation,
4 SUM(row_expectations.passed_records) AS passing_records,
5 SUM(row_expectations.failed_records) AS failing_records
6 FROM
7 (
8 SELECT
9 explode(
10 from_json(
11 details :flow_progress :data_quality :expectations,
12 "array<struct<name: string, dataset: string, passed_records: int, failed_
13 records: int>>"
14 )
15 ) row_expectations
17 FROM
18 event_log_raw
19 WHERE
Figure 11: Example of detailed view for a Delta Live Tables (DLT) table entity with the associated data 20 event_type = 'flow_progress'
quality report 21 AND origin.update_id = '${latest_update.id}'
22 )
23 GROUP BY
24 row_expectations.dataset,
25 row_expectations.name;

Again, we can view the complete history of changes made to each DLT table
by looking at the Delta history logs (see Figure 12). It allows us to understand
how tables evolve over time and investigate complete threads of updates
if a pipeline fails.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 68

AD H O C ANALY TI C S

Databricks SQL (or DB SQL) provides an efficient, cost-effective data warehouse


on top of the Data Intelligence Platform. It allows us to run our SQL workloads
directly against the source data with up to 12x better price/performance than
its alternatives.

We can leverage DB SQL to perform specific ad hoc queries against our curated
and aggregated tables. We might, for example, run a query against the curated
policies table that calculates the total exposure. The DB SQL query editor

Figure 12: View the history of changes made to a resulting Delta Live Tables (DLT) table entity provides a simple, easy-to-use interface to build and execute such queries (see
example below).

We can further use change data capture (CDC) to update tables based on
changes in the source datasets. DLT CDC supports updating tables with slow- 1 SELECT
2 round(curr.total_exposure, 0) AS total_exposure,
changing dimensions (SCD) types 1 and 2.
3 round(prev.total_exposure, 0) AS previous_exposure
4 FROM
We have one of two options for our batch process to trigger the DLT pipeline. 5 (
6 SELECT
We can use the Databricks Auto Loader to incrementally process new data as 7 sum(sum_insured) AS total_exposure
it arrives in the source tables or create scheduled jobs that trigger at set times 8 FROM
9 insurance_demo_lakehouse.curated_policies
or intervals. In this example, we opted for the latter with a scheduled job that 10 WHERE
11 expiry_date > '{{ date.end }}'
executes the DLT pipeline every five minutes. 12 AND (effective_date <= '{{ date.start }}'
13 OR (effective_date BETWEEN '{{ date.start }}' AND '{{ date.end }}'))
14 ) curr
O PE R ATI O NALIZ I N G TH E O UTPUTS 15 JOIN
17 (
The ability to incrementally process data efficiently is only half of the equation. 18 SELECT
19 ...
Results from the DLT workflow must be operationalized and delivered to
business users. In our example, we can consume outputs from the DLT pipeline
through ad hoc analytics or prepacked insights made available through an
interactive dashboard.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 69

We can also use the DB SQL query editor to run queries against different versions
of our Delta tables. For example, we can query a view of the aggregated claims
records for a specific date and time (see example below). We can further use
DB SQL to compare results from different versions to analyze only the changed
records between those states.

1 SELECT
2 *
3 FROM
4 insurance_demo_lakehouse.aggregated_claims_weekly TIMESTAMP AS OF '2022-06-
5 05T17:00:00';

DB SQL offers the option to use a serverless compute engine, eliminating the
need to configure, manage or scale cloud infrastructure while maintaining the
lowest possible cost. It also integrates with alternative SQL workbenches (e.g.,
DataGrip), allowing analysts to use their favorite tools to explore the data and Figure 13: Example operational dashboard built on a set of resulting Delta Live Table (DLT) table entities

generate insights.
For our use case, we created a dashboard with a collection of key metrics,
rolling calculations, high-level breakdowns, and aggregate views. The dashboard
B U S I N ES S I N S I G HTS provides a complete summary of our claims process at a glance. We also added
Finally, we can use DB SQL queries to create rich visualizations on top of our the option to specify specific date ranges. DB SQL supports a range of query
query results. These visualizations can then be packaged and served to end parameters that can substitute values into a query at runtime. These query
users through interactive dashboards (see Figure 13). parameters can be defined at the dashboard level to ensure all related queries
are updated accordingly.

DB SQL integrates with numerous third-party analytical and BI tools like Power
BI, Tableau and Looker. Like we did for Fivetran, we can use Partner Connect to
link our external platform with DB SQL. This allows analysts to build and serve
dashboards in the platforms that the business prefers without sacrificing the
performance of DB SQL and the Databricks Lakehouse.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 70

CO N C LU S I O N

As we move into this fast-paced, volatile modern world of finance, batch processing remains a vital part of the modern data stack, able to hold its own against the
features and benefits of streaming and real-time services. We've seen how we can use the Databricks Data Intelligence Platform for Financial Services and its ecosystem
of partners to architect a simple, scalable and extensible framework that supports complex batch-processing workloads with a practical example in insurance claims
processing. With Delta Live Tables (DLT) and Databricks SQL (DB SQL), we can build a data platform with an architecture that scales infinitely, is easy to extend to address
changing requirements and will withstand the test of time.

To learn more about the sample pipeline described, including the infrastructure setup and configuration used, please refer to this GitHub repository or watch this
demo video.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 71

A few of the benefits of using Lakehouse Federation in Databricks are:


How to Set Up Your First Federated
Lakehouse ■ Improved data access and discovery: Lakehouse Federation makes it
easy to find and access the data you need from your database estate. This
by Mike Dobing
is especially important for organizations with complex data landscapes.

Lakehouse Federation in Databricks is a groundbreaking new capability that ■ Reduced data silos: Lakehouse Federation can help to break down data
allows you to query data across external data sources — including Snowflake, silos by providing a unified view of all data across the organization.
Synapse, many others and even Databricks itself — without having to move or ■ Improved data governance: Lakehouse Federation can help to improve
copy the data. This is done by using Databricks Unity Catalog, which provides a data governance by providing a single place to manage permissions and
unified metadata layer for all of your data. access to data from within Databricks.

Lakehouse Federation is a game-changer for data teams, as it breaks down the ■ Reduced costs: Lakehouse Federation can help to reduce costs by
silos that have traditionally kept data locked away in different systems. With eliminating the need to move or copy data between different data sources.
Lakehouse Federation, you can finally access all of your data in one place, making If you are looking for a way to improve the way you access and manage your
it easier to get the insights you need to make better business decisions. data across your analytics estate, then Lakehouse Federation in Databricks is a
top choice.
As always, though, not one solution is a silver bullet for your data integration and
querying needs. See below for when Federation is a good fit, and for when you’d
prefer to bring your data into your solution and process as part of your lakehouse
platform pipelines.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 72

R E ALIT Y C H EC K S E T TI N G U P YO U R FI RST FE D E R ATE D L AKE H O U S E

While Lakehouse Federation is a powerful tool, it is not a good fit for all use cases. With that in mind, let’s get started on setting up your first federated lakehouse in
There are some specific examples of use cases when Lakehouse Federation is Databricks using Lakehouse Federation.
not a good choice:
For this example, we will be using a familiar sample database — AdventureWorks
■ Real-time data processing: Lakehouse Federation queries can be — running on an Azure SQL Database. We will be walking you through how to
slower than queries on data that is stored locally in the lake. Therefore, set up your connection to Azure SQL and how to add it as a foreign catalog
Lakehouse Federation is not a good choice for applications that require inside Databricks.
real-time data processing.
PR E R EQ U I S ITES
■ Complex data transformations: Where you need complex
data transformations and processing, or need to ingest and transform To set up Lakehouse Federation in Databricks, you will need the
vast amounts of data. For probably the large majority of use cases, following prerequisites:
you will need to apply some kind of ETL/ELT process against your data
■ A Unity Catalog-enabled Databricks workspace with Databricks Runtime
to make it fit for consumption by end users. In these scenarios, it is still
13.1 or above and shared or sin...
best to apply a medallion style approach and bring the data in, process
■ A Databricks Unity Catalog metastore
it, clean it, then model and serve it so it is performant and fit for
consumption by end users. ■ Network connectivity from your Databricks Runtime cluster or SQL
warehouse to the target database systems, including any firewall
Therefore, while Lakehouse Federation is a great option for certain use cases
connectivity requirements, such as here
as highlighted above, it’s not a silver bullet for all scenarios. Consider it an
augmentation of your analytics capability that allows for additional use cases ■ The necessary permissions to create connections and foreign catalogs in
that need agility and direct source access for creating a holistic view of your data Databricks Unity Catalog
estate, all controlled through one governance layer. ■ The SQL Warehouse must be Pro or Serverless
■ For this demo — an example database such as AdventureWorks to
use as our data source, along with the necessary credentials.
Please see this example.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 73

S E TU P We want to add this database as a foreign catalog in Databricks to be able to


query it alongside other data sources. To connect to the database, we need a
Setting up federation is essentially a three-step process, as follows:
username, password and hostname, obtained from my Azure SQL Instance.
■ Set up a connection
With these details ready, we can now go into Databricks and add the connection
■ Set up a foreign catalog there as our first step.
■ Query your data sources
First, expand the Catalog view, go to Connections and click “Create Connection”:

S E T TI N G U P A CO N N ECTI O N

We are going to use Azure SQL Database as the test data source with the sample
database AdventureWorksLT database already installed and ready to query:

Example query on the source database


E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 74

To add your new connection, give it a name, choose your connection type and C R E ATE A FO R E I G N CATALO G
then add the relevant login details for that data source:
Test your connection and verify all is well. From there, go back to the Catalog
view and go to Create Catalog:

From there, populate the relevant details (choosing Type as “Foreign”), including
choosing the connection you created in the first step, and specifying the
database you want to add as an external catalog:
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 75

Once added, you can have the option of adding the relevant user permissions Our external catalog is now available for querying as you would any other catalog
to the objects here, all governed by Unity Catalog (skipped this in this article as inside Databricks, bringing our broader data estate into our lakehouse:
there are no other users using this database):
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 76

Q U E RY I N G TH E FE D E R ATE D DATA Or even join it to a local Delta table inside our Unity Catalog:

We can now access our federated Azure SQL Database as normal, straight from
our Databricks SQL Warehouse:

And query it as we would any other object:


CO N C LU S I O N

What we’ve shown here is just scratching the surface of what Lakehouse
Federation can do with a simple connection and query. By leveraging this
offering, combined with the governance and capabilities of Unity Catalog, you
can extend the range of your lakehouse estate, ensuring consistent permissions
and controls across all of your data sources and thus enabling a plethora of new
use cases and opportunities.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 77

TH E DATA ANALYST’S WO R LD
Orchestrating Data Analytics With
Databricks Workflows
by Matthew Kuehn

For data-driven enterprises, data analysts play a crucial role in extracting insights
from data and presenting it in a meaningful way. However, many analysts might
not have the familiarity with data orchestration required to automate their
workloads for production. While a handful of ad hoc queries can quickly turn
around the right data for a last-minute report, data teams must ensure that
various processing, transformation and validation tasks are executed reliably and
in the right sequence. Without the proper orchestration in place, data teams lose
the ability to monitor pipelines, troubleshoot failures and manage dependencies.
As a result, sets of ad hoc queries that initially brought quick-hitting value to the Data analysts play a vital role in the final stages of the data lifecycle. Positioned

business end up becoming long-term headaches for the analysts who built them. at the "last mile," they rely on refined data from upstream pipelines. This could
be a table prepared by a data engineer or the output predictions of machine
Pipeline automation and orchestration becomes particularly crucial as the scale learning models built by data scientists. This refined data, often referred to as the
of data grows and the complexity of pipelines increases. Traditionally, these Silver layer in a medallion architecture, serves as the foundation for their work.
responsibilities have fallen on data engineers, but as data analysts begin to Data analysts are responsible for aggregating, enriching and shaping this data to
develop more assets in the lakehouse, orchestration and automation becomes a answer specific questions for their business, such as:
key piece to the puzzle.
■ “How many orders were placed for each SKU last week?”
For data analysts, the process of querying and visualizing data should be
■ “What was monthly revenue for each store last fiscal year?”
seamless, and that's where the power of modern tools like Databricks Workflows
comes into play. In this chapter, we'll explore how data analysts can leverage
■ “Who are our 10 most active users?”

Databricks Workflows to automate their data processes, enabling them to focus These aggregations and enrichments build out the Gold layer of the medallion
on what they do best — deriving value from data. architecture. This Gold layer enables easy consumption and reporting for
downstream users, typically in a visualization layer. This can take the form of
dashboards within Databricks or be seamlessly generated using external tools
like Tableau or Power BI via Partner Connect. Regardless of the tech stack, data
analysts transform raw data into valuable insights, enabling informed decision-
making through structured analysis and visualization techniques.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 78

TH E DATA ANALYST’ S TOO LKIT O N DATAB R I C KS Additionally, alerts keep analysts informed about critical dataset changes in
real time. Serverless SQL Warehouses are underpinning all these features, which
can scale to handle diverse data volumes and query demands. By default, this
compute uses Photon, the high-performance Databricks-native vectorized query
engine, and is optimized for high-concurrency SQL workloads. Finally, Unity
Catalog allows users to easily govern structured and unstructured data, machine
learning models, notebooks, dashboards and files in the lakehouse. This cohesive
toolkit empowers data analysts to transform raw data into enriched insights
seamlessly within the Databricks environment.

O RC H ESTR ATI N G TH E DATA ANALYST’S TOO LKIT


WITH WO R KFLOWS

For those new to Databricks, Workflows orchestrates data processing, machine


learning and analytics pipelines in the Data Intelligence Platform. Workflows is
a fully managed orchestration service integrated with the Databricks Platform,
with high reliability and advanced observability capabilities. This allows all users,
regardless of persona or background, to easily orchestrate their workloads in
production environments.

Authoring Your SQL Tasks


Building your first workflow as a data analyst is extremely simple. Workflows
In Databricks, data analysts have a robust toolkit at their fingertips to transform now seamlessly integrates the core tools used by data analysts — queries,
data effectively on the lakehouse. Centered around the Databricks SQL Editor, alerts and dashboards — within its framework, enhancing its capabilities through
analysts have a familiar environment for composing ANSI SQL queries, accessing the SQL task type. This allows data analysts to build and work with the tools they
data and exploring table schemas. These queries serve as building blocks are already familiar with and then easily bring them into a Workflow as a Task via
for various SQL assets, including visualizations that offer in-line data insights. the UI.
Dashboards consolidate multiple visualizations, creating a user-friendly interface
for comprehensive reporting and data exploration for end users.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 79

As data analysts begin to chain more SQL tasks together, they will begin to easily
define dependencies between and gain the ability to schedule and automate
SQL-based tasks within Databricks Workflows. In the below example workflow, we
see this in action:
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 80

Imagine that we have received upstream data from our data engineering team It is worth noting that this example workflow utilizes queries directly written in the
that allows us to begin our dashboard refresh process. We can define SQL- Databricks SQL Editor. A similar pattern can be achieved with SQL code coming
centric tasks like the ones below to automate our pipeline: from a repository using the File task type. With this task type, users can execute
.sql files stored in a Git repository as part of an automated workflow. Each time
■ Create_State_Speed_Records: First, we define our refreshed data in our
the pipeline is executed, the latest version from a specific branch will be retrieved
Gold layer with the Query task. This inserts data into a Gold table and then
and executed.
optimizes it for better performance.
■ Data_Available_Alert: Once this data is inserted, imagine we want to Although this example is basic, you can begin to see the possibilities of how

notify other data analysts who consume this table that new records have a data analyst can define dependencies across SQL task types to build a

been added. We can do this by creating an alert which will trigger when we comprehensive analytics pipeline.

have new records added. This will send an alert to our stakeholder group.
You can imagine using an alert in a similar fashion for data quality checks
to warn users of stale data, null records or other similar situations. For
more information on creating your first alert, check out this link.
■ Update_Dashboard_Dataset: It’s worth mentioning that tasks can be
defined in parallel if needed. In our example, while our alert is triggering
we can also begin refreshing our tailored dataset view that feeds our
dashboard in a parallel query.
■ Dashboard_Refresh: Finally, we create a dashboard task type. Once our
dataset is ready to go, this will update all previously defined visualizations
with the newest data and notify all subscribers upon successful
completion. Users can even pass specific parameters to the dashboard
while defining the task, which can help generate a default view of the
dashboard depending on the end user’s needs.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 81

M O N ITO R I N G YO U R PRO D U CTI O N PI PE LI N ES Databricks Workflows allows users to monitor individual job runs, offering insights
into task outcomes and overall execution times. This visibility helps analysts
While authoring is comprehensive within Databricks Workflows, it is only one part
understand query performance, identify bottlenecks and address issues
of the picture. Equally important is the ability to easily monitor and debug your
efficiently. By promptly recognizing tasks that require attention, analysts can
pipelines once they are built and in production.
ensure seamless data processing and quicker issue resolution.

When it comes to executing a pipeline at the right time, Databricks Workflows


allows users to schedule jobs for execution at specific intervals or trigger
them when certain files arrive. In the above image, we were first manually
triggering this pipeline to test and debug our tasks. Once we got this to
a steady state, we began triggering this every 12 hours to accommodate for
data refresh needs across time zones. This flexibility accommodates varying
data scenarios, ensuring timely pipeline execution. Whether it's routine
processing or responding to new data batches, analysts can tailor job execution
to match operational requirements.

Late-arriving data can bring a flurry of questions to a data analyst from end
users. Workflows enables analysts and consumers alike to stay informed on
data freshness by setting up notifications for job outcomes such as successful
execution, failure or even a long-running job. These notifications ensure timely
awareness of changes in data processing. By proactively evaluating a pipeline’s
status, analysts can take proactive measures based on real-time information.

As with all pipelines, failures will inevitably happen. Workflows helps manage this
by allowing analysts to configure job tasks for automatic retries. By automating
retries, analysts can focus on generating insights rather than troubleshooting
intermittent technical issues.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 82

CO N C LU S I O N G E T STARTE D

In the evolving landscape of data analysis tools, Databricks Workflows bridges ■ Learn more about Databricks Workflows
the gap between data analysts and the complexities of data orchestration. By ■ Take a product tour of Databricks Workflows
automating tasks, ensuring data quality and providing a user-friendly interface,
■ Create your first workflow with this quickstart guide
Databricks Workflows empowers analysts to focus on what they excel at —
extracting meaningful insights from data. As the concept of the lakehouse
continues to unfold, Workflows stands as a pivotal component, promising a
unified and efficient data ecosystem for all personas.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 83

For further information please visit the official Databricks Auto


Schema Management and Drift Scenarios Loader documentation.
via Databricks Auto Loader
by Garrett Peternel SC H E MA C HAN G E SC E NAR I OS

In this chapter I will showcase a few examples of how AL handles schema


Data lakes notoriously have had challenges with managing incremental data
management and drift scenarios using a public IoT sample dataset with
processing at scale without integrating open table storage format frameworks
schema modifications to showcase solutions. Schema 1 will contain an IoT
(e.g., Delta Lake, Apache Iceberg, Apache Hudi). In addition, schema management
sample dataset schema with all expected columns and expected data types.
is difficult with schema-less data and schema-on-read methods. With the power
Schema 2 will contain unexpected changes to the IoT sample dataset schema
of the Databricks Platform, Delta Lake and Apache Spark provide the essential
with new columns and changed data types. The following variables and paths
technologies integrated with Databricks Auto Loader (AL) to consistently and
will be used for this demonstration along with Databricks Widgets to set your
reliably stream and process raw data formats incrementally while maintaining
username folder.
stellar performance and data governance.

AUTO LOAD E R FE ATU R ES 1 %scala


2 dbutils.widgets.text("dbfs_user_dir", "your_user_name") // widget for account email
AL is a boost over Spark Structured Streaming, supporting several additional
3 val userName = dbutils.widgets.get("dbfs_user_dir")
benefits and solutions including: 4 val rawBasePath = s"dbfs:/user/$userName/raw/"
5 val repoBasePath = s"dbfs:/user/$userName/repo/"
■ Databricks Runtime only Structured Streaming cloudFiles source
6 val jsonSchema1Path = rawBasePath + "iot-schema-1.json"
■ Schema drift, dynamic inference and evolution support 7 val jsonSchema2Path = rawBasePath + "iot-schema-2.json"
8 val repoSchemaPath = repoBasePath + "iot-ddl.json"
■ Ingests data via JSON, CSV, PARQUET, AVRO, ORC, TEXT and BINARYFILE
9 dbutils.fs.rm(repoSchemaPath, true) // remove schema repo for demos
input file formats
■ Integration with cloud file notification services (e.g., Amazon SQS/SNS)
■ Optimizes directory list mode scanning performance to discover new files SC H E MA 1
in cloud storage (e.g., AWS, Azure, GCP, DBFS)

1 %scala
2 spark.read.json(jsonSchema1Path).printSchema
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 84

SC H E MA 2
1 root
2 |-- alarm_status: string (nullable = true)
3 |-- battery_level: long (nullable = true)
4 |-- c02_level: long (nullable = true) 1 %scala
5 |-- cca2: string (nullable = true) 2 // NEW => device_serial_number_device_type, location
6 |-- cca3: string (nullable = true) 3 spark.read.json(jsonSchema2Path).printSchema
7 |-- cn: string (nullable = true)
8 |-- coordinates: struct (nullable = true)
9 | |-- latitude: double (nullable = true)
10 | |-- longitude: double (nullable = true)
11 |-- date: string (nullable = true) 1 root
12 |-- device_id: long (nullable = true) 2 |-- alarm_status: string (nullable = true)
13 |-- device_serial_number: string (nullable = true) 3 |-- battery_level: long (nullable = true)
14 |-- device_type: string (nullable = true) 4 |-- c02_level: long (nullable = true)
15 |-- epoch_time_miliseconds: long (nullable = true) 5 |-- date: string (nullable = true)
16 |-- humidity: long (nullable = true) 6 |-- device_id: long (nullable = true)
17 |-- ip: string (nullable = true) 7 |-- device_serial_number_device_type: string (nullable = true)
18 |-- scale: string (nullable = true) 8 |-- epoch_time_miliseconds: long (nullable = true)
19 |-- temp: double (nullable = true) 9 |-- humidity: double (nullable = true)
20 |-- timestamp: string (nullable = true) 10 |-- ip: string (nullable = true)
11 |-- latitude: double (nullable = true)
12 |-- location: struct (nullable = true)
13 | |-- cca2: string (nullable = true)
14 | |-- cca3: string (nullable = true)
1 %scala 15 | |-- cn: string (nullable = true)
2 display(spark.read.json(jsonSchema1Path).limit(10)) 16 |-- longitude: double (nullable = true)
17 |-- scale: string (nullable = true)
18 |-- temp: double (nullable = true)
19 |-- timestamp: string (nullable = true)

1 %scala
2 display(spark.read.json(jsonSchema1Path).limit(10))
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 85

E X AM PLE 1 : SC H E MA TR AC KI N G/MANAG E M E NT
1 %scala
AL tracks schema versions, metadata and changes to input data over time via 2 display(rawAlDf.limit(10))
specifying a location directory path. These features are incredibly useful for
tracking history of data lineage, and are tightly integrated with the Delta Lake
transactional log DESCRIBE HISTORY and Time Travel. By default (for JSON, CSV and XML file format) AL infers all column data types as
strings, including nested fields.

1 %scala
2 val rawAlDf = (spark
3 .readStream.format("cloudfiles")
4 .option("cloudFiles.format", "json")
5 .option("cloudFiles.schemaLocation", repoSchemaPath) // schema history tracking
6 .load(jsonSchema1Path)
7 )

1 %scala Here is the directory structure where AL stores schema versions. These files can
2 rawAlDf.printSchema
be read via Spark DataFrame API.

Schema Repository
1 root
2 |-- alarm_status: string (nullable = true)
3 |-- battery_level: string (nullable = true)
4 |-- c02_level: string (nullable = true) 1 %scala
5 |-- cca2: string (nullable = true) 2 display(dbutils.fs.ls(repoSchemaPath + "/_schemas"))
6 |-- cca3: string (nullable = true)
7 |-- cn: string (nullable = true)
8 |-- coordinates: string (nullable = true)
9 |-- date: string (nullable = true)
10 |-- device_id: string (nullable = true)
11 |-- device_serial_number: string (nullable = true)
12 |-- device_type: string (nullable = true)
13 |-- epoch_time_miliseconds: string (nullable = true)
14 |-- humidity: string (nullable = true)
15 |-- ip: string (nullable = true)
16 |-- scale: string (nullable = true)
17 |-- temp: string (nullable = true)
18 |-- timestamp: string (nullable = true)
19 |-- _rescued_data: string (nullable = true)
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 86

Schema Metadata
1 root
2 |-- alarm_status: string (nullable = true)
3 |-- battery_level: string (nullable = true)
1 %scala
4 |-- c02_level: string (nullable = true)
2 display(spark.read.json(repoSchemaPath + "/_schemas"))
5 |-- cca2: string (nullable = true)
6 |-- cca3: string (nullable = true)
7 |-- cn: string (nullable = true)
8 |-- coordinates: struct (nullable = true)
9 | |-- latitude: double (nullable = true)
10 | |-- longitude: double (nullable = true)
11 |-- date: string (nullable = true)
12 |-- device_id: string (nullable = true)
13 |-- device_serial_number: string (nullable = true)
14 |-- device_type: string (nullable = true)
15 |-- epoch_time_miliseconds: string (nullable = true)
16 |-- humidity: long (nullable = true)
17 |-- ip: string (nullable = true)
18 |-- scale: string (nullable = true)
E X AM PLE 2 : SC H E MA H I NTS 19 |-- temp: double (nullable = true)
20 |-- timestamp: string (nullable = true)
AL provides hint logic using SQL DDL syntax to enforce and override dynamic 21 |-- _rescued_data: string (nullable = true)

schema inference on known single data types, as well as semi-structured


complex data types.
The schema hints specified in the AL options perform the data type mappings
on the respective columns. Hints are useful for applying schema enforcement
1 %scala on portions of the schema where data types are known while in tandem with
2 val hintAlDf = (spark
dynamic schema inference covered in Example 3.
3 .readStream.format("cloudfiles")
4 .option("cloudFiles.format", "json")
5 .option("cloudFiles.schemaLocation", repoSchemaPath)
6 .option("cloudFiles.schemaHints", "coordinates STRUCT<latitude:DOUBLE,
7 longitude:DOUBLE>, humidity LONG, temp DOUBLE") // schema ddl hints 1 %scala
8 .load(jsonSchema1Path) 2 display(hintAlDf.limit(10))
9 )

1 %scala
2 hintAlDf.printSchema
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 87

E X AM PLE 3: DY NAM I C SC H E MA I N FE R E N C E
1 root
AL dynamically searches a sample of the dataset to determine nested structure. 2 |-- alarm_status: string (nullable = true)
This avoids costly and slow full dataset scans to infer schema. The following 3 |-- battery_level: long (nullable = true)
4 |-- c02_level: long (nullable = true)
configurations are available to adjust the amount of sample data used on read to 5 |-- cca2: string (nullable = true)
6 |-- cca3: string (nullable = true)
discover initial schema: 7 |-- cn: string (nullable = true)
8 |-- coordinates: struct (nullable = true)
1. spark.databricks.cloudFiles.schemaInference.sampleSize.numBytes 9 | |-- latitude: double (nullable = true)
10 | |-- longitude: double (nullable = true)
(default 50 GB) 11 |-- date: string (nullable = true)
12 |-- device_id: long (nullable = true)
2. spark.databricks.cloudFiles.schemaInference.sampleSize.numFiles (default 13 |-- device_serial_number: string (nullable = true)
14 |-- device_type: string (nullable = true)
1000 files) 15 |-- epoch_time_miliseconds: long (nullable = true)
16 |-- humidity: long (nullable = true)
17 |-- ip: string (nullable = true)
18 |-- scale: string (nullable = true)
19 |-- temp: double (nullable = true)
1 %scala
20 |-- timestamp: string (nullable = true)
2 val inferAlDf = (spark
21 |-- _rescued_data: string (nullable = true)
3 .readStream.format("cloudfiles")
4 .option("cloudFiles.format", "json")
5 .option("cloudFiles.schemaLocation", repoSchemaPath)
6 .option("cloudFiles.inferColumnTypes", true) // schema inference AL saves the initial schema to the schema location path provided. This schema
7 .load(jsonSchema1Path)
8 ) serves as the base version for the stream during incremental processing.
Dynamic schema inference is an automated approach to applying schema
changes over time.
1 %scala
2 inferAlDf.printSchema

1 %scala
2 display(inferAlDf.limit(10))
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 88

E X AM PLE 4 : STATI C U S E R- D E FI N E D SC H E MA
1 org.apache.spark.sql.Row = [{"type":"struct","fields":[{"name":"alarm_
AL also supports static custom schemas just like Spark Structured Streaming. 2 status","type":"string","nullable":true,"metadata":{}},{"name":"battery_level","type":
This eliminates the need for dynamic schema-on-read inference scans, which 3 "long","nullable":true,"metadata":{}},{"name":"c02_level","type":"long","nullable":true,
4 "metadata":{}},{"name":"cca2","type":"string","nullable":true,"metadata":{}},{"name":
trigger additional Spark jobs and schema versions. The schema can be retrieved 5 "cca3","type":"string","nullable":true,"metadata":{}},{"name":"cn","type":"string",
6 "nullable":true,"metadata":{}},{"name":"coordinates","type":{"type":"struct","fields":
as a DDL string or a JSON payload. 7 [{"name":"latitude","type":"double","nullable":true,"metadata":{}},{"name":"longitude",
8 "type":"double","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},
DDL 9 {"name":"date","type":"string","nullable":true,"metadata":{}},{"name":"device_id",
10 "type":"long","nullable":true,"metadata":{}},{"name":"device_serial_number","type":
11 "string","nullable":true,"metadata":{}},{"name":"device_type","type":"string",
12 "nullable":true,"metadata":{}},{"name":"epoch_time_miliseconds","type":"long",
13 "nullable":true,"metadata":{}},{"name":"humidity","type":"long","nullable":true,
1 %scala
14 "metadata":{}},{"name":"ip","type":"string","nullable":true,"metadata":{}},{"name":
2 inferAlDf.schema.toDDL
15 "scale","type":"string","nullable":true,"metadata":{}},{"name":"temp","type":"double",
16 "nullable":true,"metadata":{}},{"name":"timestamp","type":"string","nullable":true,
17 "metadata":{}}]}]

1 String = alarm_status STRING,battery_level BIGINT,c02_level BIGINT,cca2 STRING,cca3


2 STRING,cn STRING,coordinates STRUCT<latitude: DOUBLE, longitude: DOUBLE>,date Here’s an example of how to generate a user-defined StructType (Scala) |
3 STRING,device_id BIGINT,device_serial_number STRING,device_type STRING,epoch_time_
4 miliseconds BIGINT,humidity BIGINT,ip STRING,scale STRING,temp DOUBLE,timestamp StructType (Python) via DDL DataFrame command or JSON queried from AL
5 STRING,_rescued_data STRING
schema repository.

JSON

1 %scala
2 spark.read.json(repoSchemaPath + "/_schemas").select("dataSchemaJson").
3 where("dataSchemaJson is not null").first()
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 89

1 %scala 1 %scala
2 import org.apache. spark.sql. types. {DataType, StructType} 2 schemaAlDf.printSchema

3 val ddl = """alarm_status STRING, battery_level BIGINT,c02_level BIGINT,cca2 STRING,


4 cca3 STRING, cn STRING, coordinates STRUCT<latitude: DOUBLE, longitude: DOUBLE>,
5 date STRING, device_id BIGINT, device_serial_number STRING, device_type STRING,
6 epoch_time_miliseconds BIGINT, humidity BIGINT, ip STRING, scale STRING,temp DOUBLE, 1 root
7 timestamp STRING, _rescued_data STRING""" 2 |-- alarm_status: string (nullable = true)
3 |-- battery_level: long (nullable = true)
8 val ddlSchema = StructType.fromDDL(ddl) 4 |-- c02_level: long (nullable = true)
5 |-- cca2: string (nullable = true)
9 val json = """{"type":"struct","fields":[{"name":"alarm_status","type":"string", 6 |-- cca3: string (nullable = true)
10 "nullable":true,"metadata":{}},{"name":"battery_level","type":"long","nullable": 7 |-- cn: string (nullable = true)
11 true,"metadata":{}},{"name":"c02_level","type":"long","nullable":true, "metadata" 8 |-- coordinates: struct (nullable = true)
12 :{}},{"name":"cca2","type":"string","nullable":true,"metadata":{}},{"name":"cca3", 9 | |-- latitude: double (nullable = true)
13 "type":"string","nullable":true,"metadata":{}},{"name":"cn","type":"string","nullable": 10 | |-- longitude: double (nullable = true)
14 true,"metadata":{}},{"name":"coordinates","type":{"type":"struct","fields": 11 |-- date: string (nullable = true)
15 [{"name":"latitude","type":"double","nullable":true,"metadata":{}},{"name":"longitude", 12 |-- device_id: long (nullable = true)
16 "type":"double","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name" 13 |-- device_serial_number: string (nullable = true)
17 :"date","type":"string","nullable":true,"metadata":{}},{"name":"device_id","type": 14 |-- device_type: string (nullable = true)
18 "long","nullable":true,"metadata":{}},{"name":"deviceserial_number","type":"string", 15 |-- epoch_time_miliseconds: long (nullable = true)
19 "nullable":true,"metadata":{}},{"name":"device_type","type":"string","nullable":true, 16 |-- humidity: long (nullable = true)
20 "metadata":{}},{"name":"epochtime_miliseconds","type":"long","nullable":true,"metadata":{}} 17 |-- ip: string (nullable = true)
21 ,{"name":"humidity","type":"long","nullable":true,"metadata":{}},{"name": 18 |-- scale: string (nullable = true)
22 "ip","type":"string","nullable":true,"metadata":{}},{"name":"scale","type":"string", 19 |-- temp: double (nullable = true)
23 "nullable":true,"metadata":{}},{"name":"temp","type":"double","nullable":true, 20 |-- timestamp: string (nullable = true)
24 "metadata":{}},{"name":"timestamp","type":"string","nullable":true,"metadata":{}}]}"""

25 val jsonSchema = DataType. fromJson(json).asInstanceOf[StructType] Passing in the schema definition will enforce the stream. AL also provides a
schema enforcement option achieving basically the same results as providing a
static StructType schema-on-read. This method will be covered in Example 7.
1 %scala
2 val schemaAlDf = (spark
3 .readStream.format("cloudfiles")
4 .option("cloudFiles.format", "json") 1 %scala
5 .schema(jsonSchema) // schema structtype definition 2 display(schemaAlDf.limit(10))
6 .load(jsonSchema1Path)
7 )
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 90

1 root
2 |-- alarm_status: string (nullable = true)
3 |-- battery_level: long (nullable = true)
4 |-- c02_level: long (nullable = true)
5 |-- cca2: string (nullable = true)
6 |-- cca3: string (nullable = true)
7 |-- cn: string (nullable = true)
8 |-- coordinates: struct (nullable = true)
9 | |-- latitude: double (nullable = true)
10 | |-- longitude: double (nullable = true)
11 |-- date: string (nullable = true)
E X AM PLE 5: SC H E MA D R I F T 12 |-- device_id: long (nullable = true)
13 |-- device_serial_number: string (nullable = true)
AL stores new columns and data types via the rescue column. This column 14 |-- device_type: string (nullable = true)
15 |-- epoch_time_miliseconds: long (nullable = true)
captures schema changes-on-read. The stream does not fail when schema and 16 |-- humidity: long (nullable = true)
17 |-- ip: string (nullable = true)
data type mismatches are discovered. This is a very impressive feature! 18 |-- scale: string (nullable = true)
19 |-- temp: double (nullable = true)
20 |-- timestamp: string (nullable = true)
21 |-- _rescued_data: string (nullable = true)

1 %scala
2 val driftAlDf = (spark
3 .readStream.format("cloudfiles") The rescue column preserves schema drift such as newly appended columns
4 .option("cloudFiles.format", "json")
5 .option("cloudFiles.schemaLocation", repoSchemaPath) and/or different data types via a JSON string payload. This payload can be
6 .option("cloudFiles.inferColumnTypes", true)
7 .option("cloudFiles.schemaEvolutionMode", "rescue") // schema drift tracking
parsed via Spark DataFrame or Dataset APIs to analyze schema drift scenarios.
8 .load(rawBasePath + "/*.json") The source file path for each individual row is also available in the rescue column
9 )
to investigate the root cause.

1 %scala
2 driftAlDf.printSchema 1 %scala
2 display(driftAlDf.where("_rescued_data is not null").limit(10))
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 91

1 %scala
2 evolveAlDf.printSchema // original schema

1 root
2 |-- alarm_status: string (nullable = true)
3 |-- battery_level: long (nullable = true)
4 |-- c02_level: long (nullable = true)
5 |-- cca2: string (nullable = true)
6 |-- cca3: string (nullable = true)
7 |-- cn: string (nullable = true)
8 |-- coordinates: struct (nullable = true)
9 | |-- latitude: double (nullable = true)
10 | |-- longitude: double (nullable = true)
11 |-- date: string (nullable = true)
12 |-- device_id: long (nullable = true)
13 |-- device_serial_number: string (nullable = true)
14 |-- device_type: string (nullable = true)
15 |-- epoch_time_miliseconds: long (nullable = true)
16 |-- humidity: long (nullable = true)
E X AM PLE 6: SC H E MA E VO LUTI O N 17 |-- ip: string (nullable = true)
18 |-- scale: string (nullable = true)
AL merges schemas as new columns arrive via schema evolution mode. New 19 |-- temp: double (nullable = true)
20 |-- timestamp: string (nullable = true)
schema JSON will be updated and stored as a new version in the specified 21 |-- _rescued_data: string (nullable = true)

schema repository location.

1 %scala
2 display(evolveAlDf.limit(10)) // # stream will fail
1 %scala
2 val evolveAlDf = (spark
3 .readStream.format("cloudfiles")
4 .option("cloudFiles.format", "json")
AL purposely fails the stream with an UnknownFieldException error when it
5 .option("cloudFiles.schemaLocation", repoSchemaPath)
6 .option("cloudFiles.inferColumnTypes", true) detects a schema change via dynamic schema inference. The updated schema
7 .option("cloudFiles.schemaEvolutionMode", "addNewColumns") // schema evolution
8 .load(rawBasePath + "/*.json") instance is created as a new version and metadata file in the schema repository
9 )
location, and will be used against the input data after restarting the stream.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 92

1 %scala 1 root
2 val evolveAlDf = (spark 2 |-- alarm_status: string (nullable = true)
3 .readStream.format("cloudfiles") 3 |-- battery_level: long (nullable = true)
4 .option("cloudFiles.format", "json") 4 |-- c02_level: long (nullable = true)
5 .option("cloudFiles.schemaLocation", repoSchemaPath) 5 |-- cca2: string (nullable = true)
6 .option("cloudFiles.inferColumnTypes", true) 6 |-- cca3: string (nullable = true)
7 .option("cloudFiles.schemaHints", "humidity DOUBLE") 7 |-- cn: string (nullable = true)
8 .option("cloudFiles.schemaEvolutionMode", "addNewColumns") // schema evolution 8 |-- coordinates: struct (nullable = true)
9 .load(rawBasePath + "/*.json") 9 | |-- latitude: double (nullable = true)
10 ) 10 | |-- longitude: double (nullable = true)
11 |-- date: string (nullable = true)
12 |-- device_id: long (nullable = true)
13 |-- device_serial_number: string (nullable = true)
14 |-- device_type: string (nullable = true)
1 %scala 15 |-- epoch_time_miliseconds: long (nullable = true)
2 evolveAlDf.printSchema // evolved schema 16 |-- humidity: double (nullable = true)
17 |-- ip: string (nullable = true)
18 |-- scale: string (nullable = true)
19 |-- temp: double (nullable = true)
20 |-- timestamp: string (nullable = true)
21 |-- device_serial_number_device_type: string (nullable = true)
22 |-- latitude: double (nullable = true)
23 |-- location: struct (nullable = true)
24 | |-- cca2: string (nullable = true)
25 | |-- cca3: string (nullable = true)
26 | |-- cn: string (nullable = true)
27 |-- longitude: double (nullable = true)
28 |-- _rescued_data: string (nullable = true)

AL has evolved the schema to merge the newly acquired data fields.

1 %scala
2 display(evolveAlDf.where("device_serial_number_device_type is not null").limit(10))
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 93

The newly merged schema transformed by AL is stored in the original schema Schema evolution can be a messy problem if frequent. With AL and Delta Lake

repository path as version 1 along with the base version 0 schema. This history is it becomes easier and simpler to manage. Adding new columns is relatively

valuable for tracking changes to schema over time, as well as quickly retrieving straightforward as AL combined with Delta Lake uses schema evolution to

DDL on the fly for schema enforcement. append them to the existing schema. Note: the values for these columns will be
NULL for data already processed. The greater challenge occurs when the data
Schema Repository types change because there will be a type mismatch against the data already
processed. Currently, the "safest" approach is to perform a complete overwrite
of the target Delta table to refresh all data with the changed data type(s).
1 %scala
2 display(dbutils.fs.ls(repoSchemaPath + "/_schemas")) Depending on the data volume this operation is also relatively straightforward if
infrequent. However, if data types are changing daily/weekly then this operation is
going to be very costly to reprocess large data volumes. This can be an indication
that the business needs to improve their data strategy.

Constantly changing schemas can be a sign of a weak data governance strategy


Schema Metadata and lack of communication with the data business owners. Ideally, organizations
should have some kind of SLA for data acquisition and know the expected
schema. Raw data stored in the landing zone should also follow some kind of pre-
1 %scala
2 display(spark.read.json(repoSchemaPath + "/_schemas"))
ETL strategy (e.g., ontology, taxonomy, partitioning) for better incremental loading
performance into the lakehouse. Skipping these steps can cause a plethora of
data management issues that will negatively impact downstream consumers
building data analytics, BI, and AI/ML pipelines and applications. If upstream
schema and formatting issues are never addressed, downstream pipelines will
consistently break and result in increased cloud storage and compute costs.
Garbage in, garbage out.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 94

E X AM PLE 7: SC H E MA E N FO RC E M E NT
1 root
AL validates data against the linked schema version stored in repository location 2 |-- alarm_status: string (nullable = true)
via schema enforcement mode. Schema enforcement is a schema-on-write 3 |-- battery_level: long (nullable = true)
4 |-- c02_level: long (nullable = true)
operation, and only ingested data matching the target Delta Lake schema will 5 |-- cca2: string (nullable = true)
6 |-- cca3: string (nullable = true)
be written to output. Any future input schema changes will be ignored, and AL 7 |-- cn: string (nullable = true)
streams will continue working without failure. Schema enforcement is a very 8 |-- coordinates: struct (nullable = true)
9 | |-- latitude: double (nullable = true)
powerful feature of AL and Delta Lake. It ensures only clean and trusted data will 10 | |-- longitude: double (nullable = true)
11 |-- date: string (nullable = true)
be inserted into downstream Silver/Gold datasets used for data analytics, BI, and 12 |-- device_id: long (nullable = true)
AI/ML pipelines and applications. 13 |-- device_serial_number: string (nullable = true)
14 |-- device_type: string (nullable = true)
15 |-- epoch_time_miliseconds: long (nullable = true)
16 |-- humidity: long (nullable = true)
17 |-- ip: string (nullable = true)
1 %scala 18 |-- scale: string (nullable = true)
2 val enforceAlDf = (spark 19 |-- temp: double (nullable = true)
3 .readStream.format("cloudfiles") 20 |-- timestamp: string (nullable = true)
4 .option("cloudFiles.format", "json")
5 .option("cloudFiles.schemaLocation", repoSchemaPath)
6 .option("cloudFiles.schemaEvolutionMode", "none") // schema enforcement Please note the rescue column is no longer available in this example because
7 .schema(jsonSchema)
8 .load(rawBasePath + "/*.json")
schema enforcement has been enabled. However, a rescue column can still
9 ) be configured separately as an AL option if desired. In addition, schema
enforcement mode uses the latest schema version in the repository to enforce
incoming data. For older versions, set a user-defined schema as explained
1 %scala
2 enforceAlDf.printSchema
in Example 4.

1 %scala
2 display(enforceAlDf.limit(10))
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 95

CO N C LU S I O N

At the end of the day, data issues are inevitable. However, the key is to limit data pollution as much as possible and have methods to detect discrepancies, changes
and history via schema management. Databricks Auto Loader provides many solutions for schema management, as illustrated by the examples in this chapter. Having a
solidified data governance and landing zone strategy will make ingestion and streaming easier and more efficient for loading data into the lakehouse. Whether it is simply
converting raw JSON data incrementally to the Bronze layer as Delta Lake format, or having a repository to store schema metadata, AL makes your job easier. It acts as an
anchor to building a resilient lakehouse architecture that provides reusable, consistent, reliable and performant data throughout the data and AI lifecycle.

HTML notebooks (Spark Scala and Spark Python) with code and both sample datasets can be found at the GitHub repo here.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 96

There are numerous practical applications, such as building multi-tenant web


From Idea to Code: Building With the applications that interact with your ML models or a robust UC migration toolkit
Databricks SDK for Python like Databricks Labs project UCX. Don’t forget the silent workhorses — those
by Kimberly Mahoney simple utility scripts that are more limited in scope but automate an annoying
task such as bulk updating cluster policies, dynamically adding users to groups
The focus of this chapter is to demystify the Databricks SDK for Python — the or simply writing data files to UC Volumes. Implementing these types of scripts is
authentication process and the components of the SDK — by walking through the a great way to familiarize yourself with the Python SDK and Databricks APIs.
start-to-end development process. I'll also be showing how to utilize IntelliSense
and the debugger for real-time suggestions in order to reduce the amount of SC E NAR I O
context-switching from the IDE to documentation and code examples.
Imagine my business is establishing best practices for development and CI/CD
What is the Databricks SDK for Python . . . and why you should use it on Databricks. We're adopting DABs to help us define and deploy workflows in
our development and production workspaces, but in the meantime, we need
The Databricks Python SDK lets you interact with the Databricks Platform
to audit and clean up our current environments. We have a lot of jobs people
programmatically using Python. It covers the entire Databricks API surface and
created in our dev workspace via the UI. One of the platform admins observed
Databricks REST operations. While you can interact directly with the API via curl
many of these jobs are inadvertently configured to run on a recurring schedule,
or a library like 'requests' there are benefits to utilizing the SDKs such as:
racking up unintended costs. As part of the cleanup process, we want to identify
■ Secure and simplified authentication via Databricks client-unified any scheduled jobs in our development workspace with an option to pause them.
authentication We’ll need to figure out:

■ Built-in debug logging with sensitive information automatically redacted ■ How to install the SDK
■ Support to wait for long-running operations to finish (kicking off a job, ■ How to connect to the Databricks workspace
starting a cluster) ■ How to list all the jobs and examine their attributes
■ Standard iterators for paginated APIs (we have multiple pagination types in ■ How to log the problematic jobs — or a step further, how to call the API to
our APIs!)
pause their schedule
■ Retrying on transient errors
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 97

D E V E LO PM E NT E N V I RO N M E NT For this example, while we’re interactively developing in our IDE, we’ll be using
what’s called U2M (user-to-machine) OAuth. We won’t get into the technical
Before diving into the code, you need to set up your development environment.
details, but OAuth is a secure protocol that handles authorization to resources
I highly recommend using an IDE that has a comprehensive code completion
without passing sensitive user credentials such as PAT or username/password
feature as well as a debugger. Code completion features, such as IntelliSense
that persist much longer than the one-hour short-lived OAuth token.
in VS Code, are really helpful when learning new libraries or APIs — they provide
useful contextual information, autocompletion, and aid in code navigation.
For this chapter, I’ll be using Visual Studio Code so I can also make use of
the Databricks Extension as well as Pylance. You’ll also need to install the
databricks-sdk (docs). In this chapter, I’m using Poetry + Pyenv. The setup is
similar for other tools — just 'poetry add databricks-sdk' or alternatively 'pip
install databricks-sdk' in your environment.

AUTH E NTI CATI O N

The next step is to authorize access to Databricks so we can work with our
workspace. There are several ways to do this, but because I’m using the VS Code
Extension, I’ll take advantage of its authentication integration. It’s one of the tools
that uses unified client authentication — that just means all these development
tools follow the same process and standards for authentication and if you set
up auth for one, you can reuse it among the other tools. I set up both the CLI
and VS Code Extension previously, but here is a primer on setting up the CLI
and installing the extension. Once you’ve connected successfully, you’ll see OAuth flow for the Databricks Python SDK
a notification banner in the lower right-hand corner and see two hidden files
generated in the .databricks folder — project.json and databricks.env (don’t worry,
the extension also handles adding these to .gitignore).
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 98

WO R KS PAC EC LI E NT VS . ACCO U NTC LI E NT MAKI N G API CALLS AN D I NTE R ACTI N G WITH DATA

The Databricks API is split into two primary categories — account and workspace. The WorkspaceClient we instantiated will allow us to interact with different
They let you manage different parts of Databricks, like user access at the APIs across the Databricks workspace services. A service is a smaller component
account level or cluster policies in a workspace. The SDK reflects this with two of the Databricks Platform — e.g., Jobs, Compute, Model Registry.
clients that act as our entry points to the SDK — the WorkspaceClient and In our example, we’ll need to call the Jobs API in order to retrieve a list of all the
AccountClient. For our example we’ll be working at the workspace level, so I’ll be jobs in the workspace.
initializing the WorkspaceClient. If you're unsure which client to use, check out the
SDK documentation.

Tip: Because we ran the previous steps to authorize access via unified client auth,
the SDK will automatically use the necessary Databricks environment variables,
so there's no need for extra configurations when setting up your client. All we
need are these two lines of code:

■ Initializing our WorkspaceClient

1 from databricks.sdk import WorkspaceClient

2 w = WorkspaceClient()

Services accessible via the Python SDK


E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 99

This is where IntelliSense really comes in handy. Instead of context switching You can construct objects with data classes and interact with enums.
between the IDE and the Documentation page, I can use autocomplete to provide For example:
a list of methods as well as examine the method description, the parameters and
■ Creating an Employee via Employee DataClass and company departments,
return types from within the IDE. I know the first step is getting a list of all the jobs
using enums for possible department values
in the workspace:

1 from dataclasses import dataclass


2 from enum import Enum
3 from typing import Optional

4 class CompanyDepartment(Enum):
5 MARKETING = 'MARKETING'
6 SALES = 'SALES'
7 ENGINEERING = 'ENGINEERING'

8 @dataclass
9 class Employee:
10 name: str
11 email: str
12 department: Optional[CompanyDepartment] = None

13 emp = Employee('Bob', '[email protected]', CompanyDepartment.ENGINEERING)

In the Python SDK all of the data classes, enums and APIs belong to the same
As you can see, it returns an iterator over an object called BaseJob. Before
module for a service located under databricks.sdk.service — e.g., databricks.
we talk about what a BaseJob actually is, it’ll be helpful to understand how data
sdk.service.jobs, databricks.sdk.service.billing, databricks.sdk.service.sql.
is used in the SDK. To interact with data you are sending to and receiving from
the API, the Python SDK takes advantage of Python data classes and enums.
The main advantage of this approach over passing around dictionaries is
improved readability while also minimizing errors through enforced type checks
and validations.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 100

For our example, we'll need to loop through all of the jobs and make a decision We can see a BaseJob has a few top-level attributes and has a more complex
on whether or not they should be paused. I'll be using a debugger in order to Settings type that contains most of the information we care about. At this
look at a few example jobs and get a better understanding of what a ‘BaseJob’ point, we have our WorkspaceClient and are iterating over the jobs in our
looks like. The Databricks VS Code extension comes with a debugger that can workspace. To flag problematic jobs and potentially take some action, we’ll need
be used to troubleshoot code issues interactively on Databricks via Databricks to better understand job.settings.schedule. We need to figure out how to
Connect. But, because I do not need to run my code on a cluster, I'll just be using programmatically identify if a job has a schedule and flag if it’s not paused. For
the standard Python debugger. I'll set a breakpoint inside for my loop and use this we’ll be using another handy utility for code navigation — Go to Definition.
the VS Code Debugger to examine a few examples. A breakpoint allows us to I’ve opted to Open Definition to the Side (⌘K F12) in order to reduce switching
stop code execution and interact with variables during our debugging session. to a new window. This will allow us to quickly navigate through the data class
This is preferable over print statements, as you can use the debugging console definitions without having to switch to a new window or exit our IDE:
to interact with the data as well as progress the loop. In this example I’m looking
As we can see, a BaseJob contains some top-level fields that are common
at the settings field and drilling down further in the debugging console to take a
among jobs such as 'job_id' or 'created_time'. A job can also have various settings
look at what an example job schedule looks like:
(JobSettings). These configurations often differ between jobs and encompass
aspects like notification settings, tasks, tags and the schedule. We’ll be focusing
on the schedule field, which is represented by the CronSchedule data class.
CronSchedule contains information about the pause status (PauseStatus) of a
job. PauseStatus in the SDK is represented as an enum with two possible values
— PAUSED and UNPAUSED.

Inspecting BaseJob in the VS Code debugger


E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 101

Tip: VSCode + Pylance provides code suggestions, and you can enable auto
imports in your User Settings or on a per-project basis in Workspace Settings. 1 import logging

By default, only top-level symbols are suggested for auto import and suggested
2 from databricks.sdk import WorkspaceClient
code (see original GitHub issue). However, the SDK has nested elements we want 3 from databricks.sdk.service.jobs import CronSchedule, JobSettings, PauseStatus

to generate suggestions for. We actually need to go down 5 levels — databricks. 4 # Initialize WorkspaceClient
5 w = WorkspaceClient()
sdk.service.jobs.<Enum|Dataclass>. In order to take full advantage of these
features for the SDK, I added a couple of workspace settings:
6 def update_new_settings(job_id, quarts_cron_expression, timezone_id):
7 """Update out of policy job schedules to be paused"""
■ Selection of theVSCode Workspace settings.json 8 new_schedule = CronSchedule(
9 quartz_cron_expression=quarts_cron_expression,
10 timezone_id=timezone_id,
11 pause_status=PauseStatus.PAUSED,
1 ... 12 )
2 "python.analysis.autoImportCompletions": true, 13 new_settings = JobSettings(schedule=new_schedule)
3 "python.analysis.indexing": true,
4 "python.analysis.packageIndexDepths": [ 14 logging.info(f"Job id: {job_id}, new_settings: {new_settings}")
5 { 15 w.jobs.update(job_id, new_settings=new_settings)
6 "name": "databricks",
7 "depth": 5,
8 "includeAllSymbols": true 16 def out_of_policy(job_settings: JobSettings):
9 } 17 """Check if a job is out of policy.
10 ] 18 If it unpaused and has a schedule and is not tagged as keep_alive
11 ... 19 Return true if out of policy, false if in policy
20 """

21 tagged = bool(job_settings.tags)
22 proper_tags = tagged and "keep_alive" in job_settings.tags
23 paused = job_settings.schedule.pause_status is PauseStatus.PAUSED
Putting it all together:
24 return not paused and not proper_tags
I broke out the policy logic into its own function for unit testing, added some
25 all_jobs = w.jobs.list()
logging and expanded the example to check for any jobs tagged as an exception 26 for job in all_jobs:
to our policy. Now we have: 27 job_id = job.job_id
28 if job.settings.schedule and out_of_policy(job.settings):
29 schedule = job.settings.schedule
■ Logging out of policy jobs
30 logging.info(
31 f"Job name: {job.settings.name}, Job id: {job_id}, creator: {job.creator_
32 user_name}, schedule: {schedule}"
33 )
34 ....
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 102

Now we have not only a working example but also a great foundation for building
out a generalized job monitoring tool. We're successfully connecting to our
workspace, listing all the jobs and analyzing their settings, and, when we're ready,
we can simply call our `update_new_settings function` to apply the new paused
schedule. It's fairly straightforward to expand this to meet other requirements
you may want to set for a workspace — for example, swap job owners to service
principles, add tags, edit notifications or audit job permissions. See the example
in the GitHub repository.

SC H E D U LI N G A J O B O N DATAB R I C KS

You can run your script anywhere, but you may want to schedule scripts that use
the SDK to run as a Databricks Workflow or job on a small single-node cluster.
When running a Python notebook interactively or via automated workflow, you
can take advantage of default Databricks Notebook authentication. If you're
working with the Databricks WorkspaceClient and your cluster meets the
requirements listed in the docs, you can initialize your WorkspaceClient without
needing to specify any other configuration options or environment variables — it
works automatically out of the box.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 103

CO N C LU S I O N AD D ITI O NAL R ESO U RC ES

In conclusion, the Databricks SDKs offer limitless potential for a variety of ■ Databricks SDK for Python Documentation
applications. We saw how the Databricks SDK for Python can be used to ■ DAIS Presentation: Unlocking the Power of Databricks SDKs
automate a simple yet crucial maintenance task and also saw an example
■ How to install Python libraries in your local development environment:
of an OSS project that uses the Python SDK to integrate with the Databricks
How to Create and Use Virtual Environments in Python With Poetry
Platform. Regardless of the application you want to build, the SDKs streamline
■ Installing the Databricks Extension for Visual Studio Code
development for the Databricks Platform and allow you to focus on your
particular use case. The key to quickly mastering a new SDK such as the
Databricks Python SDK is setting up a proper development environment.
Developing in an IDE allows you to take advantage of features such as a debugger,
parameter info and code completion, so you can quickly navigate and familiarize
yourself with the codebase. Visual Studio Code is a great choice for this as it
provides the above capabilities and you can utilize the VSCode extension for
Databricks to benefit from unified authentication.

Any feedback is greatly appreciated and welcome. Please raise any issues in the
Python SDK GitHub repository. Happy developing!
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 104

03 Ready-to-Use Notebooks
and Datasets
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 105

This section includes several Solution Accelerators — free, ready-to-use


examples of data solutions from different industries ranging from retail to Digital Twins
manufacturing and healthcare. Each of the following scenarios includes Leverage digital twins — virtual
notebooks with code and step-by-step instructions to help you get started. representations of devices and
Get hands-on experience with the Databricks Data Intelligence Platform by objects — to optimize operations
trying the following for yourself: and gain insights

Explore the Solution

Overall Equipment Recommendation Engines


Effectiveness for Personalization
Ingest equipment sensor data for Improve customers’ user experience
metric generation and data-driven and conversion with personalized
decision-making recommendations

Explore the Solution Explore the Solution

Real-Time Point-of-Sale Understanding Price


Analytics Transparency Data
Calculate current inventories for Efficiently ingest large healthcare
various products across multiple store datasets to create price transparency for
locations with Delta Live Tables better understanding of healthcare costs

Explore the Solution Explore the Solution

Additional Solution Accelerators with ready-to-use notebooks can be found here:

Databricks Solution Accelerators


E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 106

04 Case Studies
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 107

Cox Automotive — changing the way the world buys, sells and
uses vehicles

“We use Databricks Workflows as our default orchestration tool to perform ETL and enable
automation for about 300 jobs, of which approximately 120 are scheduled to run regularly.”
— Robert Hamlet, Lead Data Engineer, Enterprise Data Services, Cox Automotive

Cox Automotive Europe is part of Cox Automotive, the world’s largest automotive service organization, and is on a mission to
transform the way the world buys, sells, owns and uses vehicles. They work in partnership with automotive manufacturers,
fleets and retailers to improve performance and profitability throughout the vehicle lifecycle. Their businesses are organized
INDUSTRY around their customers’ core needs across vehicle solutions, remarketing, funding, retail and mobility. Their brands in Europe
Automotive include Manheim, Dealer Auction, NextGear Capital, Modix and Codeweavers.

SOLUTION Cox’s enterprise data services team recently built a platform to consolidate the company’s data and enable their data
Data-Driven ESG, Customer Entity scientists to create new data-driven products and services more quickly and easily. To enable their small engineering team
Resolution, Demand Forecasting,
to unify data and analytics on one platform while enabling orchestration and governance, the enterprise data services team
Product Matching
turned to the Databricks Data Intelligence Platform, Workflows, Unity Catalog and Delta Sharing.

P L AT F O R M
Workflows, Unity Catalog, Delta
Sharing, ETL

C LO U D
Azure
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 108

E ASY O RC H ESTR ATI O N AN D O B S E RVAB I LIT Y I M PROV E Hamlet also likes that Workflows provides observability into every workflow run
AB I LIT Y TO D E LI V E R VALU E and failure notifications so they can get ahead of issues quickly and troubleshoot
before the data science team is impacted. “We use the job notifications feature
Cox Automotive’s enterprise data services team maintains a data platform that
to send failure notifications to a webhook, which is linked to our Microsoft Teams
primarily serves internal customers spanning across business units, though they
account,” he says. “If we receive an alert, we go into Databricks to see what's
also maintain a few data feeds to third parties. The enterprise data services
going on. It’s very useful to be able to peel into the run logs and see what errors
team collects data from multiple internal sources and business units. “We use
occurred. And the Repair Run feature is nice to remove blemishes from your
Databricks Workflows as our default orchestration tool to perform ETL and enable
perfect history.”
automation for about 300 jobs, of which approximately 120 are scheduled to run
regularly,” says Robert Hamlet, Lead Data Engineer, Enterprise Data Services,
U N IT Y CATALO G AN D D E LTA S HAR I N G I M PROV E DATA
at Cox Automotive.
ACC ES S AC ROS S TE AM S
Jobs may be conducted weekly, daily or hourly. The amount of data processed Hamlet’s team recently began using Unity Catalog to manage data access,
in production pipelines today is approximately 720GB per day. Scheduled jobs improving their existing method, which lacked granularity and was difficult to
pull from different areas both within and outside of the company. Hamlet uses manage. “With our new workspace, we're trying to use more DevOps principles,
Databricks Workflows to deliver data to the data science team, to the in-house infrastructure-as-code and groups wherever possible,” he says. “I want to easily
data reporting team through Tableau, or directly into Power BI. “Databricks manage access to a wide range of data to multiple different groups and entities,
Workflows has a great user interface that allows you to quickly schedule any and I want it to be as simple as possible for my team to do so. Unity Catalog is
type of workflow, be it a notebook or JAR,” says Hamlet. “Parametrization the answer to that.”
has been especially useful. It gives us clues as to how we can move jobs
across environments. Workflows has all the features you would want from an The enterprise data services team also uses Delta Sharing, which natively

orchestrator.” integrates with Unity Catalog and allows Cox to centrally manage and audit
shared data outside the enterprise data services team while ensuring robust
security and governance. “Delta Sharing makes it easy to securely share data
with business units and subsidiaries without copying or replicating it,” says
Hamlet. “It enables us to share data without the recipient having an identity in our
workspace.”
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 109

LOO KI N G AH E AD : I N CO R P O R ATI N G AD D ITI O NAL DATA I NTE LLI G E N C E PL ATFO R M CAPAB I LITI ES

Going forward, Hamlet plans to use Delta Live Tables (DLT) to make it easy to build and manage batch and streaming data pipelines that deliver data on the Databricks
Data Intelligence Platform. DLT will help data engineering teams simplify ETL development and management. Eventually, Hamlet may also use Delta Sharing to easily
share data securely with external suppliers and partners while meeting security and compliance needs. “DLT provides us an opportunity to make it simpler for our team.
Scheduling Delta Live Tables will be another place we’ll use Workflows,” he says.

Hamlet is also looking forward to using the data lineage capabilities within Unity Catalog to provide his team with an end-to-end view of how data flows in the lakehouse
for data compliance requirements and impact analysis of data changes. “That’s a feature I'm excited about,” Hamlet says. “Eventually, I hope we get to a point where we
have all our data in the lakehouse, and we get to make better use of the tight integrations with things like data lineage and advanced permissions management.”
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 110

Block — building a world-class data platform


Block standardizes on Delta Live Tables to expand
secure economic access for millions

90% 150
Improvement in Pipelines being onboarded in
development velocity addition to the 10 running daily

Block is a global technology company that champions accessible financial services and prioritizes economic empowerment.
INDUSTRY Its subsidiaries, including Square, Cash App and TIDAL, are committed to expanding economic access. By utilizing artificial
Financial Services intelligence (AI) and machine learning (ML), Block proactively identifies and prevents fraud, ensuring secure customer
transactions in real time. In addition, Block enhances user experiences by delivering personalized recommendations and using
P L AT F O R M
identity resolution to gain a comprehensive understanding of customer activities across its diverse services. Internally, Block
Delta Live Tables, Data Streaming,
optimizes operations through automation and predictive analytics, driving efficiency in financial service delivery. Block uses
Machine Learning, ETL
the Data Intelligence Platform to bolster its capabilities, consolidating and streamlining its data, AI and analytics workloads.
C LO U D This strategic move positions Block for the forthcoming automation-driven innovation shift and solidifies its position as a
AWS pioneer in AI-driven financial services.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 111

E NAB LI N G C HAN G E DATA CAP TU R E FO R STR E AM I N G LE V E R AG I N G TH E L AKE H O U S E TO SY N C K AFK A STR E AM S


DATA E V E NTS O N D E LTA L AKE TO D E LTA TAB LES I N R E AL TI M E

Block’s Data Foundations team is dedicated to helping its internal customers Rather than redeveloping its data pipelines and applications on new, complex,
aggregate, orchestrate and publish data at scale across the company’s proprietary and disjointed technology stacks, Block turned to the Data
distributed system to support business use cases such as fraud detection, Intelligence Platform and Delta Live Tables (DLT) for change data capture and
payment risk evaluation and real-time loan decisions. The team needs access to enable the development of end-to-end, scalable streaming pipelines and
to high-quality, low-latency data to enable fast, data-driven decisions and applications. DLT pipelines simply orchestrate the way data flows between Delta
reactions. tables for ETL jobs, requiring only a few lines of declarative code. It automates
much of the operational complexity associated with running ETL pipelines and,
Block had been consolidating on Kafka for data ingestion and Delta Lake for data
as such, comes with preselected smart defaults yet is also tunable, enabling
storage. More recently, the company sought to make real-time streaming data
the team to optimize and debug easily. “DLT offers declarative syntax to define
available in Delta Lake as Silver (cleansed and conformed) data for analytics
a pipeline, and we believed it could greatly improve our development velocity,”
and machine learning. It also wanted to support event updates and simple data
says Yue Zhang, Staff Software Engineer for the Data Foundations team at Block.
transformations and enable data quality checks to ensure higher-quality data.
“It’s also a managed solution, so it manages the maintenance tasks for us, it
To accomplish this, Block considered a few alternatives, including the Confluent-
has data quality support, and it has advanced, efficient autoscaling and Unity
managed Databricks Delta Lake Sink connector, a fully managed solution with
Catalog integration.”
low latency. However, that solution did not offer change data capture support
and had limited transformation and data quality check support. The team also Today, Block’s Data Foundations team ingests events from internal services in
considered building their own solution with Spark Structured Streaming, which Kafka topics. A DLT pipeline consumes those events into a Bronze (raw data)
also provided low latency and strong data transformation capabilities. But that table in real time, and they use the DLT API to apply changes and merge data
solution required the team to maintain significant code to define task workflows, into a higher-quality Silver table. The Silver table can then be used by other
change data and capture logic. They’d also have to implement their own data DLT pipelines for model training, to schedule model orchestration, or to define
quality checks and maintenance jobs. features for a features store. “It’s very straightforward to implement and build
DLT pipelines,” says Zhang.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 112

Following their initial DLT proof of concept, Zhang and his team implemented
CI/CD to make DLT pipelines more accessible to internal Block teams. Different
teams can now manage pipeline implementations and settings in their own repos,
and, once they merge, simply use the Databricks pipelines API to create, update
and delete those pipelines in the CI/CD process.

B OOSTI N G D E V E LO PM E NT V E LO C IT Y WITH D LT

Implementing DLT has been a game-changer for Block, enabling it to boost


The Block Data Foundations team’s streaming data architecture with Delta Live
development velocity. “With the adoption of Delta Live Tables, the time
Tables pipelines
required to define and develop a streaming pipeline has gone from days
to hours,” says Zhang.
Using the Python API to define a pipeline requires three steps: a row table, a
Silver table and a merge process. The first step is to define the row table. Block Meanwhile, managed maintenance tasks have resulted in better query
consumes events from Kafka, performs some simple transformations, and performance, improved data quality has boosted customer trust, and more
establishes the data quality check and its rule. The goal is to ensure all events efficient autoscaling has improved cost efficiency. Access to fresh data means
have a valid event ID. Block data analysts get more timely signals for analytics and decision-making,
while Unity Catalog integration means they can better streamline and automate
The next step is to define the Silver table or target table, its storage location
data governance processes. “Before we had support for Unity Catalog, we had
and how it is partitioned. With those tables defined, the team then determines
to use a separate process and pipeline to stream data into S3 storage and
the merge logic. Using the DLT API, they simply select APPLY CHANGES INTO. If
a different process to create a data table out of it,” says Zhang. “With Unity
two units have the same event ID, DLT will choose the one with the latest ingest
Catalog integration, we can streamline, create and manage tables from the DLT
timestamp. “That’s all the code you need to write,” says Zhang.
pipeline directly.”
Finally, the team defines basic configuration settings from the DLT UI, such
Block is currently running approximately 10 DLT pipelines daily, with about
as characterizing clusters and whether the pipeline will run in continuous or
two terabytes of data flowing through them, and has another 150 pipelines to
triggered modes.
onboard. “Going forward, we’re excited to see the bigger impacts DLT can offer
us,” adds Zhang.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 113

Trek — global bicycle leader accelerates retail analytics

80%-90% 3X 1 Week
Acceleration in runtime of retail Increase in daily data Reduction in ERP data replication,
analytics solution globally refreshes on Databricks which now happens in near real time

“How do you scale up analytics without blowing a hole in your technology budget? For us,
the clear answer was to run all our workloads on Databricks Data Intelligence Platform and
replicate our data in near real-time with Qlik.”
INDUSTRY
Retail and Consumer Goods — Garrett Baltzer, Software Architect, Data Engineering, Trek Bicycle

SOLUTION
Real-Time Point-of-Sale Analytics Trek Bicycle started in a small Wisconsin barn in 1976, but their founders always saw something bigger. Decades later, the
company is on a mission to make the world a better place to live and ride. Trek only builds products they love and provides
P L AT F O R M incredible hospitality to customers as they aim to change the world for the better by getting more people on bikes. Frustrated
Delta Lake, Databricks SQL, Delta by the rising costs and slow performance of their data warehouse, Trek migrated to Databricks Data Intelligence Platform. The
Live Tables, Data Streaming
company now uses Qlik to replicate their ERP data to Databricks in near real-time and stores data in Delta Lake tables. With

PA R T N E R Databricks and Qlik, Trek has dramatically accelerated their retail analytics to provide a better experience for their customers
Qlik with a unified view of the global business to their data consumers, including business and IT users.

C LO U D
Azure
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 114

S LOW DATA PRO C ES S I N G H I N D E RS R E TAI L ANALY TI C S D E LTA L AKE U N I FI ES R E TAI L DATA FRO M ARO U N D
TH E G LO B E
As Trek Bicycle works to make the world better by encouraging more people to
ride bikes, the company keeps a close eye on what’s happening in their hundreds Seeking to modernize their data infrastructure to speed up data processing
of retail stores. But until recently, running analytics on their retail data proved and unify all their data from global sources, Trek started migrating to the
challenging because Trek relied on a data warehouse that couldn’t scale cost- Databricks Data Intelligence Platform in 2019. The company’s processing speeds
effectively. immediately increased. Qlik’s integration with the Databricks Data Intelligence
Platform helps feed Trek’s lakehouse. This replication allows Trek to build a wide
“The more stores we added, the more information we added to our processes
range of valuable data products for their sales and customer service teams.
and solutions,” explained Garrett Baltzer, Software Architect, Data Engineering,
at Trek Bicycle. “Although our data warehouse did scale to support greater data “Qlik enabled us to move relevant ERP data into Databricks where we don’t have
volumes, our processing costs were skyrocketing, and processes were taking far to worry about scaling vertically because it automatically scales parallel. Since
too long. Some of our solutions were taking over 30 hours to produce analytics, 70 to 80% of our operational data comes from our ERP system, Qlik has made it
which is unacceptable from a business perspective.” possible to get far more out of our ERP data without increasing our costs,” Baltzer
explained.
Adding to the challenge, Trek’s data infrastructure hindered the company’s
efforts to achieve a global view of their business performance. Slow processing Trek is now running all their retail analytics workloads in the Databricks Data
speeds meant Trek could only process data once per day for one region at a Intelligence Platform. Today, Trek uses the Databricks Data Intelligence Platform
time. to collect point-of-sale data from nearly 450 stores around the globe. All
computation happens on top of Trek’s lakehouse. The company runs a semantic
“We were processing retail data separately for our North American, European
layer on top of this lakehouse to power everything from strategic high-level
and Asia-Pacific stores, which meant everyone downstream had to wait for
reporting for C-level executives to daily sales and operations reports for
actionable insights for different use cases,” recalled Advait Raje, Team Lead, Data
individual store employees.
Engineering, at Trek Bicycle. “We soon made it a priority to migrate to a unified
data platform that would produce analytics more quickly and at a lower cost.” “Databricks Data Intelligence Platform has been a game-changer for Trek,” said
Raje. “With Qlik Cloud Data Integration on Databricks, it became possible to
replicate relevant ERP data to our Databricks in real time, which made it far more
accessible for downstream retail analytics. Suddenly, all our data from multiple
repositories was available in one place, enabling us to reduce costs and deliver
on business needs much more quickly.”
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 115

Trek’s BI and data analysts leverage Databricks SQL, their serverless data Trek’s retail analytics solution used to take 48+ hours to produce meaningful
warehouse, for ad hoc analysis to answer business questions much more quickly. results. Today, Trek runs the solution on the Databricks Data Intelligence Platform
Internal customers can leverage Power BI connecting directly to Databricks to to get results in six to eight hours — an 80 to 90% improvement, thus allowing
consume retail analytics data from Gold tables. This ease of analysis helps the daily runs. A complementary retail analytics solution went from 12–14 hours down
company monitor and enhance their Net Promoter Scores. Trek uses Structured to under 4–5 hours, thereby enabling the lakehouse to refresh three times per
Streaming and Auto Loader functionality within Delta Live Tables to transform the day, compared to only once a day previously.
data from Bronze to Silver or Gold, according to the medallion architecture.
“Before Databricks, we had to run our retail analytics once a day on North
“Delta Live Tables have greatly accelerated our development velocity,” Raje American time, which meant our other regions got their data late,” said Raje.
reported. “In the past, we had to use complicated ETL processes to take data “Now, we refresh the lakehouse three times per day, one for each region, and
from raw to parsed. Today, we just have one simple notebook that does it, and stakeholders receive fresh data in time to drive their decisions. Based on the
then we use Delta Live Tables to transform the data to Silver or Gold as needed.” results we’ve achieved in the lakehouse, we’re taking a Databricks-first approach
to all our new projects. We’re even migrating many of our on-premises BI
solutions to Databricks because we’re all-in on the lakehouse.”
DATA I NTE LLI G E N C E PL ATFO R M ACC E LE R ATES
ANALY TI C S BY 80% TO 9 0% “Databricks Data Intelligence Platform, along with data replication to Databricks
By moving their data processing to the Databricks Data Intelligence Platform using Qlik, aligns perfectly with our broader cloud-first strategy,” said Steve
and integrating data with Qlik, Trek has dramatically increased the speed of their Novoselac, Vice President, IT and Digital, at Trek Bicycle. “This demonstrates
processing and overall availability of data. Prior to implementing Qlik, they had confidence in the adoption of this platform at Trek.”
a custom program that, once a week, on a Sunday, replicated Trek’s ERP data
from on-premises servers to a data lake using bulk copies. Using Qlik, Trek now
replicates relevant data from their ERP system as Delta tables directly in their
lakehouse.

“We used to work with stale ERP data all week because replication only happened
on Sundays,” Raje remarked. “Now we have a nearly up-to-the-minute view of
what’s going on in our business. That’s because Qlik lets us keep replicating
through the day, streaming data from ERP into our lakehouse.”
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 116

Coastal Community Bank — mastering the modern data platform


for exponential growth

< 10 99% 12X


Minutes to securely share large Decrease in processing Faster partner onboarding by
datasets across organizations duration (2+ days to 30 min.) eliminating sharing complexity

“We’ve done two years’ worth of work here in nine months. Databricks enables access to
INDUSTRY
Financial Services a single source of truth and our ability to process high volumes of transactions that gives
us confidence we can drive our growth as a community bank and a leading banking-as-a-
SOLUTION service provider.”
Financial Crimes Compliance,
Customer Profile Scoring, Financial — Curt Queyrouze, President, Coastal Community Bank
Reconciliation, Credit Risk
Reporting, Synthesizing Multiple
Data Sources, Data Sharing and Many banks continue to rely on decades-old, mainframe-based platforms to support their back-end operations. But banks
Collaboration that are modernizing their IT infrastructures and integrating the cloud to share data securely and seamlessly are finding they
can form an increasingly interconnected financial services landscape. This has created opportunities for community banks,
P L AT F O R M
Delta Lake, ETL, Delta Sharing, Data fintechs and brands to collaborate and offer customers more comprehensive and personalized services. Coastal Community
Streaming, Databricks SQL Bank is headquartered in Everett, Washington, far from some of the world’s largest financial centers. The bank’s CCBX division
offers banking as a service (BaaS) to financial technology companies and broker-dealers. To provide personalized financial
C LO U D products, better risk oversight, reporting and compliance, Coastal turned to the Databricks Data Intelligence Platform and
Azure
Delta Sharing, an open protocol for secure data sharing, to enable them to share data with their partners while ensuring
compliance in a highly regulated industry.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 117

LE V E R AG I N G TEC H AN D I N N OVATI O N TO FUTU R E- PROO F To accomplish their objectives, Coastal would be required to receive and send
A CO M M U N IT Y BAN K vast amounts of data in near real-time with their partners, third parties and the
variety of systems used across that ecosystem. This proved to be a challenge
Coastal Community Bank was founded in 1997 as a traditional brick-and-mortar
as most banks and providers still relied on legacy technologies and antiquated
bank. Over the years, they grew to 14 full-service branches in Washington state
processes like once-a-day batch processing. To scale their BaaS offering, Coastal
offering lending and deposit products to approximately 40,000 customers.
needed a better way to manage and share data. They also required a solution
In 2018, the bank’s leadership broadened their vision and long-term growth
that could scale while ensuring that the highest levels of security, privacy and
objectives, including how to scale and serve customers outside their traditional
strict compliance requirements were met. “The list of things we have to do to
physical footprint. Coastal leaders took an innovative step and launched a plan
prove that we can safely and soundly operate as a regulated financial institution
to offer BaaS through CCBX, enabling a broad network of virtual partners and
is ever-increasing,” says MacLean. “As we added more customers and therefore
allowing the bank to scale much faster and further than they could via their
more customer information, we needed to scale safely through automation.”
physical branches alone.
Coastal also needed to accomplish all this with their existing small team. “As
Coastal hired Barb MacLean, Senior Vice President and Head of Technology
a community bank, we can’t compete on a people basis, so we have to have
Operations and Implementation, to build the technical foundation required
technology tools in place that teams can learn easily and deploy quickly,”
to help support the continued growth of the bank. “Most small community
adds MacLean.
banks have little technology capability of their own and often outsource tech
capabilities to a core banking vendor,” says MacLean. “We knew that story had
to be completely different for us to continue to be an attractive banking-as-a- TAC KLI N G A CO M PLE X DATA E N V I RO N M E NT WITH
service partner to outside organizations.” D E LTA S HAR I N G
With the goal of having a more collaborative approach to community banking
and banking as a service, Coastal began their BaaS journey in January 2023 when
they chose Cavallo Technologies to help them develop a modern, future-proof
data platform to support their stringent customer data sharing and compliance
requirements. This included tackling infrastructure challenges, such as data
ingestion complexity, speed, data quality and scalability. “We wanted to use our
small, nimble team to our advantage and find the right technology to help us
move fast and do this right,” says MacLean.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 118

“We initially tested several vendors, however learned through those tests we Initially, MacLean’s team evaluated several cloud-native solutions, with the goal
needed a system that could scale for our needs going forward,” says MacLean. of moving away from a 24-hour batch world and into real-time data processing
Though very few members of the team had used Databricks before, Coastal since any incident could have wider reverberations in a highly interconnected
decided to move from a previously known implementation pattern and a data financial system. “We have all these places where data is moving in real time.
lake–based platform to a lakehouse approach with Databricks. The lakehouse What happens when someone else’s system has an outage or goes down in the
architecture addressed the pain points they experienced using a data lake– middle of the day? How do you understand customer and bank exposure as soon
based platform, such as trying to sync batch and streaming data. The dynamic as it happens? How do we connect the batch world with the real-time world? We
nature and changing environments of Coastal’s partners required handling were trapped in a no-man’s-land of legacy, batch-driven systems, and partners
changes to data structure and content. The Databricks Data Intelligence Platform are too,” explains MacLean.
provided resiliency and tooling to deal with both data and schema drift cost-
“We wanted to be a part of a community of users, knowing that was the future,
effectively at scale.
and wanted a vendor that was continually innovating,” says MacLean. Similarly,
Coastal continued to evolve and extend their use of Databricks tools, including MacLean’s team evaluated the different platforms for ETL, BI, analytics and data
Auto Loader, Structured Streaming, Delta Live Tables, Unity Catalog and science, including some already in use by the bank. “Engineers want to work with
Databricks repos for CI/CD, as they created a robust software engineering modern tools because it makes their lives easier … working within the century in
practice for data at the bank. Applying software engineering principles to data which you live. We didn’t want to Frankenstein things because of a wide toolset,”
can often be neglected or ignored by engineering teams, but Coastal knew that says MacLean. “Reducing complexity in our environment is a key consideration,
it was critical to managing the scale and complexity of the internal and external so using a single platform has a massive positive impact. Databricks is the hands-
environment in which they were working. This included having segregated down winner in apples-to-apples comparisons to other tools like Snowflake and
environments for development, testing and production, having technical leaders SAS in terms of performance, scalability, flexibility and cost.”
approve the promotion of code between environments, and include data privacy
MacLean explained that Databricks included everything, such as Auto Loader,
and security governance.
repositories, monitoring and telemetry, and cost management. This enabled the
Coastal also liked that Databricks worked well with Azure out of the box. bank to benefit from robust software engineering practices so they could scale
And because it offered a consolidated toolkit for data transformation and to serving millions of customers, whether directly or via their partner network.
engineering, Databricks helped address any risk concerns. “When you have a MacLean explained, “We punch above our weight, and our team is extremely
highly complex technical environment with a myriad of tools, not only inside small relative to what we’re doing, so we wanted to pick the tools that are
your own environment but in our partners’ environments that we don’t control, applicable to any and all scenarios.”
a consolidated toolkit reduces complexity and thereby reduces risk,”
says MacLean.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 119

I M PROV I N G TI M E TO VALU E AN D G ROWI N G TH E I R


PARTN E R N E T WO R K

In the short time since Coastal launched CCBX, it has become the bank’s primary
customer acquisition and growth division, enabling them to grow BaaS program
fee income by 32.3% year over year. Their use of Databricks has also helped them
achieve unprecedented time to value. “We’ve done two years’ worth of work here
in nine months,” says Curt Queyrouze, President at Coastal.

Almost immediately, Coastal saw exponential improvements in core business


functions. “Activities within our risk and compliance team that we need to
conduct every few months would take 48 hours to execute with legacy inputs,”
says MacLean. “Now we can run those in 30 minutes using near real-time data.”

Despite managing myriad technology systems, Databricks helps Coastal remove


barriers between teams, enabling them to share live data with each other safely
and securely in a matter of minutes so the bank can continue to grow quickly
through partner acquisition. “The financial services industry is still heavily reliant
on legacy, batch-driven systems, and other data is moving in real time and
needs to be understood in real time. How do we marry those up?” asks MacLean.
The Databricks Data Intelligence Platform has greatly simplified how Coastal and “That was one of the fundamental reasons for choosing Databricks. We have not
their vast ecosystem of financial service partners securely share data across worked with any other tool or technology that allows us to do that well.”
data platforms, clouds or regions.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 120

CCBX leverages the power and scale of a network of partners. Delta Sharing uses With Delta Sharing on Databricks, Coastal now has a vastly simplified, faster
an open source approach to data sharing and enables users to share live data and more secure platform for onboarding new partners and their data. “When
across platforms, clouds and regions with strong security and governance. Using we want to launch and grow a product with a partner, such as a point-of-sale
Delta Sharing meant Coastal could manage data effectively even when working consumer loan, the owner of the data would need to send massive datasets
with partners and third parties using inflexible legacy technology systems. “The on tens of thousands of customers. Before, in the traditional data warehouse
data we were ingesting is difficult to deal with,” says MacLean. “How do we approach, this would typically take one to two months to ingest new data
harness incoming data from about 20 partners with technology environments sources, as the schema of the sent data would need to be changed in order for
that we don’t control? The data’s never going to be clean. We decided to make our systems to read it. But now we point Databricks at it and it’s just two days to
dealing with that complexity our strength and take on that burden. That’s where value,” shares MacLean.
we saw the true power of Databricks’ capabilities. We couldn’t have done this
While Coastal’s data engineers and business users love this improvement in
without the tools their platform gives us.”
internal productivity, the larger transformation has been how Databricks has
Databricks also enabled Coastal to scale from 40,000 customers (consumers enabled Coastal’s strategy to focus on building a rich partner network. They now
and small-medium businesses in the north Puget Sound region) to approximately have about 20 partners leveraging different aspects of Coastal’s BaaS.
6 million customers served through their partner ecosystem and dramatically
Recently, Coastal’s CEO had an ask about a specific dataset. Based on
increase the speed at which they integrate data from those partners. In one
experience from their previous data tools, they brought in a team of 10 data
notable case, Coastal was working with a new partner and faced the potential of
engineers to comb through the data, expecting this to be a multiday or even
having to load data on 80,000 customers manually. “We pointed Databricks at it
multi-week effort. But when they actually got into their Databricks Data
and had 80,000 customers and the various data sources ingested, cleaned and
Intelligence Platform, using data lineage on Unity Catalog, they were able to
prepared for our business teams to use in two days,” says MacLean. “Previously,
give a definitive answer that same afternoon. MacLean explains that this is not
that would have taken one to two months at least. We could not have done that
an anomaly. “Time and time again, we find that even for the most seemingly
with any prior existing tool we tried.”
challenging questions, we can grab a data engineer with no context on the data,
point them to a data pipeline and quickly get the answers we need.”
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 121

The bank’s use of Delta Sharing has also allowed Coastal to achieve success with One, an emerging fintech startup. One wanted to sunset its use of Google BigQuery,
which Coastal was using to ingest One’s data. The two organizations needed to work together to find a solution. Fortunately, One was also using Databricks. “We used
Delta Sharing, and after we gave them a workspace ID, we had tables of data showing up in our Databricks workspace in under 10 minutes,” says MacLean. (To read more
about how Coastal is working with One, read the blog.) MacLean says Coastal is a leader in skills, technology and modern tools for fintech partners.

DATA AN D AI FO R G OO D

With a strong data foundation set, MacLean has a larger vision for her team. “Technologies like generative AI open up self-serve capabilities to so many business groups.
For example, as we explore how to reduce financial crimes, if you are taking a day to do an investigation, that doesn’t scale to thousands of transactions that might need
to be investigated,” says MacLean. “How do we move beyond the minimum regulatory requirements on paper around something like anti-money laundering and truly
reduce the impact of bad actors in the financial system?”

For MacLean this is about aligning her organization with Coastal’s larger mission to use finance to do better for all people. Said MacLean, “Where are we doing good in
terms of the application of technology and financial services? It’s not just about optimizing the speed of transactions. We care about doing better on behalf of our fellow
humans with the work that we do.”
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 122

Powys Teaching Health Board — improving decision-making


to save lives faster

< 1 year 40% 65%


To modernize data Decrease in time More productive with
infrastructure to insight Databricks Assistant

“The adoption of Databricks has ensured that we can future-proof our data capabilities.
It has transformed and modernized the way we work, and that has a direct impact on the
INDUSTRY
quality of care delivered to our community.”
Healthcare and Life Sciences
— Jake Hammer, Chief Data Officer, Powys Teaching Health Board (PTHB)
SOLUTION
Forward-Looking Intelligence
The inability to access complete and high-quality data can have a direct impact on a healthcare system’s ability to deliver
P L AT F O R M
optimal patient outcomes. Powys Teaching Health Board (PTHB), serving the largest county in Wales, is responsible for
Data Intelligence Platform,
planning and providing national health services for approximately a quarter of the country. However, roughly 50% of the data
Unity Catalog
they need to help inform patient-centric decisions doesn’t occur within Powys and is provided by neighboring organizations
C LO U D in varying formats, slowing their ability to connect data with the quality of patient care. Converting all this data — from
Azure patient activity (e.g., appointments) and workforce data (e.g., schedules) — to actionable insights is difficult when it comes
in from so many disparate sources. PTHB needed to break down these silos and make it easier for nontechnical teams to
access the data. With the Databricks Data Intelligence Platform, PTHB now has a unified view of all their various data streams,
empowering healthcare systems to make better decisions that enhance patient care.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 123

S I LOS AN D SYSTE M STR AI N H I N D E R I M PROV I N G DATA D E M O C R ATIZ ATI O N WITH


DATA- D R I V E N I N S I G HTS A U N I FI E D PL ATFO R M

The demand for PTHB’s services has increased significantly over the years as PTHB chose the Databricks Data Intelligence Platform to house all new incoming
they’ve dealt with evolving healthcare needs and population growth. As new data, from any source. This includes the data for a large number of low-code
patients enter the national healthcare system, so does the rise in data captured apps (e.g., Power Apps) so that Hammer’s team can now work with data that was
about the patient, hospital operations and more. With this rapid influx of data historically kept on paper — making it significantly easier for people to access
coming from various hospitals and healthcare systems around the country, and analyze the data at scale.
PTHB’s legacy system began to reach its performance and scalability limits,
Data governance is also critical, but creating standard processes was difficult
quickly developing data access and ingestion bottlenecks that not only wasted
before transitioning to Databricks as their core platform. With Unity Catalog,
time, but directly impacted patient care. And as the diversity of data rose, their
PTHB has a model where all of their security and governance is done only
legacy system buckled under the load.
once at the Databricks layer. “The level of auditing in Databricks gives us a
“Our data sat in so many places that it caused major frustrations. Our on- high level of assurance. We need to provide different levels of access to many
premises SQL warehouse couldn’t cope with the scale of our growing data different individuals and systems,” added Hammer. “Having a tool that enables
estate,” explained Jake Hammer, Chief Data Officer at PTHB. “We needed to move us to confidently manage this complex security gives both ourselves and our
away from manually copying data between places. Finding a platform that would stakeholders assurance. We can more easily and securely share data with
allow us to take advantage of the cloud and was flexible enough to safeguard our partners.”
data within a single view for all to easily access was critical.”
Deriving actionable insights on data through numerous Power BI dashboards with
How could PTHB employees make data-driven decisions if the data was hard ease is something PTHB could not do before. “Now HR has the data they need to
to find and difficult to understand? Hammer realized that they needed to first improve operational efficiency while protecting the bottom line,” said Hammer.
modernize their data infrastructure in the cloud and migrate to a platform “They can self-serve any necessary data, and they can see where there are gaps
capable of unifying their data and making it readily available for downstream in rosters or inefficiencies in on-the-ground processes. Being able to access the
analytics use cases: from optimizing staff schedules to providing actionable right data at the right time means they can be smarter with rostering, resource
insights for clinicians so they can provide timely and targeted care. Hammer’s management and scheduling.”
team estimated that it would take five to 10 years to modernize their tech stack
in this way if they were to follow their own processes and tech stack. But they
needed a solution now. Enter Databricks.
E B O O K : T H E B I G B O O K O F D ATA E N G I N E E R I N G — 3 R D E D I T I O N 124

FE D E R ATE D L AKE H O U S E I M PROV ES TE AM E FFI C I E N CY A modern stack and newfound efficiencies throughout the data workflow has
AN D R E AL-TI M E DATA E N HAN C ES PATI E NT CAR E fueled Hammer’s ability to expand into more advanced use cases. “Data science
was an area we simply hadn’t thought about,” explained Hammer. “Now we have
With the Databricks Data Intelligence Platform, PTHB has taken their first step
the means to explore predictive use cases like forecasting feature utilization.”
toward modernization by moving to the cloud in less than a year — a much
quicker timeline than their 10-year estimate — and providing a federated Most importantly, PTHB has a solution that’s future-proof and can tackle any
lakehouse to unify all their data. Through the lakehouse, they are able to data challenge, which is critical given how they are seeing a rapidly growing
seamlessly connect to their on-premises SQL warehouse and remote BigQuery environment of APIs, adoption of open standards and new sources of real-time
environment at NHS Wales to create a single view of their data estate. “With the data. “I can trust our platform is future-proof, and I’m probably the only health
Databricks Platform, and by leveraging features such as Lakehouse Federation to board to be able to say that in Wales at the moment,” said Hammer. “Just like with
integrate remote data, PTHB data practitioners now work from a single source of the data science world, the prediction world was something we never thought
truth to improve decision-making and patient outcomes,” explained Hammer. was possible. But now we have the technology to do anything we put our minds
and data to.”
From an operational standpoint, the impact of a modern platform has been
significant, with efficiencies skyrocketing to an estimated 40% time savings
in building data pipelines for analytics. They also estimate spending 65% less
time answering questions from business data users with the help of Databricks
Assistant. This AI-powered tool accelerated training for PTHB, helping traditional
SQL staff embrace new programming languages and empowering them to be
more productive without overreliance on the data engineering team.
Tens of millions of production workloads run daily on Databricks
Easily ingest and transform batch and streaming data on the Databricks Data Intelligence Platform.
Orchestrate reliable production workflows while Databricks automatically manages your infrastructure
at scale. Increase the productivity of your teams with built-in data quality testing and support for
software development best practices.

Try Databricks free Get started with a free demo

About Databricks
Databricks is the data and AI company. More than 10,000
organizations worldwide — including Block, Comcast,
Condé Nast, Rivian, Shell and over 60% of the Fortune 500
— rely on the Databricks Data Intelligence Platform to take
control of their data and put it to work with AI. Databricks
is headquartered in San Francisco, with offices around
the globe, and was founded by the original creators of
Lakehouse, Apache Spark™, Delta Lake and MLflow. To learn
more, follow Databricks on LinkedIn, X and Facebook.

© Databricks 2024. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation. Privacy Policy | Terms of Use

You might also like