Data Sharing Collaboration Delta Sharing Final
Data Sharing Collaboration Delta Sharing Final
Collaboration with
Delta Sharing
Accelerate Innovation and Generate
New Revenue Streams
Ron L’Esteve
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Sharing
and Collaboration with Delta Sharing, the cover image, and related trade dress are
trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the author and do not represent the
publisher’s views. While the publisher and the author have used good faith efforts
to ensure that the information and instructions contained in this work are accurate,
the publisher and the author disclaim all responsibility for errors or omissions,
including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this
work is at your own risk. If any code samples or other technology this work contains
or describes is subject to open source licenses or the intellectual property rights of
others, it is your responsibility to ensure that your use thereof complies with such
licenses and/or rights.
This work is part of a collaboration between O’Reilly and Databricks. See our
statement of editorial independence.
978-1-098-16023-4
[LSI]
Table of Contents
iii
Marketplace Datasets and Notebooks 54
Solution Accelerators 56
AI Models 58
Getting Started 61
Summary 63
iv | Table of Contents
CHAPTER 1
Harnessing the Power of
Data Sharing
1
entities outside of the organization, including partners, customers,
researchers, or even the public. For instance, a healthcare organiza‐
tion might share anonymized patient data with research institutions
to aid in medical research. With both types of sharing, it’s important
to have clear policies and procedures in place to protect sensitive
information and comply with relevant data protection regulations.
This chapter will explore the power of data sharing and how it can
overcome various challenges to unlock its full potential. The chapter
begins by examining the current landscape of data sharing, in which
different types of data are generated and stored by various actors.
You will learn about the main challenges hindering data sharing,
such as legacy solutions that are not designed for interoperability
and scalability, cloud vendors that create silos and lock-in effects,
and legal and regulatory barriers that limit data access and reuse.
You will also learn about use cases that demonstrate the value and
impact of data sharing in different domains, such as health, edu‐
cation, agriculture, energy, transportation, and societal good. The
chapter will also highlight some of the principles and best practices
that enable effective and responsible data sharing and collaboration.
Finally, you will explore the benefits of data sharing and collabora‐
tion for various stakeholders by demonstrating how data sharing
can improve efficiency, innovation, transparency, accountability,
trust, and participation. The risks and challenges of data sharing
in relation to privacy, security, quality, fairness, and sovereignty will
also be addressed. The chapter will conclude by providing some
recommendations and guidelines for fostering a culture of data
sharing and collaboration that balances the benefits and risks of data
sharing.
Data sharing is a key capability for digital transformation, and data
and analytics leaders who share data internally and externally are
likely to generate measurable economic benefit. The market for data
sharing is also growing rapidly as organizations realize the need
for an ecosystem that enables them to build, sell, and buy data
products. Forbes estimates that by 2030, $3.6 trillion dollars will be
generated through the commercialization of data products. Accord‐
ing to Gartner, chief data officers (CDOs) who have successfully
executed data sharing initiatives in their organizations are 1.7 times
more effective at showing business value and return on investment
from their data analytics strategy. Gartner predicts that by 2024,
most organizations will attempt trust-based data sharing programs,
1 Andrew White, “Our Top Data and Analytics Predictions for 2021,” Gartner Business
Insights, Strategies & Trends for Executives (blog), January 12, 2021.
Secured Platforms
Secured platforms are technologies that provide a safe and secure
environment for organizations to share data with partners or cus‐
tomers while maintaining control over their information. Some
examples of secured platforms include marketplaces and data
exchange platforms offered by major cloud providers, as well as
clean rooms, blockchain, and distributed ledgers. These technol‐
ogies offer benefits such as increased security, transparency, effi‐
ciency, and cost savings. In this section, you will learn about the
following technologies and how they can help organizations to
securely share data and collaborate with partners:
Clean rooms
A data clean room is a secure collaboration environment that
allows two or more participants to leverage data assets for spe‐
cific, mutually agreed upon uses, while guaranteeing enforce‐
ment of strict data access limitations—e.g., not revealing or
exposing their customers’ personal data to other parties.3
3 “Data Clean Rooms: Guidance and Recommended Practices,” IAB Tech Lab, accessed
October 27, 2023, https://oreil.ly/yhrG2.
Challenges
Factors that hinder data sharing include legacy solutions that are
not designed for interoperability and scalability, cloud vendors that
create silos and lock-in effects, legal and regulatory barriers that
limit data access and reuse, lack of standardization in data formats,
issues related to data ownership and control, and the costs and
resources required to share data. Companies may be deterred from
taking advantage of data sharing opportunities by challenges and
risks in such areas as the following:
7 Hugo Ceulemans, Mathieu Galtier, Tinne Boeckx, Marion Oberhuber, and Victor Dil‐
lard, “From Competition to Collaboration: How Secure Data Sharing Can Enable
Innovation,” World Economic Forum, June 27, 2021, https://oreil.ly/boydV.
8 Richard Gilbert and Jane Nelson, Advocating Together for the SDGs—How Civil Society
and Business Are Joining Voices to Change Policy, Attitudes and Practices (Cambridge,
MA: Business Fights Poverty and the Corporate Responsibility Initiative at the Harvard
Kennedy School, 2018), https://oreil.ly/FhsJf.
Challenges | 13
certifying the origin of the data and its documentation to ensure
its quality.
Trust
Trust refers to the confidence and reliability of data and its
sources. Data sharing may involve collaborating with unknown
or untrusted parties, which may affect trust. For example, data
sharing may face challenges in verifying the identity, reputation,
or credibility of data providers or consumers. Data sharing may
also require establishing mechanisms such as contracts, agree‐
ments, incentives, ratings, reviews, or feedback to ensure trust.
The current challenges of data sharing require careful consideration
and management by data owners and users. Data owners need to
balance the benefits and risks of sharing their data with others while
respecting the rights and interests of the data subjects. Data users
need to assess the quality and trustworthiness of the shared data
while complying with the terms and conditions of the data provid‐
ers. Data sharing also requires adopting appropriate technologies
and platforms that facilitate secure and efficient data sharing while
preserving privacy and quality.
It is important to ensure that data sharing is done responsibly
and securely. This can be achieved through the use of data access
controls, data encryption, data masking, data classification, data
governance policies, and data sharing agreements. Such practices
help ensure that shared data is protected and used ethically and
transparently. Let’s review these practices:
Data access controls
Mechanisms that restrict access to sensitive data to only author‐
ized users, which can help prevent unauthorized access and
misuse of the data.
Data encryption
The process of converting plain text into a coded format that
can be read only by someone with the key to decrypt it, thus
protecting the confidentiality of the data while it is being trans‐
mitted or stored.
Data masking
The process of obscuring sensitive information in a dataset so it
cannot be easily identified, protecting the privacy of individuals
whose information is included in the dataset.
Summary
In this chapter, you have learned about the power of data sharing
and how it can transform the way you create, access, use, and share
data. You also learned about the benefits and challenges of data
sharing and collaboration for various stakeholders and society. You
examined the current landscape of data sharing, in which various
actors, such as governments, businesses, academic organizations,
civil society, and individuals, generate and store different types of
data. You also looked at the main models that facilitate broad data
sharing within and across industries, such as vertical platforms,
super platforms, shared infrastructure, and decentralized models.
However, data sharing also requires adopting appropriate technolo‐
gies and platforms that facilitate secure and efficient data sharing
while preserving privacy and quality. Databricks provides the first
open source approach to data sharing and collaboration across data,
analytics, and AI with products that include Delta Sharing, Unity
Catalog, Marketplace, and Clean Rooms. Delta Sharing provides
an open protocol for secure cross-platform live data sharing. It
Summary | 15
integrates with Unity Catalog for centralized governance to manage
access control policies. This allows organizations to share live data
across platforms without replicating it. With its Marketplace, users
can discover, evaluate, and access data products—including datasets,
machine learning models, dashboards, and notebooks—from any‐
where without the need to be on the Databricks platform. Clean
Rooms provides a secure environment for businesses to collaborate
with their customers and partners on any cloud in a privacy-safe
way. Participants in the data clean rooms can share and join their
existing data and run complex workloads in any language—Python,
R, SQL, Java, Scala—on the data while maintaining data privacy.
In Chapter 2, you will dive deeper into Delta Sharing and how it
works. You will learn about the features and benefits of Delta Shar‐
ing, such as open cross-platform sharing, avoiding vendor lock-in,
and easily sharing existing data in Delta Lake and Apache Parquet
formats to any data platform. You will also learn how to use Delta
Sharing to share and consume data from various sources and desti‐
nations, such as Databricks, AWS S3, Azure Blob Storage, Google
Cloud Storage, Snowflake, Redshift, BigQuery, Presto, Trino, and
Spark SQL. In addition, you will explore some of the use cases and
best practices of Delta Sharing in different domains and scenarios.
By the end of Chapter 2, you will have a solid understanding of
Delta Sharing and how it can enable effective and responsible data
sharing and collaboration across different platforms and domains.
Summary | 17
CHAPTER 2
Understanding Delta Sharing
19
Delta Sharing sets itself apart from other solutions through its dedi‐
cation to an open exchange of data. “Open” in the context of Delta
Sharing refers primarily to the open source nature of its sharing
protocol, which promotes a wide network of connectors and ensures
superior interoperability across diverse platforms. While Delta Lake
does employ an open datafile format, the key differentiator is the
open source sharing protocol that enables this expansive network.
The term “open exchange” is used here to denote a marketplace that
isn’t limited to a single vendor, in contrast to a “private exchange.”
Despite the democratization of data access, Delta Sharing maintains
stringent security, governance, and auditing mechanisms. Capable
of managing massive datasets, Delta Sharing scales seamlessly,
marking a significant advancement in data sharing and accessibility.
In this chapter, you will learn about the features and capabilities of
Delta Sharing, how Delta Sharing fits into the broader Databricks
ecosystem, the advantages of using Delta Sharing over traditional
data sharing methods, and real-world use cases in which Delta Shar‐
ing can be applied. By the end of the chapter, you will have a solid
understanding of what Databricks Delta Sharing is and how to get
started with data sharing. In addition, you’ll discover how strategic
partnerships with popular enterprise companies such as Oracle and
Cloudflare enhance Delta Sharing. Let’s explore the capabilities of
Databricks Delta Sharing, a technology that facilitates effective col‐
laboration and sets the stage for a business model that thrives on
shared data and enhanced possibilities. In gaining an understanding
of its capabilities, you’ll learn how Delta Sharing can augment data
sharing and teamwork within your organization.
AI Model Sharing
Delta Sharing’s technical capabilities extend beyond data sharing to
include AI model sharing. This feature is particularly beneficial for
organizations that want to leverage machine learning models across
different platforms and teams.
In the context of Delta Sharing, AI models are considered as data
assets and can be included in shares, similar to tables, notebooks,
and volumes. This allows providers to share AI models with recipi‐
ents, facilitating collaboration and knowledge transfer.
The process of sharing AI models is similar to that for sharing other
data assets. In the open source version of Delta Sharing, you would
need to manage your own sharing servers and storage accounts to
share AI models. However, the managed version simplifies this pro‐
cess by providing fine-grained access control and easy management
through Unity Catalog.
Databricks-to-Databricks sharing supports AI model sharing, allow‐
ing you to share models with Databricks users connected to a
different metastore. This is done using a secure sharing identifier,
ensuring that your models are shared securely and only with author‐
ized users.
Delta Sharing’s AI-model-sharing capability enhances the platform’s
versatility, making it a powerful tool for organizations that want to
democratize access to AI models while maintaining robust security
and governance.
Time Travel
Delta Lake time travel is a powerful feature that enables you to
access previous versions of a Delta table based on timestamps or
specific table versions. Its functionality serves practical purposes
such as recreating analyses, reports, or outputs, as well as facilitating
auditing and data validation tasks. In the context of Delta Sharing,
consider a scenario in which you have a shared Delta table con‐
taining sales data. You’re interested in analyzing how sales figures
evolved over the past year. With Delta Lake time travel, you can
Schema Evolution
Schema evolution allows you to evolve the schema of a Delta Lake
table over time. You can add new columns to your table, change
the data type of existing columns, or even delete columns without
having to rewrite your entire dataset.
You can enable schema evolution by setting the mergeSchema option
to true when writing data to a Delta table, allowing you to append
data with a different schema to an existing Delta table. For example,
if you have a Delta table with the columns first_name and age and
want to append data that also includes a country column, you can
do so by setting the mergeSchema option to true. The new column
will be added to the Delta table, and any existing data in the table
will have null values for the new column.
You can also enable schema evolution by default by setting the auto
Merge option to true. This allows you to append data with different
schemas without having to set the mergeSchema option every time.
Schema evolution allows you to easily adapt your data sharing prac‐
tices as your business requirements change. Using schema evolu‐
tion, you can ensure that your shared data remains relevant and up
to date, without going through complex data migration processes.
Partition Filtering
In Delta Sharing, partition filtering allows data providers to share
specific parts of a Delta table with data recipients without making
extra copies, helping you share only the data you need or control
access based on recipient characteristics. To specify a partition that
filters by recipient properties when creating or updating a share,
Delta Sharing allows you to unlock the full potential of your data
lake and collaborate with your customers and partners on live data
without compromising on security, performance, or flexibility.
Key Differences
The key differences among the four methods of data sharing are
based on the unique features of each method and the specific needs
each caters to. Sharing between Databricks environments (D2D) is
an excellent choice when you want to share data with another Data‐
bricks user, regardless of their account or cloud host. This method
Consuming Data
To access shared data directly from Databricks (see step 6 in Fig‐
ure 2-4), you need to have a Databricks workspace enabled for Unity
Catalog and a unique sharing identifier for your workspace. Once
you provide the identifier to the data provider, they will share data
with you and create a secure connection with your organization.
You can then find the share containing the data you want to access
and create a catalog from it. After granting or requesting access
to the catalog and its objects, you can read the data using various
tools available to Databricks users. You can also preview and clone
notebooks in the share.
Recipients who have received an activation link, similar to the illus‐
tration shown in Figure 2-5, can download the credential file locally
in JSON format. Note that for security purposes, the credential file
can be downloaded only one time, after which the download link is
deactivated. For certain technologies, such as Tableau, in addition to
the URL link, you may need to upload this credential file. For other
technologies, you may need a “bearer token” or other credentials
from this file.
Best Practices
In addition to assessing the open source versus managed version
based on your requirements, here are some best practices for secur‐
ing your data sharing with Delta Sharing. By following these best
practices, you can help ensure that your Delta Sharing platform is
secure and that your data remains protected:
Set the appropriate recipient token lifetime for every metastore.
To share data securely using the open sharing model, you need
to manage tokens well. You need to set the default recipient
token lifetime when you enable Delta Sharing for your Unity
Catalog metastore if you want to use open sharing. Tokens
should expire after a certain time period.
Establish a process for rotating credentials.
It is important to establish a process for rotating the credentials
(such as presigned URLs) used to access the data, which helps
ensure that access to your data remains secure and that any
compromised credentials are quickly invalidated.
Configure IP access lists.
You can configure IP access lists to restrict access to your data
based on the IP address of the client, ensuring that only author‐
ized clients can access your data.
Enable audit logging.
Enabling audit logging allows you to track and monitor access
to your shared data while also identifying any unauthorized
access or suspicious activity.
Summary | 45
frameworks without requiring specialized compute patterns, mark‐
ing a significant stride toward agile and efficient data sharing
practices.
Delta Sharing is a powerful tool that can help you harness the
power of data sharing and collaboration. Delta Sharing simplifies
the data sharing process, enhances data collaboration and produc‐
tivity, enables seamless data exchange, and strengthens data security
and governance. You can also benefit from the open source and
open data formats of Delta Lake to make data accessible to everyone.
In Chapter 3, you will delve into the practical aspects of navigating
Databricks Marketplace. The Marketplace offers prebuilt integra‐
tions, data connectors, and AI models to enhance your data-driven
initiatives. You will explore the various use cases and benefits it
brings, enabling you to make informed decisions when utilizing its
resources.
47
Databricks Marketplace is an open marketplace for your data, ana‐
lytics, and AI needs. It aims to address the challenges of data
sharing, which is often hindered by technical, legal, and business
demands, such as platform dependencies, data replication, security
risks, and contractual agreements. Databricks Marketplace is pow‐
ered by Delta Sharing and expands your opportunity to deliver
innovation and advance your analytics and AI initiatives. In this
chapter, you will learn how to navigate Databricks Marketplace and
take advantage of its key benefits. You will explore topics such as
understanding popular use cases and data providers across industry.
You will also learn about the different types of data assets avail‐
able on the Marketplace, including AI models and prebuilt note‐
books and solutions, and how Delta Sharing and Marketplace work
together. By the end of the chapter, you will have a comprehensive
understanding of Databricks Marketplace and will also be able to
apply best practices and tips for sharing data securely and efficiently
with other organizations.
Marketplace Partners
Databricks Marketplace thrives on the contributions of its diverse
partners. These partners, which include data providers, technology
partners, and consulting partners, play a crucial role in enriching
the Marketplace with a wide array of data products and services.
Data Providers
The Marketplace expands your ability to deliver innovation and
advance your analytics and AI initiatives. It allows data consumers
to discover, evaluate, and access more data products from third-
party vendors than ever before. Providers can now commercialize
new offerings and shorten sales cycles by providing value-added
services on top of their data. There are hundreds of providers avail‐
able across more than 15 industry categories. Here are some of
the providers on the Marketplace that can benefit organizations in
healthcare, finance, retail, manufacturing, and other industries:
LiveRamp
Provides a privacy-conscious and configurable collaboration
platform for organizations and external partners to create
audiences, activate data, and access insights. This includes
enhancing customer data with demographic and psychographic
information from AnalyticsIQ, Catalina, Experian, Polk, and
other sources.
AnalyticsIQ
Offers demographic and psychographic data, including datasets
for consumer demographics, lifestyle behaviors, and purchase
intent.
Marketplace Partners | 53
Catalina
Specializes in personalized digital media solutions and provides
datasets for purchase behavior insights.
Experian
Offers datasets such as credit scores and credit reports.
Polk
Provides automotive data solutions, offering a wide range of
data products such as datasets for vehicle registration data.
ShareThis
Offers solutions to enhance digital marketing efforts, foster
social engagement, and drive traffic to websites and digital
properties for businesses.
Technology Partners
Technology partners integrate their solutions with Databricks to
provide complementary capabilities for ETL, data ingestion, busi‐
ness intelligence, machine learning, and governance. These inte‐
grations enable you to leverage the Databricks Data Intelligence
Platform’s reliability and scalability to innovate faster while deriving
valuable data insights.
Consulting Partners
Consulting partners are experts uniquely positioned to help you
strategize, implement, and scale data, analytics, and AI initiatives
with Databricks. They bring technology, industry, and use case
expertise to help you make the most of the Databricks Data Intelli‐
gence Platform. Typical consulting partners include global, regional,
and industry-leading consulting services and technology product
companies.
Solution Accelerators
Solution Accelerators are prebuilt solutions designed to address
common use cases and speed up the development of data and AI
applications. They cover a wide range of domains, from cybersecur‐
ity to healthcare, and are developed by various providers, including
Databricks and its partners.
Solution Accelerators offer a unique blend of data, models, and
code, providing you with a solid starting point for your projects.
They can help you reduce development time, avoid common pitfalls,
and achieve better results.
Table 3-2 provides an overview of some of the Databricks Solution
Accelerators available in the Marketplace.
Solution Accelerators | 57
AI Models
Databricks Marketplace enables easy access to ready-to-use AI mod‐
els developed and provided by both Databricks and third parties,
catering to a variety of use cases and domains. For instance, a
provider could build a domain-specific natural language model to
detect healthcare-specific clinical phrases. By offering AI models
developed and provided by third parties, businesses can leverage
these models for specific use cases and domains.
Individuals can contribute their AI models to Databricks Market‐
place for utilization by others. Databricks Marketplace serves as
an open platform for the exchange of data assets such as datasets,
notebooks, dashboards, and AI models. Through AI model sharing,
Databricks users can access state-of-the-art models that can be easily
and securely implemented on their data.
To become a contributor of data on the Marketplace, one might
need to participate in a partner program. This guarantees that the
contributed data products, encompassing AI models, adhere to cer‐
tain criteria and are appropriate for use by others.
AI Models | 59
To use these models, you need a Databricks workspace that is Unity
Catalog enabled. Here is a summary of the steps to enable Unity
Catalog in Databricks:
You also need to accept the terms and conditions of the Llama
2 Community License Agreement before installing the listing.
Once it is installed, you can view detailed information about each
model, including specifications, performance, limitations, and usage
examples.
You can deploy these models directly to Databricks Model Serving
for immediate use. This allows you to create REST endpoints for
your models and serve them with low latency and high scalability.
You can also load the models for fine-tuning or batch inference use
cases using the MLflow API or the Transformers library.
Here’s how you can load the Llama-2-7b-chat-hf model using
MLflow:
import mlflow
model_uri = "models:/Llama-2-7b-chat-hf/1"
model = mlflow.pyfunc.load_model(model_uri)
And here’s how you can load the same model using Transformers:
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "meta-ai/llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
Getting Started
To start consuming data from the Marketplace, you must first have
a Premium Databricks account and workspace, a Unity Catalog met‐
astore, and the USE MARKETPLACE ASSETS privilege. If your admin
has disabled this privilege, you can request that they grant it to
you or grant you either the CREATE CATALOG or the USE PROVIDER
permissions on the Unity Catalog metastore. If you do not have any
of these privileges, you can still view Marketplace listings but cannot
access data products.
Once you have an account and the relevant permissions, you can
browse or search for the data product you want on Databricks
Marketplace. You can filter listings by provider name, category, cost
(free or paid), or keyword search. Since Databricks Marketplace uses
Delta Sharing to provide security and control over shared data, con‐
sumers can access public data, free sample data, and commercialized
Getting Started | 61
data offerings. In addition to datasets, consumers can leverage addi‐
tional analytical assets such as Databricks notebooks to help kick-
start the data exploration process. Databricks offers a pay-as-you-go
approach with no up-front costs. You pay only for the products
you use at per-second granularity. Some data products are free,
while others may have a cost associated with them. Once you have
access to the data products, you can access them via Data Explorer,
Databricks CLI, or SQL statements. You would also be able to grant
other users access to the catalog that contains the shared data and
transfer ownership of the catalog or the objects inside it.
For data providers, Databricks Marketplace gives a secure platform
for sharing data products that data scientists and analysts can use
to help their organizations succeed. Providers can share public data,
free sample data, and commercialized data offerings. They can also
share Databricks notebooks and other content to demonstrate use
cases and demonstrate how to take full advantage of their data
products.
To create a data product in the Marketplace, you’ll need a Premium
Databricks workspace with access to tools and services to prepare
your data, such as Delta Lake, MLflow, or SQL Analytics. You
can also use external tools or libraries that are compatible with
Databricks. You’ll then need to package your data product using
one of the supported formats, such as Delta Sharing or MLflow
Model Registry. To list your data products on the Marketplace, you’ll
need to apply to be a provider through the Databricks Data Partner
Program and review the Marketplace provider policies. You can
then publish your data product in the Marketplace using the Pub‐
lish Data Product UI in your workspace. You will need to provide
some information about your product, such as name, description,
category, price, terms, and conditions. You will also need to agree to
the Databricks Data Provider Agreement.
Deploying and serving a model from the Marketplace follows a
similar process. You’ll need to use Databricks Model Serving. This
is a feature that allows you to create REST endpoints for your mod‐
els and serve them with low latency and high scalability. You can
access Databricks Model Serving from the left sidebar menu in your
workspace. You can then select a model from Catalog Explorer or
MLflow Model Registry and click on Deploy Model. You will then
need to configure some settings for your model deployment, such as
name, version, inference cluster size, and so on. Once your model is
Summary
In this chapter, you learned about the advantages of Databricks
Marketplace for data collaboration. Powered by Delta Sharing, the
Marketplace brings the capability of open cross-platform and live
data sharing without data replication. Additionally, the centralized
governance model provides added security and control. Databricks
Marketplace ensures secure and compliant data sharing between
data providers and consumers. Data providers can control who can
access their data products, set terms and conditions, and monitor
usage and billing. Data consumers can trust that the data products
are verified and validated by Databricks and the providers. The
Marketplace also leverages Databricks’s built-in security and gover‐
nance features, such as encryption, authentication, authorization,
auditing, and data masking. In addition, the Marketplace supports
privacy-preserving technologies, such as differential privacy and
federated learning, to protect sensitive data.
From empowering business intelligence with actionable insights to
fueling data science and machine learning endeavors, you discov‐
ered how data integration, governance, storytelling, and data art
can be enriched through this platform. With a range of custom
datasets, notebooks, and AI models becoming available within the
Marketplace, digital transformation and innovation initiatives can
be pursued within the Databricks platform. In Chapter 4, you will
learn more about safeguarding data privacy with Databricks Clean
Rooms and other essential practices to ensure that sensitive data
remains secure and inaccessible to unauthorized entities.
Summary | 63
CHAPTER 4
Safeguarding Data with
Clean Rooms
With the rise of data privacy regulations such as the GDPR and the
CCPA and the increasing demand for external data sources, such
as third-party data providers and data marketplaces, organizations
need a secure, controlled, and private way to collaborate on data
with their customers and partners. However, traditional data sharing
solutions often require data replication and trust-based agreements,
which expose organizations to potential risks of data misuse and
privacy breaches.
The demand for data clean rooms has been growing in various
industries and use cases due to the changing security, compliance,
and privacy landscape, the fragmentation of the data ecosystem,
and the new ways to monetize data. According to Gartner, 80% of
advertisers that spend more than $1 billion annually on media will
use data clean rooms by 2023.1 However, existing solutions have
limitations on data movement and replication, are restricted to SQL,
and are hard to scale.
The Databricks Data Intelligence Platform provides a comprehen‐
sive set of tools to build, serve, and deploy a scalable, flexible,
and interoperable data clean room based on your data privacy and
1 Interactive Advertising Bureau (IAB), State of Data 2023: Data Clean Rooms and the
Democratization of Data in the Privacy-Centric Ecosystem, January 24, 2023, https://
oreil.ly/OBVgm.
65
governance requirements. Some of the features include secure data
sharing with no replication, full support to run arbitrary workloads
and languages, easy scalability with guided onboarding experience,
isolated compute, and being privacy-safe with fine-grained access
controls.
Databricks Clean Rooms enable organizations to share and join
their existing data in a secure, governed, and privacy-safe environ‐
ment. Participants in the Databricks Clean Rooms can perform
analysis on the joined data using common languages such as Python
and SQL without the risk of exposing their data to other partici‐
pants. Participants have full control of their data and can decide
which participants can perform what analysis on their data without
exposing sensitive data, such as personally identifiable information
(PII).
This chapter provides an in-depth look at how Databricks Clean
Rooms work and how they can help organizations guard their data
privacy. The chapter also explores key partnership integrations that
enhance the capabilities of Databricks Clean Rooms and provide
additional benefits for data privacy and security. By using Data‐
bricks Clean Rooms with these partner solutions, organizations
can unlock new insights and opportunities from their data while
preserving data privacy.
Best Practices
In addition to setting up access controls within the clean room
to ensure that only authorized users can access and process data,
you should observe these best practices for using Databricks Clean
Rooms to ensure data privacy and security:
Best Practices | 73
Monitor data usage.
Monitor the usage of data within the clean room to ensure that
the data is being used in compliance with data privacy and
security policies. This can be done using tools such as audit logs
and data usage reports.
Encrypt data.
Encrypt data at rest and in transit to ensure that it is protected
from unauthorized access. This can be done using encryption
tools provided by the Databricks Data Intelligence Platform or
third-party encryption tools.
Implement data retention policies.
Implement data retention policies to ensure that data is not
retained for longer than necessary. This can help to minimize
the risk of data breaches and ensure compliance with data pri‐
vacy regulations.
Regularly review and update security measures.
Regularly review and update security measures to ensure that
they are effective in protecting data privacy and security. This
can include updating access controls, monitoring tools, encryp‐
tion tools, and data retention policies.
Define clear data sharing policies.
Define clear data sharing policies that outline the terms and
conditions under which data can be shared within the clean
room. This can help to ensure that all participants understand
their rights and responsibilities when it comes to data sharing.
Provide training and support.
Provide training and support to clean room participants to help
them understand how to use the clean room effectively. This
can include training on how to share data, how to set up access
controls, and how to run computations on the data within the
clean room.
Leverage partner integrations.
Take advantage of partner integrations such as Habu, Datavant,
LiveRamp, and TransUnion to enhance the capabilities of your
Databricks Clean Room. These partners provide tools and tech‐
nologies that can help you to improve data privacy and security
within the clean room.
Future Trends
Several trends are likely to shape the evolution of clean room tech‐
nology and the broader landscape of data privacy and collaboration:
Ubiquitous adoption of clean rooms
The adoption of clean room technology is expected to grow
significantly. With cloud hyperscalers introducing clean rooms
(such as AWS Clean Rooms from Amazon) alongside database
companies like Snowflake and Databricks, the technical capabil‐
ity to perform a secure, double-blinded join is now accessible
to engineers and product builders familiar with these stacks.
As a result, you can expect to see clean-room-powered data
flows emerging in most data-driven applications across myriad
industries, from advertising to healthcare.
Enhanced walled garden solutions
A wave of new product developments and enhancements
can be expected in the walled garden clean rooms, such as
Google’s Ads Data Hub, Amazon’s Marketing Cloud, and Meta’s
Advanced Analytics. By enhancing advanced federated learning
techniques, multiple data owners could share their data and AI
assets and models without exchanging their data. Instead, they
would share only the model updates or parameters, which are
aggregated and applied to the global model. This way, the data
remains local and private, while the model benefits from the
collective data.
Increased interoperability
As more businesses adopt clean rooms, there will be a growing
need for interoperability among different clean rooms across
different clouds, regions, and platforms, driving further innova‐
tion in clean room technology to ensure seamless collaboration
across different environments.
Future Trends | 75
Greater focus on privacy
With increasing regulations around data privacy and the
upcoming demise of third-party cookies, the scale and breadth
of data sharing is becoming increasingly limited, leading to an
increased focus on privacy-preserving technologies within clean
rooms. Clean rooms could integrate advanced privacy tech‐
niques such as differential privacy, which adds controlled noise
to data queries or outputs to protect the individual privacy of
data records. It ensures that the statistical results of data analysis
do not reveal any information about specific individuals in the
data.
Enriched encryption
Clean Rooms can benefit from advanced encryption techniques
such as homomorphic encryption, a technique that allows data
owners to perform computations on encrypted data without
decrypting it. This enables data analysis and machine learn‐
ing on encrypted data without compromising data security or
privacy.
Rise of orchestration
As data clean rooms become more complex and involve more
participants, there will be a growing need for orchestration tools
to manage these environments effectively.
The future of clean room technology looks promising, with several
exciting trends on the horizon. As businesses continue to navigate
the challenges of data privacy and collaboration, clean rooms will
play an increasingly important role in enabling secure and effective
data analysis.
Summary
This chapter provided an understanding of Databricks Clean
Rooms, an innovative technology designed to protect data while
enabling effective collaboration and analysis. The concept of clean
rooms was explained, illustrating how they offer a secure, governed,
and privacy-safe environment for data analysis. You learned about
key partnerships that enhance the capabilities of Databricks Clean
Rooms, along with various industry use cases that demonstrate the
versatility and applicability of clean rooms across sectors. You also
learned about steps for getting started with and implementing Data‐
bricks Clean Rooms, along with best practices for using clean rooms
Summary | 77
CHAPTER 5
Crafting a Data Collaboration
Strategy
79
manage and enable collaboration, including open and secure data
sharing, private data exchanges, and privacy-safe clean room envi‐
ronments.
Crafting an effective data collaboration strategy offers numerous
advantages for businesses, which can gain insights, accelerate inno‐
vation, and better position themselves to compete in the digital
economy. It accelerates innovation by providing diverse perspectives
and insights, leading to more inventive and effective solutions. It
enhances data quality by allowing multiple parties to enrich the data
and unlock value by creating new data products and applications
that were not feasible in isolation.
In an increasingly interconnected world, the focus is shifting toward
eradicating organizational and geographical barriers. This approach
fosters a seamless collaboration environment, irrespective of the
location or structure of the teams involved. Tools such as Delta
Sharing, private exchanges, and clean rooms are instrumental in this
process, enabling efficient collaboration across different trust levels.
This is not just about a specific platform or tool but about a broader
perspective on enhancing global collaboration.
Organizations can derive new value and accelerate innovation from
their existing data assets through data collaboration. By sharing and
integrating data, organizations can discover new insights, develop
new services, and generate new revenue streams.
In this chapter, you will learn about different scenarios, technolo‐
gies, architectures, and best practices for data collaboration that are
possible with Databricks Delta Sharing. You will explore how Delta
Sharing’s open source approach can help you overcome the chal‐
lenges and unlock the benefits of data collaboration across different
clouds, platforms, and regions. You will also learn how to design
and implement a data collaboration framework that aligns with your
goals and scope, partners and roles, agreements and rules, platform
and tools, and security and governance. You will also discover how
to measure the success of your data collaboration strategy and how
to manage change and improvement within your organization.
By the end of this chapter, you will have a comprehensive under‐
standing of data collaboration with Databricks Delta Sharing and
how it can transform how you create, access, use, and share data.
You will also be able to apply the concepts and techniques learned in
this chapter to your data collaboration projects and use cases.
Data Monetization
Data monetization is another key scenario in which Delta Sharing
can be leveraged. Companies can monetize their data by sharing it
with external parties, as illustrated in Figure 5-3. For instance, a tele‐
com company can share user behavior data (while ensuring privacy
and compliance) with marketing agencies for targeted advertising.
Similarly, a financial institution can share market trends with invest‐
ment firms for a fee. Delta Sharing ensures the data sharing process
is secure, efficient, and compliant with regulations.
1 Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia, “Lakehouse: A New
Generation of Open Platforms That Unify Data Warehousing and Advanced Analytics,”
in 11th Annual Conference on Innovative Data Systems Research (CIDR ’21), January
11–15, 2021, https://oreil.ly/YFniN.
Data Mesh
A data mesh is a distributed data architecture that organizes data
around business domains rather than around technical functions.
Each domain takes ownership and manages its own data, presenting
it as a data product via standardized APIs and protocols. These
data products are self-describing, discoverable, and interoperable,
facilitating easy access and usage for data consumers. A data mesh
also ensures data governance and quality at the source, making the
data reliable and compliant.
The data mesh model enhances data sharing and collaboration by
enabling domain teams to function as both data providers and con‐
sumers, eliminating the need for a centralized data platform and
team. This model gives data providers the autonomy and motivation
to offer high-quality, relevant, and timely data to their consumers,
who can access and use the data in a self-service manner.
Data mesh integrates with modern data sharing technologies, such
as Databricks Marketplace, Delta Sharing, and Databricks Clean
Ease of Implementation
Delta Sharing is designed for ease of implementation, integrating
seamlessly with existing data infrastructure and requiring minimal
changes to current workflows. It provides a simple REST API that
facilitates seamless data collaboration by allowing data providers
to share their data and recipients to read the shared data. The plat‐
form supports AI and machine learning workflows, enabling direct
access to shared data for advanced data analysis. Furthermore, Delta
Sharing ensures data freshness by allowing real-time data sharing,
making it particularly useful for use cases that require timely analyt‐
ics. This comprehensive and scalable solution reduces the time and
effort required to adopt data collaboration, making it an attractive
tool for both data providers and recipients.
Provider perspective
1. Identify data for sharing.
Identify the datasets that will be shared during the pilot. These
should be representative of the data that will be shared in the
full-scale implementation.
2. Set up Delta Sharing.
Install and configure Delta Sharing on your data platform. This
step involves setting up the Delta Sharing server and configur‐
ing the necessary permissions and security settings.
3. Prepare data for sharing.
Preparing the identified datasets for sharing involves converting
the data into Delta Lake format (if it is not already in this
format), partitioning the data for efficient sharing, and setting
up the necessary sharing profiles.
4. Test data sharing.
Share the prepared datasets with the pilot users (recipients) and
ensure they can access and use the data as expected.
Full-Scale Rollout
After the technical pilot has successfully been completed, the data
sharing strategy can be scaled up to the full organization level. This
involves expanding the scope and scale of the data sharing, both in
terms of the number and variety of datasets shared and the number
and diversity of users accessing the shared data. The full-scale roll‐
out also involves ensuring the sustainability and governance of the
data sharing, as well as measuring and communicating the value and
impact of the data collaboration. Here are some key steps for the
full-scale rollout:
Summary
Crafting an effective data collaboration strategy is a multifaceted
process. It requires a deep understanding of the technological land‐
scape, the ability to navigate challenges, and the application of data
collaboration best practices. The Databricks Data Intelligence Plat‐
form can help organizations create secure, efficient, and scalable
data collaborations.
Databricks provides an open and secure platform for data collabo‐
ration. Delta Sharing provides an open source approach to data
sharing across clouds, platforms, and regions. Delta Sharing pow‐
ers Databricks Marketplace, which opens up new opportunities for
innovation and monetization, as well as Databricks Clean Rooms,
a privacy-safe collaboration environment for customers and part‐
ners. On the Data Intelligence Platform, all of this is secured and
governed by Unity Catalog, which provides organizations with a
Summary | 111
CHAPTER 6
Empowering Data Sharing
Excellence
Throughout this book, you have learned that Delta Sharing offers
a secure and controlled environment for sharing data across depart‐
ments and with external partners, eliminating the need for data
replication or movement. It enhances collaboration and provides
deeper insights into data. Key features include Clean Rooms, which
offers secure environments for sharing sensitive data in compliance
with privacy requirements. The Marketplace serves as a hub for
data products, facilitating easy access and eliminating complex pro‐
curement processes. Data catalogs act as a centralized library for
shared data assets, improving data governance and security. Data
quality checks ensure the validity and accuracy of shared data. Data
lineage provides transparency in data sharing practices by tracking
the origin and transformation of data. Data notifications keep users
informed about changes in shared data. Data APIs allow for easy
integration of shared data with other applications. By leveraging
these features within the Lakehouse data platform, organizations can
turn data into actionable insights in a secure, scalable, and real-time
data sharing environment.
Excellence in data sharing can be defined as the ability to share
data in a way that maximizes its value while minimizing risks. This
includes ensuring the quality and accuracy of shared data, protect‐
ing sensitive information, complying with relevant regulations, and
enabling effective collaboration among data users. Excellence is not
113
just about having advanced technologies; it’s about using these tech‐
nologies to drive meaningful outcomes.
In today’s data-driven world, achieving excellence in data sharing
can provide organizations with a significant competitive advantage.
It can enable them to uncover valuable insights, make informed
decisions, innovate faster, and deliver superior customer experien‐
ces. In this final chapter, you’ll learn about the key components for
data sharing excellence and how to effectively overcome challenges
to achieve this excellence.
1 Amanda Latimore and Sara Whaley, “A Quick Guide to Successful Data Partnerships,”
Johns Hopkins Bloomberg American Health Initiative, February 24, 2021, https://
oreil.ly/wUrJm.
2 Brian Eastwood, “The Case for Building a Data-Sharing Culture in Your Company,”
MIT Sloan School of Management, September 9, 2021, https://oreil.ly/YHW1p.