Dummies Low Code Data Engineering On Databricks
Dummies Low Code Data Engineering On Databricks
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Low-Code Data Engineering on Databricks For Dummies®,
Prophecy Special Edition
Published by
John Wiley & Sons, Inc.
111 River St.
Hoboken, NJ 07030-5774
www.wiley.com
Copyright © 2023 by John Wiley & Sons, Inc.
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any
form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise,
except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without
the prior written permission of the Publisher. Requests to the Publisher for permission should be
addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ
07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Trademarks: Wiley, For Dummies, the Dummies Man logo, The Dummies Way, Dummies.com,
Making Everything Easier, and related trade dress are trademarks or registered trademarks of
John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries, and may not
be used without written permission. Prophecy and the Prophecy logo are registered trademarks
of Prophecy. All other trademarks are the property of their respective owners. John Wiley & Sons,
Inc., is not associated with any product or vendor mentioned in this book.
For general information on our other products and services, or how to create a custom For
Dummies book for your business or organization, please contact our Business Development
Department in the U.S. at 877-409-4177, contact [email protected], or visit www.wiley.com/
go/custompub. For information about licensing the For Dummies brand for products or services,
contact BrandedRights&[email protected].
ISBN: 978-1-394-20592-9 (pbk); ISBN: 978-1-394-20593-6 (ebk)
Publisher’s Acknowledgments
Some of the people who helped bring this book to market include the following:
Project Manager: Acquisitions Editor: Traci Martin
Carrie Burchfield-Leighton Client Account Manager:
Sr. Managing Editor: Rev Mengle Cynthia Tweed
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Table of Contents
INTRODUCTION................................................................................................ 1
About This Book.................................................................................... 2
Icons Used in This Book........................................................................ 2
Beyond the Book................................................................................... 3
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
CHAPTER 6: Ten Resources for Getting Started............................... 39
Explore Prophecy for Databricks....................................................... 39
Design a Pipeline................................................................................. 40
Test a Pipeline...................................................................................... 40
Track Data Lineage.............................................................................. 40
Prophecy Documentation.................................................................. 40
Data Mesh Whitepaper....................................................................... 41
Lakehouse Architecture Guide.......................................................... 41
Blog Post on Data Engineering.......................................................... 41
Request a Demo.................................................................................. 42
Start a 14-Day Trial.............................................................................. 42
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Introduction
D
ata analytics has undergone revolutionary change. First,
easy-to-use business intelligence tools made analytics on
relational data available to data users across organizations.
Now, machine learning (ML) and artificial intelligence (AI) deliver
value, most often by using unstructured and semi-structured data
to drive industry-changing innovations from recommendation
engines to automating processes.
The fuel that powers analytics and AI is data, and data is captured
in various ways and formats and is delivered by data engineers
for downstream use cases that enable smarter decision making
and data-driven innovations. Many data sources exist, and these
sources involve different types of data, such as customer data or
sensor data generated by Internet of Things (IoT) devices, and
more and more demands are placed on that data to create new
data products and drive business results.
Introduction 1
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
About This Book
Low-Code Data Engineering on Databricks For Dummies, Prophecy
Special Edition, describes the advantages of the open source lake-
house environment, pioneered by Databricks. This environment is
new, introduced in 2020, but its adoption in the enterprise world
is happening at blazing speed. Several sources confirm that more
than half of enterprise IT shops are already using the lakehouse,
with more to follow soon.
The Case Study icon points out stories about companies using
low-code data engineering on the lakehouse to save time, cut
costs, improve productivity, and integrate better with existing
systems.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Beyond the Book
This book can introduce you to new and improved approaches to
data architecture and show you how to make data engineering a
tool your entire data team — from data engineers to data analysts
and data scientists — can take advantage of and contribute to. If
you want resources beyond what this book offers, check out the
following:
Introduction 3
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Observing progress in data engineering
Chapter 1
Delivering Data the
Easy Way
D
ata has become a hugely powerful resource for organiza-
tions, on par with human capital and financial resources. It
may be a well-worn phrase, but “data is the new oil” still
holds a promise that companies can put to good use.
This chapter describes how the world has changed and how new
solutions have emerged to help organizations adapt and get
the most out of these changes. The discipline that has seen the
most change is data engineering; the key technology change has
been the emergence of the data lakehouse, a new kind of data
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
architecture that combines the flexibility and cost-efficiency of
data lakes with the structure and data management features of
data warehouses.
While data lakehouses are powerful solutions for enabling BI, AI,
and ML use cases, they can also increase the demands on the data
engineering team, causing delays for downstream data users.
Low-code data engineering solutions empower these data users
to more easily access the data to drive analytics that help the
organization accomplish its goals without over-reliance on data
engineering.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
With a data lakehouse, organizations bring together data from on-
premises and various cloud storage solutions into a single unify-
ing infrastructure. Structured, semi-structured, and unstructured
data coexist, as in a data lake — but data analytics and data man-
agement capabilities once only found in the data warehouse, with
its carefully curated structured data, are now available in the
lakehouse as well. Organizations get the best of both worlds.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
This unmanaged approach to data engineering hasn’t scaled well
to a world with more and more data types powering an ever-
increasing range of use cases on an ever-expanding number of
platforms. This new world includes more use cases that engage
multiple platforms in a rapidly growing number of data pipelines.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Using Prophecy on the Lakehouse
Prophecy offers a low-code data engineering solution that enables
data users to easily create and orchestrate data pipelines on the
lakehouse — the modern, emerging, cloud-native infrastructure
for data management. Prophecy is fully up-to-date in its capa-
bilities for both data users and data engineers. For data users,
Prophecy offers access to all the major types of data repositories.
Users can mix and match from a full menu of data types and build
data pipelines for use in BI, ML, and AI.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Pulling It All Together
The addition of the Prophecy platform helps make the data lake-
house the gold standard for effective data management and
enables the democratization of data across an organization. Com-
panies can move faster and not break things; instead, they can
increase and enhance their use of powerful and flexible open
standards.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Identifying current trends
Chapter 2
Updating the Data
Engineering Landscape
D
ata engineers build systems that bring data in from the
outside world and make it usable within an organization.
They have a wide range of responsibilities, including data
security, data management, data governance, and making sure
that data is usable for many purposes. The two main types of
stakeholders for the work of data engineers are business intelli-
gence (BI) users, who largely use data to help run and grow the
business, and data scientists who develop predictive analytics,
machine learning (ML) models, and artificial intelligence (AI)
applications. Both groups contribute to internal and customer-
facing applications that make a crucial difference in business
success.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Many data engineers deal with both the old and new technology
sets in their daily work, along with the ongoing need to migrate
selected workloads from one to the other. Eventually, the migra-
tion will near completion, but for most organizations, that’s still
years in the future.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
about twice as much data as users of previous technologies. As 5G
becomes the norm, applications will grow more powerful, moving
and returning more and more data.
FIGURE 2-1: Growth in data creation from 2010 to 2025 (estimated after 2020).
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
these strategies are entirely dependent on having skilled engi-
neers to capture the right data and put it to use.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
»» The increasing movement of data storage and data process-
ing to the cloud
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Enumerating Key Technologies
What are the key technologies that have emerged to leverage all
these trends? Organizations have to ingest, process, and ana-
lyze data. These steps are organized into data pipelines, built
and managed by data engineers. (Who are also responsible for
cleansing, organizing, and securing data, among other important
processes.)
»» In the 1980s and 1990s, ad hoc scripting was used for ETL. This
get-it-done approach solved problems quickly, but the resulting
scripts were error prone and hard to manage. They were also
often dependent on the individual who wrote them, which
caused problems as people moved on from a company.
»» In the 1990s, new, proprietary approaches brought new
sophistication to ETL, including visual tools for building data
pipelines. ETL became central to a market in data manage-
ment worth billions of dollars a year. Alteryx, Informatica,
IBM, and Oracle have been among the key providers.
These providers initially worked with relational data used by
business analysts. But they’ve increasingly moved to support
unstructured and semi-structured data used by data
scientists for predictive analytics, ML, and AI, too.
In the last 20 years, the cloud has brought two new kinds of solu-
tions, generally used together:
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Identifying Gaps in Previous Techniques
Initially, organizations tended to keep their data lakes and their
data warehouses separate. The real-time analytics, ML, and AI
teams had their NoSQL databases and Python code; the BI peo-
ple had relational databases and visualization tools like Looker,
Microsoft Power BI, and Tableau.
In the two-tier approach, everything goes into the data lake first.
Real-time analytics, ML, and AI have access to all the organiza-
tion’s data, even if it’s in somewhat rough form.
Then, ETL and other processes are used to prepare some data,
much of it originating in transactional systems and other systems
of record, for the data warehouse. This structured data is used for
reporting, BI, and applications, many of them internal.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
What could possibly go wrong? Well, this two-headed beast, like
most such, has some challenges. They include
»» Complexity: Figure 2-2 doesn’t show that both the data lake
and the data warehouse are each often split between
on-premises and cloud components, with multiple ETL and
analytics processes required to try to bridge the gap.
»» Cost: Doing the same thing two different ways often costs
more. The use of multiple platforms with different tool sets
adds to costs.
»» Legacy burdens: Much of the reason for this divided
approach is to allow continued use of older, expensive
standards such as expensive proprietary databases and
ETL tools such as Ab Initio, often running on-premises.
This raises costs and traps key workloads in the less flexible
on-premises world.
»» Lack of data democratization: Complex architectures
are overwhelming to those who want to participate in the
democratization of data. Users end up waiting on over-
worked data engineers to make data spread across multiple
systems accessible to them, and the resulting solutions are
often error-prone and unstable.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
The lakehouse architecture was popularized in 2020 with the
launch of Delta Lake by Databricks. Originally created to make
data lakes more reliable, Delta Lake has quickly added data
warehouse-type capabilities to a single repository that can hold
all of a company’s data.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Finding challenges in coded solutions
Chapter 3
Using Low-Code Data
Engineering
D
ata engineers, like software engineers, do their work by
writing code. Software code is powerful and runs effi-
ciently, so it’s the right tool for many jobs. However, orga-
nizations are dealing with data flows that are doubling every
couple of years. They’re also dealing with more complex demands,
such as moving to the cloud and supporting business data users,
all while battling resource constraints and an expanding ecosys-
tem of data sources and tools.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
THE DATABRICKS CHALLENGE
For many years, data engineers focused first on structured data and
business intelligence (BI) needs. Advanced analytics, machine learning
(ML), and artificial intelligence (AI), all of which required unstructured
and semi-structured data to solve problems, were left to data scien-
tists to figure out.
The success of Databricks in the cloud has opened up many doors for
organizations, including the ability to meet the challenge posed by the
growth of ML and AI — in particular large language models such as
ChatGPT and their many potential applications.
The problem comes when the need for coded solutions scales
faster than anyone can manage. And this gap between organiza-
tional needs and available resources has grown rapidly in recent
years. Check out Chapter 2 for the key drivers of these challenges.
In addition to the impact of these changes on the entire organiza-
tion, software engineering itself is changing in response to these
and other challenges, adding even more complexity in using code
as the only answer to data engineering needs.
Data engineers today use many tools to build data pipelines and
perform all the supporting tasks needed to make data fully useful
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
to, and usable by, the organization. Chief among these are the
programming languages Java, Python, Scala, and SQL.
Figure 3-1 shows how coded solutions fit into a data lakehouse
architecture. While the core repository has been unified, with
the lakehouse supporting all kinds of data and the full range of
business needs, the coding area is still a weak point. SQL queries,
which can range up to thousands of lines long and be hard to
maintain, coexist with extract, transform, load (ETL) notebooks
and orchestration scripts.
FIGURE 3-1: Scripted solutions can fall short of data engineering requirements.
The major challenges in using coded solutions for all data engi-
neering challenges in today’s fast-changing data processing and
analytics environment can be summed up as follows:
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
LAKEHOUSE AS PART OF
THE SOLUTION
The move to lakehouse is part of the solution to the problems that
data engineers face. Currently, data repositories are split across cloud
and on-premises databases. By unifying a large and growing share of
the organization’s data in a single cloud repository that meets both
data science and BI needs, the complexity of the data engineering chal-
lenge is reduced. Pipelines become simpler and code reuse increases.
Low-code tools for ETL put the power of data engineering in the
hands of data users across the organization. Low-code tools also
make life easier for data engineers, in three important ways:
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
»» The data engineers are spared a lot of small requests that
take time to understand, implement, test, and hand over.
»» The data engineers can focus their time on extending
solutions created with low-code tools where needed to
achieve the best results.
»» The low-code tools also make many routine tasks easier for the
data engineers themselves, speeding their work and reducing
the potential for typing errors and other trivial, but impactful
problems.
Existing low-code ETL tools, however, were created for use with
previous generations of technology. They have some long-standing
faults that limit their usefulness in the data lakehouse environment:
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
One existing tool worth mentioning in this context is Alteryx, a
flexible low-code data engineering tool that works on a laptop or
desktop computer. Unfortunately, it must pull the entire dataset
onto the machine it’s running on to work with it. If the dataset
doesn’t fit on that computer, as is often the case in today’s world,
or if performance is unacceptably slow, the user is out of luck.
As with BI, data users are plugged into the business needs they’re
trying to meet. They can become skilled in the use of low-code
data engineering tools. They gain the skill needed to get the most
out of them and even, in some cases, to extend them with code.
With the right tools, data users can also quickly come up to speed
on using ML and AI to meet business use cases, just as they’ve
already done with BI.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Getting the most from key capabilities
Chapter 4
Using Prophecy
for Modern Data
Engineering
P
rophecy stands at the intersection of two trends that, when
combined, can help solve some of the biggest problems in
data management. The first is the move by organizations to
the lakehouse. The second is the introduction of low-code tools to
make data engineering easier across a variety of platforms.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Figure 4-1 shows how Prophecy fits into the data lakehouse archi-
tecture of the modern lakehouse. Data engineering gets a polished
and highly usable toolkit for use with the lakehouse.
FIGURE 4-1: Prophecy offers a data engineering solution for the lakehouse.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Prophecy offers the capabilities that data users need:
FIGURE 4-2: Prophecy users drag and drop Gems to build data pipelines.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
»» Create standardized templates in Prophecy’s Framework
Builder.
»» Support Spark Batch and Spark Streaming pipelines.
After a pipeline is built, the user can run the pipeline and review
the state of the data after each step. Abstractions such as sub-
graphs and configurations make it easy to reuse sections of the
pipeline. It’s also easy to develop and run tests for all components
of the pipeline.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
In creating a reporting pipeline, the user can create three Sub-
graph Gems, each of which incorporates several pipelines steps.
For instance, the user may create IngestBronze, MergeToSilver,
and AggregateToGold Subgraph Gems to handle each step in the
process.
FIGURE 4-3: In Prophecy, the user can create a Subgraph Gem to handle
creating a bronze-level table of raw data.
Additional Gems handle the steps needed to cleanse the source data
(MergeToSilver) and create a business report (AggregateToGold).
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Making data engineering accessible in
healthcare
»» Improving productivity
Chapter 5
Diving into Industry
Use Cases for Prophecy
P
rophecy customers are achieving outstanding results across
a range of customer use cases by modernizing and acceler-
ating extract, transform, load (ETL) processes on the lake-
house. This chapter gives you several examples of successful use
cases.
Data and efficient processing of and access to data are at the core of
what HealthVerity offers. However, the company’s data engineers
were tied up creating and maintaining pipelines. Non-technical
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
users were often blocked on projects while waiting for data engi-
neering support.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
The company’s transaction volumes, which were already high,
were growing rapidly. The organization’s on-premises ETL solu-
tions were at risk of becoming overwhelmed. In addition, the use
of proprietary ETL solutions locked the company out of the rapid
innovation fostered by open source. This included an inability to
integrate and interoperate with other internal systems.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Empowering Business Users in
Asset Management
Waterfall Asset Management, an investment management firm
that operates globally, has more than $11 billion of assets under
management. It makes critical investment decisions for its clients
around the clock. Delivering outstanding portfolio performance
and interacting productively with clients are vital to the continued
success of the business.
Data users with critical requests for data access and transforma-
tion were left to wait for assistance from overburdened data engi-
neers or do without, and were obligated to spend time performing
manual data work. Client service and even portfolio performance
suffered.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
The repeatable frameworks provided by Prophecy make it easy for
data pipelines to be standardized and shared across teams, sav-
ing time and reducing errors. Data engineering can now focus on
high-value tasks such as data governance, which prevents poten-
tial problems from reaching end-users.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Finding MVPs in Major League Baseball
The Texas Rangers are an American League baseball team based
in the Dallas-Fort Worth metropolitan area. The team has more
than 250 employees. In 2001, the Moneyball era in major league
baseball began, and today there’s a commitment to player ana-
lytics across the sport. Teams need data for player acquisition
and development as well as for between-game and even in-game
decision making.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Designing and testing a pipeline
»» Watching a demo
Chapter 6
Ten Resources for
Getting Started
A
s adoption of the data lakehouse proceeds, organizations
need to empower data users with the ability to perform
data engineering tasks. Prophecy is a low-code solution
that meets that challenge head on. To get a feel for what Prophecy
can do for your organization and to get started, check out the ten
resources in this chapter.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Design a Pipeline
If you need help designing a pipeline, look no further. Prophecy’s
video gives you an in-depth guided tour showing you how to build
a pipeline from a data source, including how to uses and configure
Prophecy Gems.
Test a Pipeline
Do you need to know how to test a pipeline you’ve built in Proph-
ecy? You’ve come to the right place. Check out the following video
to get started: www.prophecy.io/prophecy-university?video=
hs4r7qlsxo.
Prophecy Documentation
If you like to read and want to know more in-depth facts about
Prophecy, you can check out Prophecy’s documentation. It pro-
vides a thorough description of the product, including differences
between the Standard and Enterprise versions; concepts; and
low-code approaches to Apache Spark and SQL. Access the docu-
mentation here: docs.prophecy.io.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Data Mesh Whitepaper
The data mesh is a distributed approach to storing and delivering
data. The underlying platform may be owned by the organiza-
tion’s platform team or by domain-specific teams. In either case,
the domain team has responsibility for its own data pipelines.
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Request a Demo
If you think Prophecy running on a Databricks data lakehouse
may be the right solution for you, the next step is to request a
demo. Contact Prophecy at www.prophecy.io/contact-us and
request your demo today!
These materials are © 2023 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
WILEY END USER LICENSE AGREEMENT
Go to www.wiley.com/go/eula to access Wiley’s ebook EULA.