Databricks Migrating From Edw To Data Lakehouse For Dummies
Databricks Migrating From Edw To Data Lakehouse For Dummies
by Stephanie Diamond
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Migrating from a Data Warehouse to a Data Lakehouse
For Dummies®, Databricks Special Edition
Published by
John Wiley & Sons, Inc.
111 River St.
Hoboken, NJ 07030-5774
www.wiley.com
Copyright © 2022 by John Wiley & Sons, Inc.
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or
by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as
permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written
permission of the Publisher. Requests to the Publisher for permission should be addressed to the
Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011,
fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Trademarks: Wiley, For Dummies, the Dummies Man logo, The Dummies Way, Dummies.com, Making
Everything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons,
Inc. and/or its affiliates in the United States and other countries, and may not be used without written
permission. Databricks and the Databricks logo are registered trademarks of Databricks. All other
trademarks are the property of their respective owners. John Wiley & Sons, Inc., is not associated with
any product or vendor mentioned in this book.
For general information on our other products and services, or how to create a custom For Dummies book
for your business or organization, please contact our Business Development Department in the U.S. at
877-409-4177, contact [email protected], or visit www.wiley.com/go/custompub. For information about
licensing the For Dummies brand for products or services, contact BrandedRights&[email protected].
ISBN: 978-1-119-89472-8 (pbk); ISBN: 978-1-119-89473-5 (ebk). Some blank pages in the print version
may not be included in the ePDF version.
Publisher’s Acknowledgments
Some of the people who helped bring this book to market include the following:
Project Manager: Senior Client Account Manager:
Carrie Burchfield-Leighton Matt Cox
Sr. Managing Editor: Rev Mengle Content Refinement Specialist:
Acquisitions Editor: Ashley Coffey Tamilmani Varadharaj
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Table of Contents
INTRODUCTION................................................................................................ 1
About This Book.................................................................................... 1
Icons Used in This Book........................................................................ 1
Beyond the Book................................................................................... 2
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
CHAPTER 5: Reviewing Why to Migrate to the Lakehouse..... 23
Using an Agile Approach.................................................................... 23
Planning the Migration Journey......................................................... 25
The Five Pillars of Migration............................................................... 26
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Introduction
T
he data lakehouse is a cloud-native platform for data man-
agement that provides a powerful engine for data processing
and simple and intuitive tools for developers, analysts, data
scientists, and business users in an intuitive user interface (UI). It
enables you to build, deploy, scale quickly, and manage analytical
applications in minutes instead of hours or days. The data lake-
house is an open data architecture that combines the best of data
warehouses and data lakes on one platform.
With the data lakehouse, you can analyze all your data in one
place without moving it to another system first. The key to uti-
lizing this innovative platform is migrating your current system
to the data lakehouse. This book looks at why and how to migrate
from your enterprise data warehouse (EDW) to the data lake-
house to prepare your organization to meet the future.
Introduction 1
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
The Tip icon adds information to help you manage processes
faster and easier.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Looking at data management challenges
Chapter 1
Recognizing the
End of an Era
B
usiness data continues to be one of the most valuable assets
a corporation possesses. As data availability continues to
explode, maximizing, optimizing, and refining enterprise
data are seen as central to a thriving business. But it’s hard for
companies to keep up with the growing volume. As a result, you
need well-designed data management architecture to help you
minimize risks and reach your financial goals.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
The advent of low-cost cloud storage, open-source software, ML,
and AI have allowed for a significant shift in how organizations
leverage their data. In addition, the COVID-19 pandemic forced
companies to adapt to a remote distributed workforce. As a result,
cloud adoption has skyrocketed. The enterprise data warehouse
(EDW) of the past was a closed proprietary system not suited to
accommodate modern data management challenges that include
the ability to
»» Perform ML, data science, and AI, and support other new
sources of data required to make predictions
»» Store audio and video data sets
»» Support streaming for real-time operations
»» Scale in a flexible manner
»» Manage raw data regardless of the format
To understand the scope of the problem, you need to see that
as technologies advanced, many different types of data became
available and companies recognized their significant value. Busi-
nesses realized that they needed a unified place to store and ana-
lyze not only their structured data but also an increasing volume
of semi-structured and unstructured data.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
query language (SQL) were deployed because the available data
was hierarchical and stored in database tables. For the longest
time, this method was adequate to create the necessary financial
and other business reports.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
»» Data from each source had its own schema, and each
business application used its own schema. This required
extensive and complex extract, transform, load (ETL) to load
it in standardized data models — only to be copied again in
different formats by different business teams.
»» ETL to load data warehousing requires extensive modeling
and months of efforts. By the time the data was ready to be
analyzed, the business need was either already met or
changed, and the data was often outdated.
»» Scaling became exponentially more expensive.
»» There wasn’t support for data science, ML, and real-time
analytics, or semi-structured or unstructured data sets.
»» All data was up to date, and it was easy to add new sources of data.
»» You didn’t need to maintain multiple copies of the data set.
»» Large-scale data cleansing and transformations were possible.
»» You could run ad hoc queries against the entire data set.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
»» You could easily extract data from the data lakes and send it
to other locations.
»» It supported open-source ML libraries.
Inevitably, some challenges with data lakes also existed. Those
included such things as
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Looking at the top goals of data and
technology executives
Chapter 2
Prioritizing Your Data
and AI Strategy
I
n today’s competitive environment, it’s not enough to have the
right architecture to support your organization’s data. You also
need a comprehensive strategy that serves all the essential
components of your organization. This strategy should include
leveraging people, business goals, and technology. It’s the key to
long-term business success. Ultimately, the technology should be
an enabler of the strategy and not the other way.
This chapter looks at the top three strategic goals that data
and technology executives want to achieve and the benefits of
establishing a data culture. You also look at a maturity model to
determine where your organization fits and steps you can take to
progress.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
»» Creating new business value from data
»» Reducing risks
»» Controlling costs
You look at each goal in this section.
Data and technology executives want to use that data to get bet-
ter insights to increase business impact. Specifically, they seek
a lower-cost approach that improves the user experience and
increases collaboration across data personas. This goal moves them
away from complex and expensive on-premises enterprise data
warehouse (EDW) architectures.
Reducing risks
Another strategic goal for leaders of organizations is to reduce
several potential risks such as weak data management, failed IT
projects, missing out on innovation due to the lack of advanced
analytics platforms forms, and the ever-present threat of cyber-
attacks. These threats make it imperative to have a consistent way
to store, process, manage, and secure data. However, this goal is
made more complex by the following:
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Controlling costs
Leaders must always contend with the need to control costs. Data
warehouses can get very expensive, very fast as the amount of
data it manages grows. On top of it, there are overheads from data
center equipment, database administration operations and main-
tenance, and many locked-in vendor agreements.
You can spot a company that has a strong data culture in two key
ways:
Does this sound like your organization? If not, check out the later
section “Examining a Data and AI Maturity Curve” for a data
maturity model that can guide your progress.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
DEMOCRATIZING YOUR DATA
In partnership with Databricks, MIT Tech Review conducted a global
survey (2021) of 351 chief data officers, chief analytics officers, chief
information officers, and other senior technology executives to deter-
mine how they succeeded (or didn’t) at building a high-performance
data and AI organization.
Among their key findings was the need to democratize the data.
To accomplish this, they recommended the following:
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Customers of data platforms want to know
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
The model is as follows:
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Understanding the evolution of the
lakehouse
Chapter 3
The Dawn of the
Lakehouse
L
essons learned from working with enterprise data warehouses
(EDWs) and data lakes have paved the way for the lakehouse’s
modern cloud-based data architecture. It combines both the
best properties and capabilities to provide a far more powerful and
flexible data platform than possible in the past.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Is your organization prepared to operate and maintain a complex
technology stack of data lakes, EDWs, business intelligence (BI),
data science, machine learning (ML), and streaming platforms
and the complexity of moving data between them and manag-
ing different security paradigms of each? Or would you rather
consider simplifying it to one lakehouse platform that’s simple
to manage so you’re prepared and focused on solving the busi-
ness challenges with data versus complex platform and security
management? Consider migrating to the lakehouse to ensure that
you’re prepared for new challenges ahead.
Early data warehouses were optimized for analytics but not for
unstructured data. Likewise, data lakes were traditionally used to
store unstructured data but weren’t optimized for analytics. The
result: You had to choose between agility and governance. The
value of the lakehouse architecture is that data teams can now
store all their data on one platform, with the speed and govern-
ance of a data warehouse and the flexibility of a data lake.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Reviewing What a Lakehouse Is
The lakehouse takes the best elements of data warehouses and
data lakes and combines them into a single platform that gives
you the best of both worlds. Operating a lakehouse architecture is
the foundation that enables you to
»» Manage all data use cases on one single source of truth for
all your data.
»» Be more responsive and find new insights faster.
»» Have everyone look at the same version of the data.
»» Simplify existing architectures and security by reducing silos
and the number of systems and tools that you need to
manage.
»» Have the ability to consolidate and tie your data marts and
EDWs with other unstructured data for enrichment and
create innovative data products.
»» Perform extract, transform, load (ETL) operations on the
data within the data lakehouse.
Lakehouse architecture is
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
FIGURE 3-2: The Databricks Lakehouse platform.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Looking at the four truths of data leaders
»» Improving technology
Chapter 4
Benefitting from
Lakehouse Migration
E
xtracting knowledge from all the raw data from your organi-
zation’s disparate systems provides you with a tremendous
competitive advantage. If you could go back in time, you
would likely make different decisions about how you manage your
data to support today’s challenges. This chapter looks at how you
can transform your organization by deploying lakehouse archi-
tecture and seeing how companies like Bread and Amgen have
benefitted from making this transition.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
They believe that four truths should guide their decisions:
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Bread
Bread is a division of Alliance Data Systems. It’s a technology-
driven payments company that integrates with merchants and
partners to personalize payment options for their customers.
Bread’s data warehouse couldn’t handle data growth from giga-
bytes to terabytes. It was taking hours to query data. The company
also struggled with switching from batch to streaming data, hin-
dering their ability to deliver real-time insights and results.
Amgen
Amgen is the world’s largest independent biotech company. Over
the last 40 years, its vast amount of data has helped pioneer new
drug-making processes and develop life-saving medicines. As
the size of its data grew, the company couldn’t weave together
and scale the various aspects of its business. Amgen needed to
expand its cross-functional collaboration to take advantage of
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
the many perspectives present in its data. To support the digi-
tal transformation journey, Amgen chose to use the Databricks
Lakehouse Platform.
SETTING A PERFORMANCE
RECORD
Databricks has been rapidly developing full blown data warehousing
capabilities directly on data lakes, bringing the best of both worlds in
one data architecture: the data lakehouse. It announced its full suite
of data warehousing capabilities as Databricks SQL in November
2020. The open question since then has been whether an open archi-
tecture based on a lakehouse can provide the performance, speed,
and cost of the classic cloud data warehouses.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Taking an agile approach to migration
Chapter 5
Reviewing Why
to Migrate to the
Lakehouse
M
igrating to a new architecture can be a complex process.
As you consider your data warehouse modernization
strategy, plan out several essential migration factors
before proceeding. Understanding the inherent differences in
architectural choices helps you make well-informed decisions
about how best to proceed with your modernization initiatives.
This chapter looks at the value of taking an agile, iterative
approach to migration and suggests how to plan and execute the
migration journey.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
yet another Cloud EDW. Lift and shift of the platform
results in carrying the same issues and shortcomings to the
cloud. From a code and application redesign perspective,
have a balanced approach to lift and shift and modernize.
»» Implement a balanced approach to lift and shift and
modernization. Lift and shift the code as well as modernize
in one iteration. Use lift and shift with automated code
convertors and immediately modernize to optimal
Databricks patterns.
Lift and shift refers to the movement of an application design
and code from one environment to another without making
massive changes. But don’t wait to modernize and redesign
later — immediately redesign and apply all best practices.
Decide what needs redesign and what code can benefit from
a lift-and-shift approach.
»» Learn what worked and didn’t and iterate. Add on
additional use cases and workloads as you go.
»» Show success in shorter sprints and adapt. This way you
immediately show success to the stakeholders, and the
learnings and feedback help you improve the next iteration
of migration.
Just simple lift and shift is rarely the answer; if you just lift and
shift, you don’t get these three essential benefits:
Both lift and shift and total re-engineering have pros and cons:
»» Lift and shift: Pro: It’s faster and more critical to do if you
have a multi-million data warehouse license renewal coming
up. Con: You may not take the opportunity to re-engineer
and refactor the design and code.
»» Total re-engineering: Pro: It gives you the best quality.
Con: It can take years to complete and comes at a high
upfront cost.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Planning the Migration Journey
When considering a migration journey, carefully plan each step
along the way. The journey can be depicted as a set of steps (see
Figure 5-1). Here’s how each step works:
The sooner you execute your migration, the quicker you can start
to scale your analytics practice, cut costs, and increase overall
team productivity.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
FIGURE 5-1: The phases of migration methodology.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Future-proofing your investment
Chapter 6
Ten Reasons to Migrate
to the Databricks
Lakehouse
W
hen you’re making your decision to migrate to the
lakehouse, Databricks gives you ten reasons to choose
the Databricks Lakehouse as your platform. These
r easons are
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
»» You want the best price/performance. Operating a
lakehouse architecture provides up to 12 times better price/
performance than other cloud data warehouses.
»» You want to easily ingest data from anywhere and
access the freshest data. Databricks SQL works with your
data, no matter where it is. Databricks has autoloader
capabilities for seamless file ingestion and many ingestion
partner tooling integrations built-in with PartnerConnect,
that provides turnkey capabilities to ingest data from cloud
storage and enterprise data to enterprise applications such
as Salesforce or Marketo. It’s just one click away.
»» You need modern analytics with your tools of choice.
Databricks SQL works seamlessly with the most popular
business intelligence (BI) and SQL tools, such as dbt, Tableau,
Power BI, and Looker. As a result, analysts can use their
favorite tools to discover new business insights on the most
complete and freshest data.
»» You experience first-class SQL development experience.
Databricks SQL query editor allows analysts to write queries
in a familiar syntax (ANSI SQL) and easily explore data in
place in the lakehouse. Analysts can easily make sense of
query results through a wide variety of rich visualizations
and quickly build and share dashboards with stakeholders.
»» You eliminate infrastructure management. You experience
lower costs and eliminate the need to manage, configure, or
scale cloud infrastructure with serverless SQL compute. This
frees up your data team to do what they do best.
»» You’re able to practice fine-grained governance on the
lakehouse. You can confidently manage and secure data
access on your lakehouse with fine-grained governance. In
addition, you can meet compliance needs with data lineage,
role-based security policies, and table or column level tags
for all data assets.
»» You have one source of truth for all your data. Unlike
enterprise data warehouses (EDWs), the Databricks
Lakehouse Platform provides one common storage and data
management framework for all data types on your existing
data lake.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
WILEY END USER LICENSE AGREEMENT
Go to www.wiley.com/go/eula to access Wiley’s ebook EULA.