Cloud Data Platforms For Dummies 2nd Edition
Cloud Data Platforms For Dummies 2nd Edition
by David Baum
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Cloud Data Platforms For Dummies®, 2nd Snowflake Special Edition
Published by
John Wiley & Sons, Inc.
111 River St.
Hoboken, NJ 07030-5774
www.wiley.com
Copyright © 2022 by John Wiley & Sons, Inc., Hoboken, New Jersey
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any
form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise,
except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the
prior written permission of the Publisher. Requests to the Publisher for permission should be
addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ
07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Trademarks: Wiley, For Dummies, the Dummies Man logo, The Dummies Way, Dummies.com,
Making Everything Easier, and related trade dress are trademarks or registered trademarks of
John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries, and may not be
used without written permission. Snowflake and the Snowflake logo are trademarks or registered
trademarks of Snowflake Inc. All other trademarks are the property of their respective owners.
John Wiley & Sons, Inc., is not associated with any product or vendor mentioned in this book.
For general information on our other products and services, or how to create a custom For
Dummies book for your business or organization, please contact our Business Development
Department in the U.S. at 877-409-4177, contact [email protected], or visit www.wiley.com/go/
custompub. For information about licensing the For Dummies brand for products or services,
contact BrandedRights&[email protected].
ISBN 978-1-119-87548-2 (pbk); ISBN 978-1-119-87549-9 (ebk)
Publisher’s Acknowledgments
Some of the people who helped bring this book to market include the following:
Development Editor: Brian Walls Business Development
Project Manager: Jen Bingham Representative: Molly Daugherty
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Streamlining Data Engineering.......................................................... 28
Sharing Data Easily and Securely...................................................... 29
Developing Data Applications............................................................ 30
Advancing Data Science...................................................................... 30
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Introduction
D
ata analysts, data scientists, data engineers, and data
application developers influence critical functions
throughout the enterprise: sales, finance, supply chain,
and much more. But they often work in isolation and must con-
tend with trying to access a vast landscape of data silos.
Introduction 1
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
»» Standardize on a fully managed, usage-based data platform
that supports multiple data types.
»» Empower your data professionals to extract value from data
in ways not possible before.
»» Take advantage of baked-in data security, governance, and
resiliency that spans regions and clouds.
»» Efficiently access, share, and monetize data without copying
or manually moving data from one environment to another.
»» Implement new and changing architectural patterns such as
a data mesh or a hybrid data warehouse/data lake with a
single, flexible platform.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Tracking the cloud data platform’s
history
Chapter 1
Getting Up to Speed with
Cloud Data Platforms
O
ver the last four decades, the software industry has pro-
duced various solutions for storing, processing, and ana-
lyzing data. These solutions made it possible to work with
traditional forms of data and newer data types generated from
websites, mobile devices, Internet of Things (IoT) devices, and
data generated from other more recent technologies. Some of the
new solutions were designed to democratize access to data for the
business community, which has gradually moved data and ana-
lytics from the enterprise back office to frontline workers and the
executive suite.
The business world has learned how to put some of this data to
work in productive new ways, but many on-premises and legacy
cloud platforms weren’t architected for the variety and dynam-
ics of today’s data. Nor can those systems help you solve modern
operational needs, such as providing a single experience across
major clouds and securely sharing data globally.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
on-premises cousins. However, because they weren’t built from
the ground up for the cloud, they struggled to take full advantage
of the cloud’s near-unlimited scalability and performance.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
issues by allowing everyone to rally around a single, sanctioned
copy of the data.
Most importantly, your cloud data platform must take full advan-
tage of the true benefits of the cloud, with an architecture based
on three key elements (see Figure 1-1).
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Introducing the Architecture of Cloud
Data Platforms
To best satisfy the requirements of a modern cloud data platform,
the platform should be built on a modern multi-cluster, shared data
architecture, in which compute, storage, and services are separate
and can be scaled independently to leverage all the resources of
the cloud (see the “Essential Architecture” sidebar). This archi-
tecture allows a near-limitless number of users to query the same
data concurrently without degrading performance, even while
other workloads are executing simultaneously, such as running a
batch processing pipeline, training a machine learning model, or
exploring data with ad hoc queries.
ESSENTIAL ARCHITECTURE
A multi-cluster, shared data architecture includes three layers that are
logically integrated yet scale independently from one another:
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
FIGURE 1-2: A modern cloud data platform should seamlessly operate across
multiple clouds and apply a consistent set of data management services to
many types of modern data workloads.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
»» Advanced prescriptive and predictive analytics: Whereas
traditional analytic systems are reactive and backward-
looking, predictive and prescriptive systems understand the
present state or peer into the future. They recommend a
specific course of action by considering dynamically shifting
variables, such as moment-to-moment sales during a retail
promotion or campaign. Once data scientists identify the
correct algorithms and train the machine learning models,
the systems predict outcomes and prescribe a course of
action on their own — and they get smarter over time.
»» The opportunity to create new data applications: A cloud
data platform should make data application development
more accessible not just for traditional technology compa-
nies but also for any company that sees the opportunity to
offer data-driven products and services to its customers.
»» Support for modern data patterns and paradigms: The
ability to leverage new architectural frameworks beyond
data lakes and data warehouses, such as a hybrid lake-
warehouse or data mesh — a decentralized method of data
management that assigns responsibility for data to the
business teams that are closest to that data. Rather than one
monolithic system under the auspices of a centralized IT
department, a data mesh extends ownership to business
experts from throughout the organization. Each business
team leverages its domain knowledge to create data
pipelines, catalog data, uphold data privacy mandates, and
ensure data quality.
»» Easy, pervasive, and secure data sharing: A cloud data
platform should enable organizations to establish one-to-
one, one-to-many, and many-to-many relationships to share
and exchange data in new and imaginative ways. Secure,
governed access to a single source of data not only makes
internal teams more efficient but also facilitates collaboration
among business partners, customers, and other constituents.
»» The rise of global data networks: In every industry,
immense data-sharing networks, exchanges, and market-
places have emerged, propelling a growing data economy
and motivating business leaders to examine new data
sharing possibilities. A cloud data platform should enable
these networks with almost none of the cost, complex
procurement cycles, and delays that have plagued traditional
exchanges and other types of data sharing.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Understanding the problems with
traditional data management
approaches
Chapter 2
Leveraging the
Exponential Growth
and Diversity of Data
F
irst-generation cloud data platforms can’t keep up with the
nonstop creation, acquisition, storage, analysis, and sharing
of today’s diverse data sets. Much of the data is semi-
structured or unstructured, which means it doesn’t fit neatly into
the traditional data warehouse, which first emerged more than
40 years ago. Additionally, some data types, such as images and
audio files, are wholly unstructured and must be maintained as
binary large objects (BLOBs) within an object-based storage sys-
tem that doesn’t conform to traditional data management
practices.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
UNDERSTANDING DATA TYPES
Most data can be grouped into three basic categories:
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Furthermore, many legacy systems don’t have the architec-
tural flexibility to simultaneously work with structured, semi-
structured, and unstructured data and support the multitude of
other workloads needed to derive value, such as data engineering
pipelines and machine learning models.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Understanding Problems with
“Stitched Together” Platforms
Clearly, the cloud is a boon to data-intensive projects. But not
all cloud data platforms have the same pedigree. Some are built
on a cohesive architecture that takes full advantage of modern
cloud infrastructure and features inherent integration among all
platform services. Others represent “ecosystems” — dozens or
even hundreds of “best of breed” services that weren’t initially
designed to work together.
Each unique activity requires a unique set of tools and may require
copying, extracting, or moving the data. The customer must fig-
ure out how to stitch it all together because these systems don’t
naturally integrate.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
SOARING TO NEW HEIGHTS WITH
A CLOUD DATA PLATFORM
The spirit of innovation drives nearly every aspect of the business for
JetBlue, a leading airline carrier based in the U.S. In that spirit, JetBlue’s
data scientist and machine learning engineers use a cloud data plat-
form, because it gives them a one-stop-shop for all their data needs.
Airlines run on razor-thin margins. The data science team uses the
cloud data platform to discover cost efficiencies, develop great cus-
tomer experiences, and promote competitive fares, all of which
boosts revenue. Data is available 24/7, which helps JetBlue maintain
business continuity throughout the organization. Dynamic data mask-
ing allows the airline to control access to data based on roles. Near-
real time reporting enables analysts to build dashboards that allow
the operations team to make decisions as situations occur.
The data science team plans to use the cloud data platform to build
better fuel prediction models. By combining internal data with exter-
nal sources, such as air traffic control and weather, they can develop
reports and run analyses that were not possible with their traditional
data management solution.
JetBlue also uses the cloud data platform to share data with external
partners. In two minutes and with only a few clicks, the data engineer-
ing team can create a secure data sharing infrastructure that formerly
would have taken months of planning and weeks of development.
As JetBlue expands beyond its domestic roots, analysts can use the
knowledge they have gained to craft unique experiences for new cus-
tomers in new locales. As Ben Singleton, director of data science and
analytics at JetBlue, said, “We like to say that we’re a customer service
company that just happens to fly planes. Now it almost seems as
though we’re also a technology company that happens to fly planes.
The cloud data platform is a key part of making that happen.”
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
democratize access to that data, automate routine data manage-
ment activities, efficiently govern the data, and support a broad
range of data processing and analytics workloads. And they want
to do all this in one place, so they can easily obtain and share all
types of insights from all their data.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Taking stock of your data and analytic
needs
Chapter 3
Selecting a Modern,
Easy-to-Use Platform
O
rganizations outgrow their existing data platforms for a
variety of reasons. In many instances, limitations surface
in response to competitive threats that require the busi-
ness to acquire new types of data and experiment with new data
workloads. For example, a data science team may set out to create
a predictive analytics model that helps the sales team mitigate
customer churn. The success of this sales initiative depends on
the capability to access and iterate over the right data that best
describes customer behavior.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
This data arrives as JavaScript Object Notation (JSON) in a semi-
structured format. Analysts want to visualize the analysis of this
data in conjunction with audio transcripts of customer support
calls and some enterprise resource planning (ERP) transactions
stored in a relational database, including historical data about
sales, service, and purchase history.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
While you gather additional data and the value of that data grows,
you may want to monetize that data via a data marketplace to turn
it into a strategic business asset. A modern cloud data platform
should provide seamless access to a cloud data marketplace.
What are the advantages of this approach? First, all users have
a single interface for viewing and managing that data. Second,
in addition to the primary data store, the platform allows you to
access, manage, and use data in external tables (read-only tables
that reside in external repositories and can be used for query and
join operations) just as easily as you can access it from the main
platform — and with exceptional performance. Finally, you can
leave data in an existing database or object store yet apply univer-
sal controls. This allows you to simplify your data environment by
standardizing on a single cohesive system.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
To ensure you obtain superior, cloud-built capabilities, ask your
cloud data platform vendor these questions:
However, not all cloud services are created equal. Most cloud ven-
dors claim to offer “managed services,” but you must dig a little
deeper to discover how much automation they actually provide.
Ideally, all aspects of managing, updating, securing, governing,
and administering your data platform should be transparent to
the business community and require no extra effort by your IT
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
professionals. Furthermore, this level of automation should be
holistic across clouds, regions, and teams, as Chapter 7 describes.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
»» Do you need to partition data, tune SQL queries, and
optimize performance, or does the platform handle this
automatically?
The best cloud data platforms are fully managed services: You
click a button, and a database appears. After that, all manage-
ment, administration, scaling, tuning, and data security should
happen automatically in the background.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Recognizing how today’s workers
use data
Chapter 4
Accommodating Users,
Workloads, and Access
Patterns
T
oday, nearly every worker consumes data on some level.
Everybody is a data consumer, but each person has different
data requirements.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Data scientists leverage massive data sets to build, train, and deploy
machine learning models. They consolidate, cleanse, and trans-
form data to fuel their models. To deliver new value and unlock
new business opportunities, they create predictive and prescrip-
tive analytics.
Data engineers build data pipelines and use various tools to popu-
late databases in real time or batch mode and refresh those data-
bases at periodic intervals. They are also responsible for cleansing
data to eliminate duplications, correct inaccuracies, and resolve
inconsistencies, often by incorporating input from analysts and
LOB managers. Finally, data engineers handle data transforma-
tion projects, such as converting data from one format or struc-
ture into another format or structure.
Data architects are tasked with delivering the right tools and infra-
structure to make all these teams productive while helping to
establish and enforce data security and data governance needs.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
This is a sharp contrast from legacy data platforms, which are
restricted by a linear data processing architecture. These older
platforms are limited in the scale and number of multiple work-
loads they can run in parallel, leading to long wait times or failed
jobs for resources and data-driven insights. Furthermore, because
they’re typically optimized for a particular type of user or work-
load, organizations often end up with unique data silos for each
unique situation.
FIGURE 4-1: A cloud data platform should handle any data source and data
workload and serve data consumers of all levels and needs.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Empowering data teams
with a data mesh
A data mesh is a design pattern for organizing data and help-
ing domain teams gain access to that data. The basic premise is
to divide large, monolithic data architectures into smaller func-
tional domains, each managed by a dedicated team. The teams
closest to the data are responsible for developing and managing
the data products they use and that serve the business, including
building and maintaining the data pipelines, implementing gov-
ernance policies, and extending access to others who can benefit
from that access.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
• Principle 3: Self-service infrastructure as a platform. A data
mesh eliminates complex technologies and the need for niche
skills. The right cloud data platform supports a consistent set of
tools and capabilities that allow domain teams to build, serve, and
utilize data products without getting bogged down managing
hardware and software or scaling infrastructure.
• Principle 4: Federated governance. Strong access controls and
data protections are implemented by each domain team, mitigating
risks while enforcing data privacy and compliance as new products
are developed for sharing data. These governance policies should
be centrally managed and interoperable across the business.
The best cloud data platforms can connect domain teams across
regions and clouds, as Chapter 7 discusses. Each domain team can
operate locally, running on its preferred cloud or region. Whether
the teams work in SQL, Java, Scala, or Python — or utilize a mix of
languages and techniques — the cloud data platform should easily
support them. They can share data and data products as easily with
a domain team on the other side of the world as they can with a team
in the same office. And the organization can replicate data between
regions and between multiple public clouds to operate without dis-
ruption, ensuring business continuity, allowing for regional data
sovereignty differences, and upholding regulatory protections.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
When anchored by a modern cloud data platform, a data mesh
can incorporate many types of data (structured, semi-structured,
and unstructured) and file formats, and support access to external
data for comprehensive coverage of the data landscape. IT teams
don’t need to worry about provisioning, maintenance, upgrades,
or downtime. Domain teams operate as distinct units and can
scale their data products to other teams, requiring no infrastruc-
ture expertise or database tuning.
The cloud data platform also increases Vimeo’s ability to make data-
driven decisions. Ingesting enriched data from no-code sales and
marketing platform Openprise provides valuable insights about
enterprise-level customers. Integrating with customer data platforms
Singular and Simon Data enables a data enrichment process that
helps marketers refine Vimeo’s customer acquisition models. Best of
all, Vimeo’s data platform can support new data-driven initiatives.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Broadening analytics initiatives
Chapter 5
Using a Cloud Data
Platform to Support
Diverse Data Workloads
A
cloud data platform should maximize the value of your
data. It should bring together modern technologies for
storing, sharing, and analyzing that data; creating modern
data pipelines; building new data applications; and delivering
cutting-edge data science and predictive analytics projects. A
modern cloud data platform can power, scale, automate, and
improve these important workloads.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
data warehouses, data lakes, and more. In addition, a cloud data
platform should:
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
of Things (IoT) systems, streaming data from social media feeds,
JSON event data, and weblog data from Internet and mobile apps.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Developing Data Applications
Today, nearly every company sees the value of leveraging data to
develop new insights and share them with customers and part-
ners, opening up new revenue streams and powering new lines of
business. A cloud data platform masks DevOps complexity, so you
can focus on creating innovative data applications. For example,
a cloud data platform eliminates the need to build infrastructure
and automatically handles provisioning, availability, tuning, data
protection, and other operations. Developers can instantly spin up
dedicated compute resources to support a near-unlimited number
of concurrent users and workloads without requiring a dedicated
engineering team to prepare the data. Operations and quality-
assurance (QA) professionals can utilize DevOps workflows to:
A modern cloud data platform should satisfy the entire data life-
cycle of ML, artificial intelligence, data visualization, predictive/
prescriptive analytics, and application development. It should con-
solidate data in one central location for easy development and flex-
ible accessibility via a wide range of data science notebooks and
AutoML tools. It should also natively support the most popular lan-
guages, including SQL, Java, and others. These capabilities enable
data scientists to develop and deploy new models with less time
spent on data preparation.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Establishing a robust and efficient data
sharing architecture
Chapter 6
Sharing and
Collaborating with
Your Data
A
ccording to a 2020 Forrester Research report titled “The
Insights Professional’s Guide to External Data Sourcing,”
47 percent of organizations currently commercialize their
data, while 76 percent have launched, or plan to launch, initia-
tives for improving their external data sourcing. A cloud data
platform should revolutionize these endeavors by easily enabling
modern and secure data sharing without requiring organizations
to move or copy the data.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Newer data sharing methods use cloud storage services to stage
data to a central location that authorized consumers can access.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
FIGURE 6-2: A modern cloud data platform enables live, governed data to be
shared across clouds and regions without needing to move files across
environments or create unnecessary copies.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
FIGURE 6-3: A cloud data platform streamlines data sharing between data
providers and data consumers, even across multiple regions and clouds.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
provider handles data transformation, preparation, copying, and
loading, while the marketplace oversees discovery, collabora-
tion, licensing, and auditing. These are onerous tasks for the data
provider, requiring complex data pipelines and constant update
procedures that often leave the consumer with stale data. With
a modern cloud data platform that replaces those manual mar-
ketplace tasks, data providers can share and monetize their data
much more easily.
Some data providers share data. Others also share data services
that put that data to work. For example, an organization might
supplement its internal customer data with third-party data to
better understand the age and income of groups that have pur-
chased from its website. The same organization might subscribe
to a data service that cross-references online purchase behavior
with additional third-party demographic data, enabling a more
personalized understanding of each group or segment.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
With modern data sharing technology, a data provider can easily
grant access to the data it wants to share with its intended mar-
ketplace consumers without managing cumbersome data pipe-
lines. End-to-end security, multiparty governance, and metadata
management services are systematically applied, even when the
data consumers span multiple clouds. With updates made auto-
matically, you don’t have to link applications, set up file-sharing
procedures, or frequently upload new data to keep data current.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Ensuring worldwide business continuity
Chapter 7
Maximizing Availability
and Business Continuity
with a Cross-Cloud
Strategy
L
arge organizations commonly rely on multiple on-premises
data repositories while also storing data in one or more pub-
lic clouds. This diverse software-solutions landscape invari-
ably spawns diverse data sets, such as data warehouses populated
with data from enterprise applications, data lakes for exploratory
analysis, and a wide assortment of local databases, data marts,
and operational data stores for local and departmental needs.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Furthermore, each public cloud provider has different levels
of regional presence, and data sovereignty requirements may
require organizations to keep data processing operations within
the regions they serve, leading to even more silos. Each depart-
ment and division within your organization may have unique
requirements. Rather than demand that all business units use the
same cloud provider, a multi-cloud strategy allows each unit to
use the cloud that works best for that unit.
Your cloud data platform should allow you to easily operate data
workloads among multiple clouds and multiple regions within
each cloud, so you can locate data where it makes sense and mix
and match clouds as you see fit.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
systems to encrypt data? Will data engineers have to create unique
pipelines? Will data scientists encounter obstacles when building
machine learning models from multiple data sets?
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
administrators can easily control how the information is pro-
tected and ensure that all data-access constraints are consistently
enforced.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
cloud — or restore previous versions of a table or database within
a specified retention period. This strategy ensures that your busi-
ness won’t be disrupted, and you’ll minimize data loss.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Reacting Quickly to New Regulations
Cross-cloud deployment has become increasingly pertinent as
data privacy regulations become more restrictive in Europe and
elsewhere. In some instances, sudden and sweeping changes to
data privacy laws may force you to reconsider which cloud pro-
vider you use.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Each public cloud provider has different levels of regional pres-
ence. A cross-cloud data platform should enable secure access to
data globally while upholding regional data privacy laws. This
allows you to select cloud providers that meet the needs of each
application, business unit, and competitive scenario.
Cloud agnostic doesn’t simply mean storing your data and operat-
ing your workloads in whatever cloud you choose. It also means
standardizing on a single cloud data platform built on a single
code base that operates seamlessly across all the clouds your
organization relies upon.
ENABLING A MULTI-CLOUD
STRATEGY
Founded in 1851, financial services company Western Union enables
customers to pay bills, send money, and pick up cash at more than
550,000 agent locations worldwide. To ensure exceptional experi-
ences for more than 250 million customers across retail and digital
channels, Western Union ingests and analyzes large amounts of
transactional data.
(continued)
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
(continued)
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Outlining the essential elements of cloud
security and data protection
Chapter 8
Leveraging a Secure and
Governed Data Platform
P
rotecting your data and complying with industry and
regional regulations is fundamental to a cloud data plat-
form’s architecture, implementation, and operation. All
aspects of the service must center on maintaining security, pro-
tecting sensitive information, and complying with industry
mandates.
Centralizing control
Good governance is much easier to achieve when all database
objects (data structures such as tables and views used to store and
reference data) are centrally maintained and updated by the data
platform. The data platform should apply fine-grained governance
across all the different objects, not just the database, and those
governance policies should be always replicated with the data.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
FIGURE 8-1: Comprehensive data governance is based on these three
fundamental principles.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
including the tables where data is stored, the schemas that
describe the database structure, and any virtual extensions to the
database, such as views.
Data access policies should not change the data in the under-
lying table: They should be dynamically applied when the table
is queried. For example, a national sales database can be set up
with row-level access restrictions so sales reps can see only the
account information for their regions.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Without these flexible governance policies, data stewards would
have to copy regional sales information into separate tables to share
data with the pertinent sales regions — one table for the southwest
region, another table for the northwest region, and so on. How-
ever, changes in the base table need to be copied and merged to
all the regional tables, requiring constant administration. A cloud
data platform simplifies this scenario by allowing a sales manager
to maintain data in one base table, to which secure views and other
access policies are applied dynamically at query time.
A cloud data platform should help you comply with all pertinent
industry regulations and provide security and compliance reports
upon request. Your cloud data platform vendor must demon-
strate that it adequately monitors and responds to threats and
security incidents and has sufficient incident response procedures
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
in place. Industry-standard attestation reports verify that cloud
vendors use appropriate security controls. Check with your data
platform provider to ensure the reports you need are available.
Common certifications include:
Encrypting data
Encrypting data involves applying an encryption algorithm to
translate readable text into ciphertext, which contains a form of
the original plaintext that is unreadable by a human or computer
without the proper cipher to decrypt it.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Data should be encrypted in transit and at rest, which means from
the time it leaves your premises, through the Internet or another
network connection, and into the platform. It should be encrypted
when it’s stored on disk, when it’s moved into a staging location,
when it’s placed within a database object, and when it’s cached
within a data repository. Query results should also be encrypted.
The cloud data platform vendor should also protect the decryption
keys that decode your data from ciphertext back to plaintext. The
best cloud vendors deploy AES 256-bit encryption with a hierar-
chical key model. This method encrypts the encryption keys and
instigates key rotation that limits the time any single key can be
used, further strengthening security.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Maximizing performance for all types of
usage
Chapter 9
Achieving Optimal
Performance in the
Cloud
I
n the cloud, rapid data processing means less resource con-
sumption and lower costs. Virtually unlimited cloud resources
make it easy to scale vertically and horizontally, to bring in new
teams that can all run more types of operations on your data in
parallel without contention. However, you need a flexible data
platform to properly leverage all that compute power, provision
the right amount of resources, and easily process all types of data
for many kinds of workloads. Without these essential ingredients,
you can’t easily put the data to work for your business.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
data engineers, and also application developers creating new data
products for your internal stakeholders or external customers.
Additionally, all users should interact with the same data without
contending for resources or experiencing data processing delays.
FIGURE 9-1: A modern cloud data platform should deliver the power, speed,
seamlessness, and versatility of running a near-unlimited number of
concurrent and interconnected data workloads, at practically any data scale.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
performance and scale these separate processes at will, without
resource contention. The platform should automatically manage
each workload to maximize throughput and ensure consistent
results, making it possible for thousands of users to simulta-
neously analyze and share the same single copy of data with no
bottlenecks.
A cloud data platform architected first and foremost for the cloud
can automatically provision nearly limitless amounts of compute
power to support virtually any number of users and workloads
without affecting performance.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
services. If one team is running a heavy data preparation job while
another is crunching end-of-the-month financial reports, both
teams may experience resource contention and thus poor per-
formance or failed jobs. Adding more resources can be a lengthy
process, requiring new capital expenditures, complex implemen-
tation cycles, and ongoing system maintenance.
The critical issue is this: Can the cloud vendor and its associ-
ated ecosystem of add-on services fulfill all your data manage-
ment and analytic needs cohesively without forcing you to master
unique languages, development techniques, and management
tools? What services are layered on top of the basic cloud infra-
structure to handle data engineering, business analytics, data sci-
ence, and other tasks? How easy is it to integrate and use data for
these various activities?
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
In many cases, the burden is on you to figure out how to perform
each business task, integrate data, and synthesize the results back
into the platform. Training your team to work synergistically is
no small task, especially for an organization that seeks to maxi-
mize the accessibility and usability of its data.
Some cloud data platform vendors claim that they can run all
types of workloads against one common data repository, but the
data processing engine isn’t accessible to all users, and it doesn’t
dedicate resources to each workload. That means each team must
lobby IT to request resources or fight for compute time with other
groups. Performance degrades as contention increases. Be sure to
ask vendors whether all concurrent workloads execute simulta-
neously without impacting the performance of other workloads
and services. If they can’t, end users may be forced to manage
their resources and learn a specialized set of skills to use the
performance engine.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
SUPPORTING DATA ENGINEERING,
BUSINESS INTELLIGENCE, AND
PREDICTIVE ANALYTICS
When U.K. supermarket giant Sainsbury’s set out to make analytics
more cost-effective and accessible to its employees, the first step was
to consolidate functionally siloed data from multiple operating com-
panies into a modern cloud data platform. In addition to being the
second-largest general merchandise and clothing business in the U.K.,
Sainsbury’s owns a bank and hundreds of grocery stores. The organi-
zation has thousands of employees and millions of customers and
performs billions of transactions each year.
These three systems now publish raw data directly to the cloud data
platform, which populates a dashboard that streams data to the digi
tal trading teams. Data that was formerly difficult to access from
Sainsbury’s retail stores and other consumer-facing channels is now
readily available. Store managers can obtain standard reports via
cloud-based dashboards that offer visibility into customer needs and
preferences. In addition, data scientists and machine learning engi-
neers are creating new data sets by accessing raw data from a data
lake, which resides within the cloud data platform. The cloud-built
platform separates storage and compute resources, improving per
formance and eliminating resource contention for thousands of
users. Queries that used to take six hours in a legacy data warehouse
now run in three seconds in the cloud data platform.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Considering your overall requirements
Chapter 10
Five Steps for Getting
Started with a Cloud
Data Platform
CHAPTER 10 Five Steps for Getting Started with a Cloud Data Platform 57
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
These workloads have unique attributes, but all depend on the
universal principles of availability, reliability, extensibility, dura-
bility, security, governance, and ease of use. Keep these essential
workloads in mind as you ask yourself these questions:
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
»» Will your existing applications work with the new
platform? Business intelligence solutions, data visualization
tools, data science libraries, and other software development
tools should easily adapt to the new architecture.
»» How are your requirements likely to change in the
future? As you ponder emerging data-driven projects and
future application initiatives, make sure you are positioned
to accommodate new data, technologies, and capabilities
such as Internet of Things (IoT), machine learning, and
artificial intelligence.
CHAPTER 10 Five Steps for Getting Started with a Cloud Data Platform 59
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Step 4: Calculate TCO and ROI
If you choose a cloud data platform that accommodates all types
of data and that has been designed first and foremost for the
cloud, you should be able to pay for actual usage in per-second
increments and minimize additional costs, such as maintaining
multiple systems and training people to handle diverse data.
For example, does the new data platform make your organiza-
tion more productive? Does it simplify access to key workloads,
break down data silos, and boost collaboration? Bringing your
data together brings your teams together. Calculate the impact of
standardizing on one centralized system versus struggling with
a patchwork of tools, apps, and data sets. Focus on measurable,
quantifiable criteria and qualitative enhancements.
These materials are © 2022 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
WILEY END USER LICENSE AGREEMENT
Go to www.wiley.com/go/eula to access Wiley’s ebook EULA.