Automating The Modern Data Warehouse
Automating The Modern Data Warehouse
Data Warehouse
A Comprehensive Guide for
Optimal Data Management
Steve Swoyer
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Automating the
Modern Data Warehouse, the cover image, and related trade dress are trademarks of
O’Reilly Media, Inc.
The views expressed in this work are those of the author, and do not represent the
publisher’s views. While the publisher and the author have used good faith efforts to
ensure that the information and instructions contained in this work are accurate, the
publisher and the author disclaim all responsibility for errors or omissions, includ‐
ing without limitation responsibility for damages resulting from the use of or reli‐
ance on this work. Use of the information and instructions contained in this work is
at your own risk. If any code samples or other technology this work contains or
describes is subject to open source licenses or the intellectual property rights of oth‐
ers, it is your responsibility to ensure that your use thereof complies with such licen‐
ses and/or rights.
This work is part of a collaboration between O’Reilly and Oracle. See our statement
of editorial independence.
978-1-098-10283-8
[LSI]
Table of Contents
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
iii
The Cloud Data Management Stack 37
Contextualizing Data in the Cloud 40
Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
iv | Table of Contents
Introduction
For the enterprise, cloud is two things: first, it’s a force for potential
business and IT transformation; second, it’s a mechanism for reduc‐
ing or controlling costs. Even though enterprises have tended to
emphasize the latter at the expense of the former, the most compel‐
ling reason to migrate an on-premises data warehouse to the cloud
has little to do with reducing costs or taking advantage of cloud’s
tax-friendly OpEx. In fact, a focus on cost misses what is new and
transformative about cloud.
A better, more compelling rationale for cloud migration is to mod‐
ernize and improve the data warehouse by substantively automating
it.1 For one thing, automation of this type—and at this scale—has
the potential to eliminate time-consuming, tedious, and rote tasks.
For another, it frees up bright, imaginative, creative human techni‐
cians—DBAs, ETL developers, business intelligence (BI) developers,
v
architects, etc.—to focus on more challenging problems.2 Another
benefit of automation at this scale is that it permits IT and the lines
of business to respond more rapidly to changing conditions: it
becomes possible and cost-effective to accommodate one-off, sea‐
sonal, or unprecedented workloads and use cases.
This gets at the best rationale for migrating—one that is grounded in
the logic of data warehouse modernization itself: to transform the
business. A focus solely on cost savings sets an organization up to
make poor choices in the future. A focus on transforming the busi‐
ness by modernizing the data warehouse, on the other hand, has a
conative aspect: it asks, “What should we do differently, better?”
This allows companies to better understand their data and improve
data driven decision making—boosting innovation, productivity,
and efficiency—all while reducing complexity in the organization.
2 This is a problem that data warehouse automation (DWA) software has focused on for
two decades now. Like DWA tools, PaaS data warehouse services expose a combination
of guided, self-service capabilities and ML-powered, rule-driven facilities that automate
the exploration, discovery, and profiling of data, as well as simplify (and, if practicable,
automate) the creation of data models, business rules, metadata, documentation, etc.
But the PaaS data warehouse exposes these features in the context of a fully managed
service that, under its covers, automates the allocation and/or deallocation of compute,
storage, and network resources, the expansion/contraction of data warehouse capacity,
the scheduling and balancing of workloads, the remediation of performance problems,
etc. This is the stuff of radical difference: DWA tools were originally conceived and
designed for on-premises data warehouse systems that, with few exceptions, lacked the
tight integration between hardware and software—to say nothing of the ability to use
software to define and abstract hardware resources—that is definitive of cloud infra‐
structure. The PaaS data warehouse is a completely different animal.
vi | Introduction
Think about it: today, data warehouse architects, BI developers, and
other skilled technicians inevitably squander some fixed proportion
of their valuable time servicing the technical debt that encumbers all
conventional data warehouse systems. DBAs and ETL developers
inevitably squander some fixed proportion of their valuable time
maintaining the database, data integration, and associated middle‐
ware services that constitute the data warehouse system proper. At
the same time, an assortment of analytically inclined technicians—
among them data scientists, machine learning (ML) engineers, and
data engineers—are reliably stymied by the constraints imposed by
conventional data warehouse architecture. They cannot get the data
they need the way they need it when they need it.
The upshot is these and other technicians expend valuable time and
energy navigating cruft and manipulating software in order to man‐
age or access data instead of doing stuff with data.
3 Some examples include the on-premises private cloud, the public cloud, the virtual pri‐
vate cloud, the high-performance computing (HPC) public cloud, the regional public
cloud, and the on-premises public-private cloud.
Introduction | vii
data center; how, when, and why to exploit the multimodel and mul‐
tiengine capabilities that many cloud PaaS data warehouse systems
now expose; etc. One of this book’s most important themes is not
actually broken out into a discrete chapter or section but rather
occurs and recurs throughout its pages: namely, that the elasticity
that is a defining feature of cloud infrastructure makes it possible for
businesses to prioritize the one-off, seasonal, or niche use cases that
would otherwise be cost-prohibitive or (a function of both cost and
human-resource constraints) impracticable in the on-premises
enterprise.
viii | Introduction
CHAPTER 1
Finding Signals in the
Midst of Noise
1
architecture. The answer to this query already “lives” in the data that
populates the warehouse’s fact and dimension tables; the data ware‐
house performs operations that join facts to dimensions in real time,
creating an analytic model that “answers” the query.
This example also gets at something else: in data warehouse archi‐
tecture, the role or function of analytics is to answer questions. But
today’s cutting-edge analytic practices invert this arrangement: they
seek to ask questions. What if Z1, Z2, and Z3 attributes are unknown?
What if, in fact, their corresponding dimensions don’t even exist in
the data warehouse? No conceivable star or snowflake schema can
link facts to dimensions that do not yet exist. So the business analyst
or data scientist must go to source data—assuming it is available—to
answer these and similar questions.
Data management is not an end unto itself. That is, the management
of data is always adjunct to some other purpose, some other goal,
some other use. One of the most important uses of data is in support
of business decision making, but decision making is neither a mon‐
olithic domain—that is, a problem area in which the stakes, priori‐
ties, requirements, and expectations are uniform or reproducible
across all relevant instances and applications—nor is it the sole
domain in which data gets used. For example, data is generated and
consumed by operational applications that span business processes
and that support day-to-day business activities. This type of usage
has less to do with decision making than with ensuring the uninter‐
rupted and (with respect to core business processes and workflows)
unmediated operation of the business itself. In many if not most
cases, this use case does not have an analytic component: data is
generated by an application or service, written to a database or seri‐
alized in a data structure, requested by and exchanged with other
applications and services, and so on.
Another important use for data has to do with monitoring—or, in
emerging software architectures, observing—the state and perfor‐
mance of the business itself, along with its constitutive processes, IT
systems, and other assets. This use case is usually married to a
related task—that of analyzing and interpreting problems or anoma‐
lies. Sometimes decision making attends the analysis and interpreta‐
tion of a problem, sometimes not. But the material fact is that
decision support, while an especially critical data use case, is never‐
theless just one use case among many.
9
With this in mind, it is worth looking at what data management does
in order to get a sense for what needs to change—not only with
respect to the data warehouse in the cloud, but with respect to data
management in a hybrid environment that spans both the on-
premises enterprise and the cloud.
1 Ludwik A. Teclaff, “What You Have Always Wanted to Know About Riparian Rights,
but Were Afraid to Ask,” Natural Resources Journal 30, no. 1 (Winter 1972): 41.
19
gets implemented in the cloud first. This is why the cloud is already
the locus of innovation in software development. This is as true of
the cloud data warehouse as it is of any other cloud service.
Using ML-Driven AI to Automate, Improve, and Secure the Cloud Data Warehouse | 23
ML and AI Take to (and Take Off in) the Cloud
The analytic practices of today are a lot more specialized. The term
“practices” is significant in this context. Twenty years ago, analytics
was a comparatively monolithic practice area: data science did not
really exist, ML was a niche area, and even though almost all large
organizations employed statisticians and owned licenses for com‐
mercial statistical analysis software, the most common analytical
practices (a) still worked with relational data and (b) focused on (or
were focused by) the data warehouse. You just could not point to the
diversity of roles (e.g., data scientist, ML engineer, data engineer, AI
engineer, etc.) that is common today. Cutting-edge use cases in the
mid-1990s and early-2000s involved the integration of data from
geographical information systems (GIS) with customer, product,
and spatial data in the warehouse. (The retail vertical was out in
front of this.) Data mining was commonly used, and organizations
in some verticals (finance, retail) experimented with statistical
methods to predict fraud and customer churn. Similarly, rudimen‐
tary work on AI—e.g., engines that were used to power rule-driven
automation—was likewise becoming common. Organizations were
beginning to combine the insights of predictive analytics (i.e., the
application of ML functions to data problems) with rule-driven
automation to monitor the behaviors of their customers and to trig‐
ger actions on the basis of fraud, likely customer churn, or the
unavailability of certain products. (The retail sector, again, was out
in front with this class of “next-best-offer” analytics.) By the early
2000s, predictive-analytic use cases had become common enough
that GIS and certain kinds of ML functions started appearing in the
database engines at the core of the data warehouse itself.
Until very recently, then, the data warehouse and its tools effectively
circumscribed the domain of analytics in the enterprise. Analytics
consumed relational data that lived in the warehouse—or which was
discarded by the ETL processes that prepared data for use with the
warehouse. Analytic practices focused on and, in most cases, were
hosted by the data warehouse, too. As a general rule, analytic roles
were also less diverse, sorting into two buckets: business analysts
and statisticians. There was a role for self-service, but it mostly
focused on spreadsheets; it was unlike today’s BI discovery practice.
This is no longer the case. Analytics has diversified. Concomitant
with this, the domain of enterprise analytics has escaped or outstrip‐
29
aligns with the goal of improving the data warehouse, but (2) is, ipso
facto, an opportunity to reposition the warehouse for a new role and
new use cases.
1 For all practical purposes, the focus of ETL itself has already shifted to cloud—or, more
precisely, to the cloud data lake. Most commercial ETL vendors also market cloud
offerings that support more than one public cloud provider and specific cloud SaaS and
PaaS offerings. Commercial ETL vendors specifically target the cloud data lake use
case, too, typically via provider-specific offerings. And most public cloud providers
offer an array of data integration services suitable for different kinds of use cases or for
users in a breadth of roles or practices—from software developers to ETL developers
and architects to self-service users of different types.
Data Virtualization
Formerly known as “data federation,” data virtualization is used to
facilitate a single view of data across all cloud and on-premises con‐
texts. Data virtualization exposes a logical, unified view of data that,
in the physical world, is cobbled together from disparate, geographi‐
cally distributed data sources. A decade ago, data virtualization was
commonly used to create canonical “business” and “application”
views, which could be thought of as analogous to virtual data marts.
In the era of the data lake—or of multiple data lakes—data virtuali‐
zation is used to present canonical views of semistructured and
polystructured data, too. It is a proven, useful technology for knit‐
ting together disparate data sources.
With respect to the data warehousing use case, data virtualization is
much less useful for ad hoc query than for building models (views)
that support common or recurring queries. Because data virtualiza‐
tion distributes queries across multiple (often geographically dis‐
tributed) contexts, it gives priority to minimizing data movement:
when possible, its engine pushes data processing up to source sys‐
tems. (Over time, data is also cached in the data virtualization tier.)
Data virtualization exposes a single interface for SQL query; under
the covers, however, it uses smart technology to accelerate query
Data Catalogs
Data catalogs permit users to discover, profile, and classify data,
regardless of context. Even though data catalog technologies are typ‐
ically deployed and managed by IT, they are used by self-service
power users: for example, BI discovery users, business analysts, data
scientists, ML engineers. Data catalog technologies help users
discover, label, and procure useful data for analysis. More advanced
data catalog services permit users to manipulate data, even automat‐
ing the profiling and transformation of data for analysis. These
services also incorporate rich metadata management features, along
with (often less rich) capabilities for tracking and capturing data
lineage.
Data catalogs are frequently used in conjunction with data virtuali‐
zation: an analyst, BI discoverer, or other skilled person discovers
useful data in one or more contexts—a cloud data warehouse ser‐
vice; a zone in a data lake; a cloud storage service—and works with
IT, business subject-matter experts, and data modelers to expose it
via data virtualization. Several PaaS vendors offer more or less useful
versions of these technologies. (In the IaaS cloud, data virtualization
and data cataloging are either included with the core data warehouse
or licensed separately.) This usefulness is not specific to just the data
warehouse: they knit together data that is scattered among disparate
contexts.
Graph Databases
The slice of the business world that is captured by traditional BI
analytics is relatively narrow.
It ignores the data that lives in time-series databases, hierarchical
data stores, network databases, document databases (and similar
content management systems), to say nothing of the data lake. It
ignores the wild profusion of cloud data sources. But deriving con‐
text (i.e., linking text-analytic data with time-series data with
attribute value-pair data with relational data) is a hard problem. It
involves using a technique known as graph traversal, which is the
remit of so-called graph databases.
Hybrid Cloud
A hybrid deployment is a data warehouse that spans two distinct
contexts—typically, the on-premises enterprise and the cloud. For
our purposes, there are two kinds of hybrid configurations:
On-premises nonvirtualized data warehouse <> cloud data warehouse
This is a relatively common scenario. One example of this is
when an organization migrates its BC/DR, test-dev, and analytic
discovery workloads to the cloud data warehouse and keeps its
production workloads in a nonvirtualized on-premises system.
On-premises virtualized data warehouse <> cloud data warehouse
This is less common but becoming more so—especially now
that most PaaS providers also support on-premises private
43
cloud deployments. Hybrid deployments of this type might
focus on hosting certain kinds of workloads (e.g., those that
involve sensitive data, those that are especially demanding) in
the on-premises virtual data warehouse.
Same-vendor to same-vendor hybrid cloud deployments can be
advantageous for several reasons, not least because of portability
between cloud contexts: SQL, indexes, procedural code, UDFs,
database-specific schemas, and similar assets will often (but not
always) transfer without issue from one context to another. Most
database-specific skills should transfer, too. For these reasons, pro‐
viders like to tout an ability to shift workloads between contexts in a
homogeneous hybrid cloud.
Multicloud
Multicloud means just what it sounds like: rather than sourcing its
cloud services from a single provider, the organization distributes its
data warehouse across two or more cloud providers. It is necessary
to distinguish between a multicloud strategy and tactical use of
multicloud.
Multicloud in its strategic dimension can mix elements of risk man‐
agement—namely, an emphasis on hedging against the risk of
service provider lock-in—with an à la carte shopping experience
(different cloud services have different strengths) and with BC/DR
planning, too.
In practice, a mix of cloud and on-premises data warehouse systems
is not uncommon. This is less an example of a hybrid-multicloud
deployment—in which an on-premises data warehouse and two or
more cloud systems coexist inter pares—than a hybrid deployment
in which the organization makes tactical use of at least one other
cloud service. So, for example, an enterprise might host test-dev and
BC/DR in one cloud service (usually identical to that of its on-
premises provider) and one or more analytic sandboxes (used to
support its business analysts) in another. Some enterprises might
host their data lake and their data warehouse in one cloud but sup‐
port several different business use cases in sandboxes in another
cloud. Some cloud providers charge for capacity, some for use, some
for both.
Selecting a Provider
An organization that has a large on-premises data warehouse system
will tread carefully as it migrates this system to a cloud service.
“Large” in this context means the data warehouse hosts a mix of dif‐
ferent workloads, supports a large number of concurrent users,
These are just a few of the questions that factor into the calculus of
choosing a cloud provider.
Selecting a Provider | 47
An organization with specialized needs should be able to quickly
narrow down its list of providers.
There are a few points worth emphasizing, however:
1 Some cloud services achieve something like an MPP topology by employing a shared
storage substrate and independent compute nodes. That is, all compute nodes share
access to all of the tables in the database, but each processes workloads independently
of the others. This scheme achieves MPP-like compute performance.
2 In most IaaS data warehouse services, as in the on-premises enterprise, the subscriber
owns the responsibility of maintaining virtual operating system, database, and middle‐
ware software.
The data warehouse in the cloud does not constitute a radical break
with its on-premises predecessor. In the cloud, the role of the data
warehouse is the same as it ever was: it remains the authoritative sys‐
tem of record, the ground and guarantor of the veracity of all of the
data that is potential grist for business decision making. Even prior
to the emergence of data lakes and data science, one critical role of
the data warehouse was that of a research lab in which the business
performed experiments on itself.
The data-warehouse-in-the-cloud, by contrast, is practically unlimited
in terms of its size and prolificacy. It is multiparous—that is, capable
of being recreated or replicated as often as needed—in much the
same way that the on-premises data warehouse is not. Subscribers
can draw upon the reserve capacity of the hyperscale cloud to create
very large single-instance data warehouse configurations of dozens
or, even, hundreds of terabytes. They can create, pause, resume,
and/or destroy virtual data warehouse instances as needed; better
still, instances can be created (or destroyed) in response to program‐
matic events, such as API calls, or triggered by rules engines. The
data warehouse in the cloud is not perfect. As a general rule, on-
premises data warehouse systems will require more compute, more
memory, and more storage resources if they are to be successfully
transplanted into the cloud context. How much more is a function of
trial and error.
The cloud data warehouse of today is more performant than that of,
say, half a decade ago. A more recent innovation is that of the on-
premises public cloud: a PaaS service that is either deployed in a
local hosting “zone” (i.e., a local data center) or in the customer’s
53
own data center. In this scenario, the cloud provider, rather than the
customer, owns and manages the service. Cloud-at-customer gives
subscribers another (albeit premium) option to address stubborn
performance issues. This has everything to do with the inexorable
economics of cloud. It is not just a force for driving down costs but
for doing so while also improving performance.
In the cloud context, a data scientist can easily design data pipelines
that (1) extract useful data from the production data warehouse, be
it on- or off-premises; (2) call RESTful APIs to spin up one or more
virtual data warehouse instances; (3) load the extracted data into a
virtual data warehouse instance; (4) perform one or more specific
operations on this data; (5) integrate it with data prepared in
another context; (6) move the resulting dataset to a destination
repository; and, lastly, (7) destroy the virtual data warehouse
instance. This scenario is possible in the on-premises enterprise, to
be sure; however, owing to physical, economic, and logistical con‐
straints, it is difficult to implement in practice.
Extending the data warehouse to the cloud helps simplify disaster
recovery/business continuity (DR/BC) planning. Shifting DR/BC from
the on-premises data center (or from a leased DR space) to the
cloud is one of the most popular migration scenarios: low-hanging
fruit, as it were. But the data warehouse in the cloud also transforms
an organization’s security posture. Cloud service providers—and
PaaS providers, especially—tend to be better about enforcing com‐
monsense security safeguards, such as on-by-default data encryp‐
tion, or password-complexity and aging requirements. All things
being equal, the data warehouse in the cloud is a more secure plat‐
form than its on-premises kith.
Because the data warehouse in the cloud is still a site of rapid transfor‐
mation, organizations should not expect to move all of their on-
premises workloads to the cloud. Right now, it may be neither cost-
effective nor possible to do so. In the overwhelming majority of
cases, subscribers will need to allocate additional resources to
achieve performance that is at parity with their on-premises data
warehouse systems.
The economics of cloud make data warehousing relatively cost-
effective, but not all cloud data warehouses are alike. Some are more
adept at scaling and managing challenging workloads, as in the case
of a data warehouse system that hosts a mix of workloads of
54 | Conclusion
different kinds, or a large number of concurrent users. Some cloud
data warehouses are “cloudier”—more elastic; more flexible; quicker
to start, pause, and resume—than others. These factors also impact
performance and availability.
An organization should thoroughly test-drive a cloud data warehouse
service prior to inking a contract with a provider. Organizations that
have large data warehouse systems should aim to test against a com‐
plete copy of their data, if possible; they should also evaluate other
factors, such as data loading performance, the reliability and perfor‐
mance of their ETL jobs, even the speed of database inserts. If appli‐
cable, they should attempt to account for the cost of workloads that
involve data egress to other cloud services—or to on-premises
resources. But due diligence of this kind will save adopters time,
money, and frustration in the long run.
Final Thoughts
The data warehouse in the cloud is transformative in another
important way. Depending on the provider, the cloud data ware‐
house is logically adjacent to (or “lives” in the same context as)
cloud-based ML, AI, data integration, and developer-oriented
services. These services are already quite popular—not only with
ML and AI engineers but with software developers, too. In its most
recent Magic Quadrant for Cloud AI Services report,1 for example,
market-watcher Gartner projected that—by 2023—40% of develop‐
ment teams will use AI services to incorporate AI capabilities into
their apps. Five years from now, half of all data science activities will
be automated by AI, according to Gartner.
Today, this is a vision, especially when the data warehouse remains
tethered to its on-premises launch pad. But transplanting the data
warehouse from the on-premises data center into the cloud is an
important, a conative, first step. The cloud data warehouse is a new
home for old workloads, yes; more important, it is a site for and a
focus of new kinds of workloads and new types of analytic develop‐
ment, allowing you to maximize the value from data, understand it
better, and drive efficiencies and innovation.
Conclusion | 55
About the Author
Steve Swoyer is a writer, researcher, and analyst with more than 20
years of experience. His research focuses on business intelligence,
data warehousing, and analytics, as well as edge issues in data sci‐
ence, machine learning, and artificial intelligence. Steve enjoys
researching and writing about emerging trends and potentially
transformative ideas in systems design and architecture. As an ana‐
lyst and researcher, Steve explores trends in distributed systems
architecture, cloud native architecture, and other emergent subject
areas. He is a recovering philosopher with an abiding interest in eth‐
ics, philosophy of science, and the history of ideas.