Oreilly Report High Performance Data Architectures
Oreilly Report High Performance Data Architectures
m
pl
im
en
ts
of
High-Performance
Data Architectures
How to Maximize Your Business
with a Cloud-Based Database
Joe McKendrick
& Ed Huang
REPORT
High-Performance
Data Architectures
How to Maximize Your Business
with a Cloud-Based Database
978-1-098-15052-5
[LSI]
Table of Contents
iii
4. The Impact of Artificial Intelligence on Databases. . . . . . . . . . . . . . . 25
AI for Better Database Performance 25
Databases as the Lifeblood of AI 26
What Generative AI Brings to the Table 27
The New Landscape of SQL Development
with AI Innovation 28
iv | Table of Contents
CHAPTER 1
The Database Keeps Evolving,
but Can It Keep Up?
There was a time when database choices were fairly limited. Enter‐
prises could work with flat-file databases, or, eventually, more
adaptable relational databases. These databases worked well for
basic transactions and internal corporate operations, but users had
to rely on IT teams to craft and generate reports about the state
of their business. That all changed about a decade ago, with an
explosion of new types of databases, built on the web and cloud, that
put more power in the hands of end users—and gave database man‐
agers more powerful tools to serve fast-changing business needs.
1
that keep information from effectively reaching the users and appli‐
cations that need it, data quality and integrity issues, a lag in adapt‐
ing to new business realities, scalability issues in an era when data
and associated applications are exploding, security issues, privacy
requirements, and lack of talent to maintain data environments.
Businesses can’t afford to sit still, and neither can their databases.
Database technology needs to be constantly refreshed. Today’s data
teams are under immense pressure—tasked with delivering “data-
driven” capabilities to their enterprises. At the same time, they have
had a difficult time linking the components of the modern data
architecture to the needs of the business. They need to do more
with less—smaller budget, fewer resources, and ever-tighter dead‐
lines. Today’s solutions may be faster and more flexible, but more is
needed to align these solutions to the needs of the business.
Next, we’ll take a look at a brief history on how that evolution is
unfolding.
NoSQL Databases Emerge to Offer Lightweight, and Even More Sociable, Alternatives | 3
NoSQL databases were designed for applications using unstructured
or semistructured data, such as text or images, which were not
supported by RDBMSs of the time. Supporting unstructured data
within a relational database required a JSON-based nested docu‐
ment structured in a very complex way to data model in an RDBMS.
NoSQL databases were marketed by their creators as lighter, quicker,
and easier to stand up than heavy RDBMSs, intended to enable data
managers and developers to put applications into production at a
faster rate. In addition, these databases were built for the web and
cloud architectures.
NoSQL databases occur in many flavors and are suited to a
particular task at hand—as document-oriented, key-value, graph,
column-family, and multimodel databases. Each flavor has its own
advantages, from greater flexibility in supporting data models to
greater visibility of these models. For example, MongoDB gained
traction because it provided developers with a simpler abstraction
for object data. Other NoSQL databases, such as Cassandra or
DynamoDB, emerged due to inabilities to scale across multiple
nodes.
For example, knowledge graphs are designed to capture rich rela‐
tionships and contextual information. A knowledge graph can map
a social network, e.g., mapping relationships and attributes that can
be used to derive insights, make inferences, and perform complex
queries on the data.
NoSQL databases brought data and insights closer to business deci‐
sion makers and analysts, helping them to contextualize and frame
their available data assets. NoSQL databases have the following
characteristics:
Advantages
Easy to deploy on a rapid basis; avoids vendor lock-in; few
underlying IT resources required; supports unstructured data;
provides visual understanding of data relationships
Challenges
Integrating multiple databases for multiple requirements; limi‐
ted data consistency; may not support complex queries, joins, or
transactions seen with SQL
Distributed SQL Databases Containing HTAP Capabilities Spread the Power and Bring It All Together | 7
CHAPTER 2
Choosing the Right Database
for Your Enterprise
9
improve performance, increase efficiency, and gain a competitive
edge. A streamlined data architecture with a cloud native database
will help deliver competitive differentiation.
An additional consideration is the fact that databases provide
consistency, availability, and partition tolerance but inherently can‐
not support all three qualities at one time, referred to as the CAP
theorem. Thus, data managers often face making choices that will
enable two of these qualities, while forgoing the third.
Data teams are now trying to build everything through ETL pipe‐
lines and AI and machine learning (ML) pipelines, which raise
governance and compliance issues. This requires building a data
lake—or, lately, data mesh and fabric—that also has implications for
downstream and upstream processes across the business.
There’s no question that the amount of data now moving through,
around, and between organizations can be overwhelming, especially
as companies accelerate their digital engagements. This requires a
highly agile data environment, and emerging high-performance data
architecture solutions—supported by distributed SQL databases
with HTAP capabilities—offer such environments that can form the
basis of a well-functioning, data-driven enterprise.
A high-performance data architecture enables global, end-to-end
data management through converged data platforms. Such an
architecture supports the following qualities:
What a High-Performance
Data Architecture Looks Like
A range of architectural patterns—depending on business require‐
ments and resources—can be established that support high-
performance data architectures, from a single database to clusters
of databases and servers. All can incorporate the latest database
technologies and are increasingly leveraged with distributed SQL
databases containing HTAP capabilities. While such architectures
vary, the goal is to configure data environments to act and deliver
as a single database, one with high levels of scalability, resilience
and high availability, security, compliance, peak performance, and
integration with existing environments.
17
These key objectives are the essence of a high-performance database
architecture:
Scalability on demand
In today’s rapidly evolving business environments, scalability is
a must. Requirements and user workloads may change on a day-
to-day, or even hour-to-hour, basis. Distributed SQL databases
containing HTAP capabilities, which incorporate the flexibility
to access nodes across the network as needed, are ideal for
meeting dynamic scalability requirements.
Enhanced security and compliance
As businesses expand their digital footprints, so does the size
of their threat surface. Distributed SQL databases containing
HTAP capabilities—designed and built for the cloud and digital
era—ensure an additional layer of security through encryption,
access control, and audit trails. Also important is the ability
to meet regional or jurisdictional data requirements intended
to assure security and privacy.
Resiliency and high availability
A high-performance data architecture needs to guarantee that
data and associated analytics will be available to users on a 24/7
basis. Even if there is unexpected downtime within the data
environment, end users should experience little more than a
brief hiccup. Distributed SQL databases containing HTAP capa‐
bilities employ multiple nodes, thereby ensuring that workloads
are picked up instantaneously by additional nodes in a cluster in
the event of node failure or performance glitches.
Peak performance
Capabilities such as dynamic workload load balancing, caching,
and real-time or near real-time processing are key to maintain‐
ing performance and a superior user or customer experience.
As noted previously, the multiple nodes seen with distributed
SQL databases containing HTAP capabilities ensure continu‐
ous availability, but also distributed workloads. Additional fea‐
tures seen with such advanced database environments include
in-memory processing that reduces or eliminates round trips
to disk storage, and caching that reduces demand on primary
databases.
Single Database
While a single database may seem limiting in the long term for
enterprise-scale applications, it is still capable of participating as
an engaged citizen within a high-performance data architecture. A
distributed SQL database containing HTAP capabilities results in
optimizations that enable proactive database design, featuring cach‐
ing and optimized SQL queries. A single database has the following
characteristics:
Advantages
Centralized control; simplicity; greater data integration; tighter
security
Challenges
Scalability; performance slowdowns; single point of failure;
vendor lock-in; data latency; limited flexibility
Today’s organizations run on data environments that serve a vari‐
ety of requirements and data types, so operations supported by
single databases are increasingly rare. But smaller businesses and
startups that may rely on single instances can still strive for a high-
performance data architecture, shown in Figure 3-1.
Single Database | 19
Figure 3-1. A single database architecture consists of multiple compo‐
nents within a single, centralized platform that correspond to a given
application; given the data complexity in today’s data landscape, this
approach has become less common
Shared-Disk Architecture
A shared-disk architecture means that data is stored within a com‐
mon storage array, which is accessed by more than one database.
This can be accessed by multiple servers that can ensure always-up
environments. In this architecture, database environments can scale,
as required, without the overhead of physical data storage. A shared-
disk architecture has the following characteristics:
Advantages
Data sharing; greater data availability; centralized control;
cost control; simplicity; greater data integration; greater data
availability; data consolidation; scalability
Challenges
Performance slowdowns due to demand from multiple applica‐
tions; network issues; single point of failure at storage level;
vendor lock-in; data latency; limited flexibility; complexity
For decades, many organizations have purchased and maintained
disk arrays that often were run by specialized storage administrators
at centralized locations. This hub-and-spoke approach to data stor‐
age is still viable, locks down security, and helps minimize data silos.
Increasingly, these approaches are being enhanced by cloud services,
shown in Figure 3-2.
Shared-Nothing Architecture
The preceding architectures—single database and shared-disk archi‐
tecture—have served enterprise requirements for decades, but a
masterless and fully distributed architecture is most appropriate in
today’s world. That’s where more distributed architectures—built
on shared-nothing, hybrid row/columnar storage, and in-memory
databases—will play more pertinent roles.
A shared-nothing database environment consists of data partitioned
across independent nodes that can pick up workloads in the event
of failure of another. These separate nodes function as their own
systems, with their own storage and processors. A shared-nothing
architecture has the following characteristics:
Advantages
Scalability; flexibility; high availability; parallel processing; load
balancing; cost control; data security
Challenges
Network issues; greater complexity; loss of control; maintain‐
ing data consistency; greater overhead; lack of consistency
across nodes; performance slowdowns in favor of maintaining
consistency
A shared-nothing architecture for databases assures great flexibility
in growth, and accommodations for new data configurations and
technologies that are coming down the road, shown in Figure 3-3.
Shared-Nothing Architecture | 21
Figure 3-3. In a shared-nothing architecture, all nodes, which include
databases and storage, operate independently of one another; this
serves to ensure that applications continue to function in the event
of node failure, as functions can be picked up by another node in
the architecture
In-Memory Processing
In-memory processing can deliver subsecond insights. Rather than
applications accessing data from disk drives, relevant data is loaded
into, and processed within, computer random-access memory. In-
memory processing has the following characteristics:
In-Memory Processing | 23
Advantages
Increased performance; analytics capabilities; scalability;
flexibility; faster time to insights
Challenges
Data durability; limited costs; less capacity; system overhead;
scalability limitations; complexity
In-memory is an essential feature that is now being built into today’s
generation of data solutions, which require subsecond processing
for real-time requirements, from addressing customer inquiries to
managing machine production, shown in Figure 3-5.
25
labor-intensive activity: it involves cleaning, extracting, integrating,
cataloging, labeling, and organizing data, and defining and perform‐
ing the many data-related tasks that often lead to frustration among
both data scientists and employees without ‘data’ in their titles.”
The challenge for today’s data managers is to deliver enhanced data
capabilities, with strained or relatively stagnant budgets. Organiza‐
tions are sourcing and ingesting more data than ever before—now
within the multiterabyte and gigabyte range—that needs to be avail‐
able, on demand, to business users, data scientists, and mission-
critical applications. AI changes the equation for today’s databases,
helping to autonomously enhance database query development
and performance, as well as managing the day-to-day operation,
provisioning, and security of databases.
Emerging methodologies that promote the use of AI in database
management include AIOps, in which AI is applied to stream‐
line and automate data operations, DataOps, the application of
intelligent collaboration and automation to data pipelines, and
DataSecOps, which involves data security operations on cloud
native databases.
Applying AI to database functions will free up data engineers, archi‐
tects, administrators, and scientists to concentrate on bigger and
more meaningful tasks beyond day-to-day maintenance, such as
digital transformation and innovation, which are essential to operat‐
ing in today’s hypercompetitive environment.