0% found this document useful (0 votes)
19 views35 pages

Oreilly Report High Performance Data Architectures

Uploaded by

Trilok Jha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views35 pages

Oreilly Report High Performance Data Architectures

Uploaded by

Trilok Jha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Co

m
pl
im
en
ts
of
High-Performance
Data Architectures
How to Maximize Your Business
with a Cloud-Based Database

Joe McKendrick
& Ed Huang

REPORT
High-Performance
Data Architectures
How to Maximize Your Business
with a Cloud-Based Database

Joe McKendrick and Ed Huang

Beijing Boston Farnham Sebastopol Tokyo


High-Performance Data Architectures
by Joe McKendrick and Ed Huang
Copyright © 2023 O’Reilly Media, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol,
CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional
use. Online editions are also available for most titles (http://oreilly.com). For more
information, contact our corporate/institutional sales department: 800-998-9938 or
[email protected].

Acquisitions Editor: Aaron Black Proofreader: Elizabeth Faerm


Development Editor: Jill Leonard Interior Designer: David Futato
Production Editor: Elizabeth Faerm Cover Designer: Randy Comer
Copyeditor: nSight, Inc. Illustrator: Kate Dullea

August 2023: First Edition

Revision History for the First Edition


2023-08-04: First Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. High-Performance


Data Architectures, the cover image, and related trade dress are trademarks of
O’Reilly Media, Inc.
The views expressed in this work are those of the authors and do not represent the
publisher’s views. While the publisher and the authors have used good faith efforts
to ensure that the information and instructions contained in this work are accurate,
the publisher and the authors disclaim all responsibility for errors or omissions,
including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this
work is at your own risk. If any code samples or other technology this work contains
or describes is subject to open source licenses or the intellectual property rights of
others, it is your responsibility to ensure that your use thereof complies with such
licenses and/or rights.
This work is part of a collaboration between O’Reilly and PingCAP. See our state‐
ment of editorial independence.

978-1-098-15052-5
[LSI]
Table of Contents

1. The Database Keeps Evolving, but Can It Keep Up?. . . . . . . . . . . . . . . 1


Pushing the Boundaries of the Database Frontier 1
Relational Databases Open Data to the Outside World 2
NoSQL Databases Emerge to Offer Lightweight, and Even
More Sociable, Alternatives 3
Cloud Databases Remove the Limits 5
Distributed SQL Databases Containing HTAP Capabilities
Spread the Power and Bring It All Together 6

2. Choosing the Right Database for Your Enterprise. . . . . . . . . . . . . . . . . 9


Many Unique Choices for Many Unique Problems 9
How to Choose the Right Database
to Solve the Right Problem 10
High-Performance Data Architecture:
The Power of Simplicity and Scalability 13

3. Designing a High-Performance Data Architecture. . . . . . . . . . . . . . . 17


What a High-Performance Data Architecture Looks Like 17
Single Database 19
Shared-Disk Architecture 20
Shared-Nothing Architecture 21
Hybrid Row/Columnar Storage 22
In-Memory Processing 23

iii
4. The Impact of Artificial Intelligence on Databases. . . . . . . . . . . . . . . 25
AI for Better Database Performance 25
Databases as the Lifeblood of AI 26
What Generative AI Brings to the Table 27
The New Landscape of SQL Development
with AI Innovation 28

iv | Table of Contents
CHAPTER 1
The Database Keeps Evolving,
but Can It Keep Up?

There was a time when database choices were fairly limited. Enter‐
prises could work with flat-file databases, or, eventually, more
adaptable relational databases. These databases worked well for
basic transactions and internal corporate operations, but users had
to rely on IT teams to craft and generate reports about the state
of their business. That all changed about a decade ago, with an
explosion of new types of databases, built on the web and cloud, that
put more power in the hands of end users—and gave database man‐
agers more powerful tools to serve fast-changing business needs.

Pushing the Boundaries of


the Database Frontier
As databases made the transition to the cloud, they evolved, from
serving as simply data storage frameworks to becoming essential
instruments for delivering greater customer service, as well as
understanding of the business and its environs. Databases evolved
with the platforms that arose within the computing world—from
mainframes to midrange-class computers to personal computers,
from proprietary to open source systems, and ultimately, to the
cloud. But the evolution isn’t stopping there.
There are many frontiers still open for the advancement of database
solutions, with many issues that still need to be addressed: data silos

1
that keep information from effectively reaching the users and appli‐
cations that need it, data quality and integrity issues, a lag in adapt‐
ing to new business realities, scalability issues in an era when data
and associated applications are exploding, security issues, privacy
requirements, and lack of talent to maintain data environments.
Businesses can’t afford to sit still, and neither can their databases.
Database technology needs to be constantly refreshed. Today’s data
teams are under immense pressure—tasked with delivering “data-
driven” capabilities to their enterprises. At the same time, they have
had a difficult time linking the components of the modern data
architecture to the needs of the business. They need to do more
with less—smaller budget, fewer resources, and ever-tighter dead‐
lines. Today’s solutions may be faster and more flexible, but more is
needed to align these solutions to the needs of the business.
Next, we’ll take a look at a brief history on how that evolution is
unfolding.

Relational Databases Open Data


to the Outside World
Back in the days when the mainframe ruled, database designers rec‐
ognized that users needed to declaratively decide what information
they needed. The traditional database systems at that point (such as
information management systems [IMSs] or the indexed sequential
access method [ISAM]) required programming skills, but relational
database systems made it easier for users to construct queries and
execute them efficiently, without knowledge of the intricacies of the
underlying servers and systems. The relational database introduced
a separation between physical and logical implementations, enabling
users to easily understand the nature of the data.
Rather than store data in the flat files known to larger, less flexible
systems, relational databases maintain data in rows and columns,
accessible via Structured Query Language (SQL), which provides
a way to create, modify, and query the data sets stored within.
Through the joining of tables, end users could view relationships
between different data sets, such as sales numbers and regions. SQL
introduced a simpler tool for data queries that allowed a far wider
audience of users to query the database. The result was an unlocking
of data for business intelligence.

2 | Chapter 1: The Database Keeps Evolving, but Can It Keep Up?


Relational databases quickly evolved into powerful relational data‐
base management systems (RDBMSs), offered as all-in-one manage‐
ment and development platforms by major vendors. RDBMSs have
also evolved to include new ways to handle data, including object
database capability and in-memory processing. Lately, they have
been hosted within the cloud, either by their vendors or through
platform-as-a-service (PaaS) providers. Cloud-based RDBMSs take
advantage of the cloud’s ease of use and scalability to provide expan‐
ded capabilities.
Relational databases, supported by SQL tools for query access, intro‐
duced a simpler way to discover and leverage data, and opened data
to a wide audience of users. This represented a key step to unlocking
data to better understand trends among customers and markets.
Relational databases have the following characteristics:
Advantages
Cross-platform; enables complex queries, joins, or transactions
across multiple data sets
Challenges
Difficult to deploy; difficult to scale; requires extensive IT
resources; SQL-based queries difficult to program; vendor lock-
in; structured data only

NoSQL Databases Emerge to Offer


Lightweight, and Even More
Sociable, Alternatives
Relational databases helped in discovering and understanding
trends within the business but were expensive in terms of multiuser
or per-processor licensing, as well as difficult to set up and maintain.
SQL itself required a robust understanding of its structure and
commands. Seeking to avoid the complexity of building SQL-based
queries for relational databases, along with their restrictions, a new
breed of databases emerged: not only SQL (NoSQL) databases. The
first generation of NoSQL databases focused on key-value stores
(Berkeley DB and similar), text searching (Elasticsearch), and later
document stores such as CouchDB and MongoDB.

NoSQL Databases Emerge to Offer Lightweight, and Even More Sociable, Alternatives | 3
NoSQL databases were designed for applications using unstructured
or semistructured data, such as text or images, which were not
supported by RDBMSs of the time. Supporting unstructured data
within a relational database required a JSON-based nested docu‐
ment structured in a very complex way to data model in an RDBMS.
NoSQL databases were marketed by their creators as lighter, quicker,
and easier to stand up than heavy RDBMSs, intended to enable data
managers and developers to put applications into production at a
faster rate. In addition, these databases were built for the web and
cloud architectures.
NoSQL databases occur in many flavors and are suited to a
particular task at hand—as document-oriented, key-value, graph,
column-family, and multimodel databases. Each flavor has its own
advantages, from greater flexibility in supporting data models to
greater visibility of these models. For example, MongoDB gained
traction because it provided developers with a simpler abstraction
for object data. Other NoSQL databases, such as Cassandra or
DynamoDB, emerged due to inabilities to scale across multiple
nodes.
For example, knowledge graphs are designed to capture rich rela‐
tionships and contextual information. A knowledge graph can map
a social network, e.g., mapping relationships and attributes that can
be used to derive insights, make inferences, and perform complex
queries on the data.
NoSQL databases brought data and insights closer to business deci‐
sion makers and analysts, helping them to contextualize and frame
their available data assets. NoSQL databases have the following
characteristics:
Advantages
Easy to deploy on a rapid basis; avoids vendor lock-in; few
underlying IT resources required; supports unstructured data;
provides visual understanding of data relationships
Challenges
Integrating multiple databases for multiple requirements; limi‐
ted data consistency; may not support complex queries, joins, or
transactions seen with SQL

4 | Chapter 1: The Database Keeps Evolving, but Can It Keep Up?


Cloud Databases Remove the Limits
Then came the cloud. Cloud databases, as their name suggests, are
a managed service provided by a hosted vendor or company via
the cloud or offered as a cloud-based service by database vendors.
Cloud databases have automation built in by providers, as well as
tighter integration with the services they provide. No hardware pur‐
chases are required to build and maintain cloud databases.
The most compelling advantage of cloud databases is their scalabil‐
ity on demand without the up-front costs of resident servers. They
enable end users to automatically spin up new data functions and
storage on demand. They also serve as backup and failover services,
thereby ensuring high availability.
Security is another area where cloud databases may be more robust
than their on-premises counterparts. That’s because security is an
essential part of cloud providers’ culture, and these providers are
better equipped with trained staff and the latest tools than their
client companies.
Cloud databases removed many of the underlying systems and net‐
work challenges that impeded the full use of databases, enabling
business users and analysts to focus on drawing insights from data,
rather than underlying infrastructure. Cloud databases have the fol‐
lowing characteristics:
Advantages
Easy to deploy on a rapid basis; highly scalable on demand; little
up-front investment; no underlying IT resources required; sup‐
ports unstructured data; technology is automatically refreshed
or updated
Challenges
Cloud vendor lock-in; control over features and formats; cloud
vendor business viability; data security; long-term costs

Cloud Databases Remove the Limits | 5


Distributed SQL Databases Containing
HTAP Capabilities Spread the Power and
Bring It All Together
It’s important to note that all of the previously mentioned database
types are still in use, with many organizations using all forms for
their various requirements. Every business case is unique, and there
is no single “right” approach to leveraging data assets and applica‐
tions for maximum performance.
Ultimately, the key to building a successful data environment in
today’s digital age is bringing together the advantages of the pre‐
vious generations of databases mentioned above with the require‐
ments for speed and adaptability by data-driven enterprises. The
next evolution of data environments, built on high-performance
data architectures, brings together all these advantages, but without
the baggage that each successive generation of databases brought.
This new generation of databases is built on the advantages that
distributed SQL databases provide, combined with the real-time
capabilities of hybrid transactional/analytical processing (HTAP)
databases.
Distributed SQL databases have been available for a number of
years, enabling the storage and processing of data more locally to
users or applications requesting data. Managing data closer to where
it is needed—across multiple sites or nodes—decreases latency and
provides a more modular approach to build for scalability. In addi‐
tion, in the event of failure of a node, other nodes can pick up the
slack, ensuring greater availability.
Distributed SQL databases containing HTAP capabilities enable real-
time processing and analysis of data as it is being generated. Com‐
bined with the flexibility of distributed SQL databases, they are
able to process both online transactional and analytical processing
workloads within the same system, sharing the single source of truth
without data link delay in between. This allows for the simplification
of technology stacks and data silos, which help companies build
actionable data insight right from the real-time update and then
drive growth faster.

6 | Chapter 1: The Database Keeps Evolving, but Can It Keep Up?


A high-performance data architecture is needed to take advantage
of the benefits that HTAP databases have to offer. This architec‐
ture provides greatly enhanced capabilities for scalability, availabil‐
ity, and performance—combined with the flexibility of integrating
with existing databases from any vendor. Working within this
architecture, managers and professionals can make faster, more
informed decisions based on current data. An architecture built
on distributed SQL databases containing HTAP capabilities also
enables more efficient and streamlined data processing, which can
improve cost efficiency while expediting business operations and
decision-making. Especially with OpenAI GPT innovation today, a
simplified data architecture will play a much more strategic role
than ever before.
The database world continues to rapidly evolve, promising fresh
approaches to helping organizations leverage the data that is critical
for serving customers, increasing employee productivity, and mov‐
ing forward with advanced analytic applications. Distributed SQL
databases containing HTAP capabilities offer a simplified and con‐
solidated approach for building a real-time, data-driven enterprise.

Distributed SQL Databases Containing HTAP Capabilities Spread the Power and Bring It All Together | 7
CHAPTER 2
Choosing the Right Database
for Your Enterprise

As explored in Chapter 1, the database world has seen an explosion


of choices and approaches to various data problems—which are also
exploding in scope and size. With so many options, database and
business teams face a bewildering palette of databases that all serve
important purposes.

Many Unique Choices


for Many Unique Problems
To succeed in a data-driven economy, enterprises need to embrace
the three Vs of data—volume, variety, and velocity. There is an
ever-growing need to process a growing volume of data, integrate
it from multiple sources and systems, and facilitate real-time data
processing and analysis. To manage this, a new generation of
databases is emerging, employing advanced technologies such as in-
memory databases, distributed databases, and cloud-based analytics
platforms.
With these new data environments comes potentially very complex
data architectures—and with them, demand for appropriate talent
and skills, as well as modern data governance and security. A cloud
native infrastructure is a clear answer to support enterprise business
and operations with a much more agile foundation. With the right
database delivered via or supported by the cloud, data teams can

9
improve performance, increase efficiency, and gain a competitive
edge. A streamlined data architecture with a cloud native database
will help deliver competitive differentiation.
An additional consideration is the fact that databases provide
consistency, availability, and partition tolerance but inherently can‐
not support all three qualities at one time, referred to as the CAP
theorem. Thus, data managers often face making choices that will
enable two of these qualities, while forgoing the third.

How to Choose the Right Database


to Solve the Right Problem
The evolution of today’s generation of databases provides opportu‐
nities to build truly data-driven enterprises. While databases have
often tended to be performance bottlenecks to enterprises in the
past, there’s no need for them to lag. Other parts of the technology
stack—systems, applications, and interfaces—have evolved to be
well integrated, flexible, and highly responsive to enterprise work‐
loads, and databases need to catch up.
The challenge is clear. Despite the best efforts of data managers,
architects, and professionals, data teams are constantly tweaking
and refactoring their databases to meet data volumes growing faster
than the databases they originally deployed.
For example, a company may seek to have information delivered
consistently in real time across all customer channels, enabling a call
center representative to be fully aware of an e-commerce transaction
a customer has just made but wants to change. Or it may be urgent
to apply preventive maintenance based on detection of anomalies
in real-time data emanating from a customer’s developing climate
control system. But the databases within such enterprises may be
slow, unable to effectively transform data and incapable of accessing
new data sources in a timely manner.
After all, it’s no easy task. Data is pouring in from many sources.
Businesses are demanding, on an almost daily basis, greater abil‐
ities to achieve insights and deliver personalization. Data manag‐
ers, engineers, and architects are continually scrambling to keep
up, often cobbling together solutions that push existing legacy
databases beyond their limits, while throwing out an assortment
of fit-for-purpose databases to support new ventures. These

10 | Chapter 2: Choosing the Right Database for Your Enterprise


cobbled-together environments can be costly, counterproductive,
and unsustainable.
There’s too much for data teams to do and not enough resources
to do it. They are tasked with moving their organizations to more
data-driven initiatives, from customer experience management to
artificial intelligence. They are responsible for an increasing num‐
ber of databases, as well as supporting data streams from a vast
variety of sources—from the Internet of Things to partner systems.
They are under pressure from end users to deliver information on
demand, while assuring trust and quality. In addition, this needs to
be accomplished with limited resources and skills.
This can’t go on. While there are many databases available to run
within today’s enterprises—from traditional relational databases to
transactional, time-series, and graph SQL databases and NoSQL
databases—these choices can be bewildering. In addition, these
databases are tasked with running a wide array of data types coming
from sources inside and outside the enterprise.
Further, data teams need to stop serving as merely technicians
tending to running databases on a day-to-day basis. Today’s data
managers, architects, and professionals are seeing their roles evolve
into those of next-generation data teams—charged with advocating
for and implementing unified and less complex solutions that serve
a diverse range of requirements. They need to select databases that
address the complexity of today’s data environments and be able
to serve a wide array of requirements and use cases. They need
to work closely with the business to assure value, while providing
end-to-end data management and integration between various data
platforms. Database managers and professionals need to align their
environments with the needs of the business, which means not only
providing capabilities to the business, but also assuring security and
compliance.
The databases that support this new environment need to address
the following:
Cost effectiveness
Traditional database environments typically result in numerous
database instances requiring support and maintenance. Some
instances may have limited numbers of applications. A large
number of instances results in higher costs. Data teams need
to be able to oversee increasingly complex data environments

How to Choose the Right Database to Solve the Right Problem | 11


without appreciable increases in funding, staffing, or other
resources.
Scalability on demand
Databases serve and maintain a state for transactions, and
maintaining a state at scale is a very difficult problem. This
requires techniques such as sharding, eventual consistency, and
denormalization to reduce the interconnection between differ‐
ent parts of the data in the database. Across the enterprise,
however, databases are often the bottleneck for scalability.
High performance within complex and ever-expanding environments
There has been significant growth in microservices. This increa‐
ses the complexity of data models and availability of data at
scale, as well as inhibits the availability of data belonging to
microservices for other purposes. That brings a lot of complex‐
ity and maintenance requirements, especially in companies with
large numbers of microservices, sometimes in the hundreds.
Support for high-end data analytics
In traditional data environments, analytics and transactional
data have been separated. For example, analytics may be built
into data warehouses. That requires complicated extract, trans‐
form, and load (ETL) approaches to move data between ana‐
lytics warehouses and transactional databases. As a result, it’s
difficult to perform analytics against fresh data. This also
reduces the data persistence on the analytics side as well. AI
can be complex and difficult to implement. The primary chal‐
lenges include data accuracy or accessibility, conflicts within
technology teams, fear or distrust of AI, and skills availability.
Security, privacy, and compliance
Database teams need to rely on databases that can be properly
secured, with strong encryption, authentication, and access con‐
trols designed into these environments. In addition, databases
need to be highly transparent with regard to data lineage, to
meet compliance auditing mandates.

12 | Chapter 2: Choosing the Right Database for Your Enterprise


Integration across multiple domains
Within today’s highly complex data environments that support
both internal and external data, databases must be capable of
seamless integration with any systems with which they come
into contact.
Cloud-based or cloud-friendly
The cloud is an essential part of today’s digital enterprises.
Databases either need to be cloud-borne or capable of readily
integrating with cloud-borne systems and resources.
Business availability and continuity
In an always-on economy, even a glitch of a few seconds may
be costly to businesses. Databases need to assure persistence,
backup, and restore capabilities that assure the business is up 24
hours a day, every day.
There are a number of database choices available to enterprises, so
selecting the right type for the business opportunity or problem at
hand may be daunting, especially if corporate budgets are on the
line. It’s important to study and understand the business problem at
hand and the level of scalability required to address the problem.
A suitable data model—based on how the data is organized—and
an appropriate database type need to be outlined. For example,
a SQL-based database is best suited for a normalized, relational
model of the data. Another important consideration is the ecosys‐
tem—vendors, developers, and support—associated with a database
system. Of course, cost and return on investment is also a key factor
in a database decision. Every business opportunity or problem has
its own uniqueness, and it’s important to select a database that meets
those unique needs.

High-Performance Data Architecture:


The Power of Simplicity and Scalability
The objective of making a database choice that meets the require‐
ments of today’s data-driven environments is to achieve a high-
performance data architecture. But what does such an architecture
look like? Existing databases—such as those from the multiple
generations described in Chapter 1 (see the section “Relational
Databases Open Data to the Outside World” on page 2)—address
the business problems arising from a fast-paced digital economy.

High-Performance Data Architecture: The Power of Simplicity and Scalability | 13


However, they require a look at incoming data platforms that are
better suited to handling large amounts of both structured and
unstructured data, as well as seamlessly interfacing with the cloud.
The traditional approach of throwing databases at individual, siloed
problems needs to be rethought, and a more responsive data envi‐
ronment should ensure that enterprises can employ their data assets
both tactically and strategically. Such a transformed environment
has the following characteristics:

• Data is easily and readily available, on demand, to end users


through self-service channels.
• Real-time and analytical data is supported with single, highly
integrated environments.
• The data environment is highly automated, with tasks rang‐
ing from integration to security performed by the automated
database.
• Data is responsive to the enterprise requirements at hand, be it
customer experience management or compliance reporting.
• There is semantic interoperability, with data capable of moving
between applications and systems on demand.

Data teams are now trying to build everything through ETL pipe‐
lines and AI and machine learning (ML) pipelines, which raise
governance and compliance issues. This requires building a data
lake—or, lately, data mesh and fabric—that also has implications for
downstream and upstream processes across the business.
There’s no question that the amount of data now moving through,
around, and between organizations can be overwhelming, especially
as companies accelerate their digital engagements. This requires a
highly agile data environment, and emerging high-performance data
architecture solutions—supported by distributed SQL databases
with HTAP capabilities—offer such environments that can form the
basis of a well-functioning, data-driven enterprise.
A high-performance data architecture enables global, end-to-end
data management through converged data platforms. Such an
architecture supports the following qualities:

14 | Chapter 2: Choosing the Right Database for Your Enterprise


Simplicity
With advanced distributed SQL databases containing HTAP
databases, data teams can reduce the number of vendors and
database types with which they are working. There is less of
a requirement for cobbling together and attempting to main‐
tain complex, multidatabase environments. With this simplified
architecture, a single database can handle multiple workloads,
thereby reducing overall data management costs.
Greater scalability
A high-performance data architecture is capable of scaling on
demand, an important advantage for fast-changing enterprise
requirements. Rather than relying on separate OLAP and OLTP
systems with varying scalability capabilities, distributed SQL
databases containing HTAP capabilities enable scaling both
types of workloads simultaneously.
Rapid data transfer to where it is needed, when it is needed
Distributed SQL databases containing HTAP capabilities elim‐
inate the need for cumbersome ETL processes. They enable
end users to run queries on incoming transactional data the
moment it arrives, rather than being shuttled into a data ware‐
house or other data management platform first. In addition,
there needs to be a high degree of data mobility between private
and public clouds.
Consolidated workloads—within a single environment
Typically, analytical and transactional processing workloads
have been handled separately, within different database systems.
Distributed SQL databases containing HTAP capabilities enable
real-time analysis, and this real-time aspect is the key to the
advantages these databases provide. They integrate transactional
and analytical data, enabling analysis of transactional data as
it enters the organization, thus providing decision makers with
a real-time picture of developments and events within their
customer bases, markets, or even internal operations. Typically,
users relied on analytical insights delivered from a separate
system such as a data warehouse, which typically employed data
that is more historical in nature and was delayed as it moved
from an OLAP to OLTP environment. Distributed SQL data‐
bases containing HTAP databases are designed to consolidate
these workloads within a single, more manageable environment.

High-Performance Data Architecture: The Power of Simplicity and Scalability | 15


Distributed SQL databases containing HTAP capabilities enable
consistency
One of the most vexing challenges long faced by data managers
and professionals is the many formats and silos occurring across
their enterprises. Data managed with distributed SQL databases
with HTAP capabilities assures data is consistent as it moves
across functions.
There are many choices available to organizations seeking greater
simplicity and capabilities. Previous and current generations of data‐
bases are extremely effective in handling OLTP workloads, and
others are intended for the judicious handling of OLAP queries.
For those organizations seeking a more integrated approach to both
OLTP and OLAP requirements, a solution such as distributed SQL
databases containing HTAP capabilities offers a simplified and con‐
solidated approach for building a real-time, data-driven enterprise.

16 | Chapter 2: Choosing the Right Database for Your Enterprise


CHAPTER 3
Designing a High-Performance
Data Architecture

A high-performance data architecture is a well-planned road map


for the present and the future. It is more than simply a collection
of databases, servers, and networks. The architecture provides guid‐
ance, directions, and patterns for building a data environment that
serves as the workhorse of data-driven businesses.

What a High-Performance
Data Architecture Looks Like
A range of architectural patterns—depending on business require‐
ments and resources—can be established that support high-
performance data architectures, from a single database to clusters
of databases and servers. All can incorporate the latest database
technologies and are increasingly leveraged with distributed SQL
databases containing HTAP capabilities. While such architectures
vary, the goal is to configure data environments to act and deliver
as a single database, one with high levels of scalability, resilience
and high availability, security, compliance, peak performance, and
integration with existing environments.

17
These key objectives are the essence of a high-performance database
architecture:
Scalability on demand
In today’s rapidly evolving business environments, scalability is
a must. Requirements and user workloads may change on a day-
to-day, or even hour-to-hour, basis. Distributed SQL databases
containing HTAP capabilities, which incorporate the flexibility
to access nodes across the network as needed, are ideal for
meeting dynamic scalability requirements.
Enhanced security and compliance
As businesses expand their digital footprints, so does the size
of their threat surface. Distributed SQL databases containing
HTAP capabilities—designed and built for the cloud and digital
era—ensure an additional layer of security through encryption,
access control, and audit trails. Also important is the ability
to meet regional or jurisdictional data requirements intended
to assure security and privacy.
Resiliency and high availability
A high-performance data architecture needs to guarantee that
data and associated analytics will be available to users on a 24/7
basis. Even if there is unexpected downtime within the data
environment, end users should experience little more than a
brief hiccup. Distributed SQL databases containing HTAP capa‐
bilities employ multiple nodes, thereby ensuring that workloads
are picked up instantaneously by additional nodes in a cluster in
the event of node failure or performance glitches.
Peak performance
Capabilities such as dynamic workload load balancing, caching,
and real-time or near real-time processing are key to maintain‐
ing performance and a superior user or customer experience.
As noted previously, the multiple nodes seen with distributed
SQL databases containing HTAP capabilities ensure continu‐
ous availability, but also distributed workloads. Additional fea‐
tures seen with such advanced database environments include
in-memory processing that reduces or eliminates round trips
to disk storage, and caching that reduces demand on primary
databases.

18 | Chapter 3: Designing a High-Performance Data Architecture


Integration with and support for legacy environments
Data environments don’t evolve overnight with a big bang.
Rather, capabilities are added and enhanced as business require‐
ments and underlying technologies change. This is impor‐
tant for customers migrating use cases from older or legacy
databases to more modern environments, which can be a
complex and time-consuming process. Distributed SQL data‐
bases containing HTAP capabilities can be seamlessly added to
existing environments, while incorporating the latest tools and
platforms essential for the move to a data-driven enterprise.
There are numerous patterns enterprises can adopt, based on
requirements and resources. With distributed SQL databases con‐
taining HTAP capabilities, all of the following configurations
can support high-performance data architectures while also meeting
the criteria mentioned previously.

Single Database
While a single database may seem limiting in the long term for
enterprise-scale applications, it is still capable of participating as
an engaged citizen within a high-performance data architecture. A
distributed SQL database containing HTAP capabilities results in
optimizations that enable proactive database design, featuring cach‐
ing and optimized SQL queries. A single database has the following
characteristics:
Advantages
Centralized control; simplicity; greater data integration; tighter
security
Challenges
Scalability; performance slowdowns; single point of failure;
vendor lock-in; data latency; limited flexibility
Today’s organizations run on data environments that serve a vari‐
ety of requirements and data types, so operations supported by
single databases are increasingly rare. But smaller businesses and
startups that may rely on single instances can still strive for a high-
performance data architecture, shown in Figure 3-1.

Single Database | 19
Figure 3-1. A single database architecture consists of multiple compo‐
nents within a single, centralized platform that correspond to a given
application; given the data complexity in today’s data landscape, this
approach has become less common

Shared-Disk Architecture
A shared-disk architecture means that data is stored within a com‐
mon storage array, which is accessed by more than one database.
This can be accessed by multiple servers that can ensure always-up
environments. In this architecture, database environments can scale,
as required, without the overhead of physical data storage. A shared-
disk architecture has the following characteristics:
Advantages
Data sharing; greater data availability; centralized control;
cost control; simplicity; greater data integration; greater data
availability; data consolidation; scalability
Challenges
Performance slowdowns due to demand from multiple applica‐
tions; network issues; single point of failure at storage level;
vendor lock-in; data latency; limited flexibility; complexity
For decades, many organizations have purchased and maintained
disk arrays that often were run by specialized storage administrators
at centralized locations. This hub-and-spoke approach to data stor‐
age is still viable, locks down security, and helps minimize data silos.
Increasingly, these approaches are being enhanced by cloud services,
shown in Figure 3-2.

20 | Chapter 3: Designing a High-Performance Data Architecture


Figure 3-2. A shared-disk architecture features a core-centralized
storage array from which multiple databases and applications
access data

Shared-Nothing Architecture
The preceding architectures—single database and shared-disk archi‐
tecture—have served enterprise requirements for decades, but a
masterless and fully distributed architecture is most appropriate in
today’s world. That’s where more distributed architectures—built
on shared-nothing, hybrid row/columnar storage, and in-memory
databases—will play more pertinent roles.
A shared-nothing database environment consists of data partitioned
across independent nodes that can pick up workloads in the event
of failure of another. These separate nodes function as their own
systems, with their own storage and processors. A shared-nothing
architecture has the following characteristics:
Advantages
Scalability; flexibility; high availability; parallel processing; load
balancing; cost control; data security
Challenges
Network issues; greater complexity; loss of control; maintain‐
ing data consistency; greater overhead; lack of consistency
across nodes; performance slowdowns in favor of maintaining
consistency
A shared-nothing architecture for databases assures great flexibility
in growth, and accommodations for new data configurations and
technologies that are coming down the road, shown in Figure 3-3.

Shared-Nothing Architecture | 21
Figure 3-3. In a shared-nothing architecture, all nodes, which include
databases and storage, operate independently of one another; this
serves to ensure that applications continue to function in the event
of node failure, as functions can be picked up by another node in
the architecture

Hybrid Row/Columnar Storage


Row-based storage maintains data as complete records, while
column-based storage focuses on the individual attributes within
records. Hybrid row/columnar storage combines both approaches
that can be adapted to the demands of each workload. Row-based
storage is more effective for handling transaction-based workloads.
Columnar storage, on the other hand, is more suitable for analytical
workloads, as it organizes data by columns, therefore accelerating
queries. Only specific columns need to be read, and thus less data
transferred from disk. A hybrid row/columnar storage system has
the following characteristics:

22 | Chapter 3: Designing a High-Performance Data Architecture


Advantages
Query performance; wider range of use cases; scalability; data
integration; analytics capabilities; flexibility
Challenges
Greater complexity; storage requirements; system overhead
Hybrid row/columnar storage-based data architectures are enabled
through an emerging generation of hybrid databases that provide
the best of both worlds in terms of database functionality, shown in
Figure 3-4.

Figure 3-4. With a hybrid row and columnar storage architec‐


ture, applications can more rapidly access data for analytical
purposes, while maintaining the robustness of row-based storage for
transactional functions

In-Memory Processing
In-memory processing can deliver subsecond insights. Rather than
applications accessing data from disk drives, relevant data is loaded
into, and processed within, computer random-access memory. In-
memory processing has the following characteristics:

In-Memory Processing | 23
Advantages
Increased performance; analytics capabilities; scalability;
flexibility; faster time to insights
Challenges
Data durability; limited costs; less capacity; system overhead;
scalability limitations; complexity
In-memory is an essential feature that is now being built into today’s
generation of data solutions, which require subsecond processing
for real-time requirements, from addressing customer inquiries to
managing machine production, shown in Figure 3-5.

Figure 3-5. In-memory processing shifts data retrieval to a system’s


random-access memory, eliminating the latency of round-trip queries

Every organization is unique and relies on unique and ever-evolving


database configurations to address its needs. There are few limi‐
tations on the types or numbers of databases that can be trans‐
formed into high-performance data architectures. Even smaller
organizations with single databases can achieve such capabilities.

24 | Chapter 3: Designing a High-Performance Data Architecture


CHAPTER 4
The Impact of Artificial
Intelligence on Databases

Things are moving fast with artificial intelligence, with continuing


growth in areas spanning from embedded intelligence to the democ‐
ratization of AI through platforms such as ChatGPT. This is
changing the roles and functionality of databases both from an
operational and service-delivery standpoint, resulting in a mutually
beneficial relationship. First, AI can be employed to boost database
performance, enabling autonomous and near-autonomous opera‐
tions and delivery of data services. Second, databases serve as the
lifeblood of AI and ML, elevating the roles of databases to manage
and provide the right data, at the right time—data that is trustwor‐
thy and of the highest quality.

AI for Better Database Performance


From a performance perspective, AI and ML promise to deliver
significant gains for databases of all types. AI can play a role in
discovering, processing, and searching data sets, delivering rapid
results. According to Thomas Davenport and Thomas Redman,
writing in MIT Sloan Management Review, “Artificial intelligence
is quietly improving the management of data, including its quality,
accessibility, and security.”1 They continue: “Managing data…is a

1 Thomas H. Davenport and Thomas C. Redman, “How AI Is Improving Data Manage‐


ment”, MIT Sloan Management Review, December 20, 2022.

25
labor-intensive activity: it involves cleaning, extracting, integrating,
cataloging, labeling, and organizing data, and defining and perform‐
ing the many data-related tasks that often lead to frustration among
both data scientists and employees without ‘data’ in their titles.”
The challenge for today’s data managers is to deliver enhanced data
capabilities, with strained or relatively stagnant budgets. Organiza‐
tions are sourcing and ingesting more data than ever before—now
within the multiterabyte and gigabyte range—that needs to be avail‐
able, on demand, to business users, data scientists, and mission-
critical applications. AI changes the equation for today’s databases,
helping to autonomously enhance database query development
and performance, as well as managing the day-to-day operation,
provisioning, and security of databases.
Emerging methodologies that promote the use of AI in database
management include AIOps, in which AI is applied to stream‐
line and automate data operations, DataOps, the application of
intelligent collaboration and automation to data pipelines, and
DataSecOps, which involves data security operations on cloud
native databases.
Applying AI to database functions will free up data engineers, archi‐
tects, administrators, and scientists to concentrate on bigger and
more meaningful tasks beyond day-to-day maintenance, such as
digital transformation and innovation, which are essential to operat‐
ing in today’s hypercompetitive environment.

Databases as the Lifeblood of AI


Without well-managed databases, there can be no AI—it’s as simple
as that. To succeed, AI depends on meaningful and relevant data.
Put another way, a quality data set is the foundation of AI; and AI
models and algorithms are only as good as the data they receive.
Furthermore, organizations depend on databases operating at peak
performance to deliver the timely and relevant data needed for
training data sets and large language models.
Going forward, enterprises and data managers need to identify the
data essential for training models, as well as address a potential lack
of data for sustaining these models. Data feeding AI systems must be
fresh and relevant—often, in real time—for the business problems at

26 | Chapter 4: The Impact of Artificial Intelligence on Databases


hand. In addition, the data must be of the highest possible quality
and trustworthiness.
Data used by ML models is often “raw” or unstructured, and this
is where content delivery networks may be required as part of
a high-performance data architecture. The data can be a simple
time-series data, which is suitable for accumulating and storing in a
database. But training using audio or image data often falls outside
the capabilities of databases—and this is where a content delivery
network, consisting of interconnected servers that cache such assets
close to applications or end users, may be more suitable.
Databases employed to support AI initiatives must also be capable
of managing a wide range of data types, from structured to unstruc‐
tured data. Distributed SQL databases containing HTAP capabilities
fit this need, delivering real-time analytical data of all types when
and where it is needed.

What Generative AI Brings to the Table


Generative AI—delivered with OpenAI’s ChatGPT, Google’s Bard,
or Microsoft’s Bing Chat, among others—promises to upend many
aspects of the database world. From an operational point of view,
generative AI can be used to create code for applications or scripts
that enhance database performance and integration. This enables
database developers, architects, engineers, and administrators to
conduct higher-level tasks and respond more quickly to business
demands.
Generative AI also has the potential to assist in database con‐
figuration, as well as play an assistant role in designing a high-
performance data architecture, drawing on patterns and experiences
either stored locally or from across the network.
From a service-delivery standpoint, today’s databases will be tasked
with maintaining the data employed within large language models
for enterprise-specific instances of generative AI. This data provides
recommendations not only to database teams but also across the
wider business.

What Generative AI Brings to the Table | 27


The New Landscape of SQL Development
with AI Innovation
The advent of AI means widely expanded capabilities for databases
and those working closely with databases. With AI, simple SQL
queries can be built automatically through natural language process‐
ing prompts, with little or no coding required. Through this process,
an AI-driven SQL interface can also provide recommendations for
queries based on analysis of the backend database.
Generative AI, for instance, has a lot to offer for ad hoc queries
or natural language queries created by nontechnical users. Even
for programmers, AI is proving to be very good at generating
syntactically correct windowing functions, which are tedious for
programmers to create, and is beyond the capabilities of most busi‐
nesspeople. ML approaches can be used to generate and produce
simple queries for nonexperts: queries that can easily be verified
to produce correct results. AI is already proven to be able to under‐
stand natural language queries that assist programing on MySQL—
which makes it a preferable protocol, due to the availability of more
training data. AI can understand schema and apply the best practi‐
ces for SQL. At the same time, AI cannot effectively discriminate
between transaction and analysis types of queries or be sensible to
cross-sharding consistency. This is why AI assistant programming
requires a more versatile database that’s easy to use and flexible.
The architectural approach that is emerging is in support of deliver‐
ing real-time insights and capabilities, leveraging AI. Databases are
forming the foundation of real-time AI, employed in conjunction
with streaming technologies.
AI means new approaches to building and managing databases,
as well as an escalation of the roles of databases themselves. Enter‐
prises need to prepare for, and embrace the power of, AI with
high-performance data architectures that are scalable, able to pro‐
cess mixed workloads, highly available, and capable of delivering
intelligence on demand.

28 | Chapter 4: The Impact of Artificial Intelligence on Databases


About the Authors
Joe McKendrick is an author, independent researcher, and speaker
exploring innovation, information technology trends, and markets.
He is a regular contributor to Database Trends and Applications,
Forbes, Harvard Business Review, and ZDNET. He served as cochair
of the AI Summit in 2021 and 2022, and on the planning committee
of the IEEE International Conference on Edge Computing and the
International SOA and Cloud Symposium series.
Ed Huang is CTO and cofounder of PingCAP, the company behind
TiDB, one of the most advanced open source, distributed SQL
databases for modern applications. With over a decade of experience
in the technology industry, he’s an expert in distributed systems,
database architecture, and cloud computing. His experience in the
open source community also gives him a unique perspective on the
challenges and opportunities of implementing distributed databases
in large-scale, real-world production environments. Overall, Ed is
passionate about streamlining software architecture and helping
businesses unlock the full potential of their data.

You might also like