PingCAP Ebook Modern Distributed Database Fundamentals
PingCAP Ebook Modern Distributed Database Fundamentals
Modern
Distributed
Database
Fundamentals
What Organizations Need to Know to Increase Scalability,
Meet Ever-Increasing Data Requirements, and Streamline
Tech Stacks
Table of
Contents
Introduction 3
Scalable by Design 17
Versatile by Nature 20
Reliable by Default 22
Database Modernization 26
Tech Stack Unification 28
Operational Data Management 29
Key Factors 31
Evaluation Criteria 32
Best Practices 33
Introducing TiDB 35
Origins of TiDB 36
Inside TiDB’s Distributed SQL Architecture 36
The Advantages of TiDB 38
Conclusion 40
In the 1970s, the first relational database As organizations generate and process
management systems (RDBMS) were ever-increasing amounts of data, the need
developed. These databases were built on for scalable and efficient databases is
the relational data model, which allowed becoming more important. Traditional SQL
data to be stored in tables with relationships databases have long been used for
between them. This made it easier to managing and processing data with strong
manage data, and allowed for more complex consistency, but they have scalability and
queries to be executed. performance limitations. NoSQL databases, on
the other hand, are masters at data scalability
During the 1980s and 1990s, relational and performance, but they tend to fall short
databases became increasingly popular, when data consistency is a requirement.
with the emergence of commercial database
management systems, such as Oracle, IBM
DB2, and Microsoft SQL Server. These
databases were designed to handle large
volumes of data and complex queries, and
were used in a wide range of applications,
including banking, retail, and healthcare.
Additionally, distributed SQL databases with mixed workload processing capabilities can
combine row and column storage in a single database. This provides a single endpoint for mixed
workloads while guaranteeing strong data consistency. Data can also be collected from
multiple applications and aggregated instantly, allowing real-time queries to be performed on
online operational data.
Distributed SQL databases are becoming increasingly popular as organizations look for ways to
manage and process large volumes of data efficiently. As data continues to grow at an
exponential rate, distributed SQL databases will become even more important for modern
application development.
By the end of this eBook, you’ll have the knowledge and confidence to take the next step in your
cloud-native journey. You’ll also be able to pinpoint precisely what makes distributed SQL a
unique modern distributed database solution for transactional data.
Choosing the right database to power separate databases for transactional and
modern applications can be challenging. For analytical workloads, adding even more
starters, as data volumes grow when using a technical complexity while opening up major
traditional relational database, challenges in data reliability and consistency.
performance and scalability radically
degrade. These problems can only be With the explosive growth of data and the need
remedied with additional data processing, for scalable and efficient systems, traditional
aggregation, and integration tools. However, relational and NoSQL databases have faced
such solutions create greater technical limitations. This has led to the emergence of
complexity for developers, poor real-time distributed SQL databases, revolutionizing how
performance, and higher data storage costs organizations handle their data.
Figure 2. The acceleration of new customer experiences into digital channels is driving the creation of modern
software applications as digital services.
As a result, businesses face the significant challenge of effectively managing and processing this
ever-increasing data. Distributed SQL databases have emerged as a robust solution to address
these escalating data requirements.
As data requirements continue to grow, distributed SQL databases have proven their
effectiveness in handling the challenges posed by this rapid data expansion. Through scalable
data storage, elastic computing power, data partitioning and sharding, data compression and
optimization, and real-time data processing capabilities, these databases empower
organizations to efficiently scale and manage ever-increasing data volumes. By leveraging the
distributed nature of their architecture, distributed SQL databases provide the scalability,
availability, and flexibility required to meet the demands of modern data-driven applications.
Distributed SQL databases have emerged as a powerful solution to address these challenges and
significantly improve application scalability and availability.
Distributed SQL databases leverage their distributed architecture to execute queries in parallel
across multiple nodes. This parallel processing capability allows for faster query execution times,
resulting in improved application performance. By dividing the query workload across the cluster,
distributed SQL databases can harness the collective computational power of the nodes,
effectively reducing the response times for complex queries. This distributed query execution
ensures that modern applications can deliver real-time results to users, enabling them to
interact seamlessly with the application.
Efficient data placement is crucial for maximizing application availability. Distributed SQL
databases can intelligently distribute and replicate data across data nodes in multiple
availability zones (AZs), offering high availability and fault tolerance. This means if a single node
or less than half of the nodes fail, the system can continue to function, a characteristic
traditional monolithic databases can never achieve. This intelligent data placement ensures that
data is located closer to the nodes that require it, optimizing application availability.
To further enhance application scalability and availability, distributed SQL databases utilize a
disaggregated storage and compute architecture. This architecture separates computing from
storage, so each layer can be deployed separately and scaled independently.
In a disaggregated storage and compute architecture, different functionalities are divided and
allocated to two types of nodes: the Write Node and the Compute Node. This means you can
decide the number of Write Nodes and Compute Nodes to be deployed as needed. Additionally,
you can scale out or scale in the computing or storage capacity online as needed. The scaling
process is transparent to application operations and maintenance staff.
Distributed SQL databases seamlessly integrate with modern application frameworks, enabling
developers to leverage their performance-enhancing features. These databases support popular
frameworks and libraries for application development, such as Spring Boot, Django, or Ruby on
Rails.
By integrating with these frameworks, distributed SQL databases provide a familiar development
environment and enable developers to take advantage of performance optimizations specific to
the database. This integration ensures that modern applications can harness the full potential of
distributed SQL databases and deliver exceptional performance to end users.
Figure 4. An example of a distributed SQL architecture with scalability and reliability for modern transactional apps
coupled with real-time analytics on transactional data.
A significant challenge in the tech stack jungle is dealing with multiple data management sys-
tems. Traditional architectures often involve separate databases for different purposes, such as
relational databases, NoSQL databases, caching systems, and message brokers. This fragmenta-
tion introduces complexities in data modeling, data synchronization, and maintaining consisten-
cy across systems.
Distributed SQL databases consolidate these different data management needs into a single,
unified system. By consolidating data management, organizations can simplify their tech stack,
reduce integration challenges, and streamline their operations.
In today’s data-driven world, organizations face the daunting challenge of managing ever-growing
volumes of data, ensuring reliable access, and accommodating dynamic workloads. To overcome
these challenges, distributed SQL databases have emerged as a powerful solution, offering
scalability, reliability, and versatility. This chapter will explore the fundamental principles that
underpin distributed SQL databases and their significance in modern data management.
Scalable by Design
Scalability is a key advantage of distributed SQL
databases, enabling organizations to efficiently
handle growing data volumes, user demands,
and transactional workloads. In this section,
we’ll explore how distributed SQL databases
are designed to be inherently scalable. We will
delve into their horizontal scalability, automatic
sharding capabilities, distributed transactions,
and concurrency control mechanisms.
Automatic Sharding
Automatic sharding is a vital capability of distributed SQL databases that allows them to partition
data across multiple nodes transparently. Sharding ensures that data is distributed evenly and
managed efficiently in a distributed environment. Key features of automatic sharding include:
Distributed Transactions
Distributed SQL databases provide support for distributed transactions, allowing organizations
to maintain transactional integrity across multiple nodes. Distributed transactions ensure that a
group of database operations is treated as a single unit, guaranteeing consistency and
durability. Key features of distributed transactions include:
High-Performance Architecture
The high-performance architecture of distributed SQL databases is one of their key strengths,
enabling them to deliver exceptional performance for a variety of data processing tasks. The
following aspects contribute to their high-performance capabilities:
Another aspect of the versatility of distributed SQL databases is their ability to handle mixed
workloads efficiently. Whether it involves processing analytical queries, transactional operations,
or a combination of both, distributed SQL databases excel in accommodating diverse workloads.
Here’s how they achieve this:
The versatility of distributed SQL databases makes them a fundamental component of modern
data management systems. Their high-performance architecture, characterized by distributed
query execution and intelligent data partitioning, enables organizations to achieve optimal
Reliable by Default
Reliability is a foundational characteristic of
distributed SQL databases, ensuring
consistent and dependable data
management in distributed environments. In
this section, we will explore how distributed
SQL databases are designed to be reliable by
default. We will delve into their strong
consistency guarantees, high availability
features, fault tolerance mechanisms, and
disaster recovery capabilities.
Strong Consistency
Distributed SQL databases provide strong consistency guarantees, ensuring that data remains
consistent across all nodes in the distributed system, even in the presence of concurrent
operations. Strong consistency is essential for applications requiring accurate and reliable data
access. Key features contributing to strong consistency include:
High Availability
High availability is a crucial aspect of distributed SQL databases, ensuring that applications
remain accessible and responsive even in the face of node failures or network interruptions.
Distributed SQL databases achieve high availability through various features, including:
Fault Tolerance
Fault tolerance is a critical capability of distributed SQL databases, enabling them to withstand
hardware failures, network issues, or other system failures. Distributed SQL databases implement
fault tolerance through the following mechanisms:
1. Data Replication:
Distributed SQL databases replicate data across multiple nodes, providing redundancy and
safeguarding against data loss. If a node fails, data can be retrieved from replicas, ensuring
that data remains accessible and preserving system functionality.
Disaster Recovery
Disaster recovery (DR) is a critical aspect of distributed SQL databases, ensuring data integrity
and business continuity in the face of catastrophic events. Distributed SQL databases provide
robust DR capabilities through the following mechanisms:
Backup and restore in a distributed SQL database satisfies the following requirements:
a. Backs up cluster data to a DR system with a Recovery Point Objective (RPO) as short
as 5 minutes, reducing data loss in disaster scenarios.
b. Handles operational failures from applications by rolling back data to a time before
the error event.
c. Performs history data auditing to meet the requirements of judicial supervision.
d. Clones the production environment, which is convenient for troubleshooting,
performance tuning, and simulation testing.
As we’ve demonstrated so far throughout this guide, distributed SQL databases have emerged as
a transformative technology. They’re revolutionizing the way organizations manage their data
infrastructure while unlocking new levels of scalability, reliability, and versatility. In this chapter,
we’ll explore several use cases that demonstrate the practical applications and benefits of
distributed SQL databases across different domains and industries. We’ll also provide real-world
examples to support these use cases.
Database Modernization
Organizations are constantly seeking ways to
modernize their data infrastructure to meet the
evolving needs of their applications and users.
Database modernization has become a critical
initiative, as legacy systems often struggle to
cope with the demands of scalability,
performance, and agility required by modern
use cases. Distributed SQL databases have
emerged as a powerful solution to drive this
database modernization journey.
Companies often find themselves managing a complex and disparate tech stack comprising
multiple databases, data processing frameworks, and data integration tools. This fragmented
infrastructure can lead to inefficiencies, increased maintenance costs, and challenges in data
management. Distributed SQL databases offer a compelling solution for tech stack unification,
enabling organizations to streamline their data infrastructure, simplify operations, and achieve
greater efficiency.
A disparate tech stack often arises as organizations adopt various technologies to meet different
data processing requirements. However, this fragmentation can create several challenges:
SaaS Providers
Moving data between different
systems can introduce latency and
performance bottlenecks, negatively
impacting the overall system
performance. Distributed SQL databases offer a unified solution for
SaaS providers to streamline their tech stack and
address the challenges posed by a disparate
As organizations embrace the benefits of distributed SQL databases, selecting the right database
becomes crucial for successful implementation. Choosing a distributed SQL database involves
considering several key factors, evaluating various criteria, and following best practices to ensure
the selected database aligns with your organization’s requirements.
In this chapter, we will explore the process of choosing a distributed SQL database, covering key
factors to consider, evaluation criteria to assess, and best practices to follow during the selection
process.
Key Factors
When choosing a distributed SQL database, it is essential to consider the following key factors:
1. Scalability:
Evaluate the database’s scalability capabilities to ensure it can handle the anticipated
growth in data volume and user traffic. Consider factors such as horizontal scaling, automatic
data partitioning, and the ability to add or remove nodes from the cluster seamlessly.
4. Performance:
Assess the performance capabilities of the database, including query execution times,
throughput, and latency. Look for features such as distributed query processing to optimize
performance.
Evaluation Criteria
To effectively evaluate distributed SQL databases, consider the following criteria:
2. Performance Benchmarking:
Conduct performance benchmarking tests to evaluate the database’s performance under
realistic workloads. Compare query execution times, throughput, and latency across different
databases to identify the one that meets your performance expectations.
4. Ecosystem Integration:
Assess the database’s compatibility and integration with your existing tech stack and
ecosystem tools. Consider its support for programming languages, frameworks, data
processing platforms, and data streaming frameworks.
Best Practices
Follow these best practices when choosing a distributed SQL database:
1. Clearly-Defined Requirements:
Clearly define your organization’s requirements, including scalability needs, performance
expectations, data model support, and high availability requirements. This will help narrow
down the list of suitable distributed SQL databases.
As we’ve shown throughout this eBook, selecting the right database to power modern
applications can be challenging. However, there’s a better option that can evolve alongside
your organization.
Introducing TiDB
TiDB, developed by PingCAP, is one of the most
advanced open-source, distributed SQL databases
that’s also MySQL compatible. TiDB powers business-
critical applications with a streamlined tech stack,
elastic scaling, real-time analytics, and continuous
access to data—all in a single database. With these
advanced capabilities, growing companies like yours
can focus on the future without worrying about
complex data infrastructure management or tedious
application development cycles.
With over 34,000 GitHub stars and a growing community of contributors, TiDB is also one of the
world’s most adopted open-source distributed SQL databases. The “Ti” in TiDB represents the
symbol for Titanium from the Periodic Table of Elements. Found in nature as an oxide, Titanium
is a powerful chemical element that can produce highly elastic, versatile, and reliable Titanium
metal. With TiDB, some of the world’s largest companies across technology, financial services,
travel, Web3, and gaming are building modern applications as relentlessly powerful as Titanium.
The TiDB SQL Layer separates compute from The Placement Driver (PD) Layer functions
storage to make scaling simpler, delivering a just like a full-time DBA, monitoring millions of
true cloud-native architecture. shards and performing hundreds of operations
per minute. It also handles the scale-in and
This stateless MySQL-compatible layer scale-out of clusters to meet demand.
provides uniformity into data access Additionally, this layer dynamically balances
without regard for sharding, or any other the data load in real time, mitigates hotspots,
underlying technical implementation, making and provides for the implementation of
applications easier to develop and easier to customized scheduling policies.
use.
Consisting of row and column-based storage engines, the TiDB Storage Layer offers built-in high
availability and strong consistency that can auto-scale to hundreds of nodes and petabytes of
data.
This layer also offers a modern replication mechanism based on the Raft consensus protocol.
Global organizations, such as those referenced in Chapter 3, are using TiDB for diverse database
solutions from large-scale transactional workloads to recommendation engines, data-intensive
applications, and more.
We then uncovered real-world use cases that demonstrate the practical applications of
distributed SQL databases across various industries. These examples showcased how
organizations have successfully leveraged distributed SQL to modernize their databases, unify
their tech stacks, and efficiently manage operational data management.
Distributed SQL databases such as TiDB provide a solid foundation for organizations who want to
evolve their transactional data management strategies. By embracing the modern distributed
database principles mentioned in this guide, organizations can unlock the full potential of their
data, leverage the scalability and reliability of distributed SQL architecture, and propel their
businesses forward in the digital economy.
If you want to learn more about TiDB, have questions about database modernization, or simply
want to better understand your options as you plan for the future, please don’t hesitate to
book a demo with one of our distributed SQL experts.