009 Databases
009 Databases
Understand what a database is and its use cases in the system design.
• Problem statement
• Limitations of file storage
• Solution
• Advantages
• How will we explain databases?
Problem statement
Let’s start with a simple question. Can we make a software application without
Introduction
using to Databases
databases? Let’s suppose we have an application like WhatsApp. People
use our application to communicate with their friends. Now, where and how can
we store information (a list of people’s names and their respective messages)
permanently and retrieve it?
We can use a simple file to store all the records on separate lines and retrieve
them from the same file. But using a file for storage has some limitations.
Solution
The above limitations can be addressed using databases.
Some of the applications where we use database management are the banking
systems, online shopping stores, and so on. Different organizations have
different sizes of databases according to their needs.
Note: According to a source, the World Data Center for Climate (WDCC)
is the largest database in the world. It contains around 220 terabytes of
web data and 6 petabytes of additional data.
They differ in terms of their intended use case, the type of information they hold,
and the storage method they employ.
Example of SQL database Example of NoSQL database
Document
Users Product
Relational database has a well defined structure such as attributes (columns of the table).
NoSQL databases such as document databases often have application-defined structure of
data.
Relational databases, like phone books that record contact numbers and
addresses, are organized and have predetermined schemas. Non-relational
databases, like file directories that store anything from a person’s constant
information to shopping preferences, are unstructured, scattered, and feature a
dynamic schema. We’ll discuss their differences and their types in detail in the
next lesson.
Advantages
A proper database is essential for every business or organization. This is because
the database stores all essential information about the organization, such as
personnel records, transactions, salary information, and so on. Following are
some of the reasons why the database is important:
Managing large data: A large amount of data can be easily handled with a
database, which wouldn’t be possible using other tools.
Retrieving accurate data (data consistency): Due to different constraints
in databases, we can retrieve accurate data whenever we want.
Easy updation: It is quite easy to update data in databases using data
manipulation language (DML).
Security: Databases ensure the security of the data. A database only allows
authorized users to access data.
Data integrity: Databases ensure data integrity by using different
constraints for data.
Availability: Databases can be replicated (using data replication) on
different servers, which can be concurrently updated. These replicas
ensure availability.
Scalability: Databases are divided (using data partitioning) to manage the
load on a single node. This increases scalability.
• Relational databases
• Why relational databases?
• Flexibility
• Reduced redundancy
• Concurrency
• Integration
• Backup and disaster recovery
• Drawback
• Impedance mismatch
• Why non-relational (NoSQL) databases?
• Types of NoSQL databases
• Key-value database
• Document database
• Graph database
• Columnar database
• Drawbacks of NoSQL databases
• Lack of standardization
• Consistency
• Choose the right database
• Quiz
As we discussed earlier, databases are divided into two types: relational and
non-relational. Let’s discuss these types in detail.
Relational databases
Relational databases adhere to particular schemas before storing the data. The
data stored in relational databases has prior structure. Mostly, this model
organizes data into one or more relations (also called tables), with a unique key
for each tuple (instance). Each entity of the data consists of instances and
attributes, where instances are stored in rows, and the attributes of each
instance are stored in columns. Since each tuple has a unique key, a tuple in one
table can be linked to a tuple in other tables by storing the primary keys in other
tables, generally known as foreign keys.
A Structure Query Language (SQL) is used for manipulating the database. This
includes insertion, deletion, and retrieval of data.
There are various reasons for the popularity and dominance of relational
databases, which include simplicity, robustness, flexibility, performance,
scalability, and compatibility in managing generic data.
But ACID is like a big hammer by design so that it’s generic enough for all the
problems. If some specific application only needs to deal with a few anomalies,
there’s a window of opportunity to use a custom solution for higher
performance, though there is added complexity.
MySQL
Oracle Database
Microsoft SQL Server
IBM DB2
Postgres
SQLite
Flexibility
In the context of SQL, data definition language (DDL) provides us the flexibility
to modify the database, including tables, columns, renaming the tables, and
other changes. DDL even allows us to modify schema while other queries are
happening and the database server is running.
Reduced redundancy
One of the biggest advantages of the relational database is that it eliminates data
redundancy. The information related to a specific entity appears in one table
while the relevant data to that specific entity appears in the other tables linked
through foreign keys. This process is called normalization and has the additional
benefit of removing an inconsistent dependency.
Concurrency
Integration
Relational databases guarantee the state of data is consistent at any time. The
export and import operations make backup and restoration easier. Most cloud-
based relational databases perform continuous mirroring to avoid loss of data
and make the restoration process easier and quicker.
Drawback
Impedance mismatch
Impedance mismatch is the difference between the relational model and the in-
memory data structures. The relational model organizes data into a tabular
structure with relations and tuples. SQL operation on this structured data yields
relations aligned with relational algebra. However, it has some limitations. In
particular, the values in a table take simple values that can’t be a structure or a
list. The case is different for in-memory, where a complex data structure can be
stored. To make the complex structures compatible with the relations, we would
need a translation of the data in light of relational algebra. So, the impedance
mismatch requires translation between two representations, as denoted in the
following figure:
Orders
ID: 501
Customer: John
line items:
Customers
Order lines
payment details:
Card: Amex
CC Number: 22255
expiry: 04/2027 Credit cards
A single aggregated value in the view is composed of several rows and tables in the
relational database
Cost: Licenses for many RDBMSs are pretty expensive, while many NoSQL
databases are open source and freely available. Similarly, some RDBMSs
rely on costly proprietary hardware and storage systems, while NoSQL
databases usually use clusters of cheap commodity servers.
NoSQL databases are divided into various categories based on the nature of the
operations and features, including document store, columnar database, key-
value store, and graph database. We’ll discuss each of them along with their use
cases from the system design perspective in the following sections.
Non-Relational
Non-Relational
Column-oriented/
Key-value store Document columnar Graph
Key-value database
Key-value databases use key-value methods like hash tables to store data in
key-value pairs. We can see this depicted in the figure a couple of paragraphs
below. Here, the key serves as a unique or primary key, and the values can be
anything ranging from simple scalar values to complex objects. These databases
allow easy partitioning and horizontal scaling of the data. Some popular key-
value databases include Amazon DynamoDB, Redis, and Memcached DB.
Data stored in the form of key-value pair in DynamoDB, where the key is the combination of
two attributes (Product ID and Type)
Document database
Use case: Document databases are suitable for unstructured catalog data, like
JSON files or other complex structured hierarchical data. For example, in e-
commerce applications, a product has thousands of attributes, which is
unfeasible to store in a relational database due to its impact on the reading
performance. Here comes the role of a document database, which can efficiently
store each attribute in a single file for easy management and faster reading
speed. Moreover, it’s also a good option for content management applications,
such as blogs and video platforms. An entity required for the application is
stored as a single document in such applications.
The following example shows data stored in a JSON document. This data is about
a person. Various attributes are stored in the file, including id, name, email, and
so on.
1 { "id": 1001,
2 "name": "Brown",
3 "title": "Mr.",
4 "email": "[email protected]",
5 "cell": "123-465-9999",
6 "likes": [
7 "designing",
8 "cycling",
9 "skiing"],
10 "businesses": [
11 { "name": "ABC co.",
12 "partner": "Vike",
13 "status": "Bankrupt",
14 "date_founded": {
15 "$date": "2021-12-10" } }]}
Graph database
Graph databases use the graph data structure to store data, where nodes
represent entities, and edges show relationships between entities. The
organization of nodes based on relationships leads to interesting patterns
between the nodes. This database allows us to store the data once and then
interpret it differently based on relationships. Popular graph databases include
Neo4J, OrientDB, and InfiniteGraph. Graph data is kept in store files for
persistent storage. Each of the files contains data for a specific part of the graph,
such as nodes, links, properties, and so on.
In the following figure, some data is stored using a graph data structure in nodes
connected to each other via edges representing relationships between nodes.
Each node has some properties, like Name, ID, and Age. The node having ID: 2 has
the Name of James and Age of 29 years.
ID: 230
Label: Knows
ID: 1 Since: 2010/11/03 ID: 2
Name: Robert Name: James
Age: 25 ID: 231 Age: 29
Label: Knows
Since: 2010/11/03
5/0 r
La
8/0 be
4
be
01 m
ce s_ 35
ID: em
: 2 Me
l: M
Sin bel: D: 2
La nce:
23 ber
Si
be 20
r
3
be
ID _Me 10/0
l: M 34
l: I 15
:2
em
be : 2
s
32 be 7
La
La ID
m
/
ID: 3
Type: Group
Name: Cricket
A graph consists of nodes and links. This graph captures entities and their relationships with
each other
Use case: Graph databases can be used in social applications and provide
interesting facts and figures among different kinds of users and their activities.
The focus of graph databases is to store data and pave the way to drive analyses
and decisions based on relationships between entities. The nature of graph
databases makes them suitable for various applications, such as data regulation
and privacy, machine learning research, financial services-based applications,
and many more.
Columnar database
Columnar databases store data in columns instead of rows. They enable access
to all entries in the database column quickly and efficiently. Popular columnar
databases include Cassandra, HBase, Hypertable, and Amazon Redshift.
Use case: Columnar databases are efficient for a large number of aggregation
and data analytics queries. It drastically reduces the disk I/O requirements and
the amount of data required to load from the disk. For example, in applications
related to financial institutions, there’s a need to sum the financial transaction
over a period of time. Columnar databases make this operation quicker by just
reading the column for the amount of money, ignoring other attributes of
customers.
The following figure shows an example of a columnar database, where data is
stored in a column-oriented format. This is unlike relational databases, which
store data in a row-oriented fashion:
Types of Databases
Row-oriented database
1 2 3 4 5
1 1 1 2 3
NoSQL doesn’t follow any specific standard, like how relational databases follow
relational algebra. Porting applications from one type of NoSQL database to
another might be a challenge.
Consistency
If the size of the data is relatively If the size of the data to be stored
small and can fit on a node) is large
Note: When NoSQL databases first came into being, they were
drastically different to program and use as compared to traditional
databases. Though, due to extensive research in academia and industry
over the last many years, the programmer-facing differences between
NoSQL and traditional stores are blurring. We might be using the same
SQL constructs to talk to a NoSQL store and get a similar level of
performance and consistency as a traditional store. Google’s Cloud
Spanner is one such database that’s geo-replicated with automatic
horizontal sharding ability and high-speed global snapshots of data.
Quiz
Test your knowledge of the different types of databases via a quiz.
1 Which database should we use when we have unstructured data
and there’s a need for high performance?
Question 1 of 6
Reset Quiz Submit Answer
0 attempted
• Replication
• Synchronous versus asynchronous replication
• Data replication models
• Single leader/primary-secondary replication
• Primary-secondary replication methods
• Statement-based replication
• Write-ahead log (WAL) shipping
• Logical (row-based) log replication
• Multi-leader replication
• Conflict
• Handle conflicts
• Conflict avoidance
• Last-write-wins
• Custom logic
• Multi-leader replication topologies
• Peer-to-peer/leaderless replication
• Quorums
Data is an asset for an organization because it drives the whole business. Data
provides critical business insights into what’s important and what needs to be
changed. Organizations also need to securely save and serve their clients’ data
on demand. Timely access to the required data under varying conditions
(increasing reads and writes, disks and node failures, network and power
outages, and so on) is required to successfully run an online business.
Replication
Replication refers to keeping multiple copies of the data at various nodes
(preferably geographically distributed) to achieve availability, scalability, and
performance. In this lesson, we assume that a single node is enough to hold our
entire data. We won’t use this assumption while discussing the partitioning of
data in multiple nodes. Often, the concepts of replication and partitioning go
together.
However, with many benefits, like availability, replication comes with its
complexities. Replication is relatively simple if the replicated data doesn’t
require frequent changes. The main problem in replication arises when we have
to maintain changes in the replicated data over time.
Original Replica 2
User Server
database
Replica 3
Replication in action
Synchronous replication
Asynchronous replication
The advantage of synchronous replication is that all the secondary nodes are
completely up to date with the primary node. However, there’s a disadvantage to
this approach. If one of the secondary nodes doesn’t acknowledge due to failure
or fault in the network, the primary node would be unable to acknowledge the
client until it receives the successful acknowledgment from the crashed node.
This causes high latency in the response from the primary node to the client.
Write X
Writes
locally
Synchronous
Write X
Write X
Write complete
Write complete
Write complete
Write Y
Writes
Write complete locally
Asynchronous
Write Y
Write Y
Write complete
Write complete
Primary
Changes propagate
to secondary
Secondary Secondary
Primary-secondary data replication model where data is replicated from primary to
secondary
Point to Ponder
Question
Hide Answer
Statement-based replication
Write-ahead log (WAL) shipping
Logical (row-based) log replication
Statement-based replication
In the statement-based replication approach, the primary node saves all
statements that it executes, like insert, delete, update, and so on, and sends them
to the secondary nodes to perform. This type of replication was used in MySQL
before version 5.1.
This type of approach seems good, but it has its disadvantages. For example, any
nondeterministic function (such as NOW()) might result in distinct writes on the
follower and leader. Furthermore, if a write statement is dependent on a prior
write, and both of them reach the follower in the wrong order, the outcome on
the follower node will be uncertain.
In the write-ahead log (WAL) shipping approach, the primary node saves the
query before executing it in a log file known as a write-ahead log file. It then
uses these logs to copy the data onto the secondary nodes. This is used in
PostgreSQL and Oracle. The problem with WAL is that it only defines data at a
very low level. It’s tightly coupled with the inner structure of the database
engine, which makes upgrading software on the leader and followers
complicated.
Multi-leader replication
As discussed above, single leader replication using asynchronous replication has
a drawback. There’s only one primary node, and all the writes have to go
through it, which limits the performance. In case of failure of the primary node,
the secondary nodes may not have the updated database.
All updates are Changes propagate to other primary nodes All updates are
made to the made to the
primary node primary node
Conflict
Handle conflicts
Conflicts can result in different data at different nodes. These should be handled
efficiently without losing any data. Let’s discuss some of the approaches to
handle conflicts:
Leader 1 Leader 2
Wr Z
ite ere
Yw h
her Xw
eZ ite
Wr
time
Conflict of writes
Conflict avoidance
A simple strategy to deal with conflicts is to prevent them from happening in the
first place. Conflicts can be avoided if the application can verify that all writes
for a given record go via the same leader.
However, the conflict may still occur if a user moves to a different location and is
now near a different data center. If that happens, we need to reroute the traffic.
In such scenarios, the conflict avoidance approach fails and results in
concurrent writes.
Last-write-wins
Using their local clock, all nodes assign a timestamp to each update. When a
conflict occurs, the update with the latest timestamp is selected.
This approach can also create difficulty because the clock synchronization across
nodes is challenging in distributed systems. There’s clock skew that can result in
data loss.
Custom logic
In this approach, we can write our own logic to handle conflicts according to the
needs of our application. This custom logic can be executed on both reads and
writes. When the system detects a conflict, it calls our custom conflict handler.
Peer-to-peer/leaderless replication
In primary-secondary replication, the primary node is a bottleneck and a single
point of failure. Moreover, it helps to achieve read scalability but fails in
providing write scalability. The peer-to-peer replication model resolves these
problems by not having a single primary node. All the nodes have equal
weightage and can accept reads and writes requests. Amazon popularized such a
scheme in their DynamoDB data store.
Data Replication
read/write
Nodes communicate
their writes to each other
read/write read/write
Peer-to-peer data replication model where all nodes apply reads and writes to all the data
Quorums
Let’s suppose we have three nodes. If at least two out of three nodes are
guaranteed to return successful updates, it means only one node has failed. This
means that if we read from two nodes, at least one of them will have the updated
version, and our system can continue working.
n=3
replicas
ss Replica 1 Fa
cce iled
Su
Replica 3
For more details on the topic of Quorum, refer to the following links:
What is Quorum?
At some point, a single node-based database isn’t enough to tackle the load. We
might need to distribute the data over many nodes but still export all the nice
properties of relational databases. In practice, it has proved challenging to
provide single-node database-like properties over a distributed database.
Data partitioning (or sharding) enables us to use multiple nodes where each
node manages some part of the whole data. To handle increasing query rates
and data amounts, we strive for balanced partitions and balanced read/write
load.
We’ll discuss different ways to partition data, related challenges, and their
solutions in this lesson.
Partition
0
Partition
1
A database with two partitions to distribute the data and associated read/write load
Sharding
To divide load among multiple nodes, we need to partition the data by a
phenomenon known as partitioning or sharding. In this approach, we split a
large dataset into smaller chunks of data stored at different nodes on our
network.
The partitioning must be balanced so that each partition receives about the same
amount of data. If partitioning is unbalanced, the majority of queries will fall
into a few partitions. Partitions that are heavily loaded will create a system
bottleneck. The efficacy of partitioning will be harmed because a significant
portion of data retrieval queries will be sent to the nodes that carry the highly
congested partitions. Such partitions are known as hotspots. Generally, we use
the following two ways to shard the data:
Vertical sharding
Horizontal sharding
Vertical sharding
We can put different tables in various database instances, which might be
running on a different physical server. We might break a table into multiple
tables so that some columns are in one table while the rest are in the other. We
should be careful if there are joins between multiple tables. We may like to keep
such tables together on one shard.
Often, vertical sharding is used to increase the speed of data retrieval from a
table consisting of columns with very wide text or a binary large object (blob). In
this case, the column with large text or a blob is split into a different table.
As shown in the figure a couple paragraphs below, the Employee table is divided
into two tables: a reduced Employee table and an EmployeePicture table. The
EmployeePicture table has just two columns, EmployeeID and Picture, separated
from the original table. Moreover, the primary key EmployeeID of the Employee
table is added in both partitioned tables. This makes the data read and write
easier, and the reconstruction of the table is performed efficiently.
Employee
Employee EmployeeID
EmployeeID Name
Name
Picture EmployeePicture
Employee ID
Picture
Vertical partitioning
Horizontal sharding
At times, some tables in the databases become too big and affect read/write
latency. Horizontal sharding or partitioning is used to divide a table into
multiple tables by splitting data row-wise, as shown in the figure in the next
section. Each partition of the original table distributed over database servers is
called a shard. Usually, there are two strategies available:
Invoice
1 5101 01-01-2015
Invoice Sharding
2 5201 10-01-2017
Customer_Id Invoice_Id Creation_date ...
2 5301 05-11-2018
1 5101 01-01-2015
3 5202 10-12-2017
4 5302 07-11-2017
Database shard 2
Horizontal partitioning
Customer_Id Name ... Customer_Id Invoice_Id Creation_date ... Customer_Id Invoice_item_Id Invoice_Id ...
1 2
2 3
Customer Invoice Invoice_item
2 4
Customer_Id Name ... Customer_Id Invoice_Id Creation_date ... Customer_Id Invoice_item_Id Invoice_Id ...
Database shard 2
There’s a partition key in the Customer mapping table. This table resides on
each shard and stores the partition keys used in the shard. Applications
create a mapping logic between the partition keys and database shards by
reading this table from all shards to make the mapping efficient.
Sometimes, applications use advanced algorithms to determine the location
of a partition key belonging to a specific shard.
Primary keys are unique across all database shards to avoid key collision
during data migration among shards and the merging of data in the online
analytical processing (OLAP) environment.
Advantages
Using key-range-based sharding method, the range-query-based scheme is
easy to implement. We precisely know where (which node, which shard) to
look for a specific range of keys.
Range queries can be performed using the partitioning keys, and those can
be kept in partitions in sorted order. How exactly such a sorting happens
over time as new data comes in is implementation specific.
Disadvantages
Range queries can’t be performed using keys other than the partitioning
key.
If keys aren’t selected properly, some nodes may have to store more data
due to an uneven distribution of the traffic.
Hash-based sharding
In the illustration below, we use a hash function of Value mod = n. The n is the
number of nodes, which is four. We allocate keys to nodes by checking the mod
for each key. Keys with a mod value of 2 are allocated to node 2. Keys with a mod
value of 1 are allocated to node 1. Keys with a mod value of 3 are allocated to
node 3. Because there’s no key with a mod value of 0, node 0 is left vacant.
f = Value mod 4
Node 1 2 150355825 1
3 266622091 3
Node 2 4 133825114 2
5 209053885 1
Node 3
6 159421699 3
Hash-based sharding
Advantages
Keys are uniformly distributed across the nodes.
Disadvantages
We can’t perform range queries with this technique. Keys will be spread
over all partitions.
Empirically, we can determine how much each node can serve with
acceptable performance. It can help us find out the maximum amount of
data that we would like to keep on any one node. For example, if we find
out that we can put a maximum of 50 GB of data on one node, we have
the following:
Database size = 10 TB
Consistent hashing
Consistent hashing assigns each server or item in a distributed hash table a
place on an abstract circle, called a ring, irrespective of the number of servers in
the table. This permits servers and objects to scale without compromising the
system’s overall performance.
Usually, we avoid the hash of a key for partitioning (we used such a scheme to
explain the concept of hashing in simple terms earlier). The problem with the
addition or removal of nodes in the case of hashmodn is that every node’s
partition number changes and a lot of data moves. For example, assume we have
hash(key) = 1235. If we have five nodes at the start, the key will start on node
1 (1235 mod 5 = 0). Now, if a new node is added, the key would have to be
moved to node 6 (1235 mod 6 = 5), and so on. This moving of keys from one
node to another makes rebalancing costly.
In this approach, the number of partitions to be created is fixed at the time when
we set our database up. We create a higher number of partitions than the nodes
and assign these partitions to nodes. So, when a new node is added to the
system, it can take a few partitions from the existing nodes until the partitions
are equally divided.
There’s a downside to this approach. The size of each partition grows with the
total amount of data in the cluster since all the partitions contain a small part of
the total data. If a partition is very small, it will result in too much overhead
because we may have to make a large number of small-sized partitions, each
costing us some overhead. If the partition is very large, rebalancing the nodes
and recovering from node failures will be expensive. It’s very important to
choose the right number of partitions. A fixed number of partitions is used in
Elasticsearch, Riak, and many more.
Dynamic partitioning
In this approach, when the size of a partition reaches the threshold, it’s split
equally into two partitions. One of the two split partitions is assigned to one
node and the other one to another node. In this way, the load is divided equally.
The number of partitions adapts to the overall data amount, which is an
advantage of dynamic partitioning.
Point to Ponder
Question
Hide Answer
Each partition is fully independent in this indexing approach. Each partition has
Data Partitioning
its secondary indexes covering just the documents in that partition. It’s
unconcerned with the data held in other partitions. If we want to write anything
to our database, we need to handle that partition only containing the document
ID we’re writing. It’s also known as the local index. In the illustration below,
there are three partitions, each having its own identity and data. If we want to
get all the customer IDs with the name John, we have to request from all
partitions.
n "
= "Joh Par
titio
n ame Partition 0 n ed w
cust_ ith h
ash
re
id whe func
tion
c ust_
Get
Ge
tcust
User _id c tion
wh
ere Partition 1 h fun Database
cus has
t_n with
a me ed
=" ti tion
Joh Par
n"
Partition 2
Instead of creating a secondary index for each partition (a local index), we can
make a global index for secondary terms that encompasses data from all
partitions.
In the illustration below, we create indexes on names (the term on which we’re
partitioning) and store all the indexes for names on separated nodes. To get the
cust_id of all the customers named John, we must determine where our term
index is located. The index 0 contains all the customers with names starting with
“A” to “M.” The index 1 includes all the customers with names beginning with
“N” to “Z.” Because John lies in index 0, we fetch a list of cust_id with the name
John from index 0.
ohn"
Contains names starting from A-M
nam e = "J
cust_
here
us t_id w
Get c
Partition 0
Get cu
ohn" st_id wh
ust_na me = "J ere cu
st_nam
t_id where c e = "Jo
Get cus Index 0 hn"
Partition 1
Index 1
Partition 2
Request routing
We’ve learned how to partition our data. However, one question arises here:
How does a client know which node to connect to while making a request? The
allocation of partitions to nodes varies after rebalancing. If we want to read a
specific key, how do we know which IP address we need to connect to read?
Allow the clients to request any node in the network. If that node doesn’t
contain the requested data, it forwards that request to the node that does
contain the related data.
The second approach contains a routing tier. All the requests are first
forwarded to the routing tier, and it determines which node to connect to
fulfill the request.
The clients already have the information related to partitioning and which
partition is connected to which node. So, they can directly contact the node
that contains the data they need.
ZooKeeper
To track changes in the cluster, many distributed data systems need a separate
management server like ZooKeeper. Zookeeper keeps track of all the mappings
in the network, and each node connects to ZooKeeper for the information.
Whenever there’s a change in the partitioning, or a node is added or removed,
ZooKeeper gets updated and notifies the routing tier about the change. HBase,
Kafka and SolrCloud use ZooKeeper.
Conclusion
For all current distributed systems, partitioning has become the standard
protocol. Because systems contain increasing amounts of data, partitioning the
data makes sense since it speeds up both writes and reads. It increases the
system’s availability, scalability, and performance.
Back Mark As Completed Next
The following sections explain the pros and cons of no sharding versus sharding.
It’s more efficient for businesses to have a small amount of data to store
that can reside on a single node.
Disadvantages
A centralized database can slow down, causing high latency for end users,
when the number of queries per second accessing the centralized database
is approaching single-node limits.
Disadvantages
Sometimes, data is required from multiple sites, which takes more time
than expected.
Store Product
Store_key City Region Product_key Description Brand
1 New York East 1 Toy Story Wolf
2 San Francisco West 2 The Hobbit Warner Bros.
3 Atlanta East 3 The Batman Warner Bros.
4 Los Angeles West 4 The Juror MKF Studio
5 Chicago Central 5 Jurassic Park Universal Picture
Sales
Store_key Product_key Sales Cost Profit
1 1 4.76 1.5 3.26
2 2 14.24 7.2 7.04
3 3 7.79 1.76 7.03
4 4 3.6 2.45 1.15
5 5 4.4 3.23 1.17
Stored at site A
Let’s assume the distribution of both tables on different sites is the following:
The Store table has 10,000 tuples stored at site A.
The Product table has 100,000 tuples stored at site B.
The Sales table has one million tuples stored at site A.
The above query performs the join operations on the Store, Sales, and Product
tables and retrieves the Store_key values from the table generated in the result
of join operations.
Next, assume every stored tuple is 200 bits long. That’s equal to 25 Bytes.
Furthermore, estimated cardinalities of certain intermediate results are as
follows:
Parameters assumption
Before processing the query using different approaches, let’s define some
parameters:
b = Data rate
Now, let’s compute the total communication time, T , according to the following
formula:
v
T =a+ b
Possible approaches
Move the Product table to site A and process the query at A.
Trade-offs in Databases
100,000×200
T = 0.1+ 50,000,000
= 0.5 seconds
Here, 0.1 is the access delay of the table on site A, and 100,000 is the
number of tuples in the Product table. The size of each tuple in bits is 200,
and 50,000,000 is the data rate. The 200 and 50,000,000 figures are the same
for all of the following calculations.
Restrict Brand at site B to Wolf (called selection) and move the result to site
A:
10×200
T = 0.1+ 50,000,000
≈ 0.1 seconds
Here, 0.1 is the access delay of the Product table. The number of the Wolf
brand is 10, hence the number of tuples.
When we compare the three approaches, the third approach provides us the
least latency (0.1 seconds). We didn’t calculate filtering at site A because the
number of rows will be much larger, and hence data volume will be more than
the third case (filtering at the site B and then fetching data). This example shows
that careful query optimization is also critical in the distributed database.
Conclusion
Data distribution (vertical and horizontal sharding) across multiple nodes aims
to improve the following features, considering that the queries are optimized:
Reliability (fault-tolerance)
Performance
Balanced storage capacity and dollar costs
Both centralized and distributed databases have their pros and cons. We should
choose them according to the needs of our application.