NoSql Module 2 Part 1
NoSql Module 2 Part 1
Distribution Models
The primary driver of interest in NoSQL has been its ability to run databases on a
large cluster.
As data volumes increase, it becomes more difficult and expensive to scale up—buy a
bigger server to run the database on.
Aggregate orientation fits well with scaling out because the aggregate is a natural unit to
use for distribution.
Depending on your distribution model, you can get a data store that will give you the
ability to handle larger quantities of data, the ability to process a greater read or write
traffic, or more availability in the face of network slowdowns or breakages. These are often
important benefits, but they come at a cost.
Running over a cluster introduces complexity—so it’s not something to do unless the
benefits are compelling.
Replication and sharding are orthogonal techniques: You can use either or both of
them. Replication comes into two forms: master-slave and peer-to-peer.
We will now discuss these techniques starting at the simplest and working up to the more
complex: first single-server, then master-slave replication, then sharding, and finally peer-to
peer replication.
SINGLE-SERVER
Is Database A Server?
As defined by the client-server model, a database server is a server that provides database
services to other programs on a computer or to a computer. Querying relational databases
is handled by the same query language, SQL (Structured Query Language).
Single Server The first and the simplest distribution option is the one we would most
often recommend—no distribution at all.
Run the database on a single machine that handles all the reads and writes to the data store.
We prefer this option because it eliminates all the complexities that the other options
introduce; it’s easy for operations people to manage and easy for application developers
to reason about.
Although a lot of NoSQL databases are designed around the idea of running on a cluster, it
can make sense to use NoSQL with a single-server distribution model if the data model of
the NoSQL store is more suited to the application.
If your data usage is mostly about processing aggregates, then a single-server document
or key-value store may well be worthwhile because it’s easier on application developers.
A single server database often has a fixed amount of ingest throughput since it runs on
a single machine.
The limits could be I/O or memory, storage capacity, processing power or a combination
of these.
Sharding
Often, a busy data store is busy because different people are accessing different parts of the
dataset. In these circumstances we can support horizontal scalability(While horizontal
scaling refers to adding additional nodes, vertical scaling describes adding more power to
your current machines. For instance, if your server requires more processing power, vertical
scaling would mean upgrading the CPUs. You can also vertically scale the memory, storage,
or network speed.) by putting different parts of the data onto different servers—a
technique that’s called sharding.
By:Yojana KiranKumar,Asst. Professor,Dept of BVoc.
Module : 2 NoSQL Database 12-07-22
Sharding is a type of database partitioning that separates large databases into smaller, faster,
more easily managed parts. These smaller parts are called data shards. The word shard
means "a small part of a whole."
Sharding involves splitting and distributing one logical data set across multiple databases
that share nothing and can be deployed across multiple servers. To achieve sharding, the
rows or columns of a larger database table are split into multiple smaller tables.
By:Yojana
KiranKumar,Asst. Professor,Dept of BVoc.
Module : 2 NoSQL Database 12-07-22
Once a logical shard is stored on another node, it is known as a physical shard. One physical
shard can hold multiple logical shards. The shards are autonomous and don't share the same
data or computing resources. That's why they exemplify a shared-nothing architecture. At
the same time, the data in all the shards represents a logical data set.
∙ Horizontal sharding. When each new table has the same schema but unique rows,
In this type of sharding, more machines are added to an existing stack to spread
out the load, increase processing speed and support more traffic. This method
is most effective when queries return a subset of rows that are often grouped
together.
∙ Vertical sharding. When each new table has a schema that is a faithful subset of
the original table's schema, it is known as vertical sharding.
It is effective when queries usually return only a subset of columns of the data.
The following illustrates how new tables look when both horizontal and vertical sharding
are performed on the same original data set.
Horizontal shards
Shard 1
Student ID Name Age Major Hometown
Shard 2
Student ID Name Age Major Hometown
Vertical Shards
Shard 1
1 Amy 21
2 Jack 20
Shard 2 2
Student ID
1
Benefits of sharding
Major
Shard 3
Student ID
Economics History
1
San Francisco
Hometown Austin
Since shards are smaller, faster and easier to manage, they help boost database scalability,
performance and administration.
Horizontal scaling, which is also known as scaling out, helps create a more flexible
database design, which is especially useful for parallel processing(Parallel processing is a
method in computing of running two or more processors (CPUs) to handle separate parts
of an overall task. Breaking up different parts of a task among multiple processors will
help reduce the amount of time to run a program.). It provides near-limitless scalability for
intense workloads and big data requirements.
With horizontal sharding, users can optimally use all the compute resources
across a cluster for every query.
This sharding method also speeds up query resolution, since each machine has to scan
fewer rows when responding to a query.
Vertical sharding increases RAM or storage capacity and improves central processing
unit (CPU) capacity.
Sharded databases also offer higher availability and mitigate the impact of outages
because, during an outage, only those portions of an application that rely on the missing
chunks of data become unusable.
Master-Slave Replication
With master-slave distribution, you replicate data across multiple
nodes. One node is designated as the master, or primary.
This master is the authoritative source for the data and is usually responsible for
processing any updates to that data.
Master-slave replication is most helpful for scaling when you have a read-intensive
dataset.
You can scale horizontally to handle more read requests by adding more slave nodes and
ensuring that all read requests are routed to the slaves.
You are still, however, limited by the ability of the master to process updates and its ability
to pass those updates on.
Consequently it isn’t such a good scheme for datasets with heavy write traffic,
although offloading the read traffic will help a bit with handling the write load.
A second advantage of master-slave replication is read resilience: Should the master fail, the
slaves can still handle read requests. Again, this is useful if most of your data access is
reads.
The failure of the master does eliminate the ability to handle writes until either the master
is restored or a new master is appointed.
However, having slaves as replicates of the master does speed up recovery after a failure
of the master since a slave can be appointed a new master very quickly. T
he ability to appoint a slave to replace a failed master means that master-slave replication
is useful even if you don’t need to scale out.
All read and write traffic can go to the master while the slave acts as a hot backup. In
this case it’s easiest to think of the system as a single-server store with a hot backup.
You get the convenience of the single-server configuration but with greater resilience—
which is particularly handy if you want to be able to handle server failures gracefully.
Masters can be appointed manually or automatically.
Manual appointing typically means that when you configure your cluster, you configure
one node as the master. With automatic appointment, you create a cluster of nodes and
they elect one of themselves to be the master.
Apart from simpler configuration, automatic appointment means that the cluster can
automatically appoint a new master when a master fails, reducing downtime. In order to
get read resilience, you need to ensure that the read and write paths into your application
are different, so that you can handle a failure in the write path and still read.
This includes such things as putting the reads and writes through separate database
connections—a facility that is not often supported by database interaction libraries. As with
any feature, you cannot be sure you have read resilience without good tests that disable
the writes and check that reads still occur.
Replication comes with some alluring benefits, but it also comes with an inevitable
dark side— inconsistency.
You have the danger that different clients, reading different slaves, will see different values
because the changes haven’t all propagated to the slaves. In the worst case, that can
mean that a client cannot read a write it just made.
Even if you use master-slave replication just for hot backup this can be a concern, because
if the master fails, any updates not passed on to the backup are lost.
Peer-to-Peer Replication
Peer-to-peer replication occurs when two or more servers or nodes, each of which can be
a standalone server, replication data changes between each other.
Data can be modified on any of the nodes so in that sense all nodes are equals or peers.
With a peer-to-peer replication cluster, you can ride over node failures without losing
access to data. Furthermore, you can easily add nodes to improve your performance.
The biggest complication is, again, consistency. When you can write to two different places,
you run the risk that two people will attempt to update the same record at the same
time—a write-write conflict.
If we use both master-slave replication and sharding ,this means that we have
multiple masters, but each data item only has a single master.
Depending on your configuration, you may choose a node to be a master for some data
and slaves for others, or you may dedicate nodes for master or slave duties.
Using peer-to-peer replication and sharding is a common strategy for
column-family databases.
In a scenario like this you might have tens or hundreds of nodes in a cluster with
data sharded over them.
A good starting point for peer-to-peer replication is to have a replication factor of 3, so each
shard is present on three nodes.
Should a node fail, then the shards on that node will be built on the other nodes.
• Sharding distributes different data across multiple servers, so each server acts as the
single source for a subset of data.
• Replication copies data across multiple servers, so each bit of data can be found in
multiple places. A system may use either or both techniques.
• Master-slave replication makes one node the authoritative copy that handles writes
while slaves synchronize with the master and may handle reads.
********