0% found this document useful (0 votes)
13 views

NoSQL Module 2

Uploaded by

royal Bullet
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

NoSQL Module 2

Uploaded by

royal Bullet
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

NoSQL

21CS745
Module -2
• Distribution Models:
• Single Server, Sharding, Master-Slave Replication, Peer-to-Peer Replication,
• Consistency:
• Update Consistency, Read Consistency, Relaxing Consistency, The CAP
Theorem, Relaxing Durability, Quorums.
• Version Stamps
• Business and System Transactions, Version Stamps on Multiple Nodes

Textbook1: Chapter 4,5,6


• The primary goal of NoSQL is its ability to run databases on a large
cluster.
• As data volumes increase, it becomes more difficult and expensive to
scale up—buy a bigger server to run the database on.
• A more appealing option is to scale out—run the database on a
cluster of servers.
• Depending on distribution model, we can get a data store that will
give the ability to handle larger quantities of data, the ability to
process a greater read or write traffic, or more availability in the
face of network slowdowns or breakages.
• There are two paths to data distribution:
• replication and sharding.
• Replication takes the same data and copies it over multiple nodes
• Sharding puts different data on different nodes.
• Replication and sharding are orthogonal techniques
• We can use either or both of them.
• Replication comes into two forms: master-slave and peer-to-peer
Single Server
• Run the database on a single machine that handles all the reads and
writes to the data store.
• This is useful because it eliminates all the complexities that the other options
introduce; it’s easy for operations people to manage and easy for application
developers

• A lot of NoSQL databases are designed around the idea of running on a


cluster, it can make sense to use NoSQL with a single-server distribution
model if the data model of the NoSQL store is more suited to the
application.
• Ex: Graph databases
• If the data usage is mostly about processing aggregates, then a single-
server document or key-value store may well be worthwhile because it’s
easier on application developers.
Sharding
• A busy data store is busy because different people are accessing
different parts of the dataset
• The technique of putting different parts of the data onto different
servers is called sharding (horizontal scalability ).
• Ex:
• In the ideal case, different users all talking to different server nodes.
• Each user only has to talk to one server, so gets rapid responses from that
server.
• The load is balanced out nicely between servers—for example, if we have ten
servers, each one only has to handle 10% of the load.
• We need to ensure that data that’s accessed together is clumped(grouped)
together on the same node and that these clumps are arranged on the nodes
to provide the best data access.
Challenges
• 1. How to clump the data up so that one user mostly gets their data
from a single server. This can be resolved with aggregate
• 2. To improve the performance – the data can be arranged
But how:
The most accesses of certain aggregates are based on a physical
location, can be placed close to where it’s being accessed.
Ex: If you have orders for someone who lives in Boston, can place that
data in your eastern US data center
• Load Balancing :
• Try to arrange aggregates such that these are evenly distributed across the
nodes which all get equal amounts of the load.
• This may vary over time, for example if some data tends to be accessed on
certain days of the week—so there may be domain-specific rules shall be
used
Auto sharding
• Sharding Challenges:
• sharding as part of application logic:
• Ex: All customers with surnames starting from A to D on one shard and E to G
on another. This complicates the programming model, as application code
needs to ensure that queries are distributed across the various shards.
Solution: Rebalancing the sharding means changing the application code
and migrating the data.
auto-sharding
• Many NoSQL databases offer auto-sharding, where the database
takes on the responsibility of allocating data to shards and ensuring
that data access goes to the right shard.
Benefits of Sharding
• Sharding improve performance :
• It can improve both read and write performance.
• Using replication, particularly with caching, can greatly improve read
performance but does little for applications that have a lot of writes.
• Sharding provides a way to horizontally scale writes.
Master-Slave Replication
• Replicate data across multiple nodes.
• One node is designated as the master, or primary.
• This master is the authoritative source for the data and is usually
responsible for processing any updates to that data.
• The other nodes are slaves, or secondary.
• A replication process synchronizes the slaves with the master
• Helpful for scaling when a
read-intensive dataset is
present. Scale horizontally
used to handle more read
requests by adding more slave
nodes and ensuring that all
read requests are routed to the
slaves
• read resilience:
• If the master fail, the slaves
can still handle read
• Again, this is useful if most of your data access is reads.
• The failure of the master does eliminate the ability to handle writes
until either the master is restored or a new master is appointed.
• However, having slaves as replicates of the master does speed up
recovery after a failure of the master since a slave can be appointed a
new master very quickly.
Master appointment
• Masters can be appointed manually or automatically.
• Manual appointing typically means that when you configure your
cluster, you configure one node as the master.
• With automatic appointment, you create a cluster of nodes and they
elect one of themselves to be the master.
• Apart from simpler configuration, automatic appointment means that
the cluster can automatically appoint a new master when a master
fails, reducing downtime.
• In order to get read resilience, we need to ensure that the read and
write paths into our application are different, so that you can handle a
failure in the write path and still read.
• This includes such things as putting the reads and writes through
separate database connections
Peer-to-Peer Replication
• Master-slave replication helps with read scalability but doesn’t help
with scalability of writes.
• It provides resilience against failure of a slave, but not of a master.
• The master is still a bottleneck and a single point of failure.
• These are solved using Peer-to-Peer replication
• No master present
• All the replicas have equal weight, they can all accept writes, and the
loss of any of them doesn’t prevent access to the data store
Peer-to-peer replication has all nodes applying reads and writes to all the data
• With a peer-to-peer replication cluster, you can ride over node
failures without losing access to data.
• Furthermore, you can easily add nodes to improve your performance.
• Problem with peer-to-peer replication:
• Consistency.
• When you can write to two different places, you run the risk that two people
will attempt to update the same record at the same time—a write-write
conflict.
• Inconsistencies on read lead to problems but at least they are relatively
transient.
• Inconsistent writes are forever
Combining Sharding and Replication
• Replication and sharding are strategies that can be combined.
• If we use both master-slave replication and sharding this means that
we have multiple masters, but each data item only has a single
master.
• Depending on your configuration, you may choose a node to be a
master for some data and slaves for others, or you may dedicate
nodes for master or slave duties
• Using peer-to-peer replication and sharding is a common strategy for
column-family databases.
• In a scenario like this you might have tens or hundreds of nodes in a
cluster with data sharded over them.
• A good starting point for peer-to-peer replication is to have a
replication factor of 3, so each shard is present on three nodes. If a
node fail, then the shards on that node will be built on the other
nodes
• Consistency:
• Update Consistency, Read Consistency, Relaxing Consistency, The CAP
Theorem, Relaxing Durability, Quorums.
• Database consistency is the requirement that data in a database is
valid and accurate, and that changes to the data are made in allowed
ways.
• It's an important part of database management, and it helps ensure
that data is reliable and provides value for decision-making and
business outcomes
• For a database to be consistent, data written to the database must be
valid according to all defined rules, including constraints, cascades,
triggers, or any combination.
• Consistency does not guarantee correctness of the transaction in all
ways an application programmer might expect (that is the
responsibility of application-level code).
• Instead, consistency merely guarantees that programming errors
cannot result in the violation of any defined database constraints
• The biggest challenges in cluster based NoSQL is consistency
• Relational databases exhibit strong consistency
Update Consistency
• NoSQL databases generally prioritize availability over strong
consistency, and as a result, they have eventual consistency.
• This means that newly written data is not immediately available on
all nodes in the database, but will eventually be made available.
• This is usually within a few milliseconds
write-write conflict
• Two people updating the same data item at the same time.
• Assume ABC and XYZ –decided to update a data pertaining to phone
number , at the same time they initiate to update in data base.

• When the writes reach the server, the server will serialize them—
decide to apply one, then the other.
• But how????
• Server first picks for ABC –as first update and then XYZ
• When XYZ’s update would be applied and immediately overwritten by ABC’s.
In this case ABC’s is a lost update
• This is an example for failure consistency
Solution :pessimistic or optimistic
• A pessimistic approach works by preventing conflicts from occurring;
• An optimistic approach lets conflicts occur, but detects them and takes
action.
• For update conflicts, the most common pessimistic approach is to have
write locks, so that in order to change a value you need to acquire a
lock, and the system ensures that only one client can get a lock at a
time.
• In the previous example, ABC ( say) acquired write lock and perform
update. Then XYZ decide whether to update or not.
optimistic approach is a conditional update
• Optimistic approach is a conditional update where any client that
does an update tests the value just before updating it to see if it’s
changed since his last read.
• In our example ,ABC’s update would succeed but XYZ’s would fail.
• The error would let XYZ know that he should look at the value again
and decide whether to attempt a further update.
• write-write conflict—save both updates and record that they are in
conflict.
• This approach is familiar to many programmers from version control
systems, particularly distributed version control systems
• Replication makes it much more likely to run into write-write conflicts.
• If different nodes have different copies of some data which can be
independently updated, then you’ll get conflicts unless you take specific
measures to avoid them.
• Using a single node as the target for all writes for some data makes it
much easier to maintain update consistency
Read Consistency- inconsistent read or
read-write conflict or logical consistency
• Update consistency is one thing, but it doesn’t guarantee that readers
of that data store will always get consistent responses to their
requests.
• NoSQL (aggregate oriented ) does not support transaction but graph
support ACID transaction .
• Aggregate-oriented databases do support atomic updates, but only
within a single aggregate not between two aggregate .

• We can avoid running into that inconsistency if the order, the delivery
charge, and the line items are all part of a single order aggregate.
• How problem comes?
• Not all data can be put in the same aggregate, so any update that
affects multiple aggregates leaves open a time when clients could
perform an inconsistent read.
• The length of time an inconsistency is present is called the
inconsistency window.
Example on inconsistency
• The hotel reservation system runs on many nodes.
• Martin and Cindy are a couple considering this room, but they are
discussing this on the phone because Martin is in London and Cindy is in
Boston and Pramod, who is in Mumbai, goes and books that last room.
• That updates the replicated room availability, but the update gets to
Boston quicker than it gets to London.
• When Martin and Cindy fire up their browsers to see if the room is
available, Cindy sees it booked and Martin sees it free.
• This is another inconsistent read—but it’s a breach of a different form of
consistency we call replication consistency: ensuring that the same data
item has the same value when read from different replicas
• Replication consistency: ensuring that the same data item has the same
value when read from different replicas.
• The updates will propagate fully, and Martin will see the room is fully
booked.
• Therefore this situation is generally referred to as eventually consistent,
meaning that at any time nodes may have replication inconsistencies but, if
there are no further updates, eventually all nodes will be updated to the
same value.

• Data that is out of date is generally referred to as stale, which reminds us


that a cache is another form of replication—essentially following the
master-slave distribution model.
• Two different updates on the master may be performed in rapid
succession, leaving an inconsistency window of milliseconds.
• But delays in networking could mean that the same inconsistency
window lasts for much longer on a slave.
• We can usually specify the level of consistency you want with
individual requests.
• This allows you to use weak consistency most of the time when it
isn’t an issue, but request strong consistency when it is.

• The presence of an inconsistency window means that different


people will see different things at the same time.
• If Martin and Cindy are looking at rooms while on a transatlantic call,
it can cause confusion.
Situation when there is long inconsistency
windows?
• Use of read your- writes consistency which means that, once you’ve
made an update, you’re guaranteed to continue seeing that update
• session consistency: Within a user’s session there is read-your-writes
consistency
Techniques to provide session consistency
• sticky session: a session that’s tied to one node (this is also called
session affinity).
• A sticky session allows you to ensure that as long as you keep read-
your-writes consistency on a node, you’ll get it for sessions too.
• use version stamps and ensure every interaction with the data store
includes the latest version stamp seen by a session.
Relaxing Consistency
• Consistency is a Good Thing —but, sadly, sometimes we have to
sacrifice it.
• It is always possible to design a system to avoid inconsistencies, but
often impossible to do so without making unbearable sacrifices in
other characteristics of the system.
• In single server – consistency can be achieved through transaction
• Transaction systems usually come with the ability to relax isolation
levels, allowing queries to read data that hasn’t been committed yet.
• Mostly people use the read-committed transaction level , which
eliminates some read-write conflicts
• NoSQL databases relax consistency to increase availability and low latency, and
to accommodate large data volumes and flexible data models.
• This is because of the CAP theorem, which states that a distributed data store
can't simultaneously provide more than two of the following three guarantees:
• Consistency: All identical requests receive the same response
• Availability: Requests receive responses even during a partial system failure
• Partition Tolerance: Operations remain intact even when some nodes are
unavailable

• In NoSQL databases, consistency is more eventual than strong. This means that
replicas may temporarily show inconsistencies, but will eventually converge to
the same state with further updates
The CAP Theorem
Relaxing Durability
• The key to Consistency is serializing requests by forming Atomic,
Isolated work units.

• How to trade-off some durability for higher performance.


• If a database can run mostly in memory, apply updates to its in-
memory representation, and periodically flush changes to disk, then it
may be able to provide substantially higher responsiveness to
requests.
• The cost is that, should the server crash, any updates since the last
flush will be lost.
• Use of storing user-session state.
Quorums
• When you’re trading off consistency or durability, it’s not an all or
nothing proposition.
• The more nodes you involve in a request, the higher is the chance of
avoiding an inconsistency.
• This leads to the question: How many nodes need to be involved to
get strong consistency?
• Imagine some data replicated over three nodes.
• You don’t need all nodes to acknowledge a write to ensure strong
consistency; all you need is two of them—a majority.
• If you have conflicting writes, only one can get a majority.
• This is referred to as a write quorum and expressed as W > N/2,
meaning the number of nodes participating in the write (W) must be
more than the half the number of nodes involved in replication (N).
The number of replicas is often called the replication factor.
• Read quorum: How many nodes you need to contact to be sure you
have the most up-to-date change.
• The read quorum is a bit more complicated because it depends on
how many nodes need to confirm a write

• This relationship between the number of nodes you need to contact


for a read (R), those confirming a write (W), and the replication factor
(N) can be captured in an inequality:
• We can have a strongly consistent read if R + W > N.
Version Stamps
• Many critics of NoSQL databases focus on the lack of support for
transactions.
• Transactions are a useful tool that helps programmers support
consistency.
• Aggregate-oriented NoSQL databases do support atomic updates
within an aggregate—and aggregates are designed so that their data
forms a natural unit of update. Hence developer least worry for
transaction
Business and System Transactions
• The need to support update consistency without transactions is
actually a common feature of systems even when they are built on
top of transactional databases
Business transaction
• A business transaction may be something like browsing a product
catalog, choosing a bottle of product at a good price, filling in credit
card information, and confirming the order.
• Yet all of this usually won’t occur within the system transaction
provided by the database because this would mean locking the
database elements while the user is trying to find their credit card
and gets called off to lunch by their colleagues.
system transaction
• Usually applications only begin a system transaction at the end of the
interaction with the user, so that the locks are only held for a short
period of time.
• The problem, however, is that calculations and decisions may have
been made based on data that’s changed. The price list may have
updated the price of the Item, or someone may have updated the
customer’s address, changing the shipping charges.
Version stamp
• Offline concurrency useful in NoSQL situations too.
• Optimistic Offline Lock a form of conditional update where a client
operation rereads any information that the business transaction relies
on and checks that it hasn’t changed since it was originally read and
displayed to the user.
• A good way of doing this is to ensure that records in the database
contain version stamp- a field that changes every time the underlying
data in the record changes.
• When you read the data you keep a note of the version stamp, so that
when you write data you can check to see if the version has changed.
conditional update -compare-and-set (CAS)
operation
• conditional update that allows you to ensure updates won’t be based
on stale data.
• You can do this check yourself, although you then have to ensure no
other thread can run against the resource between your read and
your update
Ways to perform version stamps
• Use of counter- increment the counter on every update
• Require the server to generate the counter value, and also need a
single master to ensure the counters aren’t duplicated

• Use of GUID: large random number- combination of dates, hardware


information, and whatever other sources of randomness they can
pick up
• hash of the contents of the resource:
• timestamp of the last update:
• Combination of various approach
Version Stamps on Multiple Nodes
• In the case of master –slace , the version stamp is controlled by the master.
Any slaves follow the master’s stamps.

• Simplest solution is use of counter.


• Each time a node updates the data, it increments the counter and puts the value of
the counter into the version stamp.
• If you have blue and green slave replicas of a single master, and the blue node
answers with a version stamp of 4 and the green node with 6, you know that the
green’s answer is more recent
• In multiple-master cases,-used by distributed version control systems, is to
ensure that all nodes contain a history of version stamps.
• Use of time stamps- all nodes may not synchronize a common clock
Use of vector time stamp
• A vector stamp is a set of counters, one for each node.
• A vector stamp for three nodes (blue, green, black) would look something
like [blue: 43,green: 54, black: 12].

• Each time a node has an internal update, it updates its own counter, so an
update in the green node would change the vector to [blue: 43, green: 55,
black: 12].
• Whenever two nodes communicate, they synchronize their vector stamps.
• vector clocks and version vectors—these are specific forms of vector
stamps to synchronize.
• By using vector clocks scheme you can tell if one version stamp is newer
than another because the newer stamp will have all its counters greater
than or equal to those in the older stamp.
• So [blue: 1, green: 2, black: 5] is newer than [blue:1, green: 1, black 5] since
one of its counters is greater.
• If both stamps have a counter greater than the other, e.g. [blue: 1, green: 2,
black: 5] and [blue: 2, green: 1, black: 5], then you have a write-write
conflict.
• There may be missing values in the vector, in which case we use treat the
missing value as 0. So [blue: 6, black: 2] would be treated as [blue: 6,
green: 0, black: 2].

You might also like