0% found this document useful (0 votes)
2 views40 pages

Module 2

The document discusses various data distribution models, including sharding and replication, to enhance data handling capabilities in NoSQL databases. It explains the differences between master-slave and peer-to-peer replication, as well as the implications of consistency, availability, and partition tolerance as per the CAP theorem. Additionally, it covers techniques for managing versioning and conflicts in distributed systems, emphasizing the importance of balancing consistency and performance.

Uploaded by

belal Abdullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views40 pages

Module 2

The document discusses various data distribution models, including sharding and replication, to enhance data handling capabilities in NoSQL databases. It explains the differences between master-slave and peer-to-peer replication, as well as the implications of consistency, availability, and partition tolerance as per the CAP theorem. Additionally, it covers techniques for managing versioning and conflicts in distributed systems, emphasizing the importance of balancing consistency and performance.

Uploaded by

belal Abdullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 40

1

MODULE 2
Prepared By:
Madhuri J
Assistant Professor
Department of Computer Science and Engineering
Bangalore Institute of Technology
2

Distribution Models
• Depending on the distribution model the data store can
give us the ability:
• To handle large quantity of data,
• To process a greater read or write traffic
• To have more availability in the case of network slowdowns of
breakages
• There are two path for distribution: –
•Replication
•Sharding
3

Distribution Model: Single Server


• It is the first and simplest distribution option.
• If NoSQL database are designed to run on a cluster they
can be used in a single server application.
• This NoSQL database is more suited for the application
data model.
• Graph databases are the obvious category here—these
work best in a single-server configuration
4

Sharding
• Often, a data store is busy because different people are
accessing different part of the dataset.
• In this cases we can support horizontal scalability by
putting different part of the data onto different servers
(Sharding)
5

Sharding
• Sharding can be part of application logic.
• This complicates the programming model as the
application code needs ensure that queries are distributed
across various shards.
• In the ideal setting we have each user to talk one server
and the load is balanced.
6
7

Sharding Approaches
• In order to get the ideal case we have to guarantee that
data accessed together are stored in the same node.
• – This is very simple using aggregates.
• When considering data distribution across nodes.
• – If access is based on physical location, we can place data close
to where are accessed.
• Keep data balanced. We should arrange aggregates so
they are evenly distributed in order that each node
receive the same amount of the load.
• Its useful to put aggregate together if we think they may
be read in sequence (BigTable).
8

Sharding and NoSQL


• NoSQL databases offers auto- sharding.
• - Database takes the responsibility of allocating data to shards and
ensures data access goes to the right shard.
• This can make much easier to use sharding in an
application
• Sharding is valuable for performance because it
improves read and write performances.
• It scales read and writes on the different nodes of the
same cluster.
9

Sharding and Resilience


• Sharding does little to improve resilience when used
alone.
• Since data is on different nodes, a node failure makes
shard’s data unavailable.
• So in practice, sharding alone is likely to decrease
resilience.
10
11

Master-Slave Replication
• In this setting one node is designated as the master, or
primary and the other as slaves.
• The master is the authoritative source for the data
• and designed to process updates and send them to
slaves.
• The slaves are used for read operations.
• This allows us to scale read intensive dataset.
• We can scale horizontally by adding more slaves to
handle more read requests.
• Limitation is the ability of the master in processing
incoming data.
12

Master-Slave Replication
• Read resilience.
• If the master fails the slaves can still handle read requests.
• Writes are not allowed until the master is not restored.
• To achieve resilience we need to ensure that read and
write paths are different.
• Masters can be appointed manually or automatically.
• Automatic- when cluster is configured, one node is chosen as the
master.
• Manual- Cluster of nodes created, master is chosen automatically.
• Slave can be appointed as master to replace failed
master.
13

Master-Slave Replication
• Replication in master-slave have the analyzed
advantages but it come with the problem of inconsistency.
• The readers reading from the slaves can read data not
updated.
• When master fails, any update not passed is lost.
14

Peer-to-Peer Replication
15

Peer-to-Peer Replication
• Master-Slave replication helps with read scalability but
has problems on scalability of writes. It provides resilience
on read but not on writes. The master is still a single point
of failure.
• All the replica are equal (accept writes and reads)
• With a Peer-to-Peer we can have node failures without
lose write capability and losing data.
• Can easily add nodes for performances.
• When we can write on different nodes, we increase the
probability to have inconsistency on writes.
16

Combining Sharing with Replication


• Replication and sharding strategies are combined.
• By combining master slave and sharding, multiple
masters can be assigned, but each data item has a single
master.
• Node can be master for some data and slaves for others.
• Combining peer-to-peer replication and sharding is a
common strategy for colomn family datastores
• Each data is replicated on three nodes i.e, replicated
factor is 3.
17

Master Slave Replication and Sharding


18

Peer-to-peer replication and


sharding
19

• There are two styles of distributing data:


• Sharding distributes different data across multiple servers,
so each server acts as the single source for a subset of
data.
• Replication copies data across multiple servers, so each
bit of data can be found in multipleplaces.
20

• A system may use either or both techniques.


• Replication comes in two forms:
• Master-slave replication makes one node the authoritative
copy that handles writes while slaves synchronize with the
master and may handle reads.
• Peer-to-peer replication allows writes to any node; the
nodes coordinate to synchronize their copies of the data.
Master-slave replication reduces the chance of update
conflicts but peer-to-peer replication avoids loading all
writes onto a single point of failure.
21

Consistency
• Write-Write conflict occur when two clients try to write the
same data at the same time
• Server serializes them
• Can lead to lost update
• Read-write conflicts occur when one client reads
inconsistent data in the middle of another client write
• To get good consistency, more nodes have to be involved
in data operations but this increases latency. So there is a
trade off between consistency and latency.
22

• Update consistency – ensuring serial database changes


• Pessimistic approach – prevents conflicts from occurring (i.e.
locking)
• Write lock
• Optimistic approach – detects conflicts and sorts them out (i.e.
validation)
• Conditional update – just before update, check to see if the value has
changed since last read
• Write-write conflict resolution –Save both records or update that are in
conflict automatically or manually merge the updates. Follows from
version control.
• Highly domain specific and needs to be programmed for each particular
case
23

Read consistency – ensuring users read the same value for


data at a given time
• Logical consistency vs. replication consistency
• Logical consistency
• inconsistent read or read-write conflict
• To avoid a logically inconsistent read-write conflict, relational
databases support the notion of transactions.
• you could avoid running into that inconsistency if the order, the
delivery charge, and the line items are all part of a single order
aggregate.
• If not all data can be put in the same aggregate, so any update that
affects multiple aggregates leaves open a time when clients could
perform an inconsistent read.
• The length of time an inconsistency is present is called the
inconsistency window.
24

Read Consistency
Read-Write Conflict.
Logical Consistency ensures that different data
items make sense together.
25

Replication Inconsistancy
26

Read Consistency
• Replication Consistency- Ensures same data has same
value when read from different replicas.
• Eventual consistency- Nodes may have replication
inconsistency, but if there are no further updates, all
nodes will be updated to the same value.
• Techniques to provide session consistency
• Sticky sessions (session affinity) – assign a session to a
given database node for all of its work to ensure read-your-
writes consistency.
• Version stamps
27

Relaxed consistency
• CAP Theorem – pick two of these three
• Consistency
• Availability – ability to read and write data to a node in the
cluster
• Partition tolerance – cluster can survive network
breakage that separates it into multiple isolated partitions
• If there is a network partition, need to trade off availability
of data vs. consistency
• Depending on the domain, it can be beneficial to balance
consistency with latency (performance)
• BASE – Basically Available, Soft state, Eventual
consistency
28

CAP Theorem
All client always have the
same view of the data
Availability

Consistency

Partition
tolerance
29

CAP Theorem
Each client always can
read and write.
Availability

Consistency

Partition
tolerance
30

CAP Theorem
A system can continue to
operate in the presence of
a network partitions
Availability

Consistency

Partition
tolerance
31

CAP theorem
32

Relaxed durability

• Durability is the risk of losing data by not flushing to disk,


but keeping in memory.
• Replication durability- occurs when a node processes
an update but fails before that update is replicated to
other nodes.
• Durability can be traded off against latency, particularly if
you want to survive failures with replicated data.
33

Quorums
• The more nodes you involve in a request, the higher is the chance of
avoiding an inconsistency.
• How many nodes need to be involved to get strong consistency?
• All the nodes need not acknowledge a write to ensure strong consistency
• Write quorum are expressed as W > N/2, meaning the number of nodes
participating in the write (W) must be more than the half the number of nodes
involved in replication (N).
• How many nodes you need to contact to be sure you have the most
up-to-date change.
• Read quorum is dependent on how many nodes need to confirm a write.
• This relationship between the number of nodes you need to contact
for a read (R), those confirming a write (W), and the replication
factor (N) can be captured in an inequality: You can have a strongly
consistent read if R + W > N.
34

• replication factor of 3 is enough to have good resilience.


• This allows a single node to fail while still maintaining quora
for reads and writes.
• The number of nodes participating in an operation can vary
with the operation. When writing, we might require quorum for
some types of updates but not others.
• If you need fast, strongly consistent reads, you could require
writes to be acknowledged by all the nodes, thus allowing reads
to contact only one
• (N = 3, W= 3, R = 1). That would mean that your writes are
slow, since they have to contact all three nodes, and you would
not be able to tolerate losing a node.
35

Version Stamps
• Provide a means of detecting concurrency conflicts
• Each data item has a version stamp which gets incremented each
time the item is updated
• Before updating a data item, a process can check its version
stamp to see if it has been updated since it was last read
• Implementation methods
• Counter – requires a single master to “own” the counter
• GUID (Guaranteed Unique ID) – can be computed by any node,
but are large and cannot be compared directly
• Hash the contents of a resource
• Timestamp of last update – node clocks must be synchronized
• Vector stamp – set of version stamps for all nodes in a
distributed system
• Allows detection of conflicting updates on different nodes
36

Business and System Transactions


• A business transaction may be something like browsing a
product catalog, choosing a bottle of Talisker at a good price,
filling in credit card information, and confirming the order.
• System transaction are provided by the database because this
would mean locking the database elements while the user is
trying to find their credit card.
• Version stamp is a field that changes every time the underlying
data in the record changes. When you read the data you keep a
note of the version stamp, so that when you write data you can
check to see if the version has changed.
37

• Technique with updating resources with HTTP


• Use etags.
• Whenever you get a resource, the server responds with an etag in
the header. This etag is an opaque string that indicates the version of
the resource.
• If you then update that resource, you can use a conditional update by
supplying the etag that you got from your last GET.
• Counter – requires a single master to “own” the counter
• GUID (Guaranteed Unique ID) – can be computed by any node,
but are large and cannot be compared directly
• Hash the contents of a resource
• Timestamp of last update – node clocks must be synchronized
38

Version Stamps on Multiple


Nodes
• One approach, used by distributed version control systems, is to
ensure that all nodes contain a history of version stamps.
• This requires the clients to hold onto version stamp histories, or
the server nodes to keep version stamp histories and include
them when asked for data.
• Problem with Timestamp is it’s usually difficult to ensure that
all the nodes have a consistent notion of time, particularly if
updates can happen rapidly. If one node’s clock is out of sync, it
causes problem.
• The common approach used by peer-to-peer NoSQL systems is
a special form of version stamp which we call a vector stamp.
39

Vector stamp
• Vector stamp is a set of counters, one for each node. A vector
stamp for three nodes (blue, green, black) would look
something like [blue: 43, green: 54, black: 12].
• Each time a node has an internal update, it updates its own
counter, so an update in the green node would change the vector
to [blue: 43, green: 55, black: 12].
• Vector clocks and version vectors are the other terms used.
• By using this scheme you can tell if one version stamp is newer
than another
40

Vector stamp
• [blue: 1,green: 2, black: 5] is newer than [blue:1, green: 1, black
5] since one of its counters is greater.
• If both stamps have a counter greater than the other, e.g. [blue: 1,
green: 2, black: 5] and [blue: 2, green: 1, black: 5], then you have
a write-write conflict.

You might also like