0% found this document useful (0 votes)
9 views

2 NoSQL Databases Principles

Uploaded by

khawla tadist
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

2 NoSQL Databases Principles

Uploaded by

khawla tadist
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 58

NoSQL Databases

Principles
Ecole d’Ingénierie Digitale et d’Intelligence Artificielle (EIDIA)
Cycle Préparatoire formation Ingénieur
Khawla TADIST
Année 2023-2024
Outline
Different aspects of data distribution
 Scaling
 Vertical vs. horizontal
 Distribution models
 Sharding
 Replication
 Master-slave vs. peer-to-peer architectures
 CAP properties
 Consistency, Availability and Partition tolerance
 ACID vs. BASE
Scalability

What is scalability?
 Capability of a system to handle growing amounts of data and/or queries without
losing performance,
 Or its potential to be enlarged in order to accommodate such a growth.
 Two general approaches
 Vertical scaling
 Horizontal scaling
Scalability - Vertical Scalability

Vertical scaling (scaling up/down)


 Adding resources to a single node in a system
 Increasing the number of CPUs,
 Extending system memory,
 Using larger disk arrays,
 …
 i.e. larger and more powerful machines are involved
Scalability - Vertical Scalability

Vertical scaling (scaling up/down)


 Traditional choice
 In favor of strong consistency
 Easy to implement and deploy
 No issues caused by data distribution
 …
 Works well in many cases but …
Scalability - Vertical Scalability Drawbacks

Performance limits
 Even the most powerful machine has a limit
 Moreover, everything works well… unless we start approaching such limits
Higher costs
 The cost of expansion increases exponentially
 In particular, it is higher than the sum of costs of equivalent commodity
hardware
Scalability - Vertical Scalability Drawbacks

Proactive provisioning
 New projects/applications might evolve rapidly
 Upfront budget is needed when deploying new machines
 And so flexibility is seriously suppressed
Scalability - Vertical Scalability Drawbacks

Vendor lock-in
 There are only a few manufacturers of large machines
 Customer is made dependent on a single vendor
 Their products, services, but also implementation details, proprietary formats,
interfaces, …
 i.e. it is difficult or impossible to switch to another vendor
Deployment downtime
 Inevitable downtime is often required when scaling up
Scalability - Horizontal Scalability

Horizontal scaling (scaling out/in)


 Adding more nodes to a system
 i.e. the system is distributed across multiple nodes in a cluster
 Choice of many NoSQL systems
Scalability - Horizontal Scalability

Horizontal scaling (scaling out/in)


 Advantages
 Commodity hardware, cost effective
 Flexible deployment and maintenance
 Often surpasses the vertical scaling
 …
 Unfortunately, there are also plenty of false assumptions…
Scalability - Horizontal Scalability
Drawbacks
False assumptions
 Network is reliable
 Network is secure
 Latency is zero
 Bandwidth is infinite
 Topology does not change
 There is one administrator
 Transport cost is zero
Scalability - Horizontal Scalability
Consequences
Significantly increases complexity
 Complexity of management,
 Programming model, …
 Introduces new issues and problems
Synchronization of nodes
 Data distribution
 Data consistency
 Recovery from failures
 …
Scalability - Horizontal Scalability

A standalone node still might be a better option in certain cases


 e.g. for graph databases
 Simply because it is difficult to split and distribute graphs
 In other words
 It can make sense to run even a NoSQL database system on a single node
 No distribution at all is the most preferred/simple scenario
But in general, horizontal scaling does open new possibilities
Scalability - Horizontal Scalability
Architecture
What is a cluster?
 A collection of mutually interconnected commodity nodes
 Based on the shared-nothing architecture
 Nodes do not share their CPUs, memory, hard drives,…
 Each node runs its own operating system instance
 Nodes send messages to interact with each other
 Nodes of a cluster can be heterogeneous
 Data, queries, computation, workload, …
 This is all distributed among the nodes within a cluster
Distribution Models

Generic techniques of data distribution


 Sharding
 Different data on different nodes
 Motivation: increasing volume of data, increasing performance
 Replication
 Copies of the same data on different nodes
 Motivation: increasing performance, increasing fault tolerance
Distribution Models

Both the techniques are mutually orthogonal


 i.e. we can use either of them, or combine them both
NoSQL systems often offer automatic sharding and replication
Distribution Models - Sharding

Sharding (horizontal partitioning)


 Placement of different data on different nodes
 What does different data mean? Different aggregates
– E.g. key-value pairs, documents, …
Distribution Models - Sharding

Sharding (horizontal partitioning)


 Placement of different data on different nodes
 Related pieces of data that are accessed together should also be kept together
– Specifically, operations involving data on multiple shards should be
avoided
Distribution Models - Sharding

Sharding (horizontal partitioning)


 The questions are…
 How to design aggregate structures?
 How to actually distribute these aggregates?
Distribution Models - Sharding
Sharding (horizontal partitioning)
Distribution Models - Sharding

Objectives
 Uniformly distributed data (volume of data)
 Balanced workload (read and write requests)
 Respecting physical locations
 e.g. different data centers for users around the world
 …
Unfortunately, these objectives…
 May mutually contradict each other
 May change in time
Distribution Models - Sharding
Sharding (horizontal partitioning)

Source: Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled. Pearson Education, Inc., 2013.
Distribution Models - Sharding

How to actually determine shards for aggregates?


 We not only need to be able to place new data when handling write requests,
 But also find the data in case of read requests
 i.e. when a given search criterion is provided (e.g. key, id, …),
Distribution Models - Sharding

How to actually determine shards for aggregates?


 We must be able to determine the corresponding shard to the given key
 So that the requested data can be accessed and returned,
 Or failure can be correctly detected when the data is missing
Distribution Models - Sharding

Sharding strategies
 Based on mapping structures
 Placing of data on shards in a random fashion (e.g. round-robin) (Not suitable)
 Based on general rules:
 Hash partitioning,
 Range partitioning
Distribution Models - Replication

Replication
 Placement of multiple copies – replicas – of the same data on different nodes
 Replication factor = the number of copies
 Two approaches:
 Master-slave architecture
 Peer-to-peer architecture
Distribution Models - Replication - Master-
Slave
Master-Slave Architecture

Source: Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled. Pearson Education, Inc., 2013.
Distribution Models - Replication - Master-
Slave

Architecture
 One node is primary (master), all the other secondary (slave)
 Master node bears all the management responsibility
 All the nodes contain identical data
Distribution Models - Replication - Master-
Slave

Architecture
 Read requests can be handled by both the master or slaves
 Suitable for read-intensive applications
 More read requests to deal with → more slaves to deploy
 When the master fails, read operations can still be handled
Distribution Models - Replication - Master-
Slave

Write requests can only be handled by the master


Newly written replicas are propagated to all the slaves
Consistency issue
 Luckily enough, at most one write request is handled at a time
 But the propagation still takes some time during which obsolete reads might happen
 Hence certain synchronization is required to avoid conflicts
Distribution Models - Replication - Master-
Slave

In case of master failure, a new one needs to be appointed


 Manually (user-defined)
 Automatically (cluster-elected)
 Since the nodes are identical, appointment can be fast
Master might therefore represent a bottleneck (because of the performance or
failures)
Distribution Models - Replication - Peer-to-
Peer
Peer-to-Peer Architecture

Source: Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled. Pearson Education, Inc., 2013.
Distribution Models - Replication

Architecture
 All the nodes have equal roles and responsibilities
 All the nodes contain identical data once again
Distribution Models - Replication

Both read and write requests can be handled by any node


 No bottleneck, no single point of failure
 More requests to deal with → more nodes to deploy
Distribution Models - Replication

Both read and write requests can be handled by any node


 Consistency issues
 Unfortunately, multiple write requests can be initiated independently and handled at
the same time
 Hence synchronization is required to avoid conflicts
Distribution Models - Sharding and
Replication

Observations with respect to replication:


 Does the replication factor really need to correspond to the number of nodes?
 No, replication factor of 3 will often be the right choice
 Consequences
– Nodes will no longer contain identical data
– Replica placement strategy will be needed
 Sharding and replication can be combined… but how?
Distribution Models - Sharding and
Replication

Combinations of sharding and replication


 Sharding + master-slave replication
 Multiple masters, each for different data
 Roles of the nodes can overlap
– Each node can be master for some data
and/or slave for other
Distribution Models - Sharding and
Replication

Combinations of sharding and replication


 Sharding + peer-to-peer replication
 Placement of anything anywhere
CAP Theorem

Assumptions
 System with sharding and replication
 Read and write operations on a single aggregate
CAP properties = properties of a distributed system
 Consistency
 Availability
 Partition tolerance
CAP Theorem

CAP theorem
 It is not possible to have a distributed system that would guarantee consistency,
availability, and partition tolerance at the same time.
 Only 2 of these 3 properties can be enforced.
 But, what do these properties actually mean?
CAP Theorem - Properties

Consistency
 Read and write operations must be executed atomically
 There must exist a total order on all operations such that each operation looks as if it
was completed at a single instant,
 i.e. as if all the operations were executed one by one on a single standalone node
CAP Theorem - Properties

Consistency
 Practical consequence: After a write operation, all readers see the same data
 Since any node can be used for handling of read requests, atomicity of write
operations means that changes must be propagated to all the replicas
CAP Theorem - Properties

Availability
 If a node is working, it must respond to user requests
 Every read or write request received by a non-failing node in the system must result in a
response
CAP Theorem - Properties

Partition tolerance
 System continues to operate even when two or more sets of nodes get isolated
 i.e. a connection failure MUST NOT shut the whole system down
CAP Theorem - Consequences

At most two properties can be guaranteed


 CA = Consistency + Availability
 CP = Consistency + Partition tolerance
 AP = Availability + Partition tolerance
CAP Theorem - Consequences

If at most two properties can be guaranteed…


 CA = Consistency + Availability
 Traditional ACID properties are easy to achieve
 Examples: RDBMS, Google BigTable
 Any single-node system
 However, should the network partition happen, all the nodes must be forced to stop
accepting user requests
CAP Theorem - Consequences

If at most two properties can be guaranteed…


 CP = Consistency + Partition tolerance
 Examples: MongoDB, HBase
CAP Theorem - Consequences

If at most two properties can be guaranteed…


 AP = Availability + Partition tolerance
 New concept of BASE properties
 Examples: Apache Cassandra, Apache CouchDB
 Other examples: web caching, DNS
CAP Theorem - Consequences

Partition tolerance is necessary in clusters


 Why?
 Because it is difficult to detect network failures
 Does it mean that only purely CP and AP systems are possible?
 No…
CAP Theorem - Consequences

The real meaning of the CAP theorem:


 Partition tolerance is a MUST,
 But we can trade off consistency versus availability
 Just a little bit relaxed consistency can bring a lot of availability
 Such trade-offs are not only possible, but often work very well in practice
ACID Properties
Traditional ACID properties
 Atomicity
 Partial execution of transactions is not allowed (all or nothing)
 Consistency
 Transactions bring the database from one consistent (valid) state to another
 Isolation
 Although multiple transactions may execute in parallel, each transaction must take into
consideration the execution of the other
 Durability
 Effects of committed transactions must remain durable
BASE Properties

New concept of BASE properties


 Basically Available
 The system works basically all the time
 Partial failures can occur, but without total system failure
 Soft State
 The system is in flux (unstable)
 Changes occur all the time
 Eventual Consistency
 Sooner or later the system will be in some consistent state
ACID and BASE
ACID
 Choose consistency over availability
 Pessimistic approach
 Implemented by traditional relational databases
BASE
 Choose availability over consistency
 Optimistic approach
 Common in NoSQL databases
 Allows levels of scalability that cannot be acquired with ACID
Current trend in NoSQL:
 Strong consistency → eventual consistency
Consistency

Consistency in general…
 Consistency is the lack of contradiction in the database
 Strong consistency is achievable even in clusters, but eventual consistency might
often be sufficient
 Even when an already unavailable hotel room is booked once again, the situation can be
figured out in the real world
 …
Consistency

Write consistency (update consistency)


 Problem: write-write conflict
 Two or more write requests on the same aggregate are initiated concurrently
 Issue: lost update
 Question: Do we need to solve the problem in the first place?
Consistency

Write consistency (update consistency)


 Question: Do we need to solve the problem in the first place?
 If yes, than there are two general solutions
 Pessimistic approaches
 Preventing conflicts from occurring
 Techniques: write locks, …
 Optimistic approaches
 Conflicts may occur, but are detected and resolved later on
 Techniques: version stamps, …
Consistency

Read consistency (replication consistency)


 Problem: read-write conflict
 Write and read requests on the same aggregate are initiated concurrently
 Issue: inconsistent read
 When not treated, inconsistency window will exist
 Propagation of changes to all the replicas takes some time
 Until this process is finished, inconsistent reads may happen even the initiator of the
write request may read wrong data!
Conclusion
There is a wide range of options influencing…
 Scalability
– how well the system scales (data and requests)?
 Availability
– when nodes may refuse to handle user requests?
 Consistency
– what level of consistency is required?
 Latency
– how complicated is to handle user requests?
 Durability
– are the committed data written reliably?
 Resilience
– can the data be recovered in case of failures?
It’s good to know these properties and choose the right trade-off

You might also like