0% found this document useful (0 votes)
9 views

NoSQL M2

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

NoSQL M2

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

NoSQL Database (21CS745)

Module -2 : Introduction to NoSQL

Dr. Rama Satish K V


Associate Professor
Department of AI & ML
RNSIT, Bengaluru
Syllabus 2
Introduction to Distributed models 3

• NoSQL databases are increasingly popular due to their capability to run on large
clusters, making them suitable for handling growing data volumes.
• Scaling Options: Instead of scaling up with larger servers, NoSQL allows for
scaling out by distributing databases across clusters of servers.
• Aggregate Orientation: The use of aggregates aligns well with scaling out, serving
as a natural unit for data distribution.
• Data Distribution Benefits: Effective distribution models can enhance data storage
capacity, improve read/write performance, and increase availability during network
issues.
• Distribution Techniques: Two main methods for data distribution are replication
and sharding (distributing different data across nodes).
• Replication Types: Replication can be implemented in two forms: master-slave and
peer-to-peer, with the discussion covering single-server setups, master-slave
replication, sharding, and peer-to-peer replication.
Single-Server Database Distribution in NoSQL 4

• Simplicity of Single-Server Model: The simplest and often recommended


approach is to run the database on a single machine, which simplifies management
and development.
• Elimination of Complexities: A single-server setup avoids the complexities
associated with distributed systems, making it easier for operations teams and
application developers to work with.
• Suitability for Certain Applications: While many NoSQL databases are designed
for clusters, some data models, like graph databases, perform better in a single-
server configuration.
• Efficiency in Aggregate Processing: For applications focused on processing
aggregates, single-server document or key-value stores can be more efficient and
user-friendly for developers.
• Preference for Single-Server Approach: Despite exploring more complex
distribution schemes, the authors prefer the simplicity of a single-server model
whenever possible.
Understanding Sharding for NoSQL(1) 5

Definition of Sharding: Sharding is a


technique for horizontal scalability where
different parts of a dataset are distributed
across multiple servers, allowing for
simultaneous access by multiple users.
Understanding Sharding for NoSQL(2) 6

• Definition of Sharding: Sharding is a technique for horizontal scalability where


different parts of a dataset are distributed across multiple servers, allowing for
simultaneous access by multiple users.
• Ideal User-Server Interaction: In an ideal scenario, each user interacts with a
different server, leading to balanced load distribution and faster response times,
with each server handling an equal share of requests.
• Clumping Data: To approach the ideal distribution, related data should be
grouped together on the same server, which can be achieved through aggregate
orientation that combines frequently accessed data.
• Optimizing Data Placement: Placing data closer to its point of access, such as
location-based storage, can enhance performance, especially when certain
aggregates are accessed frequently.
• Maintaining Load Balance: It's important to evenly distribute aggregates across
nodes to prevent load imbalance, which can change over time based on user access
patterns.
Understanding Sharding for NoSQL(3) 7

• Sequential Data Access: Aggregates that are likely to be read in sequence can be
arranged to improve processing efficiency, as illustrated by the organization of
data in the Bigtable paper.
• Challenges of Manual Sharding: Historically, sharding has been managed
through application logic, complicating programming and requiring code changes
for rebalancing data across shards.
• Benefits of Auto-Sharding: Many NoSQL databases offer auto-sharding,
simplifying the distribution of data and allowing for more efficient application
development.
• Performance Enhancement: Sharding improves both read and write performance
by horizontally scaling the database, making it valuable for applications with high
write demands.
• Cautions with Sharding: While sharding can enhance performance, it can also
decrease resilience if not implemented carefully, and transitioning from a single-
server to a sharded configuration should be done proactively to avoid issues in
production.
Master-Slave Replication 8

Definition :This method


involves one primary
node (master) that
holds the authoritative
data, while secondary
nodes (slaves) replicate
that data.
Master-Slave Replication (2) 9

• Update Processing: The master is responsible for all data updates, and the slaves
are synchronized with it to ensure they reflect the latest data.
• Read Scalability: Master-slave replication is beneficial for read-intensive
datasets, allowing horizontal scaling by adding more slave nodes to handle
increased read requests.
• Limitations on Write Traffic: The master’s ability to process updates can become
a bottleneck in write-heavy environments, as it handles all write operations.
• Read Resilience: If the master fails, slaves can still process read requests,
providing continuity for read-heavy applications.
• Recovery Speed: In the event of a master failure, a slave can be quickly
promoted to master, facilitating faster recovery.
Master-Slave Replication (3) 10

• Hot Backup Functionality: The system can function like a single-server setup with a
hot backup, improving resilience without needing complex scaling.
• Master Appointment Methods: Masters can be appointed manually during
configuration or automatically through a cluster election process, enhances uptime.
• Separate Read and Write Paths: To achieve read resilience, applications should
have distinct paths for read and write operations, which may require specialized
database connections.
• Testing for Resilience: It's essential to conduct tests to ensure that reads can occur
even when writes are disabled, verifying the system's read resilience.
• Data Consistency Challenges: A major drawback of master-slave replication is the
potential for inconsistency; different clients might read different values if updates
haven’t fully propagated to the slaves.
• Risk of Data Loss: If the master fails before updates are replicated to the slaves,
those changes can be lost, emphasizing the importance of consistency and recovery
strategies.
Peer-to-Peer Replication (1) 11

• Limitations of Master-Slave
Replication: While master-slave
replication enhances read scalability, it
does not improve write scalability and
poses a single point of failure at the
master node.
• Introduction to Peer-to-Peer
Replication: Peer-to-peer (P2P)
replication eliminates the master node,
allowing all replicas to have equal
status, which can accept writes.
Peer-to-Peer Replication (2) 12

• Node Failure Resilience: In a P2P setup, the loss of any single node does not disrupt
access to the data store, enhancing overall data availability.
• Performance Enhancement: Adding more nodes in a P2P cluster can improve
performance without the bottleneck of a single master.
• Consistency Challenges: A major complication with P2P replication is maintaining
consistency, especially during simultaneous writes to the same record, leading to
write-write conflicts.
• Impact of Inconsistent Writes: While read inconsistencies are typically transient,
write inconsistencies can have permanent effects, complicating data integrity.
• Coordinating Writes for Consistency: One approach to handle write inconsistencies
involves coordinating writes across replicas, requiring majority agreement for a
valid update.
Peer-to-Peer Replication (3) 13

• Network Traffic Trade-off: Ensuring coordinated writes can increase network


traffic, which might impact performance.
• Coping with Inconsistent Writes: In some scenarios, it may be acceptable to allow
inconsistent writes and develop policies to merge them later, maximizing write
performance.
• Performance vs. Consistency: Choosing between strict coordination for consistency
or accepting inconsistencies involves balancing performance benefits against the
risks of data integrity issues.
• Operational Flexibility: P2P replication offers flexibility in managing nodes and
handling failures, making it a robust alternative to traditional master-slave
configurations.
• Future Considerations: Strategies for dealing with inconsistencies will be explored
further, highlighting the importance of developing effective conflict resolution
policies in P2P systems.
Combining Sharding and Replication (1) 14

• Combination of Strategies: Replication and sharding can be effectively


combined to enhance data management in databases.
• Multiple Masters: Using both master-slave replication and sharding allows for
multiple master nodes, improving scalability and performance.
• Single Master for Data Items: Each data item is assigned a single master,
ensuring clear ownership and update responsibility.
• Flexible Node Roles: Nodes can be configured flexibly, serving as masters for
certain data while acting as slaves for others.
• Dedicated Roles: Alternatively, nodes can be designated specifically for master
or slave roles, optimizing resource allocation and management.
Figure: Using master-slave replication together 15
with sharding
Peer-to-peer replication together with sharding 16

• Common Strategy: Peer-to-peer replication combined with sharding is


frequently used in column-family databases for enhanced data management.
• Cluster Size: This approach can involve tens or even hundreds of nodes within a
cluster, allowing for scalable data distribution.
• Replication Factor: A recommended starting point for peer-to-peer replication is
a replication factor of 3, ensuring that each shard is stored on three different
nodes.
• Node Failure Resilience: If a node fails, the shards that were on that node can be
rebuilt using data from the other nodes in the cluster.
• Data Availability: This replication strategy enhances data availability and fault
tolerance, ensuring that data remains accessible even during node failures.
Figure 2.5. Using peer-to-peer replication 17
together with sharding
Next Class 18

• Module -2 : Chapter 5
Consistency 19

1. Transitioning from relational databases to


NoSQL changes the approach to consistency.
2. Relational databases prioritize strong
consistency, while NoSQL introduces concepts
like the CAP theorem and eventual consistency.
3. Understanding consistency is crucial when
building NoSQL systems, as it impacts design
choices.
4. Consistency encompasses various forms, each
potentially leading to different types of errors.
5. The discussion will include reasons for
relaxing consistency and durability in NoSQL
databases.
CAP theorem 20

• The CAP theorem, proposed by Eric Brewer,


states that in a distributed data store, it is
impossible to simultaneously achieve all three
of the following goals:
• 1. Consistency: All nodes see the same data at
the same time, ensuring that every read
receives the most recent write.
• 2. Availability: Every request receives a
response, regardless of whether it was
successful or not, ensuring that the system is
always operational.
• 3. Partition Tolerance: The system continues to
operate despite network partitions or
communication failures between nodes.
According to the theorem, a distributed system
can only guarantee two of these three properties
at any given time. This means that when
designing a system, trade-offs must be made
based on the specific needs of the application.
Update Consistency 21

• 1. Concurrency Approaches: Consistency maintenance is categorized as


pessimistic or optimistic.

• 2. Pessimistic Approach: Prevents conflicts by using mechanisms like write


locks, allowing only one client to change a value at a time.

• 3. Example of Pessimistic: In a scenario, only the first client (Martin) acquires the
lock and can update, while the second (Pramod) must wait.

• 4. Optimistic Approach: Allows conflicts to occur but detects and resolves them
post-factum, often through conditional updates.

• 5. Example of Optimistic: Martin’s update succeeds while Pramod’s fails,


prompting him to check the value again before retrying.
Update Consistency 22

• 6. Serialization of Updates: Both approaches depend on consistent serialization


of updates across systems.

• 7. Multiple Servers: With multiple servers, different update orders can lead to
discrepancies in data (e.g., different phone numbers).

• 8. Sequential Consistency: A common requirement in distributed systems


ensuring all nodes apply operations in the same order.

• 9. Handling Write-Write Conflicts: An alternative optimistic method saves both


updates and marks them as conflicting.

• 10. Version Control Analogy: This conflict resolution is similar to processes in


distributed version control systems.
Update Consistency 23

11. Merging Updates: Conflicting updates may require user intervention to merge
or automatic handling based on specific rules.

12. Tradeoffs in Concurrency: Pessimistic concurrency may be preferred to avoid


conflicts but can lead to reduced responsiveness.

13. Safety vs. Liveness: There’s a fundamental tradeoff between avoiding errors
(safety) and maintaining quick responses (liveness).

14. Deadlocks: Pessimistic approaches can lead to deadlocks, which are challenging
to prevent and debug.

15. Replication Challenges: Increased replication leads to more write-write


conflicts unless measures are taken to ensure consistency.
Read Consistency 24

1. Update Consistency vs. Read Consistency: A data store can maintain update
consistency, but readers may not always receive consistent data.
2. Logical Consistency:
- Ensures related data items (e.g., order line items and shipping charges) are
consistent.
- Transactions in relational databases prevent read-write conflicts.
3. NoSQL Transactions:
- Claims that NoSQL databases lack transaction support are misleading.
- Aggregate-oriented NoSQL databases allow atomic updates within single
aggregates, not across multiple aggregates.
4. Inconsistency Window:
- Time during which inconsistent reads can occur when updates affect multiple
aggregates. - Example: Amazon’s SimpleDB has a short inconsistency window
(usually <1 second).
A read-write conflict in logical consistency 25
Read Consistency 26

5. Replication Consistency:
- Different replicas may return inconsistent data (e.g., hotel room booking).
- Eventually consistent: all nodes will update to the same value eventually.
6. Impact of Replication on Consistency: Replication can extend logical
inconsistency windows, especially if updates happen rapidly.
7. Configurable Consistency Levels: Applications can specify the desired
consistency level per request (weak or strong).
8. User Experience and Inconsistency: Inconsistencies can confuse users, especially
during simultaneous actions (e.g., booking hotel rooms).
9. Read-Your-Writes Consistency: Guarantees users will see their updates after
they write, often implemented via session consistency.
Replication Consistency Example 27
Read Consistency 28

10. Session Consistency Techniques:


- Sticky Sessions: Ties user sessions to one node for consistency.
- Version Stamps: Ensures data interaction includes the latest version for
consistency.
11. Handling Write Operations:
- Writes can be sent to slaves which forward them to the master while
maintaining session consistency.
- Temporary switches to the master may be necessary during writes.
12. Application Design Considerations:
- Transactions should not remain open during user interactions to avoid conflicts.

This concise summary captures the essence of the complexities surrounding data
consistency in databases, particularly focusing on relational and NoSQL systems.
Relaxing Consistency 29

1. Trade-offs in System Design: Consistency is valuable, but achieving it often


requires sacrificing other system characteristics, necessitating trade-offs.
2. Domain-Specific Tolerances: Different domains have varying tolerances for
inconsistency, which must be considered during system design.

3. Transactions and Isolation Levels: Transactions provide strong consistency


guarantees, but most applications relax isolation levels (e.g., using read-committed)
to enhance performance.

4. Forgoing Transactions for Performance: Many systems avoid transactions due to


their performance costs, as seen with MySQL’s early popularity and eBay's
architecture choices.
5. Interacting with Remote Systems: In enterprise applications, updates often occur
outside of transaction boundaries due to interactions with remote systems that
cannot be included in transactions.
The CAP Theorem 30

1. The CAP theorem states that in distributed systems, you can only achieve two
out of three properties: Consistency, Availability, and Partition tolerance, leading to
necessary trade-offs.
- Consistency: All nodes see the same data at the same time.
- Availability: If a node is reachable, it responds to requests.
- Partition Tolerance: The system continues to operate despite communication
breakdowns.
3. Single-Server vs. Cluster Systems: Single-server systems are naturally CA
(Consistency and Availability) but cannot tolerate partitions. In contrast, cluster
systems must often prioritize Partition tolerance, leading to compromises on
Consistency.
4. Practical Trade-offs: Systems may allow inconsistent writes to enhance
availability, such as in hotel bookings or shopping carts, where some level of
overbooking or merging data may be acceptable.
Partition tolerance 31

5. BASE vs. ACID: NoSQL


systems are often described as
following BASE (Basically
Available, Soft state, Eventual
consistency), but this is seen as
a spectrum rather than a strict
alternative to ACID properties of
relational databases. The focus
should be on the trade-off
between consistency and
latency rather than just
availability.
Relaxing Durability 32

1. ACID Properties and Consistency: Consistency in databases involves serializing


requests into atomic, isolated work units, which is central to ACID properties.
2. Durability Trade-offs: While durability is crucial for data integrity, there are
scenarios where sacrificing some durability can enhance performance, such as
running databases primarily in memory and periodically flushing to disk.
3. Use Cases for Nondurable Writes: User-session state management is a prime
example, where losing session data is less critical than maintaining a responsive
user experience. Durability needs can often be specified on a call-by-call basis.
4. Telemetry Data Collection: For applications like telemetric data capture,
prioritizing speed over durability may be acceptable, accepting the risk of losing the
most recent updates during a server crash.
5. Replication Durability Challenges: Issues arise when a master node fails before
updates are replicated, leading to potential data loss and conflicts upon recovery.
To enhance durability, the master can wait for replicas to acknowledge updates,
though this may slow down processing and impact availability.
Quorums 33

1. Partial Trade-offs: Consistency and durability can be adjusted; involving more


nodes in a request increases the chance of achieving consistency.
2. Write Quorum: To ensure strong consistency, a majority of nodes (more than
half) must acknowledge a write. This is expressed as W > N/2, where W is the
number of nodes confirming a write and N is the total number of nodes.
3. Read Quorum: The number of nodes needed to confirm reads (R) depends on the
write quorum (W). For strong consistency, the relationship R + W > N must hold,
allowing detection of conflicts during reads.
4. Replication Factor: A replication factor of 3 is commonly sufficient for resilience,
allowing one node to fail while maintaining the ability to achieve quorums for
reads and writes.
5. Flexible Strategy: Operations can vary in their quorum requirements based on
consistency and availability needs. This flexibility enables choosing optimal
configurations based on specific use cases, demonstrating that the relationship
between consistency and availability is more nuanced than a simple trade-off.
Next Class 34

• Module -2 : Chapter 3
NoSQL Database (21CS745)

Module -2 : Introduction to NoSQL

Dr. Rama Satish K V


Associate Professor
Department of AI & ML
RNSIT, Bengaluru
Syllabus 36
Version Stamps 37

• NoSQL databases are often criticized for lack of transaction support.


• Transactions help programmers maintain consistency.
• NoSQL proponents argue that aggregate-oriented NoSQL databases support
atomic updates within an aggregate.
• Aggregates are designed to form a natural unit of update.
• Transactional needs should be considered when choosing a database.
• Transactions have limitations, even in transactional systems.
• Some updates require human intervention and can't be run within transactions.
• Long-running transactions are problematic.
• Version stamps can be used to cope with updates requiring human intervention.
• Version stamps are useful in other situations, especially in distributed systems.
• The single-server distribution model is becoming less common.
Business and System Transactions 38

Business Transactions and System Transactions

Challenges with Data Consistency

Optimistic Offline Lock

Version Stamps

Conditional Updates

Additional Uses of Version Stamps


Business Transactions and System Transactions 39

• Business transactions (e.g., browsing a


product, filling in credit card info,
confirming an order) usually do not
occur within a single system transaction
provided by the database.
• System transactions are typically begun
at the end of the user interaction to
avoid long lock periods.
Challenges with Data Consistency 40

• Calculations and decisions may be


based on data that has changed
during the business transaction
(e.g., updated price list, changed
customer address).
• This requires techniques to handle
offline concurrency and ensure data
consistency.
Optimistic Offline Lock 41

• A form of conditional update where the client


rereads information relied on by the business
transaction and checks if it has changed since it was
originally read.

• Uses version stamps: a field that changes every


time the underlying data in the record changes.
Version Stamps 42

- Ensure records in the database contain a version stamp.


- When reading data, note the version stamp to check for changes during updates.
- Techniques for creating version stamps include:
• Counters: Increment on update, easy to compare recentness, require single
master.
• GUIDs: Large random numbers, globally unique, can't compare recentness.
• Content Hashes: Deterministic, globally unique, can't compare recentness.
• Timestamps: Short, can compare recentness, require sync clocks, potential
duplicates.
• Composite Stamps: Combining multiple schemes (e.g., CouchDB uses a
combination of counter and content hash) to blend advantages.
Conditional Updates 43

• Use version stamps to perform conditional updates, ensuring updates are not
based on stale data.
• Similar mechanisms are used in HTTP with etags for resource updates.
• Compare-and-set (CAS) operations can also be used, comparing a version stamp
before setting the new value.
Additional Uses of Version Stamps 44

• Helpful in avoiding update conflicts and providing session consistency.

• Useful in peer-to-peer replication scenarios to spot conflicts if multiple peers


update simultaneously.
Version Stamps on Multiple Nodes 45

• No single authoritative source for version stamps.


• Multiple nodes may provide different answers.
• Difficulty in determining the latest version.
Challenges • Potential for inconsistent updates.

 Counters: Increment on update, works in master-slave


scenarios.
Simple  Timestamps: Difficult to maintain consistent time across
Approaches nodes, prone to issues with clock synchronization.
Advanced Approaches for Peer-to-Peer 46

 Version Histories:
 Requires each node to track version stamp history.
 Clients or server nodes must store and share histories.
 Can detect inconsistencies by checking if one history is an ancestor of another.
 Not commonly used in NoSQL databases.
 Vector Stamps:
 A set of counters, one for each node.
 Each node updates its own counter on internal updates.
 Nodes synchronize their vector stamps during communication.
 Allows for determining newer versions: all counters in the newer stamp are greater than or equal to
those in the older stamp.
 Detects write-write conflicts when both stamps have counters greater than the other.
 Missing values are treated as 0, allowing easy addition of new nodes.
 Helps spot inconsistencies but does not resolve them.
Next Class 47

• Module 1 and Module -2

You might also like