Explain The Update Consistency - Update (Write-Write Conflict), Read (Read-Write Conflict) With An Example and A Neat Diagram
Explain The Update Consistency - Update (Write-Write Conflict), Read (Read-Write Conflict) With An Example and A Neat Diagram
Example: Martin and Pramod both update a contact number on a company website. Martin changes
it to "123-456-7890," while Pramod updates it to "987-654-3210." If the server processes Martin's
update first, the final stored value may be "987-654-3210," causing Martin's update to be lost.
Read (Read-Write Conflict): This happens when one user reads data while another user is updating it.
For instance, if Martin reads the old phone number while Pramod is updating it, he may not see the
latest information. This can lead to inconsistencies in what users perceive as the current data
Example: If Martin reads the phone number while Pramod is updating it, he may see the old number
"123-456-7890." This outdated information can lead to incorrect decisions based on stale data.
2.Define Quorums and explain read and write quorum with examples
A quorum is a subset of nodes in a distributed system that must agree on a read or write
operation to ensure strong consistency.
The concept is crucial when dealing with replicated data across multiple nodes, as it helps
avoid inconsistencies that can arise from concurrent operations.
Write Quorum
A write quorum is the minimum number of nodes that must acknowledge a write
operation for it to be considered successful.
For example, if data is replicated across three nodes (N = 3), a write quorum (W) of 2
means that at least two nodes must confirm the write. This can be expressed as W > N/2,
ensuring that a majority of nodes have the latest data.
If two nodes acknowledge the write while one does not, the system can still maintain
consistency, as the majority has agreed on the new value .
Read Quorum
A read quorum is the minimum number of nodes that must be contacted to ensure that
the most recent write is read.
Continuing with the previous example, if the write quorum is W = 2, then a read quorum
(R) of 2 is also required to guarantee that the latest data is retrieved. This can be expressed
as R + W > N.
If a read operation contacts only one node while the write quorum was not met, it may
read stale data. However, if it contacts two nodes, it can ensure that it retrieves the most
up-to-date information .
Example Scenario
Consider a system with three nodes (A, B, and C) where:
A write operation is performed, and nodes A and B acknowledge the write (W = 2).
For a subsequent read operation, if nodes A and C are contacted (R = 2), the read
will return the latest data since the write quorum was met.
3.Define Version Stamps. List and explain the approaches through which version stamps
can be constructed for single source models.
Version stamps are mechanisms used to track changes in data records, ensuring that updates
are based on the most current information. They help prevent conflicts in multi-user
environments by indicating the version of a record at any given time. When a record is
updated, its version stamp changes, allowing systems to verify whether the data being
modified is up-to-date.
Approaches to Construct Version Stamps for Single Source Models
1. Counter-Based Version Stamps:
• Each time a record is updated, a counter is incremented.
• This approach is straightforward and allows easy comparison of versions; a
higher counter indicates a more recent update.
• However, it requires a single authoritative source to manage the counter to
avoid duplication [1].
2. GUID (Globally Unique Identifier):
• A GUID is a large random number that is unique across different systems.
• It can be generated by any node, eliminating the risk of duplication.
• The downside is that GUIDs are large and cannot be directly compared for
recency, making it difficult to determine which version is newer [1].
3. Content Hashing:
• This method involves creating a hash of the contents of the resource.
• A sufficiently large hash key size can ensure global uniqueness and can be
generated by anyone.
• While deterministic (the same content will always produce the same hash), it
cannot be directly compared for recency [1].
4. Timestamp-Based Version Stamps:
• This approach uses the timestamp of the last update to indicate the version.
• Timestamps are relatively short and can be directly compared to determine
which version is more recent.
• However, it requires synchronized clocks across multiple machines to avoid
issues with data corruption due to clock discrepancies [1].
5. Composite Version Stamps:
• A combination of the above methods can be used to create a composite version
stamp.
• For example, using both a counter and a content hash can help in identifying
conflicts while allowing for recentness comparison.
• This method is particularly useful in systems that require high availability and
consistency, such as peer-to-peer replication systems
Partitioning is the process of dividing the output of the map function into different
segments or partitions. Each partition contains key-value pairs that will be sent to a specific
reducer. The goal is to ensure that all data for the same key is grouped together in one
partition so it can be processed by a single reducer .
Example:
Consider a scenario where we have the following key-value pairs emitted by the map
function:
(Product A, 2)
(Product B, 1)
(Product A, 3)
(Product C, 4)
If we have two reducers, the partitioning might look like this:
Reducer 1: (Product A, 2), (Product A, 3)
Reducer 2: (Product B, 1), (Product C, 4)
Here, all entries for Product A are sent to Reducer 1, while Products B and C go to Reducer
2. This allows each reducer to work on its own set of keys in parallel, improving processing
speed .
Combining Stage
Definition:
The combining stage is an optional step that occurs before the data is sent to the reducers.
A combiner function can be used to combine all values for the same key within each
partition. This helps reduce the amount of data that needs to be transferred across the
network, making the process more efficient .
Example:
Using the same key-value pairs from the previous example, if we apply a combiner function
that sums the quantities for each product, the output before sending to the reducers
might look like this:
(Product A, 5) // Combined from (Product A, 2) and (Product A, 3)
(Product B, 1)
(Product C, 4)
This means that instead of sending multiple entries for Product A to the reducer, we only
send a single entry with the total quantity. This reduces the amount of data transferred
and speeds up the overall process .
Key-value stores are a type of NoSQL database that uses a simple data model to store data as a
collection of key-value pairs. Each key is unique and acts as an identifier for the associated value, which
can be a simple data type or a more complex data structure. This model is akin to a hash table, where
the key is the index, and the value is the data being stored
Popular Key-Value Databases
• Redis:An in-memory data structure store, often used as a database, cache, and message
broker. It supports various data structures such as strings, hashes, lists, sets, and more.
• Amazon DynamoDB:A fully managed NoSQL database service that provides fast and
predictable performance with seamless scalability. It is designed for high availability and
durability.
• Riak:A distributed NoSQL database that offers high availability, fault tolerance, and scalability.
It is designed to handle large amounts of data across many servers.
• Cassandra:While primarily a wide-column store, it can also function as a key-value store. It is
known for its high availability and scalability, making it suitable for handling large datasets
across multiple nodes.
• Berkeley DB:A high-performance embedded database that provides a key-value store
interface. It is often used in applications requiring fast data access.
• LevelDB:A fast key-value storage library written at Google that provides an ordered mapping
from string keys to string values. It is designed for high performance and efficiency.