Nosql Mod3
Nosql Mod3
MAP-REDUCE
▪ On the database server: This reduces the amount of data that needs
to be transferred but may increase the load on the server.
o The challenge in both cases is how to efficiently handle large datasets while
considering factors like server load and data transfer.
• Map-Reduce in Clusters:
• Map-Reduce Overview:
• Implementation Differences:
• Scenario Setup:
o The example involves orders and line items as aggregates. Each order has line
items containing product ID, quantity, and price charged.
o The goal is to compute the total revenue for a product over the last seven
days, but this does not fit the current aggregate structure of the orders,
which makes it difficult to generate the report directly from the data.
o To generate the product revenue report, you would need to visit every
machine in the cluster and examine records across all machines. This is
inefficient for large datasets.
• Map-Reduce Solution:
o This scenario calls for map-reduce, which is ideal for distributed computing
where the computation can be spread across multiple machines.
• Map Function:
o The first stage of a map-reduce job is the map. The map function takes a
single aggregate (in this case, an order) as input and outputs a set of key-
value pairs.
o For each line item in the order, the output would be a key-value pair where
the product ID is the key, and the quantity and price are the values.
• Reduce Function:
o After the map function generates key-value pairs, the reduce function
processes them. The reduce function takes multiple map outputs with the
same key and combines their values.
o For example, if the map function produces 1000 line items for a product (e.g.,
"Database Refactoring"), the reduce function will aggregate them into a single
output, summarizing the total quantity and revenue for that product.
o The reduce function can use all values associated with a single key. This is
where the aggregation occurs.
• Map-Reduce Framework:
o The framework also collects all the values associated with a specific key and
calls the reduce function with the key and its corresponding values. This
simplifies the task of writing the reduce function.
• Job Execution:
o To execute a map-reduce job, you need to write the map and reduce
functions.
o The framework takes care of the underlying task distribution, ensuring the
computation is efficient across the cluster.
o In the simplest map-reduce setup, there is a single reduce function that takes
the outputs from all the map tasks running across different nodes.
o These outputs are concatenated and sent to the reduce function, which
works well but has room for optimization in terms of parallelism and data
transfer reduction.
• Increasing Parallelism with Partitioning:
o The map outputs are grouped into partitions based on the key of each
processing node. These partitions are then combined into single groups and
sent to the corresponding reducers.
o The process of grouping the map outputs into partitions and sending them to
the reducers is known as shuffling, and the partitions are often referred to as
buckets or regions.
o A combiner function helps by combining the data for the same key into a
single value before it is transferred. This reduces the data being shuffled
between nodes.
o The combiner function is similar to the reduce function and can often be the
same function used for final reduction.
• Non-Combinable Reducers:
o Not all reduce functions can be used as combiners. For example, a function
that counts the number of unique customers for a particular product:
▪ The reducer would combine these and count how many times each
customer appears for that product.
o In this case, the output of the reducer differs from its input, so it cannot be
used as a combiner.
o You can still apply a combining function to eliminate duplicate product-
customer pairs before sending data to the reducer, but this is not the same as
the final reduction.
o Averaging Example:
• Counting Example:
o When performing counts, the map function will emit a count field with a
value of 1 for each item. These values can then be summed during the reduce
phase to get the total count.
o This aligns with the map-reduce model, where data emitted in the map phase
is aggregated in the reduce phase.
o Example: To compare sales of products for each month in 2011 to the prior
year, the calculations are divided into two stages:
1. The first stage produces records showing the aggregate sales figures
for each product per month.
2. The second stage compares the current year's sales to the prior year's
sales for the same month.
o The first stage processes original order records, outputting key-value pairs
that show sales for each product per month.
o This stage uses a composite key (product and month) to reduce records based
on multiple fields.
o The second-stage mappers process the output from the first stage and
categorize it by year. For example, a 2011 record populates the current year
quantity, and a 2010 record populates the prior year quantity.
o The reduce function merges records, summing values for each key and
calculating the comparison between the current and prior year’s sales.
o Small steps are easier to manage and combine than large, complex steps.
o Intermediate output from early stages can be reused for other outputs, which
saves both programming and execution time. This reuse is particularly useful
when intermediate records represent heavy data access.
• Materialized Views:
o Apache Pig and Hive are languages specifically built for simplifying map-
reduce programming:
• Importance of Map-Reduce:
• Future of Map-Reduce:
o The map stage is easy to handle incrementally. The mapper only needs to be
rerun if the input data changes.
o Since maps are isolated from each other, incremental updates for the map
stage are straightforward and efficient.
o The reduce stage is more complex because it combines outputs from many
map tasks. Any change in the map outputs could trigger a new reduction,
leading to the need for recomputation.
• Combinable Reducers:
o If the changes are additive (i.e., new records are added without modifying or
deleting old ones), the reduce operation can be run with the existing results
combined with the new additions.
• Framework Control:
KEY-VALUE DATABASES
• A key-value store is a type of database that functions as a simple hash table, typically
used when all database access is done via a primary key.
• The key column (e.g., ID) acts as the unique identifier, and the value column stores
the associated data (e.g., NAME).
• When inserting data, the application provides a key-value pair. If the key already
exists, the existing value is overwritten; otherwise, a new entry is created.
Key-value stores are the simplest NoSQL data stores to use from an API perspec tive. The
client can either get the value for the key, put a value for a key, or delete a key from the data
store. The value is a blob that the data store just stores, without caring or knowing what’s
inside; it’s the responsibility of the application to understand what was stored. Since key-
value stores always use primary-key access, they generally have great performance and can
be easily scaled.
Some of the popular key-value databases are Riak [Riak], Redis (often referred to as Data
Structure server) [Redis], Memcached DB and its flavors [Memcached], Berkeley DB
[Berkeley DB], HamsterDB (especially suited for embedded use) [HamsterDB], Amazon
DynamoDB [Amazon's Dynamo] (not open-source), and Project Voldemort [Project
Voldemort] (an open-source implementation of Amazon DynamoDB).
In some key-value stores, such as Redis, the aggregate being stored does not have to be a
domain object—it could be any data structure. Redis supports storing lists, sets, hashes and
can do range, diff, union, and intersection op erations. These features allow Redis to be used
in more different ways than a standard key-value store.
There are many more key-value databases and many new ones are being worked on at this
time. For the sake of keeping discussions in this book easier we will focus mostly on Riak.
Riak lets us store keys into buckets, which are just a way to segment the keys—think of
buckets as flat namespaces for the keys.
If we wanted to store user session data, shopping cart information, and user preferences in
Riak, we could just store all of them in the same bucket with a single key and single value for
all of these objects. In this scenario, we would have a single object that stores all the data
and is put into a single bucket (Figure 8.1).
The downside of storing all the different objects (aggregates) in the single bucket would be
that one bucket would store different types of aggregates, in creasing the chance of key
conflicts. An alternate approach would be to append the name of the object to the key, such
as 288790b8a421_userProfile, so that we can get to individual objects as they are needed
(Figure 8.2).
We could also create buckets which store specific data. In Riak, they are known as domain
buckets allowing the serialization and deserialization to be handled by the client driver.
Using domain buckets or different buckets for different objects (such as UserProfile and
ShoppingCart) segments the data across different buckets al lowing you to read only the
object you need without having to change key design.
Key-value stores such as Redis also support storing random data structures, which can be
sets, hashes, strings, and so on. This feature can be used to store lists of things, like states or
addressTypes, or an array of user’s visits.
o This understanding helps in knowing what features are missing and how the
application architecture needs to adapt for key-value data stores.
8.2.1 Consistency
o Optimistic writes can be done, but they are expensive because the data store
cannot determine the change in value.
.createBucket(bucketName)
.withRetrier(attempts(3))
.allowSiblings(siblingsAllowed)
.nVal(numberOfReplicasOfTheData)
.w(numberOfNodesToRespondToWrite)
.r(numberOfNodesToRespondToRead)
.execute();
• Transaction Guarantees:
o Riak uses the quorum model for transactions, utilizing the W value (write
quorum) during writes.
▪ Data loss can occur for those 2 nodes in terms of read operations.
o To check an attribute, the application must read the value and process it.
o Without knowing the key, querying becomes challenging (especially for ad-
hoc debugging).
o Some key-value stores (like Riak) allow searching inside values (e.g., Riak
Search with Lucene indexing).
▪ An algorithm.
▪ User profiles.
o The expiry_secs property can be used to expire keys after a certain time
(useful for session data).
{ "lastVisit":1324669989288,
"user":{"customerId":"91cfdf5bcb7c","name":"buyer","countryCode":
"US","tzOffset":0}
http://localhost:8098/buckets/session/keys/a7e618d9db25
curl -i http://localhost:8098/buckets/session/keys/a7e618d9db25
o Riak allows specifying the Content-Type in the POST request to indicate the
data type.
8.2.5 Scaling
o Sharding determines which node stores a key based on the value of the key.
o Example: If sharding is done by the first character of the key, a key starting
with 'f' will be stored on a different node than one starting with 'a'.
o If a node goes down (e.g., storing keys starting with 'f'), the data becomes
unavailable.
o This allows flexibility in handling node failures for read or write operations.
o Storing session data in a key-value store is efficient, as the entire session can
be stored in a single PUT request and retrieved with a single GET.
o Memcached is often used for such cases, and Riak is suitable when availability
is a priority.
o User profiles typically have unique identifiers (e.g., userId, username) along
with preferences (e.g., language, color, timezone, product access).
o All these attributes can be stored in a single object, allowing quick retrieval
via a GET operation.
o E-commerce websites tie shopping carts to user IDs, ensuring that cart data is
available across browsers, machines, and sessions.
o The user’s shopping cart information can be stored as the value, with the user
ID as the key.
o Riak clusters are well-suited for handling such applications due to their
availability and scaling features.
o Key-value stores are not ideal when you need relationships between different
sets of data or need to correlate data across multiple keys.
o Although some key-value stores offer link-walking features, they are not
designed for handling complex relationships between data sets.
2. Multioperation Transactions
o If multiple keys need to be saved, and you require a rollback or revert if any
one operation fails, key-value stores are not the best solution.
3. Query by Data
o Key-value stores do not perform well when you need to search keys based on
attributes within the value part of the key-value pairs.
o The database cannot inspect the value side, except in products like Riak
Search or external indexing engines like Lucene or Solr.
4. Operations by Sets