DRKP Module 3
DRKP Module 3
Contents – Module 3
Map-Reduce
Basic Map-Reduce
Partitioning and Combining
Composing Map-Reduce Calculations
A Two Stage Map-Reduce Example
Incremental Map-Reduce
2
Chapter- 1- Map reduce
The rise of aggregate-oriented databases is in large part due to the need for handling large
amounts of data across multiple machines in clusters.
When working with a single machine, data storage and processing are simpler—you can
either
Process the data on the same computer where it’s stored (the server), or
Running it on the server is often more efficient for large data but puts more load
on that machine
Send the data to another computer to process it (the client)
While processing on a client offers flexibility to set up any software, tools, or
programming language you want, without restrictions
However, it takes time to move data between computers, which can slow things
down.
3
Chapter- 1- Map reduce
In a cluster, where multiple machines are connected, processing can be distributed across
many machines, which speeds things up.
However, when computers in a cluster share a lot of data with each other, it can slow
things down. This is because transferring data between machines takes time and uses
network resources.
To handle this efficiently, developers use the "map-reduce" pattern
Map Step: Each computer processes its own local data independently, keeping the
work and data on the same computer as much as possible.
Reduce Step: After each computer completes its part, the results are combined
(reduced) across the cluster to produce a final result.
This method was popularized by Google’s MapReduce framework, and there’s a widely-
used open-source version called Hadoop..
4
Chapter-1-Map-Reduce
Definition
MapReduce is a programming model used for processing large datasets by distributing the
work across multiple machines in a cluster.
The basic idea is to divide a big task into smaller sub-tasks (the "Map" phase), and then
combine the results of those sub-tasks into a final answer (the "Reduce" phase).
Phases
1. Map phase
The data is split into smaller chunks, and each chunk is processed independently using a
"map function." It focuses on processing input data and transforming it into
intermediate key-value pairs.
2. Reduce Phase
The reduce function processes the grouped key-value pairs, typically performing an
aggregation (e.g., sum, count) to produce the final result.
5
Chapter- 1- Map reduce
Example - We have a system for managing customer orders. Each We group orders in aggregate way, as usually,
order includes several items, and each item has a product, quantity, people want to view a whole order at once
and price. (such as when a customer wants to review all
the items they purchased).
Since we have a lot of orders, we've spread
them across many computers in a cluster
(shard)
This means that each machine will store a
portion of the total orders, and when
querying, the system needs to access
these machines and combine the data as
needed.
Now, the sales team wants a report showing
the total revenue for each product over the last
This is where MapReduce helps: seven days.
Map Step This is tricky because our data is organized by
Each computer in the cluster looks at its orders. orders, not by products.
For each item in an order, it creates a "key-value pair," To get this report, we’d need to check
where the key is the product, and the value contains every order on every computer, which
A map function reads records from the database and emits
quantity and price. would be slow. 6
key-value pairs
Chapter- 1- Map reduce
Example - We have a system for managing
customer orders. Each order includes several
items, and each item has a product, quantity, and
price. For Product 1 (puerh), the map output will
be
(“puerh”,(26,8))
4 Size-Based Partitioning Divide keys based on the size of each word (short, medium, long).
Short words (1–4 letters) go to Partition 1
5 Custom Partition Function Define a custom function based on any specific criteria relevant to the
dataset.
A custom function might assign words based on whether they appear
frequently, contain specific letters, or match certain patterns, dynamically
creating partitions.
Chapter- 1- Map reduce
Combining - The combiner function is like Another problem we can solve is the large
a mini-reducer and the goal is to reduce the amount of data being transferred between the
amount of data sent from the mapper to the map and reduce stages.
reducer by partially processing data. A lot of this data is repeated, with multiple
key-value pairs for the same key.
A combiner function helps reduce this by
merging all the data for the same key into
one value.
A combiner function is basically like a
reducer function.
In many cases, the same function can
be used for both combining and the
final reduction.
The reduce function needs to be set up in a
specific way for this to work: its output
should match its input. We call this a
Combining reduces data before sending it combinable reducer.“ 11
across the network
Chapter- 1- Map reduce
Combining
Before sending the data to the reducers, we can use a combiner to combine all
values for the same product within each partition. This step is optional, but it helps to
reduce the amount of data that needs to be sent across the network.
In the reduce step, each reducer will receive the data for a single product and
combine the results. Since we used partitioning, each reducer will only work with
data for a specific product.
Reducers process each key's data in parallel, and multiple reducers can operate at
the same time on different keys (products).
12
Chapter- 1- Map reduce
Combinable reducer
17
Chapter- 1- Map reduce
Two-stage Map-Reduce Example
If the map reduce calculations get more complex, it is useful to break them down into
stages using a pipes and filters approach, with the output of one stage serving as input
to the next
18
Chapter- 1- Map reduce
Two-stage Map-Reduce Example Example – Compare the sales of products
for each month in 2011 to the prior year
(2010)
The first stage will produce records showing
the aggregate figures for a single product in
a single month of the year.
The second stage then uses these as inputs
and produces the result for a single product
by comparing one month’s results with the
same month in the prior year to determine
the difference or growth.
21
Chapter- 1- Map reduce
Two-stage Map-Reduce Example – Second stage
The reduce in this case is a merge of
records, where combining the values
by summing allows two different year
outputs to be reduced to a single value
(with a calculation based on the
reduced values thrown in for good
measure).
22
Chapter- 1- Map reduce
Advantages of Two stage Map-Reduce
Instead of trying to perform all the calculations in one large, complex step
it’s easier to break them into smaller steps using multiple MapReduce stages
Another advantage is the intermediate outputs generated during the MapReduce process can often be
reused for other tasks.
For example, aggregated data (such as monthly sales) can serve multiple downstream analyses
without reprocessing the raw data.
Map-reduce is a pattern that can be implemented in any programming language. However, the constraints
of the style make it a good fit for languages specifically designed for map-reduce computations.
Apache Pig [Pig] , an offshoot of the Hadoop [Hadoop] project, is a language specifically built to
make it easy to write map-reduce programs. It certainly makes it much easier to work with Hadoop
than the underlying Java libraries.
In a similar vein, if you want to specify map reduce programs using an SQL-like syntax, there is hive
[Hive] , another Hadoop offshoot.
MapReduce is not limited to NoSQL databases. It originated with Google’s system for processing large-
scale data on distributed file systems.
The open-source Hadoop project follows this model, making it accessible to more organizations.
23
Chapter- 1- Map reduce
Incremental Map-Reduce
The examples we’ve discussed so far are complete map-reduce computations, where we start with raw
inputs and create a final output.
Many map-reduce computations take a while to perform, even with clustered hardware, and new data
keeps coming in which means we need to rerun the computation to keep the output up to date.
Starting from scratch each time can take too long, so often it’s useful to structure a map-reduce
computation to allow incremental updates, so that only the minimum computation needs to be done.
The map stages of a map-reduce are easy to handle incrementally—only if the input data changes
does the mapper need to be rerun. Since maps are isolated from each other, incremental updates are
straightforward.
24
Chapter- 1- Map reduce
Incremental Map-Reduce
The more complex case is the reduce step, since it pulls together the outputs from many maps and any change
in the map outputs could trigger a new reduction.
This recomputation can be lessened depending on how parallel the reduce step is. If we are partitioning the data
for reduction, then any partition that’s unchanged does not need to be re-reduced.
Similarly, if there’s a combiner step, it doesn’t need to be rerun if its source data hasn’t changed. If our reducer
is combinable, there’s some more opportunities for computation avoidance.
If the changes are additive—that is, if we are only adding new records but are not changing or deleting any old
records—then we can just run the reduce with the existing result and the new additions.
If there are destructive changes, that is updates and deletes, then we can avoid some recomputation by breaking
up the reduce operation into steps and only recalculating those steps whose inputs have changed—essentially,
using a Dependency Network (It helps figure out what needs to be recalculated and what doesn’t) to organize
the computation.
The map-reduce framework controls much of this, so you have to understand how a specific framework
supports incremental operation
25
Chapter- 1- Map reduce
Incremental map reduce - Example- Calculating the total sales for each product
What If Data Changes
Customer Product Quantity Step 2 : New Data Arrives
(Updates/Deletions)?
Jane Puerh 25 Customer Product Quantity Case 1: Update a Record
Jane Puerh 15 Jane Puerh 12 Suppose the original record for Jane’s
Puerh purchase changes from 1 to 2.
John Dragonwell 23 Max Genmaicha 10 1.Subtract the old value (1) from the total:
Max Genmaicha 10 Puerh: 52 - 1 = 51.
Instead of starting over with all the
Jane Dragonwell 7 2.Add the new value (2) to the total:
data, use incremental MapReduce
Apple: 51 + 2 = 53.
Step 1 : Initial MapReduce to update only what’s changed. Case 2: Delete a Record
Map step Reduce step: Map step output Suppose Jane’s Dragonwell purchase (5)
output (Puerh, 12) is deleted.
Combine
(Genmaicha, 10) 1.Subtract the deleted value from the total:
(Puerh, 25) quantities for
Dragonwell: 30 - 5 = 25.
each product
(Puerh,15) Combine with Existing Results How Incremental MapReduce Saves
Puerh : 40 Add the new results to the existing total. Time
(Dragonwell, 23)
Dragonwell : 30 Instead of reprocessing all sales
(Genmaicha, 10) Existing Totals: Puerh: 40, Dragonwell: 30,
Genmaicha : 10 records, you process only the new,
(Dragonwell, 7) Genmaicha:10 updated, or deleted records.
New Totals: Puerh: 12, Genmaicha: 10 The framework tracks dependencies
using a Dependency Network so that
Combined Totals: Puerh: 40 + 12 = 52
only affected totals (per product)
26 are
Genmaicha: 10 + 10 = 20
recalculated.
Chapter- 1- Map reduce
Key points
Map-reduce is a pattern to allow computations to be parallelized over a cluster.
The map task reads data from an aggregate and boils it down to relevant key-value pairs. Maps only read a
single record at a time and can thus be parallelized and run on the node that stores the record.
Reduce tasks take many values for a single key output from map tasks and summarize them into a single output.
Each reducer operates on the result of a single key, so it can be parallelized by key.
Reducers that have the same form for input and output can be combined into pipelines. This improves
parallelism and reduces the amount of data to be transferred.
Map-reduce operations can be composed into pipelines where the output of one reduce is the input to another
operation’s map.
If the result of a map-reduce computation is widely used, it can be stored as a materialized view.
Materialized views can be updated through incremental map-reduce operations that only compute changes to
the view instead of recomputing everything from scratch
27
Chapter- 2- Key-Value Databases
Key-Value Database
A key-value store is a type of database that works like a simple hash table, primarily used when
all access to the database via primary key.
Key: Think of it as a unique ID, like a label or identifier.
Value: This is the data or information associated with that key
For example:
In a traditional database (RDBMS), think of a table with two columns:
ID (the unique identifier, like a key)
NAME (the information, like a value).
In a key-value store, you store and retrieve data by providing the key:
If the key exists, the database retrieves the value.
If you save a new value for the same key, it overwrites the old value.
If the key doesn't exist, it creates a new entry.
28
Chapter- 2- Key-Value Databases
Key-Value Database
Let’s look at how terminology compares in Oracle and Riak.
29
Chapter- 2- Key-Value Databases
What is Key-Value store
Key-value stores are the simplest NoSQL data stores to use from an API perspective.
The client can either get the value for the key, put a value for a key, or delete a key from the data store.
The value is a blob that the data store just stores, without caring or knowing what’s inside; it’s the responsibility
of the application to understand what was stored.
Since key-value stores always use primary-key access, they generally have great performance and can be easily
scaled
Example - A shopping cart application:
Key: user_123
Value: { "items": ["item1", "item2"], "total": 200 }
The application fetches, updates, or deletes the shopping cart for a user using the key user_123. The value
structure and meaning are managed by the application, not the key-value store
Popular Key-Value Databases
Some of the popular key-value databases are:
Riak [Riak], Redis (often referred to as Data Structure server) [Redis], Memcached DB and its flavors
[Memcached], Berkeley DB [Berkeley DB], HamsterDB (especially suited for embedded use)
[HamsterDB], Amazon DynamoDB [Amazon’s Dynamo] (not open-source), and Project Voldemort
[Project Voldemort] (an open-source implementation of Amazon DynamoDB).
30
Chapter- 2- Key-Value Databases
In some key value stores, such as Redis, the value stored for a key does not have to be a
"domain object (JSON, XML, CSV)"
Instead, the value can be any data structure such as lists, sets, or hashes and can do range,
diff, union, and intersection operations.
These features allow Redis to be used in more different ways than a standard key-value
store.
Riak store keys into buckets.
Bucket: The container or namespace where data is stored (e.g., shopping_carts,
user_profiles).
Key: The unique identifier for each object in the bucket (e.g., user12345 or
user12345_order_1).
Object (Value): The actual data associated with the key. This is what is retrieved or
stored in the bucket.
31
Chapter- 2- Key-Value Databases
32
Chapter- 2- Key-Value Databases
Disadvantage- When storing all the Solution
Example - Bucket called “user_data” and
different types of objects 1. Append the name of the object to the key
storing two different types(orofaggregates)
information in For example:
aforsingle bucket,
a user: User there
Profileis a(name,
downside:
email, etc.)
A single bucket would contain multiple Instead of just storing a key as “user:12345", include the object type:
and Shopping Cart (items, total cost, etc.)
types “user:12345_userProfile".
Use theofkeyaggregates
"user:12345" (e.g.,touser
storeprofiles,
the Similarly, a shopping cart could use the key
shopping
user’s profilecarts,
andsession
use the data),
same key
increasing the “user:12345_shoppingCart".
"user:12345" to likelihood of key
store the user’s 2. Create bucket to store specific data
conflicts.
shopping cart information. For example:
WhenFor instance,
you try toif store
two objects (a usercart
the shopping A bucket named "userProfiles" can store all user profile data.
profile
data withandthe asame
shopping cart)
key, the share
user the
profile A bucket named "shoppingCarts" can store all shopping cart data.
same
data is key, the system
overwritten, might
and nowoverwrite
the These specialized buckets, called domain buckets, allow you to group and
one withonly
database the stores
other, the
leading to data
shopping cart manage data more efficiently.
In Riak, domain buckets, support serialization and deserialization of data,
corruption or loss.
information. which is handled by the client driver.
This leads to data corruption or loss. Serialization is the process of converting a python object (like a
UserProfile) into a string format (usually JSON) that can be stored.
Deserialization is the process of converting a JSON string back into a
Python object.
For example, the driver can automatically convert objects like UserProfile
or ShoppingCart into a format that Riak can store and bac
33
Chapter- 2- Key-Value Databases
In a key-value store like Redis, the "value" you store doesn't have to be just a single piece of data
(like a number or text). It can be a more complex data structure, such as:
A list (e.g., a shopping list).
A set (e.g., a collection of unique tags or categories).
A hash (e.g., a dictionary with key-value pairs like a user profile).
34
Chapter- 2- Key-Value Databases
Features
When using NoSQL data stores, it's important to compare them with traditional RDBMS to
understand what features they lack, like consistency or transactions.
The primary reason is to understand what features are missing and how does the application
architecture need to change to better use the features of a key-value data store.
For example, since key-value stores may not support complex queries or transactions
like relational databases, you might need to design your application differently to handle
these limitations
Some features to focus for the key value stores are consistency, transactions, queries, data
structure, and scaling.
35
Chapter- 2- Key-Value Databases
Features
1. Consistency
Consistency in key-value stores like Riak refers to how data is handled when it is written to the system.
Operations like get, put, or delete are applicable to single key, which means consistency is easy to maintain for
individual key operations. Conistency ensures when you read a key, you get the most up-to-date value
Example - Imagine a user profile with the key user:123 storing the name "Alice". If you update the name
to "Bob", you will always get the most recent value when you query user:123.
Optimistic writes (two users try to update the data at the same time) can be performed, but are very expensive
to implement, because a change in value cannot be determined by the data store (system)
Example: If two users try to change the same field (e.g., profile_picture) at the same time, without
checking if the field was changed by someone else, it might cause a problem because the system cannot
automatically choose which value to keep.
In a distributed key-value store implementations like Riak, data might not be immediately consistent across all
nodes. The system aims for eventual consistency, meaning it will eventually synchronize the data across all
servers
Example - Imagine a shopping cart. If you add profile photo to the website in one server, and then check
from a different server immediately, you might not see the update right away. However, after some time,
all servers will synchronize, and you will see the same data everywhere.
36
Chapter- 2- Key-Value Databases
1. Consistency
Since the value may have already been replicated to other nodes (when multiple writes occur to the same data
simultaneously), there can be conflicts.
Riak can resolve conflicts in two ways:
either the newest write wins and older writes loose,
or both (all) values are returned allowing the client to resolve the conflict
Bucket bucket = Bucket bucket =
Example: If two users try to update the same product's price at the same time, Riak can either:Keep the
connection .createBucket(bucketName) .withRet connection.createBucket("user_profiles") .withRetri
most recent update (Newest Write Wins), or Show both prices (Return All Versions) for the client to decide
rier(attempts(3)) .allowSiblings(siblingsAllowed
In Riak, these options er(attempts(3))
can be set up during the bucket creation.
) .nVal(numberOfReplicasOfTheData)
Buckets are just a way to namespace .w(numb
keys so that.allowSiblings(false) # Only the latest
key collisions can be reduced—for write
example, all customer
erOfNodesToRespondToWrite) .r(numberOfNod
keys may reside in the customer bucket. wins .nVal(3) # Three replicas of the data
esToRespondToRead)
When creating a .execute(); .w(3) #can
bucket, default values for consistency Allbe3provided
nodes must confirm the write
for example that a write is considered good only.r(3)when
# Read
thefrom
data isallconsistent
3 nodes across all the nodes where
the data is stored. .execute()
If we need data in every node to be consistent, we can increase the numberOfNodesToRespondToWrite set by w to
be the same as nVal. Of course doing that will decrease the write performance of the cluster.
To improve on write or read conflicts, we can change the allowSiblings flag during bucket creation: If it is set to false, we
let the last write to win and not create siblings
When allowSiblings = true, Riak does not automatically resolve write conflicts. Instead, it stores all versions of the
conflicting data (i.e., the "siblings").
Chapter- 2- Key-Value Databases
2. Transactions
Different products of the key-value store kind have different specifications of transactions.
Generally speaking, there are no guarantees on the writes, meaning that there is no certainty or
guarantee that the data will be successfully written to all nodes or even to a specific node.
Many data stores do implement transactions in different ways.
Riak uses the concept of quorum implemented by using the W value —replication factor—during
the write API call.
Assume we have a Riak cluster with a replication factor (N) of 5 and we supply the W value of 3.
When writing, the write is reported as successful only when it is written and reported as a success
on at least three of the nodes.
This allows Riak to have write tolerance (the system successfully complete the write operation
even if some of the nodes (servers) or their replicas are unavailable);
In our example, with N equal to 5 and with a W value of 3, the cluster can tolerate N - W = 2
nodes being down for write operations, though we would still have lost some data on those nodes
for read.
By setting the W value, you can balance between consistency (making sure data is written to
enough nodes) and availability (ensuring writes happen even if some nodes are unavailable).
38
Chapter- 2- Key-Value Databases
3. Query Features
Key-value stores allow querying only by the key.
You can’t query based on the attributes inside the value directly.
For example, Key: user:12345 and Value: {"name": "Alice", "country": "US"}, then we can
retrieve the user data only if you know the key and if you want all users from country: US,
the database doesn’t support such queries.
If you don’t know the key (e.g., during debugging), retrieving data can be challenging because
most databases will not give you a list of all the primary keys. For instance, if you forgot the Key:
user:12345, you can't search directly.
Some databases, like Riak Search, allow you to search inside values (e.g., finding all users
from country: US) using Lucene indexes.
Designing the key is crucial in key-value stores because the key is the primary way to access or
query data.
The key is automatically created using some algorithm to ensure uniqueness.
The key is provided by the user, like their email or user ID.
The key is derived using timestamps to store events or logs.
39
Chapter- 2- Key-Value Databases
3. Query Features
Key-value stores are great for storing data like:
Session Data: Use session ID as the key, e.g., session:a7e618d9db25.
Shopping Cart Data: Use cart ID as the key, e.g., cart:12345.
User Profiles: Use user ID as the key, e.g., user:98765.
Some data, like session or shopping cart data, is temporary and should automatically expire
(expiry_secs property) after a certain time. For example, Session Key: session:a7e618d9db25
and Expiry: 30 minutes (if the user logs out or session expires).
To Read and Write data in the database using JAVA
Using the store API, we write data into the Riak bucket with a specified key
Bucket bucket = getBucket(bucketName);
IRiakObject riakObject = bucket.store(key, value).execute();
Similarly, we can get the value stored for the key using the fetch API.
Bucket bucket = getBucket(bucketName);
IRiakObject riakObject = bucket.fetch(key).execute();
byte[] bytes = riakObject.getValue();
String value = new String(bytes);
40
Chapter- 2- Key-Value Databases
3. Query Features
Saving Data through HTTP based interface
Riak provides an HTTP based interface to interact with the Riak database via the command
line or browser using curl
To save the data to Riak
curl -v -X POST -d ‘ (uploads the data (key-value pair) into the Riak bucket over HTTP)
{ "lastVisit":1324669989288,
"user":{"customerId":"91cfdf5bcb7c",
"name":"buyer", "countryCode":"US",
"tzOffset":0} }’
-H "Content-Type: application/json"
http://localhost:8098/buckets/session/keys/a7e618d9db25
The data for the key a7e618d9db25 can be fetched by using the curl command
curl -i http://localhost:8098/buckets/session/keys/a7e618d9db25
41
Chapter- 2- Key-Value Databases
Features
4. Structure of Data
Key-value databases don’t care what is stored in the value part of the key-value pair.
The value can be a blob, text, JSON, XML, and so on.
In Riak, we can use the Content-Type in the POST request to specify the data type.
5. Scaling
Many key-value stores scale by using sharding. With sharding, the value of the key determines on
which node the key is stored.
Let’s assume we are sharding by the first character of the key; if the key is f4b19d79587d, which
starts with an f, it will be sent to different node than the key ad9c7a396542.
This kind of sharding setup can increase performance as more nodes are added to the cluster.
42
Chapter- 2- Key-Value Databases
Suitable Use Cases
1. Storing Session Information
Generally, every web session is unique and is assigned a unique sessionid value.
Applications that store the sessionid on disk or in an RDBMS will greatly benefit from moving to a key-value
store, since everything about the session can be stored by a single PUT request or retrieved using GET.
This single-request operation makes it very fast, as everything about the session is stored in a single object.
Solutions such as Memcached are used by many web applications, and Riak can be used when availability is
important.
2. User Profiles, Preferences
Almost every user has a unique userId, username, or some other attribute, as well as preferences such as
language, color, timezone, which products the user has access to, and so on.
This can all be put into an object, so getting preferences of a user takes a single GET operation. Similarly,
product profiles can be stored.
3. Shopping Cart Data
E-commerce websites have shopping carts tied to the user.
As we want the shopping carts to be available all the time, across browsers, machines, and sessions, all the
shopping information can be put into the value where the key is the userid.
A Riak cluster would be best suited for these kinds of applications
43
Chapter- 2- Key-Value Databases
When not to use
1. Relationships among Data
If you need to have relationships between different sets of data, or correlate the data between different sets of
keys, key-value stores are not the best solution to use, even though some key-value stores provide link-
walking features.
2. Multioperation Transactions
If you’re saving multiple keys and there is a failure to save any one of them, and you want to revert or roll
back the rest of the operations, key-value stores are not the best solution to be used.
3. Query by Data
If you need to search the keys based on something found in the value part of the key-value pairs, then key-
value stores are not going to perform well for you. There is no way to inspect the value on the database side,
with the exception of some products like Riak Search or indexing engines like Lucene [Lucene] or Solr
[Solr] .
4. Operations by Sets
Since operations are limited to one key at a time, there is no way to operate upon multiple keys at the same
time. If you need to operate upon multiple keys, you have to handle this from the client side.
44