0% found this document useful (0 votes)

15 views44 pages

DRKP Module 3

Uploaded by

Vijaylaxmi Patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views44 pages

DRKP Module 3

Uploaded by

Vijaylaxmi Patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 44

Module-3

Contents – Module 3
Map-Reduce
 Basic Map-Reduce
Partitioning and Combining
Composing Map-Reduce Calculations
A Two Stage Map-Reduce Example
Incremental Map-Reduce

2
Chapter- 1- Map reduce
 The rise of aggregate-oriented databases is in large part due to the need for handling large
amounts of data across multiple machines in clusters.
 When working with a single machine, data storage and processing are simpler—you can
either
 Process the data on the same computer where it’s stored (the server), or
 Running it on the server is often more efficient for large data but puts more load
on that machine
 Send the data to another computer to process it (the client)
 While processing on a client offers flexibility to set up any software, tools, or
programming language you want, without restrictions
 However, it takes time to move data between computers, which can slow things
down.

3
Chapter- 1- Map reduce
 In a cluster, where multiple machines are connected, processing can be distributed across
many machines, which speeds things up.
 However, when computers in a cluster share a lot of data with each other, it can slow
things down. This is because transferring data between machines takes time and uses
network resources.
 To handle this efficiently, developers use the "map-reduce" pattern
 Map Step: Each computer processes its own local data independently, keeping the
work and data on the same computer as much as possible.
 Reduce Step: After each computer completes its part, the results are combined
(reduced) across the cluster to produce a final result.
 This method was popularized by Google’s MapReduce framework, and there’s a widely-
used open-source version called Hadoop..

4
Chapter-1-Map-Reduce
Definition
 MapReduce is a programming model used for processing large datasets by distributing the
work across multiple machines in a cluster.
 The basic idea is to divide a big task into smaller sub-tasks (the "Map" phase), and then
combine the results of those sub-tasks into a final answer (the "Reduce" phase).
Phases
1. Map phase
 The data is split into smaller chunks, and each chunk is processed independently using a
"map function." It focuses on processing input data and transforming it into
intermediate key-value pairs.
2. Reduce Phase
 The reduce function processes the grouped key-value pairs, typically performing an
aggregation (e.g., sum, count) to produce the final result.

5
Chapter- 1- Map reduce
Example - We have a system for managing customer orders. Each  We group orders in aggregate way, as usually,
order includes several items, and each item has a product, quantity, people want to view a whole order at once
and price. (such as when a customer wants to review all
the items they purchased).
 Since we have a lot of orders, we've spread
them across many computers in a cluster
(shard)
 This means that each machine will store a
portion of the total orders, and when
querying, the system needs to access
these machines and combine the data as
needed.
 Now, the sales team wants a report showing
the total revenue for each product over the last
This is where MapReduce helps: seven days.
Map Step  This is tricky because our data is organized by
 Each computer in the cluster looks at its orders. orders, not by products.
 For each item in an order, it creates a "key-value pair,"  To get this report, we’d need to check
where the key is the product, and the value contains every order on every computer, which
A map function reads records from the database and emits
quantity and price. would be slow. 6
key-value pairs
Chapter- 1- Map reduce
Example - We have a system for managing
customer orders. Each order includes several
items, and each item has a product, quantity, and
price.  For Product 1 (puerh), the map output will
be
(“puerh”,(26,8))

 For Product 2 (genmaicha), the map

output will be
(“genmaicha”,(12,4))

 For Product 3 (dragonwell), the map

output will be
(“dragonwell”,(18,8))

This map phase output of key value pairs will

be the input for reduce phase

A map function reads records from the database and emits 7

key-value pairs
Chapter- 1- Map reduce
 The map function works on one piece of data (called
 After the map function processes the data,
a "record") at a time.
 If we have a list of orders, the map function might the reduce function takes all the map
look at each individual order and extract certain results with the same key and combines
details (like product name, quantity, price). them.
 The reduce function takes all these items
and combines them into one final result

Disadvantages  Map-Reduce Framework automatically

However, while this works, it can become inefficient decides which computer should process
because: which data. It also moves the processed data
 Single Reduce Function: Having just one reducer from the map step to the reduce step.
means that it has to process all the data sequentially  The framework collects all the data for each
(one after the another), which can slow down the data key (for example, all the data related to
processing. "product A") and sends it to the reduce
 Data Transfer: All the map outputs need to be sent to function.
 So, to run a map-reduce job, you just need
a single reducer, which requires a lot of data
to write these two functions
movement,
A reduce slowing
function downkey-value
takes several the process.
pairs with the
same
So,key
weand
areaggregates
using portioning
them intoand
one.combining process 8
Chapter- 1- Map reduce
Partitioning
 In the simplest form, we think of a map-reduce job as
having a single reduce function.
 The outputs from all the map tasks running on the various
nodes are concatenated together and sent into the reduce.
 Partitioning is the process where the output of the map
function (which is key-value pairs) is divided into
different partitions.
 Each partition contains one or more keys that will be sent
to a specific reducer.
 This partitioning allows multiple reducers to work in
parallel, each handling different keys.
 Multiple reducers can then operate on the partitions in
parallel, with the final results merged together.
 This step is also called “shuffling,” and the partitions
are sometimes referred to as “buckets” or “regions.”
 The goal is to ensure that all data for the same key is
Partitioning allows reduce functions to run in grouped together in one partition so it can be processed by
parallel on different keys a single reducer.
9
Types of Partitioning
S.No Types Explanation
1 Frequency-Based Partitioning Group high-frequency and low-frequency words separately

2 Range-Based Partitioning Divide keys based on alphabetical or numerical ranges

For example, words starting from "A" to "H" go to one partition, "I" to
"P" to another, and so on
3 Type-Based Partitioning Partition based on the type of data or meaning associated with each key
For example, in a dataset with words representing various categories
(e.g., fruits, animals, vehicles), each category could be a partition.

4 Size-Based Partitioning Divide keys based on the size of each word (short, medium, long).
Short words (1–4 letters) go to Partition 1

5 Custom Partition Function Define a custom function based on any specific criteria relevant to the
dataset.
A custom function might assign words based on whether they appear
frequently, contain specific letters, or match certain patterns, dynamically
creating partitions.
Chapter- 1- Map reduce
Combining - The combiner function is like  Another problem we can solve is the large
a mini-reducer and the goal is to reduce the amount of data being transferred between the
amount of data sent from the mapper to the map and reduce stages.
reducer by partially processing data.  A lot of this data is repeated, with multiple
key-value pairs for the same key.
 A combiner function helps reduce this by
merging all the data for the same key into
one value.
 A combiner function is basically like a
reducer function.
 In many cases, the same function can
be used for both combining and the
final reduction.
 The reduce function needs to be set up in a
specific way for this to work: its output
should match its input. We call this a
Combining reduces data before sending it combinable reducer.“ 11
across the network
Chapter- 1- Map reduce
Combining

 Before sending the data to the reducers, we can use a combiner to combine all
values for the same product within each partition. This step is optional, but it helps to
reduce the amount of data that needs to be sent across the network.
 In the reduce step, each reducer will receive the data for a single product and
combine the results. Since we used partitioning, each reducer will only work with
data for a specific product.
 Reducers process each key's data in parallel, and multiple reducers can operate at
the same time on different keys (products).

12
Chapter- 1- Map reduce
Combinable reducer

 Not all reducers are combinable

 Combinable reducers allow combining data early (before
sending it to the final reducer) because the operation
doesn’t depend on having all the data.
 However, keep in mind that combiners are usually
applied in cases where the operation is commutative
and associative, such as summing numbers.
 Example: sum of sales product

This reduce function, which counts how many

unique customers order a particular tea, is
not combinable.
13
Chapter- 1- Map reduce
 Non-combinable reducers (like counting unique customers)
Combinable reducer
cannot combine data early because you need to look at all the
data to know what’s unique.
 But since counting unique customers is not a
commutative operation (because the uniqueness needs to
be checked across all inputs, not just within one mapper),
using a combiner will not be very effective.
 Example: count the number of unique customers for a
particular product
 Unlike summing prices, here we cannot combine partial
results early because we need to keep track of all
customers for each product to determine the unique ones.
 So, no combining happens here, and all data will pass
directly to the reducer.
This reduce function, which counts how many  The reducer takes all the customer names for each
unique customers order a particular tea, is
product and eliminates duplicates to count the number of
not combinable.
unique customers.
14
Chapter- 1- Map reduce
Composing Map-Reduce calculations
 The map-reduce approach is a way of thinking about concurrent processing [performing multiple
tasks at the same time, typically across multiple machines (in a cluster)]
 The MapReduce model is simple and structured, making it easy to implement and scale
 However, the simplicity limits how you can structure your computations for a relatively
straightforward model.
 Within a map task, you can only operate on a single aggregate (e.g., a single record, document, or
row).
 Within a reduce task, you can only operate on a single key, such as aggregating values associated with
a specific product ID or category.
 Composable: Results can be combined directly to get the correct final output.
 Examples: Sum, Count, Maximum, Minimum.
 Not Composable: Results cannot be combined directly; additional data is required.
15
 Examples: Average, Standard Deviation, Median.
Chapter- 1- Map reduce
Composing Map-Reduce calculations
 Limitation
 Calculations must be designed to
split into parts (map phase) and
combine easily (reduce phase),
which works well for simple tasks
like counting or summing but
requires extra steps for complex
operations like average
 Example – Calculating averages
 Suppose you want to calculate the
average ordered quantity of each
product.
Figure - When calculating averages, the sum and count can be
 An important property of averages is that
combined in the reduce calculation, but the average must be
calculated from the combined sum and count they are not composable —that is, if I
take two groups of orders, I can’t
 Instead, I need to take total amount and the count of orders from combine their averages alone.
each group, combine those, and then calculate the average from
the combined sum and count
16
 Average= total quantity/total number of orders = 1200/25 = 48
Chapter- 1- Map reduce
Composing Map-Reduce calculations

 Example – To count the total numbers

of orders for each product
 To make a count, the mapping
function will emit count fields with a
value of 1, which can be summed to
get a total count

Figure - When making a count, each map emits 1, which can be

summed to get a total

17
Chapter- 1- Map reduce
Two-stage Map-Reduce Example

 If the map reduce calculations get more complex, it is useful to break them down into
stages using a pipes and filters approach, with the output of one stage serving as input
to the next

18
Chapter- 1- Map reduce
Two-stage Map-Reduce Example  Example – Compare the sales of products
for each month in 2011 to the prior year
(2010)
 The first stage will produce records showing
the aggregate figures for a single product in
a single month of the year.
 The second stage then uses these as inputs
and produces the result for a single product
by comparing one month’s results with the
same month in the prior year to determine
the difference or growth.

 Stage 1: Calculate total sales for each product for

each month and year (e.g., “Puerh tea in December
2011 sold 1200 units” and in December 2010 sold
1000 units).
Figure - A calculation broken down into two map-reduce steps,  Stage 2: Compare the total sales for the same
which will be expanded in the next three figures month across two years (e.g., “Puerh tea in
December 2011 sold 1200 units, which is 20%
more than December 2010").
19
Chapter- 1- Map reduce
Two-stage Map-Reduce Example – First stage  A composite key is a combination of
multiple fields (attributes) used together as
a unique identifier for grouping or sorting
data in MapReduce operations
 In this diagram, the composite key consists
of three fields:
 Year (e.g., 2011)
 Month (e.g., 12 for December)
 Product (e.g., puerh or dragonwell)
 The composite key allows the MapReduce
framework to group all records related to
the same product in the same month of
Creating records for monthly sales of a product the same year together.
 The reducer processes each unique composite key to combine
 During mapping phase, all key-value pairs with the same  Example
the values (quantities).
Figure - A calculation
composite broken
key are grouped  Example:  For the first record: product = puerh,
down into two map-reduce steps,
together.
 which will be expanded in the next three figures
Example:  For (2011:12:puerh),
year = 2011, month[26,
the quantities = 12, quantity
44] are summed =
 Key: (2011:12:puerh), Values: [26, 44] to get 70. 26
 Key: (2011:12:dragonwell), Values: [12]  The output after
 the Reduce Phase
Key-Value Pair: is : (2011:12:puerh,
(2011:12:puerh, 70)
20 26)
Chapter- 1- Map reduce
Two-stage Map-Reduce Example – Second stage
 The second-stage mappers process this output
depending on the year.
 A 2011 record populates the current year
quantity while a 2010 record populates a prior
year quantity.
 Records for earlier years (such as 2009) don’t
result in any mapping output being emitted

The second stage mapper creates base records for year-on-

year comparisons.

21
Chapter- 1- Map reduce
Two-stage Map-Reduce Example – Second stage
 The reduce in this case is a merge of
records, where combining the values
by summing allows two different year
outputs to be reduced to a single value
(with a calculation based on the
reduced values thrown in for good
measure).

The reduction step is a merge of incomplete records.

22
Chapter- 1- Map reduce
Advantages of Two stage Map-Reduce
 Instead of trying to perform all the calculations in one large, complex step
 it’s easier to break them into smaller steps using multiple MapReduce stages
 Another advantage is the intermediate outputs generated during the MapReduce process can often be
reused for other tasks.
 For example, aggregated data (such as monthly sales) can serve multiple downstream analyses
without reprocessing the raw data.
 Map-reduce is a pattern that can be implemented in any programming language. However, the constraints
of the style make it a good fit for languages specifically designed for map-reduce computations.
 Apache Pig [Pig] , an offshoot of the Hadoop [Hadoop] project, is a language specifically built to
make it easy to write map-reduce programs. It certainly makes it much easier to work with Hadoop
than the underlying Java libraries.
 In a similar vein, if you want to specify map reduce programs using an SQL-like syntax, there is hive
[Hive] , another Hadoop offshoot.
 MapReduce is not limited to NoSQL databases. It originated with Google’s system for processing large-
scale data on distributed file systems.
 The open-source Hadoop project follows this model, making it accessible to more organizations.
23
Chapter- 1- Map reduce
Incremental Map-Reduce
 The examples we’ve discussed so far are complete map-reduce computations, where we start with raw
inputs and create a final output.
 Many map-reduce computations take a while to perform, even with clustered hardware, and new data
keeps coming in which means we need to rerun the computation to keep the output up to date.
 Starting from scratch each time can take too long, so often it’s useful to structure a map-reduce
computation to allow incremental updates, so that only the minimum computation needs to be done.
 The map stages of a map-reduce are easy to handle incrementally—only if the input data changes
does the mapper need to be rerun. Since maps are isolated from each other, incremental updates are
straightforward.

24
Chapter- 1- Map reduce
Incremental Map-Reduce
 The more complex case is the reduce step, since it pulls together the outputs from many maps and any change
in the map outputs could trigger a new reduction.
 This recomputation can be lessened depending on how parallel the reduce step is. If we are partitioning the data
for reduction, then any partition that’s unchanged does not need to be re-reduced.
 Similarly, if there’s a combiner step, it doesn’t need to be rerun if its source data hasn’t changed. If our reducer
is combinable, there’s some more opportunities for computation avoidance.
 If the changes are additive—that is, if we are only adding new records but are not changing or deleting any old
records—then we can just run the reduce with the existing result and the new additions.
 If there are destructive changes, that is updates and deletes, then we can avoid some recomputation by breaking
up the reduce operation into steps and only recalculating those steps whose inputs have changed—essentially,
using a Dependency Network (It helps figure out what needs to be recalculated and what doesn’t) to organize
the computation.
 The map-reduce framework controls much of this, so you have to understand how a specific framework
supports incremental operation
25
Chapter- 1- Map reduce
 Incremental map reduce - Example- Calculating the total sales for each product
 What If Data Changes
Customer Product Quantity  Step 2 : New Data Arrives
(Updates/Deletions)?
Jane Puerh 25 Customer Product Quantity  Case 1: Update a Record
Jane Puerh 15 Jane Puerh 12 Suppose the original record for Jane’s
Puerh purchase changes from 1 to 2.
John Dragonwell 23 Max Genmaicha 10 1.Subtract the old value (1) from the total:
Max Genmaicha 10 Puerh: 52 - 1 = 51.
Instead of starting over with all the
Jane Dragonwell 7 2.Add the new value (2) to the total:
data, use incremental MapReduce
Apple: 51 + 2 = 53.
 Step 1 : Initial MapReduce to update only what’s changed.  Case 2: Delete a Record
 Map step  Reduce step:  Map step output Suppose Jane’s Dragonwell purchase (5)
output (Puerh, 12) is deleted.
Combine
(Genmaicha, 10) 1.Subtract the deleted value from the total:
(Puerh, 25) quantities for
Dragonwell: 30 - 5 = 25.
each product
(Puerh,15)  Combine with Existing Results  How Incremental MapReduce Saves
Puerh : 40 Add the new results to the existing total. Time
(Dragonwell, 23)
Dragonwell : 30  Instead of reprocessing all sales
(Genmaicha, 10) Existing Totals: Puerh: 40, Dragonwell: 30,
Genmaicha : 10 records, you process only the new,
(Dragonwell, 7) Genmaicha:10 updated, or deleted records.
New Totals: Puerh: 12, Genmaicha: 10  The framework tracks dependencies
using a Dependency Network so that
Combined Totals: Puerh: 40 + 12 = 52
only affected totals (per product)
26 are
Genmaicha: 10 + 10 = 20
recalculated.
Chapter- 1- Map reduce
Key points
 Map-reduce is a pattern to allow computations to be parallelized over a cluster.
 The map task reads data from an aggregate and boils it down to relevant key-value pairs. Maps only read a
single record at a time and can thus be parallelized and run on the node that stores the record.
 Reduce tasks take many values for a single key output from map tasks and summarize them into a single output.
Each reducer operates on the result of a single key, so it can be parallelized by key.
 Reducers that have the same form for input and output can be combined into pipelines. This improves
parallelism and reduces the amount of data to be transferred.
 Map-reduce operations can be composed into pipelines where the output of one reduce is the input to another
operation’s map.
 If the result of a map-reduce computation is widely used, it can be stored as a materialized view.
 Materialized views can be updated through incremental map-reduce operations that only compute changes to
the view instead of recomputing everything from scratch

27
Chapter- 2- Key-Value Databases
Key-Value Database
 A key-value store is a type of database that works like a simple hash table, primarily used when
all access to the database via primary key.
 Key: Think of it as a unique ID, like a label or identifier.
 Value: This is the data or information associated with that key
For example:
 In a traditional database (RDBMS), think of a table with two columns:
 ID (the unique identifier, like a key)
 NAME (the information, like a value).
 In a key-value store, you store and retrieve data by providing the key:
 If the key exists, the database retrieves the value.
 If you save a new value for the same key, it overwrites the old value.
 If the key doesn't exist, it creates a new entry.

28
Chapter- 2- Key-Value Databases
Key-Value Database
 Let’s look at how terminology compares in Oracle and Riak.

Oracle (RDBMS) Riak (NoSQL)

database instance (central system that riak cluster (multiple nodes working
manages the data and operations) together to handle data in a distributed
manner)
table (a structured collection of rows and bucket (groups related key-value pairs.
columns used to store data, with a Here, no fixed constraints or schema for
predefined schema) data storage)
Row (It is a single entry in a table) key-value (It represents a single piece of
data stored in a bucket)
Rowid (It is a unique system-generated Key (It is a unique identifier provided by
identifier for each row in a table. the application for a key-value pair.

29
Chapter- 2- Key-Value Databases
What is Key-Value store
 Key-value stores are the simplest NoSQL data stores to use from an API perspective.
 The client can either get the value for the key, put a value for a key, or delete a key from the data store.
 The value is a blob that the data store just stores, without caring or knowing what’s inside; it’s the responsibility
of the application to understand what was stored.
 Since key-value stores always use primary-key access, they generally have great performance and can be easily
scaled
 Example - A shopping cart application:
 Key: user_123
 Value: { "items": ["item1", "item2"], "total": 200 }
 The application fetches, updates, or deletes the shopping cart for a user using the key user_123. The value
structure and meaning are managed by the application, not the key-value store
Popular Key-Value Databases
 Some of the popular key-value databases are:
 Riak [Riak], Redis (often referred to as Data Structure server) [Redis], Memcached DB and its flavors
[Memcached], Berkeley DB [Berkeley DB], HamsterDB (especially suited for embedded use)
[HamsterDB], Amazon DynamoDB [Amazon’s Dynamo] (not open-source), and Project Voldemort
[Project Voldemort] (an open-source implementation of Amazon DynamoDB).
30
Chapter- 2- Key-Value Databases
 In some key value stores, such as Redis, the value stored for a key does not have to be a
"domain object (JSON, XML, CSV)"
 Instead, the value can be any data structure such as lists, sets, or hashes and can do range,
diff, union, and intersection operations.
 These features allow Redis to be used in more different ways than a standard key-value
store.
 Riak store keys into buckets.
 Bucket: The container or namespace where data is stored (e.g., shopping_carts,
user_profiles).
 Key: The unique identifier for each object in the bucket (e.g., user12345 or
user12345_order_1).
 Object (Value): The actual data associated with the key. This is what is retrieved or
stored in the bucket.

31
Chapter- 2- Key-Value Databases

 If we wanted to store user session data,

shopping cart information, and user
preferences in Riak, we could just store all of
them in the same bucket with a single key and
single value for all of these objects.
 In this scenario, we would have a single
object that stores all the data and is put into a
single bucket

32
Chapter- 2- Key-Value Databases
 Disadvantage- When storing all the  Solution
Example - Bucket called “user_data” and
different types of objects 1. Append the name of the object to the key
storing two different types(orofaggregates)
information in  For example:
aforsingle bucket,
a user: User there
Profileis a(name,
downside:
email, etc.)
 A single bucket would contain multiple  Instead of just storing a key as “user:12345", include the object type:
and Shopping Cart (items, total cost, etc.)
types “user:12345_userProfile".
 Use theofkeyaggregates
"user:12345" (e.g.,touser
storeprofiles,
the  Similarly, a shopping cart could use the key
shopping
user’s profilecarts,
andsession
use the data),
same key
increasing the “user:12345_shoppingCart".
"user:12345" to likelihood of key
store the user’s 2. Create bucket to store specific data
conflicts.
shopping cart information.  For example:
WhenFor instance,
you try toif store
two objects (a usercart
the shopping  A bucket named "userProfiles" can store all user profile data.
profile
data withandthe asame
shopping cart)
key, the share
user the
profile  A bucket named "shoppingCarts" can store all shopping cart data.
same
data is key, the system
overwritten, might
and nowoverwrite
the These specialized buckets, called domain buckets, allow you to group and
one withonly
database the stores
other, the
leading to data
shopping cart manage data more efficiently.
 In Riak, domain buckets, support serialization and deserialization of data,
corruption or loss.
information. which is handled by the client driver.
 This leads to data corruption or loss.  Serialization is the process of converting a python object (like a
UserProfile) into a string format (usually JSON) that can be stored.
 Deserialization is the process of converting a JSON string back into a
Python object.
 For example, the driver can automatically convert objects like UserProfile
or ShoppingCart into a format that Riak can store and bac
33
Chapter- 2- Key-Value Databases
 In a key-value store like Redis, the "value" you store doesn't have to be just a single piece of data
(like a number or text). It can be a more complex data structure, such as:
 A list (e.g., a shopping list).
 A set (e.g., a collection of unique tags or categories).
 A hash (e.g., a dictionary with key-value pairs like a user profile).

 Example -Imagine you are managing a social media app:

 Lists: Store a list of posts a user has liked.
 Sets: Keep track of unique tags or groups a user follows.
 Hashes: Store user details like {"name": "Alice", "age": 25, "location": "NYC"} and update only
the location if it changes.

34
Chapter- 2- Key-Value Databases
Features
 When using NoSQL data stores, it's important to compare them with traditional RDBMS to
understand what features they lack, like consistency or transactions.
 The primary reason is to understand what features are missing and how does the application
architecture need to change to better use the features of a key-value data store.
 For example, since key-value stores may not support complex queries or transactions
like relational databases, you might need to design your application differently to handle
these limitations
 Some features to focus for the key value stores are consistency, transactions, queries, data
structure, and scaling.

35
Chapter- 2- Key-Value Databases
Features
1. Consistency
 Consistency in key-value stores like Riak refers to how data is handled when it is written to the system.
 Operations like get, put, or delete are applicable to single key, which means consistency is easy to maintain for
individual key operations. Conistency ensures when you read a key, you get the most up-to-date value
 Example - Imagine a user profile with the key user:123 storing the name "Alice". If you update the name
to "Bob", you will always get the most recent value when you query user:123.
 Optimistic writes (two users try to update the data at the same time) can be performed, but are very expensive
to implement, because a change in value cannot be determined by the data store (system)
 Example: If two users try to change the same field (e.g., profile_picture) at the same time, without
checking if the field was changed by someone else, it might cause a problem because the system cannot
automatically choose which value to keep.
 In a distributed key-value store implementations like Riak, data might not be immediately consistent across all
nodes. The system aims for eventual consistency, meaning it will eventually synchronize the data across all
servers
 Example - Imagine a shopping cart. If you add profile photo to the website in one server, and then check
from a different server immediately, you might not see the update right away. However, after some time,
all servers will synchronize, and you will see the same data everywhere.
36
Chapter- 2- Key-Value Databases
1. Consistency
 Since the value may have already been replicated to other nodes (when multiple writes occur to the same data
simultaneously), there can be conflicts.
 Riak can resolve conflicts in two ways:
 either the newest write wins and older writes loose,
 or both (all) values are returned allowing the client to resolve the conflict
Bucket bucket = Bucket bucket =
 Example: If two users try to update the same product's price at the same time, Riak can either:Keep the
connection .createBucket(bucketName) .withRet connection.createBucket("user_profiles") .withRetri
most recent update (Newest Write Wins), or Show both prices (Return All Versions) for the client to decide
rier(attempts(3)) .allowSiblings(siblingsAllowed
 In Riak, these options er(attempts(3))
can be set up during the bucket creation.
) .nVal(numberOfReplicasOfTheData)
 Buckets are just a way to namespace .w(numb
keys so that.allowSiblings(false) # Only the latest
key collisions can be reduced—for write
example, all customer
erOfNodesToRespondToWrite) .r(numberOfNod
keys may reside in the customer bucket. wins .nVal(3) # Three replicas of the data
esToRespondToRead)
 When creating a .execute(); .w(3) #can
bucket, default values for consistency Allbe3provided
nodes must confirm the write
 for example that a write is considered good only.r(3)when
# Read
thefrom
data isallconsistent
3 nodes across all the nodes where
the data is stored. .execute()
 If we need data in every node to be consistent, we can increase the numberOfNodesToRespondToWrite set by w to
be the same as nVal. Of course doing that will decrease the write performance of the cluster.
 To improve on write or read conflicts, we can change the allowSiblings flag during bucket creation: If it is set to false, we
let the last write to win and not create siblings
 When allowSiblings = true, Riak does not automatically resolve write conflicts. Instead, it stores all versions of the
conflicting data (i.e., the "siblings").
Chapter- 2- Key-Value Databases
2. Transactions
 Different products of the key-value store kind have different specifications of transactions.
 Generally speaking, there are no guarantees on the writes, meaning that there is no certainty or
guarantee that the data will be successfully written to all nodes or even to a specific node.
 Many data stores do implement transactions in different ways.
 Riak uses the concept of quorum implemented by using the W value —replication factor—during
the write API call.
 Assume we have a Riak cluster with a replication factor (N) of 5 and we supply the W value of 3.
When writing, the write is reported as successful only when it is written and reported as a success
on at least three of the nodes.
 This allows Riak to have write tolerance (the system successfully complete the write operation
even if some of the nodes (servers) or their replicas are unavailable);
 In our example, with N equal to 5 and with a W value of 3, the cluster can tolerate N - W = 2
nodes being down for write operations, though we would still have lost some data on those nodes
for read.
 By setting the W value, you can balance between consistency (making sure data is written to
enough nodes) and availability (ensuring writes happen even if some nodes are unavailable).
38
Chapter- 2- Key-Value Databases
3. Query Features
 Key-value stores allow querying only by the key.
 You can’t query based on the attributes inside the value directly.
 For example, Key: user:12345 and Value: {"name": "Alice", "country": "US"}, then we can
retrieve the user data only if you know the key and if you want all users from country: US,
the database doesn’t support such queries.
 If you don’t know the key (e.g., during debugging), retrieving data can be challenging because
most databases will not give you a list of all the primary keys. For instance, if you forgot the Key:
user:12345, you can't search directly.
 Some databases, like Riak Search, allow you to search inside values (e.g., finding all users
from country: US) using Lucene indexes.
 Designing the key is crucial in key-value stores because the key is the primary way to access or
query data.
 The key is automatically created using some algorithm to ensure uniqueness.
 The key is provided by the user, like their email or user ID.
 The key is derived using timestamps to store events or logs.
39
Chapter- 2- Key-Value Databases
3. Query Features
 Key-value stores are great for storing data like:
 Session Data: Use session ID as the key, e.g., session:a7e618d9db25.
 Shopping Cart Data: Use cart ID as the key, e.g., cart:12345.
 User Profiles: Use user ID as the key, e.g., user:98765.
 Some data, like session or shopping cart data, is temporary and should automatically expire
(expiry_secs property) after a certain time. For example, Session Key: session:a7e618d9db25
and Expiry: 30 minutes (if the user logs out or session expires).
 To Read and Write data in the database using JAVA
 Using the store API, we write data into the Riak bucket with a specified key
Bucket bucket = getBucket(bucketName);
IRiakObject riakObject = bucket.store(key, value).execute();
 Similarly, we can get the value stored for the key using the fetch API.
Bucket bucket = getBucket(bucketName);
IRiakObject riakObject = bucket.fetch(key).execute();
byte[] bytes = riakObject.getValue();
String value = new String(bytes);
40
Chapter- 2- Key-Value Databases
3. Query Features
 Saving Data through HTTP based interface
 Riak provides an HTTP based interface to interact with the Riak database via the command
line or browser using curl
 To save the data to Riak
curl -v -X POST -d ‘ (uploads the data (key-value pair) into the Riak bucket over HTTP)
{ "lastVisit":1324669989288,
"user":{"customerId":"91cfdf5bcb7c",
"name":"buyer", "countryCode":"US",
"tzOffset":0} }’
-H "Content-Type: application/json"
http://localhost:8098/buckets/session/keys/a7e618d9db25

 Stores session data with the key a7e618d9db25.

 The data for the key a7e618d9db25 can be fetched by using the curl command
curl -i http://localhost:8098/buckets/session/keys/a7e618d9db25
41
Chapter- 2- Key-Value Databases
Features
4. Structure of Data
 Key-value databases don’t care what is stored in the value part of the key-value pair.
 The value can be a blob, text, JSON, XML, and so on.
 In Riak, we can use the Content-Type in the POST request to specify the data type.
5. Scaling
 Many key-value stores scale by using sharding. With sharding, the value of the key determines on
which node the key is stored.
 Let’s assume we are sharding by the first character of the key; if the key is f4b19d79587d, which
starts with an f, it will be sent to different node than the key ad9c7a396542.
 This kind of sharding setup can increase performance as more nodes are added to the cluster.

42
Chapter- 2- Key-Value Databases
Suitable Use Cases
1. Storing Session Information
 Generally, every web session is unique and is assigned a unique sessionid value.
 Applications that store the sessionid on disk or in an RDBMS will greatly benefit from moving to a key-value
store, since everything about the session can be stored by a single PUT request or retrieved using GET.
 This single-request operation makes it very fast, as everything about the session is stored in a single object.
Solutions such as Memcached are used by many web applications, and Riak can be used when availability is
important.
2. User Profiles, Preferences
 Almost every user has a unique userId, username, or some other attribute, as well as preferences such as
language, color, timezone, which products the user has access to, and so on.
 This can all be put into an object, so getting preferences of a user takes a single GET operation. Similarly,
product profiles can be stored.
3. Shopping Cart Data
 E-commerce websites have shopping carts tied to the user.
 As we want the shopping carts to be available all the time, across browsers, machines, and sessions, all the
shopping information can be put into the value where the key is the userid.
 A Riak cluster would be best suited for these kinds of applications
43
Chapter- 2- Key-Value Databases
When not to use
1. Relationships among Data
If you need to have relationships between different sets of data, or correlate the data between different sets of
keys, key-value stores are not the best solution to use, even though some key-value stores provide link-
walking features.
2. Multioperation Transactions
If you’re saving multiple keys and there is a failure to save any one of them, and you want to revert or roll
back the rest of the operations, key-value stores are not the best solution to be used.
3. Query by Data
If you need to search the keys based on something found in the value part of the key-value pairs, then key-
value stores are not going to perform well for you. There is no way to inspect the value on the database side,
with the exception of some products like Riak Search or indexing engines like Lucene [Lucene] or Solr
[Solr] .
4. Operations by Sets
Since operations are limited to one key at a time, there is no way to operate upon multiple keys at the same
time. If you need to operate upon multiple keys, you have to handle this from the client side.
44

FJ33-5A Level 1
100% (3)
FJ33-5A Level 1
38 pages
Map Reduce Workflow Colloquim
No ratings yet
Map Reduce Workflow Colloquim
30 pages
Module 3
No ratings yet
Module 3
79 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Nosql Qbsol Ia-02
No ratings yet
Nosql Qbsol Ia-02
18 pages
Unit 3 - Big Data Technologies
No ratings yet
Unit 3 - Big Data Technologies
42 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Nosql Mod3
No ratings yet
Nosql Mod3
18 pages
Unit 1 Lecture 3
No ratings yet
Unit 1 Lecture 3
12 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
No ratings yet
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
36 pages
6.unit 3 Bda
No ratings yet
6.unit 3 Bda
18 pages
L04 MapReduce
No ratings yet
L04 MapReduce
37 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Unit III
No ratings yet
Unit III
8 pages
Unit 3
No ratings yet
Unit 3
22 pages
3 Module NOSQL Preparation
No ratings yet
3 Module NOSQL Preparation
12 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Own Answer 2
No ratings yet
Own Answer 2
22 pages
ECS765P - W2 - The MapReduce Programming Model
No ratings yet
ECS765P - W2 - The MapReduce Programming Model
53 pages
Unit - 5
No ratings yet
Unit - 5
57 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
Unit4 Fos
No ratings yet
Unit4 Fos
7 pages
00 - PMP Exam - Knowledge Area Project Management Framework
100% (1)
00 - PMP Exam - Knowledge Area Project Management Framework
18 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Map Reduce Tutorial-1
No ratings yet
Map Reduce Tutorial-1
7 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Lecture 2.1
No ratings yet
Lecture 2.1
13 pages
Map Reduce
No ratings yet
Map Reduce
35 pages
Big Data
No ratings yet
Big Data
120 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Comp9313: Big Data Management: Mapreduce
No ratings yet
Comp9313: Big Data Management: Mapreduce
65 pages
777 1651400043 BD Module 4
No ratings yet
777 1651400043 BD Module 4
21 pages
Unit 3
No ratings yet
Unit 3
27 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
BigData MapReduce
100% (1)
BigData MapReduce
6 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
43 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Module - 4 - UNDERSTANDING MAP REDUCE FUNDAMENTALS
No ratings yet
Module - 4 - UNDERSTANDING MAP REDUCE FUNDAMENTALS
6 pages
Describe The MapReduce Execution Steps With A Neat Diagram
No ratings yet
Describe The MapReduce Execution Steps With A Neat Diagram
10 pages
Unit Ii Iintroduction To Map Reduce
No ratings yet
Unit Ii Iintroduction To Map Reduce
4 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
Chapter4 - MapReduce
No ratings yet
Chapter4 - MapReduce
29 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
12 pages
Research Paper - Map Reduce - CSC3323
No ratings yet
Research Paper - Map Reduce - CSC3323
16 pages
Map Reduce
No ratings yet
Map Reduce
11 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Why MapReduce
No ratings yet
Why MapReduce
8 pages
Module 3 Nosql
No ratings yet
Module 3 Nosql
12 pages
EPOCH LT Training Presentation: Introduction and Basic Operation
100% (2)
EPOCH LT Training Presentation: Introduction and Basic Operation
58 pages
Data Science
No ratings yet
Data Science
7 pages
Analysis of Mapreduce Algorithms: Harini Padmanaban
No ratings yet
Analysis of Mapreduce Algorithms: Harini Padmanaban
6 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
Smart Dustbin System Using Iot
100% (1)
Smart Dustbin System Using Iot
48 pages
PNOZ M B0 Operating Manual 1002660-En-01
No ratings yet
PNOZ M B0 Operating Manual 1002660-En-01
47 pages
Comparative Analysis of Hysteresis and PWM Current Controllers For PMSM Drive.
No ratings yet
Comparative Analysis of Hysteresis and PWM Current Controllers For PMSM Drive.
6 pages
VSC Transmission Tutorial
No ratings yet
VSC Transmission Tutorial
29 pages
Distribution Design Issues
No ratings yet
Distribution Design Issues
2 pages
Engine Catalog
No ratings yet
Engine Catalog
17 pages
Introduction To Brocade X7 Director Hardware: Student Guide
100% (1)
Introduction To Brocade X7 Director Hardware: Student Guide
70 pages
Parallel Bus Device Protocols - Pci Bus: Lesson - 22
No ratings yet
Parallel Bus Device Protocols - Pci Bus: Lesson - 22
37 pages
SPOILER Illuvium Zero GDD
No ratings yet
SPOILER Illuvium Zero GDD
18 pages
Marketing Strategy of Apple PDF
No ratings yet
Marketing Strategy of Apple PDF
48 pages
Caterpillar Stand On Nrs 0.9 3.0ca 07 09
No ratings yet
Caterpillar Stand On Nrs 0.9 3.0ca 07 09
7 pages
Software Engineer
100% (1)
Software Engineer
66 pages
Rob Bux - DK 21
No ratings yet
Rob Bux - DK 21
3 pages
8 2020 CNS CIRCULAR 08 of 2020 AFTN Connectivity To Non AAI and Within AAI Airports
No ratings yet
8 2020 CNS CIRCULAR 08 of 2020 AFTN Connectivity To Non AAI and Within AAI Airports
9 pages
1 Chapter 1 - Intro To Computer - Programming
No ratings yet
1 Chapter 1 - Intro To Computer - Programming
42 pages
WRC19RPG51 - ICAO - GADSS in Depth - ClaudePichavant
No ratings yet
WRC19RPG51 - ICAO - GADSS in Depth - ClaudePichavant
41 pages
Session 10-FSM With Output
No ratings yet
Session 10-FSM With Output
44 pages
DateSheet (Mid Term Exam) CS Fall 2024 - 4-Dec
No ratings yet
DateSheet (Mid Term Exam) CS Fall 2024 - 4-Dec
8 pages
VN200 Datasheet Rev2
No ratings yet
VN200 Datasheet Rev2
2 pages
Eastern Maine Electric Cooperative New Electric Service Installation Manual
No ratings yet
Eastern Maine Electric Cooperative New Electric Service Installation Manual
40 pages
NE-300 Just Infusion User Manual
No ratings yet
NE-300 Just Infusion User Manual
14 pages
Conformity Cert g2
No ratings yet
Conformity Cert g2
1 page
YOU WE: Dream Test
No ratings yet
YOU WE: Dream Test
15 pages
Question # 1: Distribution Substations
No ratings yet
Question # 1: Distribution Substations
5 pages
Ib9371 Htdatasheet en
No ratings yet
Ib9371 Htdatasheet en
2 pages
Partha Sarathi Dey01
No ratings yet
Partha Sarathi Dey01
4 pages
Appendix F PMBOK
No ratings yet
Appendix F PMBOK
3 pages
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)

DRKP Module 3

Uploaded by

DRKP Module 3

Uploaded by

Module-3

 For Product 2 (genmaicha), the map

 For Product 3 (dragonwell), the map

This map phase output of key value pairs will

A map function reads records from the database and emits 7

Disadvantages  Map-Reduce Framework automatically

2 Range-Based Partitioning Divide keys based on alphabetical or numerical ranges

 Not all reducers are combinable

This reduce function, which counts how many

 Example – To count the total numbers

Figure - When making a count, each map emits 1, which can be

 Stage 1: Calculate total sales for each product for

The second stage mapper creates base records for year-on-

The reduction step is a merge of incomplete records.

Oracle (RDBMS) Riak (NoSQL)

 If we wanted to store user session data,

 Example -Imagine you are managing a social media app:

 Stores session data with the key a7e618d9db25.

You might also like