0% found this document useful (0 votes)
20 views18 pages

Nosql Mod3

The document discusses the rise of aggregate-oriented databases and the Map-Reduce paradigm for processing large datasets across clusters. It explains the Map and Reduce functions, their implementation, and optimization techniques like partitioning and combining to enhance performance. Additionally, it covers incremental Map-Reduce computations and the basics of key-value databases, highlighting their simplicity and efficiency in data storage and retrieval.

Uploaded by

Prerana S A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views18 pages

Nosql Mod3

The document discusses the rise of aggregate-oriented databases and the Map-Reduce paradigm for processing large datasets across clusters. It explains the Map and Reduce functions, their implementation, and optimization techniques like partitioning and combining to enhance performance. Additionally, it covers incremental Map-Reduce computations and the basics of key-value databases, highlighting their simplicity and efficiency in data storage and retrieval.

Uploaded by

Prerana S A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

MODULE 3

MAP-REDUCE

• Rise of Aggregate-Oriented Databases:

o The growth of clusters has contributed to the rise of aggregate-oriented


databases. When data is stored on a cluster, the approach to data storage and
computation changes compared to single-machine setups.

o Clusters introduce new considerations for both data storage and


computation, requiring a different way of organizing processing tasks.

• Centralized Database Processing:

o In a centralized database (single machine), there are typically two ways to


run processing logic:

▪ On the database server: This reduces the amount of data that needs
to be transferred but may increase the load on the server.

▪ On a client machine: This offers flexibility in choosing the


programming environment, making it easier to develop or extend
programs. However, it requires transferring large amounts of data
from the database server.

o The challenge in both cases is how to efficiently handle large datasets while
considering factors like server load and data transfer.

• Map-Reduce in Clusters:

o The map-reduce pattern is a solution designed to efficiently process large


datasets on clusters. The key advantage of using a cluster is the ability to
distribute computation across multiple machines.

o However, to optimize the process, it's essential to minimize data transfer


across the network. The map-reduce approach aims to keep the data and
processing on the same machine as much as possible, reducing the need to
move large amounts of data between nodes.

• Map-Reduce Overview:

o The map-reduce pattern, inspired by functional programming concepts,


involves two main operations:

▪ Map: This operation processes data in parallel, distributing the task


across multiple machines in the cluster.
▪ Reduce: After the mapping phase, the reduce operation aggregates or
processes the results.

o Map-Reduce first gained widespread attention through Google's MapReduce


framework and is commonly implemented in the Hadoop project. Other
databases also have their own implementations of map-reduce.

• Map-Reduce and Scatter-Gather:

o Map-reduce can be seen as a form of the Scatter-Gather pattern, which


involves distributing data to multiple machines (scatter) and then gathering
the results.

o The goal of map-reduce is to efficiently distribute computation across a


cluster while minimizing the data that needs to be moved between machines.

• Implementation Differences:

o While Google’s MapReduce framework and Hadoop offer popular


implementations, different databases may have slight variations in how they
implement map-reduce. The core idea remains consistent, but the specifics of
each implementation may vary.

7.1 BASIC MAP-REDUCE

• Scenario Setup:

o The example involves orders and line items as aggregates. Each order has line
items containing product ID, quantity, and price charged.

o The goal is to compute the total revenue for a product over the last seven
days, but this does not fit the current aggregate structure of the orders,
which makes it difficult to generate the report directly from the data.
o To generate the product revenue report, you would need to visit every
machine in the cluster and examine records across all machines. This is
inefficient for large datasets.

• Map-Reduce Solution:

o This scenario calls for map-reduce, which is ideal for distributed computing
where the computation can be spread across multiple machines.

• Map Function:

o The first stage of a map-reduce job is the map. The map function takes a
single aggregate (in this case, an order) as input and outputs a set of key-
value pairs.

o For each line item in the order, the output would be a key-value pair where
the product ID is the key, and the quantity and price are the values.

o The map function works independently on each aggregate, making it


parallelizable. This means that the map-reduce framework can distribute
map tasks across multiple nodes, ensuring parallelism and locality of data
access.

o The map function can perform more complex operations as long as it


operates on a single aggregate.

• Reduce Function:

o After the map function generates key-value pairs, the reduce function
processes them. The reduce function takes multiple map outputs with the
same key and combines their values.

o For example, if the map function produces 1000 line items for a product (e.g.,
"Database Refactoring"), the reduce function will aggregate them into a single
output, summarizing the total quantity and revenue for that product.

o The reduce function can use all values associated with a single key. This is
where the aggregation occurs.

• Map-Reduce Framework:

o The map-reduce framework handles the distribution of tasks. It ensures that


the map tasks are run on the appropriate nodes to process all the data.

o The framework also collects all the values associated with a specific key and
calls the reduce function with the key and its corresponding values. This
simplifies the task of writing the reduce function.
• Job Execution:

o To execute a map-reduce job, you need to write the map and reduce
functions.

o The framework takes care of the underlying task distribution, ensuring the
computation is efficient across the cluster.

7.2 PARTITIONING AND COMBINING

• Single Reduce Function:

o In the simplest map-reduce setup, there is a single reduce function that takes
the outputs from all the map tasks running across different nodes.

o These outputs are concatenated and sent to the reduce function, which
works well but has room for optimization in terms of parallelism and data
transfer reduction.
• Increasing Parallelism with Partitioning:

o One way to improve parallelism is by partitioning the output of the map


tasks. Each reduce function processes the results for a single key, which can
limit the flexibility of the reduce function but also provides an opportunity for
parallelism.

o The map outputs are grouped into partitions based on the key of each
processing node. These partitions are then combined into single groups and
sent to the corresponding reducers.

o This partitioning allows multiple reducers to operate in parallel on different


partitions, improving the speed of processing. The final results are merged
after all reducers have completed their work.

o The process of grouping the map outputs into partitions and sending them to
the reducers is known as shuffling, and the partitions are often referred to as
buckets or regions.

• Reducing Data Transfer with a Combiner Function:

o Another way to optimize map-reduce is by reducing the amount of data


transferred between the map and reduce stages. A significant amount of this
data is repetitive, especially when multiple key-value pairs for the same key
are involved.

o A combiner function helps by combining the data for the same key into a
single value before it is transferred. This reduces the data being shuffled
between nodes.

o The combiner function is similar to the reduce function and can often be the
same function used for final reduction.

o For a function to work as a combiner, it must have a combinable reducer


design, meaning its output must match its input.

• Non-Combinable Reducers:

o Not all reduce functions can be used as combiners. For example, a function
that counts the number of unique customers for a particular product:

▪ The map function would emit a product and a customer.

▪ The reducer would combine these and count how many times each
customer appears for that product.

o In this case, the output of the reducer differs from its input, so it cannot be
used as a combiner.
o You can still apply a combining function to eliminate duplicate product-
customer pairs before sending data to the reducer, but this is not the same as
the final reduction.

• Flexibility of Combining Reducers:

o When using combining reducers, the map-reduce framework allows the


operations to run in parallel (on different partitions) and in series (on the
same partition, at different times and places).

o This flexibility is beneficial because it allows combining to occur before the


data is transmitted between nodes and even while mappers are still
processing data.

o Some map-reduce frameworks require all reducers to be combinable, which


maximizes this flexibility. If a non-combinable reducer is needed, it requires
separating the process into pipelined map-reduce steps.

7.3 COMPOSING MAP-REDUCE CALCULATIONS

• Map-Reduce as Concurrent Processing:

o The map-reduce approach is a method for parallelizing computation over a


cluster. It trades flexibility in how computations are structured for a simple
model that can be applied to distributed systems.

o There are constraints to consider when using map-reduce:

▪ In a map task, you can only operate on a single aggregate (such as an


individual order).

▪ In a reduce task, you can only operate on a single key (such as a


specific product ID).

o These constraints require careful structuring of programs to ensure that they


fit within the map-reduce model.
• Constraints on Calculation Types:

o Averaging Example:

▪ Calculating averages is a non-composable operation. This means that


you can't simply combine the averages of two groups of data. Instead,
you need to combine the total sum and count of orders from each
group and then calculate the average from these combined values.

▪ The structure of map-reduce requires thinking about operations that


reduce neatly (i.e., operations where data can be combined effectively
at the reduce stage). This impacts how calculations like averages are
handled.

• Counting Example:

o When performing counts, the map function will emit a count field with a
value of 1 for each item. These values can then be summed during the reduce
phase to get the total count.

o This aligns with the map-reduce model, where data emitted in the map phase
is aggregated in the reduce phase.

7.3.1 A TWO STAGE M AP-REDUCE EXAMPLE


• Decomposing Complex Map-Reduce Calculations:

o As map-reduce calculations grow in complexity, it's beneficial to break them


down into stages using a pipes-and-filters approach (similar to UNIX
pipelines). Each stage's output becomes the input for the next stage.

o Example: To compare sales of products for each month in 2011 to the prior
year, the calculations are divided into two stages:

1. The first stage produces records showing the aggregate sales figures
for each product per month.

2. The second stage compares the current year's sales to the prior year's
sales for the same month.

• First Stage: Aggregate Sales Calculation:

o The first stage processes original order records, outputting key-value pairs
that show sales for each product per month.

o This stage uses a composite key (product and month) to reduce records based
on multiple fields.

• Second Stage: Comparison of Current and Prior Year:

o The second-stage mappers process the output from the first stage and
categorize it by year. For example, a 2011 record populates the current year
quantity, and a 2010 record populates the prior year quantity.

o The reduce function merges records, summing values for each key and
calculating the comparison between the current and prior year’s sales.

• Advantages of Decomposing Calculations:

o Decomposing complex calculations into multiple map-reduce steps makes


them easier to write.

o Small steps are easier to manage and combine than large, complex steps.

o Intermediate output from early stages can be reused for other outputs, which
saves both programming and execution time. This reuse is particularly useful
when intermediate records represent heavy data access.

• Materialized Views:

o Intermediate records, especially those from early stages, can be stored as


materialized views to save computation time for downstream processes.

o It's essential to base reuse on real query experiences, as speculative reuse


often doesn't deliver as expected.
• Map-Reduce Implementation:

o The map-reduce pattern can be implemented in any programming language,


but it's particularly well-suited to languages designed for map-reduce
computations.

o Apache Pig and Hive are languages specifically built for simplifying map-
reduce programming:

▪ Apache Pig: An offshoot of the Hadoop project, it simplifies working


with Hadoop compared to using Java libraries.

▪ Hive: Provides an SQL-like syntax for specifying map-reduce programs.

• Importance of Map-Reduce:

o The map-reduce pattern is crucial even outside NoSQL databases. Google’s


original map-reduce system worked on files stored in a distributed file system,
a concept used by the Hadoop project.

o Map-reduce is particularly well-suited for running computations on clusters,


making it an ideal solution for handling high data volumes in cluster-oriented
systems.

• Future of Map-Reduce:

o As organizations process increasing volumes of data, cluster-oriented


approaches like map-reduce will become more common, and the map-reduce
pattern will see broader adoption.

7.3.2 INCREMENTAL MAP-REDUCE

• Incremental Map-Reduce Computations:

o Many map-reduce computations take significant time to perform, especially


when large data volumes are involved. Since new data constantly comes in,
rerunning the computation from scratch to keep the output up-to-date can be
inefficient and time-consuming.

o Incremental updates allow the map-reduce computation to only update the


portions that need recomputation, rather than starting over from scratch
each time.

• Incremental Map Stage:

o The map stage is easy to handle incrementally. The mapper only needs to be
rerun if the input data changes.
o Since maps are isolated from each other, incremental updates for the map
stage are straightforward and efficient.

• Incremental Reduce Stage:

o The reduce stage is more complex because it combines outputs from many
map tasks. Any change in the map outputs could trigger a new reduction,
leading to the need for recomputation.

o Parallel reduce steps help minimize unnecessary recomputation. If data is


partitioned for reduction, partitions that remain unchanged do not need to
be re-reduced.

o Similarly, if a combiner step is used, it doesn’t need to be rerun if its source


data remains unchanged.

• Combinable Reducers:

o If the reduce function is combinable, further opportunities arise for avoiding


unnecessary computation.

o If the changes are additive (i.e., new records are added without modifying or
deleting old ones), the reduce operation can be run with the existing results
combined with the new additions.

o For destructive changes (updates or deletions), recomputation can be


minimized by breaking the reduce operation into steps and recalculating only
those steps whose inputs have changed. This approach is similar to using a
Dependency Network to organize computations.

• Framework Control:

o The map-reduce framework often controls how incremental updates are


handled, so it’s important to understand how a specific framework supports
incremental operations. Different frameworks may have varying capabilities
and mechanisms for handling incremental updates efficiently.

KEY-VALUE DATABASES

• A key-value store is a type of database that functions as a simple hash table, typically
used when all database access is done via a primary key.

• It can be compared to a table in a traditional Relational Database Management


System (RDBMS), with two columns: one for the key (e.g., ID) and one for the value
(e.g., NAME).

• The key column (e.g., ID) acts as the unique identifier, and the value column stores
the associated data (e.g., NAME).
• When inserting data, the application provides a key-value pair. If the key already
exists, the existing value is overwritten; otherwise, a new entry is created.

8.1 WHAT IS A KEY-VALUE STORE

Key-value stores are the simplest NoSQL data stores to use from an API perspec tive. The
client can either get the value for the key, put a value for a key, or delete a key from the data
store. The value is a blob that the data store just stores, without caring or knowing what’s
inside; it’s the responsibility of the application to understand what was stored. Since key-
value stores always use primary-key access, they generally have great performance and can
be easily scaled.

Some of the popular key-value databases are Riak [Riak], Redis (often referred to as Data
Structure server) [Redis], Memcached DB and its flavors [Memcached], Berkeley DB
[Berkeley DB], HamsterDB (especially suited for embedded use) [HamsterDB], Amazon
DynamoDB [Amazon's Dynamo] (not open-source), and Project Voldemort [Project
Voldemort] (an open-source implementation of Amazon DynamoDB).

In some key-value stores, such as Redis, the aggregate being stored does not have to be a
domain object—it could be any data structure. Redis supports storing lists, sets, hashes and
can do range, diff, union, and intersection op erations. These features allow Redis to be used
in more different ways than a standard key-value store.

There are many more key-value databases and many new ones are being worked on at this
time. For the sake of keeping discussions in this book easier we will focus mostly on Riak.
Riak lets us store keys into buckets, which are just a way to segment the keys—think of
buckets as flat namespaces for the keys.

If we wanted to store user session data, shopping cart information, and user preferences in
Riak, we could just store all of them in the same bucket with a single key and single value for
all of these objects. In this scenario, we would have a single object that stores all the data
and is put into a single bucket (Figure 8.1).

The downside of storing all the different objects (aggregates) in the single bucket would be
that one bucket would store different types of aggregates, in creasing the chance of key
conflicts. An alternate approach would be to append the name of the object to the key, such
as 288790b8a421_userProfile, so that we can get to individual objects as they are needed
(Figure 8.2).

We could also create buckets which store specific data. In Riak, they are known as domain
buckets allowing the serialization and deserialization to be handled by the client driver.

Using domain buckets or different buckets for different objects (such as UserProfile and
ShoppingCart) segments the data across different buckets al lowing you to read only the
object you need without having to change key design.

Key-value stores such as Redis also support storing random data structures, which can be
sets, hashes, strings, and so on. This feature can be used to store lists of things, like states or
addressTypes, or an array of user’s visits.

8.2 KEY-VALUE STORE FEATURES

• Understanding Key-Value Store Features:

o It's important to understand the features of key-value stores compared to


traditional RDBMS.

o This understanding helps in knowing what features are missing and how the
application architecture needs to adapt for key-value data stores.

o Features to consider include consistency, transactions, query features, data


structure, and scaling.

8.2.1 Consistency

• Consistency for Operations:


o Consistency applies to operations on a single key (get, put, delete).

o Optimistic writes can be done, but they are expensive because the data store
cannot determine the change in value.

• Riak's Consistency Model:

o Riak implements an eventually consistent model in distributed setups.

o When a value is replicated to other nodes, conflict resolution can occur in


two ways:

1. Newest write wins: The last update overwrites previous ones.

2. All values returned: The client resolves the conflict.

o Configuring Consistency in Riak:

▪ During bucket creation, you can configure consistency:

▪ Set the number of replicas (nVal).

▪ Set the number of nodes to respond to read/write (w, r).

▪ Example code for configuring a bucket in Riak:

Bucket bucket = connection

.createBucket(bucketName)

.withRetrier(attempts(3))

.allowSiblings(siblingsAllowed)

.nVal(numberOfReplicasOfTheData)

.w(numberOfNodesToRespondToWrite)

.r(numberOfNodesToRespondToRead)

.execute();

▪ If data consistency across all nodes is required:

▪ Set w (write) to the same value as nVal (number of replicas).

▪ This will reduce write performance but ensure consistency.

▪ To improve write/read conflict resolution:

▪ Set allowSiblings to false: Last write wins, and no sibling


versions are created.
8.2.2 Transactions

• Transaction Guarantees:

o Key-value stores generally do not guarantee transactions in the same way


RDBMS do.

o Riak uses the quorum model for transactions, utilizing the W value (write
quorum) during writes.

• Riak Write Quorum:

o If a Riak cluster has a replication factor of 5, and the W value is 3:

▪ The write is successful only when it is acknowledged by at least 3


nodes.

▪ This allows write tolerance, meaning the cluster can tolerate up to 2


nodes being down for write operations.

▪ Data loss can occur for those 2 nodes in terms of read operations.

8.2.3 Query Features

• Querying in Key-Value Stores:

o All key-value stores support querying by key only.

o Querying based on attributes of the value is not possible directly.

o To check an attribute, the application must read the value and process it.

• Limitations of Key Querying:

o Without knowing the key, querying becomes challenging (especially for ad-
hoc debugging).

o Most key-value stores don’t allow retrieving a list of all keys.

o Some key-value stores (like Riak) allow searching inside values (e.g., Riak
Search with Lucene indexing).

• Key Design Considerations:

o The key can be generated using:

▪ An algorithm.

▪ A user-provided value (e.g., user ID, email).

▪ Timestamps or other derived data.

o Key-value stores are useful for storing:


▪ Session data (with session ID as the key).

▪ Shopping cart data.

▪ User profiles.

o The expiry_secs property can be used to expire keys after a certain time
(useful for session data).

• Riak Key-Value Store Example:

o Storing data in a Riak bucket:

Bucket bucket = getBucket(bucketName);

IRiakObject riakObject = bucket.store(key, value).execute();

o Fetching data from a Riak bucket:

Bucket bucket = getBucket(bucketName);

IRiakObject riakObject = bucket.fetch(key).execute();

byte[] bytes = riakObject.getValue();

String value = new String(bytes);

o Example of saving data to Riak using HTTP-based interface (via curl):

▪ POST data to Riak:

▪ curl -v -X POST -d '

{ "lastVisit":1324669989288,

"user":{"customerId":"91cfdf5bcb7c","name":"buyer","countryCode":
"US","tzOffset":0}

}' -H "Content-Type: application/json"

http://localhost:8098/buckets/session/keys/a7e618d9db25

▪ Fetch data from Riak:

curl -i http://localhost:8098/buckets/session/keys/a7e618d9db25

8.2.4 Structure of Data

• Data in Key-Value Stores:

o Key-value stores do not care about the format of the value.

o The value can be any blob of data, such as:


▪ Text, JSON, XML, etc.

o Riak allows specifying the Content-Type in the POST request to indicate the
data type.

8.2.5 Scaling

• Sharding for Scaling:

o Key-value stores often scale using sharding.

o Sharding determines which node stores a key based on the value of the key.

o Example: If sharding is done by the first character of the key, a key starting
with 'f' will be stored on a different node than one starting with 'a'.

o Sharding improves performance as more nodes are added to the cluster.

• Challenges with Sharding:

o If a node goes down (e.g., storing keys starting with 'f'), the data becomes
unavailable.

o New data with keys starting with 'f' cannot be written.

• Riak's CAP Theorem Control:

o Riak allows you to control aspects of the CAP Theorem (Consistency,


Availability, Partition tolerance).

o Riak allows fine-tuning of:

▪ N: Number of nodes to store replicas.

▪ R: Number of nodes required to respond for a read to be considered


successful.

▪ W: Number of nodes required to respond for a write to be considered


successful.

o Example configuration for Riak:

▪ Assume a 5-node Riak cluster:

▪ Set N to 3 (replicate data to at least 3 nodes).

▪ Set R to 2 (2 nodes must reply to a GET request).

▪ Set W to 2 (2 nodes must acknowledge a PUT request).

o This allows flexibility in handling node failures for read or write operations.

o Choose W based on the required consistency for your application.


o These settings can be configured during bucket creation for better read or
write availability.

8.3 SUITABLE USE CASES

1. Storing Session Information

o Web sessions are unique, each with a session ID.

o Storing session data in a key-value store is efficient, as the entire session can
be stored in a single PUT request and retrieved with a single GET.

o This method improves speed, especially compared to storing session data in


disk or an RDBMS.

o Memcached is often used for such cases, and Riak is suitable when availability
is a priority.

2. User Profiles and Preferences

o User profiles typically have unique identifiers (e.g., userId, username) along
with preferences (e.g., language, color, timezone, product access).

o All these attributes can be stored in a single object, allowing quick retrieval
via a GET operation.

o This method can also be used to store product profiles efficiently.

3. Shopping Cart Data

o E-commerce websites tie shopping carts to user IDs, ensuring that cart data is
available across browsers, machines, and sessions.

o The user’s shopping cart information can be stored as the value, with the user
ID as the key.

o Riak clusters are well-suited for handling such applications due to their
availability and scaling features.

8.4 WHEN NOT TO USE

1. Relationships Among Data

o Key-value stores are not ideal when you need relationships between different
sets of data or need to correlate data across multiple keys.

o Although some key-value stores offer link-walking features, they are not
designed for handling complex relationships between data sets.

2. Multioperation Transactions
o If multiple keys need to be saved, and you require a rollback or revert if any
one operation fails, key-value stores are not the best solution.

o They do not support multioperation transactions with rollback capabilities,


unlike relational databases that handle this scenario better.

3. Query by Data

o Key-value stores do not perform well when you need to search keys based on
attributes within the value part of the key-value pairs.

o The database cannot inspect the value side, except in products like Riak
Search or external indexing engines like Lucene or Solr.

4. Operations by Sets

o Key-value stores operate on one key at a time.

o If you need to perform operations on multiple keys simultaneously, this must


be handled at the client level, making key-value stores inefficient for set-
based operations.

You might also like