Introduction To MapReduce
Introduction To MapReduce
2. Distributed Systems:
5. Linux/Unix Commands:
• Many big data platforms, including Hadoop, are typically deployed on Unix-
like systems.
• Familiarity with basic command-line operations in a Unix environment can be
helpful for interacting with Hadoop clusters.
• If you are planning to use tools like Apache Hive, which provides a SQL-like
interface for querying data in Hadoop, a basic understanding of SQL can be
beneficial.
• MapReduce involves breaking down problems into smaller tasks that can be
executed in parallel.
• Strong algorithmic and problem-solving skills are valuable for designing
efficient MapReduce jobs.
Explanation
Q: Describe MapReduce Execution steps with a neat diagram 12 M
In a MapReduce job
1. The data is split into smaller chunks, and a "map" function is applied to each
chunk independently.
2. The results are then shuffled and sorted, and a "reduce" function is applied to
combine the intermediate results into the final output.
Map-Tasks
which runs user application codes for each key-value pair (kl, vl).
• Refers to a task which takes the output v2 from the map as an input and
combines those data pieces into a smaller set of data using a combiner.
• The reduce task is always performed after the map task.
Key-Value Pair
Each phase (Map phase and Reduce phase) of MapReduce has key-value pairs as
input and output.
Data should be first converted into key-value pairs before it is passed to the Mapper,
as the Mapper only understands key-value pairs of data.
Grouping by Key
• When a map task completes, Shuffle process aggregates (combines) all the
Mapper outputs by grouping the key-values of the Mapper output, and the
value v2 append in a list of values.
• A "Group By" operation on intermediate keys creates v2.
• All pairs with the same group key (k2) collect and group together, creating one
group for each key.
• Shuffle output format will be a List of. Thus, a different subset of the
intermediate key space assigns to each reduce node.
Reduced Tasks
• Implements reduce () that takes the Mapper output (which shuffles and sorts),
which is grouped by key-values (k2, v2) and applies it in parallel to each
group.
• Reduce function iterates over the list of values associated with a key and
produces outputs such as aggregations and statistics.
• The reduce function sends output zero or another set of key-value pairs (k3,
v3) to the final the output file. Reduce: {(k2, list (v2) -> list (k3, v3)}
MapReduce Implementation
The Map function processes each word in the document and emits key-value pairs
where the key is the word, and the value is 1 (indicating the count).
Map("hello”) →{("hello",1)},
The Reduce function takes each unique key and the list of values and calculates the
sum.