Applications Here are a few simple applications of interesting programs that can be easily expressed as MapReduce computations. Distributed Grep: The map function emits a line if it matches a supplied pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. Count of URL Access Frequency: The map function processes logs of web page requests and outputs (URL; 1). The reduce function adds together all values for the same URL and emits a (URL; total count) pair. ReverseWeb-Link Graph: The map function outputs (target; source) pairs for each link to a target URL found in a page named source. The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: (target; list(source)) Big Data Computing Vu Pham Hadoop MapReduce 2.0 Contd… Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of (word; frequency) pairs.
The map function emits a (hostname; term vector) pair
for each input document (where the hostname is extracted from the URL of the document).
The reduce function is passed all per-document term
vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final (hostname; term vector) pair
Big Data Computing Vu Pham Hadoop MapReduce 2.0
Contd… Inverted Index: The map function parses each document, and emits a sequence of (word; document ID) pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a (word; list(document ID)) pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.
Distributed Sort: The map function extracts the key from
each record, and emits a (key; record) pair. The reduce function emits all pairs unchanged.
Big Data Computing Vu Pham Hadoop MapReduce 2.0
Applications of MapReduce (1) Distributed Grep:
Input: large set of files
Output: lines that match pattern
Map – Emits a line if it matches the supplied
pattern
Reduce – Copies the intermediate data to output
Big Data Computing Vu Pham Hadoop MapReduce 2.0
Applications of MapReduce (2) Reverse Web-Link Graph:
Input: Web graph: tuples (a, b)
where (page a → page b)
Output: For each page, list of pages that link to it
Map – process web log and for each input <source,
target>, it outputs <target, source> Reduce - emits <target, list(source)>
Big Data Computing Vu Pham Hadoop MapReduce 2.0
Applications of MapReduce (3) Count of URL access frequency:
Input: Log of accessed URLs, e.g., from proxy server
Output: For each URL, % of total accesses for that URL
Map – Process web log and outputs <URL, 1>
Multiple Reducers - Emits <URL, URL_count> (So far, like Wordcount. But still need %) Chain another MapReduce job after above one Map – Processes <URL, URL_count> and outputs <1, (<URL, URL_count> )> 1 Reducer – Does two passes. In first pass, sums up all URL_count’s to calculate overall_count. In second pass calculates %’s Emits multiple <URL, URL_count/overall_count> Big Data Computing Vu Pham Hadoop MapReduce 2.0 Applications of MapReduce (4) Map task’s output is sorted (e.g., quicksort) Reduce task’s input is sorted (e.g., mergesort)
Sort Input: Series of (key, value) pairs Output: Sorted <value>s
Map – <key, value> → <value, _> (identity)
Reducer – <key, value> → <key, value> (identity) Partitioning function – partition keys across reducers based on ranges (can’t use hashing!) • Take data distribution into account to balance reducer tasks
Big Data Computing Vu Pham Hadoop MapReduce 2.0
The YARN Scheduler • Used underneath Hadoop 2.x + • YARN = Yet Another Resource Negotiator • Treats each server as a collection of containers – Container = fixed CPU + fixed memory
• Has 3 main components
– Global Resource Manager (RM) • Scheduling – Per-server Node Manager (NM) • Daemon and server-specific functions – Per-application (job) Application Master (AM) • Container negotiation with RM and NMs • Detecting task failures of that job Big Data Computing Vu Pham Hadoop MapReduce 2.0 YARN: How a job gets a container Resource Manager Capacity Scheduler In this figure • 2 servers (A, B) • 2 jobs (1, 2)
1. Need 3. Container on Node B 2. Container Completed
container Node A Node Manager A Node B Node Manager B