0% found this document useful (0 votes)
3 views

Map-reduce-Developing a map-reduce application – Map-reduce working procedure-2

Uploaded by

cakvlr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Map-reduce-Developing a map-reduce application – Map-reduce working procedure-2

Uploaded by

cakvlr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

MapReduce Applications

Big Data Computing Vu Pham Hadoop MapReduce 2.0


Applications
Here are a few simple applications of interesting programs that
can be easily expressed as MapReduce computations.
Distributed Grep: The map function emits a line if it matches a
supplied pattern. The reduce function is an identity function that
just copies the supplied intermediate data to the output.
Count of URL Access Frequency: The map function processes
logs of web page requests and outputs (URL; 1). The reduce
function adds together all values for the same URL and emits a
(URL; total count) pair.
ReverseWeb-Link Graph: The map function outputs (target;
source) pairs for each link to a target URL found in a page named
source. The reduce function concatenates the list of all source
URLs associated with a given target URL and emits the pair:
(target; list(source))
Big Data Computing Vu Pham Hadoop MapReduce 2.0
Contd…
Term-Vector per Host: A term vector summarizes the
most important words that occur in a document or a set
of documents as a list of (word; frequency) pairs.

The map function emits a (hostname; term vector) pair


for each input document (where the hostname is
extracted from the URL of the document).

The reduce function is passed all per-document term


vectors for a given host. It adds these term vectors
together, throwing away infrequent terms, and then emits
a final (hostname; term vector) pair

Big Data Computing Vu Pham Hadoop MapReduce 2.0


Contd…
Inverted Index: The map function parses each document,
and emits a sequence of (word; document ID) pairs. The
reduce function accepts all pairs for a given word, sorts
the corresponding document IDs and emits a (word;
list(document ID)) pair. The set of all output pairs forms a
simple inverted index. It is easy to augment this
computation to keep track of word positions.

Distributed Sort: The map function extracts the key from


each record, and emits a (key; record) pair. The reduce
function emits all pairs unchanged.

Big Data Computing Vu Pham Hadoop MapReduce 2.0


Applications of MapReduce
(1) Distributed Grep:

Input: large set of files


Output: lines that match pattern

Map – Emits a line if it matches the supplied


pattern

Reduce – Copies the intermediate data to output

Big Data Computing Vu Pham Hadoop MapReduce 2.0


Applications of MapReduce
(2) Reverse Web-Link Graph:

Input: Web graph: tuples (a, b)


where (page a → page b)

Output: For each page, list of pages that link to it

Map – process web log and for each input <source,


target>, it outputs <target, source>
Reduce - emits <target, list(source)>

Big Data Computing Vu Pham Hadoop MapReduce 2.0


Applications of MapReduce
(3) Count of URL access frequency:

Input: Log of accessed URLs, e.g., from proxy server


Output: For each URL, % of total accesses for that URL

Map – Process web log and outputs <URL, 1>


Multiple Reducers - Emits <URL, URL_count>
(So far, like Wordcount. But still need %)
Chain another MapReduce job after above one
Map – Processes <URL, URL_count> and outputs
<1, (<URL, URL_count> )>
1 Reducer – Does two passes. In first pass, sums up all
URL_count’s to calculate overall_count. In second pass
calculates %’s
Emits multiple <URL, URL_count/overall_count>
Big Data Computing Vu Pham Hadoop MapReduce 2.0
Applications of MapReduce
(4) Map task’s output is sorted (e.g., quicksort)
Reduce task’s input is sorted (e.g., mergesort)

Sort
Input: Series of (key, value) pairs
Output: Sorted <value>s

Map – <key, value> → <value, _> (identity)


Reducer – <key, value> → <key, value> (identity)
Partitioning function – partition keys across reducers
based on ranges (can’t use hashing!)
• Take data distribution into account to balance
reducer tasks

Big Data Computing Vu Pham Hadoop MapReduce 2.0


The YARN Scheduler
• Used underneath Hadoop 2.x +
• YARN = Yet Another Resource Negotiator
• Treats each server as a collection of containers
– Container = fixed CPU + fixed memory

• Has 3 main components


– Global Resource Manager (RM)
• Scheduling
– Per-server Node Manager (NM)
• Daemon and server-specific functions
– Per-application (job) Application Master (AM)
• Container negotiation with RM and NMs
• Detecting task failures of that job
Big Data Computing Vu Pham Hadoop MapReduce 2.0
YARN: How a job gets a container
Resource Manager
Capacity Scheduler In this figure
• 2 servers (A, B)
• 2 jobs (1, 2)

1. Need 3. Container on Node B 2. Container Completed


container
Node A Node Manager A Node B
Node Manager B

Application Application Task


4. Start task, please!
Master 1 Master 2 (App2)

Big Data Computing Vu Pham Hadoop MapReduce 2.0

You might also like