0% found this document useful (0 votes)
5 views10 pages

Map-reduce-Developing a map-reduce application – Map-reduce working procedure-2

Uploaded by

cakvlr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views10 pages

Map-reduce-Developing a map-reduce application – Map-reduce working procedure-2

Uploaded by

cakvlr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

MapReduce Applications

Big Data Computing Vu Pham Hadoop MapReduce 2.0


Applications
Here are a few simple applications of interesting programs that
can be easily expressed as MapReduce computations.
Distributed Grep: The map function emits a line if it matches a
supplied pattern. The reduce function is an identity function that
just copies the supplied intermediate data to the output.
Count of URL Access Frequency: The map function processes
logs of web page requests and outputs (URL; 1). The reduce
function adds together all values for the same URL and emits a
(URL; total count) pair.
ReverseWeb-Link Graph: The map function outputs (target;
source) pairs for each link to a target URL found in a page named
source. The reduce function concatenates the list of all source
URLs associated with a given target URL and emits the pair:
(target; list(source))
Big Data Computing Vu Pham Hadoop MapReduce 2.0
Contd…
Term-Vector per Host: A term vector summarizes the
most important words that occur in a document or a set
of documents as a list of (word; frequency) pairs.

The map function emits a (hostname; term vector) pair


for each input document (where the hostname is
extracted from the URL of the document).

The reduce function is passed all per-document term


vectors for a given host. It adds these term vectors
together, throwing away infrequent terms, and then emits
a final (hostname; term vector) pair

Big Data Computing Vu Pham Hadoop MapReduce 2.0


Contd…
Inverted Index: The map function parses each document,
and emits a sequence of (word; document ID) pairs. The
reduce function accepts all pairs for a given word, sorts
the corresponding document IDs and emits a (word;
list(document ID)) pair. The set of all output pairs forms a
simple inverted index. It is easy to augment this
computation to keep track of word positions.

Distributed Sort: The map function extracts the key from


each record, and emits a (key; record) pair. The reduce
function emits all pairs unchanged.

Big Data Computing Vu Pham Hadoop MapReduce 2.0


Applications of MapReduce
(1) Distributed Grep:

Input: large set of files


Output: lines that match pattern

Map – Emits a line if it matches the supplied


pattern

Reduce – Copies the intermediate data to output

Big Data Computing Vu Pham Hadoop MapReduce 2.0


Applications of MapReduce
(2) Reverse Web-Link Graph:

Input: Web graph: tuples (a, b)


where (page a → page b)

Output: For each page, list of pages that link to it

Map – process web log and for each input <source,


target>, it outputs <target, source>
Reduce - emits <target, list(source)>

Big Data Computing Vu Pham Hadoop MapReduce 2.0


Applications of MapReduce
(3) Count of URL access frequency:

Input: Log of accessed URLs, e.g., from proxy server


Output: For each URL, % of total accesses for that URL

Map – Process web log and outputs <URL, 1>


Multiple Reducers - Emits <URL, URL_count>
(So far, like Wordcount. But still need %)
Chain another MapReduce job after above one
Map – Processes <URL, URL_count> and outputs
<1, (<URL, URL_count> )>
1 Reducer – Does two passes. In first pass, sums up all
URL_count’s to calculate overall_count. In second pass
calculates %’s
Emits multiple <URL, URL_count/overall_count>
Big Data Computing Vu Pham Hadoop MapReduce 2.0
Applications of MapReduce
(4) Map task’s output is sorted (e.g., quicksort)
Reduce task’s input is sorted (e.g., mergesort)

Sort
Input: Series of (key, value) pairs
Output: Sorted <value>s

Map – <key, value> → <value, _> (identity)


Reducer – <key, value> → <key, value> (identity)
Partitioning function – partition keys across reducers
based on ranges (can’t use hashing!)
• Take data distribution into account to balance
reducer tasks

Big Data Computing Vu Pham Hadoop MapReduce 2.0


The YARN Scheduler
• Used underneath Hadoop 2.x +
• YARN = Yet Another Resource Negotiator
• Treats each server as a collection of containers
– Container = fixed CPU + fixed memory

• Has 3 main components


– Global Resource Manager (RM)
• Scheduling
– Per-server Node Manager (NM)
• Daemon and server-specific functions
– Per-application (job) Application Master (AM)
• Container negotiation with RM and NMs
• Detecting task failures of that job
Big Data Computing Vu Pham Hadoop MapReduce 2.0
YARN: How a job gets a container
Resource Manager
Capacity Scheduler In this figure
• 2 servers (A, B)
• 2 jobs (1, 2)

1. Need 3. Container on Node B 2. Container Completed


container
Node A Node Manager A Node B
Node Manager B

Application Application Task


4. Start task, please!
Master 1 Master 2 (App2)

Big Data Computing Vu Pham Hadoop MapReduce 2.0

You might also like