0% found this document useful (0 votes)

395 views76 pages

MapReduce Example

This document provides an overview of Hadoop MapReduce and HDFS. It begins with introductions to the MapReduce programming model and key terminology. It then discusses HDFS in detail, including how files are stored in blocks across DataNodes, the roles of the NameNode and Secondary NameNode, and how reads and writes are handled. The document continues with explanations of how MapReduce jobs are executed, including task assignment by the JobTracker, task execution on TaskTrackers, and the shuffle and sort process through which map outputs are transferred and sorted for the reducers.

Uploaded by

YoonMin Nam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

395 views76 pages

MapReduce Example

Uploaded by

YoonMin Nam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 76

DEUTSCH-FRANZSISCHE SOMMERUNIVERSITT FR NACHWUCHSWISSENSCHAFTLER 2011 CLOUD COMPUTING : HERAUSFORDERUNGEN UND MGLICHKEITEN

UNIVERSIT DT FRANCO-ALLEMANDE POUR JEUNES CHERCHEURS 2011 CLOUD COMPUTING: DFIS ET OPPORTUNITS

Hadoop MapReduce in Practice

Speaker: Pietro Michiardi Contributors: Antonio Barbuzzi, Mario Pastorelli Institution: Eurecom

Pietro Michiardi (Eurecom)

Hadoop in Practice

1 / 76

Introduction

Overview of this Lecture Hadoop: Architecture Design and Implementation Details [45 minutes]
HDFS Hadoop MapReduce Hadoop MapReduce I/O

Exercise Session:
Warm up: WordCount and rst design patterns [45 minutes] Exercises on various design patterns: [60 minutes]
Pairs Stripes Order Inversion [HomeWork]

Solved Exercise: PageRank in MapReduce [45 minutes]

Pietro Michiardi (Eurecom)

Hadoop in Practice

2 / 76

Hadoop MapReduce

Pietro Michiardi (Eurecom)

Hadoop in Practice

3 / 76

Hadoop MapReduce

Preliminaries

From Theory to Practice The story so far

Concepts behind the MapReduce Framework Overview of the programming model

Terminology
MapReduce:
Job: an execution of a Mapper and Reducer across a data set Task: an execution of a Mapper or a Reducer on a slice of data Task Attempt: instance of an attempt to execute a task

Example:
Running Word Count across 20 les is one job 20 les to be mapped = 20 map tasks + some number of reduce tasks At least 20 attempts will be performed... more if a machine crashes

Pietro Michiardi (Eurecom)

Hadoop in Practice

4 / 76

Hadoop MapReduce

HDFS in details

Pietro Michiardi (Eurecom)

Hadoop in Practice

5 / 76

Hadoop MapReduce

HDFS in details

The Hadoop Distributed Filesystem

Large dataset(s) outgrowing the storage capacity of a single physical machine

Need to partition it across a number of separate machines Network-based system, with all its complications Tolerate failures of machines

Hadoop Distributed Filesystem[6, 7]

Very large les Streaming data access Commodity hardware

Pietro Michiardi (Eurecom)

Hadoop in Practice

6 / 76

Hadoop MapReduce

HDFS in details

HDFS Blocks (Big) les are broken into block-sized chunks

NOTE: A le that is smaller than a single block does not occupy a full blocks worth of underlying storage

Blocks are stored on independent machines

Reliability and parallel access

Why is a block so large?

Make transfer times larger than seek latency E.g.: Assume seek time is 10ms and the transfer rate is 100 MB/s, if you want seek time to be 1% of transfer time, then the block size should be 100MB

Pietro Michiardi (Eurecom)

Hadoop in Practice

7 / 76

Hadoop MapReduce

HDFS in details

NameNodes and DataNodes NameNode

Keeps metadata in RAM Each block information occupies roughly 150 bytes of memory Without NameNode, the lesystem cannot be used
Persistence of metadata: synchronous and atomic writes to NFS

Secondary NameNode
Merges the namespce with the edit log A useful trick to recover from a failure of the NameNode is to use the NFS copy of metadata and switch the secondary to primary

DataNode
They store data and talk to clients They report periodically to the NameNode the list of blocks they hold
Pietro Michiardi (Eurecom) Hadoop in Practice 8 / 76

Hadoop MapReduce

HDFS in details

Anatomy of a File Read NameNode is only used to get block location

Unresponsive DataNode are discarded by clients Batch reading of blocks is allowed

External clients
For each block, the NameNode returns a set of DataNodes holding a copy thereof DataNodes are sorted according to their proximity to the client

MapReduce clients
TaskTracker and DataNodes are colocated For each block, the NameNode usually1 returns the local DataNode

Exceptions exist due to stragglers.

Hadoop in Practice 9 / 76

Pietro Michiardi (Eurecom)

Hadoop MapReduce

HDFS in details

Anatomy of a File Write

Details on replication
Clients ask NameNode for a list of suitable DataNodes This list forms a pipeline: rst DataNode stores a copy of a block, then forwards it to the second, and so on

Replica Placement
Tradeoff between reliability and bandwidth Default placement:
First copy on the same node of the client, second replica is off-rack, third replica is on the same rack as the second but on a different node Since Hadoop 0.21, replica placement can be customized

Pietro Michiardi (Eurecom)

Hadoop in Practice

10 / 76

Hadoop MapReduce

HDFS in details

HDFS Coherency Model

Read your writes is not guaranteed

The namespace is updated Block contents may not be visible after a write is nished Application design (other than MapReduce) should use sync() to force synchronization sync() involves some overhead: tradeoff between robustness/consistency and throughput

Multiple writers (for the same le) are not supported

Instead, different les can be written in parallel (using MapReduce)

Pietro Michiardi (Eurecom)

Hadoop in Practice

11 / 76

Hadoop MapReduce

Hadoop MapReduce in details

How Hadoop MapReduce Works

Pietro Michiardi (Eurecom)

Hadoop in Practice

12 / 76

Hadoop MapReduce

Hadoop MapReduce in details

Anatomy of a MapReduce Job Run

Pietro Michiardi (Eurecom)

Hadoop in Practice

13 / 76

Hadoop MapReduce

Hadoop MapReduce in details

Job Submission

JobClient class
The runJob() method creates a new instance of a JobClient Then it calls the submitJob() on this class

Simple verications on the Job

Is there an output directory? Are there any input splits? Can I copy the JAR of the job to HDFS?

NOTE: the JAR of the job is replicated 10 times

Pietro Michiardi (Eurecom)

Hadoop in Practice

14 / 76

Hadoop MapReduce

Hadoop MapReduce in details

Job Initialization The JobTracker is responsible for:

Create an object for the job Encapsulate its tasks Bookkeeping with the tasks status and progress

This is where the scheduling happens

JobTracker performs scheduling by maintaining a queue Queueing disciplines are pluggable

Compute mappers and reducers

JobTracker retrieves input splits (computed by JobClient) Determines the number of Mappers based on the number of input splits Reads the conguration le to set the number of Reducers
Pietro Michiardi (Eurecom) Hadoop in Practice 15 / 76

Hadoop MapReduce

Hadoop MapReduce in details

Task Assignment Heartbeat-Based mechanism

TaskTrackers periodically send heartbeats to the JobTracker TaskTracker is alive Heartbeat contains also information on availability of the TaskTrackers to execute a task JobTracker piggybacks a task if TaskTracker is available

Selecting a task
JobTracker rst needs to select a job (i.e. scheduling) TaskTrackers have a xed number of slots for map and reduce tasks JobTracker gives priority to map tasks (WHY?)

Data locality
JobTracker is topology aware
Useful for map tasks Unused for reduce tasks
Pietro Michiardi (Eurecom) Hadoop in Practice 16 / 76

Hadoop MapReduce

Hadoop MapReduce in details

Task Execution

Task Assignment is done, now TaskTrackers can execute

Copy the JAR from the HDFS Create a local working directory Create an instance of TaskRunner

TaskRunner launches a child JVM

This prevents bugs from stalling the TaskTracker A new child JVM is created per InputSplit
Can be overridden by specifying JVM Reuse option, which is very useful for custom, in-memory, combiners

Pietro Michiardi (Eurecom)

Hadoop in Practice

17 / 76

Hadoop MapReduce

Hadoop MapReduce in details

Shufe and Sort

The MapReduce framework guarantees the input to every reducer to be sorted by key
The process by which the system sorts and transfers map outputs to reducers is known as shufe

Shufe is the most important part of the framework, where the magic happens
Good understanding allows optimizing both the framework and the execution time of MapReduce jobs

Subject to continuous renements

Pietro Michiardi (Eurecom)

Hadoop in Practice

18 / 76

Hadoop MapReduce

Hadoop MapReduce in details

Shufe and Sort: the Map Side

Pietro Michiardi (Eurecom)

Hadoop in Practice

19 / 76

Hadoop MapReduce

Hadoop MapReduce in details

Shufe and Sort: the Map Side The output of a map task is not simply written to disk
In memory buffering Pre-sorting

Circular memory buffer

100 MB by default Threshold based mechanism to spill buffer content to disk Map output written to the buffer while spilling to disk If buffer lls up while spilling, the map task is blocked

Disk spills
Written in round-robin to a local dir Output data is partitioned corresponding to the reducers they will be sent to Within each partition, data is sorted (in-memory) Optionally, if there is a combiner, it is executed just after the sort phase
Pietro Michiardi (Eurecom) Hadoop in Practice 20 / 76

Hadoop MapReduce

Hadoop MapReduce in details

Shufe and Sort: the Map Side

More on spills and memory buffer

Each time the buffer is full, a new spill is created Once the map task nishes, there are many spills Such spills are merged into a single partitioned and sorted output le

The output le partitions are made available to reducers over HTTP

There are 40 (default) threads dedicated to serve the le partitions to reducers

Pietro Michiardi (Eurecom)

Hadoop in Practice

21 / 76

Hadoop MapReduce

Hadoop MapReduce in details

Shufe and Sort: the Map Side

Pietro Michiardi (Eurecom)

Hadoop in Practice

22 / 76

Hadoop MapReduce

Hadoop MapReduce in details

Shufe and Sort: the Reduce Side The map output le is located on the local disk of TaskTracker Another TaskTracker (in charge of a reduce task) requires input from many other TaskTracker (that nished their map tasks)
How do reducers know which TaskTrackers to fetch map output from?
When a map task nishes it noties the parent TaskTracker The TaskTracker noties (with the heartbeat mechanism) the jobtracker A thread in the reducer polls periodically the JobTracker TaskTrackers do not delete local map output as soon as a reduce task has fetched them (WHY?)

Copy phase: a pull approach

There is a small number (5) of copy threads that can fetch map outputs in parallel
Pietro Michiardi (Eurecom) Hadoop in Practice 23 / 76

Hadoop MapReduce

Hadoop MapReduce in details

Shufe and Sort: the Reduce Side The map outputs are copied to the the TraskTracker running the reducer in memory (if they t)
Otherwise they are copied to disk

Input consolidation
A background thread merges all partial inputs into larger, sorted les Note that if compression was used (for map outputs to save bandwidth), decompression will take place in memory

Sorting the input

When all map outputs have been copied a merge phase starts All map outputs are sorted maintaining their sort ordering, in rounds

Pietro Michiardi (Eurecom)

Hadoop in Practice

24 / 76

Hadoop MapReduce

Hadoop I/O

Pietro Michiardi (Eurecom)

Hadoop in Practice

25 / 76

Hadoop MapReduce

Hadoop I/O

I/O operations in Hadoop Reading and writing data

From/to HDFS From/to local disk drives Across machines (inter-process communication)

Customized tools for large amounts of data

Hadoop does not use Java native classes Allows exibility for dealing with custom data (e.g. binary)

Whats next
Overview of what Hadoop offers For an in depth knowledge, use [7]

Pietro Michiardi (Eurecom)

Hadoop in Practice

26 / 76

Hadoop MapReduce

Hadoop I/O

Serialization Transforms structured objects into a byte stream

For transmission over the network: Hadoop uses RPC For persistent storage on disks

Hadoop uses its own serialization interface, Writable

Comparison of types is crucial (Shufe and Sort phase): Hadoop provides a custom RawComparator, which avoids deserialization Custom Writable for having full control on the binary representation of data Also external frameworks are allowed: enter Avro

Fixed-Length or variable-length encoding?

Fixed-Length: when the distribution of values is uniform Variable-length: when the distribution of values is not uniform
Pietro Michiardi (Eurecom) Hadoop in Practice 27 / 76

Hadoop MapReduce

Hadoop I/O

Hadoop MapReduce Types and Formats

Pietro Michiardi (Eurecom)

Hadoop in Practice

28 / 76

Hadoop MapReduce

Hadoop I/O

MapReduce Types

Input / output to mappers and reducers

map: (k 1, v 1) [(k 2, v 2)] reduce: (k 2, [v 2]) [(k 3, v 3)]

Types:
K types implement WritableComparable V types implement Writable

Pietro Michiardi (Eurecom)

Hadoop in Practice

29 / 76

Hadoop MapReduce

Hadoop I/O

What is a Writable

Hadoop denes its own classes for strings (Text), integers (IntWritable), etc... All keys are instances of WritableComparable
Why comparable?

All values are instances of Writable

Pietro Michiardi (Eurecom)

Hadoop in Practice

30 / 76

Hadoop MapReduce

Hadoop I/O

Getting Data to the Mapper

Pietro Michiardi (Eurecom)

Hadoop in Practice

31 / 76

Hadoop MapReduce

Hadoop I/O

Reading Data

Datasets are specied by InputFormats

InputFormats dene input data (e.g. a le, a directory) InputFormats is a factory for RecordReader objects to extract key-value records from the input source

InputFormats identify partitions of the data that form an InputSplit

InputSplit is a (reference to a) chunk of the input processed by a single map Each split is divided into records, and the map processes each record (a key-value pair) in turn Splits and records are logical, they are not physically bound to a le

Pietro Michiardi (Eurecom)

Hadoop in Practice

32 / 76

Hadoop MapReduce

Hadoop I/O

FileInputFormat and Friends TextInputFormat

Treats each newline-terminated line of a le as a value

KeyValueTextInputFormat
Maps newline-terminated text lines of key SEPARATOR value

SequenceFileInputFormat
Binary le of key-value pairs with some additional metadata

SequenceFileAsTextInputFormat
Same as before but, maps (k.toString(), v.toString())

Pietro Michiardi (Eurecom)

Hadoop in Practice

33 / 76

Hadoop MapReduce

Hadoop I/O

Record Readers

Each InputFormat provides its own RecordReader implementation Examples:

LineRecordReader
Used by TextInputFormat, reads a line from a text le

KeyValueRecordReader
Used by KeyValueTextInputFormat

Pietro Michiardi (Eurecom)

Hadoop in Practice

34 / 76

Hadoop MapReduce

Hadoop I/O

Example: TextInputFormat, LineRecordReader

On the top of the Crumpetty Tree The Quangle Wangle sat, But his face you could not see, On account of his Beaver Hat.

(0, On the top of the Crumpetty Tree) (33, The Quangle Wangle sat,) (57, But his face you could not see,) (89, On account of his Beaver Hat.)

Pietro Michiardi (Eurecom)

Hadoop in Practice

35 / 76

Hadoop MapReduce

Hadoop I/O

Writing the Output

Pietro Michiardi (Eurecom)

Hadoop in Practice

36 / 76

Hadoop MapReduce

Hadoop I/O

Writing the Output

Analogous to InputFormat TextOutputFormat writes key value <newline> strings to output le SequenceFileOutputFormat uses a binary format to pack key-value pairs NullOutputFormat discards output

Pietro Michiardi (Eurecom)

Hadoop in Practice

37 / 76

Algorithm Design

Pietro Michiardi (Eurecom)

Hadoop in Practice

38 / 76

Algorithm Design

Preliminaries

Pietro Michiardi (Eurecom)

Hadoop in Practice

39 / 76

Algorithm Design

Preliminaries

Algorithm Design Developing algorithms involve:

Preparing the input data Implement the mapper and the reducer Optionally, design the combiner and the partitioner

How to recast existing algorithms in MapReduce?

It is not always obvious how to express algorithms Data structures play an important role Optimization is hard The designer needs to bend the framework

Learn by examples
Design patterns Synchronization is perhaps the most tricky aspect
Pietro Michiardi (Eurecom) Hadoop in Practice 40 / 76

Algorithm Design

Preliminaries

Algorithm Design Aspects that are not under the control of the designer
Where a mapper or reducer will run When a mapper or reducer begins or nishes Which input key-value pairs are processed by a specic mapper Which intermediate key-value paris are processed by a specic reducer

Aspects that can be controlled

Construct data structures as keys and values Execute user-specied initialization and termination code for mappers and reducers Preserve state across multiple input and intermediate keys in mappers and reducers Control the sort order of intermediate keys, and therefore the order in which a reducer will encounter particular keys Control the partitioning of the key space, and therefore the set of keys that will be encountered by a particular reducer
Pietro Michiardi (Eurecom) Hadoop in Practice 41 / 76

Algorithm Design

Preliminaries

Algorithm Design

MapReduce jobs can be complex

Many algorithms cannot be easily expressed as a single MapReduce job Decompose complex algorithms into a sequence of jobs
Requires orchestrating data so that the output of one job becomes the input to the next

Iterative algorithms require an external driver to check for convergence

Optimizations
Scalability (linear) Resource requirements (storage and bandwidth)

Pietro Michiardi (Eurecom)

Hadoop in Practice

42 / 76

Algorithm Design

Local Aggregation

Warm up: WordCount and Local Aggregation

Pietro Michiardi (Eurecom)

Hadoop in Practice

43 / 76

Algorithm Design

Local Aggregation

Motivations In the context of data-intensive distributed processing, the most important aspect of synchronization is the exchange of intermediate results
This involves copying intermediate results from the processes that produced them to those that consume them In general, this involves data transfers over the network In Hadoop, also disk I/O is involved, as intermediate results are written to disk

Network and disk latencies are expensive

Reducing the amount of intermediate data translates into algorithmic efciency

Combiners and preserving state across inputs

Reduce the number and size of key-value pairs to be shufed
Pietro Michiardi (Eurecom) Hadoop in Practice 44 / 76

Algorithm Design

Local Aggregation

Word Counting in MapReduce

Pietro Michiardi (Eurecom)

Hadoop in Practice

45 / 76

Algorithm Design

Local Aggregation

Word Counting in MapReduce Input:

Key-value pairs: (docid, doc) stored on the distributed lesystem docid: unique identier of a document doc: is the text of the document itself

Mapper:
Takes an input key-value pair, tokenize the document Emits intermediate key-value pairs: the word is the key and the integer is the value

The framework:
Guarantees all values associated with the same key (the word) are brought to the same reducer

The reducer:
Receives all values associated to some keys Sums the values and writes output key-value pairs: the key is the word and the value is the number of occurrences
Pietro Michiardi (Eurecom) Hadoop in Practice 46 / 76

Algorithm Design

Local Aggregation

Technique 1: Combiners Combiners are a general mechanism to reduce the amount of intermediate data
They could be thought of as mini-reducers

Example: word count

Combiners aggregate term counts across documents processed by each map task If combiners take advantage of all opportunities for local aggregation we have at most m V intermediate key-value pairs
m: number of mappers V : number of unique terms in the collection

Note: due to Zipan nature of term distributions, not all mappers will see all terms

Pietro Michiardi (Eurecom)

Hadoop in Practice

47 / 76

Algorithm Design

Local Aggregation

Technique 2: In-Mapper Combiners

In-Mapper Combiners, a possible improvement

Hadoop does not guarantee combiners to be executed

Use an associative array to cumulate intermediate results

The array is used to tally up term counts within a single document The Emit method is called only after all InputRecords have been processed

Example (see next slide)

The code emits a key-value pair for each unique term in the document

Pietro Michiardi (Eurecom)

Hadoop in Practice

48 / 76

Algorithm Design

Local Aggregation

In-Mapper Combiners

Pietro Michiardi (Eurecom)

Hadoop in Practice

49 / 76

Algorithm Design

Local Aggregation

In-Mapper Combiners

Summing up: a rst design pattern, in-mapper combining

Provides control over when local aggregation occurs Design can determine how exactly aggregation is done

Efciency vs. Combiners

There is no additional overhead due to the materialization of key-value pairs
Un-necessary object creation and destruction (garbage collection) Serialization, deserialization when memory bounded

Mappers still need to emit all key-value pairs, combiners only reduce network trafc

Pietro Michiardi (Eurecom)

Hadoop in Practice

50 / 76

Algorithm Design

Local Aggregation

In-Mapper Combiners Precautions

In-mapper combining breaks the functional programming paradigm due to state preservation Preserving state across multiple instances implies that algorithm behavior might depend on execution order
Ordering-dependent bugs are difcult to nd

Scalability bottleneck
The in-mapper combining technique strictly depends on having sufcient memory to store intermediate results
And you dont want the OS to deal with swapping

Multiple threads compete for the same resources A possible solution: block and ush
Implemented with a simple counter

Pietro Michiardi (Eurecom)

Hadoop in Practice

51 / 76

Algorithm Design

Paris and Stripes

Exercise: Pairs and Stripes

Pietro Michiardi (Eurecom)

Hadoop in Practice

52 / 76

Algorithm Design

Paris and Stripes

Pairs and Stripes

A common approach in MapReduce: build complex keys

Data necessary for a computation are naturally brought together by the framework

Two basic techniques:

Pairs Stripes

Next, we focus on a particular problem that benets from these two methods

Pietro Michiardi (Eurecom)

Hadoop in Practice

53 / 76

Algorithm Design

Paris and Stripes

Problem statement The problem: building word co-occurrence matrices for large corpora
The co-occurrence matrix of a corpus is a square n n matrix n is the number of unique words (i.e., the vocabulary size) A cell mij contains the number of times the word wi co-occurs with word wj within a specic context Context: a sentence, a paragraph a document or a window of m words NOTE: the matrix may be symmetric in some cases

Motivation
This problem is a basic building block for more complex operations Estimating the distribution of discrete joint events from a large number of observations Similar problem in other domains:
Customers who buy this tend to also buy that
Pietro Michiardi (Eurecom) Hadoop in Practice 54 / 76

Algorithm Design

Paris and Stripes

Word co-occurrence: the Pairs approach Input to the problem

Key-value pairs in the form of a docid and a doc

The mapper:
Processes each input document Emits key-value pairs with:
Each co-occurring word pair as the key The integer one (the count) as the value

This is done with two nested loops:

The outer loop iterates over all words The inner loop iterates over all neighbors

The reducer:
Receives pairs relative to co-occurring words Computes an absolute count of the joint event Emits the pair and the count as the nal key-value output
Basically reducers emit the cells of the matrix
Pietro Michiardi (Eurecom) Hadoop in Practice 55 / 76

Algorithm Design

Paris and Stripes

Word co-occurrence: the Pairs approach

Pietro Michiardi (Eurecom)

Hadoop in Practice

56 / 76

Algorithm Design

Paris and Stripes

Word co-occurrence: the Stripes approach Input to the problem

Key-value pairs in the form of a docid and a doc

The mapper:
Same two nested loops structure as before Co-occurrence information is rst stored in an associative array Emit key-value pairs with words as keys and the corresponding arrays as values

The reducer:
Receives all associative arrays related to the same word Performs an element-wise sum of all associative arrays with the same key Emits key-value output in the form of word, associative array
Basically, reducers emit rows of the co-occurrence matrix
Pietro Michiardi (Eurecom) Hadoop in Practice 57 / 76

Algorithm Design

Paris and Stripes

Word co-occurrence: the Stripes approach

Pietro Michiardi (Eurecom)

Hadoop in Practice

58 / 76

Algorithm Design

Paris and Stripes

Pairs and Stripes, a comparison The pairs approach

Generates a large number of key-value pairs (also intermediate) The benet from combiners is limited, as it is less likely for a mapper to process multiple occurrences of a word Does not suffer from memory paging problems

The stripes approach

More compact Generates fewer and shorted intermediate keys
The framework has less sorting to do

The values are more complex and have serialization/deserialization overhead Greately benets from combiners, as the key space is the vocabulary Suffers from memory paging problems, if not properly engineered
Pietro Michiardi (Eurecom) Hadoop in Practice 59 / 76

Algorithm Design

Order Inversion

Order Inversion [Homework]

Pietro Michiardi (Eurecom)

Hadoop in Practice

60 / 76

Algorithm Design

Order Inversion

Computing relative frequenceies Relative Co-occurrence matrix construction

Similar problem as before, same matrix Instead of absolute counts, we take into consideration the fact that some words appear more frequently than others
Word wi may co-occur frequently with word wj simply because one of the two is very common

We need to convert absolute counts to relative frequencies f (wj |wi )

What proportion of the time does wj appear in the context of wi ?

Formally, we compute: f (wj |wi ) = N (wi , wj ) w N (wi , w )

N (, ) is the number of times a co-occurring word pair is observed The denominator is called the marginal
Pietro Michiardi (Eurecom) Hadoop in Practice 61 / 76

Algorithm Design

Order Inversion

Computing relative frequenceies The stripes approach

In the reducer, the counts of all words that co-occur with the conditioning variable (wi ) are available in the associative array Hence, the sum of all those counts gives the marginal Then we divide the the joint counts by the marginal and were done

The pairs approach

The reducer receives the pair (wi , wj ) and the count From this information alone it is not possible to compute f (wj |wi ) Fortunately, as for the mapper, also the reducer can preserve state across multiple keys
We can buffer in memory all the words that co-occur with wi and their counts This is basically building the associative array in the stripes method
Pietro Michiardi (Eurecom) Hadoop in Practice 62 / 76

Algorithm Design

Order Inversion

Computing relative frequenceies: a basic approach We must dene the sort order of the pair
In this way, the keys are rst sorted by the left word, and then by the right word (in the pair) Hence, we can detect if all pairs associated with the word we are conditioning on (wi ) have been seen At this point, we can use the in-memory buffer, compute the relative frequencies and emit

We must dene an appropriate partitioner

The default partitioner is based on the hash value of the intermediate key, modulo the number of reducers For a complex key, the raw byte representation is used to compute the hash value
Hence, there is no guarantee that the pair (dog, aardvark) and (dog,zebra) are sent to the same reducer

What we want is that all pairs with the same left word are sent to the same reducer
Pietro Michiardi (Eurecom) Hadoop in Practice Limitations of this approach 63 / 76

Algorithm Design

Order Inversion

Computing relative frequenceies: order inversion

The key is to properly sequence data presented to reducers

If it were possible to compute the marginal in the reducer before processing the join counts, the reducer could simply divide the joint counts received from mappers by the marginal The notion of before and after can be captured in the ordering of key-value pairs The programmer can dene the sort order of keys so that data needed earlier is presented to the reducer before data that is needed later

Pietro Michiardi (Eurecom)

Hadoop in Practice

64 / 76

Algorithm Design

Order Inversion

Computing relative frequenceies: order inversion Recall that mappers emit pairs of co-occurring words as keys The mapper:
additionally emits a special key of the form (wi , ) The value associated to the special key is one, that represtns the contribution of the word pair to the marginal Using combiners, these partial marginal counts will be aggregated before being sent to the reducers

The reducer:
We must make sure that the special key-value pairs are processed before any other key-value pairs where the left word is wi We also need to modify the partitioner as before, i.e., it would take into account only the rst word
Pietro Michiardi (Eurecom) Hadoop in Practice 65 / 76

Algorithm Design

Order Inversion

Computing relative frequenceies: order inversion

Memory requirements:
Minimal, because only the marginal (an integer) needs to be stored No buffering of individual co-occurring word No scalability bottleneck

Key ingredients for order inversion

Emit a special key-value pair to capture the margianl Control the sort order of the intermediate key, so that the special key-value pair is processed rst Dene a custom partitioner for routing intermediate key-value pairs Preserve state across multiple keys in the reducer

Pietro Michiardi (Eurecom)

Hadoop in Practice

66 / 76

Algorithm Design

PageRank

Solved Exercise: PageRank in MapReduce

Pietro Michiardi (Eurecom)

Hadoop in Practice

67 / 76

Algorithm Design

PageRank

Introduction What is PageRank

Its a measure of the relevance of a Web page, based on the structure of the hyperlink graph Based on the concept of random Web surfer

Formally we have: P (n ) = 1 + (1 ) |G | P (m) C (m)

mL(n)

|G| is the number of nodes in the graph is a random jump factor L(n) is the set of out-going links from page n C (m) is the out-degree of node m
Pietro Michiardi (Eurecom) Hadoop in Practice 68 / 76

Algorithm Design

PageRank

PageRank in Details PageRank is dened recursively, hence we need an interative algorithm

A node receives contributions from all pages that link to it

Consider the set of nodes L(n)

A random surfer at m arrives at n with probability 1/C (m) Since the PageRank value of m is the probability that the random surfer is at m, the probability of arriving at n from m is P (m)/C (m)

To compute the PageRank of n we need:

Sum the contributions from all pages that link to n Take into account the random jump, which is uniform over all nodes in the graph

Pietro Michiardi (Eurecom)

Hadoop in Practice

69 / 76

Algorithm Design

PageRank

PageRank in MapReduce: Example

Pietro Michiardi (Eurecom)

Hadoop in Practice

70 / 76

Algorithm Design

PageRank

PageRank in MapReduce: Example

Pietro Michiardi (Eurecom)

Hadoop in Practice

71 / 76

Algorithm Design

PageRank

PageRank in MapReduce

Sketch of the MapReduce algorithm

The algorithm maps over the nodes Foreach node computes the PageRank mass the needs to be distributed to neighbors Each fraction of the PageRank mass is emitted as the value, keyed by the node ids of the neighbors In the shufe and sort, values are grouped by node id
Also, we pass the graph structure from mappers to reducers (for subsequent iterations to take place over the updated graph)

The reducer updates the value of the PageRank of every single node

Pietro Michiardi (Eurecom)

Hadoop in Practice

72 / 76

Algorithm Design

PageRank

PageRank in MapReduce: Pseudo-Code

Pietro Michiardi (Eurecom)

Hadoop in Practice

73 / 76

Algorithm Design

PageRank

PageRank in MapReduce Implementation details

Loss of PageRank mass for sink nodes Auxiliary state information One iteration of the algorith
Two MapReduce jobs: one to distribute the PageRank mass, the other for dangling nodes and random jumps

Checking for convergence

Requires a driver program When updates of PageRank are stable the algorithm stops

Pietro Michiardi (Eurecom)

Hadoop in Practice

74 / 76

References

References I [1] Adversarial information retrieval workshop. [2] Monica Bianchini, Marco Gori, and Franco Scarselli. Inside pagerank. In ACM Transactions on Internet Technology, 2005. [3] Silvio Lattanzi, Benjamin Moseley, Siddharth Suri, and Sergei Vassilvitskii. Filtering: a method for solving graph problems in mapreduce. In Proc. of SPAA, 2011. [4] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over time: Densication laws, shrinking diamters and possible explanations. In Proc. of SIGKDD, 2005.

Pietro Michiardi (Eurecom)

Hadoop in Practice

75 / 76

References

References II [5] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringin order to the web. In Stanford Digital Library Working Paper, 1999. [6] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The hadoop distributed le system. In Proc. of the 26th IEEE Symposium on Massive Storage Systems and Technologies (MSST). IEEE, 2010. [7] Tom White. Hadoop, The Denitive Guide. OReilly, Yahoo, 2010.

Pietro Michiardi (Eurecom)

Hadoop in Practice

76 / 76

Big Data Chapter 1
No ratings yet
Big Data Chapter 1
22 pages
CSC401 L1 - L2[Spring 2024-2025]
No ratings yet
CSC401 L1 - L2[Spring 2024-2025]
55 pages
Queue
No ratings yet
Queue
13 pages
Introduction To Computing
No ratings yet
Introduction To Computing
62 pages
Defragmentation
No ratings yet
Defragmentation
4 pages
SM3257EN_FW_L1017_ReleaseNote
No ratings yet
SM3257EN_FW_L1017_ReleaseNote
22 pages
Intel® Volume Management Device LED Management Tool Release Notes and Guide
No ratings yet
Intel® Volume Management Device LED Management Tool Release Notes and Guide
8 pages
Red Hat Enterprise Virtualization For Servers-2.1-Installation Guide-En-US
No ratings yet
Red Hat Enterprise Virtualization For Servers-2.1-Installation Guide-En-US
70 pages
The Patient Medical Record As A Database: B. E. Jones and M. A. Ould
No ratings yet
The Patient Medical Record As A Database: B. E. Jones and M. A. Ould
7 pages
Backup Types: This Article Applies Only To Backup4all. If You Don't Have It Yet, You Must Download It RST
No ratings yet
Backup Types: This Article Applies Only To Backup4all. If You Don't Have It Yet, You Must Download It RST
4 pages
LDOMS
No ratings yet
LDOMS
19 pages
Storage Area Network, An Introduction of Basic Concepts: Antonella Corno - CCIE Storage CM
No ratings yet
Storage Area Network, An Introduction of Basic Concepts: Antonella Corno - CCIE Storage CM
40 pages
Digital Image Processing
No ratings yet
Digital Image Processing
14 pages
computer practical exams
67% (3)
computer practical exams
3 pages
14 MapReduce
100% (1)
14 MapReduce
82 pages
Memoria Ram y Ad Con La Tarjeta Madre
No ratings yet
Memoria Ram y Ad Con La Tarjeta Madre
7 pages
Spark-GraphX and Neo4j
No ratings yet
Spark-GraphX and Neo4j
32 pages
The Foundations of Computing Understanding Hardware and Software
No ratings yet
The Foundations of Computing Understanding Hardware and Software
16 pages
Writing An Hadoop MapReduce Program in Python
No ratings yet
Writing An Hadoop MapReduce Program in Python
21 pages
GE Fanuc Series Six Install Maintenance Manual
No ratings yet
GE Fanuc Series Six Install Maintenance Manual
300 pages
Aberdeen CFTS Case Study
No ratings yet
Aberdeen CFTS Case Study
2 pages
Hive Join
No ratings yet
Hive Join
6 pages
It Tools Used in Business
No ratings yet
It Tools Used in Business
10 pages
Spark Scala Protected
No ratings yet
Spark Scala Protected
211 pages
Simatic S5: System Manual CPU 100/102/103
No ratings yet
Simatic S5: System Manual CPU 100/102/103
550 pages
BDA Unit-3
No ratings yet
BDA Unit-3
24 pages
Spark
No ratings yet
Spark
13 pages
Distributed Computing BE(AI&DS)
No ratings yet
Distributed Computing BE(AI&DS)
53 pages
PySpark RDD Assignment
No ratings yet
PySpark RDD Assignment
1 page
Software Engineering CS#101 Introduction To Computing
No ratings yet
Software Engineering CS#101 Introduction To Computing
51 pages
CH 02
No ratings yet
CH 02
39 pages
Map Reduce Examples
No ratings yet
Map Reduce Examples
16 pages
Data Sheet 6AV2123-2GA03-0AX0: General Information
No ratings yet
Data Sheet 6AV2123-2GA03-0AX0: General Information
9 pages
PracticeExam DCADAS3 Scala 1
No ratings yet
PracticeExam DCADAS3 Scala 1
27 pages
Introduction To Apache Spark (Spark) : - by Praveen
No ratings yet
Introduction To Apache Spark (Spark) : - by Praveen
19 pages
Assembly 68000
No ratings yet
Assembly 68000
116 pages
Mapr Snapshots
No ratings yet
Mapr Snapshots
31 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Difference Between Cache and Buffer
No ratings yet
Difference Between Cache and Buffer
2 pages
Intellipaat Hands On Exercises PDF
No ratings yet
Intellipaat Hands On Exercises PDF
49 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
43 pages
Spark DataFrames Project Exercise - Jupyter Notebook
No ratings yet
Spark DataFrames Project Exercise - Jupyter Notebook
7 pages
Ajay Singh - Hadoop Resume
67% (3)
Ajay Singh - Hadoop Resume
2 pages
SparkInternals All
No ratings yet
SparkInternals All
90 pages
Course Contents of Hadoop and Big Data
No ratings yet
Course Contents of Hadoop and Big Data
11 pages
Information Storage Management
No ratings yet
Information Storage Management
2 pages
Clinic
No ratings yet
Clinic
108 pages
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
No ratings yet
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
27 pages
10
No ratings yet
10
4 pages
Map Reduce
No ratings yet
Map Reduce
10 pages
HDFS Exercises - Basic
No ratings yet
HDFS Exercises - Basic
5 pages
TMS320LF2407 Introduction
100% (2)
TMS320LF2407 Introduction
43 pages
Second Generation Computers
No ratings yet
Second Generation Computers
4 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
1.hadoop Admin Brochure
No ratings yet
1.hadoop Admin Brochure
11 pages
3 Input Output and Storage Devices
No ratings yet
3 Input Output and Storage Devices
72 pages
BDC Previous Papers 2 Marks
100% (1)
BDC Previous Papers 2 Marks
7 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Spark Details
No ratings yet
Spark Details
11 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Hadoop Interview Questions and Answers
No ratings yet
Hadoop Interview Questions and Answers
3 pages
MapReduce Example
No ratings yet
MapReduce Example
3 pages
LS1.1 - V6 Generalized Architecture of Big Data Systems
No ratings yet
LS1.1 - V6 Generalized Architecture of Big Data Systems
8 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
Real Time Hadoop Interview Questions From Various Interviews
No ratings yet
Real Time Hadoop Interview Questions From Various Interviews
6 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Big Data and Spark Developers
No ratings yet
Big Data and Spark Developers
5 pages
Big Data Syllabus For Theory and Lab
No ratings yet
Big Data Syllabus For Theory and Lab
4 pages
Hadoop Big Data Administration
No ratings yet
Hadoop Big Data Administration
6 pages
AaxHadoop Interview Questions and Answers
No ratings yet
AaxHadoop Interview Questions and Answers
37 pages
Apache Sqoop
No ratings yet
Apache Sqoop
21 pages
NoSQL Intro
No ratings yet
NoSQL Intro
26 pages
Hadoop Interviews Q
No ratings yet
Hadoop Interviews Q
9 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Cloudera Administration Study Guide
No ratings yet
Cloudera Administration Study Guide
3 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Hadoop and Java Ques - Ans
No ratings yet
Hadoop and Java Ques - Ans
222 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
5 - Programming With RDDs and Dataframes
No ratings yet
5 - Programming With RDDs and Dataframes
32 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Unit 4 BDA
No ratings yet
Unit 4 BDA
31 pages