0% found this document useful (0 votes)
88 views

HaLoop - Efficient Iterative Data Processing On Large Clusters

Uploaded by

AG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views

HaLoop - Efficient Iterative Data Processing On Large Clusters

Uploaded by

AG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

HaLoop: Efficient Iterative Data Processing

on Large Clusters

Yingyi Bu∗ Bill Howe Magdalena Balazinska Michael D. Ernst


Department of Computer Science and Engineering
University of Washington, Seattle, WA, U.S.A.
[email protected], {billhowe, magda, mernst}@cs.washington.edu

url source url dest


ABSTRACT www.a.com www.b.com
url rank
The growing demand for large-scale data mining and data anal- www.a.com www.c.com
www.a.com 1.0
www.c.com www.a.com
ysis applications has led both industry and academia to design www.b.com 1.0
www.e.com www.d.com
new types of highly scalable data-intensive computing platforms. www.c.com 1.0
www.d.com www.b.com
www.d.com 1.0
MapReduce and Dryad are two popular platforms in which the www.c.com www.e.com
www.e.com 1.0
www.e.com www.c.com
dataflow takes the form of a directed acyclic graph of operators. www.a.com www.d.com
These platforms lack built-in support for iterative programs, which
arise naturally in many applications including data mining, web (a) Initial Rank Table R0 (b) Linkage Table L
ranking, graph analysis, model fitting, and so on. This paper
url rank
presents HaLoop, a modified version of the Hadoop MapReduce 



T1 = Ri 1url=url source L
www.a.com 2.13
framework that is designed to serve these applications. HaLoop T2 = γ (T1 )

MR1 url,rank , rank →new rank www.b.com 3.89
 COUNT(url dest)
not only extends MapReduce with programming support for it- www.c.com 2.60

T2 1url=url source L

 T =
3
n
www.d.com 2.60
erative applications, it also dramatically improves their efficiency MR2 Ri+1 = γurl dest→url,SUM(new rank )→rank (T3 )
www.e.com 2.13
by making the task scheduler loop-aware and by adding various
caching mechanisms. We evaluated HaLoop on real queries and (c) Loop Body (d) Rank Table R3
real datasets. Compared with Hadoop, on average, HaLoop reduces Figure 1: PageRank example
query runtimes by 1.85, and shuffles only 4% of the data between
mappers and reducers. and aggregate their data, respectively. Many algorithms naturally
fit into the MapReduce model, such as word counting, equi-join
1. INTRODUCTION queries, and inverted list construction [4].
The need for highly scalable parallel data processing platforms However, many data analysis techniques require iterative com-
is rising due to an explosion in the number of massive-scale data- putations, including PageRank [15], HITS (Hypertext-Induced
intensive applications both in industry (e.g., web-data analysis, Topic Search) [11], recursive relational queries [3], clustering,
click-stream analysis, network-monitoring log analysis) and in the neural-network analysis, social network analysis, and network traf-
sciences (e.g., analysis of data produced by massive-scale simula- fic analysis. These techniques have a common trait: data are pro-
tions, sensor deployments, high-throughput lab equipment). cessed iteratively until the computation satisfies a convergence or
MapReduce [4] is a well-known framework for programming stopping condition. The MapReduce framework does not directly
commodity computer clusters to perform large-scale data process- support these iterative data analysis applications. Instead, program-
ing in a single pass. A MapReduce cluster can scale to thousands mers must implement iterative programs by manually issuing mul-
of nodes in a fault-tolerant manner. Although parallel database sys- tiple MapReduce jobs and orchestrating their execution using a
tems [5] may also serve these data analysis applications, they can driver program [12].
be expensive, difficult to administer, and lack fault-tolerance for There are two key problems with manually orchestrating an iter-
long-running queries [16]. Hadoop [7], an open-source MapRe- ative program in MapReduce. The first problem is that even though
duce implementation, has been adopted by Yahoo!, Facebook, and much of the data may be unchanged from iteration to iteration, the
other companies for large-scale data analysis. With the MapReduce data must be re-loaded and re-processed at each iteration, wasting
framework, programmers can parallelize their applications simply I/O, network bandwidth, and CPU resources. The second prob-
by implementing a map function and a reduce function to transform lem is that the termination condition may involve detecting when
a fixpoint has been reached — i.e., when the application’s output

Work was done while the author was at University of Washington, Seattle. Current does not change for two consecutive iterations. This condition may
affiliation: Yingyi Bu - University of California, Irvine. itself require an extra MapReduce job on each iteration, again in-
Permission to make digital or hard copies of all or part of this work for curring overhead in terms of scheduling extra tasks, reading extra
personal or classroom use is granted without fee provided that copies are data from disk, and moving data across the network. To illustrate
not made or distributed for profit or commercial advantage and that copies
these problems, consider the following two examples.
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific E XAMPLE 1. (PageRank) PageRank is a link analysis algo-
permission and/or a fee. Articles from this volume were presented at The rithm that assigns weights (ranks) to each vertex in a graph by
36th International Conference on Very Large Data Bases, September 13-17, iteratively computing the weight of each vertex based on the weight
2010, Singapore.
Proceedings of the VLDB Endowment, Vol. 3, No. 1 of its inbound neighbors. In the relational algebra, the PageRank
Copyright 2010 VLDB Endowment 2150-8097/10/09... $ 10.00. algorithm can be expressed as a join followed by an update with
name1 name2
Tom Bob
(the friend table F ) remains constant throughout the execution of
the query, yet still gets processed and shuffled at each iteration.
(
Tom Alice T1 = ∆Si 1∆Si .name2=F.name1 F
MR1
Elisa Tom T2 = π∆Si .name1,F.name2 (T1 ) Many other data analysis applications have characteristics sim-
Elisa Harry
Sherry Todd
( S
T3 = 0≤j≤(i−1) ∆Sj ilar to the above two examples: a significant fraction of the pro-
MR2 cessed data remains invariant across iterations, and the analysis
Eric Elisa ∆Si+1 = δ(T2 − T3 )
Todd John should typically continue until a fixpoint is reached. Examples
Robin Edward include most iterative model-fitting algorithms (such as k-means
(a) Friend Table F (b) Loop Body clustering and neural network analysis), most web/graph ranking
algorithms (such as HITS [11]), and recursive graph or network
Eric(∆S0 ) queries.
name1 name2 This paper presents a new system called HaLoop that is designed
Elisa(∆S1 ) Eric Elisa
Eric Tom to efficiently handle the above types of applications. HaLoop ex-
Eric Harry tends MapReduce and is based on two simple intuitions. First, a
Tom(∆S2 ) Harry(∆S2 ) MapReduce cluster can cache the invariant data in the first itera-
(c) Result Generating Trace (d) Result Table ∆S tion, and then reuse them in later iterations. Second, a MapReduce
cluster can cache reducer outputs, which makes checking for a fix-
Figure 2: Descendant query example point more efficient, without an extra MapReduce job.
This paper makes the following contributions:
two aggregations. These steps must be repeated by a driver pro-
gram until a termination condition is satisfied (e.g., the rank of • New Programming Model and Architecture for Iterative Pro-
each page converges or a specified number of iterations has been grams: HaLoop handles loop control that would otherwise have
performed). to be manually programmed. It offers a programming interface
Figure 1 shows a concrete example. R0 (Figure 1(a)) is the to express iterative data analysis applications (Section 2).
initial rank table, and L (Figure 1(b)) is the linkage table. Two • Loop-Aware Task Scheduling: HaLoop’s task scheduler enables
MapReduce jobs (MR1 and MR2 in Figure 1(c)) are required to data reuse across iterations, by physically co-locating tasks that
implement the loop body of PageRank. The first MapReduce job process the same data in different iterations (Section 3).
joins the rank and linkage tables. Mappers emit records from the • Caching for Loop-Invariant Data: HaLoop caches and indexes
two relations with the join column as the key and the remaining data that are invariant across iterations in cluster nodes during
columns as the value. Reducers compute the join for each unique the first iteration of an application. Caching the invariant data
source URL, as well as the rank contribution for each outbound reduces the I/O cost for loading and shuffling them in subsequent
edge (new rank). The second MapReduce job computes the ag- iterations (Section 4.1 and Section 4.3).
gregate rank of each unique destination URL: the map function is • Caching to Support Fixpoint Evaluation: HaLoop caches and
the identity function, and the reducers sum the rank contributions indexes a reducer’s local output. This avoids the need for a
of each incoming edge. In each iteration, Ri is updated to Ri+1 . dedicated map-reduce step for fixpoint or convergence checking
For example, one could obtain R3 (Figure 1(d)) by iteratively com- (Section 4.2).
puting R1 , R2 , R3 . • Experimental Study: We evaluated our system on iterative
programs that process both synthetic and real world datasets.
In the PageRank algorithm, the linkage table L is invariant across HaLoop outperforms Hadoop in all metrics; on average, HaLoop
iterations. Because the MapReduce framework is unaware of this reduces query runtimes by 1.85, and shuffles only 4% of the data
property, however, L is processed and shuffled at each iteration. between mappers and reducers (Section 5).
Worse, the invariant linkage data may frequently be larger than the
resulting rank table. Finally, determining whether the ranks have
converged requires an extra MapReduce job on each iteration. 2. HALOOP OVERVIEW
E XAMPLE 2. (Descendant Query) Given the social network re- This section introduces HaLoop’s architecture and its application
lation in Figure 2(a), who is within two friend-hops from Eric? To programming model.
answer this query, we can first find Eric’s direct friends, and then
all the friends of these friends. A related query is to find all peo-
2.1 Architecture
ple who can be reached from Eric following the friend relation F . Figure 3 illustrates the architecture of HaLoop, a modified ver-
These queries can be implemented by a driver program that exe- sion of the open source MapReduce implementation Hadoop [7].
cutes two MapReduce jobs (MR1 and MR2 in Figure 2(b)), either HaLoop inherits the basic distributed computing model and ar-
for two iterations or until fixpoint, respectively. The first MapRe- chitecture of Hadoop. HaLoop relies on a distributed file system
duce job finds a new generation of friends by joining the friend ta- (HDFS [8]) that stores each job’s input and output data. The sys-
ble F with the friends discovered in the previous iteration, ∆Si . tem is divided into two parts: one master node and many slave
The second MapReduce job removes duplicate tuples from ∆Si nodes. A client submits jobs to the master node. For each submit-
that also appear in ∆Sj for j < i. The final result is the union ted job, the master node schedules a number of parallel tasks to run
of results from each iteration. on slave nodes. Every slave node has a task tracker daemon pro-
Let ∆Si be the result of the join after iteration i, computed cess to communicate with the master node and manage each task’s
by joining ∆Si−1 with F and removing duplicates. ∆S0 ={Eric, execution. Each task is either a map task (which usually performs
Eric} is the trivial friend relationship that initiates the computa- transformations on an input data partition, and calls a user-defined
map function with one hkey, valuei pair each time) or a reduce
S shows how results evolve from ∆S0 to ∆S2 . Fi-
tion. Figure 2(c)
task (which usually copies the corresponding partition of mapper
nally, ∆S = 0<i≤2 ∆Si is returned as the final result, as in
Figure 2(d). output, groups the input keys, and invokes a user-defined reduce
function with one key and its associated values each time). For ex-
As in the PageRank example, a significant fraction of the data ample, in Figure 3, there are three jobs running in the system: job
A fixpoint is typically defined by exact equality between iter-
ations, but HaLoop also supports the concept of an approximate
 
Task11 Task12 Task13
fixpoint, where the computation terminates when either the differ-
ence between two consecutive iterations is less than a user-specified
Task21 Task22 Task23
threshold, or the maximum number of iterations has been reached.
Task31 Task32 Task33 Both kinds of approximate fixpoints are useful for expressing con-
vergence conditions in machine learning and complex analytics.
 
.
For example, for PageRank, it is common to either use a user-
. specified convergence threshold  [15] or a fixed number of iter-
. ations as the loop termination condition.
Although our recursive formulation describes the class of iter-
Task Queue
ative programs we intend to support, this work does not develop
 
a high-level declarative language for expressing recursive queries.
Rather, we focus on providing an efficient foundation API for it-
erative MapReduce programs; we posit that a variety of high-level
languages (e.g., Datalog) could be implemented on this foundation.
To write a HaLoop program, a programmer specifies the loop
Local communication Remote communication body (as one or more map-reduce pairs) and optionally specifies
Identical to Hadoop Modified from Hadoop New in HaLoop a termination condition and loop-invariant data. We now discuss
HaLoop’s API (see Figure 16 in the appendix for a summary). Map
and Reduce are similar to standard MapReduce and are required;
Figure 3: The HaLoop framework, a variant of Hadoop the rest of the API is new and is optional.
MapReduce framework. To specify the loop body, the programmer constructs a multi-step
MapReduce job, using the following functions:
1, job 2, and job 3. Each job has three tasks running concurrently
on slave nodes. • Map transforms an input hkey, valuei tuple into intermediate
In order to accommodate the requirements of iterative data anal- hin key, in valuei tuples.
ysis applications, we made several changes to the basic Hadoop • Reduce processes intermediate tuples sharing the same in key,
MapReduce framework. First, HaLoop exposes a new application to produce hout key, out valuei tuples. The interface contains
programming interface to users that simplifies the expression of a new parameter for cached invariant values associated with the
iterative MapReduce programs (Section 2.2). Second, HaLoop’s in key.
master node contains a new loop control module that repeatedly • AddMap and AddReduce express a loop body that consists of
starts new map-reduce steps that compose the loop body, until more than one MapReduce step. AddMap (AddReduce) asso-
a user-specified stopping condition is met (Section 2.2). Third, ciates a Map (Reduce) function with an integer indicating the
HaLoop uses a new task scheduler for iterative applications that order of the step.
leverages data locality in these applications (Section 3). Fourth, HaLoop defaults to testing for equality from one iteration to the
HaLoop caches and indexes application data on slave nodes (Sec- next to determine when to terminate the computation. To specify an
tion 4). As shown in Figure 3, HaLoop relies on the same file approximate fixpoint termination condition, the programmer uses
system and has the same task queue structure as Hadoop, but the the following functions.
task scheduler and task tracker modules are modified, and the loop
control, caching, and indexing modules are new. The task tracker • SetFixedPointThreshold sets a bound on the distance be-
not only manages task execution, but also manages caches and in- tween one iteration and the next. If the threshold is exceeded,
dices on the slave node, and redirects each task’s cache and index then the approximate fixpoint has not yet been reached, and the
accesses to local file system. computation continues.
• The ResultDistance function calculates the distance between
2.2 Programming Model two out value sets sharing the same out key. One out value set vi
The PageRank and descendant query examples are representative is from the reducer output of the current iteration, and the other
of the types of iterative programs that HaLoop supports. Here, we out value set vi−1 is from the previous iteration’s reducer output.
present the general form of the recursive programs we support and The distance between the reducer outputs of the current iteration
a detailed API. i and the last iteration i − 1 is the sum of ResultDistance on
The iterative programs that HaLoop supports can be distilled into every key. (It is straightforward to support additional aggrega-
the following core construct: tions besides sum.)
• SetMaxNumOfIterations provides further control of the loop
Ri+1 = R0 ∪ (Ri ./ L) termination condition. HaLoop terminates a job if the maxi-
mum number of iterations has been executed, regardless of the
where R0 is an initial result and L is an invariant relation. A
distance between the current and previous iteration’s outputs.
program in this form terminates when a fixpoint is reached —
SetMaxNumOfIterations can also be used to implement a
when the result does not change from one iteration to the next, i.e.
simple for-loop.
Ri+1 = Ri . This formulation is sufficient to express a broad class
of recursive programs.1 To specify and control inputs, the programmer uses:
1
SQL (ANSI SQL 2003, ISO/IEC 9075-2:2003) queries using the • SetIterationInput associates an input source with a specific
WITH clause can also express a variety of iterative applications, in- iteration, since the input files to different iterations may be dif-
cluding complex analytics that are not typically implemented in ferent. For example, in Example 1, at each iteration i + 1, the
SQL such as k-means and PageRank; see Section 9.5. input is Ri ∪ L.
Map function Stop condition Map function Stop condition three reduce tasks are executed, each of which loads a partition of
Application
Job Reduce function
Application Reduce function the collective mapper output. In Figure 5, reducer R00 processes
Yes
submit Yes Stop?
mapper output keys whose hash value is 0, reducer R10 processes
No
Stop? keys with hash value 1, and reducer R20 processes keys with hash
No Job Job
value 2.
Map Reduce Map Reduce The scheduling of the join step of iteration 2 can take advantage
Map Reduce Map Reduce
of inter-iteration locality: the task (either mapper or reducer) that
HaLoop MapReduce processes a specific data partition D is scheduled on the physical
Figure 4: Boundary between an iterative application and the node where D was processed in iteration 1. Note that the two file
framework (HaLoop vs. Hadoop). HaLoop knows and controls inputs to the join step in iteration 2 are L and R1 .
the loop, while Hadoop only knows jobs with one map-reduce The schedule in Figure 5 provides the feasibility to reuse loop-
pair. invariant data from past iterations. Because L is loop-invariant,
mappers M01 and M11 would compute identical results to M00
M20: R0-split0 R00: partition 0
M21: R1-split0 R01: partition 0 and M10 . There is no need to re-compute these mapper outputs,
n1 n3 nor to communicate them to the reducers. In iteration 1, if reducer
n1 n3
M00: L-split0 R10: partition 1
input partitions 0, 1, and 2 are stored on nodes n3 , n1 , and n2
M01: L-split0 R11: partition 1 respectively, then in iteration 2, L need not be loaded, processed
n2 n1
n2 n1 or shuffled again. In that case, in iteration 2, only one mapper
M10: L-split1 M21 for R1 -split0 needs to be launched, and thus the three reducers
R20: partition 2 M11: L-split1 R21: partition 2 will only copy intermediate data from M21 . With this strategy, the
n3
n2 n3 reducer input is no different, but it now comes from two sources:
n2
Unnecessary computation Unnecessary communication the output of the mappers (as usual) and the local disk.
We refer to the property of the schedule in Figure 5 as inter-
Figure 5: A schedule exhibiting inter-iteration locality. Tasks
iteration locality. Let d be a file split (mapper input partition) or a
processing the same inputs on consecutive iterations are sched-
reducer input partition2 , and let Tdi be a task consuming d in itera-
uled to the same physical nodes.
tion i. Then we say that a schedule exhibits inter-iteration locality
if for all i > 1, Tdi and Tdi−1 are assigned to the same physical node
• AddStepInput associates an additional input source with an in-
if Tdi−1 exists.
termediate map-reduce pair in the loop body. The output of pre-
The goal of task scheduling in HaLoop is to achieve inter-
ceding map-reduce pair is always in the input of the next map-
iteration locality. To achieve this goal, the only restriction is that
reduce pair.
HaLoop requires that the number of reduce tasks should be invari-
• AddInvariantTable specifies an input table (an HDFS file)
ant across iterations, so that the hash function assigning mapper
that is loop-invariant. During job execution, HaLoop will cache
outputs to reducer nodes remains unchanged.
this table on cluster nodes.

This programming interface is sufficient to express a variety of


3.2 Scheduling Algorithm
iterative applications. The appendix sketches the implementation HaLoop’s scheduler keeps track of the data partitions processed
of PageRank (Section 9.2), descendant query (Section 9.3), and k- by each map and reduce task on each physical machine, and it uses
means (Section 9.4) using this programming interface. Figure 4 that information to schedule subsequent tasks taking inter-iteration
shows the difference between HaLoop and Hadoop, from the appli- locality into account.
cation’s perspective: in HaLoop, a user program specifies loop set- More specifically, the HaLoop scheduler works as follows. Upon
tings and the framework controls the loop execution, but in Hadoop, receiving a heartbeat from a slave node, the master node tries to
it is the application’s responsibility to control the loops. assign the slave node an unassigned task that uses data cached on
that node. To support this assignment, the master node maintains a
3. LOOP-AWARE TASK SCHEDULING mapping from each slave node to the data partitions that this node
processed in the previous iteration. If the slave node already has a
This section introduces the HaLoop task scheduler. The sched-
full load, the master re-assigns its tasks to a nearby slave node.
uler provides potentially better schedules for iterative programs
Figure 6 gives pseudocode for the scheduling algorithm. Before
than Hadoop’s scheduler. Sections 3.1 and 3.2 illustrate the desired
each iteration, previous is set to current, and then current is
schedules and scheduling algorithm respectively.
set to a new empty HashMap object. In a job’s first iteration, the
3.1 Inter-Iteration Locality schedule is exactly the same as that produced by Hadoop (line 2).
The high-level goal of HaLoop’s scheduler is to place on the After scheduling, the master remembers the association between
same physical machines those map and reduce tasks that occur in data and node (lines 3 and 13). In later iterations, the scheduler
different iterations but access the same data. With this approach, tries to retain previous data-node associations (lines 11 and 12). If
data can more easily be cached and re-used between iterations. For the associations can no longer hold due to the load, the master node
example, Figure 5 is a sample schedule for the join step (MR1 in will associate the data with another node (lines 6–8).
Figure 1(c)) of the PageRank application from Example 1. There
are two iterations and three slave nodes involved in the job.
4. CACHING AND INDEXING
The scheduling of iteration 1 is no different than in Hadoop. In Thanks to the inter-iteration locality offered by the task sched-
the join step of the first iteration, the input tables are L and R0 . uler, access to a particular loop-invariant data partition is usually
Three map tasks are executed, each of which loads a part of one or 2
Mapper input partitions are represented by an input file URL plus
the other input data file (a.k.a., a file split). As in ordinary Hadoop, an offset and length; reducer input partitions are represented by an
the mapper output key (the join attribute in this example) is hashed integer hash value. Two partitions are assumed to be equal if their
to determine the reduce task to which it should be assigned. Then, representations are equal.
Task Scheduling name1 name2 name1 name2
Input: Node node Tom Bob Sherry Todd
name1 name2
// The current iteration’s schedule; initially empty Tom Alice Eric Elisa
Eric Eric
Global variable: MaphNode, ListhPartitionii current Elisa Tom Todd John
Elisa Harry Robin Edward
// The previous iteration’s schedule
Global variable: MaphNode, ListhPartitionii previous (a) F -split0 (b) F -split1 (c) ∆S0 -split0
1: if iteration == 0 then Figure 7: Mapper Input Splits in Example 2
2: Partition part = hadoopSchedule(node);
3: current.get(node).add(part); name1 name2 table ID
name1 name2 table ID
Elisa Tom #1
4: else Elisa Harry #1
Eric Elisa #1
5: if node.hasFullLoad() then Robin Edward #1
Eric Eric #2
6: Node substitution = findNearestIdleNode(node); Sherry Todd #1
Tom Bob #1
7: previous.get(substitution).addAll(previous.remove(node)); Todd John #1
Tom Alice #1
8: return; (a) partition 0 (b) partition 1
9: end if
10: if previous.get(node).size()>0 then Figure 8: Reducer Input Partitions in Example 2
11: Partition part = previous.get(node).get(0);
12: schedule(part, node); into two files, and each key has an associated pointer to its cor-
13: current.get(node).add(part); responding values. Sometimes the selectivity in the cached loop-
14: previous.remove(part);
invariant data is low. Thus, after reducer input data are cached to
15: end if
16: end if local disk, HaLoop creates an index over the keys and stores it in
Figure 6: Task scheduling algorithm. If there are running jobs, the local file system too. Since the reducer input cache is sorted
this function is called when master node receives a heartbeat and then accessed by reducer input key in the same sorted order,
from a slave. the disk seek operations are only conducted in a forward manner,
and in the worst case, in each iteration, the input cache is sequen-
tially scanned from local disk only once.
only needed by one physical node. To reduce I/O cost, HaLoop The reducer input cache is suitable for PageRank, HITS, various
caches those data partitions on the physical node’s local disk for recursive relational queries, and any other algorithm with repeated
subsequent re-use. To further accelerate processing, it indexes the joins against large invariant data. The reducer input cache requires
cached data. If a cache becomes unavailable, it is automatically that the partition function f for every mapper output tuple t satis-
re-loaded, either from map task physical nodes, or from HDFS. fies that: (1) f must be deterministic, (2) f must remain the same
HaLoop maintains three types of caches: reducer input cache, re- across iterations, and (3) f must not take any inputs other than the
ducer output cache, and mapper input cache. Each of them fits tuple t. In HaLoop, the number of reduce tasks is unchanged across
a number of application scenarios. Application programmers can iterations, therefore the default hash partitioning satisfies these con-
choose to enable or disable a cache type using the HaLoop API (see ditions.
Appendix 9.1).
4.2 Reducer Output Cache
4.1 Reducer Input Cache The reducer output cache stores and indexes the most recent local
If an intermediate table is specified to be loop-invariant (via the output on each reducer node. This cache is used to reduce the cost
HaLoop API AddInvariantTable) and the reducer input cache is of evaluating fixpoint termination conditions. That is, if the appli-
enabled, HaLoop will cache reducer inputs across all reducers and cation must test the convergence condition by comparing the cur-
create a local index for the cached data. Note that reducer inputs rent iteration output with the previous iteration output, the reducer
are cached before each reduce function invocation, so that tuples output cache enables the framework to perform the comparison in
in the reducer input cache are sorted and grouped by reducer input a distributed fashion.
key. The reducer output cache is used in applications where fixpoint
Let us consider the social network example (Example 2) to see evaluation should be conducted after each iteration. For example,
how the reducer input cache works. Three physical nodes n1 , n2 , in PageRank, a user may set a convergence condition specifying
and n3 are involved in the job, and the number of reducers is set that the total rank difference from one iteration to the next is below
to 2. In the join step of the first iteration, there are three mappers: a given threshold. With the reducer output cache, the fixpoint can
one processes F -split0, one processes F -split1, and one processes be evaluated in a distributed manner without requiring a separate
∆S0 -split0. The three splits are shown in Figure 7. The two re- MapReduce step. After all Reduce function invocations are done,
ducer input partitions are shown in Figure 8. The reducer on n1 each reducer evaluates the fixpoint condition within the reduce pro-
corresponds to hash value 0, while the reducer on n2 corresponds cess and reports local evaluation results to the master node, which
to hash value 1. Then, since table F (with table ID “#1”) is set computes the final answer.
to be invariant by the programmer using the AddInvariantTable The reducer output cache requires that in the last map-reduce
function, every reducer will cache the tuples with table ID “#1” in pair of the loop body, the mapper output partition function f and
its local file system. the reduce function satisfy the following conditions: if (ko1 ,
In later iterations, when a reducer passes a shuffled key with vo1 )∈reduce(ki , Vi ), (ko2 , vo2 )∈reduce(kj , Vj ), and ko1 =ko2 ,
associated values to the user-defined Reduce function, it also then f (ki )=f (kj ). That is, if two Reduce function calls produce
searches for the key in the local reducer input cache to find associ- the same output key from two different reducer input keys, both
ated values and passes them together to the Reduce function (note reducer input keys must be in the same partition so that they are
that HaLoop’s modified Reduce interface accepts this parameter; sent to the same reduce task. Further, f should also meet the re-
see details in Appendix 9.1). Also, if the reducer input cache is quirements of the reducer input cache. Satisfying these require-
enabled, mapper outputs in the first iteration are cached in the cor- ments guarantees that reducer output tuples in different iterations
responding mapper’s local disk, for future reducer cache reloading. but with the same output key are produced on the same physical
In the physical layout of the cache, keys and values are separated node, which ensures the usefulness of reducer output cache and the
correctness of the local fixpoint evaluation. Our PageRank, descen- are plotted in Figure 9(a), Figure 10(a), Figure 11(a), and Fig-
dant query, and k-means clustering implementations on HaLoop all ure 12(a).
satisfy these conditions. In the PageRank algorithm, there are two steps in every itera-
tion: join and aggregation. The running time in Figure 9(a) and
4.3 Mapper Input Cache Figure 10(a) is the sum of join time and aggregation time over all
Hadoop [7] attempts to co-locate map tasks with their input data. iterations. In the descendant query algorithm, there are also two
On a real-world Hadoop cluster [1], the rate of data-local map- steps per iteration: join and duplicate elimination. The running
pers is around 70%–95%, depending on the runtime environment. time in Figure 11(a) and Figure 12(a) is the sum of join time and
HaLoop’s mapper input cache aims to avoid non-local data reads “duplicate elimination” time over all iterations.
in mappers during non-initial iterations. In the first iteration, if a HaLoop always performs better than Hadoop. The descendant
mapper performs a non-local read on an input split, the split will query on the Triples dataset has the best improvement, PageRank
be cached in the local disk of the mapper’s physical node. Then, on Livejournal and Freebase have intermediate gains, but the de-
with loop-aware task scheduling, in later iterations, all mappers scendant query on the Livejournal dataset has the least improve-
read data only from local disks, either from HDFS or from the local ment. Livejournal is a social network dataset with high fan-out and
file system. The mapper input cache can be used by model-fitting reachability. As a result, the descendant query in later iterations
applications such as k-means clustering, neural network analysis, (>3) produces so many duplicates that duplicate elimination dom-
and any other iterative algorithm consuming mapper inputs that do inates the cost, and HaLoop’s caching mechanism does not signifi-
not change across iterations. cantly reduce overall runtime. In contrast, the Triples dataset is less
connected, thus the join step is the dominant cost and the cache is
4.4 Cache Reloading crucial.
There are a few cases where the cache must be re-constructed: Join Step Run Time. HaLoop’s task scheduling and reducer in-
(1) the hosting node fails, or (2) the hosting node has a full load put cache potentially reduce join step time, but do not reduce the
and a map or reduce task must be scheduled on a different substitu- cost of the “duplicate elimination” step for the descendant query,
tion node. A reducer reconstructs the reducer input cache by copy- nor the final aggregation step in PageRank. Thus, to partially ex-
ing the desired partition from all first-iteration mapper outputs. To plain why overall job running time is shorter with HaLooop, we
reload the mapper input cache or the reducer output cache, the map- compare the performance of the join step in each iteration. Fig-
per/reducer only needs to read the corresponding chunks from the ure 9(b), Figure 10(b), Figure 11(b), and Figure 12(b) plot join
distributed file system, where replicas of the cached data are stored. time in each iteration. HaLoop significantly outperforms Hadoop.
Cache re-loading is completely transparent to user programs. In the first iteration, HaLoop is slower than Hadoop, as shown in
(a) and (b) of all four figures. The reason is that HaLoop performs
additional work in the first iteration: HaLoop caches the sorted and
5. EXPERIMENTAL EVALUATION grouped data on each reducer’s local disks, creates an index for
We compared the performance of iterative data analysis applica- the cached data, and stores the index to disk. That is, in the first
tions on HaLoop and Hadoop. Since use of the reducer input cache, iteration, HaLoop does the exact same thing as Hadoop, but also
reducer output cache, and mapper input cache are independent op- writes caches to local disk.
tions, we evaluated them separately in Sections 5.1–5.3. Cost Distribution for Join Step. To better understand HaLoop’s
improvements to each phase, we compared the cost distribution of
5.1 Evaluation of Reducer Input Cache the join step across Map and Reduce phases. Figure 9(c), Fig-
ure 10(c), Figure 11(c), and Figure 12(c) show the cost distribu-
This suite of experiments used virtual machine clusters of 50 and
tion of the join step in a certain iteration (here it is iteration 3).
90 slave nodes in Amazon’s Elastic Compute Cloud (EC2). There
The measurement is time spent on each phase. In both HaLoop
is always one master node. The applications were PageRank and
and Hadoop, reducers start to copy data immediately after the first
descendant query. Both are implemented in both HaLoop (using
mapper completes. “Shuffle time” is normally the time between
our new programming model) and Hadoop (using the traditional
reducers starting to copy map output data, and reducers starting to
driver approach).
sort copied data; shuffling is concurrent with the rest of the unfin-
We used both semi-synthetic and real-world datasets: Livejour-
ished mappers. The first completed mapper’s running time in the
nal (18GB, social network data), Triples (120GB, semantic web
two algorithms is very short, e.g., 1–5 seconds to read data from one
data) and Freebase (12GB, concept linkage graph). Detailed hard-
64MB HDFS block. If we were to plot the first mapper’s running
ware and dataset descriptions are in Section 9.6.
time as “map phase”, the duration would be too brief to be visible
We executed the PageRank query on the Livejournal and Free-
compared to shuffle phase and reduce phase. Therefore we let the
base datasets and the descendant query on the Livejournal and
“shuffle time” in the plots be the usual shuffle time plus the first
Triples datasets. Figures 9–12 show the results for Hadoop and
completed mapper’s running time. The “reduce time” in the plots
HaLoop. The number of reduce tasks is set to the number of slave
is the total time a reducer spends after the shuffle phase, including
nodes. The performance with fail-overs has not been quantified; all
sorting and grouping, as well as accumulated Reduce function call
experimental results are obtained without any node failures.
time. Note that in the plots, “shuffle time” plus “reduce time” con-
Overall, as the figures show, for a 10-iteration job, HaLoop low-
stitutes what we have referred to as the “join step”. Considering all
ers the runtime by 1.85 on average when the reducer input cache is
four plots, we conclude that HaLoop outperforms Hadoop in both
used. As we discuss later, the reducer output cache creates an addi-
phases.
tional gap between Hadoop and HaLoop but the impact is less sig-
The “reduce” bar is not visible in Figure 11(c), although it is
nificant on overall runtime. We now present these results in more
present. The “reduce time” is not 0, but rather very short compared
detail.
to “shuffle” bar. It takes advantage of the index HaLoop creates
Overall Run Time. In this experiment, we used SetMaxNumOf-
for the cache data. Then the join between ∆Si and F will use an
Iterations, rather than fixedPointThreshold and Result-
index seek to search qualified tuples in the cache of F . Also, in
Distance, to specify the loop termination condition. The results
HaLoop Hadoop HaLoop Hadoop R e d u c e S h u f f le HaLoop Hadoop

Shuffled Data (Bytes)


Running Time (s)
4k 400 5 0 0

Running Time (s)


20.0G

4 0 0
3k 300 16.0G

R u n n in g T im e (s )
3 0 0 12.0G

2k 200
8.0G
2 0 0
1k 100 4.0G
1 0 0

0 0 0
0.0
0 2 4 6 8 10
0 2 4 6 8 10 0 2 4 6 8 10 H a L o o p H a d o o p
Total Iteration Iteration C o n f ig u r a tio n Iteration

(a) Overall Performance (b) Join Step (c) Cost Distribution (d) Shuffled Bytes
Figure 9: PageRank Performance: HaLoop vs. Hadoop (Livejournal Dataset, 50 nodes)
HaLoop Hadoop HaLoop Hadoop R e d u c e S h u f f le HaLoop Hadoop

Shuffled Data (Bytes)


1 0 0 0
Running Time (s)

Running Time (s)


5k 800 16.0G

8 0 0
4k 600

R u n n in g T im e (s )
12.0G

6 0 0
3k
400 8.0G

2k 4 0 0

200 4.0G

1k 2 0 0

0 0 0
0.0
0 2 4 6 8 10
0 2 4 6 8 10 0 2 4 6 8 10 H a L o o p H a d o o p
Total Iteration Iteration C o n f ig u r a tio n Iteration

(a) Overall Performance (b) Join Step (c) Cost Distribution (d) Shuffled Bytes
Figure 10: PageRank Performance: HaLoop vs. Hadoop (Freebase Dataset, 90 nodes)
HaLoop Hadoop HaLoop Hadoop R e d u c e S h u f f le HaLoop Hadoop

Shuffled Data (Bytes)


2000 2 0 0 0
Running Time (s)

Running Time (s)

8.0k
120.0G

1600 1 5 0 0

R u n n in g T im e (s )
6.0k 90.0G

1200
1 0 0 0 60.0G
4.0k
800
30.0G
2.0k
400 5 0 0
0.0

0.0 0 0 0 1 2 3 4 5 6
0 1 2 3 4 5 6 0 1 2 3 4 5 6 H a L o o p H a d o o p
Total Iteration Iteration C o n f ig u r a tio n Iteration

(a) Overall Performance (b) Join Step (c) Cost Distribution (d) Shuffled Bytes
Figure 11: Descendant Query Performance: HaLoop vs. Hadoop (Triples Dataset, 90 nodes)
HaLoop Hadoop HaLoop Hadoop R e d u c e S h u f f le HaLoop Hadoop

Shuffled Data (Bytes)


5 0 0
Running Time (s)

Running Time (s)

20.0G

6.0k 1000
4 0 0 16.0G
R u n n in g T im e (s )

800
3 0 0 12.0G
4.0k
600
8.0G
2 0 0
400
2.0k 4.0G

200 1 0 0
0.0
0.0 0 0 0 1 2 3 4 5 6
0 1 2 3 4 5 6 0 1 2 3 4 5 6 H a L o o p H a d o o p
Total Iteration Iteration C o n f ig u r a tio n Iteration

(a) Overall Performance (b) Join Step (c) Cost Distribution (d) Shuffled Bytes
Figure 12: Descendant Query Performance: HaLoop vs. Hadoop (Livejournal Dataset, 50 nodes)

each iteration, there are few new records produced, so the join’s the PageRank implementation in Section 9.2 on Livejournal and
selectivity on F is very low. Thus the cost becomes negligible. Freebase. In the Hadoop implementation, the fixpoint evaluation is
By contrast, for PageRank, the index does not help much, because implemented by an extra MapReduce job. On average, compared
the selectivity is high. For the descendants query on Livejournal with Hadoop, HaLoop reduces the cost of this step to 40%, by tak-
(Figure 12), in iteration>3, the index does not help either, because ing advantage of the reducer output cache and a built-in distributed
the selectivity becomes high. fixpoint evaluation. Figure 13(a) and (b) shows the time spent on
I/O in Shuffle Phase of Join Step. To tell how much shuffling fixpoint evaluation in each iteration.
I/O is saved, we compared the amount of shuffled data in the join
step of each iteration. Since HaLoop caches loop-invariant data, the 5.3 Evaluation of Mapper Input Cache
overhead of shuffling these invariant data are completely avoided. Since the mapper input cache aims to reduce data transportation
These savings contribute an important part of the overall perfor- between slave nodes but we do not know the disk I/O implemen-
mance improvement. Figure 9(d), Figure 10(d), Figure 11(d), and tations of EC2 virtual machines, this suite of experiments uses an
Figure 12(d) plot the sizes of shuffled data. On average, HaLoop’s 8-node physical machine cluster. PageRank and descendant query
join step shuffles 4% as much data as Hadoop’s does. cannot utilize the mapper input cache because their inputs change
from iteration to iteration. Thus, the application used in the eval-
5.2 Evaluation of Reducer Output Cache uation is the k-means clustering algorithm. We used two real-
This experiment shares the same hardware and dataset as the re- world Astronomy datasets (multi-dimensional tuples): cosmo-dark
ducer input cache experiments. To see how effective HaLoop’s re- (46GB) and cosmo-gas (54GB). Detailed hardware and dataset de-
ducer output cache is, we compared the cost of fixpoint evaluation scriptions are in Section 9.6. We vary the number of total iterations,
in each iteration. Since descendant query has a trivial fixpoint eval- and plot the algorithm running time in Figure 14. The mapper lo-
uation step that only requires testing to see if a file is empty, we run cality rate is around 95% since there are not concurrent jobs in our
H a L o o p H a d o o p H a L o o p H a d o o p size graph datasets, but it does not support general iterative pro-
6 0 .0 8 0 .0
grams.
R u n n in g T im e (s )

R u n n in g T im e (s )
6 0 .0
4 0 .0
4 0 .0 7. CONCLUSION AND FUTURE WORK
2 0 .0
2 0 .0 This paper presents the design, implementation, and evaluation
0 .0 0 .0 of HaLoop, a novel parallel and distributed system that supports
2 4 6 8 1 0 2 4 6 8 1 0
Ite ra tio n Ite ra tio n large-scale iterative data analysis applications. HaLoop is built on
top of Hadoop and extends it with a new programming model and
(a) Livejournal, 50 nodes (b) Freebase, 90 nodes several important optimizations that include (1) a loop-aware task
Figure 13: Fixpoint Evaluation Overhead in PageRank: scheduler, (2) loop-invariant data caching, and (3) caching for effi-
HaLoop vs. Hadoop cient fixpoint verification. We evaluated our HaLoop prototype on
several large datasets and iterative queries. Our results demonstrate
H a L o o p H a d o o p H a L o o p H a d o o p
5 k that pushing support for iterative programs into the MapReduce
R u n n in g T im e (s )

R u n n in g T im e (s )
6 k
5 k 4 k engine greatly improves the overall performance of iterative data
4 k 3 k analysis applications. In future work, we would like to implement
3 k
2 k a simplified Datalog evaluation engine on top of HaLoop, to enable
2 k
1 k 1 k large-scale iterative data analysis programmed in a declarative way.
0 0
2 4 6
T o ta l Ite ra tio n
8 1 0 1 2 2 4 6
T o ta l Ite ra tio n
8 1 0 1 2
Acknowledgements
The HaLoop project is partially supported by NSF CluE grants
(a) Cosmo-dark, 8 nodes (b) Cosmo-gas, 8 nodes IIS-0844572 and IIS-0844580, NSF CAREER Award IIS-0845397,
Figure 14: Performance of k-means: HaLoop vs. Hadoop NSF grant CNS-0855252, Woods Hole Oceanographic Institute
Grant OCE-0418967, Amazon, University of Washington eScience
lab HaLoop cluster. By avoiding non-local data loading, HaLoop Institute, and the Yahoo! Key Scientific Challenges program.
performs marginally better than Hadoop. Thanks for suggestions and comments from Michael J. Carey,
Rares Vernica, Vinayak R. Borkar, Hongbo Deng, Congle Zhang,
and the anonymous reviewers.
6. RELATED WORK
Parallel database systems [5] partition data storage and paral- 8. REFERENCES
lelize query workloads to achieve better performance. However, [1] http://www.nsf.gov/pubs/2008/nsf08560/nsf08560.htm.
they are sensitive to failures and have not been shown to scale to Accessed July 7, 2010.
thousands of nodes. Various optimization techniques for evaluat- [2] Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Alexander Rasin,
ing recursive queries have been proposed in the literature [3, 17]. and Avi Silberschatz. HadoopDB: An architectural hybrid of MapReduce and
DBMS technologies for analytical workloads. VLDB, 2(1):922–933, 2009.
The existing work has not been shown to operate at large scale. [3] François Bancilhon and Raghu Ramakrishnan. An amateur’s introduction to
Further, most of these techniques are orthogonal to our research; recursive query processing strategies. In SIGMOD Conference, pages 16–52,
we provide a low-level foundation for implementing data-intensive 1986.
iterative programs. [4] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing
on large clusters. In OSDI, pages 137–150, 2004.
More recently, MapReduce [4] has emerged as a popular alterna- [5] David J. DeWitt and Jim Gray. Parallel database systems: The future of high
tive for massive-scale parallel data analysis in shared-nothing clus- performance database systems. Commun. ACM, 35(6):85–98, 1992.
ters. Hadoop [7] is an open-source implementation of MapReduce. [6] Jaliya Ekanayake and Shrideep Pallickara. MapReduce for data intensive
MapReduce has been followed by a series of related systems in- scientific analysis. In IEEE eScience, pages 277–284, 2008.
[7] Hadoop. http://hadoop.apache.org/. Accessed July 7, 2010.
cluding Dryad [10], Hive [9], Pig [14], and HadoopDB [2]. Like
[8] Hdfs. http://hadoop.apache.org/common/docs/current/
Hadoop, none of these systems provides explicit support and opti- hdfs_design.html. Accessed July 7, 2010.
mizations for iterative or recursive types of analysis. [9] Hive. http://hadoop.apache.org/hive/. Accessed July 7, 2010.
Mahout [12] is a project whose goal is to build a set of scal- [10] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly.
able machine learning libraries on top of Hadoop. Since most Dryad: distributed data-parallel programs from sequential building blocks. In
EuroSys, pages 59–72, 2007.
machine learning algorithms are model fitting applications, nearly [11] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM,
all of them involve iterative programs. Mahout uses an outside 46(5):604–632, 1999.
driver program to control the loops, and new MapReduce jobs are [12] Mahout. http://lucene.apache.org/mahout/. Accessed July 7,
launched in each iteration. The drawback of this approach has been 2010.
[13] Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert,
discussed in Section 1. Like Mahout, we are trying to help itera- Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for
tive data analysis algorithms work on scalable architectures, but we large-scale graph processing. In SIGMOD Conference, pages 135–146, 2010.
are different in that we are modifying the fundamental system: we [14] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and
inject the iterative capability into a MapReduce engine. Andrew Tomkins. Pig Latin: a not-so-foreign language for data processing. In
SIGMOD Conference, pages 1099–1110, 2008.
Twister [6] is a stream-based MapReduce framework that sup- [15] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The
ports iterative programs, in which mappers and reducers are long PageRank citation ranking: Bringing order to the web. Technical Report
running with distributed memory caches. They are established to 1999-66, Stanford InfoLab, 1999.
avoid repeated mapper data loading from disks. However, Twister’s [16] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J.
DeWitt, Samuel Madden, and Michael Stonebraker. A comparison of
streaming architecture between mappers and reducers is sensitive approaches to large-scale data analysis. In SIGMOD Conference, pages
to failures, and long-running mappers/reducers plus memory cache 165–178, 2009.
is not a scalable solution for commodity machine clusters, where [17] Weining Zhang, Ke Wang, and Siu-Cheung Chau. Data partition and parallel
evaluation of datalog programs. IEEE Trans. Knowl. Data Eng., 7(1):163–176,
each node has limited memory and resources. 1995.
Finally, Pregel [13] is a distributed system for processing large-
9. APPENDIX runjobComputeDistance();
while(! isFixedPoint() &&
This appendix presents additional implementation details for the {
! exceedMaxIterations())
in Hadoop:
HaLoop system and our sample applications, experiment setup de- Job Client kickOffJobForNewIteration();
…}
tails, and a discussion.
aggregateDistance();
9.1 HaLoop Implementation Details TaskScheduler in HaLoop:
while(! isFixedPoint() &&
!exceedMaxIterations())
{
We first provide some additional details about HaLoop’s exten- kickOffNewIteration();
….}
sions of Hadoop.

9.1.1 Background on Hadoop


TaskTracker TaskTracker TaskTracker
In Hadoop, client programs must implement the fixpoint evalua-
tion on their own, either in a centralized way or by an extra MapRe-
Map Reduce Map Reduce Map Reduce
duce job. They must also decide when to launch a new MapReduce Task Task Task Task Task Task
job. The Mahout [12] project has implemented multiple iterative
machine learning and data mining algorithms with this approach.
Figure 15 demonstrates how an iterative program is executed in Figure 15: Job Execution: HaLoop V.s. Hadoop
Hadoop. It also shows how the following classes fit together in the Name Functionality
Hadoop system. AddMap & AddReduce specify a step in loop
Hadoop master node. In Hadoop, interface TaskScheduler SetDistanceMeasure specify a distance for results
SetInput specify inputs to iterations
and class JobInProgress play the role of master node: they ac- AddInvariantTable specify loop-invariant data
cept heartbeats from slave nodes and manage task scheduling. SetFixedPointThreshold a loop termination condition
Hadoop slave nodes. Class TaskTracker is a daemon process SetMaxNumOfIterations specify the max iterations
SetReducerInputCache enable/disable reducer input caches
on every slave node. It sends heartbeats to the master node includ- SetReducerOutputCache enable/disable reducer output caches
ing information about completed tasks. It receives task execution SetMapperInputCache enable/disable mapper input caches
commands from the master node.
User-defined map and reduce functions. Class MapTask and Figure 16: HaLoop API
ReduceTask are containers for user-defined Mapper and Reducer
classes. These wrapper classes load, preprocess and pass data to After the final reduce phase of an iteration, ReduceTask com-
user code. Once a TaskTracker gets task execution commands putes the sum of the user-defined distances between the current
from the TaskScheduler, it kicks off a process to start a MapTask output and that of the previous iteration by executing the user-
or ReduceTask thread. defined distance function. Then, the host TaskTracker sends
the aggregated value back to JobInProgress. JobInProgress
9.1.2 HaLoop Extensions to Hadoop computes the sum of the locally pre-aggregated distance values
We extended and modified Hadoop as follows: returned by each TaskTracker and compares the overall dis-
Hadoop master node: loop control and new API. We im- tance value with fixedPointThreshold. If the distance is
plemented HaLoop’s loop control and task scheduler by im- less than fixedPointThreshold or current iteration number
plementing our own TaskScheduler and modifying the class is already maxNumOfIterations, JobInProgress will raise a
JobInProgress. “job complete” event to terminate the job execution. Otherwise,
Additionally, HaLoop provides an extended API to facilitate JobInProgress will put a number of tasks in its task queue to
client programming, with functions to set up the loop body, as- start a new iteration. Figure 15 also shows how HaLoop executes
sociate the input files with each iteration, specify a loop termina- a job. In particular, we see that the TaskScheduler manages the
tion condition, enable/disable caches, and inform HaLoop about lifecycle of an iterative job execution.
any loop-invariant data. JobConf class represents a client job and
hosts these APIs. Figure 16 shows the descriptions of this API. 9.2 PageRank Implementation
Hadoop slave nodes: caching. We implemented HaLoop’s Let us walk through how PageRank (from Example 1) is imple-
caching mechanisms by modifying classes MapTask, ReduceTask mented on top of HaLoop. Figure 17 shows the pseudo-code of
and TaskTracker. In map/reduce tasks, HaLoop creates a direc- this implementation. There are two steps in PageRank’s loop body:
tory in the local file system to store the cached data. The directory one is to join Ri and L and populate ranks; the other is to aggre-
is under the task’s working directory, and is tagged with iteration gate ranks on each URL. Each step is a map-reduce pair. Each
number. Therefore, tasks accessing the cache in the future could pair is added to the overall iterative program by calling HaLoop’s
know the data is generated from which iteration. After the iterative AddMap and AddReduce functions (line 2-5 in Main).
job finishes, the whole cache related to the job will be erased. The join step is comprised of two user-defined functions,
User-defined map and reduce functions: iterations. We added Map Rank and Reduce Rank. In the first iteration, Map Rank reads
abstract classes MapperIterative and ReducerIterative to an input tuple, either from the linkage table L or the initial rank
wrap the Mapper/Reducer interfaces in Hadoop. They both pro- table R0 . It outputs the join column as key (L.url src or R0 .url)
vide an empty implementation for the user-defined map/reduce and the rest of the input tuple as the value. It also attaches a table
functions and add new map/reduce functions to accept both pa- ID to each output tuple to distinguish their sources. In Figure 17,
rameters for ordinary map/reduce functions and iteration-related #1 is the table ID for L, while #2 is the table ID for rank table Ri .
parameters such as current iteration number. ReduceIterative’s new In later iterations, Map Rank simply reads tuples from Ri , outputs
reduce function also adds another new parameter, which stores the column url as the key and column rank as the value, and attaches
cached reducer input values associated with the key. the table ID as before.
User-defined map and reduce functions: fixpoint evalua- On each iteration, the Reduce Rank calculates the local rank for
tion. HaLoop evaluates the fixpoint in a distributed fashion. destination URLs (in invariantValues), where each destination
Map Rank Map Join
Input: Key k, Value v, int iteration Input: Key k, Value v, int iteration
1: if v from L then 1: if v from F then
2: Output(v.url src, v.url dest, #1); 2: Output(v.name1, v.name2, #1);
3: else 3: else
4: Output(v.url, v.rank, #2); 4: Output(v.name2, v.name1, #2);
5: end if 5: end if

Reduce Rank Reduce Join


Input: Key key, Set values, Set invariantValues, Input: Key key, Set values, Set invariantValues,
int iteration int iteration
1: for url dest in invariantValues do 1: Output(Product(values, invariantValues));
2: Output(url dest, values.get(0)/invariantValues.size());
3: end for Map Distinct
Input: Key k, Value v, int iteration
Map Aggregate 1: Output(v.name1, v.name2, iteration);
Input: Key k, Value v, int iteration
1: Output(v.url, v.rank); Reduce Distinct
Input: Key key, Set values, int iteration
Reduce Aggregate 1: for name in values do
Input: Key key, Set values, int iteration 2: if (name.iteration < iteration) then
1: Output(key, AggregateRank(values)); 3: set old.add(name);
4: else set new.add(name);
ResultDistance 5: end for
Input: Key out key, Set vi−1 , Set vi 6: Output(Product(key, Distinct(set new-set old)));
1: return |vi .get(0)−vi−1 .get(0)|;
IterationInput
IterationInput Input: int iteration
Input: int iteration 1: if iteration==1 then
1: if iteration==1 then 2: return F ∪ ∆S0 ;
2: return L ∪ R0 ; 3: else
3: else 4: return ∆Siteration−1
4: return Riteration−1 5: end if
5: end if
StepInput
Main Input: int step, int iteration
1: Job job = new Job(); 1: if step==2Sthen
2: job.AddMap(Map Rank, 1); 2: return 0≤j≤(iteration−1) ∆Sj
3: job.AddReduce(Reduce Rank, 1); 3: end if
4: job.AddMap(Map Aggregate, 2);
5: job.AddReduce(Reduce Aggregate, 2); ResultDistance
6: job.SetDistanceMeasure(ResultDistance); Input: Key out key, Set vi−1 , Set vi
7: job.AddInvariantTable(#1); 1: return vi .size();
8: job.SetInput(IterationInput);
9: job.SetFixedPointThreshold(0.1); Main
10: job.SetMaxNumOfIterations(10);
11: job.SetReducerInputCache(true); 1: Job job = new Job();
12: job.SetReducerOutputCache(true); 2: job.AddMap(Map Join, 1);
13: job.Submit(); 3: job.AddReduce(Reduce Join, 1);
4: job.AddMap(Map Distinct, 2);
Figure 17: Implementation of Example 1 on HaLoop 5: job.AddReduce(Reduce Distinct, 2);
6: job.SetDistanceMeasure(ResultDistance);
7: job.SetInput(IterationInput);
URL’s rank is assigned to the source URL’s rank divided by the 8: job.AddInvariantTable(#1);
number of destination URLs. 9: job.SetFixedPointThreshold(1);
The aggregation step includes Map Aggregate and 10: job.SetMaxNumOfIterations(2);
Reduce Aggregate, where Map Aggregate reads raw ranks 11: job.SetReducerInputCache(true);
produced by Reduce Rank, and Reduce Aggregate sums the 12: job.AddStepInput(StepInput);
13: job.Submit();
local ranks for each URL.
Figure 18: Implementation of Example 2 on HaLoop
The distance measure between reducer outputs from consec-
utive iterations is simply the rank difference (ResultDistance
and line 6 in Main). Table L is set as loop-invariant (line 1-2 in performance of the join step and enable the reducer output cache to
Map Rank and line 7 in Main). IterationInput and line 8 in support distributed fixpoint evaluation. Finally, the job is submitted
Main specify the input to each iteration: {L, R0 } for the first iter- to the HaLoop master node (line 13 in Main).
ation and {Ri−1 } for later iteration i. Therefore, in Reduce Rank,
invariantValues are obtained by querying key(in the input to 9.3 Descendant Query Implementation
Reduce Rank) from the cached L partition and projecting on the We present the pseudo-code for the HaLoop implementation of
url dest column. The fixedPointThreshold is set to 0.1, Example 2 (descendant query) in Figure 18. Similar to PageRank
while the maxNumOfIterations is set to 10 (line 9-10 in Main). example, the loop body also has two steps: one is join (to find
Lines 11-12 in Main enable the reducer input cache to improve the friends-of-friends by looking one hop further), and the other one is
Map Kmeans Configure
reducer input data keep changing. Also, since the output from each
1: loadLatestCluster(); iteration has a very small size, there is no need to enable reducer
output cache.
Map Kmeans
Input: Key k, Value v, int iteration We sketch the code for this application in Figure 19. There
1: Output(assignCluster(v), v); is only one map-reduce step in the program: Map Kmeans and
Reduce Kmeans. Map Kmeans assigns an input tuple to the nearest
Reduce Kmeans cluster (based on the distances between the tuple and every cluster’s
Input: Key key, Set values, Set invariantValues, mean), outputs the cluster ID as the key, and the tuple as value,
int iteration while Reduce Kmeans calculates the means of all tuples in one
1: Output(key, AVG(values)); cluster. We only output cluster means as the result of each iter-
ation. There is one extra MapReduce job to finally determine and
IterationInput output every tuple’s cluster membership after the loop is completed.
Input: int iteration For simplicity, we omit this extra job here. IterationInput re-
1: return “input”; turns a constant (the HDFS path to the dataset), such that each it-
eration Map Kmeans reads the same input files. Each mapper also
ResultDistance
Input: Key out key, Set vi−1 , Set vi loads the latest cluster means from HDFS in mapper hook function
1: return Manhattan Distance(vi .get(0), vi−1 .get(0)); Map Kmeans Configure before the mapper function Map Kmeans
is called. The ResultDistance measures the dissimilarity be-
Main tween two clusters produced from different iterations but with the
1: Job job = new Job(); same cluster ID. The distance measure is the Manhattan distance3
2: job.AddMap(Map Kmeans, 1); between two cluster means. The fixedPointThreshold is set to
3: job.AddReduce(Reduce Kmeans, 1); 0.01 at line 5 in Main, while the maxNumOfIteration is set to
4: job.SetDistanceMeasure(ResultDistance); 12 at the next line. At line 8 of Main, the mapper input cache is
5: job.SetFixedPointThreshold(0.01); enabled.
6: job.SetMaxNumOfIterations(12);
7: job.SetInput(IterationInput); 9.5 Higher-Level Query Language
8: job.SetMapperInputCache(true);
9: job.Submit(); We observe that the general form of the recursive queries we
support has a basic structure similar to recursive queries as defined
Figure 19: K-means Implementation on HaLoop in the SQL standard.
Recall that our recursive programs have the form:
duplicate elimination (to remove duplicates in the extended friends Ri+1 = R0 ∪ (Ri ./ L)
set). We still utilize reducer input cache (line 11 in Main), and set
F to be loop invariant (line 1-2 in Map Join and line 8 in Main). Descendant Query in SQL using WITH. To illustrate how this
Map Join and Reduce Join form the join step. In the first itera- formulation relates to a recursive query expressed in SQL using the
tion, Map Join reads input tuples from both F and ∆S0 , and out- WITH syntax, consider a simple descendant query as an example:
puts the join column as key and the remaining columns and the WITH descendants (parent, child) AS (
table ID as value. In this example, #1 is the ID of the friend ta- -- R0 : base case
ble F and #2 is the ID of ∆Si−1 . In later iteration i, Map Join SELECT parent, child FROM parentof
simply reads ∆Si−1 tuples and attaches the table ID to them as WHERE parent = ‘Eric’
output. For each key (∆Si−1 .name2), Reduce Join computes UNION ALL
the cartesian product of the corresponding values (∆Si−1 .name1) -- R ./ L: step case
SELECT d.parent, e.child
and invariantValues (F .name2). The duplicate elimi- FROM descendants d, parentof e
nation step includes Map Distinct and Reduce Distinct. WHERE d.child = e.parent
Map Distinct emits tuples with column name1 as key and col- )
umn name2 as value, while Reduce Distinct outputs distinct -- Ri+1 = R0 ∪ (Ri ./ L)
hkey, valuei (h∆Si .name1, ∆Si .name2i) pairs. The binding to SELECT DISTINCT * FROM descendants
IterationInput at line 7 in Main specifies the input to each
iteration: {F , ∆S0 } for the first iteration and {∆Si−1 } for This query computes the transitive closure of the parentof ta-
later iteration i. The ResultDistance function simply returns ble by repeatedly joining an initial result set (records with parent =
current out key’s corresponding out value set vi ’s size. The ‘Eric’) with an invariant relation (the entire parentof relation),
fixedPointThreshold is set to 1 at line 9 in Main. The and (optionally) appending the results. The last line removes dupli-
maxNumOfIteration is set to 2. Thus, the loop termination con- cates and returns all results.
dition is that either ∆Si is empty or two iterations pass. Since the We find this formulation to be very general; SQL queries us-
fixpoint evaluation does not compare results from two iterations, ing the WITH clause are sufficient to express a variety of iterative
we disable reducer output cache option. Other parts in the Main applications, including complex analytics that are not typically im-
function are similar to the corresponding parts in Figure 17. plemented in SQL.
K-means in SQL using WITH. We now show how to express
9.4 K -means Implementation k-means clustering as a recursive query. Assume there are two re-
lations points(pid , point), means(kid , center ). The points rela-
K-means clustering is another popular iterative data analysis al-
tion holds data values for which we wish to compute the k clusters.
gorithm that can be implemented on top of HaLoop. Unlike the
The means relation holds an initial estimate of the means, usually
previous two examples, however, k-means takes advantage of the
randomized.
mapper input cache rather than the reducer input cache, because
3
the input data to mappers at each iteration are invariant, while the http://en.wikipedia.org/wiki/Manhattan distance
Name Nodes Edges size in a social network, and its size is 1GB. We substituted all node
Livejournal 4,847,571 68, 993,773 18GB identifiers with longer strings to make the dataset larger without
Triples 1,464,829,200 1,649,506,981 120GB changing the network structure. The extended Livejournal dataset
Freebase 7,024,741 154,544,312 12GB is 18GB.
Triples is an RDF benchmark (resource description framework)
Figure 20: Dataset Descriptions graph dataset from the billion triple challenge6 . Each raw tuple in
Triples is a line of hsubject, predicate, object, contexti. We ignore
-- find minimum dist for each point
CREATE VIEW dmin SELECT pid, the predicate and context columns, and treat the dataset as a graph
min(dist(pp.point, kk.mean)) AS dist,center where each unique string that appears as either a subject or an ob-
FROM points pp, means kk ject is a node, and each hsubject, objecti tuple as an edge. The
GROUP BY pid filtered Triples dataset is 120GB in size.
Freebase is another real-world dataset7 , where a large amount of
-- find mean for each pid concepts are connected by various relationships. If we search for
CREATE VIEW assign_cluster
SELECT pid, point, kid a keyword or concept ID on the Freebase website, it returns the
FROM points p, means k, dmin d description of a matched concept, as well as outgoing links to the
WHERE dist(p.point, k.mean) = d.dist connected concepts. Therefore, we filter the Freebase raw dataset
(which is the crawl of the whole Freebase website) to extract tuples
-- update step of the form of hconcept id1, concept id2i. The filtered Freebase
CREATE VIEW newmeans AS dataset (12.2GB in total) is actually a concept-connection graph,
SELECT kid, avg(point)
FROM assign_cluster where each unique concept id is a node and each tuple represents
GROUP BY kid an edge. Detailed data set statistics are in Figure 20.
We run PageRank on the Livejournal and Freebase datasets be-
-- put it all together cause ranking on social network and crawl graphs makes sense in
WITH means AS ( practice. Similarly, we run the descendant query on the Livejournal
SELECT kid, mean, 0 FROM initial_means and Triples datasets. In the social network application, a descen-
UNION ALL
SELECT kid, avg(point), level + 1 dant query finds one’s friend network, while for the RDF triples,
FROM points p, means k such a query finds a subject’s impacted scope. The initial source
WHERE dist(p.point, k.center) = node in the query is chosen at random.
(select min(dist(p.point, m.center)) By default, experiments on Livejournal are run on a 50-node
FROM means m) cluster, while experiments for both Triples and Freebase are exe-
AND k.level = (select max(level) FROM means) cuted on a 90-node cluster.
AND dist(k.center, d.center) < $threshold
GROUP BY kid 9.6.2 Settings for Mapper Input Cache Evaluations
);
SELECT * FROM means All nodes in these experiments contain a 2.60GHz dual quad-
core Intel Xeon CPU with 16GB of RAM. The Cosmo dataset8 is a
Since MapReduce has been used as a foundation to express rela- snapshot from an astronomy simulation of the universe. The simu-
tional algebra operators, it is straightforward to translate these SQL lation covered a volume of 110 million light years on a side, with
queries into MapReduce jobs. Essentially, PageRank, descendant 900 million particles total. Tuples in Cosmo are multi-dimensional
query, and k-means clustering all share a recursive join structure. vectors.
Our PageRank and descendant query implementations are similar
to map-reduce joins in Hive [9], while k-means implementation is
9.7 Discussion
similar to Hive’s map-side joins; the difference is that these three Here we compare some other design alternatives with HaLoop.
applications are recursive, which neither Hive nor MapReduce has • Disk Cache vs. Memory Cache. To cache loop-invariant data,
built-in support. Further, with a modest extension to high-level lan- one can use either disk or memory. HaLoop only caches data
guages such as Hive, common table expressions could be supported to disk. The reason is that in a commodity machine cluster, a
directly and optimized using HaLoop, and then programmers’ im- slave node does not have sufficient memory to hold the cache,
plementation effort could be greatly reduced. especially when there are a large number of tasks that have to
run on the node.
9.6 Hardware and Dataset Descriptions • Synchronized Iteration vs. Asynchronous Iteration. HaLoop only
This section presents additional details about our experimental utilizes partitioned parallelism. There could be some dataflow
design, for both reducer (input/output) cache evaluation and map- parallelism if iterations are not strictly synchronized. However,
per input cache evaluation. dataflow parallelism is not the goal of MapReduce, and it is also
out of this work’s scope.
9.6.1 Settings for Reducer Cache Evaluations • Loop Body: Single Pipeline vs. DAGs. Currently, HaLoop only
All nodes in these experiments are default Amazon small in- supports articulated map-reduce pairs with a single pipeline in
stances4 , with 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual the loop body, rather than DAGs. Although DAGs are a more
core with 1 EC2 Compute Unit), 160 GB of instance storage (150 general form of loop body, we believe the current design can
GB plus 10 GB for the root partition), 32-bit platform, and moder- meet the requirements of many iterative data analysis applica-
ate I/O performance. tions.
Livejournal is a semi-synthetic dataset generated from a base 6
real-world dataset5 . The base dataset consists of all edge tuples http://challenge.semanticweb.org/
7
http://www.freebase.com/
4 8
http://aws.amazon.com/ec2/instance-types/ http://nuage.cs.washington.edu/benchmark/astro-
5
http://snap.stanford.edu/data/index.html nbody/dataset.php

You might also like