HaLoop - Efficient Iterative Data Processing On Large Clusters
HaLoop - Efficient Iterative Data Processing On Large Clusters
on Large Clusters
4 0 0
3k 300 16.0G
R u n n in g T im e (s )
3 0 0 12.0G
2k 200
8.0G
2 0 0
1k 100 4.0G
1 0 0
0 0 0
0.0
0 2 4 6 8 10
0 2 4 6 8 10 0 2 4 6 8 10 H a L o o p H a d o o p
Total Iteration Iteration C o n f ig u r a tio n Iteration
(a) Overall Performance (b) Join Step (c) Cost Distribution (d) Shuffled Bytes
Figure 9: PageRank Performance: HaLoop vs. Hadoop (Livejournal Dataset, 50 nodes)
HaLoop Hadoop HaLoop Hadoop R e d u c e S h u f f le HaLoop Hadoop
8 0 0
4k 600
R u n n in g T im e (s )
12.0G
6 0 0
3k
400 8.0G
2k 4 0 0
200 4.0G
1k 2 0 0
0 0 0
0.0
0 2 4 6 8 10
0 2 4 6 8 10 0 2 4 6 8 10 H a L o o p H a d o o p
Total Iteration Iteration C o n f ig u r a tio n Iteration
(a) Overall Performance (b) Join Step (c) Cost Distribution (d) Shuffled Bytes
Figure 10: PageRank Performance: HaLoop vs. Hadoop (Freebase Dataset, 90 nodes)
HaLoop Hadoop HaLoop Hadoop R e d u c e S h u f f le HaLoop Hadoop
8.0k
120.0G
1600 1 5 0 0
R u n n in g T im e (s )
6.0k 90.0G
1200
1 0 0 0 60.0G
4.0k
800
30.0G
2.0k
400 5 0 0
0.0
0.0 0 0 0 1 2 3 4 5 6
0 1 2 3 4 5 6 0 1 2 3 4 5 6 H a L o o p H a d o o p
Total Iteration Iteration C o n f ig u r a tio n Iteration
(a) Overall Performance (b) Join Step (c) Cost Distribution (d) Shuffled Bytes
Figure 11: Descendant Query Performance: HaLoop vs. Hadoop (Triples Dataset, 90 nodes)
HaLoop Hadoop HaLoop Hadoop R e d u c e S h u f f le HaLoop Hadoop
20.0G
6.0k 1000
4 0 0 16.0G
R u n n in g T im e (s )
800
3 0 0 12.0G
4.0k
600
8.0G
2 0 0
400
2.0k 4.0G
200 1 0 0
0.0
0.0 0 0 0 1 2 3 4 5 6
0 1 2 3 4 5 6 0 1 2 3 4 5 6 H a L o o p H a d o o p
Total Iteration Iteration C o n f ig u r a tio n Iteration
(a) Overall Performance (b) Join Step (c) Cost Distribution (d) Shuffled Bytes
Figure 12: Descendant Query Performance: HaLoop vs. Hadoop (Livejournal Dataset, 50 nodes)
each iteration, there are few new records produced, so the join’s the PageRank implementation in Section 9.2 on Livejournal and
selectivity on F is very low. Thus the cost becomes negligible. Freebase. In the Hadoop implementation, the fixpoint evaluation is
By contrast, for PageRank, the index does not help much, because implemented by an extra MapReduce job. On average, compared
the selectivity is high. For the descendants query on Livejournal with Hadoop, HaLoop reduces the cost of this step to 40%, by tak-
(Figure 12), in iteration>3, the index does not help either, because ing advantage of the reducer output cache and a built-in distributed
the selectivity becomes high. fixpoint evaluation. Figure 13(a) and (b) shows the time spent on
I/O in Shuffle Phase of Join Step. To tell how much shuffling fixpoint evaluation in each iteration.
I/O is saved, we compared the amount of shuffled data in the join
step of each iteration. Since HaLoop caches loop-invariant data, the 5.3 Evaluation of Mapper Input Cache
overhead of shuffling these invariant data are completely avoided. Since the mapper input cache aims to reduce data transportation
These savings contribute an important part of the overall perfor- between slave nodes but we do not know the disk I/O implemen-
mance improvement. Figure 9(d), Figure 10(d), Figure 11(d), and tations of EC2 virtual machines, this suite of experiments uses an
Figure 12(d) plot the sizes of shuffled data. On average, HaLoop’s 8-node physical machine cluster. PageRank and descendant query
join step shuffles 4% as much data as Hadoop’s does. cannot utilize the mapper input cache because their inputs change
from iteration to iteration. Thus, the application used in the eval-
5.2 Evaluation of Reducer Output Cache uation is the k-means clustering algorithm. We used two real-
This experiment shares the same hardware and dataset as the re- world Astronomy datasets (multi-dimensional tuples): cosmo-dark
ducer input cache experiments. To see how effective HaLoop’s re- (46GB) and cosmo-gas (54GB). Detailed hardware and dataset de-
ducer output cache is, we compared the cost of fixpoint evaluation scriptions are in Section 9.6. We vary the number of total iterations,
in each iteration. Since descendant query has a trivial fixpoint eval- and plot the algorithm running time in Figure 14. The mapper lo-
uation step that only requires testing to see if a file is empty, we run cality rate is around 95% since there are not concurrent jobs in our
H a L o o p H a d o o p H a L o o p H a d o o p size graph datasets, but it does not support general iterative pro-
6 0 .0 8 0 .0
grams.
R u n n in g T im e (s )
R u n n in g T im e (s )
6 0 .0
4 0 .0
4 0 .0 7. CONCLUSION AND FUTURE WORK
2 0 .0
2 0 .0 This paper presents the design, implementation, and evaluation
0 .0 0 .0 of HaLoop, a novel parallel and distributed system that supports
2 4 6 8 1 0 2 4 6 8 1 0
Ite ra tio n Ite ra tio n large-scale iterative data analysis applications. HaLoop is built on
top of Hadoop and extends it with a new programming model and
(a) Livejournal, 50 nodes (b) Freebase, 90 nodes several important optimizations that include (1) a loop-aware task
Figure 13: Fixpoint Evaluation Overhead in PageRank: scheduler, (2) loop-invariant data caching, and (3) caching for effi-
HaLoop vs. Hadoop cient fixpoint verification. We evaluated our HaLoop prototype on
several large datasets and iterative queries. Our results demonstrate
H a L o o p H a d o o p H a L o o p H a d o o p
5 k that pushing support for iterative programs into the MapReduce
R u n n in g T im e (s )
R u n n in g T im e (s )
6 k
5 k 4 k engine greatly improves the overall performance of iterative data
4 k 3 k analysis applications. In future work, we would like to implement
3 k
2 k a simplified Datalog evaluation engine on top of HaLoop, to enable
2 k
1 k 1 k large-scale iterative data analysis programmed in a declarative way.
0 0
2 4 6
T o ta l Ite ra tio n
8 1 0 1 2 2 4 6
T o ta l Ite ra tio n
8 1 0 1 2
Acknowledgements
The HaLoop project is partially supported by NSF CluE grants
(a) Cosmo-dark, 8 nodes (b) Cosmo-gas, 8 nodes IIS-0844572 and IIS-0844580, NSF CAREER Award IIS-0845397,
Figure 14: Performance of k-means: HaLoop vs. Hadoop NSF grant CNS-0855252, Woods Hole Oceanographic Institute
Grant OCE-0418967, Amazon, University of Washington eScience
lab HaLoop cluster. By avoiding non-local data loading, HaLoop Institute, and the Yahoo! Key Scientific Challenges program.
performs marginally better than Hadoop. Thanks for suggestions and comments from Michael J. Carey,
Rares Vernica, Vinayak R. Borkar, Hongbo Deng, Congle Zhang,
and the anonymous reviewers.
6. RELATED WORK
Parallel database systems [5] partition data storage and paral- 8. REFERENCES
lelize query workloads to achieve better performance. However, [1] http://www.nsf.gov/pubs/2008/nsf08560/nsf08560.htm.
they are sensitive to failures and have not been shown to scale to Accessed July 7, 2010.
thousands of nodes. Various optimization techniques for evaluat- [2] Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Alexander Rasin,
ing recursive queries have been proposed in the literature [3, 17]. and Avi Silberschatz. HadoopDB: An architectural hybrid of MapReduce and
DBMS technologies for analytical workloads. VLDB, 2(1):922–933, 2009.
The existing work has not been shown to operate at large scale. [3] François Bancilhon and Raghu Ramakrishnan. An amateur’s introduction to
Further, most of these techniques are orthogonal to our research; recursive query processing strategies. In SIGMOD Conference, pages 16–52,
we provide a low-level foundation for implementing data-intensive 1986.
iterative programs. [4] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing
on large clusters. In OSDI, pages 137–150, 2004.
More recently, MapReduce [4] has emerged as a popular alterna- [5] David J. DeWitt and Jim Gray. Parallel database systems: The future of high
tive for massive-scale parallel data analysis in shared-nothing clus- performance database systems. Commun. ACM, 35(6):85–98, 1992.
ters. Hadoop [7] is an open-source implementation of MapReduce. [6] Jaliya Ekanayake and Shrideep Pallickara. MapReduce for data intensive
MapReduce has been followed by a series of related systems in- scientific analysis. In IEEE eScience, pages 277–284, 2008.
[7] Hadoop. http://hadoop.apache.org/. Accessed July 7, 2010.
cluding Dryad [10], Hive [9], Pig [14], and HadoopDB [2]. Like
[8] Hdfs. http://hadoop.apache.org/common/docs/current/
Hadoop, none of these systems provides explicit support and opti- hdfs_design.html. Accessed July 7, 2010.
mizations for iterative or recursive types of analysis. [9] Hive. http://hadoop.apache.org/hive/. Accessed July 7, 2010.
Mahout [12] is a project whose goal is to build a set of scal- [10] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly.
able machine learning libraries on top of Hadoop. Since most Dryad: distributed data-parallel programs from sequential building blocks. In
EuroSys, pages 59–72, 2007.
machine learning algorithms are model fitting applications, nearly [11] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM,
all of them involve iterative programs. Mahout uses an outside 46(5):604–632, 1999.
driver program to control the loops, and new MapReduce jobs are [12] Mahout. http://lucene.apache.org/mahout/. Accessed July 7,
launched in each iteration. The drawback of this approach has been 2010.
[13] Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert,
discussed in Section 1. Like Mahout, we are trying to help itera- Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for
tive data analysis algorithms work on scalable architectures, but we large-scale graph processing. In SIGMOD Conference, pages 135–146, 2010.
are different in that we are modifying the fundamental system: we [14] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and
inject the iterative capability into a MapReduce engine. Andrew Tomkins. Pig Latin: a not-so-foreign language for data processing. In
SIGMOD Conference, pages 1099–1110, 2008.
Twister [6] is a stream-based MapReduce framework that sup- [15] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The
ports iterative programs, in which mappers and reducers are long PageRank citation ranking: Bringing order to the web. Technical Report
running with distributed memory caches. They are established to 1999-66, Stanford InfoLab, 1999.
avoid repeated mapper data loading from disks. However, Twister’s [16] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J.
DeWitt, Samuel Madden, and Michael Stonebraker. A comparison of
streaming architecture between mappers and reducers is sensitive approaches to large-scale data analysis. In SIGMOD Conference, pages
to failures, and long-running mappers/reducers plus memory cache 165–178, 2009.
is not a scalable solution for commodity machine clusters, where [17] Weining Zhang, Ke Wang, and Siu-Cheung Chau. Data partition and parallel
evaluation of datalog programs. IEEE Trans. Knowl. Data Eng., 7(1):163–176,
each node has limited memory and resources. 1995.
Finally, Pregel [13] is a distributed system for processing large-
9. APPENDIX runjobComputeDistance();
while(! isFixedPoint() &&
This appendix presents additional implementation details for the {
! exceedMaxIterations())
in Hadoop:
HaLoop system and our sample applications, experiment setup de- Job Client kickOffJobForNewIteration();
…}
tails, and a discussion.
aggregateDistance();
9.1 HaLoop Implementation Details TaskScheduler in HaLoop:
while(! isFixedPoint() &&
!exceedMaxIterations())
{
We first provide some additional details about HaLoop’s exten- kickOffNewIteration();
….}
sions of Hadoop.