0% found this document useful (0 votes)
9 views13 pages

MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis On Large Clusters

Uploaded by

Dev Dang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views13 pages

MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis On Large Clusters

Uploaded by

Dev Dang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO.

9, SEPTEMBER 2011 1299

MAP-JOIN-REDUCE: Toward Scalable and


Efficient Data Analysis on Large Clusters
Dawei Jiang, Anthony K. H. Tung, and Gang Chen

Abstract—Data analysis is an important functionality in cloud computing which allows a huge amount of data to be processed over
very large clusters. MapReduce is recognized as a popular way to handle data in the cloud environment due to its excellent scalability
and good fault tolerance. However, compared to parallel databases, the performance of MapReduce is slower when it is adopted to
perform complex data analysis tasks that require the joining of multiple data sets in order to compute certain aggregates. A common
concern is whether MapReduce can be improved to produce a system with both scalability and efficiency. In this paper, we introduce
Map-Join-Reduce, a system that extends and improves MapReduce runtime framework to efficiently process complex data analysis
tasks on large clusters. We first propose a filtering-join-aggregation programming model, a natural extension of MapReduce’s filtering-
aggregation programming model. Then, we present a new data processing strategy which performs filtering-join-aggregation tasks in
two successive MapReduce jobs. The first job applies filtering logic to all the data sets in parallel, joins the qualified tuples, and pushes
the join results to the reducers for partial aggregation. The second job combines all partial aggregation results and produces the final
answer. The advantage of our approach is that we join multiple data sets in one go and thus avoid frequent checkpointing and shuffling
of intermediate results, a major performance bottleneck in most of the current MapReduce-based systems. We benchmark our system
against Hive, a state-of-the-art MapReduce-based data warehouse on a 100-node cluster on Amazon EC2 using TPC-H benchmark.
The results show that our approach significantly boosts the performance of complex analysis queries.

Index Terms—Cloud computing, parallel systems, query processing.

1 INTRODUCTION

C LOUD computing is a service through which a service


provider delivers elastic computing resources (virtual
compute nodes) to a number of users. This computing
provides fine-grained fault tolerance whereby only tasks on
failed nodes have to be restarted.
With the above features, MapReduce has become a
paradigm is attracting increasing interests since it enables popular tool for processing large-scale data analytical tasks.
users to scale their applications up and down seamlessly in However, there are two problems when MapReduce is
a pay-as-you-go manner. To unleash the full power of cloud adopted for processing complex data analytical tasks which
computing, it is well accepted that a cloud data processing join multiple data sets for aggregation. First, MapReduce is
system should provide a high degree of elasticity, scal- mainly designed for performing a filtering-aggregation data
ability, and fault tolerance. analytical task on a single homogenous data set [16]. It is
MapReduce [1] is recognized as a possible means to not very convenient to express the join processing in the
perform elastic data processing in the cloud. There are three map() and reduce() functions [4].
main reasons for this. First, the programming model of Second, in certain cases, performing multiway join using
MapReduce is not efficient. The performance issue is
MapReduce is simple yet expressive. A large number of
mainly due to the fact that MapReduce employs a
data analytical tasks can be expressed as a set of MapReduce
sequential data processing strategy which frequently
jobs, including SQL query, data mining, machine learning, checkpoints and shuffles intermediate results in data
and graph processing. Second, MapReduce achieves the processing. Suppose we join three data sets, i.e.,
desired elastic scalability through block-level scheduling R ffl S ffl T , and conduct an aggregation on the join results.
and is proven to be highly scalable. Yahoo! has deployed Most MapReduce-based systems (e.g., Hive, Pig) will
MapReduce on a 4,000-node cluster [2]. Finally, MapReduce translate this query into four MapReduce jobs. The first
job joins R and S, and writes the results U into a file system
(e.g., Hadoop Distributed File System, HDFS). The second
. D. Jiang and A.K.H. Tung are with the Department of Computer Science,
School of Computing, National University of Singapore, Computing 1, job joins U and T and produces V which will again be
Computing Drive, Singapore 117417. written to HDFS. The third job aggregates tuples on V . If
E-mail: {jiangdw, atung}@comp.nus.edu.sg. more than one reducers are used in step three, a final job
. G. Chen is with the College of Computer Science, College of Computer merges results from reducers of the third job, and writes the
Software Technology, Zhejiang University, Hangzhou 310027, P.R. China.
E-mail: [email protected]. final query results into one HDFS file. Here, checkpointing
Manuscript received 15 Mar. 2010; revised 16 July 2010; accepted 11 Aug. U and V to HDFS, and shuffling them in the next
2010; published online 15 Dec. 2010. MapReduce jobs incurs huge cost if U and V are large.
Recommended for acceptance by D. Lomet. Although one can achieve better performance by allocating
For information on obtaining reprints of this article, please send e-mail to: more nodes from the cloud, this “renting more nodes”
[email protected], and reference IEEECS Log Number
TKDESI-2010-03-0153. solution is not really cost efficient in a pay-as-you-go
Digital Object Identifier no. 10.1109/TKDE.2010.248. environment like cloud. An ideal cloud data processing
1041-4347/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society
1300 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 9, SEPTEMBER 2011

system should offer elastic data processing in the most modifications. This design makes it very easy for
economical way. users to gradually migrate their legacy MapReduce
This paper introduces Map-Join-Reduce, an extended programs to Map-Join-Reduce programs for better
and enhanced MapReduce system for simplifying and performance.
efficiently processing complex data analytical tasks. To . We provide comprehensive performance study of
solve the first problem described above, we introduce a our system. We benchmark our system against Hive
filtering-join-aggregation programming model which is an using four TPC-H queries. The results show that the
extension of MapReduce’s filtering-aggregation program- performance gap between our system and Hive
ming model. In addition to the mapper and reducer, we grows as more joins are involved and more inter-
introduce a third operation join (called joiner) to the mediate results are produced. For TPC-H Q9, our
framework. To join multiple data sets for aggregation, system runs thrice faster than Hive.
users specify a set of join() functions and the join order. The rest of this paper is organized as follows: Section 2
The runtime system automatically joins multiple data sets presents the filter-join-aggregation model and describes MJR
according to the join order and invoke join() functions to at high level. Section 3 discusses the implementation details
process the joined records. Like MapReduce, a Map-Join- on Hadoop. Section 4 presents optimization techniques.
Reduce job can be chained with an arbitrary number of Section 5 reports experimental results. Section 6 reviews
MapReduce or Map-Join-Reduce jobs to form a complex related work. We present our conclusions in Section 7.
data processing flow. Therefore, Map-Join-Reduce benefits
both end users and high-level query engines built on top of
MapReduce. For end users, Map-Join-Reduce removes the 2 MAP-JOIN-REDUCE
burden of presenting complex join algorithms to the system. This section presents filtering-join-aggregation, a natural
For MapReduce-based high-level query engines such as extension of MapReduce’s filtering-aggregation program-
Hive [5] and Pig [6], Map-Join-Reduce provides a new ming model and describes the overall data processing flow
building block for generating query plans. in Map-Join-Reduce.
To solve the second problem, we introduce a one-to-
many shuffling strategy in Map-Join-Reduce. MapReduce 2.1 Filtering-Join-Aggregation
adopts a one-to-one shuffling scheme which shuffles each As described before, MapReduce represents a two-phase
intermediate key/value pair produced by a map() function filtering-aggregation data analysis framework with map-
to a unique reducer. In addition to this shuffling scheme, pers performing filtering logic and reducers performing
Map-Join-Reduce offers a one-to-many shuffling scheme aggregation logic [16]. In [1], the signatures of map and
which shuffles each intermediate key/value pair to many reduce functions are defined as follows:
joiners at one time. We show that, with proper partition
strategy, one can utilize the one-to-many shuffling scheme map ðk1; v1Þ ! listðk2; v2Þ;
to join multiple data sets in one phase instead of a set of reduce ðk2; listðv2ÞÞ ! listðv2Þ:
MapReduce jobs. This one-phase joining approach, in
This programming model is mainly designed for homo-
certain cases, is more efficient than the multiphases joining
genous data sets, namely, the same filtering logic, repre-
approach employed by MapReduce in that it avoids
sented by the map() function, is applied to each tuple in the
checkpointing and shuffling intermediate join results in data set. We extend this model to filtering-join-aggregation
the next MapReduce jobs. in order to process multiple heterogenous data sets. In
This paper makes the following contributions: addition to map() and reduce() functions, we introduce a
. We propose filtering-join-aggregation, a natural third join() function, i.e., joiner. A filtering-join-aggrega-
extension of MapReduce’s filtering-aggregation pro- tion data analytical task involves n data sets and
gramming model. This extended programming Di ; i 2 f1; . . . ; ng, n  1 join functions. The signatures of
model covers more complex data analytical tasks map, join, and reduce functions are as follows:
which require to join multiple data sets for aggrega- mapi ðk1i ; v1i Þ ! ðk2i ; listðv2i ÞÞ;
tion. The complexity of parallel join processing is joinj ððk2j1 ; listðv2j1 ÞÞ; ðk2j ; listðv2j ÞÞÞ
handled by the runtime system, thus it is straightfor- ! ðk2jþ1 ; listðv2jþ1 ÞÞ;
ward for both human and high-level query planner
reduce ðk2; listðv2ÞÞ ! listðv2Þ:
to generate data analytical programs.
. We introduce a one-to-many shuffling strategy and The signature of map in Map-Join-Reduce is similar to
demonstrate the usage of such a shuffling strategy to that of MapReduce except for the subscript i which denotes
perform filtering-join-aggregation data analytical that the filtering logic defined by mapi is applied on data set
tasks. This data processing scheme outperforms Di . The join function joinj ; j 2 f1; . . . ; n  1g defines the
MapReduce’s sequential data processing scheme logic for processing the jth joined tuples. If j ¼ 1, the first
since it avoids frequent checkpointing and shuffling input list of joinj comes from the mappers output. If j > 1
intermediate results. the first input list is from the ðj  1Þth join results. The
. We implement the proposed approach on Hadoop. second input list of joinj must be from mapper output. From
We show that our technique is ready to be adopted. a database perspective, the join chain of Map-Join-Reduce is
Although our solution is intrusive to Hadoop, our equivalent to a left-deep tree. Currently, we only support
implementation is such that our system is binary equal join. For each function joinj , the runtime system
compatible with Hadoop. Existing MapReduce guarantees that the key of the first input list is equal to the
programs can run directly on our system without key in the second input list, namely, k2j1 ¼ k2j . The reduce
JIANG ET AL.: MAP-JOIN-REDUCE: TOWARD SCALABLE AND EFFICIENT DATA ANALYSIS ON LARGE CLUSTERS 1301

function’s signature is the same as MapReduce, we shall not reduce(Date d, Iterator values):
explain further. A Map-Join-Reduce job can be chained with double price = 0.0
an arbitrary number of MapReduce or Map-Join-Reduce for each V in values
jobs to form complex data processing flow by feeding its price += V
output to the next MapReduce or Map-Join-Reduce job. emit(d, price)
This chaining strategy is a standard technique in MapRe-
duce-based data processing systems [1]. Therefore, in this To launch a Map-Join-Reduce job, in addition to the above
paper, we only focus on presenting the execution flow of a pseudocode, one also needs to specify a join order which
single Map-Join-Reduce task. defines the execution order of joiners. This is achieved by
providing a Map-Join-Reduce job specification, an extension
2.2 Example of original MapReduce’s job specification, to the runtime
We give a concrete example of the filtering-join-aggregation system. Details of providing job specification can be found in
task here. The data analytical task in the example is a Section 3.1. Here, we only focus on presenting the logic of
simplified TPC-H Q3 query [19]. This will be our running map(), join(), and reduce() functions.
example for illustrating the features of Map-Join-Reduce. To evaluate TPC-H Q3, three mappers, mapC , mapO , and
The TPC-H Q3 task, represented in SQL, is as follows: mapL , are specified to process records in customer,
orders, and lineitem, respectively. The first joiner
select join1 processes the results of C ffl O, i.e., customer and
O.orderdate, sum(L.extendedprice) orders. For each joined record pair, it produces a key/
from value pair with the orderkey as the key and the
customer C, orders O, lineitem L (orderdate) as the value. The result pair are then passed
where to the second joiner. The second joiner joins the result tuple
C.mksegment=’BUILDING’ and of join1 with lineitem and emits orderdate as key and
C.custkey = O.custkey and (extendedprice) as value. Finally, the reducer aggre-
L.orderkey = O.orderkey and gates extendedprice on each possible date.
O.orderdate < date ’1995-03-15’ and
2.3 Execution Overview
L.shipdate > date ’1995-03-15’
To execute a Map-Join-Reduce job, the runtime system
group by
launches two kinds of processes: called MapTask, and
O.orderdate ReduceTask. Mappers run inside the MapTask process while
This data analytical task requires the system to apply joiners and reducers are invoked inside the ReduceTask
filtering condition on all three data sets, i.e., customer, process. The MapTask process and ReduceTask process are
orders, and lineitem, join them and calculate the semantically equivalent to map worker process and reduce
corresponding aggregations. Schemas of the data sets can worker process presented in [1]. Map-Join-Reduce’s process
be found in [19]. We intentionally omit them for saving model allows for pipelining intermediate results between
space. The Map-Join-Reduce program that performs this joiners and reducers since joiners and reducers are run
analytical task is similar to the following pseudocode: inside the same ReduceTask process. The failure recovery
strategy of Map-Join-Reduce is identical to MapReduce. In
mapC (long tid, Tuple t):
the presence of node failure, only MapTasks and uncom-
// tid: tuple ID
pleted ReduceTasks need to be restarted. Completed
// t: tuple in customer ReduceTasks do not need to be reexecuted. The process of
if t.mksegment = ’BUILDING’ task restarting in Map-Join-Reduce is also similar to
emit(t.custkey, null) MapReduce except for ReduceTask. In addition to rerun
the reduce() function, when a ReduceTask is restarted, all
mapO (long tid, Tuple t): the joiners are also reexecuted.
if t.orderdate < date ’1995-03-15’ Map-Join-Reduce is compatible with MapReduce. There-
emit(t.custkey, (t.orderkey, t.orderdate)) fore, a filtering-join-aggregation task can be evaluated by
the standard sequential data processing strategy described
mapL (long tid, Tuple t): in Section 1. In this case, for each MapReduce job, the
if t.shipdate > date ’1995-03-15’ ReduceTask process only invokes a unique join() to
emit(t.orderkey, (t.extendedprice)) process an intermediate two-way join results. We shall omit
the details of this data processing scheme. Alternatively,
Map-Join-Reduce can also perform a filtering-join-aggrega-
join1 (long lKey, Iterator lValues,
tion task by two successive MapReduce jobs. The first job
long rKey, Iterator rValues): performs filtering, join, and partial aggregation. The second
for each V in rValues job combines the partial aggregation results and writes the
emit(V.orderkey, (V.orderdate)) final aggregation results to HDFS.1
In the first MapReduce job, the runtime system splits the
join2 (long lKey, Iterator lValues, input data sets into chunks in a per-data set manner and
long rKey, Iterator rValues):
for each V1 in lValues 1. Strictly speaking, if only one reducer is used in the first job, the second
merge job is unnecessary. However, in real-world workload, a number of
for each V2 in rValues reducers are required in the first job to speed up data processing. Therefore,
emit(V1.orderdate, (V2.extendedprice)) the second job is needed to produce the final query results.
1302 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 9, SEPTEMBER 2011

This problem is fairly easy to solve if the analytical task


only involves two data sets. Consider we join two data sets,
R:a¼S:b
R ffl S:
To partition R and S to nr reducers, we adopt a partition
function HðxÞ ¼ hðxÞ mod nr , where hðxÞ is a universal hash
function to each tuple in R and S on the join column, and take
the output of HðxÞ as the partition signature that is
associated to a unique reducer for processing. Therefore,
tuples that can be joined with each other will eventually go to
Fig. 1. Execution flow of first MapReduce job. the same reducer. This technique is equivalent to a standard
parallel hash join algorithm [20] and is widely used in
then launches a set of MapTasks onto those chunks with current MapReduce-based systems. The scheme is also
one being allocated to each chunk. Each MapTask executes feasible for multiple data sets (more than two) join if each
a corresponding map function to filter tuples, and emits data set has a unique join column. As an example, to perform
intermediate key-value pairs. The output is then forwarded
to the combiner, if a map-side partial aggregation is R:a¼S:b S:b¼T :c
R ffl S ffl T;
necessary, and to the partitioner in turn. The partitioner
applies a user specified partitioning function on each map we can also apply the same partition function HðxÞ on the
output and creates corresponding partitions for a set of join columns R:a, S:b, and T :c, and partition R, S, and T to
reducers. We will see how Map-Join-Reduce partition the the same nr reducers to complete all joins in one
same intermediate pair to many reducers. For now, we MapReduce job.
simply state that the partition is to ensure that each reducer However, if a data analytical task involves a data set that
can independently perform all joins on the intermediate has more than one join column, the above technique will not
results that it receives. Details of partitioning will be work. For example, if we perform
presented later. Finally, the intermediate pairs are sorted
R:a¼S:a S:b¼T :c
by the key and then written to local disks. R ffl S ffl T;
When the MapTasks are completed, the runtime system
launches a set of ReduceTasks. Each reducer builds a join list it is impossible to use a single partition function to partition
data structure which links all joiners as the user specified join all three data sets to the reducers in one pass. Map-Join-
order. Then, each ReduceTasks remotely reads (shuffles) Reduce solves this problem by utilizing k partition functions
partitions associated to it from all mappers. When a partition to partition the input data sets where k is the number of
is successfully read, the ReduceTask checks whether the first connected components in the derived join graph of the query.
joiner is ready to perform. A joiner is ready if and only if both We will first give a concrete example; general rules for
its first and second input data sets are ready, either in data partitioning will be provided later. Recall the previous
memory or on local disk. When the joiner is ready, simplified TPC-H Q3 query. The query performs C ffl O ffl L
ReduceTask performs a merge-join algorithm on its input for aggregation where C, O, and L stand for customer,
data sets and fires its join function on the joined results. orders, and lineitem, respectively. The join condition is
ReduceTask buffers the output of the joiner in memory. If the C:custkey ¼ O:custkey and O:orderkey ¼ L:orderkey.
memory buffer is full, it sorts the results and writes the To partition the three input data sets into nr ¼ 4
sorted results to disk. ReduceTask repeats the whole loop reducers, we use two partition functions hH1 ðxÞ; H2 ðxÞi
until all the joiners are completed. Here, the shuffling and with H1 ðxÞ to partition columns C.custkey and O.cust-
join operations overlap with each other. The output of the key and H2 ðxÞ to partition columns O.orderkey and
final joiner is then fed to the reducer for partial aggregation. L.orderkey. Function H1 ðxÞ is defined as H1 ðxÞ ¼
Fig. 1 depicts the execution flow of the first MapReduce job.
hðxÞ mod n1 . Function H2 ðxÞ is defined as H2 ðxÞ ¼
In Fig. 1, the data sets D1 and D2 are chopped into two
hðxÞ mod n2 . To facilitate discussion, we assume that the
chunks. For each chunk, a mapper is launched for filtering
universal hash function hðxÞ is hðxÞ ¼ x. The point is that
qualified tuples. The output of all mappers are then shuffled
to joiners for join. Finally, the output of the final joiner is the partition number n1 and n2 must satisfy the constraint
passed to reducer for partial aggregation. n1  n2 ¼ nr . Suppose we set n1 ¼ 2 and n2 ¼ 2. Each
When the first job is completed, the second MapReduce reducer is then associated with a unique partition signature
is launched to combine the partial results (typically via pair among all possible outcomes. In this example, reducer
applying the same reduce function on the results) and R0 is associated with <0; 0>, R1 is associated with <0; 1>,
present the final aggregation results to HDFS. The second R2 is associated with <1; 0>, and R3 is associated with
job is a standard MapReduce job, and thus, we omit its <1; 1>.
execution details. Now, we use <H1 ðxÞ; H2 ðxÞ> to partition the data sets.
We begin with customer relation. Suppose the input key-
2.4 Partitioning value pair t is <1; null> where 1 is the custkey. The
Obviously, to make the framework described above work, partition signature of this custkey is calculated as
the important step is to properly partition the output of H1 ð1Þ ¼ 1. Since customer has no column that belongs to
mappers so that each reducer can join all data sets locally. the partition function H2 ðxÞ, all possible outcome of H2 ðxÞ is
JIANG ET AL.: MAP-JOIN-REDUCE: TOWARD SCALABLE AND EFFICIENT DATA ANALYSIS ON LARGE CLUSTERS 1303

Fig. 3. Derived join graph for TPC-H Q3.

Definition 1. A graph G is called a derived join graph of task Q


if each vertex in G is a unique join column involved in Q and
each edge is a join condition that joins two data sets in Q.
Fig. 2. Process of partition customer, orders, and lineitem. Definition 2. A connected component Gc of a derived join graph
is a subgraph in which any vertices are reachable by path.
considered. Therefore, t is partitioned to reducers R2 :
<1; 0> and R3 : <1; 1>. Fig. 3 shows the derived join graph of TPC-H Q3. We
The same logic is applied to relations orders and only support queries whose derived join graphs have no
lineitem. Suppose the input pair of orders o is loops. Even with this restriction, we will see this model
<1; ð0; ‘1995  03  01’Þ>, then o is partitioned to R2 : covers many complex queries including those of TPC-H.
<H1 ð1Þ ¼ 1; H2 ð0Þ ¼ 0>. The input pair of lineitem l : We build HðxÞ as follows: first, we enumerate all
<0; 120:34> is partitioned to R0 : <0; H2 ð0Þ ¼ 0> and connected components in the derived join graph. Suppose
R2 : <1; H2 ð0Þ ¼ 0>. Now, all three tuples can be joined in k connected components are found, we build a partition
R2 . Fig. 2 shows the whole partitioning process. function for each connected component. The domain of
The above algorithm is correct. Clearly, for any tuples the partition function is the vertices (join columns) in the
from three data sets that can be joined, namely, Cðx; nullÞ, corresponding connected component. For example, the
Oðx; ðy; dateÞÞ, and Lðy; priceÞ, these tuples will eventually be derived join graph of TPC-H Q3 has two connected
components as shown in Fig. 3. So, we build two
partitioned to the same reducer Ri : <H1 ðxÞ; H2 ðyÞ>.
partition functions HðxÞ ¼ <H1 ðxÞ; H2 ðxÞ>. The domains
In general, to partition n data sets to nr reducers for
of partition functions are vertices (join columns) in each
processing, we first build k partition functions
connected component, namely,
HðxÞ ¼ <H1 ðxÞ; . . . ; Hk ðxÞ>:
Dom½H1  ¼ fC:custkey; O:custkeyg;
Each partition function Hi ðxÞ ¼ hðxÞ mod ni ; i 2 f1; . . . ; kg Dom½H2  ¼ fO:orderkey; L:orderkeyg:
is responsible for partitioning a set of join columns. We call
the join columns that Hi ðxÞ operates on, the domain of Currently, we rely on users to manually build partition
Hi ðxÞ denoted by Dom½Hi . The constraint is that para- functions as we have done in the experiments. In the longer
meters, i.e., ni in all partition functions must satisfy term, we plan to introduce an optimizer for automatically
ki¼1 ni ¼ nr . building derived join graph and partition functions.
When k partition functions are built, the partitioning
2.5 Discussion
process is straightforward, as shown by the previous TPC-H
In Map-Join-Reduce, we only shuffle input data sets to
Q3 example. First, we associate each reducer with a unique
reducers and do not checkpoint and shuffle intermediate
signature, a k-dimensional vector, from all possible partition
join results. Instead, intermediate join results are pipelined
outcomes. The reducer will then process intermediate between joiners and reducer either through in-memory
mapper output that belongs to the assigned partition. buffer or local disk. In general, if join selectivity is low, a
For each intermediate output pair, the k dimensional common case in real-world workload, it is less costly to
partition signature is calculated by applying all k partition shuffle input data sets than intermediate join results.
functions in H on the pair in turn. For Hi ðxÞ, if the Furthermore, shuffling and join operations overlap in the
intermediate result contains a join column c that falls in reducer side to speed up query processing. A joiner is
Dom½Hi , the ith value of the k-dimensional signature is Hi ðcÞ; launched immediately if both its input data sets are ready.
otherwise all possible outcomes of Hi ðxÞ are considered. Although Map-Join-Reduce is designed for multiway join,
The remaining questions of partitioning are: 1) how to it can also be used together with existing two-way join
build k partition functions for a query; 2) how to determine techniques. Assume we want to perform S ffl R ffl T ffl U. If
the domain for each partition function Hi ðxÞ; and 3) given S is quite small and can be fitted into the memory of any
nr , what are the optimal values for partition parameters single machine, we can join R with S in the map-side by
fn1 ; . . . ; nk g. We solve the first two problems in this section. loading S into the memory of each mapper launched on R as
The optimization problems are discussed in Section 4. described in [14]. In this case, mappers of R emit joined
We build a derived join graph for a data analytical task tuples to reducers and perform a normal Map-Join-Reduce
according to the following definition: procedure to join with T and U for aggregation. Also, if R and
1304 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 9, SEPTEMBER 2011

S are already partitioned on join columns among available processing a single homogenous data set, the API that
nodes, we can also use a map-side join, and shuffle the results Hadoop provides only supports specifying one mapper,
to reducers for further processing. In the future, we will combiner, and partitioner for each MapReduce job.
introduce a map-side joiner, which runs a joiner inside In Map-Join-Reduce, the mapper, combiner, and parti-
MapTask process, and transparently integrate all these two- tioner is defined in a per-data set manner. Users specify a
way join techniques in Map-Join-Reduce framework. mapper, combiner, and partitioner for each data set using
The potential problem of Map-Join-Reduce is that it may the following API:
consume more memory and local disk space to process a
query. Compared to the sequential processing strategy that TableInputs.addTableInput(path, mapper,
we described in Section 1, reducers in Map-Join-Reduce combiner, partitioner)
receive more portions of input data sets than reducers in In the above code, path points to the location in the
MapReduce. In MapReduce, data sets are partitioned with HDFS that stores the data set while mapper, combiner,
the full number of reducers, namely, each data set will be and partitioner define the Java class that the user
partitioned into nr portions, but in Map-Join-Reduce, the creates to process the data set. The return value of
data set may be partitioned into a small number of partitions.
addTableInput() is an integer which specifies the data
In the TPC-H Q3 example, customer and lineitem are
set id (called table id in Map-Join-Reduce). Table id is used
both partitioned into two partitions although the total
number of available reducers is 4. Therefore, compared to to specify the join input data sets, and is used for the system
sequential query processing, the reducers of Map-Join- to perform various per-data set operations, e.g., launching
Reduce may need a larger memory buffer and more disk corresponding mappers and combiners, which will be
space to hold input data sets. One possible solution to solve discussed in the following sections.
this problem is to allocate more nodes from the cloud and Following Hadoop’s principle, joiner is also implemen-
utilize those nodes to process the data analytical task. ted as a Java interface in Map-Join-Reduce. Users specify
Even in environments with a limited number of compute join processing logic by creating a joiner class and
nodes, we still have some ways to solve the problem. First, implementing join() functions. Joiners are registered to
given a fixed number of reducers, we can use the technique the system as follows:
presented in Section 4 to tune the partition number ni of each
TableInputs.addJoiner(leftId, rightId,
partition function to minimize the input data set portions that
joiner)
each reducer receives. Second, we can compress the output of
mappers and operate the data in compressed format, a The input data sets of joiner are specified by leftId and
technique which is widely used in column-wise database rightId. Both Ids are integers either returned by
systems to reduce disk and memory cost [21]. Finally, we can addTableInput() or addJoiner(). The return value
adopt a hybrid query processing strategy. For example, of addJoiner() represents the resulting table id. There-
suppose we need to join six data sets but the available fore, joiners can be chained. As stated previously, to
computing resources only allow us to join four data sets once, simplify implementation, the right input of a joiner must
we can first launch a MapReduce job to join four data sets, be a source input, namely, the data set added by
write the results to HDFS and then launch another MapRe- addTableInput(); only the left input can be result of a
joiner. As an example, the specifications of TPC-H Q3 query
duce job to join the rest of the data sets for aggregation. This
is described as follows:
hybrid processing strategy is equivalent to the ZigZag
processing in parallel database [22]. C = TableInputs.addTableInput(CPath, CMap,
CPartitioner)
3 IMPLEMENTATION ON HADOOP O = TableInputs.addTableInput(OPath, OMap,
CPartitioner)
We implement Map-Join-Reduce on Hadoop 0.19.2, an open L = TableInputs.addTableInput(LPath, LMap,
source MapReduce implementation. Although large-scale
CPartitioner)
data analysis is an emerging application of MapReduce, not
tmp = TableInputs.addJoiner(C, O, Join1)
all MapReduce programs are of this kind; some of them build
TableInputs.addJoiner(tmp, L, Join2)
inverted-indexing, perform large-scale machine learning,
and conduct many other data processing tasks. Therefore, it
3.2 Data Partitioning
is important that our modifications do not damage the
interface and semantics of MapReduce to the extent that Before the MapReduce job can be launched, data sets need to
those nondata analysis tasks fail to work. That is, we must be partitioned into chunks. Each chunk is called a FileS-
make sure that the resulting system should be binary plit in Hadoop. We implement a TableInputFormat
compatible with Hadoop and existing MapReduce jobs can class to split multiple data sets for launching Map-Join-
run on our system without any problems. Fortunately, based Reduce jobs. TableInputFormat walks through the data
on the experiences from building this implementation, such a set list produced by addTableInput(). For each data set, it
requirement does not introduce huge engineering efforts. adopts conventional Hadoop code to split files of the data set
This section describes these implementation details. into FileSplits. When all the FileSplits for the
processing data set are collected, TableInputFormat
3.1 New APIs rewrites each FileSplit into a TableSplit by append-
We introduce new APIs to Hadoop for new features of ing additional information including table id, mapper class,
Map-Join-Reduce. Since MapReduce is mainly designed for combiner class, and partitioner class. When TableSplits
JIANG ET AL.: MAP-JOIN-REDUCE: TOWARD SCALABLE AND EFFICIENT DATA ANALYSIS ON LARGE CLUSTERS 1305

of all data sets are generated, TableInputFormat sorts read, ReduceTask checks the joiner at the front of the joiner
these splits first by access order, then by split size. This is to list. If the joiner is ready, ReduceTask performs the merge
ensure that data sets that will be joined first will have a join algorithm and calls the joiner’s join function to process
bigger chance to scan first.2 the results.

3.3 MapTask
When MapTask is launched, it first reads the TableSplit 4 OPTIMIZATION
that is assigned to it, then parses the information from the In addition to the modifications described in the previous
TableSplit, and launches mapper, combiner, and parti- section, three optimization strategies are also adopted. We
tioner to process the data set. Overall, the workflow of present them here.
MapTask is the same as the original MapReduce except for
partitioning. There are two problems here. First, in Map- 4.1 Speedup Parsing
Join-Reduce, the partition signature of an intermediate pair Following MapReduce, Map-Join-Reduce is designed to be
is a k-dimensional vector. However, Hadoop can only storage independent. As a result, users have to decode the
shuffle an intermediate pair based on a single value record stored in the value part of the input key/value pair
partition signature. Second, Map-Join-Reduce requires in map() and reduce() functions. Previous studies show
shuffling the same intermediate pair to many reducers. that this runtime data decoding process introduces con-
However, Hadoop can only shuffle the same intermediate siderable overhead 12], [10], [11].
pair to only one reducer. There are two kinds of decoding schemes: immutable
To solve the first problem, we convert our k-dimensional decoding and mutable decoding. The immutable decoding
signature into a single value. Given the k-dimensional scheme transforms raw data into immutable objects, i.e.,
partition signature S ¼ <x1 ; . . . ; xk > and the k partition read-only objects. Using this approach, decoding 4 millions
functions parameters fn1 ; . . . ; nk g, the single signature records results in 4 millions immutable objects and thus
value s is calculated as follows: introduces huge CPU overhead. We found that the poor
performance of data parsing reported in previous studies is
X
k
s ¼ x1 þ xi  ni1 : due to the fact that all these studies adopt the immutable
i¼2 scheme.
To reduce the record parsing problem, we adopt a
For the second problem, a naive method involves writing
mutable decoding scheme in Map-Join-Reduce. The idea is
the same intermediate pair to disk several times. Suppose
straightforward, to decode records from a data set D, we
we need to shuffle an intermediate key-value pair I to m
create a mutable object according to the schema of D and
reducers, we can emit I for m times in map functions. Each
use that object to decode all records belonging to D.
Ii is associated with a different partition vector. Hadoop
Therefore, no matter how many records will be decoded,
then is able to shuffle each Ii to a unique reducer.
only one mutable object is created. Our benchmarking
Unfortunately, this method dumps I to disk m times and
results show that the performance of mutable decoding
introduces a huge I/O overhead at map-side.
We adopt an alternative solution to rectify this problem. outperforms immutable decoding by a factor of four.
First, we extend Hadoop’s Partitioner interface to 4.2 Tuning Partition Functions
TablePartitioner interface and add a new function
In Map-Join-Reduce, an intermediate pair could be shuffled
that can return a set of reducers ids as shuffling targets. The
new function is as follows: to many reducers for join. To save network bandwidth and
computation cost, it is important to ensure that each reducer
int[] getTablePartitions(K key, V value) receives only a minimal number of intermediate pairs to
MapTask then collects and sorts all the intermediate process. This section discusses this problem.
pairs. The sorting groups pairs that will be shuffled to the Suppose a filtering-join-aggregation task Q involves n
same set of reducers into partitions based on the returned data sets D ¼ fD1 ; . . . ; Dn g. The derived join graph includes
information of getTablePartition() function, and k connected components and the corresponding partition
orders pairs in each partition according to keys. Using this functions are H ¼ fH1 ðxÞ; . . . ; Hk ðxÞg, with partition num-
approach, the same intermediate pair is only written to disk bers fn1 ; . . . ; nk g. For each data set Di , mi partition
once and thus will not introduce additional I/Os. functions Hi ¼ fHm1 ðxÞ; . . . ; Hmi ðxÞg are used to partition
the join columns. Furthermore, we assume that there is no
3.4 ReduceTask data distribution skew for the time being. Data skew is our
The structure of ReduceTask is also similar to the original future work. The optimization problem is to minimize the
Hadoop version. The only difference is that it holds an array intermediate pairs that each reducer receives
of counters for a multiple data sets job, one for each data set.
For each data set, the counter is initially set to the total Xn
jDi j
minimize F ðxÞ ¼ mi
number of mappers that the reducer needs to connect for i¼1
j¼1 nj
shuffling data. When a partition is successfully read from a Y
k
certain mapper, ReduceTask decreases the corresponding subject to ni ¼ n r
counter. When all the needed partitions for a data set are i¼1
ni  1 is an integer:
2. In MapReduce, there is no method to control data set access order.
Here, we only give a hint to the Hadoop scheduler for scheduling MapTask In the above problem formulation, nr is the number of
to process data sets. ReduceTasks specified by the user. Like MapReduce, the
1306 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 9, SEPTEMBER 2011

number nr is often set to a small multiple of the number of checkpointing and shuffling those intermediate results is
slave nodes [1]. The optimization problem is equivalent to a higher than replicating input data sets on multiple reducers,
nonlinear integer programming program. In general, non- one-phase join processing is more efficient than sequential
linear integer programming is an NP-hard problem [23], data processing.
and there is no efficient algorithm to solve it. However, in
this case, if the number of reducers nr is small, we can use a 4.3 Speeding Up the Final Merge
brute-force approach which enumerates all possible feasible In Map-Join-Reduce, in order to calculate the final aggre-
solutions to minimize the object function F ðxÞ. However, if gates, the second MapReduce job often needs to process a
nr is large, say 10,000, finding optimal partition numbers for large number of small files. This is because the first
four partition functions requires Oð1016 Þ computations, MapReduce job launches a huge number of reducers for
which makes the brute-force approach infeasible. processing join and partial aggregation and produces a
If nr is large, we use a heuristic approach to solve the partial aggregation results file for each reducer.
optimization problem and produce an approximation Currently, Hadoop schedules mappers in per-file manner,
solution that works reasonably well. For a very large nr , one mapper for each file. If there are 400 files to process, there
we first round it to a number n0r  nr with n0r ¼ 2d . Then we will be at least 400 mappers to launch. This per-file assign-
replace nr with n0r and rewrite the constraint as ment scheme is quite inefficient for the second merging job.
After join and partial aggregation, the partial results files
Y
k
ni ¼ n0r : produced by the first job are quite small, several KBs
i¼1 typically. However, we observe that the start-up cost of a
mapper in a 100-node cluster is around 7-10 seconds,
It is easy to see that after constraint rewriting, each ni
thousands times larger than actual data processing time.
must be of the form ni ¼ 2ji , where ji is an integer. So, the
To speed up the final merging process, we adopt
constraint can further be written as
another scheduling strategy for the second MapReduce
X
k job [24]. Instead of scheduling mappers in a per-file
ji ¼ d: manner, we schedule a mapper to process multiple files
i¼1 in order to enlarge the payload. Particularly, we schedule a
Now, the brute-force approach can be used to find mapper to process 128 MB data consolidated from multiple
optimal ji ; i 2 f1; . . . ; kg to minimize the object function. files. Using this approach, the number of mappers needed
The computation cost is reduced to Oðdk Þ. in the second job is significantly reduced. The typical
We now build a cost model and analyze the I/O costs of merging time, based on our experiments on TPC-H queries,
evaluating a filtering-join-aggregation task by: 1) a standard is around 15 seconds, which approaches the minimal cost
sequential data processing strategy employed by original of launching a MapReduce job.
MapReduce and 2) the alternative data processing strategy
introduced by Map-Join-Reduce. We regard the whole
cluster as a single computer and estimate the total I/O costs
5 EXPERIMENTS
for both approaches. The difference between the two In this section, we study the performance of Map-Join-
approaches lies in the method of joining multiple data sets. Reduce. Our benchmark consists of five tasks. In the first
The final aggregation step is the same. Therefore, we only task, we evaluate the performance of our tuple parsing
consider the joining phase. technique and study whether our approach could reduce
For the sequential data processing, the multiway join is the CPU cost in runtime parsing. Then, we benchmark
evaluated by a set of MapReduce jobs. The I/O cost Cs is the Map-Join-Reduce against Hive with four analytical tasks
sum of I/O costs of all map() and reduce() functions, drawn from the TPC-H benchmark.
which is equivalent to scanning and shuffling the input data
sets and intermediate join results 5.1 Benchmark Environment
! We run all benchmarks on Amazon EC2 Cloud with large
Xn X
n1
instances. Each instance has 7.5 GB memory, 4 EC2 Compute
Cm ¼ 2 jDi j þ jJj j ;
i¼1 j¼1
Units (two virtual cores), 420 GB instance storage, and runs
64-bit platform Linux Fedora 8 OS. Interestingly, in the data
where jJj j is the size of jth join results. The coefficient is two sheet, Amazon claims that a large instance has 850 GB
since both the input data sets and intermediate results are instance storage (2  420 GB plus 10 GB root partition).
first read from HDFS by mappers and then shuffled to However, when we login to the instance and check the system
reducers for processing. Thus, two I/Os are introduced. with df -h, we found out only one 420 GB disk was installed.
For the one-phase join processing, the input data sets are The raw disk speed of a large instance is roughly 120 MB/s,
first read by mappers and then replicated to multiple and the network bandwidth is about 100 MB/s. For analysis
reducers for joining. Thus the total I/O cost Cp is tasks, we benchmark the performance with cluster sizes of 10,
j 62Hi
n HY 50, and 100 nodes.3 We implemented Map-Join-Reduce on
X
n X  
Cp ¼ jDi j þ nj  jDi j : Hadoop v0.19.2 and use the enhanced Hadoop to run all
i¼1 i¼1 benchmarks. The Java system we used is 1.6.0_16.
Comparing Cs and Cp , it is obvious that if the 3. These nodes are slave nodes. To make Hadoop run, we use an
intermediate join results is huge and thus the I/O cost of additional master node to run NameNode and JobTracker.
JIANG ET AL.: MAP-JOIN-REDUCE: TOWARD SCALABLE AND EFFICIENT DATA ANALYSIS ON LARGE CLUSTERS 1307

5.1.1 Hive Settings


There are two important reasons that we choose Hive as the
system to benchmark against. First, Hive represents state-
of-the-art MapReduce-based system that processes complex
analytical workloads. Second, but more importantly, Hive
has already benchmarked itself using the TPC-H bench-
mark and released its HiveQL, an SQL like query declara-
tion language, scripts and Hadoop configurations [25]. This
Fig. 4. Tuple parsing results.
simplifies our effort in setting up Hive to run and tune the
parameters for better performance. We assume that the
take the whole data set as input. We only report the
HiveQL scripts that Hive provides are well tuned and
execution time of the mapper and ignore the startup cost of
use them without modifications.
the job. Fig. 4 plots the results. In Fig. 4, the left two columns
We carefully follow the Hadoop configurations used by
represents the time MJR code used to split and parse the
Hive for the TPC-H benchmarking. We only make a few small
tuple. The right two columns records the execution time of
modifications. First, we set each slave node to run two Java code. We can see that our split approach runs about
MapTasks and two ReduceTasks concurrently instead of four four times faster than Java code. Furthermore, parsing two
since we only have two cores in each slave node. Second, we columns actually only introduces very little overhead (less
set the sort buffer to 500 MB to ensure that MapTask could than one second). The results confirm our claim made in
hold all intermediate pairs in memory. This setting makes Section 4.1. The cost in the runtime parsing is mainly due to
both systems, Map-Join-Reduce and Hive, run a little faster. the creation of temporary java objects. By proper and
Third, we set the HDFS block size to 512 MB instead of 128 MB careful coding, much of the cost can be removed.
that Hive used in the TPC-H benchmarking. This is because
we observe that although Hive set block size to 128 MB, it 5.3 Analytical Tasks
manually sets the minimal chunk split size to 512 MB in each We benchmark Map-Join-Reduce against Hive. The original
query. This setting is in line with our observation that Hive’s TPC-H benchmark [25] runs on a 11-node cluster with
MapTask should process reasonable sized data chunk to ten slaves to process a TPC-H 100 GB data set, with 10 GB
amortize the startup cost. So, we directly use 512 MB block data per node. Each slave node has 4 cores, 8 GB memory, 4
size. Hive enables map output compression in its benchmark. hard disks with 1.6TB space. However, our EC2 instance only
At present, we do not support compression and therefore we has 2 cores, 7.5 GB memory, and 1 hard disk. Therefore, to
disable it. Disabling compression will not significantly affect enable the benchmark to be completed within a reasonable
the performance. According to another benchmarking results time frame, we process 5 GB data in each node. With the use
published by Hive, enabling compression only improves the of 10, 50, and 100-node clusters, we subsequently have three
performance by less than 4 percent [26]. The final modifica- data sets of 50, 250, and 500 GB. Unfortunately, Hive fails to
tion is to enable JVM task reuse. perform all four analytical queries using the 500 GB data set.
The JVM throws various runtime exceptions, such as “GC
5.1.2 Map-Join-Reduce Settings overhead limits exceeded” during query processing. This
Map-Join-Reduce shares the same common Hadoop set- problem is not due to our modifications to the Hadoop since
tings with Hive. Furthermore, we set joiner output buffer to the same problem occurred when we ran Hive on the
150 MB. standard Hadoop v0.19.2 release. Therefore, for the 500 GB
5.2 Performance Study of Tuple Parsing data set, we only report results of Map-Join-Reduce. Instead,
for the 100-node cluster, we reduce the data set size to 350 GB
This benchmark studies whether MapReduce’s runtime
so that Hive can complete all four queries.
parsing cost can be reduced. Since Hive is a system, we
We choose four TPC-H queries for benchmarking,
have no method to only test its parsing component.
namely, Q3, Q4, Q7, and Q9. Each query is executed three
Therefore, we compare our parsing library (called MJR
times and the average of three runs is reported. For Hive, we
approach) with the code that is used in [10] (called the Java
use the latest 0.4.0 release. HiveQL scripts are also used for
approach). The differences between the two approaches is
query submission. For Map-Join-Reduce, all the programs
that our code does not create temporary objects in splitting
are hand coded. For each query, we specify the same join
and parsing while Java code does.
We create a one-node cluster and populate HDFS with order with Hive. We set HDFS replication factor r ¼ 3, when
725 MB lineitem data set. We run two MapReduce jobs to the data set and results are replicated twice. That is, three
test the performance. The first job extracts the tuple copies of the data are stored. The effect of replication on
structure by reading a line from the input as a tuple and performance is studied in Section 5.3.5 when no replication is
then splitting it into fields according to the delimiter, “|.” used. Data are generated by TPC-H DBGEN tool and are
The second job splits the tuple into fields and parses two loaded to HDFS as text files. We do not report loading time
date columns, i.e., l_commitdate and l_receiptdate, since both systems directly perform queries on files.
to compare which date is early. Here, the computation is It is useful to list each query’s SQL and HiveQL script.
merely for testing purpose. We are interested in whether the However, the full list will run out of our space limit.
parsing cost is acceptable. Both jobs only have map Therefore, we only present each query’s execution flow in
functions and do not produce output into HDFS. The Hive. The details of the SQL query and HiveQL scripts can
minimal file split size is set to 1 GB so that the mapper will be found in [19] and [25]. Hive is able to dynamically
1308 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 9, SEPTEMBER 2011

Fig. 5. TPC-H Q4 (r ¼ 3). Fig. 6. TPC-H Q3 (r ¼ 3).

determine the number of reducers to use based on the input aggregates I2 on the group keys. The fourth job J4 sorts the
size. Map-Join-Reduce, however, gives the user more aggregation results in decreasing order of revenue. The
freedom to specify the number of reducers to be used. To final job J5 limits the results to top ten orders with the largest
make a fair comparison, we set the number of reducers to be revenues and writes these ten result tuples into HDFS.
no more than the total number of reducers that Hive used in Map-Join-Reduce can process the query with two jobs.
processing the same query. For example, if Hive uses The first job scans all three data sets and shuffles qualified
50 reducers to process a query, we will set the number of tuples to reducers. At the reducer side, two joiners are
reducers to no more than 50 in Map-Join-Reduce. We could chained to join all tuples. The join order is the same as
not set the number of reducers to be same with Hive since Hive’s, namely, first joining orders and customer,
some reducer number, e.g., a prime number, makes us fail followed by joining the results with lineitem. In partial
to build partition functions.
aggregation, the reducers in the first job maintain a heap to
5.3.1 TPC-H Q4 hold the top ten tuples with the largest revenues. Finally,
the second job combines partial aggregations and produces
This query joins lineitem with orders and counts the
the final query answers. Two partition functions are built
number of orders based on order priority. Hive uses four
for the first job to partition intermediate pairs. The domains
MapReduce jobs to evaluate this query. The first job writes
of the partition functions are Dom½H1  ¼ fc custkey;
the unique l_orderkeys in lineitem to HDFS. The
o custkeyg and Dom½H2  ¼ fl orderkey; o orderkeyg. We
second job joins the unique l_orderkeys with orders set the number of reducers employed in the first job to be
and writes the join results to HDFS. The third job aggregates close to the sum of reducers Hive used in J1 , J2 , and J3 . The
joined tuples and writes the aggregation results to HDFS. partition numbers in each partition function are tuned by
The fourth job merges results into one HDFS file. the brute-force search algorithm described in Section 4.2.
Map-Join-Reduce launches two MapReduce jobs to Fig. 6 illustrates the results of this benchmark task.
evaluate the query. The first job performs the ordinary Although Map-Join-Reduce can perform all joins within one
filtering, joining and partial aggregation tasks on line- job, it only runs twice faster than Hive. This is because the
item and orders. The second job combines all partial intermediate join results are small (relative to the input).
aggregation results and produce the final answers to HDFS. Therefore checkpointing intermediate join results and
The partition function for this query is straightforward. shuffling those results in the complete job does not
Since there are only two data sets involved in the query, one introduce too much overhead. In 350 GB data set, the first
partition function is sufficient. It partitions tuples from job of Hive only writes 1.7 GB intermediate results into
lineitem and orders to all available reducers. We set the HDFS which is much smaller than the input data sets. Since
number of reducers to the be sum of reducers in the first Map-Join-Reduce may need more memory and disk space
and second jobs that Hive launched. in each processing node, this observation suggests that if
Fig. 5 presents the performance of each system. In the computation resources are inadequate and the inter-
general, Map-Join-Reduce runs twice faster than Hive. The mediate results are small, sequential processing should be
main reason for Hive to be slower than Map-Join-Reduce is used instead without too much performance degradation.
because Hive uses two MapReduce jobs to join lineitem
5.3.3 TPC-H Q7
and orders. This plan causes the intermediate results,
namely, the unique keys of lineitem produced by J1 , to Hive compiles this query to ten MapReduce jobs. The first
be shuffled again, for joining, in J2 . Actually, to speed up four jobs perform a self-join on nation to find out desired
nation key pairs with the given nation names. The key
writing unique l_orderkeys to HDFS, Hive already
pairs that have been found are written to HDFS as
partitions and shuffles these keys to all reducers in J1 . If
temporary results I1 . Then another four jobs are launched
this shuffling can also be applied to orders in J1 and thus
to join lineitem, orders, customer, supplier, and I1 .
shuffles all qualified orders tuples to reducers for joining, Finally, two additional jobs are launched to compute
we think Hive is able to deliver the same performance as aggregations, order result tuples and store them into HDFS.
Map-Join-Reduce. In contrast to the complexity of Hive, Map-Join-Reduce
only needs two MapReduce jobs to evaluate the query. The
5.3.2 TPC-H Q3 simplicity of Map-Join-Reduce is its strong point and as
Hive runs Q3 using five MapReduce jobs. The first job J1 joins expected, Map-Join-Reduce is much easier to use than
the qualified tuples in customer and orders and produce MapReduce to represent complex analysis logic. This
the join results I1 to HDFS. The second job J2 joins I1 and feature becomes important if users prefer to write programs
lineitem, writes join results I2 to HDFS. The third job J3 to analyze the data. In the first MapReduce job, mappers
JIANG ET AL.: MAP-JOIN-REDUCE: TOWARD SCALABLE AND EFFICIENT DATA ANALYSIS ON LARGE CLUSTERS 1309

Fig. 7. TPC-H Q7 (r ¼ 3). Fig. 10. TPC-H Q3 (r ¼ 1).

Fig. 8. TPC-H Q9 (r ¼ 3). Fig. 11. TPC-H Q7 (r ¼ 1).

Fig. 9. TPC-H Q4 (r ¼ 1). Fig. 12. TPC-H Q9 (r ¼ 1).

Dom½H1  ¼ fs suppkey; l suppkey; ps suppkeyg;


scan the data sets in parallel. We load nation into the
memory of mappers which scan supplier and customer Dom½H2  ¼ fl partkey; ps partkey; p partkeyg; and :
for map-side join. The logic of reducer side is as per normal. Dom½H3  ¼ fo orderkey; l orderkeyg
Three joiners, linked as Hive’s join order, join the tuples and
push the results to reducer for aggregation. Three partition We also join supplier with nation on the map-side.
functions are also built. Their domains are Fig. 8 plots the result of this benchmark. This query shows
the largest performance gap between the two systems. Map-
Dom½H1  ¼ fs suppkey; l suppkeyg; Join-Reduce runs nearly four times faster than Hive. This is
Dom½H2  ¼ fc custkey; o custkeyg; and : because Hive needs to checkpoint and shuffle a huge amount
of intermediate results. In the test with 350 GB data set with
Dom½H3  ¼ fo orderkey; l orderkeyg
100 nodes, Hive needs to checkpoint and shuffle more than
The partial aggregates ordered by group key are combined 300 GB intermediate join results. Although Map-Join-Reduce
in the second job to generate the final query result. We also also incurs more than 400 GB disk I/O accesses in the first
use the brute-force search method for parameters tuning. job’s reducers for holding input data set and join results, it is
Fig. 7 presents the performance of both systems. On less costly to read and write data to the local disks than to
average, Map-Join-Reduce runs nearly three times faster shuffle those results over the network. Therefore, the
than Hive. We attribute this significant performance boost significant performance boost is mainly due to the ability of
Map-Join-Reduce in joining all data sets locally, which is its
to the benefit of avoiding checkpointing and shuffling. In
another strong point.
the test with 350 GB data set, Hive needs to write more than
88 GB data to HDFS. All these data need to be shuffled 5.3.5 Effect of Replication
again in the next MapReduce job. Frequent shuffling of such
We also set the replication factor r ¼ 1 to study how
large volume of data incurs huge performance overhead. replication affects the performance. In this setting, the data
Thus, Hive is significantly slower than Map-Join-Reduce. sets and intermediate results are not replicated by HDFS
and thus have only one copy. Since Hive failed to process
5.3.4 TPC-H Q9
500 GB data set, we only conduct this test on 350 GB data set
Hive compiles this query into seven MapReduce jobs. Five in a 100-node cluster.
jobs are to join lineitem, supplier, partsupp, part, and Figs. 9, 10, 11, and 12 present the results of this
orders. Hive also joins nation with supplier in the map benchmark. We do not see significant performance im-
side. Then, two additional jobs are launched for aggregation provements for both systems. It is reasonable for the
and ordering. Map-Join-Reduce still performs the query over performance of Map-Join-Reduce to be insensitive to
two MapReduce jobs. The procedure is similar to the replication since replication only affects the data volume
previous queries. Therefore, we only list partition functions that are written to HDFS, and Map-Join-Reduce only writes
and omit other details. The domain of partition functions are small partial aggregations to HDFS. We also observe that in
1310 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 9, SEPTEMBER 2011

some settings, Map-Reduce-Join runs a little slower than the including Pig [6], Hive [5] and Cascading [17]. These systems
version that runs with three replications setting. This is provide a high-level query language and associated optimi-
because fewer replications increase the chance of JobTracker zer for efficient evaluating complex queries which may
to schedule nondata local map tasks. involve multiple joins. Compared to this work, Map-Join-
For the tasks that produce small intermediate results, i.e., Reduce provides built-in support for multiway join proces-
Q3 and Q4, there is also no performance improvement in sing and processes join in system level rather than applica-
Hive. However, for queries producing large intermediate tion level. Therefore, Map-Join-Reduce can be used as a new
results, i.e., Q9, setting replication to one improves the building block (in addition to MapReduce) for these systems
performance of Hive by 10 percent. to generate an efficient query plan.
During preparation of the paper, we noticed that the
work presented in a newly accepted paper [18] is closely
6 RELATED WORK related to ours. In [18], the authors propose an efficient
There are two kinds of systems that are able to perform multiway join processing strategy for MapReduce system.
large-scale data analytical tasks on a shared-nothing cluster: The basic idea of the join processing strategy is similar to
1) Parallel Databases, and 2) MapReduce-based systems. ours. Our work is an independent work. Compared to [18],
The research on parallel databases started in the late our work not only targets at efficient join processing but
1980s [7]. Pioneering research systems include Gamma [8], also aims to simplify developing complex data analytical
and Grace [9]. A full comparison between parallel databases tasks. For the join processing technique, the differences
and MapReduce are presented in [10] and [11]. The between their work with ours lies in technical details. First,
comparison shows that main differences between the two we adopt a derived join graph approach for generating
systems are performance and scalability. The scalability k-partition function and consider every join column.
issues of the two systems are further studied in [12]. Contrary to our approach, their work do not provide a
Efficient join processing is also extensively studied in concrete method for constructing partition functions. The
parallel database systems. The proposed work can be join columns that are included in partition functions are
categorized as two classes: 1) two-way join algorithms determined by Lagrangean optimization process. Not every
and 2) schemes for evaluating multiway joins based on two- join column will be included. Second, to optimize para-
way joins. Work in the first category include parallel nested meters in partition functions, we adopt a heuristic solution
loop join, parallel sort-merge join, parallel hash join, and to solve an integer programming problem. This heuristic
parallel partition join [8]. All these join algorithms have solution guarantees feasible parameters (satisfying all
been implemented in MapReduce-based systems in one constraints) that can always be found although the
form or another [13], [14], [12]. Although Map-Join-Reduce parameters may not be optimal. Contrary to our approach,
targets a multiway join, these two-way joins techniques can their Lagrangean optimization technique dost not guarantee
also be integrated in our Map-Join-Reduce framework. We feasible parameters that can always be found. In cases
discuss this problem in Section 2.5. where no feasible parameters are returned, they employ
Parallel database systems evaluate multijoin queries alternative strategies for deriving parameters. Finally, we
through a pipelined processing strategy. Suppose we are provide a working system and a comprehensive bench-
to perform a three-way join R1 ffl R2 ffl R3 . Typical pipe- marking performance study of the proposed approach.
lined processing works as follows: first, two nodes N1 and
N2 scan R2 and R3 in parallel and load them into an in-
7 CONCLUSION
memory hash table if the tables can fit in memory. Then, the
third node N3 reads tuples from R1 and pipelines the read In this paper, we present Map-Join-Reduce, a system that
tuples to N2 and N3 to probe R2 and R3 in turn and produce extends and improves the MapReduce runtime system to
the final query results. Pipelined processing is proven to be efficiently process complex data analytical tasks on large
superior to sequential processing [15]. However, pipelined clusters. The novelty of Map-Join-Reduce is that it intro-
processing suffers from node failure since it introduces duces a filtering-join-aggregation programming model. This
dependencies between processing nodes. When a node (say programming model allows users to specify data analytical
N2 ) fails, the data flow is broken and thus the whole query tasks that require joining multiple data sets for aggregate
needs to be resubmitted. Therefore, all MapReduce-based computation with a relatively simple interface offering
systems adopt the sequential processing strategy. three functions: map(), join(), and reduce(). We have
MapReduce is introduced by Dean et al. for simplifying designed a one-to-many shuffling strategy and demon-
construction of inverted indexes [1]. But it was quickly found strated the usage of such a shuffling strategy in efficiently
that the framework is also able to perform filtering-aggrega- processing filtering-join-aggregation tasks. We have con-
tion data analysis tasks [16]. More complex data analytical ducted a benchmarking study against Hive on Amazon EC2
tasks can also be evaluated by a set of MapReduce jobs [1], using the TPC-H benchmark. The benchmarking results
[10]. Although join processing can be implemented in a show that our approach significantly improves the perfor-
MapReduce framework, processing heterogenous data sets mance of complex data analytical tasks, and confirm the
and manually writing a join algorithm is not straightforward. potential of the Map-Join-Reduce approach.
In [3], Map-Reduce-Merge is proposed for simplifying join
processing by introducing a Merge operation. Compared to REFERENCES
this work, Map-Join-Reduce aims to not only alleviate
[1] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Proces-
development effort but also improve the performance of sing on Large Clusters,” Proc. Operating Systems Design and
multiway join processing. There are also a number of query Implementation (OSDI), pp. 137-150, 2004.
processing systems that are built on top of MapReduce, [2] http://developer.yahoo.net/blogs/hadoop/2008/09/, 2011.
JIANG ET AL.: MAP-JOIN-REDUCE: TOWARD SCALABLE AND EFFICIENT DATA ANALYSIS ON LARGE CLUSTERS 1311

[3] H.-C. Yang, A. Dasdan, R.-L. Hsiao, and D.S. Parker, “Map- Dawei Jiang received the BSc and PhD
Reduce-Merge: Simplified Relational Data Processing on Large degrees in computer science from the Southeast
Clusters,” Proc. ACM SIGMOD Int’l Conf. Management of Data University in 2001 and 2008, respectively. He is
(SIGMOD ’07), 2007. currently a research fellow at the school of
[4] D. DeWitt, E. Paulson, E. Robinson, J. Naughton, J. Royalty, S. computing, National University of Singapore. His
Shankar, and A. Krioukov, “Clustera: An Integrated Computation research interests includes Cloud computing,
and Data Management System,” Proc. VLDB Endowment, vol. 1, database systems, and large-scale distributed
no. 1, pp. 28-41, 2008. systems.
[5] A. Thusoo, J.S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H.
Liu, P. Wychoff, and R. Murthy, “Hive—A Warehousing Solution
over a Map-Reduce Framework,” Proc. VLDB Endowment, vol. 2,
no. 2, pp. 1626-1629, 2009.
[6] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig Anthony K.H. Tung received the BSc (second
Latin: A Not-so-Foreign Language for Data Processing,” Proc. class honour) and the MSc degrees in computer
ACM SIGMOD Int’l Conf. Management of Data (SIGMOD ’08), 2008. sciences from the National University of Singa-
[7] D. DeWitt and J. Gray, “Parallel Database Systems: The Future of pore in 1997 and 1998, respectively. In 2001, he
High Performance Database Systems,” Comm. ACM, vol. 35, no. 6, received the PhD degree in computer sciences
pp. 85-98, 1992. from Simon Fraser University (SFU). He is
[8] D.J. DeWitt, R.H. Gerber, G. Graefe, M.L. Heytens, K.B. Kumar, currently an associate professor in the School
and M. Muralikrishna, “Gamma—A High Performance Dataflow of Computing, National University of Singapore
Database Machine,” Proc. 12th Int’l Conf. Very Large Data Bases, (NUS) and a junior faculty member in the NUS
pp. 228-237, 1986. Graduate School for Integrative Sciences and
[9] S. Fushimi, M. Kitsuregawa, and H. Tanaka, “An Overview of the Engineering and a SINGA supervisor. His research interests include the
System Software of a Parallel Relational Database Machine whole process of converting data into intelligence. He is also affiliated
Grace,” Proc. 12th Int’l Conf. Very Large Data Bases, pp. 209-219, with the Database Lab, Computational Biology Lab, and the NUS
1986. Bioinformatics Programs.
[10] A. Pavlo, E. Paulson, A. Rasin, D.J. Abadi, D.J. Dewitt, S. Madden,
and M. Stonebraker, “A Comparison of Approaches to Large- Gang Chen received the BSc, MSc, and PhD
Scale Data Analysis,” Proc. 35th SIGMOD Int’l Conf. Management degrees in computer science and engineering
of Data (SIGMOD ’09), http://database.cs.brown.edu/sigmod09/ from Zhejiang University in 1993, 1995 and 1998,
benchmarks-sigmod09.pdf, June 2009. respectively. He is currently a professor at the
[11] M. Stonebraker, D. Abadi, D.J. DeWitt, S. Madden, E. Paulson, A. College of Computer Science, Zhejiang Univer-
Pavlo, and A. Rasin, “Mapreduce and Parallel DBMSs: Friends or sity. His research interests include database,
Foes?” Comm. ACM, vol. 53, no. 1, pp. 64-71, 2010. information retrieval, information security and
[12] A. Abouzeid, K. Bajda-Pawlikowski, D.J. Abadi, A. Silberschatz, computer supported cooperative work. He is also
and A. Rasin, “HadoopDB: An Architectural Hybrid of MapRe- the executive director of Zhejiang University—
duce and DBMS Technologies for Analytical Workloads,” Proc. Netease Joint Lab on Internet Technology.
VLDB Endowment, vol. 2, no. 1, pp. 922-933, 2009.
[13] http://hadoop.apache.org, 2011.
[14] K.S. Beyer, V. Ercegovac, R. Krishnamurthy, S. Raghavan, J. Rao,
. For more information on this or any other computing topic,
F. Reiss, E.J. Shekita, D.E. Simmen, S. Tata, S. Vaithyanathan, and
please visit our Digital Library at www.computer.org/publications/dlib.
H. Zhu, “Towards a Scalable Enterprise Content Analytics
Platform,” IEEE Data Eng. Bull., vol. 32, no. 1, pp. 28-35, Mar. 2009.
[15] D.A. Schneider and D.J. DeWitt, “Tradeoffs in Processing
Complex Join Queries via Hashing in Multiprocessor Database
Machines,” Proc. 16th Int’l Conf. Very Large Data Bases. pp. 469-480,
1990.
[16] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, “Interpreting
the Data: Parallel Analysis with Sawzall,” Scientific Programming,
vol. 13, no. 4, pp. 277-298, 2005.
[17] http://www.cascading.org, 2011.
[18] F.N. Afrati and J.D. Ullman, “Optimizing Joins in a Map-Reduce
Environment,” Proc. 13th Int’l Conf. Extending Database Technology
(EDBT ’10), 2010.
[19] http://www.tpc.org/tpch/, 2011.
[20] D.A. Schneider and D.J. DeWitt, “A Performance Evaluation of
Four Parallel Join Algorithms in a Shared-Nothing Multiprocessor
Environment,” ACM SIGMOD Record, vol. 18, no. 2, pp. 110-121,
1989.
[21] M. Stonebraker, D.J. Abadi, A. Batkin, X. Chen, M. Cherniack, M.
Ferreira, E. Lau, A. Lin, S.R. Madden, E.J. O’Neil, P.E. O’Neil, A.
Rasin, N. Tran, and S.B. Zdonik, “C-Store: A Column-Oriented
DBMS,” Proc. 31st Int’l Conf. Very Large Data Bases (VLDB ’05),
pp. 553-564, 2005.
[22] M. Ziane, M. Zaı̈t, and P. Borla-Salamet, “Parallel Query
Processing with Zigzag Trees,” The VLDB J.—Int’l J. Very Large
Data Bases—Parallelism in Database Systems, vol. 2, no. 3, pp. 277-
302, 1993.
[23] A. Weintraub, “Integer Programming in Forestry,” Annals of
Operations Research, vol. 149, no. 1, pp. 209-216, 2007.
[24] http://www.cloudera.com/blog/2009/05/07/what’s-new-in-
hadoop-core-020/, 2011.
[25] http://issues.apache.org/jira/browse/hive-600, 2011.
[26] http://issues.apache.org/jira/browse/hive-396, 2011.

You might also like