Big Data Processing in Cloud Computing Environments
Big Data Processing in Cloud Computing Environments
Abstract—With the rapid growth of emerging applications significant resources were allocated to support these data-
like social network analysis, semantic Web analysis and bioin- intensive operations which lead to high storage and data
formatics network analysis, a variety of data to be processed processing costs.
continues to witness a quick increase. Effective management
and analysis of large-scale data poses an interesting but critical The current technologies such as grid and cloud comput-
challenge. Recently, big data has attracted a lot of attention ing have all intended to access large amounts of comput-
from academia, industry as well as government. This paper ing power by aggregating resources and offering a single
introduces several big data processing technics from system system view. Among these technologies, cloud computing
and application aspects. First, from the view of cloud data
management and big data processing mechanisms, we present is becoming a powerful architecture to perform large-scale
the key issues of big data processing, including cloud computing and complex computing, and has revolutionized the way
platform, cloud architecture, cloud database and data storage that computing infrastructure is abstracted and used. In
scheme. Following the MapReduce parallel processing frame- addition, an important aim of these technologies is to deliver
work, we then introduce MapReduce optimization strategies computing as a solution for tackling big data, such as large-
and applications reported in the literature. Finally, we discuss
the open issues and challenges, and deeply explore the research scale, multi-media and high dimensional data sets.
directions in the future on big data processing in cloud Big data and cloud computing are both the fastest-moving
computing environments. technologies identified in Gartner Inc.’s 2012 Hype Cycle
Keywords-Big Data; Cloud Computing; Data Management; for Emerging Technologies4 . Cloud computing is associated
Distributed Computing. with new paradigm for the provision of computing infras-
tructure and big data processing method for all kinds of
I. I NTRODUCTION resources. Moreover, some new cloud-based technologies
In the last two decades, the continuous increase of have to be adopted because dealing with big data for
computational power has produced an overwhelming flow concurrent processing is difficult.
of data. Big data is not only becoming more available Then what is Big Data? In the publication of the journal
but also more understandable to computers. For example, of Science 2008, “Big Data” is defined as “Represents the
modern high-energy physics experiments, such as DZero1 , progress of the human cognitive processes, usually includes
typically generate more than one TeraByte of data per day. data sets with sizes beyond the ability of current technology,
The famous social network Website, Facebook, serves 570 method and theory to capture, manage, and process the data
billion page views per month, stores 3 billion new photos within a tolerable elapsed time”[1]. Recently, the definition
every month, and manages 25 billion pieces of content2 . of big data as also given by the Gartner: “Big Data are
Google’s search and ad business, Facebook, Flickr, YouTube, high-volume, high-velocity, and/or high-variety information
and Linkedin use a bundle of artificial-intelligence tricks, assets that require new forms of processing to enable
require parsing vast quantities of data and making decisions enhanced decision making, insight discovery and process
instantaneously. Multimedia data mining platforms make it optimization”[2]. According to Wikimedia, “In information
easy for everybody to achieve these goals with the minimum technology, big data is a collection of data sets so large and
amount of effort in terms of software, CPU and network. complex that it becomes difficult to process using on-hand
On March 29, 2012, American government announced the database management tools”5 .
“Big Data Research and Development Initiative”, and big The goal of this paper is to provide the status of big
data becomes the national policy for the first time3 . All data studies and related works, which aims at providing
these examples showed that daunting big data challenges and a general view of big data management technologies and
1 http://www-d0.fnal.gov/
2 http://www.facebook.com 4 http://www.gartner.com
3 http://www.whitehouse.gov/blog/2012/03/29/big-data-big-deal 5 http://en.wikipedia.org/wiki/Big-data
18
C. Open Source Cloud Platform manner. The use of parallelization techniques and algorithms
is the key to achieve better scalability and performance
The main idea behind data center is to leverage the
for processing big data. At present, there are a lot of
virtualization technology to maximize the utilization of
popular parallel processing models, including MPI, General
computing resources. Therefore, it provides the basic in-
Purpose GPU (GPGPU), MapReduce and MapReduce-like.
gredients such as storage, CPUs, and network bandwidth
MapReduce proposed by Google, is a very popular big
as a commodity by specialized service providers at low unit
data processing model that has rapidly been studied and
cost. For reaching the goals of big data management, most of
applied by both industry and academia. MapReduce has
the research institutions and enterprises bring virtualization
two major advantages: the MapReduce model hide details
into cloud architectures. Amazon Web Services (AWS),
related to the data storage, distribution, replication, load
Eucalptus, Opennebula, Cloudstack and Openstack are the
balancing and so on. Furthermore, it is so simple that
most popular cloud management platforms for infrastructure
programmers only specify two functions, which are map
as a service (IaaS). AWS9 is not free but it has huge
function and reduce function, for performing the processing
usage in elastic platform. It is very easy to use and only
of the big data. We divided existing MapReduce applications
pay-as-you-go. The Eucalyptus[14] works in IaaS as an
into three categories: partitioning sub-space, decomposing
open source. It uses virtual machine in controlling and
sub-processes and approximate overlapping calculations.
managing resources. Since Eucalyptus is the earliest cloud
management platform for IaaS, it signs API compatible While MapReduce is referred to as a new approach
agreement with AWS. It has a leading position in the of processing big data in cloud computing environments,
private cloud market for the AWS ecological environment. it is also criticized as a “major step backwards” com-
OpenNebula[15] has integration with various environments. pared with DBMS[16]. We all know that MapReduce is
It can offer the richest features, flexible ways and better schema-free and index-free. Thus, the MapReduce frame-
interoperability to build private, public or hybrid clouds. work requires parsing each record at reading input. As
OpenNebula is not a Service Oriented Architecture (SOA) the debate continues, the final result shows that neither
design and has weak decoupling for computing, storage and is good at the other does well, and the two technologies
network independent components. CloudStack10 is an open are complementary[17]. Recently, some DBMS vendors also
source cloud operating system which delivers public cloud have integrated MapReduce front-ends into their systems in-
computing similar to Amazon EC2 but using users’ own cluding Aster, HadoopDB[18], Greenplum[19] and Vertuca.
hardware. CloudStack users can take full advantage of cloud Mostly of those are still databases, which simply provide a
computing to deliver higher efficiency, limitless scale and MapReduce front-end to a DBMS. HadoopDB is a hybrid
faster deployment of new services and systems to the end- system which efficiently takes the best features from the
user. At present, CloudStack is one of the Apache open scalability of MapReduce and the performance of DBMS.
source projects. It already has mature functions. However, The result shows that HadoopDB improves task processing
it needs to further strengthen the loosely coupling and com- times of Hadoop by a large factor to match the shared-
ponent design. OpenStack11 is a collection of open source nothing DBMS. Lately, J. Dittrich et al. propose a new
software projects aiming to build an open-source community type of system named Hadoop++[20] which indicates that
with researchers, developers and enterprises. People in this HadoopDB has also severe drawbacks, including forcing
community share a common goal to create a cloud that is user to use DBMS, changing the interface to SQL and so
simple to deploy, massively scalable and full of rich features. on. There are also certain papers adapting different inverted
The architecture and components of OpenStack are straight- index, which is a simple but practical index structure and
forward and stable, so it is a good choice to provide specific appropriate for MapReduce to process big data, such as
applications for enterprises. In current situation, OpenStack [21] etc. We also do intensive study on large-scale spatial
has good community and ecological environment. However, data environment and design a distributed inverted grid index
it still have some shortcomings like incomplete functions by combining inverted index and spatial grid partition with
and lack of commercial supports. MapReduce model, which is simple, dynamic, scale and fit
for processing high dimensional spatial data[22].
III. A PPLICATIONS AND O PTIMIZATION MapReduce has received a lot of attentions in many fields,
including data mining, information retrieval, image retrieval,
A. Application machine learning, and pattern recognition. For example,
In this age of data explosion, parallel processing is es- Mahout12 is an Apache project that aims at building scalable
sential to perform a massive volume of data in a timely machine learning libraries which are all implemented on
the Hadoop. However, as the amount of data that need to
9 http://aws.amazon.com/what-is-aws/
be processed grows, many data processing methods have
10 http://cloudstack.org/software.pdf
11 http://www.openstack.org/ 12 http://mahout.apache.org/
19
become not suitable or limited. Recently, many research a good partition scheme in advance. In the second phase,
efforts have exploited the MapReduce framework for solving expected MapReduce job applies this partition scheme to
challenging data processing problems on large scale datasets every mapper to group the intermediate keys quickly.
in different domains. For example, the Ricardo[23] is soft 2) Iterative Optimization: MapReduce also is a popular
system that integrate R statistical tool and Hadoop to support platform in which the dataflow takes the form of a directed
parallel data analysis. RankReduce[24] perfectly combines acyclic graph of operators. However, it requires lots of I/Os
the Local Sensitive Hashing (LSH) and MapReduce, which and unnecessary computations while solving the problem
effectively performs K-Nearest Neighbors search in the high- of iterations with MapReduce. Twister[33] proposed by
dimensional spaces. F. Cordeiro et al.[25] proposed BoW J. Ekanayake et al. is an enhanced MapReduce runtime
method for clustering very large and multi-dimensional that supports iterative MapReduce computations efficiently,
datasets with MapReduce which is a hard-clustering method which adds an extra Combine stage after Reduce stage. Thus,
and allows the automatic, and dynamic trade-off between data output from combine stage flows to the next iteration’s
disk delay and network delay. MapDupReducer[26] is a Map stage. It avoids instantiating workers repeatedly during
MapReduce based system capable of detecting near du- iterations and previously instantiated workers are reused for
plicates over massive datasets efficiently. In addition, C. the next iteration with different inputs. HaLoop[34] is similar
Ranger et al.[27] implement the MapReduce framework on to Twister, which is a modified version of the MapReduce
multiple processors in a single machine, which has gained framework that supports for iterative applications by adding
good performance. Recently, B. He et al. develop Mars[28], a Loop control. It also allows to cache both stages’ input
a GPU-based MapReduce framework, which gains better and output to save more I/Os during iterations. There exist
performance than the state-of-the-art CPU-based framework. lots of iterations during graph data processing. Pregel[35]
implements a programming model motivated by the Bulk
B. Optimization Synchronous Parallel(BSP) model, in which each node has
its own input and transfers only some messages which are
In this section, we present details of approaches to im-
required for the next iteration to other nodes.
prove the performance of processing big data with MapRe-
duce. 3) Online: There are some jobs which need to process
1) Data Transfer Bottlenecks: It is a big challenge that online while original MapReduce can not do this very
cloud users must consider how to minimize the cost of well. MapReduce Online[36] is desgined to support online
data transmission. Consequently, researchers have begun aggregation and continuous queries in MapReduce. It raises
to propose variety of approaches. Map-Reduce-Merge[29] an issue that frequent checkpointing and shuffling of in-
is a new model that adds a Merge phase after Reduce termediate results limit pipelined processing. They modify
phase that combines two reduced outputs from two different MapReduce framework by making Mappers push their data
MapReduce jobs into one, which can efficiently merge data temporarily stored in local storage to Reducers preiodically
that is already partitioned and sorted (or hashed) by map in the same MR job. In addition, Map-side pre-aggregation
and reduce modules. Map-Join-Reduce[30] is a system that is used to reduce communication. Hadoop Online Prototype
extends and improves MapReduce runtime framework by (HOP)[37] proposed by Tyson Condie is similar to MapRe-
adding Join stage before Reduce stage to perform com- duce Online. HOP is a modified version of MapReduce
plex data analysis tasks on large clusters. They present framework that allows users to early get returns from a
a new data processing strategy which runs filtering-join- job as it is being computed. It also supports for continuous
aggregation tasks with two consecutive MR jobs. It adopts queries which enable MapReduce programs to be written for
one-to-many shuffling scheme to avoid frequent checkpoint- applications such as event monitoring and stream processing
ing and shuffling of intermediate results. Moreover, differ- while retaining the fault tolerance properties of Hadoop. D.
ent jobs often perform similar work, thus sharing similar Jiang et al.[38] found that the merge sort in MapReduce
work reduces overall amount of data transfer between jobs. costs lots of I/Os and seriously affects the performance
MRShare[31] is a sharing framework proposed by T. Nykiel of MapReduce. In the study, the results are hashed and
et al. that transforms a batch of queries into a new batch pushed to hash tables held by reducers as soon as each map
that can be executed more efficiently by merging jobs into task outputs its intermediate results. Then, reducers perform
groups and evaluating each group as a single query. Data aggregation on the values in each bucket. Since each bucket
skew is also an important factor that affects data transfer in the hash table holds all values which correspond to a
cost. In order to overcome this deficiency, we propose a distinct key, no grouping is required. In addition, reducers
method[32] that divides a MapReduce job into two phases: can perform aggregation on the fly even when all mappers
sampling MapReduce job and expected MapReduce job. are not completed yet.
The first phase is to sample the input data, gather the 4) Join Query Optimization: Join Query is a popular
inherent distribution on keys’ frequencies and then make problem in big data area. However a join problem needs
20
more than two inputs while MapReduce is devised for Big Data Computation and Analysis: While processing
processing a single input. R. Vernica et al.[39] proposed a query in big data, speed is a significant demand[41].
a 3-stage approach for end-to-end set-similarity joins. They However, the process may take time because mostly it cannot
efficiently partition the data across nodes in order to bal- traverse all the related data in the whole database in a
ance the workload and minimize the need for replication. short time. In this case, index will be an optimal choice. At
Wei Lu et al. investigate how to perform kNN join using present, indices in big data are only aiming at simple type
MapReduce[40]. Mappers cluster objects into groups, then of data, while big data is becoming more complicated. The
Reducers perform the kNN join on each group of objects combination of appropriate index for big data and up-to-date
separately. To reduce shuffling and computational costs, they preprocessing technology will be a desirable solution when
desgine an effective mapping mechanism that exploits prun- we encountered this kind of problems. Application paral-
ing rules for distance filtering. In addition, two approximate lelization and divide-and-conquer is natural computational
algorithms minimize the number of replicas to reduce the paradigms for approaching big data problems. But getting
shuffling cost. additional computational resources is not as simple as just
upgrading to a bigger and more powerful machine on the
IV. D ISCUSSION AND C HALLENGES fly. The traditional serial algorithm is inefficient for the big
We are now in the days of big data. We can gather data. If there is enough data parallelism in the application,
more information from daily life of every human being. The users can take advantage of the cloud’s reduced cost model
top seven big data drivers are science data, Internet data, to use hundreds of computers for a short time costs.
finance data, mobile device data, sensor data, RFID data and Big Data Security: By using online big data application, a
streaming data. Coupled with recent advances in machine lot of companies can greatly reduce their IT cost. However,
learning and reasoning, as well as rapid rises in computing security and privacy affect the entire big data storage and
power and storage, we are transforming our ability to make processing, since there is a massive use of third-party
sense of these increasingly large, heterogeneous, noisy and services and infrastructures that are used to host important
incomplete datasets collected from a variety of sources. data or to perform critical operations. The scale of data and
So far, researchers are not able to unify around the applications grow exponentially, and bring huge challenges
essential features of big data. Some think that big data is of dynamic data monitoring and security protection. Unlike
the data that we are not able to process using pre-exist traditional security method, security in big data is mainly in
technology, method and theory. However, no matter how we the form of how to process data mining without exposing
consider the definition of big data, the world is turning into a sensitive information of users. Besides, current technologies
”helplessness” age while varies of incalculable data is being of privacy protection are mainly based on static data set,
generated by science, business and society. Big data put while data is always dynamically changed, including data
forward new challenges for data management and analysis, pattern, variation of attribute and addition of new data. Thus,
and even for the whole IT industry. it is a challenge to implement effective privacy protection in
this complex circumstance. In addition, legal and regulatory
We consider there are three important aspects while we issues also need attention.
encounter with problems in processing big data, and we
present our points of view in details as follows.
V. C ONCLUSION
Big Data Storage and Management: Current technologies
of data management systems are not able to satisfy the needs This paper described a systematic flow of survey on
of big data, and the increasing speed of storage capacity is the big data processing in the context of cloud comput-
much less than that of data, thus a revolution re-construction ing. We respectively discussed the key issues, including
of information framework is desperately needed. We need to cloud storage and computing architecture, popular parallel
design a hierarchical storage architecture. Besides, previous processing framework, major applications and optimization
computer algorithms are not able to effectively storage data of MapReduce. Big Data is not a new concept but very
that is directly acquired from the actual world, due to challenging. It calls for scalable storage index and a dis-
the heterogeneity of the big data. However, they perform tributed approach to retrieve required results near real-time.
excellent in processing homogeneous data. Therefore, how It is a fundamental fact that data is too big to process
to re-organize data is one big problem in big data manage- conventionally. Nevertheless, big data will be complex and
ment. Virtual server technology can exacerbate the problem, exist continuously during all big challenges, which are the
raising the prospect of overcommitted resources, especially big opportunities for us. In the future, significant challenges
if communication is poor between the application, server and need to be tackled by industry and academia. It is an urgent
storage administrators. We also need to solve the bottleneck need that computer scholars and social sciences scholars
problems of the high concurrent I/O and single-named node make close cooperation, in order to guarantee the long-term
in the present Master-Slave system model. success of cloud computing and collectively explore new
21
territory. [18] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silber-
schatz, and A. Rasin, “Hadoopdb: an architectural hybrid of
R EFERENCES mapreduce and dbms technologies for analytical workloads,”
Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 922–
[1] “Big data: science in the petabyte era,” Nature 455 (7209): 933, 2009.
1, 2008. [19] Y. Xu, P. Kostamaa, and L. Gao, “Integrating hadoop and
[2] Douglas and Laney, “The importance of ‘big data’: A defini- parallel dbms,” in Proceedings of the 2010 international
tion,” 2008. conference on Management of data. ACM, 2010, pp. 969–
[3] D. Kossmann, T. Kraska, and S. Loesing, “An evaluation 974.
of alternative architectures for transaction processing in the [20] J. Dittrich, J. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and
cloud,” in Proceedings of the 2010 international conference J. Schad, “Hadoop++: Making a yellow elephant run like a
on Management of data. ACM, 2010, pp. 579–590. cheetah (without it even noticing),” Proceedings of the VLDB
[4] S. Ghemawat, H. Gobioff, and S. Leung, “The google file Endowment, vol. 3, no. 1-2, pp. 515–529, 2010.
system,” in ACM SIGOPS Operating Systems Review, vol. 37, [21] D. Logothetis and K. Yocum, “Ad-hoc data processing in the
no. 5. ACM, 2003, pp. 29–43. cloud,” Proceedings of the VLDB Endowment, vol. 1, no. 2,
[5] J. Dean and S. Ghemawat, “Mapreduce: simplified data pp. 1472–1475, 2008.
processing on large clusters,” Communications of the ACM, [22] C. Ji, T. Dong, Y. Li, Y. Shen, K. Li, W. Qiu, W. Qu,
vol. 51, no. 1, pp. 107–113, 2008. and M. Guo, “Inverted grid-based knn query processing with
[6] D. Borthakur, “The hadoop distributed file system: Architec- mapreduce,” in ChinaGrid, 2012 Seventh ChinaGrid Annual
ture and design,” Hadoop Project Website, vol. 11, 2007. Conference on. IEEE, 2012, pp. 25–33.
[7] A. Rabkin and R. Katz, “Chukwa: A system for reliable [23] S. Das, Y. Sismanis, K. Beyer, R. Gemulla, P. Haas, and
large-scale log collection,” in USENIX Conference on Large J. McPherson, “Ricardo: integrating r and hadoop,” in Pro-
Installation System Administration, 2010, pp. 1–15. ceedings of the 2010 international conference on Manage-
[8] S. Sakr, A. Liu, D. Batista, and M. Alomari, “A survey of ment of data. ACM, 2010, pp. 987–998.
large scale data management approaches in cloud environ- [24] A. Stupar, S. Michel, and R. Schenkel, “Rankreduce–
ments,” Communications Surveys & Tutorials, IEEE, vol. 13, processing k-nearest neighbor queries on top of mapreduce,”
no. 3, pp. 311–336, 2011. in Proceedings of the 8th Workshop on Large-Scale Dis-
[9] Y. Cao, C. Chen, F. Guo, D. Jiang, Y. Lin, B. Ooi, H. Vo, tributed Systems for Information Retrieval, 2010, pp. 13–18.
S. Wu, and Q. Xu, “Es2: A cloud data storage system for [25] R. Ferreira Cordeiro, C. Traina Junior, A. Machado Traina,
supporting both oltp and olap,” in Data Engineering (ICDE), J. López, U. Kang, and C. Faloutsos, “Clustering very large
2011 IEEE 27th International Conference on. IEEE, 2011, multi-dimensional datasets with mapreduce,” in Proceed-
pp. 291–302. ings of the 17th ACM SIGKDD international conference on
[10] F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, Knowledge discovery and data mining. ACM, 2011, pp.
M. Burrows, T. Chandra, A. Fikes, and R. Gruber, “Bigtable: 690–698.
A distributed structured data storage system,” in 7th OSDI, [26] C. Wang, J. Wang, X. Lin, W. Wang, H. Wang, H. Li,
2006, pp. 305–314. W. Tian, J. Xu, and R. Li, “Mapdupreducer: detecting near
[11] B. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, duplicates over massive datasets,” in Proceedings of the 2010
P. Bohannon, H. Jacobsen, N. Puz, D. Weaver, and R. Yerneni, international conference on Management of data. ACM,
“Pnuts: Yahoo!’s hosted data serving platform,” Proceedings 2010, pp. 1119–1122.
of the VLDB Endowment, vol. 1, no. 2, pp. 1277–1288, 2008. [27] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and
[12] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, C. Kozyrakis, “Evaluating mapreduce for multi-core and
A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, multiprocessor systems,” in High Performance Computer
and W. Vogels, “Dynamo: amazon’s highly available key- Architecture, 2007. HPCA 2007. IEEE 13th International
value store,” in ACM SIGOPS Operating Systems Review, Symposium on. IEEE, 2007, pp. 13–24.
vol. 41, no. 6. ACM, 2007, pp. 205–220. [28] B. He, W. Fang, Q. Luo, N. Govindaraju, and T. Wang,
[13] Y. Lin, D. Agrawal, C. Chen, B. Ooi, and S. Wu, “Llama: “Mars: a mapreduce framework on graphics processors,” in
leveraging columnar storage for scalable join processing Proceedings of the 17th international conference on Parallel
in the mapreduce framework,” in Proceedings of the 2011 architectures and compilation techniques. ACM, 2008, pp.
international conference on Management of data. ACM, 260–269.
2011, pp. 961–972. [29] H. Yang, A. Dasdan, R. Hsiao, and D. Parker, “Map-reduce-
[14] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, merge: simplified relational data processing on large clusters,”
L. Youseff, and D. Zagorodnov, “The eucalyptus open-source in Proceedings of the 2007 ACM SIGMOD international
cloud-computing system,” in Cluster Computing and the Grid, conference on Management of data. ACM, 2007, pp. 1029–
2009. CCGRID’09. 9th IEEE/ACM International Symposium 1040.
on. IEEE, 2009, pp. 124–131. [30] D. Jiang, A. Tung, and G. Chen, “Map-Join-Reduce: Toward
[15] P. Sempolinski and D. Thain, “A comparison and critique scalable and efficient data analysis on large clusters,” Knowl-
of eucalyptus, opennebula and nimbus,” in Cloud Computing edge and Data Engineering, IEEE Transactions on, vol. 23,
Technology and Science (CloudCom), 2010 IEEE Second no. 9, pp. 1299–1311, 2011.
International Conference on. Ieee, 2010, pp. 417–426. [31] T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and
[16] D. DeWitt and M. Stonebraker, “Mapreduce: A major step N. Koudas, “Mrshare: Sharing across multiple queries in
backwards,” The Database Column, vol. 1, 2008. mapreduce,” Proceedings of the VLDB Endowment, vol. 3,
[17] M. Stonebraker, D. Abadi, D. DeWitt, S. Madden, E. Paulson, no. 1-2, pp. 494–505, 2010.
A. Pavlo, and A. Rasin, “Mapreduce and parallel dbmss: [32] Y. Xu, P. Zou, W. Qu, Z. Li, K. Li, and X. Cui, “Sampling-
friends or foes,” Communications of the ACM, vol. 53, no. 1, based partitioning in mapreduce for skewed data,” in China-
pp. 64–71, 2010.
22
Grid, 2012 Seventh ChinaGrid Annual Conference on. IEEE,
2012, pp. 1–8.
[33] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S. Bae, J. Qiu,
and G. Fox, “Twister: a runtime for iterative mapreduce,” in
Proceedings of the 19th ACM International Symposium on
High Performance Distributed Computing. ACM, 2010, pp.
810–818.
[34] Y. Bu, B. Howe, M. Balazinska, and M. Ernst, “Haloop: Effi-
cient iterative data processing on large clusters,” Proceedings
of the VLDB Endowment, vol. 3, no. 1-2, pp. 285–296, 2010.
[35] G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn,
N. Leiser, and G. Czajkowski, “Pregel: a system for large-
scale graph processing,” in Proceedings of the 2010 interna-
tional conference on Management of data. ACM, 2010, pp.
135–146.
[36] T. Condie, N. Conway, P. Alvaro, J. Hellerstein, K. Elmeleegy,
and R. Sears, “Mapreduce online,” in Proceedings of the
7th USENIX conference on Networked systems design and
implementation, 2010, pp. 21–21.
[37] T. Condie, N. Conway, P. Alvaro, J. Hellerstein, J. Gerth,
J. Talbot, K. Elmeleegy, and R. Sears, “Online aggregation
and continuous query support in mapreduce,” in ACM SIG-
MOD, 2010, pp. 1115–1118.
[38] D. Jiang, B. Ooi, L. Shi, and S. Wu, “The performance of
mapreduce: An in-depth study,” Proceedings of the VLDB
Endowment, vol. 3, no. 1-2, pp. 472–483, 2010.
[39] R. Vernica, M. Carey, and C. Li, “Efficient parallel set-
similarity joins using mapreduce,” in SIGMOD conference.
Citeseer, 2010, pp. 495–506.
[40] C. Zhang, F. Li, and J. Jestes, “Efficient parallel knn joins
for large data in mapreduce,” in Proceedings of the 15th
International Conference on Extending Database Technology.
ACM, 2012, pp. 38–49.
[41] X. Zhou, J. Lu, C. Li, and X. Du, “Big data challenge in
the management perspective,” Communications of the CCF,
vol. 8, pp. 16–20, 2012.
23