Big Data Processing in Cloud Computing Environments

big data

Uploaded by

Kunal Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

Big Data Processing in Cloud Computing Environments

big data

Uploaded by

Kunal Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

2012 International Symposium on Pervasive Systems, Algorithms and Networks

Big Data Processing in Cloud Computing Environments

Changqing Ji∗† , Yu Li‡ , Wenming Qiu‡ , Uchechukwu Awada‡ , Keqiu Li‡

∗ College of Information Science and Technology, Dalian Maritime University, Dalian 116026, China
† College of Physical Science and Technology, Dalian University, Dalian 116600, China
‡ School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China

Email:{jcqgood, liyu87122, xmdlut2007, awadauche, likeqiu}@gmail.com

Abstract—With the rapid growth of emerging applications significant resources were allocated to support these data-
like social network analysis, semantic Web analysis and bioin- intensive operations which lead to high storage and data
formatics network analysis, a variety of data to be processed processing costs.
continues to witness a quick increase. Effective management
and analysis of large-scale data poses an interesting but critical The current technologies such as grid and cloud comput-
challenge. Recently, big data has attracted a lot of attention ing have all intended to access large amounts of comput-
from academia, industry as well as government. This paper ing power by aggregating resources and offering a single
introduces several big data processing technics from system system view. Among these technologies, cloud computing
and application aspects. First, from the view of cloud data
management and big data processing mechanisms, we present is becoming a powerful architecture to perform large-scale
the key issues of big data processing, including cloud computing and complex computing, and has revolutionized the way
platform, cloud architecture, cloud database and data storage that computing infrastructure is abstracted and used. In
scheme. Following the MapReduce parallel processing frame- addition, an important aim of these technologies is to deliver
work, we then introduce MapReduce optimization strategies computing as a solution for tackling big data, such as large-
and applications reported in the literature. Finally, we discuss
the open issues and challenges, and deeply explore the research scale, multi-media and high dimensional data sets.
directions in the future on big data processing in cloud Big data and cloud computing are both the fastest-moving
computing environments. technologies identified in Gartner Inc.’s 2012 Hype Cycle
Keywords-Big Data; Cloud Computing; Data Management; for Emerging Technologies4 . Cloud computing is associated
Distributed Computing. with new paradigm for the provision of computing infras-
tructure and big data processing method for all kinds of
I. I NTRODUCTION resources. Moreover, some new cloud-based technologies
In the last two decades, the continuous increase of have to be adopted because dealing with big data for
computational power has produced an overwhelming flow concurrent processing is difficult.
of data. Big data is not only becoming more available Then what is Big Data? In the publication of the journal
but also more understandable to computers. For example, of Science 2008, “Big Data” is defined as “Represents the
modern high-energy physics experiments, such as DZero1 , progress of the human cognitive processes, usually includes
typically generate more than one TeraByte of data per day. data sets with sizes beyond the ability of current technology,
The famous social network Website, Facebook, serves 570 method and theory to capture, manage, and process the data
billion page views per month, stores 3 billion new photos within a tolerable elapsed time”[1]. Recently, the definition
every month, and manages 25 billion pieces of content2 . of big data as also given by the Gartner: “Big Data are
Google’s search and ad business, Facebook, Flickr, YouTube, high-volume, high-velocity, and/or high-variety information
and Linkedin use a bundle of artificial-intelligence tricks, assets that require new forms of processing to enable
require parsing vast quantities of data and making decisions enhanced decision making, insight discovery and process
instantaneously. Multimedia data mining platforms make it optimization”[2]. According to Wikimedia, “In information
easy for everybody to achieve these goals with the minimum technology, big data is a collection of data sets so large and
amount of effort in terms of software, CPU and network. complex that it becomes difficult to process using on-hand
On March 29, 2012, American government announced the database management tools”5 .
“Big Data Research and Development Initiative”, and big The goal of this paper is to provide the status of big
data becomes the national policy for the first time3 . All data studies and related works, which aims at providing
these examples showed that daunting big data challenges and a general view of big data management technologies and
1 http://www-d0.fnal.gov/
2 http://www.facebook.com 4 http://www.gartner.com
3 http://www.whitehouse.gov/blog/2012/03/29/big-data-big-deal 5 http://en.wikipedia.org/wiki/Big-data

DOI 10.1109/I-SPAN.2012.9
applications. We give an overview of major approaches and Web Services. This file system is targeted at clusters hosted
classify them with respect to their strategies including big on the Amazon Elastic Compute Cloud server-on-demand
data management platform, distributed file system, big data infrastructure. S3 aims to provide scalability, high availabil-
storage, MapReduce application and optimization. However, ity, and low latency at commodity costs. ES2[9] is an elastic
maintaining and processing these large-scale data sets is storage system of epiC6 , which is designed to support both
typically beyond the reach of small businesses and it is functionalities within the same storage. The system provides
increasingly posing challenges even for large companies and efficient data loading from different sources, flexible data
institutes. Finally, we discuss the open issues and challenges partitioning scheme, index and parallel sequential scan. In
in processing big data in three important aspects: big data addition, there are general filesystems that have not to be
storage, analysis and security. addressed such as Moose File System (MFS)7 , Kosmos
The rest of the paper is organized as follows. Section 2 Distributed Filesystem (KFS)8 .
reviews the architecture and the key concepts of big data
processing. Section 3 presents the classification of major B. Non-structural and Semi-structured Data Storage
applications and optimization of the MapReduce framework
while Section 4 discusses several open issues and future With the success of the Web 2.0, more and more IT
challenges. Finally, Section 5 concludes this paper. companies have increasing needs to store and analyze the
ever growing data, such as search logs, crawled web content,
II. B IG DATA M ANAGEMENT S YSTEM and click streams, usually in the range of petabytes, collected
from a variety of web services. However, web data sets
Many researchers have suggested that commercial
are usually non-relational or less structured and process-
DBMSs are not suitable for processing extremely large
ing such semi-structured data sets at scale poses another
scale data. Classic architecture’s potential bottleneck is the
challenge. Moreover, simple distributed file systems men-
database server while faced with peak workloads. One
tioned above cannot satisfy service providers like Google,
database server has restriction of scalability and cost, which
Yahoo!, Microsoft and Amazon. All providers have their
are two important goals of big data processing. In order
purpose to serve potential users and own their relevant state-
to adapt various large data processing models, D. Koss-
of-the-art of big data management systems in the cloud
mann et al. presented four different architectures based on
environments. Bigtable[10] is a distributed storage system
classic multi-tier database application architecture which
of Google for managing structured data that is designed
are partitioning, replication, distributed control and caching
to scale to a very large size (petabytes of data) across
architecture[3]. It is clear that the alternative providers
thousands of commodity servers. Bigtable does not support
have different business models and target different kinds of
a full relational data model. However, it provides clients
applications: Google seems to be more interested in small
with a simple data model that supports dynamic control
applications with light workloads whereas Azure is currently
over data layout and format. PNUTS[11] is a massive-
the most affordable service for medium to large services.
scale hosted database system designed to support Yahoo!’s
Most of recent cloud service providers are utilizing hybrid
web applications. The main focus of the system is on data
architecture that is capable of satisfying their actual service
serving for web applications, rather than complex queries.
requirements. In this section, we mainly discuss big data
Upon PNUTS, new applications can be built very easily and
architecture from three key aspects: distributed file system,
the overhead of creating and maintaining these applications
non-structural and semi-structured data storage and open
is nothing much. The Dynamo[12] is a highly available
source cloud platform.
and scalable distributed key/value based data store built
A. Distributed File System for supporting internal Amazon’s applications. It provides a
simple primary-key only interface to meet the requirements
Google File System (GFS)[4] is a chunk-based distributed of these applications. However, it differs from key-value
file system that supports fault-tolerance by data partitioning storage system. Facebook proposed the design of a new
and replication. As an underlying storage layer of Google’s cluster-based data warehouse system, Llama[13], a hybrid
cloud computing platform, it is used to read input and data management system which combines the features of
store output of MapReduce[5]. Similarly, Hadoop also has row-wise and column-wise database systems. They also
a distributed file system as its data storage layer called describe a new column-wise file format for Hadoop called
Hadoop Distributed File System (HDFS)[6], which is an CFile, which provides better performance than other file
open-source counterpart of GFS. GFS and HDFS are user- formats in data analysis.
level filesystems that do not implement POSIX semantics
and heavily optimized for the case of large files (measured 6 http://www.comp.nus.edu.sg/ epic/overview.html
in gigabytes)[7]. Amazon Simple Storage Service (S3)[8] is 7 http://www.moosefs.org/

an online public storage web service offered by Amazon 8 http://kosmosfs.sourceforge.net/

18
C. Open Source Cloud Platform manner. The use of parallelization techniques and algorithms
is the key to achieve better scalability and performance
The main idea behind data center is to leverage the
for processing big data. At present, there are a lot of
virtualization technology to maximize the utilization of
popular parallel processing models, including MPI, General
computing resources. Therefore, it provides the basic in-
Purpose GPU (GPGPU), MapReduce and MapReduce-like.
gredients such as storage, CPUs, and network bandwidth
MapReduce proposed by Google, is a very popular big
as a commodity by specialized service providers at low unit
data processing model that has rapidly been studied and
cost. For reaching the goals of big data management, most of
applied by both industry and academia. MapReduce has
the research institutions and enterprises bring virtualization
two major advantages: the MapReduce model hide details
into cloud architectures. Amazon Web Services (AWS),
related to the data storage, distribution, replication, load
Eucalptus, Opennebula, Cloudstack and Openstack are the
balancing and so on. Furthermore, it is so simple that
most popular cloud management platforms for infrastructure
programmers only specify two functions, which are map
as a service (IaaS). AWS9 is not free but it has huge
function and reduce function, for performing the processing
usage in elastic platform. It is very easy to use and only
of the big data. We divided existing MapReduce applications
pay-as-you-go. The Eucalyptus[14] works in IaaS as an
into three categories: partitioning sub-space, decomposing
open source. It uses virtual machine in controlling and
sub-processes and approximate overlapping calculations.
managing resources. Since Eucalyptus is the earliest cloud
management platform for IaaS, it signs API compatible While MapReduce is referred to as a new approach
agreement with AWS. It has a leading position in the of processing big data in cloud computing environments,
private cloud market for the AWS ecological environment. it is also criticized as a “major step backwards” com-
OpenNebula[15] has integration with various environments. pared with DBMS[16]. We all know that MapReduce is
It can offer the richest features, flexible ways and better schema-free and index-free. Thus, the MapReduce frame-
interoperability to build private, public or hybrid clouds. work requires parsing each record at reading input. As
OpenNebula is not a Service Oriented Architecture (SOA) the debate continues, the final result shows that neither
design and has weak decoupling for computing, storage and is good at the other does well, and the two technologies
network independent components. CloudStack10 is an open are complementary[17]. Recently, some DBMS vendors also
source cloud operating system which delivers public cloud have integrated MapReduce front-ends into their systems in-
computing similar to Amazon EC2 but using users’ own cluding Aster, HadoopDB[18], Greenplum[19] and Vertuca.
hardware. CloudStack users can take full advantage of cloud Mostly of those are still databases, which simply provide a
computing to deliver higher efficiency, limitless scale and MapReduce front-end to a DBMS. HadoopDB is a hybrid
faster deployment of new services and systems to the end- system which efficiently takes the best features from the
user. At present, CloudStack is one of the Apache open scalability of MapReduce and the performance of DBMS.
source projects. It already has mature functions. However, The result shows that HadoopDB improves task processing
it needs to further strengthen the loosely coupling and com- times of Hadoop by a large factor to match the shared-
ponent design. OpenStack11 is a collection of open source nothing DBMS. Lately, J. Dittrich et al. propose a new
software projects aiming to build an open-source community type of system named Hadoop++[20] which indicates that
with researchers, developers and enterprises. People in this HadoopDB has also severe drawbacks, including forcing
community share a common goal to create a cloud that is user to use DBMS, changing the interface to SQL and so
simple to deploy, massively scalable and full of rich features. on. There are also certain papers adapting different inverted
The architecture and components of OpenStack are straight- index, which is a simple but practical index structure and
forward and stable, so it is a good choice to provide specific appropriate for MapReduce to process big data, such as
applications for enterprises. In current situation, OpenStack [21] etc. We also do intensive study on large-scale spatial
has good community and ecological environment. However, data environment and design a distributed inverted grid index
it still have some shortcomings like incomplete functions by combining inverted index and spatial grid partition with
and lack of commercial supports. MapReduce model, which is simple, dynamic, scale and fit
for processing high dimensional spatial data[22].
III. A PPLICATIONS AND O PTIMIZATION MapReduce has received a lot of attentions in many fields,
including data mining, information retrieval, image retrieval,
A. Application machine learning, and pattern recognition. For example,
In this age of data explosion, parallel processing is es- Mahout12 is an Apache project that aims at building scalable
sential to perform a massive volume of data in a timely machine learning libraries which are all implemented on
the Hadoop. However, as the amount of data that need to
9 http://aws.amazon.com/what-is-aws/
be processed grows, many data processing methods have
10 http://cloudstack.org/software.pdf
11 http://www.openstack.org/ 12 http://mahout.apache.org/

19
become not suitable or limited. Recently, many research a good partition scheme in advance. In the second phase,
efforts have exploited the MapReduce framework for solving expected MapReduce job applies this partition scheme to
challenging data processing problems on large scale datasets every mapper to group the intermediate keys quickly.
in different domains. For example, the Ricardo[23] is soft 2) Iterative Optimization: MapReduce also is a popular
system that integrate R statistical tool and Hadoop to support platform in which the dataflow takes the form of a directed
parallel data analysis. RankReduce[24] perfectly combines acyclic graph of operators. However, it requires lots of I/Os
the Local Sensitive Hashing (LSH) and MapReduce, which and unnecessary computations while solving the problem
effectively performs K-Nearest Neighbors search in the high- of iterations with MapReduce. Twister[33] proposed by
dimensional spaces. F. Cordeiro et al.[25] proposed BoW J. Ekanayake et al. is an enhanced MapReduce runtime
method for clustering very large and multi-dimensional that supports iterative MapReduce computations efficiently,
datasets with MapReduce which is a hard-clustering method which adds an extra Combine stage after Reduce stage. Thus,
and allows the automatic, and dynamic trade-off between data output from combine stage flows to the next iteration’s
disk delay and network delay. MapDupReducer[26] is a Map stage. It avoids instantiating workers repeatedly during
MapReduce based system capable of detecting near du- iterations and previously instantiated workers are reused for
plicates over massive datasets efficiently. In addition, C. the next iteration with different inputs. HaLoop[34] is similar
Ranger et al.[27] implement the MapReduce framework on to Twister, which is a modified version of the MapReduce
multiple processors in a single machine, which has gained framework that supports for iterative applications by adding
good performance. Recently, B. He et al. develop Mars[28], a Loop control. It also allows to cache both stages’ input
a GPU-based MapReduce framework, which gains better and output to save more I/Os during iterations. There exist
performance than the state-of-the-art CPU-based framework. lots of iterations during graph data processing. Pregel[35]
implements a programming model motivated by the Bulk
B. Optimization Synchronous Parallel(BSP) model, in which each node has
its own input and transfers only some messages which are
In this section, we present details of approaches to im-
required for the next iteration to other nodes.
prove the performance of processing big data with MapRe-
duce. 3) Online: There are some jobs which need to process
1) Data Transfer Bottlenecks: It is a big challenge that online while original MapReduce can not do this very
cloud users must consider how to minimize the cost of well. MapReduce Online[36] is desgined to support online
data transmission. Consequently, researchers have begun aggregation and continuous queries in MapReduce. It raises
to propose variety of approaches. Map-Reduce-Merge[29] an issue that frequent checkpointing and shuffling of in-
is a new model that adds a Merge phase after Reduce termediate results limit pipelined processing. They modify
phase that combines two reduced outputs from two different MapReduce framework by making Mappers push their data
MapReduce jobs into one, which can efficiently merge data temporarily stored in local storage to Reducers preiodically
that is already partitioned and sorted (or hashed) by map in the same MR job. In addition, Map-side pre-aggregation
and reduce modules. Map-Join-Reduce[30] is a system that is used to reduce communication. Hadoop Online Prototype
extends and improves MapReduce runtime framework by (HOP)[37] proposed by Tyson Condie is similar to MapRe-
adding Join stage before Reduce stage to perform com- duce Online. HOP is a modified version of MapReduce
plex data analysis tasks on large clusters. They present framework that allows users to early get returns from a
a new data processing strategy which runs filtering-join- job as it is being computed. It also supports for continuous
aggregation tasks with two consecutive MR jobs. It adopts queries which enable MapReduce programs to be written for
one-to-many shuffling scheme to avoid frequent checkpoint- applications such as event monitoring and stream processing
ing and shuffling of intermediate results. Moreover, differ- while retaining the fault tolerance properties of Hadoop. D.
ent jobs often perform similar work, thus sharing similar Jiang et al.[38] found that the merge sort in MapReduce
work reduces overall amount of data transfer between jobs. costs lots of I/Os and seriously affects the performance
MRShare[31] is a sharing framework proposed by T. Nykiel of MapReduce. In the study, the results are hashed and
et al. that transforms a batch of queries into a new batch pushed to hash tables held by reducers as soon as each map
that can be executed more efficiently by merging jobs into task outputs its intermediate results. Then, reducers perform
groups and evaluating each group as a single query. Data aggregation on the values in each bucket. Since each bucket
skew is also an important factor that affects data transfer in the hash table holds all values which correspond to a
cost. In order to overcome this deficiency, we propose a distinct key, no grouping is required. In addition, reducers
method[32] that divides a MapReduce job into two phases: can perform aggregation on the fly even when all mappers
sampling MapReduce job and expected MapReduce job. are not completed yet.
The first phase is to sample the input data, gather the 4) Join Query Optimization: Join Query is a popular
inherent distribution on keys’ frequencies and then make problem in big data area. However a join problem needs

20
more than two inputs while MapReduce is devised for Big Data Computation and Analysis: While processing
processing a single input. R. Vernica et al.[39] proposed a query in big data, speed is a significant demand[41].
a 3-stage approach for end-to-end set-similarity joins. They However, the process may take time because mostly it cannot
efficiently partition the data across nodes in order to bal- traverse all the related data in the whole database in a
ance the workload and minimize the need for replication. short time. In this case, index will be an optimal choice. At
Wei Lu et al. investigate how to perform kNN join using present, indices in big data are only aiming at simple type
MapReduce[40]. Mappers cluster objects into groups, then of data, while big data is becoming more complicated. The
Reducers perform the kNN join on each group of objects combination of appropriate index for big data and up-to-date
separately. To reduce shuffling and computational costs, they preprocessing technology will be a desirable solution when
desgine an effective mapping mechanism that exploits prun- we encountered this kind of problems. Application paral-
ing rules for distance filtering. In addition, two approximate lelization and divide-and-conquer is natural computational
algorithms minimize the number of replicas to reduce the paradigms for approaching big data problems. But getting
shuffling cost. additional computational resources is not as simple as just
upgrading to a bigger and more powerful machine on the
IV. D ISCUSSION AND C HALLENGES fly. The traditional serial algorithm is inefficient for the big
We are now in the days of big data. We can gather data. If there is enough data parallelism in the application,
more information from daily life of every human being. The users can take advantage of the cloud’s reduced cost model
top seven big data drivers are science data, Internet data, to use hundreds of computers for a short time costs.
finance data, mobile device data, sensor data, RFID data and Big Data Security: By using online big data application, a
streaming data. Coupled with recent advances in machine lot of companies can greatly reduce their IT cost. However,
learning and reasoning, as well as rapid rises in computing security and privacy affect the entire big data storage and
power and storage, we are transforming our ability to make processing, since there is a massive use of third-party
sense of these increasingly large, heterogeneous, noisy and services and infrastructures that are used to host important
incomplete datasets collected from a variety of sources. data or to perform critical operations. The scale of data and
So far, researchers are not able to unify around the applications grow exponentially, and bring huge challenges
essential features of big data. Some think that big data is of dynamic data monitoring and security protection. Unlike
the data that we are not able to process using pre-exist traditional security method, security in big data is mainly in
technology, method and theory. However, no matter how we the form of how to process data mining without exposing
consider the definition of big data, the world is turning into a sensitive information of users. Besides, current technologies
”helplessness” age while varies of incalculable data is being of privacy protection are mainly based on static data set,
generated by science, business and society. Big data put while data is always dynamically changed, including data
forward new challenges for data management and analysis, pattern, variation of attribute and addition of new data. Thus,
and even for the whole IT industry. it is a challenge to implement effective privacy protection in
this complex circumstance. In addition, legal and regulatory
We consider there are three important aspects while we issues also need attention.
encounter with problems in processing big data, and we
present our points of view in details as follows.
V. C ONCLUSION
Big Data Storage and Management: Current technologies
of data management systems are not able to satisfy the needs This paper described a systematic flow of survey on
of big data, and the increasing speed of storage capacity is the big data processing in the context of cloud comput-
much less than that of data, thus a revolution re-construction ing. We respectively discussed the key issues, including
of information framework is desperately needed. We need to cloud storage and computing architecture, popular parallel
design a hierarchical storage architecture. Besides, previous processing framework, major applications and optimization
computer algorithms are not able to effectively storage data of MapReduce. Big Data is not a new concept but very
that is directly acquired from the actual world, due to challenging. It calls for scalable storage index and a dis-
the heterogeneity of the big data. However, they perform tributed approach to retrieve required results near real-time.
excellent in processing homogeneous data. Therefore, how It is a fundamental fact that data is too big to process
to re-organize data is one big problem in big data manage- conventionally. Nevertheless, big data will be complex and
ment. Virtual server technology can exacerbate the problem, exist continuously during all big challenges, which are the
raising the prospect of overcommitted resources, especially big opportunities for us. In the future, significant challenges
if communication is poor between the application, server and need to be tackled by industry and academia. It is an urgent
storage administrators. We also need to solve the bottleneck need that computer scholars and social sciences scholars
problems of the high concurrent I/O and single-named node make close cooperation, in order to guarantee the long-term
in the present Master-Slave system model. success of cloud computing and collectively explore new

21
territory. [18] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silber-
schatz, and A. Rasin, “Hadoopdb: an architectural hybrid of
R EFERENCES mapreduce and dbms technologies for analytical workloads,”
Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 922–
[1] “Big data: science in the petabyte era,” Nature 455 (7209): 933, 2009.
1, 2008. [19] Y. Xu, P. Kostamaa, and L. Gao, “Integrating hadoop and
[2] Douglas and Laney, “The importance of ‘big data’: A defini- parallel dbms,” in Proceedings of the 2010 international
tion,” 2008. conference on Management of data. ACM, 2010, pp. 969–
[3] D. Kossmann, T. Kraska, and S. Loesing, “An evaluation 974.
of alternative architectures for transaction processing in the [20] J. Dittrich, J. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and
cloud,” in Proceedings of the 2010 international conference J. Schad, “Hadoop++: Making a yellow elephant run like a
on Management of data. ACM, 2010, pp. 579–590. cheetah (without it even noticing),” Proceedings of the VLDB
[4] S. Ghemawat, H. Gobioff, and S. Leung, “The google file Endowment, vol. 3, no. 1-2, pp. 515–529, 2010.
system,” in ACM SIGOPS Operating Systems Review, vol. 37, [21] D. Logothetis and K. Yocum, “Ad-hoc data processing in the
no. 5. ACM, 2003, pp. 29–43. cloud,” Proceedings of the VLDB Endowment, vol. 1, no. 2,
[5] J. Dean and S. Ghemawat, “Mapreduce: simplified data pp. 1472–1475, 2008.
processing on large clusters,” Communications of the ACM, [22] C. Ji, T. Dong, Y. Li, Y. Shen, K. Li, W. Qiu, W. Qu,
vol. 51, no. 1, pp. 107–113, 2008. and M. Guo, “Inverted grid-based knn query processing with
[6] D. Borthakur, “The hadoop distributed file system: Architec- mapreduce,” in ChinaGrid, 2012 Seventh ChinaGrid Annual
ture and design,” Hadoop Project Website, vol. 11, 2007. Conference on. IEEE, 2012, pp. 25–33.
[7] A. Rabkin and R. Katz, “Chukwa: A system for reliable [23] S. Das, Y. Sismanis, K. Beyer, R. Gemulla, P. Haas, and
large-scale log collection,” in USENIX Conference on Large J. McPherson, “Ricardo: integrating r and hadoop,” in Pro-
Installation System Administration, 2010, pp. 1–15. ceedings of the 2010 international conference on Manage-
[8] S. Sakr, A. Liu, D. Batista, and M. Alomari, “A survey of ment of data. ACM, 2010, pp. 987–998.
large scale data management approaches in cloud environ- [24] A. Stupar, S. Michel, and R. Schenkel, “Rankreduce–
ments,” Communications Surveys & Tutorials, IEEE, vol. 13, processing k-nearest neighbor queries on top of mapreduce,”
no. 3, pp. 311–336, 2011. in Proceedings of the 8th Workshop on Large-Scale Dis-
[9] Y. Cao, C. Chen, F. Guo, D. Jiang, Y. Lin, B. Ooi, H. Vo, tributed Systems for Information Retrieval, 2010, pp. 13–18.
S. Wu, and Q. Xu, “Es2: A cloud data storage system for [25] R. Ferreira Cordeiro, C. Traina Junior, A. Machado Traina,
supporting both oltp and olap,” in Data Engineering (ICDE), J. López, U. Kang, and C. Faloutsos, “Clustering very large
2011 IEEE 27th International Conference on. IEEE, 2011, multi-dimensional datasets with mapreduce,” in Proceed-
pp. 291–302. ings of the 17th ACM SIGKDD international conference on
[10] F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, Knowledge discovery and data mining. ACM, 2011, pp.
M. Burrows, T. Chandra, A. Fikes, and R. Gruber, “Bigtable: 690–698.
A distributed structured data storage system,” in 7th OSDI, [26] C. Wang, J. Wang, X. Lin, W. Wang, H. Wang, H. Li,
2006, pp. 305–314. W. Tian, J. Xu, and R. Li, “Mapdupreducer: detecting near
[11] B. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, duplicates over massive datasets,” in Proceedings of the 2010
P. Bohannon, H. Jacobsen, N. Puz, D. Weaver, and R. Yerneni, international conference on Management of data. ACM,
“Pnuts: Yahoo!’s hosted data serving platform,” Proceedings 2010, pp. 1119–1122.
of the VLDB Endowment, vol. 1, no. 2, pp. 1277–1288, 2008. [27] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and
[12] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, C. Kozyrakis, “Evaluating mapreduce for multi-core and
A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, multiprocessor systems,” in High Performance Computer
and W. Vogels, “Dynamo: amazon’s highly available key- Architecture, 2007. HPCA 2007. IEEE 13th International
value store,” in ACM SIGOPS Operating Systems Review, Symposium on. IEEE, 2007, pp. 13–24.
vol. 41, no. 6. ACM, 2007, pp. 205–220. [28] B. He, W. Fang, Q. Luo, N. Govindaraju, and T. Wang,
[13] Y. Lin, D. Agrawal, C. Chen, B. Ooi, and S. Wu, “Llama: “Mars: a mapreduce framework on graphics processors,” in
leveraging columnar storage for scalable join processing Proceedings of the 17th international conference on Parallel
in the mapreduce framework,” in Proceedings of the 2011 architectures and compilation techniques. ACM, 2008, pp.
international conference on Management of data. ACM, 260–269.
2011, pp. 961–972. [29] H. Yang, A. Dasdan, R. Hsiao, and D. Parker, “Map-reduce-
[14] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, merge: simplified relational data processing on large clusters,”
L. Youseff, and D. Zagorodnov, “The eucalyptus open-source in Proceedings of the 2007 ACM SIGMOD international
cloud-computing system,” in Cluster Computing and the Grid, conference on Management of data. ACM, 2007, pp. 1029–
2009. CCGRID’09. 9th IEEE/ACM International Symposium 1040.
on. IEEE, 2009, pp. 124–131. [30] D. Jiang, A. Tung, and G. Chen, “Map-Join-Reduce: Toward
[15] P. Sempolinski and D. Thain, “A comparison and critique scalable and efficient data analysis on large clusters,” Knowl-
of eucalyptus, opennebula and nimbus,” in Cloud Computing edge and Data Engineering, IEEE Transactions on, vol. 23,
Technology and Science (CloudCom), 2010 IEEE Second no. 9, pp. 1299–1311, 2011.
International Conference on. Ieee, 2010, pp. 417–426. [31] T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and
[16] D. DeWitt and M. Stonebraker, “Mapreduce: A major step N. Koudas, “Mrshare: Sharing across multiple queries in
backwards,” The Database Column, vol. 1, 2008. mapreduce,” Proceedings of the VLDB Endowment, vol. 3,
[17] M. Stonebraker, D. Abadi, D. DeWitt, S. Madden, E. Paulson, no. 1-2, pp. 494–505, 2010.
A. Pavlo, and A. Rasin, “Mapreduce and parallel dbmss: [32] Y. Xu, P. Zou, W. Qu, Z. Li, K. Li, and X. Cui, “Sampling-
friends or foes,” Communications of the ACM, vol. 53, no. 1, based partitioning in mapreduce for skewed data,” in China-
pp. 64–71, 2010.

22
Grid, 2012 Seventh ChinaGrid Annual Conference on. IEEE,
2012, pp. 1–8.
[33] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S. Bae, J. Qiu,
and G. Fox, “Twister: a runtime for iterative mapreduce,” in
Proceedings of the 19th ACM International Symposium on
High Performance Distributed Computing. ACM, 2010, pp.
810–818.
[34] Y. Bu, B. Howe, M. Balazinska, and M. Ernst, “Haloop: Effi-
cient iterative data processing on large clusters,” Proceedings
of the VLDB Endowment, vol. 3, no. 1-2, pp. 285–296, 2010.
[35] G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn,
N. Leiser, and G. Czajkowski, “Pregel: a system for large-
scale graph processing,” in Proceedings of the 2010 interna-
tional conference on Management of data. ACM, 2010, pp.
135–146.
[36] T. Condie, N. Conway, P. Alvaro, J. Hellerstein, K. Elmeleegy,
and R. Sears, “Mapreduce online,” in Proceedings of the
7th USENIX conference on Networked systems design and
implementation, 2010, pp. 21–21.
[37] T. Condie, N. Conway, P. Alvaro, J. Hellerstein, J. Gerth,
J. Talbot, K. Elmeleegy, and R. Sears, “Online aggregation
and continuous query support in mapreduce,” in ACM SIG-
MOD, 2010, pp. 1115–1118.
[38] D. Jiang, B. Ooi, L. Shi, and S. Wu, “The performance of
mapreduce: An in-depth study,” Proceedings of the VLDB
Endowment, vol. 3, no. 1-2, pp. 472–483, 2010.
[39] R. Vernica, M. Carey, and C. Li, “Efficient parallel set-
similarity joins using mapreduce,” in SIGMOD conference.
Citeseer, 2010, pp. 495–506.
[40] C. Zhang, F. Li, and J. Jestes, “Efficient parallel knn joins
for large data in mapreduce,” in Proceedings of the 15th
International Conference on Extending Database Technology.
ACM, 2012, pp. 38–49.
[41] X. Zhou, J. Lu, C. Li, and X. Du, “Big data challenge in
the management perspective,” Communications of the CCF,
vol. 8, pp. 16–20, 2012.

Universities Interview Qus (Lithuania)
100% (1)
Universities Interview Qus (Lithuania)
11 pages
Big Data Project
100% (3)
Big Data Project
61 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Week 14 Other Technologies-Cloud
No ratings yet
Week 14 Other Technologies-Cloud
21 pages
BDA-1
No ratings yet
BDA-1
26 pages
A Seminar Presentation On "Big Data": Presented By: Divyanshu Bhardwaj Department of Computer Science VIII Semester
No ratings yet
A Seminar Presentation On "Big Data": Presented By: Divyanshu Bhardwaj Department of Computer Science VIII Semester
19 pages
Dont Do That
No ratings yet
Dont Do That
30 pages
The Growing Enormous of Big Data Storage
No ratings yet
The Growing Enormous of Big Data Storage
6 pages
I Jcs It 2015060405
No ratings yet
I Jcs It 2015060405
6 pages
Know Your Big Data
No ratings yet
Know Your Big Data
11 pages
Taming Big Data
No ratings yet
Taming Big Data
268 pages
Big Data Survey 2014
No ratings yet
Big Data Survey 2014
39 pages
Building Scalable Data-Intensive Applications
From Everand
Building Scalable Data-Intensive Applications
Chandani Kaul
No ratings yet
Session 8 - George Strawn - Big Data
No ratings yet
Session 8 - George Strawn - Big Data
34 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
A Study On Big Data Processing Mechanism & Applicability: Byung-Tae Chun and Seong-Hoon Lee
No ratings yet
A Study On Big Data Processing Mechanism & Applicability: Byung-Tae Chun and Seong-Hoon Lee
10 pages
Big Data Algorithms
100% (1)
Big Data Algorithms
476 pages
Big Data in Computer Cyber Security Systems
No ratings yet
Big Data in Computer Cyber Security Systems
10 pages
International Journal of Engineering Research and Development (IJERD)
No ratings yet
International Journal of Engineering Research and Development (IJERD)
6 pages
The Scientific World Journal - 2014 - Khan - Big Data Survey Technologies Opportunities and Challenges
No ratings yet
The Scientific World Journal - 2014 - Khan - Big Data Survey Technologies Opportunities and Challenges
18 pages
01 Big data, present and future. 2021 04 Computer. pp59-65
No ratings yet
01 Big data, present and future. 2021 04 Computer. pp59-65
7 pages
Unit-III CC&BD Cs62 Ab
No ratings yet
Unit-III CC&BD Cs62 Ab
85 pages
SSRN Id3319251
No ratings yet
SSRN Id3319251
5 pages
Big Data Summery
No ratings yet
Big Data Summery
9 pages
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
From Everand
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
Anthony David Giordano
No ratings yet
Jsaer2016 03 01 21 24
No ratings yet
Jsaer2016 03 01 21 24
4 pages
Review Paper On Big Data Analytics in Cloud Computing: July 2017
No ratings yet
Review Paper On Big Data Analytics in Cloud Computing: July 2017
6 pages
Seminar_Report kiran
No ratings yet
Seminar_Report kiran
14 pages
Unit-1
No ratings yet
Unit-1
11 pages
UNIT1 -BDH
No ratings yet
UNIT1 -BDH
77 pages
Big Data Analytic Using Cloud Computing
No ratings yet
Big Data Analytic Using Cloud Computing
6 pages
Big Data: How To Handle: A Survey: Dinesh MCA Deptt. PDM University, Bahadurgarh ABC MCA Deptt
No ratings yet
Big Data: How To Handle: A Survey: Dinesh MCA Deptt. PDM University, Bahadurgarh ABC MCA Deptt
8 pages
Big Data: A Survey: Shiwen Mao Yunhao Liu
No ratings yet
Big Data: A Survey: Shiwen Mao Yunhao Liu
39 pages
The Future of Database Management Technologies: Harnessing the Power of Data: Insights and Strategies in Database Management
From Everand
The Future of Database Management Technologies: Harnessing the Power of Data: Insights and Strategies in Database Management
Robert Lewis
No ratings yet
Big Data Dan Cloud Computing
No ratings yet
Big Data Dan Cloud Computing
19 pages
Algorithms and Tools of Big Dat3
No ratings yet
Algorithms and Tools of Big Dat3
66 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
Big Data in Cloud Computing
No ratings yet
Big Data in Cloud Computing
11 pages
0 Principles of Big Data
No ratings yet
0 Principles of Big Data
70 pages
C - B D A - A S C R F D: Loud Based IG ATA Nalytics Urvey of Urrent Esearch and Uture Irections
No ratings yet
C - B D A - A S C R F D: Loud Based IG ATA Nalytics Urvey of Urrent Esearch and Uture Irections
12 pages
12616_Elementaryeducationonline_10-21-37_(1)[1][1]
No ratings yet
12616_Elementaryeducationonline_10-21-37_(1)[1][1]
31 pages
Hadoop
No ratings yet
Hadoop
562 pages
Streaming Big Data Processing in Datacenter Clouds: Blue Skies
No ratings yet
Streaming Big Data Processing in Datacenter Clouds: Blue Skies
6 pages
What Is Big Data
No ratings yet
What Is Big Data
18 pages
WHAT IS BIG DATA87
No ratings yet
WHAT IS BIG DATA87
4 pages
Big Data Processing With Hadoop: Bachelor's Thesis Information Technology Internet Technology 2015
No ratings yet
Big Data Processing With Hadoop: Bachelor's Thesis Information Technology Internet Technology 2015
45 pages
Overview of Big Data
No ratings yet
Overview of Big Data
4 pages
j.ijdsa.20241005.11
No ratings yet
j.ijdsa.20241005.11
14 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Big Data The Driver For Innovation in Databases
No ratings yet
Big Data The Driver For Innovation in Databases
4 pages
BDA_UNIT_1
No ratings yet
BDA_UNIT_1
32 pages
Big Data MINING AND TOOLS
No ratings yet
Big Data MINING AND TOOLS
44 pages
Big Data Unit 1 Notes - 240311 - 100703
No ratings yet
Big Data Unit 1 Notes - 240311 - 100703
15 pages
A Survey On Big Data and Cloud Computing: D. Asir Antony Gnana Singh B. Tamizhpoonguil E. Jebamalar Leavline
No ratings yet
A Survey On Big Data and Cloud Computing: D. Asir Antony Gnana Singh B. Tamizhpoonguil E. Jebamalar Leavline
5 pages
Emerging Technologies in Big Data
No ratings yet
Emerging Technologies in Big Data
15 pages
Parcial Cono 1 14
No ratings yet
Parcial Cono 1 14
14 pages
Parcial Cono 1 21
No ratings yet
Parcial Cono 1 21
21 pages
Ashish_Presentation_Stage1_modify_LR
No ratings yet
Ashish_Presentation_Stage1_modify_LR
24 pages
Big Data Seminar
100% (2)
Big Data Seminar
27 pages
Data Mining With Big Data: Apurva Choudhary 206114009 Dept. of CSE NIT Tiruchirappalli - 15
No ratings yet
Data Mining With Big Data: Apurva Choudhary 206114009 Dept. of CSE NIT Tiruchirappalli - 15
19 pages
Big Data
No ratings yet
Big Data
31 pages
Untitled
No ratings yet
Untitled
23 pages
ITQ0100 To ITQ9000 Brochure PDF
No ratings yet
ITQ0100 To ITQ9000 Brochure PDF
14 pages
Bitcointalk Signature
No ratings yet
Bitcointalk Signature
7 pages
G50 Datasheet 1.9.60
No ratings yet
G50 Datasheet 1.9.60
34 pages
Agenda Item: Source: Vodafone Title: The Disadvantages of Option 3-1 For High Layer Functional Split Document For
No ratings yet
Agenda Item: Source: Vodafone Title: The Disadvantages of Option 3-1 For High Layer Functional Split Document For
7 pages
Structural Modeling
No ratings yet
Structural Modeling
60 pages
(Exams) Dmitry Sensei - Oracle 1Z0-062 Exam. 345q. Corrected (2018)
No ratings yet
(Exams) Dmitry Sensei - Oracle 1Z0-062 Exam. 345q. Corrected (2018)
152 pages
Essay
No ratings yet
Essay
5 pages
Bus Switching Scheme PDF
No ratings yet
Bus Switching Scheme PDF
6 pages
Hamamatsu s9702 Appnotes
No ratings yet
Hamamatsu s9702 Appnotes
4 pages
Social Media Consent Form FASMR.pdf
No ratings yet
Social Media Consent Form FASMR.pdf
2 pages
Case Study Setting in SDLC
No ratings yet
Case Study Setting in SDLC
6 pages
Horizontal Wells
No ratings yet
Horizontal Wells
12 pages
Proceedings of Vccc'08
100% (1)
Proceedings of Vccc'08
291 pages
The Future of AI - How AI Is Changing The World - Built in
No ratings yet
The Future of AI - How AI Is Changing The World - Built in
17 pages
Teleprotection Solutions With Guaranteed Performance Using Packet Switched Wide Area Communication Networks
No ratings yet
Teleprotection Solutions With Guaranteed Performance Using Packet Switched Wide Area Communication Networks
6 pages
MR Honey's Banking Dictionary (English-German)
No ratings yet
MR Honey's Banking Dictionary (English-German)
211 pages
Projects Guider
No ratings yet
Projects Guider
19 pages
Java Syllabus
No ratings yet
Java Syllabus
3 pages
Junior Engineer Electrical Cseries1
No ratings yet
Junior Engineer Electrical Cseries1
16 pages
TCS - CodeVita - Coding Arena Round1 F
No ratings yet
TCS - CodeVita - Coding Arena Round1 F
1 page
Toppers computer 6th July 2024
No ratings yet
Toppers computer 6th July 2024
18 pages
Types of USB Cables - The Ultimate Guide - CDW
No ratings yet
Types of USB Cables - The Ultimate Guide - CDW
9 pages
MATLAB code Week 4 and 5
No ratings yet
MATLAB code Week 4 and 5
13 pages
PDF
No ratings yet
PDF
653 pages
Introduction To Information Security: Process Confinement (1/2)
No ratings yet
Introduction To Information Security: Process Confinement (1/2)
19 pages
How To Setup WAP (Symphony V90 &amp S70, C30)
No ratings yet
How To Setup WAP (Symphony V90 &amp S70, C30)
3 pages
ISMS Presentation
No ratings yet
ISMS Presentation
18 pages
Tcs MCQ
100% (3)
Tcs MCQ
4 pages

Big Data Processing in Cloud Computing Environments

Uploaded by

Big Data Processing in Cloud Computing Environments

Uploaded by

2012 International Symposium on Pervasive Systems, Algorithms and Networks

Big Data Processing in Cloud Computing Environments

Changqing Ji∗† , Yu Li‡ , Wenming Qiu‡ , Uchechukwu Awada‡ , Keqiu Li‡

Email:{jcqgood, liyu87122, xmdlut2007, awadauche, likeqiu}@gmail.com

1087-4089/12 $26.00 © 2012 IEEE 17

an online public storage web service offered by Amazon 8 http://kosmosfs.sourceforge.net/

You might also like