Big Data Platforms
Big Data Platforms
1. Introduction
According to [Dresner Advisory Services 2017], 53% of companies were adopting Big
Data Analytics platforms. Such news and other similar ones motivate a reflection on
why Big Data adoption is not yet a reality to many companies. On the one hand, we are
witnessing a data deluge, which demands scalable solutions to extract value from a large
volume of data. On the other hand, Big Data technologies are usually presented as the
key answer to such need. However, it seems that the industry is not convinced about Big
Data promises. To our understanding this problem comes from the lack of a thorough
knowledge of what is a big data problem and what are the advantages and drawbacks of
the available big data technologies.
The management and analysis of large-scale datasets are usually associated with
the term Big Data. According to [Begoli and Horey 2012], Big Data is the practice of
collection and processing of large datasets, systems, and algorithms used to analyze these
massive datasets. [Wu et al. 2014] claims that Big Data refers to large heterogeneous vol-
umes of autonomous sources with distributed and decentralized control, trying to explore
the complex and evolving relationships between data. Because of the distributed process-
ing involving lots of nodes, it is necessary that the data management in Big Data deals
with a failure of the nodes as a frequent event, and not as an exception to the processing.
Meanwhile, Gartner introduced Big Data as characterized by three Vs: volume, variety,
and velocity [Sakr 2016]. After the three Vs definition was extended to four Vs, with the
addition of value.
In this paper, we analyze Big Data from two different perspectives: Big Data
technologies and Big Data platforms. Several research studies [Cohen et al. 2009,
25
LADaS 2018 - Latin America Data Science Workshop
Herodotou et al. 2011, Begoli and Horey 2012, Singh and Reddy 2015], together with
the authors’ development experience with Big Data problems, advocate that all Big Data
technologies should require fault tolerance, scalability, elasticity, distributed architecture,
generic storage and processing of large volumes of data in the order of terabytes or even
petabytes. Besides, a Big Data platform with an ecosystem of services and technologies
should also provide resource management, data governance, and monitoring. In this work,
we only refer to the technologies and platforms regarding such features.
Notably, we aim at comparing different Big Data technologies and analytics plat-
forms according to the following categories: processing (streaming and batch), stor-
age, data integration, analytics, governance, and monitoring. There exist several pa-
pers that compare big data technologies, to name a few [Inoubli et al. 2018, Sakr 2016,
Singh and Reddy 2015]. However, they do not address big data platform analytics. On
the other hand, this study aims to help organizations in the selection of platforms more
suitable to their analytic processes. Since, typically, before deciding on the right tech-
nology or platform to choose from, the user/organization investigates what the applica-
tion/algorithm needs are and what each technology/platform may provide. It is worth
to mention that our focus is not comparing Big Data technologies and platforms for dif-
ferent applications, like Cloud Computing and Internet of Things, but to compare them
according to categories of Big Data problems.
The remainder of the paper is structured as follows: Section 2 and 3 provide an
overview of the relevant Big Data technologies and platforms, respectively, from the state-
of-the-art works. Moreover, these sections present a comparison of such technologies and
platforms based on some categories of problems. Finally, Section 4 draws final consider-
ations and research challenges.
26
LADaS 2018 - Latin America Data Science Workshop
27
LADaS 2018 - Latin America Data Science Workshop
ning computation over a large volume of data, all at once, over a period. It is typically
performed in tasks of ETL (Extract, Transform and Load), data aggregation, training and
updating machine learning models. Hadoop was broadly adopted in batch processing due
to its MapReduce implementation for distributing the data processing within a computing
cluster with many nodes. Hyracks also performs batch data processing. However, Spark
has become the main adopted engine for large data processing by a variety of companies1 ,
since it brings fast in-memory data processing capability, which overcomes the Hadoop
reading and writing overheads.
Streaming Processing. In stream processing, the data is processed and the results pro-
duced strictly within specific time constraints (often in the order of milliseconds and
sometimes microseconds depending on the application and the user requirements). For
instance, Spark Streaming receives live input data streams and divides the data into micro-
batches, which are processed by the Spark engine and used to generate the final stream
of results in batches. Micro-batching allows handling a stream as a sequence of small
batches or chunks of data. However, it can introduce considerable overhead in the form
of scheduling tasks. On the other hand, Flink can deliver all of the advantages of buffering
with none of the task-scheduling overhead. Flink can also perform well on real-time or
near-real-time scenarios, where insights from data should be available at nearly the same
moment of data generation.
Generic Storage. HDFS can store a diverse mix of structured, unstructured and
semistructured data. Hyracks can consume the data from HDFS, and it also provides
the data storage AsterixDB to ingest, store, index, query, and analyze mass quantities of
data using a flexible data model (ADM). Spark supports integration with a wide variety of
file systems, including HDFS, MapR File System, Cassandra, Amazon S3, or the imple-
mentation of a custom solution. Flink enables the integration of heterogeneous data sets,
ranging from strictly structural relational data, unstructured text data and semi-structured
data. It also works with HDFS and connects to various other data storage systems. Flink
and Spark do not provide a primary storage solution.
Data Analytics. YARN/Hadoop actively supports several top-level projects to create de-
velopment tools and to manage its data flow and processing such as Giraph, Pig, Hive,
Mahout, and HBase. Spark also supports a wide range of applications, including ETL,
Machine Learning (MLib), Stream Processing (Spark Streaming), and Graph computation
(GraphX). Flink’ stack offers libraries with high-level APIs for different use cases: Com-
plex Event Processing (CEP), Machine Learning (FlinkML), and Graph Analytics (Gelly).
The software stack of Hyracks system is composed of various interfaces for analytics as
well, like SQL (Hivesterix), XQuery (Apache VXQuery), and Graph (Pregelix). Even
Hyracks can efficiently execute complex distributed data-flow operations and express full
relational algebras. It also exposes low-level APIs and requires a machine learning (ML)
expert to reformulate its algorithms as dataflow operators [Sparks et al. 2013].
Finally, Table 1 summarizes our discussion. It provides a short-comparison of Big
Data technologies capabilities according to the categories of Big Data problems analyzed.
1
https://spark.apache.org/powered-by.html
28
LADaS 2018 - Latin America Data Science Workshop
```
``` Technology
``` Hadoop Spark Flink Hyracks
Category ```
Processing Type Batch Mini-batch Streaming, Batch
Batch
Generic Storage HDFS no primary storage no primary storage AsterixDB
Data Analytics SQL, ML, ETL, ML, ML, CEP, SQL, XQuery,
Graph Graph Graph Graph
3.1. Overview
BDE platform [Jabeen et al. 2017] developed a computing infrastructure for handling
large volumes of data in a variety of formats. It addresses the requirements of simpli-
fying use, easing deployment, managing heterogeneity and improving scalability, and
facilitates the execution and integration of Big Data frameworks and tools like Hadoop,
Spark, Flink and many others. The authors have decided to use Docker as packaging
and deployment methodology as well as managing the variety of underlying hardware re-
sources efficiently alongside the varying software requirements. BDE allows performing
a variety of Big Data flow tasks such as message passing (via Kafka, Flume), storage (via
Hive, Cassandra), analysis (via Spark, Flink) or publishing (via GeoTriples). Moreover,
the platform is open-source and completely free.
Hortonworks Data Platform (HDP) is an open-source modern data architecture
that delivers immediate value by slashing storage costs as it integrates Yarn into its data
center, and by optimizing Enterprise Data Warehouse costs by offloading low-value com-
puting tasks such as ETL to Yarn. Yarn allows HDP to integrate all data processing
engines across the community and commercial ecosystem to deliver consistent shared
services and resources across the platform. Ambari is an intuitive Web UI and a robust
REST API that makes HDP management simpler, consistent and secure. Furthermore,
HDP is a complete solution offering not just data processing and management, but the en-
terprise capabilities to match the demands of an enterprise spanning security, governance,
and operations.
Cloudera was the first company to develop and distribute Apache Hadoop-based
software, and it has made data analytics on Big Data more convenient and accessible
to anyone interested. It integrates Hadoop with more than a dozen other critical open
source projects. Cloudera created a functionally advanced system that helps to perform
2
https://br.hortonworks.com/
3
https://www.cloudera.com/
29
LADaS 2018 - Latin America Data Science Workshop
end-to-end Big Data workflows. Different projects compose Cloudera ecosystem for a
variety of Big Data tasks: streaming processing (via Spark), message passing (via Kafka,
Flume), storage (via Accumulo, Hive, Pig, HBase), analysis (via Flink, Impala), searching
(via Cloudera Search) or providing an extensible and productive web GUI for users (via
HUE).
30
LADaS 2018 - Latin America Data Science Workshop
XXX
XXXPlatform HDP BDE Cloudera
Category XXX
X
Data Integration Talend, ODI Ontario, Semagrow Talend
Data Governance Atlas, Ranger No support Cloudera Navigator
Monitoring Ambari Prometheus, ELK stack Cloudera Manager
centralized security management for Hadoop. By integrating Atlas and Ranger, HDP al-
lows companies to implement dynamic, runtime access policies that pro-actively prevent
violations. BDE does not delve much into data governance since it does not address issues
such as data privacy, sharing, and rights.
Monitoring. Monitoring is the process of proactively reviewing and evaluating
what has been monitored (as data, resources or applications). Monitoring software helps
to measure and track the data usually using dashboards, alerts, and reports. Cloudera
Manager provides many features for monitoring the health and performance of the clusters
components (hosts, service daemons) as well as the performance and resource demands
of the jobs running on clusters. BDE distinguishes between resource monitoring and
status monitoring. The former allows to follow up the health of a server or a component
in the platform (CPU usage, memory usage, network I/O and disk utilization) while the
latter offers insight in the status of a specific application. For resource monitoring, the
tools Docker Stats, cAdvisor, Prometheus, InfluxDB, and Grafana can be useful at BDE
platform. For status monitoring, BDE supports docker built-in logging and ELK stack.
As part of HDP, Apache Ambari allows to plan, install and securely configure clusters of
computers, by making it easier to provide ongoing cluster maintenance and management.
Finally, a summary of this section is presented in Table 2, that provides a short-
review of Big Data platforms according to the categories of Big Data problems analyzed.
4. Conclusion
This paper surveys various Big Data technologies and platforms that are currently avail-
able and discusses their capabilities. A comparison between different technologies based
on some important Big Data problems has been made. In addition, we also compare dif-
ferent Big Data platforms based on their support for data integration, data governance,
and monitoring. By providing this guideline, we aim at helping organizations in the se-
lection of technologies/platforms more appropriate to their Big Data problems. A future
work consists of an empirical evaluation of these technologies/platforms by using differ-
ent Big Data scenarios/applications. Moreover, we intend to compare them according to
other categories of Big Data problems, such as how these platforms/technologies manage
and integrate different data analysis outputs and algorithms.
Acknowledgments
This work has been supported by FUNCAP SPU 8789771/2017 research project and
CAPES fellowship.
References
Alexandrov, A., Bergmann, R., Ewen, S., Freytag, J.-C., Hueske, F., Heise, A., Kao, O.,
Leich, M., Leser, U., Markl, V., et al. (2014). The stratosphere platform for big data
analytics. The VLDB Journal, 23(6):939–964.
31
LADaS 2018 - Latin America Data Science Workshop
Begoli, E. and Horey, J. (2012). Design principles for effective knowledge discovery from
big data. In Joint ICSA and ECSA, pages 215–218.
Borkar, V., Carey, M., Grover, R., Onose, N., and Vernica, R. (2011). Hyracks: A flexible
and extensible foundation for data-intensive computing. In ICDE, pages 1151–1162.
Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., and Tzoumas, K. (2015).
Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE
Computer Society TCDE, 36(4).
Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J. M., and Welton, C. (2009). Mad skills:
new analysis practices for big data. Proceedings of the VLDB Endowment, 2(2):1481–
1492.
Data Governance Institute (2018). Definitions of data governance. http://
www.datagovernance.com/adg_data_governance_definition/. Ac-
cessed: 2018-05-01.
Dresner Advisory Services (2017). Big Data Analytics Market Study.
Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F. B., and Babu, S. (2011).
Starfish: a self-tuning system for big data analytics. In CIDR, pages 261–272.
Inoubli, W., Aridhi, S., Mezni, H., Maddouri, M., and Nguifo, E. M. (2018). An experi-
mental survey on big data frameworks. Future Generation Computer Systems.
Jabeen, H., Archer, P., Scerri, S., Versteden, A., Ermilov, I., Mouchakis, G., Lehmann, J.,
and Auer, S. (2017). Big data europe. In EDBT/ICDT Workshops.
Sakr, S. (2016). Big data 2.0 processing systems: a survey. Springer.
Singh, D. and Reddy, C. K. (2015). A survey on platforms for big data analytics. Journal
of Big Data, 2(1):8.
Soares, S. (2012). Big data governance: an emerging imperative. Mc Press.
Sparks, E. R., Talwalkar, A., Smith, V., Kottalam, J., Pan, X., Gonzalez, J., Franklin, M. J.,
Jordan, M. I., and Kraska, T. (2013). Mli: An api for distributed machine learning. In
ICDM.
Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves,
T., Lowe, J., Shah, H., Seth, S., et al. (2013). Apache hadoop yarn: Yet another
resource negotiator. In Proceedings of the 4th Symposium SOCC.
White, T. (2012). Hadoop: The definitive guide. ” O’Reilly Media, Inc.”.
Wu, X., Zhu, X., Wu, G.-Q., and Ding, W. (2014). Data mining with big data. IEEE
TKDE, 26(1):97–107.
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. (2010). Spark:
Cluster computing with working sets. HotCloud, 10(10-10):95.
Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen,
J., Venkataraman, S., Franklin, M. J., et al. (2016). Apache spark: a unified engine for
big data processing. Communications of the ACM, 59(11):56–65.
32