0% found this document useful (0 votes)

148 views8 pages

Big Data Platforms

This document provides a summary and comparison of several big data technologies and platforms, including Hadoop/YARN, Spark, Flink, Hyracks/ASTERIX. It aims to help organizations select technologies and platforms that are appropriate for their analytic needs and processes. The document reviews the technologies based on categories like processing, storage, integration, analytics, governance and monitoring.

Uploaded by

JAWAHAR BALARAMAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

148 views8 pages

Big Data Platforms

Uploaded by

JAWAHAR BALARAMAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

LADaS 2018 - Latin America Data Science Workshop

Big Data Analytics Technologies and Platforms: a brief review

Ticiana L. Coelho da Silva1 , Regis P. Magalhães1 , Igo R. Brilhante1 ,
José Antonio F. de Macêdo1 , David Araújo1 , Paulo A. L. Rego1 , Aloisio Vieira Lira Neto2
1
Federal University of Ceará, Brazil
2
Brazilian Federal Highway Police
{ticianalc, regismagalhaes, pauloalr}@ufc.br, [email protected]
{igobrilhante, jose.macedo, araujodavid}@lia.ufc.br

Abstract. A plethora of Big Data Analytics technologies and platforms have

been proposed in the last years. However, in 2017, only 53% of companies
are adopting such tools. It seems that the industry is not convinced about Big
Data promises or maybe choosing the right technology/platform requires in-
depth knowledge about the capabilities of all these tools. Before deciding the
right technology or platform to choose from, the organizations have to investi-
gate the application/algorithm needs and the advantages and drawbacks of each
technology/platform. In this paper, we aim at helping organizations in the se-
lection of technologies/platforms more appropriate to their analytic processes
by offering a short-review according to some categories of Big Data problems
as processing (streaming and batch), storage, data integration, analytics, data
governance, and monitoring.

1. Introduction
According to [Dresner Advisory Services 2017], 53% of companies were adopting Big
Data Analytics platforms. Such news and other similar ones motivate a reflection on
why Big Data adoption is not yet a reality to many companies. On the one hand, we are
witnessing a data deluge, which demands scalable solutions to extract value from a large
volume of data. On the other hand, Big Data technologies are usually presented as the
key answer to such need. However, it seems that the industry is not convinced about Big
Data promises. To our understanding this problem comes from the lack of a thorough
knowledge of what is a big data problem and what are the advantages and drawbacks of
the available big data technologies.
The management and analysis of large-scale datasets are usually associated with
the term Big Data. According to [Begoli and Horey 2012], Big Data is the practice of
collection and processing of large datasets, systems, and algorithms used to analyze these
massive datasets. [Wu et al. 2014] claims that Big Data refers to large heterogeneous vol-
umes of autonomous sources with distributed and decentralized control, trying to explore
the complex and evolving relationships between data. Because of the distributed process-
ing involving lots of nodes, it is necessary that the data management in Big Data deals
with a failure of the nodes as a frequent event, and not as an exception to the processing.
Meanwhile, Gartner introduced Big Data as characterized by three Vs: volume, variety,
and velocity [Sakr 2016]. After the three Vs definition was extended to four Vs, with the
addition of value.
In this paper, we analyze Big Data from two different perspectives: Big Data
technologies and Big Data platforms. Several research studies [Cohen et al. 2009,

25
LADaS 2018 - Latin America Data Science Workshop

Herodotou et al. 2011, Begoli and Horey 2012, Singh and Reddy 2015], together with
the authors’ development experience with Big Data problems, advocate that all Big Data
technologies should require fault tolerance, scalability, elasticity, distributed architecture,
generic storage and processing of large volumes of data in the order of terabytes or even
petabytes. Besides, a Big Data platform with an ecosystem of services and technologies
should also provide resource management, data governance, and monitoring. In this work,
we only refer to the technologies and platforms regarding such features.
Notably, we aim at comparing different Big Data technologies and analytics plat-
forms according to the following categories: processing (streaming and batch), stor-
age, data integration, analytics, governance, and monitoring. There exist several pa-
pers that compare big data technologies, to name a few [Inoubli et al. 2018, Sakr 2016,
Singh and Reddy 2015]. However, they do not address big data platform analytics. On
the other hand, this study aims to help organizations in the selection of platforms more
suitable to their analytic processes. Since, typically, before deciding on the right tech-
nology or platform to choose from, the user/organization investigates what the applica-
tion/algorithm needs are and what each technology/platform may provide. It is worth
to mention that our focus is not comparing Big Data technologies and platforms for dif-
ferent applications, like Cloud Computing and Internet of Things, but to compare them
according to categories of Big Data problems.
The remainder of the paper is structured as follows: Section 2 and 3 provide an
overview of the relevant Big Data technologies and platforms, respectively, from the state-
of-the-art works. Moreover, these sections present a comparison of such technologies and
platforms based on some categories of problems. Finally, Section 4 draws final consider-
ations and research challenges.

2. Big Data Technologies

A plethora of Big Data technologies have been proposed [Alexandrov et al. 2014,
White 2012, Borkar et al. 2011, Zaharia et al. 2010]. In this section, we briefly describe
some of those technologies and provide a comparison between them according to some
categories of problems.

2.1. Overview of Big Data Technologies

This section presents an overview of the most widely used and recently discussed Big
Data technologies [Sakr 2016, Singh and Reddy 2015]. To this end, we examine the main
features of YARN/Hadoop, Spark, Flink and Hyracks/ASTERISK.
In many Big Data scenarios, Apache Hadoop has become the data and compu-
tational de facto standard for sharing and accessing data and computational resources
[Vavilapalli et al. 2013]. Hadoop is a scalable open-source computation framework that
allows the partitioning of computation processes across many host servers which are not
necessarily high-performance computers [White 2012]. It has two main components: a
MapReduce execution engine and a distributed file system (DFS) called HDFS – Hadoop
Distributed FileSystem. The advantages of Hadoop mainly lie in its high flexibility, scal-
ability, low-cost, and reliability for managing and efficiently processing a large volume
of structured and unstructured datasets, as well as providing job schedules for balancing
data, resource and task loads. Hadoop evolved to YARN – Yet Another Resource Negotia-
tor, whose architecture decouples the programming model from the resource management

26
LADaS 2018 - Latin America Data Science Workshop

infrastructure and delegates many scheduling functions to per-application components

[Vavilapalli et al. 2013].
Apache Spark [Zaharia et al. 2016] is a unified engine for distributed data pro-
cessing. It has a programming model similar to MapReduce but extends it with a data-
sharing abstraction called Resilient Distributed Datasets, or RDDs. Using this extension,
Spark can capture a wide range of processing workloads that previously needed sepa-
rate engines, including SQL, streaming, machine learning, and graph processing. Spark
[Zaharia et al. 2010] was also designed to overcome the disk I/O limitations and improve
the performance of earlier systems. The main feature of Spark is its ability to perform
in-memory computations. It allows the data to be cached in memory, thus eliminating the
YARN’s disk overhead limitation for iterative tasks.
Apache Flink [Carbone et al. 2015] is an open-source stream and batch pro-
cessing framework for distributed and high-performing applications originated from
[Alexandrov et al. 2014] project. It is built on the philosophy that many classes of data
processing applications, including real-time analytics, continuous data pipelines, histori-
cal data processing, and iterative algorithms can be expressed and executed as pipelined
fault-tolerant data flows. Flink can run as a completely independent framework, or on top
of HDFS and YARN. It leverages in-memory storage for improving the performance of
the runtime execution. The main novelties of Flink in comparison to previous Big Data
technologies: a distributed data flow runtime that exploits pipelined streaming execu-
tion for batch and stream workloads; exactly-once state consistency through lightweight
checkpointing; native iterative processing; and a sophisticated window semantics, sup-
porting out-of-order processing.
Hyracks/ASTERIX [Borkar et al. 2011] is a partitioned-parallel software platform
designed to run data-intensive computations on large shared-nothing clusters. Hyracks
includes a collection of operators that can be used to assemble data processing jobs with-
out needing to write Map and Reduce code. Moreover, it also provides a Yarn compatible
layer to run existing MapReduce jobs. The Hyracks presents a scalable information man-
agement system that supports the storage, querying, and analysis of large collections of
semi-structured nested data objects. Hyracks provides performance gains over MapRe-
duce through its more flexible user model, while also being a more efficient implementa-
tion than Hadoop for MapReduce jobs for a variety of data-intensive use cases. Hyracks
also achieves fault recovery performance gains over Hadoop by offering a less pessimistic
approach to fault handling.
2.2. Comparison of Big Data Technologies
Companies using the Big Data technologies are usually facing challenges like: (i) dealing
with the storage of heterogeneous sources such as structured, unstructured and semistruc-
tured data; (ii) the need to discover knowledge from large and heterogeneous datasets
by not only applying SQL queries, but also performing complex machine learning algo-
rithms or graph computations; (iii) continuously receiving streams of data that must be
continuously processed in order of milliseconds for (near) real-time analytics. Based on
these challenges, we present a comprehensive discussion on how those frameworks are
able or not to provide support for streaming and batch processing, generic storage and
data analytics.
Batch Processing. This kind of data processing is intimately related to long-time run-

27
LADaS 2018 - Latin America Data Science Workshop

ning computation over a large volume of data, all at once, over a period. It is typically
performed in tasks of ETL (Extract, Transform and Load), data aggregation, training and
updating machine learning models. Hadoop was broadly adopted in batch processing due
to its MapReduce implementation for distributing the data processing within a computing
cluster with many nodes. Hyracks also performs batch data processing. However, Spark
has become the main adopted engine for large data processing by a variety of companies1 ,
since it brings fast in-memory data processing capability, which overcomes the Hadoop
reading and writing overheads.
Streaming Processing. In stream processing, the data is processed and the results pro-
duced strictly within specific time constraints (often in the order of milliseconds and
sometimes microseconds depending on the application and the user requirements). For
instance, Spark Streaming receives live input data streams and divides the data into micro-
batches, which are processed by the Spark engine and used to generate the final stream
of results in batches. Micro-batching allows handling a stream as a sequence of small
batches or chunks of data. However, it can introduce considerable overhead in the form
of scheduling tasks. On the other hand, Flink can deliver all of the advantages of buffering
with none of the task-scheduling overhead. Flink can also perform well on real-time or
near-real-time scenarios, where insights from data should be available at nearly the same
moment of data generation.
Generic Storage. HDFS can store a diverse mix of structured, unstructured and
semistructured data. Hyracks can consume the data from HDFS, and it also provides
the data storage AsterixDB to ingest, store, index, query, and analyze mass quantities of
data using a flexible data model (ADM). Spark supports integration with a wide variety of
file systems, including HDFS, MapR File System, Cassandra, Amazon S3, or the imple-
mentation of a custom solution. Flink enables the integration of heterogeneous data sets,
ranging from strictly structural relational data, unstructured text data and semi-structured
data. It also works with HDFS and connects to various other data storage systems. Flink
and Spark do not provide a primary storage solution.
Data Analytics. YARN/Hadoop actively supports several top-level projects to create de-
velopment tools and to manage its data flow and processing such as Giraph, Pig, Hive,
Mahout, and HBase. Spark also supports a wide range of applications, including ETL,
Machine Learning (MLib), Stream Processing (Spark Streaming), and Graph computation
(GraphX). Flink’ stack offers libraries with high-level APIs for different use cases: Com-
plex Event Processing (CEP), Machine Learning (FlinkML), and Graph Analytics (Gelly).
The software stack of Hyracks system is composed of various interfaces for analytics as
well, like SQL (Hivesterix), XQuery (Apache VXQuery), and Graph (Pregelix). Even
Hyracks can efficiently execute complex distributed data-flow operations and express full
relational algebras. It also exposes low-level APIs and requires a machine learning (ML)
expert to reformulate its algorithms as dataflow operators [Sparks et al. 2013].
Finally, Table 1 summarizes our discussion. It provides a short-comparison of Big
Data technologies capabilities according to the categories of Big Data problems analyzed.

1
https://spark.apache.org/powered-by.html

28
LADaS 2018 - Latin America Data Science Workshop

```
``` Technology
``` Hadoop Spark Flink Hyracks
Category ```
Processing Type Batch Mini-batch Streaming, Batch
Batch
Generic Storage HDFS no primary storage no primary storage AsterixDB
Data Analytics SQL, ML, ETL, ML, ML, CEP, SQL, XQuery,
Graph Graph Graph Graph

Table 1. Summary of Big Data technologies discussed.

3. Big Data Platforms

A Big Data platform is an ecosystem of services and technologies that needs to perform
analysis on voluminous, complex and dynamic data. Thus, scaling up the hardware plat-
form becomes imminent and choosing the right hardware/software technologies becomes
a crucial decision if the user’s requirements are to be satisfied in a reasonable amount of
time [Singh and Reddy 2015]. A set of Big Data platforms has recently emerged, includ-
ing Big Data Europe (BDE) [Jabeen et al. 2017], Hortonworks 2 and Cloudera3 . In this
section, we briefly describe these Big Data platforms and provide a comparison between
them according to some categories of Big Data problems.

3.1. Overview
BDE platform [Jabeen et al. 2017] developed a computing infrastructure for handling
large volumes of data in a variety of formats. It addresses the requirements of simpli-
fying use, easing deployment, managing heterogeneity and improving scalability, and
facilitates the execution and integration of Big Data frameworks and tools like Hadoop,
Spark, Flink and many others. The authors have decided to use Docker as packaging
and deployment methodology as well as managing the variety of underlying hardware re-
sources efficiently alongside the varying software requirements. BDE allows performing
a variety of Big Data flow tasks such as message passing (via Kafka, Flume), storage (via
Hive, Cassandra), analysis (via Spark, Flink) or publishing (via GeoTriples). Moreover,
the platform is open-source and completely free.
Hortonworks Data Platform (HDP) is an open-source modern data architecture
that delivers immediate value by slashing storage costs as it integrates Yarn into its data
center, and by optimizing Enterprise Data Warehouse costs by offloading low-value com-
puting tasks such as ETL to Yarn. Yarn allows HDP to integrate all data processing
engines across the community and commercial ecosystem to deliver consistent shared
services and resources across the platform. Ambari is an intuitive Web UI and a robust
REST API that makes HDP management simpler, consistent and secure. Furthermore,
HDP is a complete solution offering not just data processing and management, but the en-
terprise capabilities to match the demands of an enterprise spanning security, governance,
and operations.
Cloudera was the first company to develop and distribute Apache Hadoop-based
software, and it has made data analytics on Big Data more convenient and accessible
to anyone interested. It integrates Hadoop with more than a dozen other critical open
source projects. Cloudera created a functionally advanced system that helps to perform
2
https://br.hortonworks.com/
3
https://www.cloudera.com/

29
LADaS 2018 - Latin America Data Science Workshop

end-to-end Big Data workflows. Different projects compose Cloudera ecosystem for a
variety of Big Data tasks: streaming processing (via Spark), message passing (via Kafka,
Flume), storage (via Accumulo, Hive, Pig, HBase), analysis (via Flink, Impala), searching
(via Cloudera Search) or providing an extensible and productive web GUI for users (via
HUE).

3.2. Comparison of Big Data Platforms

Following we present a set of recurrent problems usually faced by organizations that
might become more complex when dealing with Big Data: (i) integrate different Big
Data sources and provide a transparent view to the users; (ii) manage and protect the or-
ganization’s data assets in order to guarantee generally understandable, correct, complete
and secure corporate data; (iii) monitor the data, resources and applications to review and
evaluate the health and performance of the whole system. Each challenge can be summa-
rized in one of the following categories of problems: Data Integration, Data Governance,
and Monitoring Services. In what follows, we provide a comparison between the Big Data
platforms mentioned in this work, and what they provide to deal with such problems.
Data Integration. Data integration involves combining data from different
sources and providing users with a unified view of them. HDP has partnered with Tal-
end, a powerful and versatile open source solution for Big Data integration that natively
supports Hadoop, including connectors for HDFS, HBase, Pig, Sqoop, and Hive without
having to write any code. Talend also supports Cloudera Navigator. Another alternative
for HDP is Oracle Data Integrator (ODI). A user can create a flow from sources to targets
of different technologies, including relational databases, applications, XML, JSON, Hive
tables, HBase, HDFS files, and so on. BDE platform goes further than HDP and Cloudera
by comprising a Semantic Data Lake – a repository provided for processing and analysis
the datasets in their original formats – named Ontario. Ontario builds a Semantic Layer
on top of the Data Lake, which is responsible for mapping data into existing Semantic vo-
cabularies/ontologies. A successful mapping process, termed Semantic Lifting, provides
a view over the whole data. In this way, data can be extracted, queried or analyzed from
the heterogeneous sources in the lake as if it was in a single format using a high-level
query language. Another relevant component is Semagrow, a SPARQL query processing
system that federates multiple remote endpoints.
Data Governance. Data Governance is a system of decision rights and account-
abilities for information-related processes, executed according to agreed-upon models
which describe who can take what actions with what information, and when, under what
circumstances, using what methods [Data Governance Institute 2018]. [Soares 2012] ex-
pands this definition by including policies regarding the optimization, privacy, and mon-
etization of Big Data. Governing Big Data systems can be complex. Securing datasets
consistently across multiple repositories can be extremely error-prone. Cloudera Navi-
gator Data Management component is a fully integrated data management and security
tool for the Hadoop that has been designed to meet compliance, data governance, and
auditing needs of global enterprises. HDP uses Apache Atlas and Apache Ranger, which
combine data classification with security policy enforcement. Apache Atlas was created
as part of the Hadoop Data Governance initiative, and it offers the ability to view the
cross-component lineage, providing a complete view of the data movement through some
parsing engines such as Apache Storm, Kafka, Falcon, and Hive. Apache Ranger provides

30
LADaS 2018 - Latin America Data Science Workshop

XXX
XXXPlatform HDP BDE Cloudera
Category XXX
X
Data Integration Talend, ODI Ontario, Semagrow Talend
Data Governance Atlas, Ranger No support Cloudera Navigator
Monitoring Ambari Prometheus, ELK stack Cloudera Manager

Table 2. Summary of Big Data Platforms studied.

centralized security management for Hadoop. By integrating Atlas and Ranger, HDP al-
lows companies to implement dynamic, runtime access policies that pro-actively prevent
violations. BDE does not delve much into data governance since it does not address issues
such as data privacy, sharing, and rights.
Monitoring. Monitoring is the process of proactively reviewing and evaluating
what has been monitored (as data, resources or applications). Monitoring software helps
to measure and track the data usually using dashboards, alerts, and reports. Cloudera
Manager provides many features for monitoring the health and performance of the clusters
components (hosts, service daemons) as well as the performance and resource demands
of the jobs running on clusters. BDE distinguishes between resource monitoring and
status monitoring. The former allows to follow up the health of a server or a component
in the platform (CPU usage, memory usage, network I/O and disk utilization) while the
latter offers insight in the status of a specific application. For resource monitoring, the
tools Docker Stats, cAdvisor, Prometheus, InfluxDB, and Grafana can be useful at BDE
platform. For status monitoring, BDE supports docker built-in logging and ELK stack.
As part of HDP, Apache Ambari allows to plan, install and securely configure clusters of
computers, by making it easier to provide ongoing cluster maintenance and management.
Finally, a summary of this section is presented in Table 2, that provides a short-
review of Big Data platforms according to the categories of Big Data problems analyzed.

4. Conclusion
This paper surveys various Big Data technologies and platforms that are currently avail-
able and discusses their capabilities. A comparison between different technologies based
on some important Big Data problems has been made. In addition, we also compare dif-
ferent Big Data platforms based on their support for data integration, data governance,
and monitoring. By providing this guideline, we aim at helping organizations in the se-
lection of technologies/platforms more appropriate to their Big Data problems. A future
work consists of an empirical evaluation of these technologies/platforms by using differ-
ent Big Data scenarios/applications. Moreover, we intend to compare them according to
other categories of Big Data problems, such as how these platforms/technologies manage
and integrate different data analysis outputs and algorithms.

Acknowledgments
This work has been supported by FUNCAP SPU 8789771/2017 research project and
CAPES fellowship.

References
Alexandrov, A., Bergmann, R., Ewen, S., Freytag, J.-C., Hueske, F., Heise, A., Kao, O.,
Leich, M., Leser, U., Markl, V., et al. (2014). The stratosphere platform for big data
analytics. The VLDB Journal, 23(6):939–964.

31
LADaS 2018 - Latin America Data Science Workshop

Begoli, E. and Horey, J. (2012). Design principles for effective knowledge discovery from
big data. In Joint ICSA and ECSA, pages 215–218.
Borkar, V., Carey, M., Grover, R., Onose, N., and Vernica, R. (2011). Hyracks: A flexible
and extensible foundation for data-intensive computing. In ICDE, pages 1151–1162.
Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., and Tzoumas, K. (2015).
Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE
Computer Society TCDE, 36(4).
Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J. M., and Welton, C. (2009). Mad skills:
new analysis practices for big data. Proceedings of the VLDB Endowment, 2(2):1481–
1492.
Data Governance Institute (2018). Definitions of data governance. http://
www.datagovernance.com/adg_data_governance_definition/. Ac-
cessed: 2018-05-01.
Dresner Advisory Services (2017). Big Data Analytics Market Study.
Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F. B., and Babu, S. (2011).
Starfish: a self-tuning system for big data analytics. In CIDR, pages 261–272.
Inoubli, W., Aridhi, S., Mezni, H., Maddouri, M., and Nguifo, E. M. (2018). An experi-
mental survey on big data frameworks. Future Generation Computer Systems.
Jabeen, H., Archer, P., Scerri, S., Versteden, A., Ermilov, I., Mouchakis, G., Lehmann, J.,
and Auer, S. (2017). Big data europe. In EDBT/ICDT Workshops.
Sakr, S. (2016). Big data 2.0 processing systems: a survey. Springer.
Singh, D. and Reddy, C. K. (2015). A survey on platforms for big data analytics. Journal
of Big Data, 2(1):8.
Soares, S. (2012). Big data governance: an emerging imperative. Mc Press.
Sparks, E. R., Talwalkar, A., Smith, V., Kottalam, J., Pan, X., Gonzalez, J., Franklin, M. J.,
Jordan, M. I., and Kraska, T. (2013). Mli: An api for distributed machine learning. In
ICDM.
Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves,
T., Lowe, J., Shah, H., Seth, S., et al. (2013). Apache hadoop yarn: Yet another
resource negotiator. In Proceedings of the 4th Symposium SOCC.
White, T. (2012). Hadoop: The definitive guide. ” O’Reilly Media, Inc.”.
Wu, X., Zhu, X., Wu, G.-Q., and Ding, W. (2014). Data mining with big data. IEEE
TKDE, 26(1):97–107.
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. (2010). Spark:
Cluster computing with working sets. HotCloud, 10(10-10):95.
Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen,
J., Venkataraman, S., Franklin, M. J., et al. (2016). Apache spark: a unified engine for
big data processing. Communications of the ACM, 59(11):56–65.

Getting Started with Greenplum for Big Data Analytics
From Everand
Getting Started with Greenplum for Big Data Analytics
Sunila Gollapudi
No ratings yet
Big Data Analytics
No ratings yet
Big Data Analytics
134 pages
Big Data Analytics and Visualization Lab
No ratings yet
Big Data Analytics and Visualization Lab
193 pages
Building and Operating Data Hubs: Using a practical Framework as Toolset
From Everand
Building and Operating Data Hubs: Using a practical Framework as Toolset
Georg Graner
No ratings yet
2015 - AutoCAD Tutorial Architecture Imperial Version
67% (6)
2015 - AutoCAD Tutorial Architecture Imperial Version
44 pages
Spark Scala Protected
No ratings yet
Spark Scala Protected
211 pages
Introduction To Information and Big Data Security
No ratings yet
Introduction To Information and Big Data Security
39 pages
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
100% (1)
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
31 pages
Introduction To Big Data With Spark and Hadoop
No ratings yet
Introduction To Big Data With Spark and Hadoop
61 pages
Lecture1 Big Data
No ratings yet
Lecture1 Big Data
47 pages
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
Introduction To Big Data - Presentation
No ratings yet
Introduction To Big Data - Presentation
30 pages
Big Data Metods
No ratings yet
Big Data Metods
23 pages
1-Big Data Analytics
No ratings yet
1-Big Data Analytics
37 pages
Lecture 2 - Introduction To Big Data Analytics - 1691894427998
No ratings yet
Lecture 2 - Introduction To Big Data Analytics - 1691894427998
55 pages
A Big Data Analytics Study Challenges, Unresolved Research Issues, and Techniques
100% (1)
A Big Data Analytics Study Challenges, Unresolved Research Issues, and Techniques
8 pages
Big Data Not Right Data Yes
No ratings yet
Big Data Not Right Data Yes
8 pages
1 - Understanding Big Data
No ratings yet
1 - Understanding Big Data
46 pages
Data Mining Information
100% (1)
Data Mining Information
15 pages
An Introduction To Big Data
No ratings yet
An Introduction To Big Data
31 pages
Child Labour - Magnitude, Causes, Effects and Reponses
0% (1)
Child Labour - Magnitude, Causes, Effects and Reponses
6 pages
BIG DATA and Its Traits
No ratings yet
BIG DATA and Its Traits
25 pages
Big Data: Introduction To Terms, Concepts and Tools
No ratings yet
Big Data: Introduction To Terms, Concepts and Tools
23 pages
Record Management
100% (3)
Record Management
46 pages
Big Data
No ratings yet
Big Data
3 pages
CCS334 BIG DATA ANALYTICS Session 1 Intr
No ratings yet
CCS334 BIG DATA ANALYTICS Session 1 Intr
18 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
BDM Unit I Slides Part 1
No ratings yet
BDM Unit I Slides Part 1
27 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Big Data
No ratings yet
Big Data
43 pages
Story of Chautara, Sindhupalchowk
No ratings yet
Story of Chautara, Sindhupalchowk
69 pages
LS1.1 - V6 Generalized Architecture of Big Data Systems
No ratings yet
LS1.1 - V6 Generalized Architecture of Big Data Systems
8 pages
The Impact of Big Data Analytics On Company Perfor
No ratings yet
The Impact of Big Data Analytics On Company Perfor
22 pages
Da Notes (Big Data) PDF
No ratings yet
Da Notes (Big Data) PDF
32 pages
Part 1 - Introduction To Big Data
No ratings yet
Part 1 - Introduction To Big Data
24 pages
A6515 BDA Question Bank
No ratings yet
A6515 BDA Question Bank
9 pages
Banking Data Analysis On Hadoop
No ratings yet
Banking Data Analysis On Hadoop
21 pages
What Is An Arithmetic Sequence?: Arithmetic Sequences and Series
No ratings yet
What Is An Arithmetic Sequence?: Arithmetic Sequences and Series
34 pages
Big Data
No ratings yet
Big Data
52 pages
Big Data: NADC Says: Every Day, We Create 2.5 Quintillion Bytes of Data - So Much That 90% of The Data in The
No ratings yet
Big Data: NADC Says: Every Day, We Create 2.5 Quintillion Bytes of Data - So Much That 90% of The Data in The
3 pages
Big Educational Data & Analytics Survey
No ratings yet
Big Educational Data & Analytics Survey
23 pages
(IJCST-V12I6P9) :Mrs.N.Dhivya, Mrs.S.Senthamarai Selvi, R.Gayathri
No ratings yet
(IJCST-V12I6P9) :Mrs.N.Dhivya, Mrs.S.Senthamarai Selvi, R.Gayathri
5 pages
Decision Support System: Fundamentals and Applications for The Art and Science of Smart Choices
From Everand
Decision Support System: Fundamentals and Applications for The Art and Science of Smart Choices
Fouad Sabry
No ratings yet
Seminar
No ratings yet
Seminar
16 pages
A Seminar Report: Big Data
No ratings yet
A Seminar Report: Big Data
22 pages
Petroleum: Big Data Analytics in Oil and Gas Industry: An Emerging Trend
No ratings yet
Petroleum: Big Data Analytics in Oil and Gas Industry: An Emerging Trend
10 pages
Big Data
No ratings yet
Big Data
16 pages
10
No ratings yet
10
4 pages
Big Data: by It Faculty Alttc Ghaziabad
No ratings yet
Big Data: by It Faculty Alttc Ghaziabad
26 pages
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
No ratings yet
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
15 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
5 pages
Iso 26000
No ratings yet
Iso 26000
17 pages
Apache HIVE
No ratings yet
Apache HIVE
9 pages
Bigdata MINT PDF
No ratings yet
Bigdata MINT PDF
4 pages
Experiment 3: Hive: Aim: To Understand Data Processing Tool - Hive and HQL (Hive Query Language)
No ratings yet
Experiment 3: Hive: Aim: To Understand Data Processing Tool - Hive and HQL (Hive Query Language)
11 pages
Big Data Syllabus For Theory and Lab
No ratings yet
Big Data Syllabus For Theory and Lab
4 pages
Big Data Analytics
No ratings yet
Big Data Analytics
12 pages
Real-Time Stock Market Analysis Using LSTM
No ratings yet
Real-Time Stock Market Analysis Using LSTM
5 pages
Big Data and Spark Developers
No ratings yet
Big Data and Spark Developers
5 pages
Board 4-CHN
100% (23)
Board 4-CHN
30 pages
Big Data
No ratings yet
Big Data
1 page
Big Data Data Analytics
No ratings yet
Big Data Data Analytics
5 pages
Big Data Technologies
No ratings yet
Big Data Technologies
4 pages
Big Data
No ratings yet
Big Data
6 pages
Big Data
No ratings yet
Big Data
11 pages
Co - Ownership
100% (1)
Co - Ownership
7 pages
Big Data Landscape 2017
No ratings yet
Big Data Landscape 2017
1 page
The Flow of Food: Storage General Storage Guidelines
No ratings yet
The Flow of Food: Storage General Storage Guidelines
18 pages
Literature
No ratings yet
Literature
15 pages
Big Data and Hadoop For Developers - Syllabus
No ratings yet
Big Data and Hadoop For Developers - Syllabus
6 pages
Thesis Well Testing, Methods and Applicability
No ratings yet
Thesis Well Testing, Methods and Applicability
164 pages
Blank Column
No ratings yet
Blank Column
12 pages
SDC Lab Manual (Mid - 1)
No ratings yet
SDC Lab Manual (Mid - 1)
35 pages
Murud Janjira The Unsung Legacy of Siddi
No ratings yet
Murud Janjira The Unsung Legacy of Siddi
10 pages
Fuckbook
No ratings yet
Fuckbook
5 pages
Grammar Quiz - Gustavo Millan
No ratings yet
Grammar Quiz - Gustavo Millan
2 pages
Wika 0900766b813ecd99
No ratings yet
Wika 0900766b813ecd99
2 pages
(Ebook) Mental Disorder in Canada: An Epidemiological Perspective by John Cairney David L. Streiner ISBN 9781442698574, 1442698578
100% (2)
(Ebook) Mental Disorder in Canada: An Epidemiological Perspective by John Cairney David L. Streiner ISBN 9781442698574, 1442698578
77 pages
Wa0018.
No ratings yet
Wa0018.
17 pages
Pathfit 103 Learning Materials
No ratings yet
Pathfit 103 Learning Materials
47 pages
July 2010 Nursing Board Exam Topnotchers
No ratings yet
July 2010 Nursing Board Exam Topnotchers
2 pages
Competition Law in India by Nishith Desai
No ratings yet
Competition Law in India by Nishith Desai
120 pages
Thermodynamics Problems
No ratings yet
Thermodynamics Problems
10 pages
Mla Bibliography Website
100% (1)
Mla Bibliography Website
4 pages
Ethical Issues Powerpoint
No ratings yet
Ethical Issues Powerpoint
12 pages
Final Script Assembly Play
No ratings yet
Final Script Assembly Play
3 pages
GD
No ratings yet
GD
18 pages
Dynamics Ax 2012 r2 Import Export Framework Walkthrough Installation v1 Secured
No ratings yet
Dynamics Ax 2012 r2 Import Export Framework Walkthrough Installation v1 Secured
17 pages
Exam 1 MGMT 363 Review (CH 1-4)
No ratings yet
Exam 1 MGMT 363 Review (CH 1-4)
7 pages
Team Energy Vs Cir
No ratings yet
Team Energy Vs Cir
1 page

Big Data Platforms

Uploaded by

Big Data Platforms

Uploaded by

LADaS 2018 - Latin America Data Science Workshop

Big Data Analytics Technologies and Platforms: a brief review

Abstract. A plethora of Big Data Analytics technologies and platforms have

2. Big Data Technologies

2.1. Overview of Big Data Technologies

infrastructure and delegates many scheduling functions to per-application components

Table 1. Summary of Big Data technologies discussed.

3. Big Data Platforms

3.2. Comparison of Big Data Platforms

Table 2. Summary of Big Data Platforms studied.

You might also like