10.1007@978-3-030-11881-510_paper_8

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Exploring Apache Spark Data APIs for Water

Big Data Management

Nassif El Hassane(&) and Hicham Hajji

School of Geomatic Sciences and Surveying Engineering,


SGIT, IAV Institute, Rabat, Morocco
[email protected], [email protected]

Abstract. Managing data complexity is a recurrent problem in multiple


domains related to water resources management such as utilities, hydrological
and meteorological modelling. Recently and since the advent of intelligent
sensors, we observe a systemic growth in the volume of collected data. Besides,
these kinds of sensors generate near real-time data under various formats. To get
the right value of this kind of water datasets we need to design new solutions,
efficient enough to manage massive data coming from intelligent sensors in near
real time and under various formats. We present in our paper a reference
architecture for managing massive data collected from smart meters. Also, we
show how recent advances in big data technologies mainly the Apache Spark
project can effectively be used to obtain insights from massive datasets. Finally,
we will focus on presenting the advantages that provide the distributed execu-
tion model of Spark by exploring three Apache Spark APIs: RDD, Dataframe,
and SparkR.

Keywords: Big Data  Spark  Water management  RDD  Dataframe

1 Introduction

During the past few years, growing data volumes has been considered as a key chal-
lenge for organizations pushing them to look for new approaches to scale their
applications and computations. One of the solutions they considered was to distribute
data storage and processing across clusters of hundreds of machines (Example of
Google, Facebook, Amazon …). In addition to simple queries, complex algorithms like
machine learning and graph analysis are becoming common in many domains. Also
streaming analysis of real-time data is required to let organizations take timely action.
Water management is not an exception since data collection and processing is
becoming a challenge for practitioners, IT teams, and decision makers. Either it was for
managing river basin information, for managing Water utilities data, or for carrying out
data intensive hydrologic modelling, the data management task has always been
challenging. And has become more difficult with the advent of real time sensors,
remote imagery, and the need to speed up the process of decision making [5–7].
Throughout this paper, we will present a reference architecture for handling and
managing smart metering water datasets. We demonstrate how recent advancement in
Big Data Technologies (especially Apache Spark project) can handle water big data

© Springer Nature Switzerland AG 2019


M. Ezziyyani (Ed.): AI2SD 2018, AISC 913, pp. 105–117, 2019.
https://doi.org/10.1007/978-3-030-11881-5_10
106 N. El Hassane and H. Hajji

efficiently with fault tolerance for getting insights from those datasets. Finally, we will
highlight the advantages that provide the distributed execution model of Spark by
exploring three APIs and abstractions provided by Apache Spark: RDD, Dataframe,
and SparkR. The aim of this paper is mainly to explore how Spark can be used with
different abstractions to handle Big data constraints as encountered in Smart Metering
Data processing. Due to lack of space, the impact of volume on such approaches will
not be addressed in this paper, and will be developed in a further work.

2 Big Data Examples in Water Management

2.1 Smart Grid and Water Smart Metering


With the challenges of managing water resource scarcity, and for preventing man-made
disasters like natural flooding, and minimizing the impacts of drought in arid regions,
Water utilities need to reinvent their monitoring techniques and to rely heavily on IT
technologies. Such Newly techniques can play central role for assisting Water decision
makers by giving them faster access to better information to make better decisions and
to rapidly disseminate that information to customers and other stakeholders. Among the
recent advances, one can list the two following initiatives and technologies:
• Advanced metering systems: Which are systems that enable measurement of
detailed, time-based information and frequent collection and transmittal of such
information to various parties [3]. All the data is collected from one smart meter as
the water enters the property. This can measures water usage, pressure and tem-
perature with extreme accuracy measurements taken each period of time (second,
minute … etc.) [8]. This collected water consumption flow data can enhance many
tasks such as leaks detection and understanding end use events (e.g. shower, toilet,
washing machine, etc.) [9].
• Real Time Sensors for specific measurements have been developed recently for
Water quality monitoring [1], and they aim to simplify remote water quality
monitoring. With multiple sensors that measure a dozen of the most relevant water
quality parameters, they are suitable for potable water monitoring, chemical leakage
detection in rivers, remote measurement of swimming pools and spas, and levels of
seawater pollution.

2.2 Data Intensive Hydrologic Modelling


The computer models of watershed hydrology are highly data intensive and the time
consumed running hydrologic models (especially physically based and distributed
hydrologic models) is still a concern for hydrologic practitioners and scientists [2, 4]. In
addition, the complexity of the calibration problem has increased substantially. Recall
that the successful application of a hydrologic model depends on how well the model is
calibrated.
Exploring Apache Spark Data APIs for Water Big Data Management 107

2.3 Remote Sensing, Atmospheric Measurements and Climate Change


Modelling
Many climate models are using high-end supercomputers to complete a comprehensive
set of climate change simulations that will be used to advance scientists’ knowledge of
climate variability and climate change. In addition, remote sensing observations (e.g.,
remote sensing imagery, Atmospheric Radiation Measurement (ARM) data) are gen-
erating large amounts of scientific data (see Table 1 that shows some of the climate and
earth systems data stored at the Earth System Grid (ESG) portal).

Table 1. Example of scientific data stored on the earth system grid


CIMP5 ARM DACC
Sponsor SciDAC DOE/BER NASA
Description of data 40+ Models Atmospheric Biogeochemical
processes and dynamics,
cloud dynamics FLUXNET
Archive size *6 PB *200 TB 1 TB
Year started 2010 1991 1993

3 Overview of Constraints in Big Data Water Management


Datasets

Water datasets, whether captured through remote sensors or large-scale simulations, as


stated above, has always been Big. Similarly to traditional Big Data, three features can
mainly characterize it: Volume, Variety, and Velocity, defined as three “V” dimensions
by Gartner in 2001 [13]:

3.1 Velocity
The velocity of big data in Water Data management involves the generation of data at a
rapid growth rate, and efficiency of data processing and analysis. The data should be
analyzed in a near-real-time manner to achieve a given task, e.g.: flooding prediction
and leak detection.

3.2 Variety
In terms of variety, Water-Dataset data consist of multisource (sensors, smart metering,
hydrological, DEM, etc.), at different resolutions (temporal, and spatial).

3.3 Volume
To illustrate the growth of data volume generated by Water Management, the following
Table 2 shows how the data collected from smart metering evolves according to the
granularity of measurement.
108 N. El Hassane and H. Hajji

Table 2. Data storage issues for smart metering [14]


Granularity Number of Storage per year Storage per 40
measures by year million per year
One second 31 536 000 120 MB 4 PB
One minute 525 600 2 MB 76 TB
One hour 8760 34 KB 1 TB
One day 365 1 KB 54 GB

4 Big Data Architecture for Managing Massive Water


Datasets Using Spark

As stated earlier, we will focus in this paper on managing water big data coming from
Smart Metering datasets with efficiency, scalability and fault tolerance.
In order to achieve those purposes, it is not enough to focus only on the processing
layer only. We should rather treat all layers, from data collection, ingestion, processing
and even visualization. It is in this sense that we propose an end-to-end architecture
(Fig. 1) based on big data tools to ensure timely collection, rapid ingestion and efficient
query processing.

Fig. 1. End to end big data architecture for managing massive water datasets.

This architecture (Fig. 1) is able respond to both usual use cases: near real time and
batch. Data collection and ingestion are carried out by making use of three ingestion
tools: using Apache Camel [15] for files data sources and then through Apache Scoop
[16] for DBMS data sources and finally through Apache Kafka [17] for streaming data
(real water smart metering datasets). Kafka is a subscriber/consumer messaging system,
horizontally scalable and fault-tolerant. Data stored in Kafka is then consumed by
Apache Spark that can execute cleaning, transformation and processing of data before
Exploring Apache Spark Data APIs for Water Big Data Management 109

sending them to apache Cassandra [17]. Once the data is stored in Cassandra, Water
smart metering data are made available to user through Spark SQL, one of Spark’s
APIs. One of analytical queries that can be seamlessly executed within our architecture
are: leaks detection and customer profiling queries1.
As smart metering data can also be defined as data source with spatial component
such as customer coordinates (X, Y) or smart metering locations (latitude, longitude).
We have introduced in our architecture at least one case where locations are handled:
During ingesting phase of smart metering data into Cassandra, spatial indexes (such as
Z-Index) can be constructed based upon coordinates and location.

4.1 Processing Water Datasets Using Apache Spark


In this part, three different approaches based on Apache Spark for processing Water
Smart Metering has been explored. Recall that Apache Spark is an open source parallel
data flow system built on the concept of Resilient Distributed Datasets (RDD), which is
a fault-tolerant collection of elements that can be operated on in parallel [10–12].
Because RDDs are cached in memory and the data flow is created lazily, Spark’s model
is well suited for bulk iterative algorithms.
In the following sections, we will briefly recall the three Apache Spark abstractions,
RDD, Dataframe and SparkR, that will be explored in our work.
Resilient Distributed Dataset (RDD)
The first abstraction of Apache Spark we will explore is the Resilient Distributed
Dataset (RDD) that represents an immutable, partitioned collection of elements that can
be operated on in parallel.
Internally, each RDD is characterized by five main properties (see Fig. 2). RDD
supports two types of operations (Table 3):

Fig. 2. Resilient distributed dataset properties.

• Transformations, which create a new dataset from an existing one.


• Actions, which return value to the driver program after running a computation on
the dataset.

1
Such advanced queries will be developed separately in further works.
110 N. El Hassane and H. Hajji

Table 3. Two types of operations supported by Spark.


Operation Function
RDD transformation Map(func); flatmap()
Filter(func); mappartitions(func)
Mappartitionwithindex()
Union(dataset); intersection(other-dataset)
Distinct(); groupbykey()
Reducebykey(func, [numtasks])
Sortbykey(); join();coalesce()
RDD action Count(); collect()
Take(n); top()
Countbyvalue()
Reduce(); fold()
Aggregate(); foreach()

All transformations in Spark are lazy, they do not compute their results right away.
Instead, they just remember the transformations applied to some base dataset (e.g. a
file). The transformations are only computed when an action requires a result to be
returned to the driver program.
Spark SQL Dataframe
Spark SQL is an extension of Apache Spark for structured data processing. Unlike the
basic Spark RDD API, it provides Spark with more information about the structure of
both the data and the computation being performed and it allows executing SQL
queries written using a basic SQL syntax.

Table 4. RDD and dataframe comparison.


RDD DataFrame
Data formats Can be used to process structured as Data is organized into named
well as unstructured data columns
Data Is a distributed collection of data Data frame data is organized into
representations elements spread across many named columns. Basically, it is as
machines over the cluster. They are same as a table in a relational
a set of Scala or Java objects database
representing data
Optimization There was no provision for By using Catalyst Optimizer,
optimization engine in RDD optimization can takes place in
dataframes
Serialization Spark uses java serialization, In dataframe, we can serialize data
whenever it needs to distribute data into off-heap storage in binary
over a cluster format
Efficiency and When serialization executes Use of off-heap memory for
memory use individually on a java and scala serialization reduces the overhead
object, efficiency decreases
Exploring Apache Spark Data APIs for Water Big Data Management 111

DataFrame can be considered as a distributed collection of data organized into


named columns and can be constructed from a wide array of sources such as: struc-
tured data files, tables in Hive, external databases, or existing RDDs. To understand the
main differences between Apache Spark RDD and DataFrame we compare them on the
basis of different features (Table 4).
SparkR Package Using R Software
R is a popular tool for statistics and data analysis. It gives rich visualization capabilities
and a large collection of libraries. However, in the context of Big Data, it is designed
only to run on in-memory data, which makes it unsuitable for large datasets.
SparkR is an R package that provides a frontend to use Apache Spark cluster from
R (Fig. 3). It provides a distributed data frame implementation (Similar to R data-
frames) that supports operations like selection, filtering and aggregation but on large
datasets.
SparkR allows users to connect R programs to a Spark cluster from RStudio,
R shell, Rscript or other R IDEs. It can operate on a variety of data sources through the
Spark DataFrame interface.

Fig. 3. Diagram of SparkR connecting to spark cluster (https://aws.amazon.com/fr/blogs/big-


data/crunching-statistics-at-scale-with-sparkr-on-amazon-emr/).

4.2 Three Approaches for Processing Water Datasets: RDD, Dataframe


and SparkR
To illustrate the use of Apache Spark for Water Management, we will present briefly
how Spark ecosystem can be used to process from simple to complex analytical queries
on water smart metering datasets. Practically, we explored Spark processing by making
use of three approaches:
• Using Resilient Distributed Datasets as a central backbone for describing our
queries.
• Using DataFrame and Spark Sql for representing and formulating our queries.
• Using SparkR for interacting with our Water Smart Metering Datasets using R API
over Apache Spark.
112 N. El Hassane and H. Hajji

The code related to the three approaches can be found in https://github.com/hajjihi/


BD4WM. Data used in our prototype are composed of 2246 files downloaded from
Smart Metering Information Portal SMIP2. SMIP is an online environment that allows
researchers to collect, preserve, access, and collates data gathered from smart meter
devices. It is also a secure service provided to assist researchers from the Smart Water
Research Centre in the query and maintenance of water logger details and associated
data (logger data, household survey data, logger history).
Using Spark Dataframe
Water Smart Metering Data Ingestion
When Spark starts ingesting logger data files, available through any convenient storage
such as HDFS or a local file system (available on all nodes), it constructs an RDD
collection. Starting from this point, data are presented as a distributed dataset that can
be operated on in parallel.
Preparing Case Classes for Inferring Schema
Case classes are regular classes that provide a recursive decomposition mechanism
via pattern matching. In our case, the case class describes granular information col-
lected from smart metering such as:

• Id of the Smart meter.


• Date of the measure (year, month, day, and interval).
• Value of the measure.

2
https://code.google.com/p/smart-meter-information-portal/.
Exploring Apache Spark Data APIs for Water Big Data Management 113

Then comes the data preparation phase where Spark maps Smart metering data into
the above case class SmartWaterMeasure, and transforms it to Dataframe.

The Scala interface for Spark SQL supports automatically converting an RDD
containing case classes to a DataFrame. Once The RDD is implicitly converted to a
DataFrame and then be registered as a table, it can be used in subsequent SQL
statements.
Analytical Queries for Smart Metering Datasets
To illustrate the use of our approaches, we will present some analytical queries and
show how they can be expressed using the Dataframe abstraction.

Fig. 4. Direct acyclic graph of filtering query.


114 N. El Hassane and H. Hajji

– Query 1: Getting water smart meter data with measurement year greater than 2009
The first query is a simple one that returns Water Smart Meter data with measure
attribute year greater than 2009. In the corresponding Directed Acyclic Graph of the
Spark query (Fig. 4), we can read that the job associated with the query is associated
with a chain of RDD dependencies organized in a direct acyclic graph (DAG).
First, it performs an ingestion operation of available logger files, and then use the
Map operation before applying the Filter operation. Recall that for tuning and opti-
mization, Spark uses the Project Tungsten to improve the efficiency of memory and
CPU.
As the query is just about just filtering smart metering dataset, there is no shuffle
needed between nodes of the cluster. The corresponding DAG shows consequently a
single stage composed of subsequent tasks.
– Query 2: Aggregating smart meter data
This example of aggregation queries tries to group smart metering data with two
attributes: Meter Id, and Year of Measurement, and then apply two aggregates meth-
ods: Average and Maximum.

In the corresponding Directed Acyclic Graph of the Spark query (DAG), we can
notice that the job associated with the query is composed of two stages because of
shuffling due to the aggregation part of the query. Recall that Shuffling is the process of
data transfer between stages. It is one of the problems that need to be minimized and
tuned when developing Big Data applications. Fortunately, the most part of shuffling is
taken into consideration when using Spark SQL, contrary to RDD based approach,
where user should take care closely to the shuffling issue (Fig. 5).
Using Spark RDD
For this case, we found that the most interesting way to use RDD for constructing
analytical queries is to making use of the Accumulator variable (see code below).
Accumulators are variables that are used for aggregating information across the
executors. Similar to counters in MapReduce, they are variables that are “added” to
through an associative and commutative “add” operation. They are designed to be used
safely and efficiently in parallel and distributed Spark computations.
Exploring Apache Spark Data APIs for Water Big Data Management 115

Fig. 5. Direct acyclic graph of aggregation query.

The above code computes the average of water consumption by month and by
customer. First, each value are transformed to pairRDD (value, 1). Then to compute the
Average-by-key, the map method is used to divide the sum by the count for each key.
Finally, CollectAsMap method is used to return the average-by-key as a dictionary.
Keys in this example are month and customer.
Let’s recall that even Accumulator, as shared variable, can help expressing ana-
lytical queries, but it suffers from some issues such as the important number of the
accumulators copies that each executor should handle. Also handling manually shuf-
fling for advanced queries makes RDDs very difficult to use and to maintain.
Using SparkR
Creating connection between R software and Spark cluster can be summarized by
ingesting data from textFile Spark action and constructing a dataframe (in the sens of
Spark Dataframe). This dataframe is made available to R operations such as filter or
groupby functions.
116 N. El Hassane and H. Hajji

As seen form the code above, SparkR allows to express analytical queries exactly
as if user if executing an R command. Thus, an RDD is firstly created by parallelizing
water metering logger file, then by using flatMap and createDataFrame functions, a
new dataframe is made available to user for executing filter query on dataset.

5 Conclusion

Smart Water initiatives rely heavily on the use of new technologies such as sensors and
smart meters that tend to produce large and uneasily managed volume of data. To
manage such database, and beyond the traditional database approach, we presented a
novel approach for processing Water Smart Metering datasets using Apache Spark as
back end for processing and analyzing large volume of datasets produced by smart
meters. As an early conclusion, and with respect to the used version of Spark (1.6), it
can be stated that the use of DataFrame seems to be a very efficient, optimized and easy
to use method, especially compared to RDD abstractions, for handling Smart Metering
Data, and for achieving analytical tasks on them.

References
1. Akyildiz, L.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: A survey on sensor networks
(2002)
2. Bennett, N.D., Croke, B.F.W., Guariso, G., Guillaume, J.H.A., Hamilton, S.H., Jakeman, A.
J., Marsili-Libelli, S., Newham, L.T.H., Norton, J.P., Perrin, C., Pierce, S.A., Robson, B.,
Seppelt, R., Voinov, A.A., Fath, B.D., Andreassian, V.: Position paper : characterising
performance of environmental models. Environ. Model. Softw. 40, 1–20 (2013)
3. Bernardo, V., Curado, M., Staub, T., Braun, T.: Towards energy consumption measurement
in a cloud computing wireless testbed. In: Proceedings of the 2011 First International
Symposium on Network Cloud Computing and Applications, NCCA 2011, Washington,
DC, pp. 91–98. IEEE Computer Society (2011)
4. D’Agostino, D., Clematis, A., Galizia, A., Quarati, A., Danovaro, E., Roverelli, L., Zereik,
G., Kranzlmüller, D., Schiffers, M., Felde, N.G., Straube, C., Caumont, O., Richard, E.,
Garrote, L., Harpham, Q., Jagers, H.R.A., Dimitrijevic, V., Dekic, L., Fiorii, E., Delogu, F.,
Parodi, A.: The DRIHM project: a flexible approach to integrate HPC, grid and cloud
resources for hydro-meteorological research. In: Proceeding of the International Conference
for High Performance Computing, Networking, Storage and Analysis, SC 2014, Piscataway,
pp. 536–546. IEEE Press (2014)
Exploring Apache Spark Data APIs for Water Big Data Management 117

5. Dunning, T., Friedman, E.: Time Series Databases. O’Reilly Media, Greenwich (2014)
6. Eichinger, F., Pathmaperuma, D., Vogt, H., Muller, E.: Data analysis challenges in the future
energy domain. In: Yu, T., Chawla, N., Simoff, S. (eds.) Computational Intelligent Data
Analysis for Sustainable Development; Data Mining and Knowledge Discovery Series. CRC
Press, Taylor Francis Group, Boca Raton. Chapter 7
7. Vatsavai, R.R., Ganguly, A., Chandola, V., Stefanidis, A., Klasky, S., Shekhar, S.:
Spatiotemporal data mining in the era of big spatial data: algorithms and applications. In:
Proceedings of the 1st ACM SIGSPATIAL International Workshop on Analytics for Big
Geospatial Data, BigSpatial 2012, New York, pp. 1–10. ACM (2012)
8. Fang, X., Misra, S., Xue, G., Yang, D.: Smart grid - the new and improved power grid: a
survey. IEEE Commun. Surv. Tutor. (2011)
9. Yigit, M., Cagri Gungor, V., Baktir, S.: Cloud computing for smart grid applications.
Comput. Netw. 70, 312–329 (2014)
10. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J.,
Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-
memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked
Systems Design and Implementation, NSDI 2012, Berkeley, p. 2. USENIX Association
(2012)
11. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster
computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot
Topics in Cloud Computing, HotCloud 2010, Berkeley, p. 10. USENIX Association (2010)
12. Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-
tolerant streaming computation at scale. In: Proceedings of the Twenty-Fourth ACM
Symposium on Operating Systems Principles, SOSP 2013, New York, pp. 423–438. ACM
(2013)
13. Laney, D.: META Group, 3D Data Management: Controlling Data Volume, Velocity, and
Variety, February 2001
14. Eichinger, F., Pathmaperuma, D., Vogt, H., Müller, E.: Data analysis challenges in the future
energy domain. In: Yu, T., Chawla, N., Simoff, S. (eds.) Computational Intelligent Data
Analysis for Sustainable Development. Chapman and Hall/CRC, London (2013)
15. http://camel.apache.org/
16. http://sqoop.apache.org/
17. https://kafka.apache.org/
18. http://cassandra.apache.org/

You might also like