10.1007@978-3-030-11881-510_paper_8
10.1007@978-3-030-11881-510_paper_8
10.1007@978-3-030-11881-510_paper_8
1 Introduction
During the past few years, growing data volumes has been considered as a key chal-
lenge for organizations pushing them to look for new approaches to scale their
applications and computations. One of the solutions they considered was to distribute
data storage and processing across clusters of hundreds of machines (Example of
Google, Facebook, Amazon …). In addition to simple queries, complex algorithms like
machine learning and graph analysis are becoming common in many domains. Also
streaming analysis of real-time data is required to let organizations take timely action.
Water management is not an exception since data collection and processing is
becoming a challenge for practitioners, IT teams, and decision makers. Either it was for
managing river basin information, for managing Water utilities data, or for carrying out
data intensive hydrologic modelling, the data management task has always been
challenging. And has become more difficult with the advent of real time sensors,
remote imagery, and the need to speed up the process of decision making [5–7].
Throughout this paper, we will present a reference architecture for handling and
managing smart metering water datasets. We demonstrate how recent advancement in
Big Data Technologies (especially Apache Spark project) can handle water big data
efficiently with fault tolerance for getting insights from those datasets. Finally, we will
highlight the advantages that provide the distributed execution model of Spark by
exploring three APIs and abstractions provided by Apache Spark: RDD, Dataframe,
and SparkR. The aim of this paper is mainly to explore how Spark can be used with
different abstractions to handle Big data constraints as encountered in Smart Metering
Data processing. Due to lack of space, the impact of volume on such approaches will
not be addressed in this paper, and will be developed in a further work.
3.1 Velocity
The velocity of big data in Water Data management involves the generation of data at a
rapid growth rate, and efficiency of data processing and analysis. The data should be
analyzed in a near-real-time manner to achieve a given task, e.g.: flooding prediction
and leak detection.
3.2 Variety
In terms of variety, Water-Dataset data consist of multisource (sensors, smart metering,
hydrological, DEM, etc.), at different resolutions (temporal, and spatial).
3.3 Volume
To illustrate the growth of data volume generated by Water Management, the following
Table 2 shows how the data collected from smart metering evolves according to the
granularity of measurement.
108 N. El Hassane and H. Hajji
As stated earlier, we will focus in this paper on managing water big data coming from
Smart Metering datasets with efficiency, scalability and fault tolerance.
In order to achieve those purposes, it is not enough to focus only on the processing
layer only. We should rather treat all layers, from data collection, ingestion, processing
and even visualization. It is in this sense that we propose an end-to-end architecture
(Fig. 1) based on big data tools to ensure timely collection, rapid ingestion and efficient
query processing.
Fig. 1. End to end big data architecture for managing massive water datasets.
This architecture (Fig. 1) is able respond to both usual use cases: near real time and
batch. Data collection and ingestion are carried out by making use of three ingestion
tools: using Apache Camel [15] for files data sources and then through Apache Scoop
[16] for DBMS data sources and finally through Apache Kafka [17] for streaming data
(real water smart metering datasets). Kafka is a subscriber/consumer messaging system,
horizontally scalable and fault-tolerant. Data stored in Kafka is then consumed by
Apache Spark that can execute cleaning, transformation and processing of data before
Exploring Apache Spark Data APIs for Water Big Data Management 109
sending them to apache Cassandra [17]. Once the data is stored in Cassandra, Water
smart metering data are made available to user through Spark SQL, one of Spark’s
APIs. One of analytical queries that can be seamlessly executed within our architecture
are: leaks detection and customer profiling queries1.
As smart metering data can also be defined as data source with spatial component
such as customer coordinates (X, Y) or smart metering locations (latitude, longitude).
We have introduced in our architecture at least one case where locations are handled:
During ingesting phase of smart metering data into Cassandra, spatial indexes (such as
Z-Index) can be constructed based upon coordinates and location.
1
Such advanced queries will be developed separately in further works.
110 N. El Hassane and H. Hajji
All transformations in Spark are lazy, they do not compute their results right away.
Instead, they just remember the transformations applied to some base dataset (e.g. a
file). The transformations are only computed when an action requires a result to be
returned to the driver program.
Spark SQL Dataframe
Spark SQL is an extension of Apache Spark for structured data processing. Unlike the
basic Spark RDD API, it provides Spark with more information about the structure of
both the data and the computation being performed and it allows executing SQL
queries written using a basic SQL syntax.
2
https://code.google.com/p/smart-meter-information-portal/.
Exploring Apache Spark Data APIs for Water Big Data Management 113
Then comes the data preparation phase where Spark maps Smart metering data into
the above case class SmartWaterMeasure, and transforms it to Dataframe.
The Scala interface for Spark SQL supports automatically converting an RDD
containing case classes to a DataFrame. Once The RDD is implicitly converted to a
DataFrame and then be registered as a table, it can be used in subsequent SQL
statements.
Analytical Queries for Smart Metering Datasets
To illustrate the use of our approaches, we will present some analytical queries and
show how they can be expressed using the Dataframe abstraction.
– Query 1: Getting water smart meter data with measurement year greater than 2009
The first query is a simple one that returns Water Smart Meter data with measure
attribute year greater than 2009. In the corresponding Directed Acyclic Graph of the
Spark query (Fig. 4), we can read that the job associated with the query is associated
with a chain of RDD dependencies organized in a direct acyclic graph (DAG).
First, it performs an ingestion operation of available logger files, and then use the
Map operation before applying the Filter operation. Recall that for tuning and opti-
mization, Spark uses the Project Tungsten to improve the efficiency of memory and
CPU.
As the query is just about just filtering smart metering dataset, there is no shuffle
needed between nodes of the cluster. The corresponding DAG shows consequently a
single stage composed of subsequent tasks.
– Query 2: Aggregating smart meter data
This example of aggregation queries tries to group smart metering data with two
attributes: Meter Id, and Year of Measurement, and then apply two aggregates meth-
ods: Average and Maximum.
In the corresponding Directed Acyclic Graph of the Spark query (DAG), we can
notice that the job associated with the query is composed of two stages because of
shuffling due to the aggregation part of the query. Recall that Shuffling is the process of
data transfer between stages. It is one of the problems that need to be minimized and
tuned when developing Big Data applications. Fortunately, the most part of shuffling is
taken into consideration when using Spark SQL, contrary to RDD based approach,
where user should take care closely to the shuffling issue (Fig. 5).
Using Spark RDD
For this case, we found that the most interesting way to use RDD for constructing
analytical queries is to making use of the Accumulator variable (see code below).
Accumulators are variables that are used for aggregating information across the
executors. Similar to counters in MapReduce, they are variables that are “added” to
through an associative and commutative “add” operation. They are designed to be used
safely and efficiently in parallel and distributed Spark computations.
Exploring Apache Spark Data APIs for Water Big Data Management 115
The above code computes the average of water consumption by month and by
customer. First, each value are transformed to pairRDD (value, 1). Then to compute the
Average-by-key, the map method is used to divide the sum by the count for each key.
Finally, CollectAsMap method is used to return the average-by-key as a dictionary.
Keys in this example are month and customer.
Let’s recall that even Accumulator, as shared variable, can help expressing ana-
lytical queries, but it suffers from some issues such as the important number of the
accumulators copies that each executor should handle. Also handling manually shuf-
fling for advanced queries makes RDDs very difficult to use and to maintain.
Using SparkR
Creating connection between R software and Spark cluster can be summarized by
ingesting data from textFile Spark action and constructing a dataframe (in the sens of
Spark Dataframe). This dataframe is made available to R operations such as filter or
groupby functions.
116 N. El Hassane and H. Hajji
As seen form the code above, SparkR allows to express analytical queries exactly
as if user if executing an R command. Thus, an RDD is firstly created by parallelizing
water metering logger file, then by using flatMap and createDataFrame functions, a
new dataframe is made available to user for executing filter query on dataset.
5 Conclusion
Smart Water initiatives rely heavily on the use of new technologies such as sensors and
smart meters that tend to produce large and uneasily managed volume of data. To
manage such database, and beyond the traditional database approach, we presented a
novel approach for processing Water Smart Metering datasets using Apache Spark as
back end for processing and analyzing large volume of datasets produced by smart
meters. As an early conclusion, and with respect to the used version of Spark (1.6), it
can be stated that the use of DataFrame seems to be a very efficient, optimized and easy
to use method, especially compared to RDD abstractions, for handling Smart Metering
Data, and for achieving analytical tasks on them.
References
1. Akyildiz, L.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: A survey on sensor networks
(2002)
2. Bennett, N.D., Croke, B.F.W., Guariso, G., Guillaume, J.H.A., Hamilton, S.H., Jakeman, A.
J., Marsili-Libelli, S., Newham, L.T.H., Norton, J.P., Perrin, C., Pierce, S.A., Robson, B.,
Seppelt, R., Voinov, A.A., Fath, B.D., Andreassian, V.: Position paper : characterising
performance of environmental models. Environ. Model. Softw. 40, 1–20 (2013)
3. Bernardo, V., Curado, M., Staub, T., Braun, T.: Towards energy consumption measurement
in a cloud computing wireless testbed. In: Proceedings of the 2011 First International
Symposium on Network Cloud Computing and Applications, NCCA 2011, Washington,
DC, pp. 91–98. IEEE Computer Society (2011)
4. D’Agostino, D., Clematis, A., Galizia, A., Quarati, A., Danovaro, E., Roverelli, L., Zereik,
G., Kranzlmüller, D., Schiffers, M., Felde, N.G., Straube, C., Caumont, O., Richard, E.,
Garrote, L., Harpham, Q., Jagers, H.R.A., Dimitrijevic, V., Dekic, L., Fiorii, E., Delogu, F.,
Parodi, A.: The DRIHM project: a flexible approach to integrate HPC, grid and cloud
resources for hydro-meteorological research. In: Proceeding of the International Conference
for High Performance Computing, Networking, Storage and Analysis, SC 2014, Piscataway,
pp. 536–546. IEEE Press (2014)
Exploring Apache Spark Data APIs for Water Big Data Management 117
5. Dunning, T., Friedman, E.: Time Series Databases. O’Reilly Media, Greenwich (2014)
6. Eichinger, F., Pathmaperuma, D., Vogt, H., Muller, E.: Data analysis challenges in the future
energy domain. In: Yu, T., Chawla, N., Simoff, S. (eds.) Computational Intelligent Data
Analysis for Sustainable Development; Data Mining and Knowledge Discovery Series. CRC
Press, Taylor Francis Group, Boca Raton. Chapter 7
7. Vatsavai, R.R., Ganguly, A., Chandola, V., Stefanidis, A., Klasky, S., Shekhar, S.:
Spatiotemporal data mining in the era of big spatial data: algorithms and applications. In:
Proceedings of the 1st ACM SIGSPATIAL International Workshop on Analytics for Big
Geospatial Data, BigSpatial 2012, New York, pp. 1–10. ACM (2012)
8. Fang, X., Misra, S., Xue, G., Yang, D.: Smart grid - the new and improved power grid: a
survey. IEEE Commun. Surv. Tutor. (2011)
9. Yigit, M., Cagri Gungor, V., Baktir, S.: Cloud computing for smart grid applications.
Comput. Netw. 70, 312–329 (2014)
10. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J.,
Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-
memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked
Systems Design and Implementation, NSDI 2012, Berkeley, p. 2. USENIX Association
(2012)
11. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster
computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot
Topics in Cloud Computing, HotCloud 2010, Berkeley, p. 10. USENIX Association (2010)
12. Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-
tolerant streaming computation at scale. In: Proceedings of the Twenty-Fourth ACM
Symposium on Operating Systems Principles, SOSP 2013, New York, pp. 423–438. ACM
(2013)
13. Laney, D.: META Group, 3D Data Management: Controlling Data Volume, Velocity, and
Variety, February 2001
14. Eichinger, F., Pathmaperuma, D., Vogt, H., Müller, E.: Data analysis challenges in the future
energy domain. In: Yu, T., Chawla, N., Simoff, S. (eds.) Computational Intelligent Data
Analysis for Sustainable Development. Chapman and Hall/CRC, London (2013)
15. http://camel.apache.org/
16. http://sqoop.apache.org/
17. https://kafka.apache.org/
18. http://cassandra.apache.org/