DDI Book Chapter Tools and Techniques
DDI Book Chapter Tools and Techniques
Author(s):
M. Ateeq (The Islamia Univ of Bahawalpur, [email protected])
*M.K. Afzal (COMSATS Univ Islamabad, Wah Campus, [email protected])
With the advent of data-driven techniques it became inevitable to facilitate the support for
benefiting from these techniques practically. Because the systems are largely controlled
through software, programming languages are the primary focus to provide necessary
development of custom tools that can automate the known processes and abstract the
implementation from users is considered productive and promotive. Such tools can help
primitive users by easing the implementation and can save time for advanced users by
An important aspect for realizing a data-driven network is the devise practical frameworks
and process flows that can be followed in order to benefit from this promising artefact. In
theory, the methods for handling and learning from data have been proposed and explored at
length. These studies and findings greatly benefit the proponents of data-driven techniques in
the communication discipline. These methods include procedures such as cleaning and
completing data, analysing and visualizing the data, identify and pose the right questions, and
This chapter is divided into 5 sections; in section 2 we survey the popular programming
languages such as python and R with relevant libraries (e.g., pandas, numpy, matplotlib,
scikitlearn, data.table, dplyer, and ggplot2). The section 3 comprises of useful tools such as
Weka, Orange, and RapidMiner. The flow of an example generic process is presented in
considered among the most prominent examples in this context. In the following, we describe
1.2.1 Python
Python, with its generalized programming language design, ease of learning, and wide
applicability is considered most obvious choice when applying data-driven techniques and
models in any domain. The pseudocode-like syntax makes it easy to learn the language and
develop the solutions with better focus on the problem rather than language learning.
The data-driven techniques require support for descriptive and inferential data analysis, data
visualization, data cleaning, data transformations, statistical measures, and machine learning
among others. Python with its rich support of libraries and an open-source community
process serves the needs adequately. In the following, we give a brief overview of the
NumPy
NumPy is short for Numerical Python and provides the foundation for numerical operations.
NumPy is among the most actively maintained and contributed library of Python. It
implements multi-dimensional arrays that are expressive, and fast. In order to meet the speed
challenges, NumPy provides a wide range of buit-in functions that are optimized to do well
Facilitates interfacing with code from other languages like C/C++ and Fortran
Provides solid foundation for other libraries like Pandas, scikit-learn, and SciPy
Pandas
Pandas is generally the foremost library that a data scientist has to learn along with NumPy
and Matplotlib. Pandas is amongst the most actively maintained Python library. It is seriously
used for all preparatory and initial analytical steps involved in data cleaning and analysis. In
large datasets, Pandas uses NumPy at the backend, and is usually blazingly fast compared to
Expressive set of functions and features that enable adequate data handling
Implements complex and demanding data operations with a high level of abstraction
Matplotlib
Matplotlib is the primary library for visualization and plotting provided for Python. Due to
any other platform, programming languages, or tool. An object-oriented API makes sure that
the visualizations created with Matplotlib can be integrated into other applications with ease.
An open source and free competitor to MATLAB with adequate features and
facilities.
Implements diverse output types that are useable across different platforms without
Efficient in using system resources thus plots from big datasets can be created with
sufficient ease.
Integrates necessary statistical analysis and measures like correlations, and confidence
intervals etc.
Scikit-learn
Scikit-learn is the Python’s prolific machine learning library. It implements almost all
machine learning models and relevant concepts and works well with Pandas, NumPy and
Some other important libraries that turn out to be important at various stages of data-driven
systems include: SciPy (scientific python), TensorFlow, Keran, and PyTorch (for deep
learning), Scrapy and BeautifulSoup (for scrapping and dealing with semi-structured or
unstructured data).
1.2.2 R
as well as open source. However, as against Python, R is more confined and specialized for
data processing and does not find appreciation in solving computing problems in general. The
language supports data types like lists, vectors, arrays, and data frames that make it very
As expected, R has a wide range of packages related to data manipulation and processing. In
the following, we briefly explain some of the prominent packages and libraries by categories.
Data Loading
R offers wide support in terms of loading data from diverse formats and sources. R can be
used to read plain data files without needing any package. In order to read data from
databases, packages like DBI, ODBC are available. XLConnet, and xlsx are the packages to
read and write Excel files as csv’s. R also interfaces with software like SPSS or Stata using
A collection of relevant packages is available in the form of tidyverse. The primary package
for data manipulation is dplyr. It provides data manipulation similar to panadas and is quite
fast. In addition there are useful packages for data manipulation like tidyr, stringr, and
lubridate.
R also has a number of packages for data visualization. ggplot2 is the main library in R to
create feature rich custom visuals. In addition ggviz, rgl, htmlwidgets, googleViz make a list
Data Modelling
popular machine learning models for R programmers. In addition, there are a lot of
specialized packages for various machine learning and statistical models including car
(having Anova functions), mgcv (additive models), multcomp, vcd, glmnet, and caret.
Other Packages
In addition to the packages discussed above, R has a great support for data scientists through
additional packages for various tasks. For example, it provides shiny, xtable, and Markdown
for reporting results. Sp, maptools, maps, and ggmap are popular when dealing with spatial
data. zoo, xts, and quantmod are useful for analyzing time series and financial data. R also
provides the facility to write own packages through devtools, testthat, and roxygen2.
In this section we list a few important tools useful in adopting and implementing data-driven
methods. In this context, we discuss three different categories of tools. In the first category
we place the tools that are capable to handling large scale data. The second category belongs
to the tools useful for data analysis, and the third category discusses the tools for predictive
Intuitively, data-driven techniques are driven by some kind of data at the backend. Data can
combination of these. When it comes to dealing with communication and networks, the data
can be some performance statistics coming from traffic logs or some kind of signals. In
general, any kind of statistics measured from traffic logs present data in a structured way.
Even if the data is representing some kind of signals, it is often represented in the form of
numerical quantities. Therefore, communication and networks often deal with structured data.
Normal sized data can be represented using text files where csv is a popular format.
However, larger datasets require better support to store, retrieve, and process the data at scale.
In the following, we briefly discuss some important and useful tools for big data handling.
Hadoop
Hadoop offers solution to solve bid data problems using a network of computers. It is a
collection of open source libraries with hadoop distributed file system at core for storage of
data and uses MapReduce as programming model. It uses YARN to manage computing
resources in the clusters and schedules applications. However, a major limitation is that
MapReduce can run one job at a time in batch processing mode. This limits the usefulness of
Hadoop as a real-time analysis framework and makes it a prominent choice for data
warehousing.
Spark
Spark is popular for real-time data stream processing in the context of big data systems.
Although the workflows used in Spark are based on Hadoop MapReduce, however, these are
more efficient because Spark provides its own streaming API rather than banking on Hadoop
YARN. This makes Spark more suitable for real-time data stream processing as against
Hadoop which has turned out to be a tool for storage and batch processing rather.
Spark banks of data stored in Hadoop and does not implement its own storage system. For the
Cloudera
Cloudera was developed as an enterprise level deployment solution based on Hadoop. It can
interact and access data from heterogeneous environments offering real-time analysis.
Cloudera can interact with different clouds thus implementing truly enterprise-wide solutions.
In additions to data analysis it also provides the capability to train and deploy data models.
Cloudera is versatile in that it can be deployed across multiple clouds as well as on site and is
Cloudera provides multiple language support for application development including C/C++,
MongoDB
MongoDB is a free solution for implementing databases that can overcome the limitations of
relational databases and based on NoSQL design scheme. MongoDB can handle large
MongoDB Atlas enables developers to manage databases across different cloud providers
including Azure, AWS, and Google Cloud. It supports more than 10 languages.
Some of the prominent features of MongoDB are listed here. The game changing feature of
MongoDB is that it supports real-time analytics based on ad-hoc queries. Moreover, its
indexing and data replication features demonstrate great performance advantage. The load
There are a lot of tools that provide users with this capability to do primitive and advanced
statistical analysis of data without having to program anything. In general, tools have buit-in
methods to carry out the implemented tasks by following a well-defined sequence of steps.
This is why tools are easy for primitive users and useful for advanced users in carrying out
some initial analysis or handling simplistic situations without having to program anything.
Although there are numerous tools available for data analysis, in the following we discuss
Spreadsheets
Spreadsheets are most widely used software to handle structured data arranged into rows and
columns. Amongst the most used spreadsheet software are Microsoft Excel and Google
Sheets. Excel is a proprietary package whereas Google Sheets are free and also available
formulas, custom functions, data handling, and basic plotting are some of the prominent and
SPSS
Statistical product and service solution (SPSS) is a proprietary solution by IBM for advanced
statistical analysis. It has been widely used in the social science domains. Market researchers,
healthcare and survey companies, education, various government sector setups find it very
useful.
The noticeable features of SPSS include descriptive statistical analysis, statistical tests,
simple predictive modelling, text analysis, and visualizations.
Although the libraries for machine learning are well implemented and available in
programming languages like Python and R, however there are certain tools in the market that
make it convenient for users to be able to do predictive modelling without having to program
explicitly. In the following we briefly introduce Weka and RapidMiner, two of the most
Weka
Weka is a free software developed at University of Waikato for the purpose of data analysis
and predictive modelling. It provides support for both supervised and unsupervised machine
Data in Weka can be read from files of different formats, web via url, as well as from
databases. Weka makes it easy to see the behavior of various machine learning models on the
Orange
Ornage is a cross-platform free software for data analysis, and machine learning. Like Weka
it implements all basic machine learning models. In addition to the features provided by
Weka, Orange is Python based and provides support for plugins. It also implements support
Both Weka and Orange are useful software for testing machine learning models with ease.
The support is adequate and learning is rather easy. In addition, there are a number of other
and adaptive solution for QoS in IoT. To the best of our knowledge, there is no existing
proposal that uses such data-driven approach for adaptive QoS in WSN driven IoT where the
communication parameters are used as primary source of capturing variations and facilitating
adaptation through real-time reconfigurability. For a practical system, data is collected and
processed interactively, and real-time decisions are made. However, from a design-in-
research perspective, data can be acquired from the public repositories, testbeds, or generated
using simulations.
In the following, we analyze the potential data sources (e.g., simulators, testbeds, etc.) [11]
Simulations: Historically, it has not been possible to realize real deployments of desired
network technologies and topologies for research purposes. Therefore, there are several
simulators available and in use to create design scenarios and evaluate the performance of
wireless networks.
Some popular examples include: ns-2, ns-3, OMNET++, and COOJA [11]. Where it has been
possible to create custom network topologies with desired software and hardware
Testbeds: The second possibility is to generate the data using a testbed. With the proliferation
of wireless communication, IoT is very much a realization on the timeline. This has induced a
lot of interest from the research, commercial and governmental organizations to create
[11]. Lately, FED4FIRE+ has federated a large set of testbeds 1, focusing on diverse
Although there are some wireless datasets related to QoS performance, however, these do not
comprehend the diverse deployment scenarios, application requirements, and also do not cop
with the evolving nature of network designs. Most of the dataset primarily focuses on the
parameter settings in WSNs [12] is hosted by CRAWDAD 2, a large public repository for
deployments have their own dynamic and evolving nature. Therefore, it is more desirable and
effective to acquire data from these real scenarios. Moreover, the changes, growth,
and evolution can only be accommodated through this integration of prediction system with
real deployments. However, gathering data from live sources and carrying out real-time
The data need to be pre-processed resulting in features set(s) and target(s). The features can
vary as the target metric and context changes. A sample of what the feature vector could look
like is shown in Fig. 2. The prediction of QoS metrics involves identifying the correct set of
features for a particular metric, and feature selection techniques can help. Big data platforms
and services like Hadoop, Kafka, MQTT, and Spark Streaming, etc., play a vital role in
putting the system components together. FED4FIRE+ facilitates testbeds like w-iLabt.t 3 for
experiments involving WSNs, WiFi, LTE/5G, and cognitive radio, etc. The big data platform
like Tengu 4 is available with adequate facilities for hosting the big data and provides
streaming services for live interactions. In order to provide gluing between the testbeds and
big data platform, seamless connectivity is delivered by virtual wall 5. The next important
thing is choosing a suitable machine learning model(s) that can meet the desired performance
requirements at an affordable cost. In past, we have used deep neural networks for predicting
QoS in WSNs [13], [14], [15]. The results have been promising and encouraging. A brief
comparison with conventional regression models (e.g., linear regression, and decision trees
based regression) reveals that these simpler models carry the potential to yield effective
Using real-time analytics for configuring the nodes and network can serve the real advantage
Finally, recommendations are formulated consisting of befitting values for critical features
considering the QoS target(s) and are disseminated to the interested nodes in the networks.
For this purpose, the values need to be maintained in the form of target : value,
and feature : value pairs. This way, for each threshold of a prediction target, a recommended
set of values for the critical set of features can be provided proactively or reactively to the
sensor nodes. The sole task that the sensor nodes need to perform is to choose the right set of
1.4.4 Milestones
The indicated milestones to achieve the proposed datadriven framework are revealed in Fig.
3. The first four tasks (identification of facilities for experimentation, design, and execution
of experiments, statistical analysis, predictions of QoS metrics) have already been completed
for WSNs. We have published some promising results in [13], [14], [15]. Currently, we are
designing experiments for WiFi and LTE. The online evaluations of predictions are also
being carried out using FED4FIRE+ facilities in parallel. After completing these ongoing
1.4.5 Deployment
References [2 pages]