0% found this document useful (0 votes)
25 views13 pages

DDI Book Chapter Tools and Techniques

This chapter discusses programming languages, tools, and techniques for data-driven systems. It covers popular programming languages Python and R, outlining important libraries for data science tasks. These include NumPy, Pandas, Matplotlib, and Scikit-learn for Python and dplyr, ggplot2, and tidymodels for R. The chapter also describes tools like Weka, Orange, and RapidMiner that can perform data analysis and predictive modeling without programming. Finally, it presents a generic process flow for applying data-driven methods and concludes with challenges.

Uploaded by

Muhammad Ateeq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views13 pages

DDI Book Chapter Tools and Techniques

This chapter discusses programming languages, tools, and techniques for data-driven systems. It covers popular programming languages Python and R, outlining important libraries for data science tasks. These include NumPy, Pandas, Matplotlib, and Scikit-learn for Python and dplyr, ggplot2, and tidymodels for R. The chapter also describes tools like Weka, Orange, and RapidMiner that can perform data analysis and predictive modeling without programming. Finally, it presents a generic process flow for applying data-driven methods and concludes with challenges.

Uploaded by

Muhammad Ateeq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 13

Chapter 9: (Programming Languages, Tools, and Techniques)

Author(s):
M. Ateeq (The Islamia Univ of Bahawalpur, [email protected])
*M.K. Afzal (COMSATS Univ Islamabad, Wah Campus, [email protected])

1.1 Introduction [1-1.5 pages]

With the advent of data-driven techniques it became inevitable to facilitate the support for

benefiting from these techniques practically. Because the systems are largely controlled

through software, programming languages are the primary focus to provide necessary

constructs and toolkits to adequately implement the data-driven models. In addition,

development of custom tools that can automate the known processes and abstract the

implementation from users is considered productive and promotive. Such tools can help

primitive users by easing the implementation and can save time for advanced users by

allowing more focus on the problem at hand.

An important aspect for realizing a data-driven network is the devise practical frameworks

and process flows that can be followed in order to benefit from this promising artefact. In

theory, the methods for handling and learning from data have been proposed and explored at

length. These studies and findings greatly benefit the proponents of data-driven techniques in

the communication discipline. These methods include procedures such as cleaning and

completing data, analysing and visualizing the data, identify and pose the right questions, and

learning from data.

This chapter is divided into 5 sections; in section 2 we survey the popular programming

languages such as python and R with relevant libraries (e.g., pandas, numpy, matplotlib,

scikitlearn, data.table, dplyer, and ggplot2). The section 3 comprises of useful tools such as

Weka, Orange, and RapidMiner. The flow of an example generic process is presented in

section 4. Conclusion and challenges are listed in section 5.

1.2 Programming Languages [2-3 pages]


Data science in general finds growing support in programming languages. Python and R are

considered among the most prominent examples in this context. In the following, we describe

both programming languages, covering their relevant features and libraries.

1.2.1 Python

Python, with its generalized programming language design, ease of learning, and wide

applicability is considered most obvious choice when applying data-driven techniques and

models in any domain. The pseudocode-like syntax makes it easy to learn the language and

develop the solutions with better focus on the problem rather than language learning.

The data-driven techniques require support for descriptive and inferential data analysis, data

visualization, data cleaning, data transformations, statistical measures, and machine learning

among others. Python with its rich support of libraries and an open-source community

process serves the needs adequately. In the following, we give a brief overview of the

libraries related to data science.

NumPy

NumPy is short for Numerical Python and provides the foundation for numerical operations.

NumPy is among the most actively maintained and contributed library of Python. It

implements multi-dimensional arrays that are expressive, and fast. In order to meet the speed

challenges, NumPy provides a wide range of buit-in functions that are optimized to do well

with the NumPy arrays.

Some of the prominent features of NumPy are listed here:

 Provides fast, one and multi-dimensional array objects

 Facilitates interfacing with code from other languages like C/C++ and Fortran

 Supports vectorization for fast numerical computations

 Provides solid foundation for other libraries like Pandas, scikit-learn, and SciPy

Pandas
Pandas is generally the foremost library that a data scientist has to learn along with NumPy

and Matplotlib. Pandas is amongst the most actively maintained Python library. It is seriously

used for all preparatory and initial analytical steps involved in data cleaning and analysis. In

addition in supports some primitive visualizations. In order to support wide operations on

large datasets, Pandas uses NumPy at the backend, and is usually blazingly fast compared to

conventional programming constructs provided by Python language.

Some of the prominent features of Pandas are listed here:

 A complete documentation that explains the whole library with examples

 Expressive set of functions and features that enable adequate data handling

 Provides API for custom development and contribution to the library

 Implements complex and demanding data operations with a high level of abstraction

 Supports operations like data wrangling and data cleaning adequately

Matplotlib

Matplotlib is the primary library for visualization and plotting provided for Python. Due to

wide support and contributions Matplotlib is considered a competitive library compared to

any other platform, programming languages, or tool. An object-oriented API makes sure that

the visualizations created with Matplotlib can be integrated into other applications with ease.

Some of the prominent features of Matplotlib are listed here:

 An open source and free competitor to MATLAB with adequate features and

facilities.

 Implements diverse output types that are useable across different platforms without

any special assistance.

 Efficient in using system resources thus plots from big datasets can be created with

sufficient ease.

 Integrates necessary statistical analysis and measures like correlations, and confidence
intervals etc.

 Behaves as a base for sophisticated library like Seaborn.

Scikit-learn

Scikit-learn is the Python’s prolific machine learning library. It implements almost all

machine learning models and relevant concepts and works well with Pandas, NumPy and

other relevant libraries. It supports both supervised and unsupervised learning.

Some of the prominent features of Scikit-learn are listed here:

 Supports pre-processing including transformation, normalization, and encoding

 Provides implementation of classification and regression model based on linear, non-

linear, gradient descent, tree-based, Bayesian, and ensemble methods

 Supports clustering for unsupervised learning implementing methods such as K-mean,

affinity propagation, spectral, and hierarchical among others.

 Implements dimensionality reduction covering principal component analysis,

independent component analysis, latent dirichlet allocation.

Some other important libraries that turn out to be important at various stages of data-driven

systems include: SciPy (scientific python), TensorFlow, Keran, and PyTorch (for deep

learning), Scrapy and BeautifulSoup (for scrapping and dealing with semi-structured or

unstructured data).

1.2.2 R

R is a programming language designed for statistical computing. Like Python it is interpreted

as well as open source. However, as against Python, R is more confined and specialized for

data processing and does not find appreciation in solving computing problems in general. The

language supports data types like lists, vectors, arrays, and data frames that make it very

convenient to process data.

As expected, R has a wide range of packages related to data manipulation and processing. In
the following, we briefly explain some of the prominent packages and libraries by categories.

Data Loading

R offers wide support in terms of loading data from diverse formats and sources. R can be

used to read plain data files without needing any package. In order to read data from

databases, packages like DBI, ODBC are available. XLConnet, and xlsx are the packages to

read and write Excel files as csv’s. R also interfaces with software like SPSS or Stata using

packages like foreign and haven.

Data Manipulation and Visualization

A collection of relevant packages is available in the form of tidyverse. The primary package

for data manipulation is dplyr. It provides data manipulation similar to panadas and is quite

fast. In addition there are useful packages for data manipulation like tidyr, stringr, and

lubridate.

R also has a number of packages for data visualization. ggplot2 is the main library in R to

create feature rich custom visuals. In addition ggviz, rgl, htmlwidgets, googleViz make a list

of relevant tools for visualization in R.

Data Modelling

Statistical as well as machine learning models implementation is available in R through a

number of packages. tidymodels is a comprehensive collection of packages that provide

popular machine learning models for R programmers. In addition, there are a lot of

specialized packages for various machine learning and statistical models including car

(having Anova functions), mgcv (additive models), multcomp, vcd, glmnet, and caret.

Other Packages

In addition to the packages discussed above, R has a great support for data scientists through

additional packages for various tasks. For example, it provides shiny, xtable, and Markdown

for reporting results. Sp, maptools, maps, and ggmap are popular when dealing with spatial
data. zoo, xts, and quantmod are useful for analyzing time series and financial data. R also

provides the facility to write own packages through devtools, testthat, and roxygen2.

1.3 Tools [3-4 pages]

In this section we list a few important tools useful in adopting and implementing data-driven

methods. In this context, we discuss three different categories of tools. In the first category

we place the tools that are capable to handling large scale data. The second category belongs

to the tools useful for data analysis, and the third category discusses the tools for predictive

modelling and machine learning.

1.3.1 Big Data

Intuitively, data-driven techniques are driven by some kind of data at the backend. Data can

be structured, semi-structures, or even unstructured. It can be numbers, text, images, or any

combination of these. When it comes to dealing with communication and networks, the data

can be some performance statistics coming from traffic logs or some kind of signals. In

general, any kind of statistics measured from traffic logs present data in a structured way.

Even if the data is representing some kind of signals, it is often represented in the form of

numerical quantities. Therefore, communication and networks often deal with structured data.

Normal sized data can be represented using text files where csv is a popular format.

However, larger datasets require better support to store, retrieve, and process the data at scale.

In the following, we briefly discuss some important and useful tools for big data handling.

Hadoop

Hadoop offers solution to solve bid data problems using a network of computers. It is a

collection of open source libraries with hadoop distributed file system at core for storage of

data and uses MapReduce as programming model. It uses YARN to manage computing

resources in the clusters and schedules applications. However, a major limitation is that

MapReduce can run one job at a time in batch processing mode. This limits the usefulness of
Hadoop as a real-time analysis framework and makes it a prominent choice for data

warehousing.

Spark

Spark is popular for real-time data stream processing in the context of big data systems.

Although the workflows used in Spark are based on Hadoop MapReduce, however, these are

more efficient because Spark provides its own streaming API rather than banking on Hadoop

YARN. This makes Spark more suitable for real-time data stream processing as against

Hadoop which has turned out to be a tool for storage and batch processing rather.

Spark banks of data stored in Hadoop and does not implement its own storage system. For the

sake of development Spark uses Scala tuples.

Cloudera

Cloudera was developed as an enterprise level deployment solution based on Hadoop. It can

interact and access data from heterogeneous environments offering real-time analysis.

Cloudera can interact with different clouds thus implementing truly enterprise-wide solutions.

In additions to data analysis it also provides the capability to train and deploy data models.

Cloudera is versatile in that it can be deployed across multiple clouds as well as on site and is

popular choice to implement business intelligence solutions.

Cloudera provides multiple language support for application development including C/C++,

Pythin, Scala, Go, and Java.

MongoDB

MongoDB is a free solution for implementing databases that can overcome the limitations of

relational databases and based on NoSQL design scheme. MongoDB can handle large

amount of data which is beyond traditional structure followed by relational databases.

MongoDB Atlas enables developers to manage databases across different cloud providers

including Azure, AWS, and Google Cloud. It supports more than 10 languages.
Some of the prominent features of MongoDB are listed here. The game changing feature of

MongoDB is that it supports real-time analytics based on ad-hoc queries. Moreover, its

indexing and data replication features demonstrate great performance advantage. The load

balancing feature also outperforms many competing solutions.

1.3.2 Data Analysis

There are a lot of tools that provide users with this capability to do primitive and advanced

statistical analysis of data without having to program anything. In general, tools have buit-in

methods to carry out the implemented tasks by following a well-defined sequence of steps.

This is why tools are easy for primitive users and useful for advanced users in carrying out

some initial analysis or handling simplistic situations without having to program anything.

Although there are numerous tools available for data analysis, in the following we discuss

two broad tools.

Spreadsheets

Spreadsheets are most widely used software to handle structured data arranged into rows and

columns. Amongst the most used spreadsheet software are Microsoft Excel and Google

Sheets. Excel is a proprietary package whereas Google Sheets are free and also available

through Google Cloud in the form of software-as-a-service. Numerical operations, statistical

formulas, custom functions, data handling, and basic plotting are some of the prominent and

most used features in spreadsheets.

SPSS

Statistical product and service solution (SPSS) is a proprietary solution by IBM for advanced

statistical analysis. It has been widely used in the social science domains. Market researchers,

healthcare and survey companies, education, various government sector setups find it very

useful.

The noticeable features of SPSS include descriptive statistical analysis, statistical tests,
simple predictive modelling, text analysis, and visualizations.

1.3.3 Machine Learning

Although the libraries for machine learning are well implemented and available in

programming languages like Python and R, however there are certain tools in the market that

make it convenient for users to be able to do predictive modelling without having to program

explicitly. In the following we briefly introduce Weka and RapidMiner, two of the most

widely used tools for machine learning and data mining.

Weka

Weka is a free software developed at University of Waikato for the purpose of data analysis

and predictive modelling. It provides support for both supervised and unsupervised machine

learinng. The features implemented in Weka include pre-processing, classification,

regression, and clustering.

Data in Weka can be read from files of different formats, web via url, as well as from

databases. Weka makes it easy to see the behavior of various machine learning models on the

datasets of interests without having to have explicit programming knowledge.

Orange

Ornage is a cross-platform free software for data analysis, and machine learning. Like Weka

it implements all basic machine learning models. In addition to the features provided by

Weka, Orange is Python based and provides support for plugins. It also implements support

for text processing and simulations.

Both Weka and Orange are useful software for testing machine learning models with ease.

The support is adequate and learning is rather easy. In addition, there are a number of other

software packages like RapidMiner, KNIME, Neural Designer, and KEEL.

1.4 Techniques [4-6 pages]

1.4.1 Data Collection


Similar to any data-driven systems, data is of primary importance to achieve an intelligent

and adaptive solution for QoS in IoT. To the best of our knowledge, there is no existing

proposal that uses such data-driven approach for adaptive QoS in WSN driven IoT where the

communication parameters are used as primary source of capturing variations and facilitating

adaptation through real-time reconfigurability. For a practical system, data is collected and

processed interactively, and real-time decisions are made. However, from a design-in-

research perspective, data can be acquired from the public repositories, testbeds, or generated

using simulations.

In the following, we analyze the potential data sources (e.g., simulators, testbeds, etc.) [11]

and explain their benefits and drawbacks.

Simulations: Historically, it has not been possible to realize real deployments of desired

network technologies and topologies for research purposes. Therefore, there are several

simulators available and in use to create design scenarios and evaluate the performance of

wireless networks.

Some popular examples include: ns-2, ns-3, OMNET++, and COOJA [11]. Where it has been

possible to create custom network topologies with desired software and hardware

characteristics and configurations, simulations fail to facilitate the behavior of real

deployments because even the stochastic events are deterministic.

Testbeds: The second possibility is to generate the data using a testbed. With the proliferation

of wireless communication, IoT is very much a realization on the timeline. This has induced a

lot of interest from the research, commercial and governmental organizations to create

suitable testbeds to foster research in the domain of wireless communications. Some

prominent examples of accessible testbeds include MoteLab, TWIST, and Indriya

[11]. Lately, FED4FIRE+ has federated a large set of testbeds 1, focusing on diverse

networking and cloud facilities.


Public Sources: Public data sources are of paramount importance for research driven by data.

Although there are some wireless datasets related to QoS performance, however, these do not

comprehend the diverse deployment scenarios, application requirements, and also do not cop

with the evolving nature of network designs. Most of the dataset primarily focuses on the

sensed information rather than communication performance. A prominent public

dataset providing comprehensive measurements based on a large combination of diverse

parameter settings in WSNs [12] is hosted by CRAWDAD 2, a large public repository for

networking related datasets.

Real-World Deployments: Real-time decision making is critical, particularly for time-critical

safety-related scenarios. The purpose of adopting data-driven QoS prediction is to facilitate

real-world deployments of communication systems forming the IoT. These real-word

deployments have their own dynamic and evolving nature. Therefore, it is more desirable and

effective to acquire data from these real scenarios. Moreover, the changes, growth,

and evolution can only be accommodated through this integration of prediction system with

real deployments. However, gathering data from live sources and carrying out real-time

analytics for QoS still requires considerable attention.

1.4.2 Data Analysis and Machine Learning

The data need to be pre-processed resulting in features set(s) and target(s). The features can

vary as the target metric and context changes. A sample of what the feature vector could look

like is shown in Fig. 2. The prediction of QoS metrics involves identifying the correct set of

features for a particular metric, and feature selection techniques can help. Big data platforms

and services like Hadoop, Kafka, MQTT, and Spark Streaming, etc., play a vital role in

putting the system components together. FED4FIRE+ facilitates testbeds like w-iLabt.t 3 for

experiments involving WSNs, WiFi, LTE/5G, and cognitive radio, etc. The big data platform

like Tengu 4 is available with adequate facilities for hosting the big data and provides
streaming services for live interactions. In order to provide gluing between the testbeds and

big data platform, seamless connectivity is delivered by virtual wall 5. The next important

thing is choosing a suitable machine learning model(s) that can meet the desired performance

requirements at an affordable cost. In past, we have used deep neural networks for predicting

QoS in WSNs [13], [14], [15]. The results have been promising and encouraging. A brief

comparison with conventional regression models (e.g., linear regression, and decision trees

based regression) reveals that these simpler models carry the potential to yield effective

predictions in less complex scenarios [13].

Using real-time analytics for configuring the nodes and network can serve the real advantage

of achieving adaptivity and self-reconfigurability.

1.4.3 Dissemination of Recommendation

Finally, recommendations are formulated consisting of befitting values for critical features

considering the QoS target(s) and are disseminated to the interested nodes in the networks.

For this purpose, the values need to be maintained in the form of target : value,

and feature : value pairs. This way, for each threshold of a prediction target, a recommended

set of values for the critical set of features can be provided proactively or reactively to the

sensor nodes. The sole task that the sensor nodes need to perform is to choose the right set of

values considering a performance goal which is computationally very simple. An

example of such a target : value, and parameter : value pairs.

Achieving the feat of dissemination of recommendation requires robust, efficient, state-

preserving, and compatible reconfiguration (e.g., via firmware update).

1.4.4 Milestones

The indicated milestones to achieve the proposed datadriven framework are revealed in Fig.

3. The first four tasks (identification of facilities for experimentation, design, and execution

of experiments, statistical analysis, predictions of QoS metrics) have already been completed
for WSNs. We have published some promising results in [13], [14], [15]. Currently, we are

designing experiments for WiFi and LTE. The online evaluations of predictions are also

being carried out using FED4FIRE+ facilities in parallel. After completing these ongoing

steps, the proposed framework will be evaluated in real environments. However, an

elementary case study in presented below.

1.4.5 Deployment

1.5 Challenges [2 pages]

References [2 pages]

You might also like