0% found this document useful (0 votes)

25 views13 pages

DDI Book Chapter Tools and Techniques

This chapter discusses programming languages, tools, and techniques for data-driven systems. It covers popular programming languages Python and R, outlining important libraries for data science tasks. These include NumPy, Pandas, Matplotlib, and Scikit-learn for Python and dplyr, ggplot2, and tidymodels for R. The chapter also describes tools like Weka, Orange, and RapidMiner that can perform data analysis and predictive modeling without programming. Finally, it presents a generic process flow for applying data-driven methods and concludes with challenges.

Uploaded by

Muhammad Ateeq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views13 pages

DDI Book Chapter Tools and Techniques

Uploaded by

Muhammad Ateeq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 13

Chapter 9: (Programming Languages, Tools, and Techniques)

Author(s):
M. Ateeq (The Islamia Univ of Bahawalpur, [email protected])
*M.K. Afzal (COMSATS Univ Islamabad, Wah Campus, [email protected])

1.1 Introduction [1-1.5 pages]

With the advent of data-driven techniques it became inevitable to facilitate the support for

benefiting from these techniques practically. Because the systems are largely controlled

through software, programming languages are the primary focus to provide necessary

constructs and toolkits to adequately implement the data-driven models. In addition,

development of custom tools that can automate the known processes and abstract the

implementation from users is considered productive and promotive. Such tools can help

primitive users by easing the implementation and can save time for advanced users by

allowing more focus on the problem at hand.

An important aspect for realizing a data-driven network is the devise practical frameworks

and process flows that can be followed in order to benefit from this promising artefact. In

theory, the methods for handling and learning from data have been proposed and explored at

length. These studies and findings greatly benefit the proponents of data-driven techniques in

the communication discipline. These methods include procedures such as cleaning and

completing data, analysing and visualizing the data, identify and pose the right questions, and

learning from data.

This chapter is divided into 5 sections; in section 2 we survey the popular programming

languages such as python and R with relevant libraries (e.g., pandas, numpy, matplotlib,

scikitlearn, data.table, dplyer, and ggplot2). The section 3 comprises of useful tools such as

Weka, Orange, and RapidMiner. The flow of an example generic process is presented in

section 4. Conclusion and challenges are listed in section 5.

1.2 Programming Languages [2-3 pages]

Data science in general finds growing support in programming languages. Python and R are

considered among the most prominent examples in this context. In the following, we describe

both programming languages, covering their relevant features and libraries.

1.2.1 Python

Python, with its generalized programming language design, ease of learning, and wide

applicability is considered most obvious choice when applying data-driven techniques and

models in any domain. The pseudocode-like syntax makes it easy to learn the language and

develop the solutions with better focus on the problem rather than language learning.

The data-driven techniques require support for descriptive and inferential data analysis, data

visualization, data cleaning, data transformations, statistical measures, and machine learning

among others. Python with its rich support of libraries and an open-source community

process serves the needs adequately. In the following, we give a brief overview of the

libraries related to data science.

NumPy

NumPy is short for Numerical Python and provides the foundation for numerical operations.

NumPy is among the most actively maintained and contributed library of Python. It

implements multi-dimensional arrays that are expressive, and fast. In order to meet the speed

challenges, NumPy provides a wide range of buit-in functions that are optimized to do well

with the NumPy arrays.

Some of the prominent features of NumPy are listed here:

 Provides fast, one and multi-dimensional array objects

 Facilitates interfacing with code from other languages like C/C++ and Fortran

 Supports vectorization for fast numerical computations

 Provides solid foundation for other libraries like Pandas, scikit-learn, and SciPy

Pandas
Pandas is generally the foremost library that a data scientist has to learn along with NumPy

and Matplotlib. Pandas is amongst the most actively maintained Python library. It is seriously

used for all preparatory and initial analytical steps involved in data cleaning and analysis. In

addition in supports some primitive visualizations. In order to support wide operations on

large datasets, Pandas uses NumPy at the backend, and is usually blazingly fast compared to

conventional programming constructs provided by Python language.

Some of the prominent features of Pandas are listed here:

 A complete documentation that explains the whole library with examples

 Expressive set of functions and features that enable adequate data handling

 Provides API for custom development and contribution to the library

 Implements complex and demanding data operations with a high level of abstraction

 Supports operations like data wrangling and data cleaning adequately

Matplotlib

Matplotlib is the primary library for visualization and plotting provided for Python. Due to

wide support and contributions Matplotlib is considered a competitive library compared to

any other platform, programming languages, or tool. An object-oriented API makes sure that

the visualizations created with Matplotlib can be integrated into other applications with ease.

Some of the prominent features of Matplotlib are listed here:

 An open source and free competitor to MATLAB with adequate features and

facilities.

 Implements diverse output types that are useable across different platforms without

any special assistance.

 Efficient in using system resources thus plots from big datasets can be created with

sufficient ease.

 Integrates necessary statistical analysis and measures like correlations, and confidence
intervals etc.

 Behaves as a base for sophisticated library like Seaborn.

Scikit-learn

Scikit-learn is the Python’s prolific machine learning library. It implements almost all

machine learning models and relevant concepts and works well with Pandas, NumPy and

other relevant libraries. It supports both supervised and unsupervised learning.

Some of the prominent features of Scikit-learn are listed here:

 Supports pre-processing including transformation, normalization, and encoding

 Provides implementation of classification and regression model based on linear, non-

linear, gradient descent, tree-based, Bayesian, and ensemble methods

 Supports clustering for unsupervised learning implementing methods such as K-mean,

affinity propagation, spectral, and hierarchical among others.

 Implements dimensionality reduction covering principal component analysis,

independent component analysis, latent dirichlet allocation.

Some other important libraries that turn out to be important at various stages of data-driven

systems include: SciPy (scientific python), TensorFlow, Keran, and PyTorch (for deep

learning), Scrapy and BeautifulSoup (for scrapping and dealing with semi-structured or

unstructured data).

1.2.2 R

R is a programming language designed for statistical computing. Like Python it is interpreted

as well as open source. However, as against Python, R is more confined and specialized for

data processing and does not find appreciation in solving computing problems in general. The

language supports data types like lists, vectors, arrays, and data frames that make it very

convenient to process data.

As expected, R has a wide range of packages related to data manipulation and processing. In
the following, we briefly explain some of the prominent packages and libraries by categories.

Data Loading

R offers wide support in terms of loading data from diverse formats and sources. R can be

used to read plain data files without needing any package. In order to read data from

databases, packages like DBI, ODBC are available. XLConnet, and xlsx are the packages to

read and write Excel files as csv’s. R also interfaces with software like SPSS or Stata using

packages like foreign and haven.

Data Manipulation and Visualization

A collection of relevant packages is available in the form of tidyverse. The primary package

for data manipulation is dplyr. It provides data manipulation similar to panadas and is quite

fast. In addition there are useful packages for data manipulation like tidyr, stringr, and

lubridate.

R also has a number of packages for data visualization. ggplot2 is the main library in R to

create feature rich custom visuals. In addition ggviz, rgl, htmlwidgets, googleViz make a list

of relevant tools for visualization in R.

Data Modelling

Statistical as well as machine learning models implementation is available in R through a

number of packages. tidymodels is a comprehensive collection of packages that provide

popular machine learning models for R programmers. In addition, there are a lot of

specialized packages for various machine learning and statistical models including car

(having Anova functions), mgcv (additive models), multcomp, vcd, glmnet, and caret.

Other Packages

In addition to the packages discussed above, R has a great support for data scientists through

additional packages for various tasks. For example, it provides shiny, xtable, and Markdown

for reporting results. Sp, maptools, maps, and ggmap are popular when dealing with spatial
data. zoo, xts, and quantmod are useful for analyzing time series and financial data. R also

provides the facility to write own packages through devtools, testthat, and roxygen2.

1.3 Tools [3-4 pages]

In this section we list a few important tools useful in adopting and implementing data-driven

methods. In this context, we discuss three different categories of tools. In the first category

we place the tools that are capable to handling large scale data. The second category belongs

to the tools useful for data analysis, and the third category discusses the tools for predictive

modelling and machine learning.

1.3.1 Big Data

Intuitively, data-driven techniques are driven by some kind of data at the backend. Data can

be structured, semi-structures, or even unstructured. It can be numbers, text, images, or any

combination of these. When it comes to dealing with communication and networks, the data

can be some performance statistics coming from traffic logs or some kind of signals. In

general, any kind of statistics measured from traffic logs present data in a structured way.

Even if the data is representing some kind of signals, it is often represented in the form of

numerical quantities. Therefore, communication and networks often deal with structured data.

Normal sized data can be represented using text files where csv is a popular format.

However, larger datasets require better support to store, retrieve, and process the data at scale.

In the following, we briefly discuss some important and useful tools for big data handling.

Hadoop

Hadoop offers solution to solve bid data problems using a network of computers. It is a

collection of open source libraries with hadoop distributed file system at core for storage of

data and uses MapReduce as programming model. It uses YARN to manage computing

resources in the clusters and schedules applications. However, a major limitation is that

MapReduce can run one job at a time in batch processing mode. This limits the usefulness of
Hadoop as a real-time analysis framework and makes it a prominent choice for data

warehousing.

Spark

Spark is popular for real-time data stream processing in the context of big data systems.

Although the workflows used in Spark are based on Hadoop MapReduce, however, these are

more efficient because Spark provides its own streaming API rather than banking on Hadoop

YARN. This makes Spark more suitable for real-time data stream processing as against

Hadoop which has turned out to be a tool for storage and batch processing rather.

Spark banks of data stored in Hadoop and does not implement its own storage system. For the

sake of development Spark uses Scala tuples.

Cloudera

Cloudera was developed as an enterprise level deployment solution based on Hadoop. It can

interact and access data from heterogeneous environments offering real-time analysis.

Cloudera can interact with different clouds thus implementing truly enterprise-wide solutions.

In additions to data analysis it also provides the capability to train and deploy data models.

Cloudera is versatile in that it can be deployed across multiple clouds as well as on site and is

Pythin, Scala, Go, and Java.

MongoDB

MongoDB is a free solution for implementing databases that can overcome the limitations of

relational databases and based on NoSQL design scheme. MongoDB can handle large

amount of data which is beyond traditional structure followed by relational databases.

MongoDB Atlas enables developers to manage databases across different cloud providers

including Azure, AWS, and Google Cloud. It supports more than 10 languages.
Some of the prominent features of MongoDB are listed here. The game changing feature of

MongoDB is that it supports real-time analytics based on ad-hoc queries. Moreover, its

indexing and data replication features demonstrate great performance advantage. The load

balancing feature also outperforms many competing solutions.

1.3.2 Data Analysis

There are a lot of tools that provide users with this capability to do primitive and advanced

statistical analysis of data without having to program anything. In general, tools have buit-in

methods to carry out the implemented tasks by following a well-defined sequence of steps.

This is why tools are easy for primitive users and useful for advanced users in carrying out

some initial analysis or handling simplistic situations without having to program anything.

Although there are numerous tools available for data analysis, in the following we discuss

two broad tools.

Spreadsheets

Spreadsheets are most widely used software to handle structured data arranged into rows and

columns. Amongst the most used spreadsheet software are Microsoft Excel and Google

Sheets. Excel is a proprietary package whereas Google Sheets are free and also available

through Google Cloud in the form of software-as-a-service. Numerical operations, statistical

formulas, custom functions, data handling, and basic plotting are some of the prominent and

most used features in spreadsheets.

SPSS

Statistical product and service solution (SPSS) is a proprietary solution by IBM for advanced

statistical analysis. It has been widely used in the social science domains. Market researchers,

healthcare and survey companies, education, various government sector setups find it very

useful.

The noticeable features of SPSS include descriptive statistical analysis, statistical tests,
simple predictive modelling, text analysis, and visualizations.

1.3.3 Machine Learning

Although the libraries for machine learning are well implemented and available in

programming languages like Python and R, however there are certain tools in the market that

make it convenient for users to be able to do predictive modelling without having to program

explicitly. In the following we briefly introduce Weka and RapidMiner, two of the most

widely used tools for machine learning and data mining.

Weka

Weka is a free software developed at University of Waikato for the purpose of data analysis

and predictive modelling. It provides support for both supervised and unsupervised machine

learinng. The features implemented in Weka include pre-processing, classification,

regression, and clustering.

Data in Weka can be read from files of different formats, web via url, as well as from

databases. Weka makes it easy to see the behavior of various machine learning models on the

datasets of interests without having to have explicit programming knowledge.

Orange

Ornage is a cross-platform free software for data analysis, and machine learning. Like Weka

it implements all basic machine learning models. In addition to the features provided by

Weka, Orange is Python based and provides support for plugins. It also implements support

for text processing and simulations.

Both Weka and Orange are useful software for testing machine learning models with ease.

The support is adequate and learning is rather easy. In addition, there are a number of other

software packages like RapidMiner, KNIME, Neural Designer, and KEEL.

1.4 Techniques [4-6 pages]

1.4.1 Data Collection

Similar to any data-driven systems, data is of primary importance to achieve an intelligent

and adaptive solution for QoS in IoT. To the best of our knowledge, there is no existing

proposal that uses such data-driven approach for adaptive QoS in WSN driven IoT where the

communication parameters are used as primary source of capturing variations and facilitating

adaptation through real-time reconfigurability. For a practical system, data is collected and

processed interactively, and real-time decisions are made. However, from a design-in-

research perspective, data can be acquired from the public repositories, testbeds, or generated

using simulations.

In the following, we analyze the potential data sources (e.g., simulators, testbeds, etc.) [11]

and explain their benefits and drawbacks.

Simulations: Historically, it has not been possible to realize real deployments of desired

network technologies and topologies for research purposes. Therefore, there are several

simulators available and in use to create design scenarios and evaluate the performance of

wireless networks.

Some popular examples include: ns-2, ns-3, OMNET++, and COOJA [11]. Where it has been

possible to create custom network topologies with desired software and hardware

characteristics and configurations, simulations fail to facilitate the behavior of real

deployments because even the stochastic events are deterministic.

Testbeds: The second possibility is to generate the data using a testbed. With the proliferation

of wireless communication, IoT is very much a realization on the timeline. This has induced a

lot of interest from the research, commercial and governmental organizations to create

suitable testbeds to foster research in the domain of wireless communications. Some

prominent examples of accessible testbeds include MoteLab, TWIST, and Indriya

[11]. Lately, FED4FIRE+ has federated a large set of testbeds 1, focusing on diverse

networking and cloud facilities.

Public Sources: Public data sources are of paramount importance for research driven by data.

Although there are some wireless datasets related to QoS performance, however, these do not

comprehend the diverse deployment scenarios, application requirements, and also do not cop

with the evolving nature of network designs. Most of the dataset primarily focuses on the

sensed information rather than communication performance. A prominent public

dataset providing comprehensive measurements based on a large combination of diverse

parameter settings in WSNs [12] is hosted by CRAWDAD 2, a large public repository for

networking related datasets.

Real-World Deployments: Real-time decision making is critical, particularly for time-critical

safety-related scenarios. The purpose of adopting data-driven QoS prediction is to facilitate

real-world deployments of communication systems forming the IoT. These real-word

deployments have their own dynamic and evolving nature. Therefore, it is more desirable and

effective to acquire data from these real scenarios. Moreover, the changes, growth,

and evolution can only be accommodated through this integration of prediction system with

real deployments. However, gathering data from live sources and carrying out real-time

analytics for QoS still requires considerable attention.

1.4.2 Data Analysis and Machine Learning

The data need to be pre-processed resulting in features set(s) and target(s). The features can

vary as the target metric and context changes. A sample of what the feature vector could look

like is shown in Fig. 2. The prediction of QoS metrics involves identifying the correct set of

features for a particular metric, and feature selection techniques can help. Big data platforms

and services like Hadoop, Kafka, MQTT, and Spark Streaming, etc., play a vital role in

putting the system components together. FED4FIRE+ facilitates testbeds like w-iLabt.t 3 for

experiments involving WSNs, WiFi, LTE/5G, and cognitive radio, etc. The big data platform

like Tengu 4 is available with adequate facilities for hosting the big data and provides
streaming services for live interactions. In order to provide gluing between the testbeds and

big data platform, seamless connectivity is delivered by virtual wall 5. The next important

thing is choosing a suitable machine learning model(s) that can meet the desired performance

requirements at an affordable cost. In past, we have used deep neural networks for predicting

QoS in WSNs [13], [14], [15]. The results have been promising and encouraging. A brief

comparison with conventional regression models (e.g., linear regression, and decision trees

based regression) reveals that these simpler models carry the potential to yield effective

predictions in less complex scenarios [13].

Using real-time analytics for configuring the nodes and network can serve the real advantage

of achieving adaptivity and self-reconfigurability.

1.4.3 Dissemination of Recommendation

Finally, recommendations are formulated consisting of befitting values for critical features

considering the QoS target(s) and are disseminated to the interested nodes in the networks.

For this purpose, the values need to be maintained in the form of target : value,

and feature : value pairs. This way, for each threshold of a prediction target, a recommended

set of values for the critical set of features can be provided proactively or reactively to the

sensor nodes. The sole task that the sensor nodes need to perform is to choose the right set of

values considering a performance goal which is computationally very simple. An

example of such a target : value, and parameter : value pairs.

Achieving the feat of dissemination of recommendation requires robust, efficient, state-

preserving, and compatible reconfiguration (e.g., via firmware update).

1.4.4 Milestones

The indicated milestones to achieve the proposed datadriven framework are revealed in Fig.

3. The first four tasks (identification of facilities for experimentation, design, and execution

of experiments, statistical analysis, predictions of QoS metrics) have already been completed
for WSNs. We have published some promising results in [13], [14], [15]. Currently, we are

designing experiments for WiFi and LTE. The online evaluations of predictions are also

being carried out using FED4FIRE+ facilities in parallel. After completing these ongoing

steps, the proposed framework will be evaluated in real environments. However, an

elementary case study in presented below.

1.4.5 Deployment

1.5 Challenges [2 pages]

References [2 pages]

R Programming. An Approach To Data Analytics - G. Sudhamathy, C. Jothi Venkateswaran
91% (11)
R Programming. An Approach To Data Analytics - G. Sudhamathy, C. Jothi Venkateswaran
384 pages
AZ 305T00A ENU Powerpoint 04
0% (1)
AZ 305T00A ENU Powerpoint 04
32 pages
1 Database Language DDL, DCL, TCL
0% (1)
1 Database Language DDL, DCL, TCL
2 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
49 pages
Introduction - R Programming
No ratings yet
Introduction - R Programming
22 pages
What Is R Programming
No ratings yet
What Is R Programming
7 pages
Article Review 3 Eng
No ratings yet
Article Review 3 Eng
16 pages
Dsbda Unit4
No ratings yet
Dsbda Unit4
110 pages
Introduction R
No ratings yet
Introduction R
20 pages
Python Libraries Seminar Report
100% (2)
Python Libraries Seminar Report
16 pages
Expt-1 Dav
No ratings yet
Expt-1 Dav
5 pages
Suraj Report File
No ratings yet
Suraj Report File
17 pages
What Is Python?: Why Python For Data Science?
No ratings yet
What Is Python?: Why Python For Data Science?
3 pages
Intro To DS Assignmnt 1 (Amna Iqbal) ....
No ratings yet
Intro To DS Assignmnt 1 (Amna Iqbal) ....
4 pages
Data Ty
No ratings yet
Data Ty
59 pages
Data Science I: Charles C.N. Wang
No ratings yet
Data Science I: Charles C.N. Wang
68 pages
R PROGRAMMING QUESTION BANK Answer
100% (1)
R PROGRAMMING QUESTION BANK Answer
20 pages
DS Unit 1 - NUMPY
No ratings yet
DS Unit 1 - NUMPY
29 pages
Basic Libraries For Data Science
No ratings yet
Basic Libraries For Data Science
4 pages
Data Visualization
No ratings yet
Data Visualization
25 pages
10EXP01
No ratings yet
10EXP01
12 pages
PYTHON
No ratings yet
PYTHON
11 pages
Introduction - R Programming
100% (1)
Introduction - R Programming
26 pages
Basic Features of R Programming
No ratings yet
Basic Features of R Programming
10 pages
Data Mining Lab 1
No ratings yet
Data Mining Lab 1
16 pages
Dav Lab
No ratings yet
Dav Lab
8 pages
Top 18 Python Libraries
100% (1)
Top 18 Python Libraries
11 pages
Chapter-1:-Introduction To R Language: 1.1 History and Overview
No ratings yet
Chapter-1:-Introduction To R Language: 1.1 History and Overview
7 pages
Python 2
No ratings yet
Python 2
18 pages
Python Tutorial
No ratings yet
Python Tutorial
18 pages
Tools of Business Analytics
No ratings yet
Tools of Business Analytics
20 pages
Python Libraries
No ratings yet
Python Libraries
17 pages
TY FDS Workbook
No ratings yet
TY FDS Workbook
56 pages
Top 20 Python Libraries For Data Science
No ratings yet
Top 20 Python Libraries For Data Science
15 pages
Languages Data Scientist
No ratings yet
Languages Data Scientist
13 pages
Note 5-7
No ratings yet
Note 5-7
21 pages
Python Data Analysis Sample Chapter
No ratings yet
Python Data Analysis Sample Chapter
40 pages
Ass1 DSBDA Writeup
No ratings yet
Ass1 DSBDA Writeup
8 pages
Experiment No 2 Introduction To Various Python Packages and Their Basic Use
No ratings yet
Experiment No 2 Introduction To Various Python Packages and Their Basic Use
5 pages
Introduction
No ratings yet
Introduction
45 pages
Libraries For Data Science
No ratings yet
Libraries For Data Science
2 pages
Olympic Data Minor Project 5th Sem
No ratings yet
Olympic Data Minor Project 5th Sem
23 pages
PythonDASE - 2025 Version1
No ratings yet
PythonDASE - 2025 Version1
44 pages
R Programming Text Book
No ratings yet
R Programming Text Book
384 pages
R Proook Pages 1
No ratings yet
R Proook Pages 1
15 pages
Data Science: Institute of Engineering and Technology
No ratings yet
Data Science: Institute of Engineering and Technology
28 pages
Introduction To Python
No ratings yet
Introduction To Python
71 pages
R Programming - An Approach To Data Analytics
No ratings yet
R Programming - An Approach To Data Analytics
402 pages
Unit-2 Ds
No ratings yet
Unit-2 Ds
26 pages
R Vs Python For Data Science
No ratings yet
R Vs Python For Data Science
7 pages
5616-1700040380952-HND - APDD - W10 - Data Structures and Data Analysis Libraries in Python
No ratings yet
5616-1700040380952-HND - APDD - W10 - Data Structures and Data Analysis Libraries in Python
16 pages
R Python
No ratings yet
R Python
25 pages
Introduction-It Skills
No ratings yet
Introduction-It Skills
20 pages
Data Preprocessing-AIML Algorithm1
No ratings yet
Data Preprocessing-AIML Algorithm1
47 pages
R Programming Unit 1
No ratings yet
R Programming Unit 1
83 pages
SC&RP - Unit 1
No ratings yet
SC&RP - Unit 1
106 pages
Lab - Manual FDS
No ratings yet
Lab - Manual FDS
12 pages
Python
No ratings yet
Python
23 pages
School of Computing and Creative Media XBIS 2023 Data Science Assignment Report
No ratings yet
School of Computing and Creative Media XBIS 2023 Data Science Assignment Report
21 pages
Final Fds Manual Print
No ratings yet
Final Fds Manual Print
55 pages
QR Code
No ratings yet
QR Code
7 pages
Assignment 1
No ratings yet
Assignment 1
1 page
Components of Database
No ratings yet
Components of Database
4 pages
MIT 103 - Tutorial 03
No ratings yet
MIT 103 - Tutorial 03
8 pages
Web Mining: G.Anuradha References From Dunham
100% (1)
Web Mining: G.Anuradha References From Dunham
63 pages
Schema Refinement
No ratings yet
Schema Refinement
25 pages
Pam
No ratings yet
Pam
3 pages
ChatGPT: Jack of All Trades, Master of None
No ratings yet
ChatGPT: Jack of All Trades, Master of None
40 pages
R18 B.Tech - CSE (Data Science) 3-1 Tentative Syllabus
No ratings yet
R18 B.Tech - CSE (Data Science) 3-1 Tentative Syllabus
24 pages
SAP HANA Architecture
No ratings yet
SAP HANA Architecture
10 pages
Data Warehouse Lab Manual
No ratings yet
Data Warehouse Lab Manual
61 pages
Chapter 4 Notes Hall Aud Cis
No ratings yet
Chapter 4 Notes Hall Aud Cis
3 pages
James A. Senn's Information Technology, 3 Edition: Enterprise Databases and Data Warehouses
No ratings yet
James A. Senn's Information Technology, 3 Edition: Enterprise Databases and Data Warehouses
38 pages
How To Write A Literature Review Using Artificial Intelligence (AI) Tools
No ratings yet
How To Write A Literature Review Using Artificial Intelligence (AI) Tools
8 pages
Tut08 NodeJS-3
No ratings yet
Tut08 NodeJS-3
4 pages
Practical Aspect of Robot Design, Control and Application of AI
No ratings yet
Practical Aspect of Robot Design, Control and Application of AI
68 pages
Design and Implementation of A Document Repository and Work Flow System in The Parliament of Kenya
No ratings yet
Design and Implementation of A Document Repository and Work Flow System in The Parliament of Kenya
23 pages
Ccs369-Unit 3
No ratings yet
Ccs369-Unit 3
28 pages
Rapid Application Development (Assessment 2) Task 1 - Multiple-Choice Questions
No ratings yet
Rapid Application Development (Assessment 2) Task 1 - Multiple-Choice Questions
3 pages
Lecture#11
No ratings yet
Lecture#11
19 pages
Artificial Intelligence For Blockchain - Mariya Ouaissa
100% (1)
Artificial Intelligence For Blockchain - Mariya Ouaissa
377 pages
What Is AI's Role in SEO in 2024
100% (1)
What Is AI's Role in SEO in 2024
9 pages
NLP Chapter-1
No ratings yet
NLP Chapter-1
24 pages
Text Generation:Use Technique Like Markov Models or LSTM Network To Generate Realistic Text in A Specific Style or Genre
No ratings yet
Text Generation:Use Technique Like Markov Models or LSTM Network To Generate Realistic Text in A Specific Style or Genre
7 pages
Firebase
No ratings yet
Firebase
33 pages
Experience: Cisco Software Download
No ratings yet
Experience: Cisco Software Download
1 page
Nishant Kumar Resume
No ratings yet
Nishant Kumar Resume
1 page
Chapter 2 Notes
No ratings yet
Chapter 2 Notes
6 pages

DDI Book Chapter Tools and Techniques

Uploaded by

DDI Book Chapter Tools and Techniques

Uploaded by

Chapter 9: (Programming Languages, Tools, and Techniques)

1.1 Introduction [1-1.5 pages]

constructs and toolkits to adequately implement the data-driven models. In addition,

allowing more focus on the problem at hand.

learning from data.

section 4. Conclusion and challenges are listed in section 5.

1.2 Programming Languages [2-3 pages]

both programming languages, covering their relevant features and libraries.

libraries related to data science.

with the NumPy arrays.

Some of the prominent features of NumPy are listed here:

 Provides fast, one and multi-dimensional array objects

 Supports vectorization for fast numerical computations

addition in supports some primitive visualizations. In order to support wide operations on

conventional programming constructs provided by Python language.

Some of the prominent features of Pandas are listed here:

 A complete documentation that explains the whole library with examples

 Provides API for custom development and contribution to the library

 Supports operations like data wrangling and data cleaning adequately

wide support and contributions Matplotlib is considered a competitive library compared to

Some of the prominent features of Matplotlib are listed here:

any special assistance.

 Behaves as a base for sophisticated library like Seaborn.

other relevant libraries. It supports both supervised and unsupervised learning.

Some of the prominent features of Scikit-learn are listed here:

 Supports pre-processing including transformation, normalization, and encoding

 Provides implementation of classification and regression model based on linear, non-

linear, gradient descent, tree-based, Bayesian, and ensemble methods

 Supports clustering for unsupervised learning implementing methods such as K-mean,

affinity propagation, spectral, and hierarchical among others.

 Implements dimensionality reduction covering principal component analysis,

independent component analysis, latent dirichlet allocation.

R is a programming language designed for statistical computing. Like Python it is interpreted

convenient to process data.

packages like foreign and haven.

Data Manipulation and Visualization

of relevant tools for visualization in R.

Statistical as well as machine learning models implementation is available in R through a

number of packages. tidymodels is a comprehensive collection of packages that provide

1.3 Tools [3-4 pages]

modelling and machine learning.

1.3.1 Big Data

be structured, semi-structures, or even unstructured. It can be numbers, text, images, or any

sake of development Spark uses Scala tuples.

popular choice to implement business intelligence solutions.

Pythin, Scala, Go, and Java.

amount of data which is beyond traditional structure followed by relational databases.

balancing feature also outperforms many competing solutions.

1.3.2 Data Analysis

two broad tools.

through Google Cloud in the form of software-as-a-service. Numerical operations, statistical

most used features in spreadsheets.

1.3.3 Machine Learning

widely used tools for machine learning and data mining.

learinng. The features implemented in Weka include pre-processing, classification,

regression, and clustering.

datasets of interests without having to have explicit programming knowledge.

for text processing and simulations.

software packages like RapidMiner, KNIME, Neural Designer, and KEEL.

1.4 Techniques [4-6 pages]

1.4.1 Data Collection

and explain their benefits and drawbacks.

characteristics and configurations, simulations fail to facilitate the behavior of real

deployments because even the stochastic events are deterministic.

suitable testbeds to foster research in the domain of wireless communications. Some

prominent examples of accessible testbeds include MoteLab, TWIST, and Indriya

networking and cloud facilities.

sensed information rather than communication performance. A prominent public

dataset providing comprehensive measurements based on a large combination of diverse

networking related datasets.

Real-World Deployments: Real-time decision making is critical, particularly for time-critical

safety-related scenarios. The purpose of adopting data-driven QoS prediction is to facilitate

real-world deployments of communication systems forming the IoT. These real-word

analytics for QoS still requires considerable attention.

1.4.2 Data Analysis and Machine Learning

predictions in less complex scenarios [13].