0% found this document useful (0 votes)

46 views54 pages

Pervasive AI For IoT Applications A Survey On Resource Efficient Distributed Artificial Intelligence

The document presents a comprehensive survey on Pervasive AI for IoT applications, focusing on resource-efficient distributed artificial intelligence techniques. It discusses the integration of pervasive computing and AI, addressing challenges such as latency, privacy, and scalability in centralized cloud approaches. The survey reviews various AI methodologies, including federated learning and multi-agent reinforcement learning, while outlining their applications and performance metrics in the context of IoT systems.

Uploaded by

Pepijn Nijboer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views54 pages

Pervasive AI For IoT Applications A Survey On Resource Efficient Distributed Artificial Intelligence

Uploaded by

Pepijn Nijboer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Delft University of Technology

Pervasive AI for IoT Applications

A Survey on Resource-Efficient Distributed Artificial Intelligence
Baccour, Emna; Mhaisen, Naram; Abdellatif, Alaa Awad; Erbad, Aiman; Mohamed, Amr; Hamdi, Mounir;
Guizani, Mohsen
DOI
10.1109/COMST.2022.3200740
Publication date
2022
Document Version
Final published version
Published in
IEEE Communications Surveys and Tutorials

Citation (APA)
Baccour, E., Mhaisen, N., Abdellatif, A. A., Erbad, A., Mohamed, A., Hamdi, M., & Guizani, M. (2022).
Pervasive AI for IoT Applications: A Survey on Resource-Efficient Distributed Artificial Intelligence. IEEE
Communications Surveys and Tutorials, 24(4), 2366-2418. https://doi.org/10.1109/COMST.2022.3200740
Important note
To cite this publication, please use the final published version (if applicable).
Please check the document version above.

Copyright
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent
of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Takedown policy
Please contact us and provide details if you believe this document breaches copyrights.
We will remove access to the work immediately and investigate your claim.

This work is downloaded from Delft University of Technology.

For technical reasons the number of authors shown on this cover page is limited to a maximum of 10.
2366 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

Pervasive AI for IoT Applications: A Survey

on Resource-Efficient Distributed
Artificial Intelligence
Emna Baccour, Naram Mhaisen , Alaa Awad Abdellatif , Member, IEEE,
Aiman Erbad , Senior Member, IEEE, Amr Mohamed , Senior Member, IEEE, Mounir Hamdi , Fellow, IEEE,
and Mohsen Guizani , Fellow, IEEE

Abstract—Artificial intelligence (AI) has witnessed a substan- Index Terms—Pervasive computing, deep learning, distributed
tial breakthrough in a variety of Internet of Things (IoT) appli- inference, federated learning, reinforcement learning.
cations and services, spanning from recommendation systems
and speech processing applications to robotics control and mili- I. I NTRODUCTION
tary surveillance. This is driven by the easier access to sensory
RIVEN by the recent development and prevalence of
data and the enormous scale of pervasive/ubiquitous devices
that generate zettabytes of real-time data streams. Designing
accurate models using such data streams, to revolutionize the
D computing power, Internet of Things (IoT) systems, and
big data, a booming era of AI has emerged, covering a wide
decision-taking process, inaugurates pervasive computing as a spectrum of applications including natural language process-
worthy paradigm for a better quality-of-life (e.g., smart homes
and self-driving cars.). The confluence of pervasive computing [1], speech recognition [2], computer vision [3], and
ing and artificial intelligence, namely Pervasive AI, expanded robotics [4]. Owing to these breakthroughs, AI has achieved
the role of ubiquitous IoT systems from mainly data collec- unprecedented improvements in multiple sectors of academia,
tion to executing distributed computations with a promising industry, and daily services in order to improve the humans’
alternative to centralized learning, presenting various challenges, productivity and lifestyle. As an example, multiple intelligent
including privacy and latency requirements. In this context, an
intelligent resource scheduling should be envisaged among IoT IoT applications have been designed such as self-driving cars,
devices (e.g., smartphones, smart vehicles) and infrastructure disease mapping services, smart home appliances, manufactur-
(e.g., edge nodes and base stations) to avoid communication and ing robots, and surveillance systems. In this context, studies
computation overheads and ensure maximum performance. In estimate that AI will have higher impact on the global Gross
this paper, we conduct a comprehensive survey of the recent Domestic Product (GDP) by 2030, accounting for $13 trillion
techniques and strategies developed to overcome these resource
challenges in pervasive AI systems. Specifically, we first present additional gains compared to 2018 [5].
an overview of pervasive computing, its architecture, and its The popularity of AI is also related to the abundance of
intersection with artificial intelligence. We then review the back- storage and computing devices, ranging from server clusters
ground, applications and performance metrics of AI, particularly in the cloud to personal phones and computers, further, to
Deep Learning (DL) and reinforcement learning, running in a wearables and IoT units. In fact, the unprecedented amount of
ubiquitous system. Next, we provide a deep literature review of
communication-efficient techniques, from both algorithmic and data generated by the massive number of ubiquitous devices
system perspectives, of distributed training and inference across opens up an attractive opportunity to provide intelligent IoT
the combination of IoT devices, edge devices and cloud servers. services that can transform all aspects of our modern life and
Finally, we discuss our future vision and research challenges. fuel the continuous advancement of AI. Statistics forecast that,
by 2025, the number of devices connected to the Internet will
Manuscript received 4 May 2021; revised 11 October 2021, 24 February reach more than 500 billion [6] owing to the maturity of their
2022, 8 May 2022, and 19 June 2022; accepted 9 August 2022. Date of publi-
cation 25 August 2022; date of current version 22 November 2022. This work sensing capabilities and affordable prices. Furthermore, reports
was supported by the National Priorities Research Program-Standard (NPRP- revealed that these devices will generate enormous data reach-
S) Thirteen (13th) Cycle from the Qatar National Research Fund under Grant ing more than 79 ZB by 2025 and will increase the economic
NPRP13S-0205-200265. The findings herein reflect the work and are solely
the responsibility of the authors. (Corresponding author: Mohsen Guizani.) gains up to 11 trillion by the same year [7].
Emna Baccour, Aiman Erbad, and Mounir Hamdi are with the Division of With the rapid evolution of AI and the enormous bulks
Information and Computing Technology, College of Science and Engineering, of data generated by pervasive devices, conventional wisdom
Hamad Bin Khalifa University, Qatar Foundation, Doha, Qatar.
Naram Mhaisen was with the Machine Learning Department, Qatar resorts to centralized cloud servers for analytics. In fact, the
University, Doha, Qatar. He is now with the Faculty of Electrical Engineering, high performance of AI systems applied to multiple fields
Mathematics and Computer Science, Delft University of Technology, comes at the expense of a huge memory requirement and
2628 CD Delft, The Netherlands.
Alaa Awad Abdellatif and Amr Mohamed are with the Department of an intensive computational load to perform both training and
Computer Science and Engineering, College of Engineering, Qatar University, inference phases, which requires powerful servers. However,
Doha, Qatar. this approach is no longer sustainable as it introduces sev-
Mohsen Guizani is with the Machine Learning Department, Mohamed Bin
Zayed University of Artificial Intelligence, Abu Dhabi, UAE. eral challenges: (1) the appearance of a new breed of services
Digital Object Identifier 10.1109/COMST.2022.3200740 and the advent of delay-sensitive technologies spanning from
1553-877X
c 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2367

self-driving cars to Virtual and Augmented Reality (VR/AR), opt for pervasive computing, where different data storage
make the cloud-approaches inadequate for AI tasks due to the and processing capacities existing everywhere (e.g., distributed
long transmission delays. More precisely, the aforementioned cloud datacenters, edge servers, and IoT devices.) cooperate
applications are real-time and cannot allow any additional to accomplish AI tasks that need large memory and inten-
latency or connectivity loss. For example, autonomous cars sive computation. This marriage of pervasive computing and
sending camera frames to remote servers need to receive AI has given rise to a new research area, which garnered
prompt inferences to detect potential obstacles and apply considerable attention from both academia and industry. The
brakes [8], [9]. Sending data to cloud servers may not sat- new research area comprises different concepts (e.g., feder-
isfy the latency requirements of the real-time applications. ated learning, active learning, etc.) that suggest to distribute
Particularly, experiments in [10] demonstrated that executing AI tasks into pervasive devices for multiple objectives. In this
a computer vision task on a camera frame offloaded to an paper, we propose to gather all existing concepts having dif-
Amazon server takes more than 200 ms. (2) In addition to ferent terminologies under the same umbrella that we named
latency, privacy presents a major concern for cloud-based AI “Pervasive AI”. Indeed, we define the pervasive AI as “The
approaches. In fact, end-users are typically reluctant to upload intelligent and efficient distribution of AI tasks and models
their private data to cloud servers (e.g., photos or audios), as over/amongst any types of devices with heterogeneous capa-
they can be highly exposed to cyber risks, malicious attacks, bilities in order to execute sophisticated global missions”.
or disclosures. Among the most popular breaches reported in The pervasive AI concepts are firstly introduced to solve the
the 21st century, we can cite the Marriott attack revealed in described challenges of centralized approaches (e.g., on-cloud
2018 and affecting 500 million customers and Equifax breach or on-device computation): (1) To preserve privacy and reduce
recorded in 2017 and affecting 147 million users [11]. (3) A the huge overhead of data collection and the complexity of
tremendous number of AI tasks, involving unstructured and training an astronomical dataset, Federated Learning (FL) is
bandwidth-intensive data, needs to be transferred across the proposed, where raw data are stored in their source entities and
Wide Area Network (WAN), which poses huge pressure on the model is trained collaboratively. Particularly, each entity
the network infrastructure having varying quality. (4) In the computes a local model using its collected data, then sends
same context, offloading the data to remote servers encounters the results to a fusion server to aggregate the global model.
also scalability issues, as the access to the cloud can become a Such an approach suggests the distribution of data and the
bottleneck when the number of data sources increases, partic- assembly of the trained AI models. (2) To cope with the lim-
ularly if some devices import irrelevant and noisy inputs. (5) ited resources provided by edge devices and simultaneously
Nowadays, Explainable AI (XAI) [12] has become extremely avoid latency overheads caused by cloud transmissions, the
popular, aiming to enhance the transparency of learning and inference task is distributed among ubiquitous devices located
detect prediction errors. However, consigning AI tasks to the at the proximity of the source. The basic idea is to divide the
cloud makes the whole process a black-box vis-a-vis the trained model into segments and subsequently, each segment
end-user, and prevents model decomposability and debugging. is assigned to a participant. Each participant shares the out-
Pushing AI to the network edge has been introduced as put to the next one until generating the final prediction. In
a viable solution to face latency, privacy, and scalability other words, the Pervasive Inference covers the distribution
challenges described earlier. As such, the large amount of of the established model resulting from the training phase.
computational tasks can be handled by edge devices without (3) Some AI techniques are inherently distributed such as
exchanging the related data with the remote servers, which Multi-Agent Reinforcement Learning (MARL) or Multi-agent
guarantees agile IoT services owing to the physical proxim- Bandits (MAB), where agents cooperate to build and improve
ity of computing devices to the data sources [13]. In the case a policy in real-time enabling them to take on-the-fly deci-
when the AI tasks can only be executed at the cloud datacen- sions/actions based on the environment status. In this case,
ters, the edge devices can be used to pre-process the data and the distribution covers the online creation and update of the
polish it from noisy inputs in order to reduce the transmission Reinforcement Learning (RL) policy. A scenario illustrating
load [14]. Furthermore, the edge network can play the role some pervasive AI techniques is presented in Fig. 1.
of a firewall that enhances the privacy by discarding sensitive The pervasive AI exploits the on-device computation capac-
information prior to data transfer. A variety of edge devices ities to collaboratively achieve learning tasks. This requires a
can be candidate for executing different AI tasks with different careful scheduling to wisely use the available resources with-
computation requirements, ranging from edge servers provi- out resorting to remote computing. Yet, some intensive AI
sioned with GPUs, to smart-phones with strong processors and tasks can only be performed by involving the cloud servers,
even small IoT wearable with Raspberry Pi computing. These which results in higher communication costs. Therefore, lever-
edge devices have been continuously improving to fit for deep aging the small and ubiquitous resources and managing the
AI models. In spite of this technological advancement, a large enormous communication overheads present a major bottle-
range of pervasive devices used in countless fields of our daily neck for the pervasive AI.
life still suffers from limited power and memory, such as smart
home IoT appliances, sensors, and gaming gears.
Given the limited resources of edge-devices, computing A. Our Scope
the full AI model in one device may be infeasible, partic- In this survey, we focus on the confluence of the two emerg-
ularly when the task requires high computational load, e.g., ing paradigms: pervasive computing and artificial intelligence,
Deep Neural Networks (DNN). A promising solution is to which we name Pervasive AI. The pervasive AI is a promising

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2368 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

Fig. 1. A scenario illustrating examples of pervasive AI techniques.

research field, in which the system design is highly corre- IoT applications and the corresponding communication and
lated to the resource constraints of the ubiquitous participants resource challenges.
(e.g., memory, computation, bandwidth, and energy.) and
the communication overheads between them. More specifi-
cally, the size of some deep AI models, their computational B. Contributions and Structure of the Paper
requirements and their energy consumption may exceed the The contributions of this paper are summarized as follows:
available memory (e.g., RAM) or the power supply capacity • We present an overview of the pervasive computing and
of some devices, which restricts them from participating in introduce its architecture and potential participants.
the collaborative system. Furthermore, the process of decen- • We provide a brief background of artificial intelligence,
tralized training or inference may involve a big number of particularly deep learning and reinforcement learning.
participants that potentially communicate over wireless links, We, also, describe the frameworks that support AI
which creates new challenges related to channels capaci- tasks and the metrics that assess their performance.
ties and conditions, the delay performance, and the privacy Furthermore, we present multiple IoT applications, in
aspect. Therefore, the pervasive AI should rely on various which pervasive AI can be useful.
parameters, including the optimal AI partitioning, the effi- • For each phase of the AI (i.e., training and inference),
cient design of architectures and algorithms managing the we profile the communication and computation models
distributed learning, and the smart selection and scheduling and review the state-of-the-art. A comparison between
of pervasive participants supported by efficient communica- different existing works, lessons learned, in addition to
tion protocols. Not only that, all the on-device constraints recent use cases, are also provided.
should be taken into consideration such as the memory, the • We conclude by an elaborative discussion of our future
computation, the energy, not to mention the privacy require- vision and we identify some open challenges that may
ments of the system. Finally, the load of real-time inferences arouse new promising research ideas.
(e.g., an area that needs 24/7 surveillance), the pace of data The rest of this paper is organized as follows: Sections II
collection (e.g., weather monitoring) and the dynamics of the and III present the fundamentals of pervasive computing and
studied environment should also be considered as they highly artificial intelligence, respectively. In Section IV, we introduce
impact the number of selected participants and the paralleliza- the related surveys and we highlight the novelty of our paper.
tion strategies. In this paper, we survey the aforementioned Section V presents the related studies that investigated the
challenges in deploying pervasive AI models and algorithms. potential of federated learning schemes in different domains.
Particularly, we provide a deep study of resource-efficient dis- Moreover, it highlights the use of FL within UAV swarms
tributed learning for the training phase, the inference tasks, and for cooperative target recognition as a case study. Then, we
real-time training and decision process. We start by identifying investigate diverse reinforcement learning schemes, and active
the motives behind establishing a pervasive AI system for learning. Specifically, we focus on the state-of-art algorithms

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2369

Fig. 2. Pervasive AI survey roadmap.

that study the trade-off between the utilized communication at any format, in any place and any time. More specifically,
resources and the performance, which allows us next to eval- it can span from resource-constrained devices to highly per-
uate the strengths and weaknesses of the discussed approaches. formant servers and can involve cloud datacenters, mobile
In Section VI, we present a deep study of pervasive inference. edge computing servers, mobile devices, wearable computers,
Particularly, we review the state-of-the-art approaches adopt- embedded systems, laptops, tablets, pair of intelligent glasses,
ing different splitting strategies and managing the existing and even a refrigerator or a TV. These ubiquitous devices
pervasive resources to distribute the inference. Next, we com- are constantly connected and available for any task. In other
pare the performance of these works and discuss the learned words, we are not talking anymore about devices acting on a
lessons and potential use cases. In Section VII, we discuss passive data. Instead, the pervasive systems are able to collect,
the privacy and security issues associated with pervasive AI process, communicate any data type or size, understand its sur-
and the corresponding mitigation strategies proposed in the lit- roundings, adapt to the input context, and enhance humans’
erature. We discuss the future vision and open challenges, in experiences and lifestyles.
Section VIII. Finally, we conclude in Section IX. More details
about the road map of the paper are illustrated in Fig. 2.
B. Ubiquitous Participants
The pervasive systems are characterized by highly hetero-
II. F UNDAMENTALS OF P ERVASIVE C OMPUTING
geneous devices (see Fig. 3), where the critical challenge
A. Definition is to design a scalable infrastructure able to dynamically
The pervasive computing [15], [16], named also ubiqui- discover different components, manage their interconnection
tous computing, is the growing trend to embed computational and interaction, interpret their context, and adapt rapidly to the
capabilities in all devices in order to enable them to commu- deployment of new software and user interfaces. A pervasive
nicate efficiently and accomplish any computing task, while system can be composed of:
minimizing their resource consumptions, e.g., battery, memory, 1) Data Center and Cloud Servers: Cloud computing [17],
cpu time, etc. Pervasive computing can occur in any device, [18], [19], [20], [21] is defined as delivering on-demand

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2370 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

Fig. 4. Pervasive architecture.

Fig. 3. Ubiquitous participants.

4) Fog Devices: The fog [26] and cloud computing share

the same set of services provided to end-users, such as storage,
services from storage, management, advertising, and computa- networking, computing, and artificial intelligence. However,
tion to artificial intelligence and natural language processing, the cloud architecture is composed of fully distributed large-
following different pricing models, such as pay-as-you-go and scale data centers. Meanwhile, fog services focus on IoT
subscription-based billing. Hence, instead of owning comput- devices in a specific geographical area and target applications
ing servers, companies, operators, and end-users can exploit requiring real-time response such as live streaming, interactive
the high-performance facilities offered by the cloud service applications, and online collective gaming. Examples include
provider. In this way, they can benefit from better compu- phones, wearable health monitoring devices, connected vehi-
tational capacities, while reducing the cost of owning and cles, etc.
maintaining a computation infrastructure, and paying only for 5) Edge Devices: In most of the studies, the interpretation
their requested services. Cloud computing underpins a broad of edge devices (i.e., edge nodes and IoT devices) is still
number of services, including data storage, cloud back-up of ambiguous [27], which means the difference between end or
photos, video streaming services, and online gaming. IoT devices and edge nodes is still unclear. Yet, common
2) Mobile Edge Computing (MEC) Servers: Edge comput- consensus defines the end-devices/IoT as ubiquitous gadgets
ing is introduced as a solution to bring cloud facilities at that are embedded with processing capacities, sensors, and
the vicinity of users in order to minimize the services per- software, serving to connect and exchange data with other
ceived latency, relieve the data transmission, and ease the systems over different communication networks. Meanwhile,
cloud congestion. In other words, the edge computing has the edge nodes are defined as devices in higher levels includ-
become an essential complement to the cloud and even a sub- ing fog nodes, MEC servers, and cloudlets. The edge nodes are
stitute in some scenarios. Services and computing capabilities expected to possess high storage and computation capacities
equipped at the edge of cellular networks are called Mobile and to offer high-quality networking and processing service
Edge Computing (MEC) facilities [22], [23], [24]. Deploying with a lower response time compared to the cloud remote
MEC servers within the edge Base Stations (BSs) allows pro- servers.
viding location and context awareness, deploying new services Driven by the expansion and pervasiveness of the comput-
quickly and flexibly, and enhancing the Quality of Service ing devices, we believe that the heterogeneity of ubiquitous
(QoS). systems will increase in the future. These devices have to inter-
3) Cloudlet Devices: Cloudlets [25] are the network com- act seamlessly and coherently, despite their difference in terms
ponents that connect cloud computing to mobile computing of software and hardware capacities.
(e.g., computers cluster). This network part presents the mid-
dle layer of the three-tier hierarchical architecture composed
of mobile devices, micro-clouds, and cloud data centers. The C. Architecture and Intersection With AI
role of cloudlets is to define the algorithms and implement 1) Architecture: Fig. 4 illustrates the hierarchical
the functionalities that support low latency edge-cloud tasks architecture of a pervasive system [16], which is composed
offloading. of three layers:

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2371

Fig. 6. Relation between AI, machine learning, deep learning, and rein-
forcement learning. This survey mainly focuses on pervasive deep and
reinforcement learning.

Fig. 5. AI forms and considerations. the decentralization of AI models is managed using intelli-
gent techniques that take into consideration the IoT resource
constraints, their heterogeneity, the application context, etc.

• Data source layer: the data is collected from differ- III. F UNDAMENTALS OF A RTIFICIAL I NTELLIGENCE
ent monitored sources generating information of physical
Since approaches and techniques reviewed in this survey
world or human activities, multimedia data such as
rely on artificial intelligence and deep neural networks, we
images and audio, and social media information.
start first by providing a brief background of deep learning. A
• Data management layer: this layer involves the storage
deeper and detailed review of AI can be found in the reference
and integration of heterogeneous data incoming from per-
book in [29].
vasive sources, the cleaning and pre-processing that tailor
the context of the system, and the data analytics that
convert the raw information into useful and personalized A. Background
insights using multiple approaches, such as business and Even though AI has recently gained enormous attention, it
artificial intelligence. is not a new term and it was initially coined in 1956. Multiple
• Application layer: to this end, the insights generated techniques and procedures fall under AI broad umbrella, such
from the previous layer are used to offer multiple intelli- as rule-based systems, expert systems, control systems, and
gent applications, such as health advisor and smart home well-known machine learning algorithms. Machine learning
applications. generally includes three categories, which are supervised,
In our paper, we focus only on the data management layer, unsupervised and reinforcement learning. An important branch
specifically the data analytics using artificial intelligence. The of machine learning is deep learning that can be supervised or
data source layer is thoroughly discussed in [28], whereas the unsupervised and it is based on simulating the biological ner-
application layer can be found in [7]. vous system and performing the learning through subsequent
2) Intersection With AI: The artificial intelligence tech- layers transformation. As most of the pervasive applications
niques are centralized by design and most of the challenges are led by deep learning techniques and recently reinforcement
revolve around the accuracy, complexity, computation power, learning, the crossover between the above-mentioned domains
memory, and explainability. With the evolution and migration (shown in Fig. 6) defines the scope of this paper.
to edge computing characterized by scarce resources, solving 1) Deep Learning and Deep Neural Networks: In the fol-
these issues as well as facing the new challenges related to lowing, we briefly present an overview of the most common
privacy, become crucial. This called for AI distribution, where deep learning networks.
the training and the inference (e.g., data, models, policies.) are Neural networks consist of a first input layer, one or multiple
split into smaller parts in order to reduce local computation and hidden layers, and a last output layer, as shown in Fig. 7. When
memory overheads, while considering the privacy and latency the neural network contains a high number of sequential layers,
constraints. However, the nascent IoT applications deployed in it can be called Deep Neural Network (DNN). The DNN layers
large-scale IoT devices have driven the distributed computation include smaller units, namely neurons. Most commonly, the
towards further dispersion, which urged the support of inter- output of one layer is the input of the next layer and the output
operability, high heterogeneity, scalability, context-awareness, of the final layer is either a classification or a feature. The
smart resource allocation, coordination, and invisibility, which correctness of the prediction is assessed by the loss function
are the characteristics of pervasive computing. To this end, the that calculates the error between the true and predicted values.
intersection between AI and pervasive computing came to the The DNN networks have various structures. Hence, we
light paving the way to introduce “Pervasive AI”. As shown in introduce the fundamentals of the most known types as
Fig. 5 , Pervasive AI is a special class of distributed AI, where follows:

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2372 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

Fig. 7. Examples of NN structures: (a) Multilayer Perceptron (MLP), (b) Convolutional Neural Network (CNN), (c) Residual Neural Network, (d) Randomly
Wired Neural Network.

a) Multilayer perceptron (MLP): If the output of one

layer is fed forward to the subsequent layer, the Neural
Network (NN) is termed as the Feed Forward NN (FNN).
The baseline FNN is called MLP or Vanilla. As shown in
Fig. 7 (a), each layer is Fully connected (Fc) to the next one
and the output is sent to the next layer’s perceptron without any
additional computation or recursion other than the activation
function.
b) Convolutional neural networks (CNN): Processing
vision-based tasks (e.g., image data), using MLP, potentially
requires a deep model with a huge number of perceptrons,
as for each data pixel a perceptron is assigned, which makes
the network hard to train and scale. One of the successors of
MLP is CNN that is introduced to solve this problem by defin-
ing additional pre-processing layers, (i.e., convolutional (conv)
and pooling layers), as shown in Fig. 7 (b). Furthermore,
the convolutional layer includes a set of learning parameters,
namely filters that have the same number of channels as the
data feature maps with smaller dimensions. Each filter chan-
nel passes through the length and width of the corresponding
input feature map and calculates the inner product to the data. Fig. 8. Convolutional task.
The summation of all the outputs produces one feature map.
Finally, the number of output feature maps equals the number
of filters, as illustrated in Fig. 8. The main difference between the next time step. LSTM belongs to the Recurrent Neural
the Fc and the conv layers is that each neuron in Fc networks Networks (RNN) family.
is connected to the entire input, which is not the case of CNN d) Randomly wired networks: The aforementioned
that is connected to only a subset of the input. The second networks focus more on connecting operations such as
basic component of the CNN network is the pooling task, convolutional tasks through wise and sequential paths. Unlike
which has an objective to reduce the spatial size of the input previous DNNs, the randomly wired networks [33] arbitrarily
feature maps and minimize the computation time. connect the same operations throughout the sequential micro-
A milestone for CNN applied to computer vision problems architectures, as shown in Fig. 7 (d). Still, some decisions
is the design of AlexNet [30] and VGG [31]. are required to design a random DNN, such as the number of
c) Deep residual networks: Following the victory of stages to down-sample feature maps using Maxpooling and
AlexNet and VGG, the deep residual networks have achieved the number of nodes to deploy in each stage. The edge of the
a new breakthrough in the computer vision challenges dur- randomly wired networks over the other models is that the
ing the recent years. Particularly, the residual networks paved training is faster, the number of weights is reduced and the
the way for the deep learning community to train up to hun- memory footprint is optimized.
dreds and even thousands of layers, while achieving high Fig. 7 presents the NN structures introduced in this section
performance. ResNet [3] is the-state-of-the-art variant of the and serving to understand the following sections. Other state-
residual network. This model uses the so called shortcut/skip of-the-art structures achieved unprecedented performance in
connections that skip multiple nodes and feed the intermediate multiple deep learning applications [34], including Recurrent
output to a destination layer (see Fig. 7 (c)), which serves as Neural Networks (RNNs) [35], Auto-Encoders (AEs) [36],
a memory to the model. A similar idea is applied in the Long and Generative Adversarial Networks (GANs) [37]; however,
Short Term Memory (LSTM) networks [32], where a forget detailed overview of all models falls outside the scope of this
gate is added to control the information that will be fed to paper.

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2373

For example, for object detection, face authentication, or self-

driving car, the accuracy is of an ultrahigh importance. Yet,
some performance metrics are general and not specific to any
application, including latency, memory footprint, and energy
consumption. An overview of different performance metrics is
presented as follows:
1) Latency: The latency, typically measured in millisec-
onds, is defined as the required time to perform the whole
inference/training process, which includes the data pre-
Fig. 9. Deep Reinforcement Learning (DRL) design. processing, data transmission, the classification process or the
model training, and the post processing. Real-time applica-
2) Reinforcement Learning (RL): Reinforcement learning, tions led by artificial intelligence (e.g., autonomous vehicles
also known as sequential decision making, refers to techniques and AR/VR gaming) have usually stringent latency constraints,
that update the model/policy at each time step, i.e., when of around 100 ms [44]. Hence, the near-processing is advan-
receiving each new instance of data. The advantage of RL tageous for fast inference response. The latency metric is
is that it is adaptable, as it does not have any knowledge or affected by different factors, such as the size of the DNN
assumption about the data distribution. In this way, if the trend model, the computational capacity of the host device, and the
of data drifts or morphs, the policy or the model can adapt to transmission efficiency.
the changes on the fly. 2) Energy Efficiency: Unlike the cloud and edge servers,
a) Bandit learning: The bandit problem represents the the IoT devices are battery-limited (e.g., commercial drones.).
simplest RL formulation, where an agent interacts with an Moreover, the communication and computation overhead
environment by performing actions at discrete time steps. Each caused by the deep model training/inference incurs huge
of these actions results in a feedback signal that is referred energy consumption. Hence, the energy efficiency, typically
to as reward, which describes the goodness of that action. measured in nanojoules, is of a large importance in the con-
Consider a website that wants to maximize the engagement text of edge AI and it primarily depends on the size of the
and relevance of articles presented to users. When a new user DNN and the capabilities of the computing device.
arrives, the website needs to decide on an article header to 3) Computation and Memory Footprint: To perform DNN
show and observe whether or not the user interacts with this training/inference, significant cycles are executed for memory
article. In this example, the selected action is the article to data transfer to/from computational array, which makes it a
display, and the reward is binary, 1 if clicked, 0 otherwise. highly intensive and challenging task. For example, VGG 16
Note that a critical assumption in bandits is that actions do and AlexNet require respectively 512 MB and 217 MB of
not have any effect on the agent other than causing a sample memory to store more than 138 M and 60 M of weights
of a reward signal. In cases where actions may transform the in addition to the model complexity or Multiply-ACCumulate
environment from a well-described state to another, a Markov operations (MACC) which is equal to 154.7 G and 7.27 G [45].
Decision Process (MDP) is required to model the problem and Such amounts of memory and computational tasks, typically
the formulation is known as MDP reinforcement learning. measured in Megabyte and number of multiplications respec-
b) Markov decision process (MDP)-based learning: This tively, are infeasible to be executed in power and resource
RL concept is based on learning how to map MDP’s states to constrained devices with a real-time response.
actions in order to maximize the long-term reward signal. The 4) Communication Overhead: The communication over-
RL-agent is not apprised which action to choose; instead, it head impacts the performance of the system, when the DNN
discovers the actions that achieve the highest reward by try- computation is offloaded to the cloud or other edge partici-
ing different combinations and receiving immediate gains and pants. Hence, it is indispensable to minimize this overhead,
penalties, which can be modeled as MDP process. Different particularly in costly network infrastructures. The data over-
from the bandit learning, the RL chosen action does not impact head, typically measured in Megabyte, depends on the input
only the direct reward, but also all subsequent situations and and how the model is designed, i.e., types and configuration
related rewards. Deep Reinforcement Learning (DRL) [38], of the layers that determine the output size, in addition to the
[39] combines reinforcement learning and the deep learning, communication technology. Furthermore, the fault-tolerance
as illustrated in Fig. 9. The DRL is well-suited, and even should be guaranteed to deal with communication failures
indispensable, when the environment is highly dynamic and efficiently.
dimensional and the number of states is large or continuous. 5) Privacy: IoT devices produce and offload a massive
Variants of DRL include the deep policy gradient RL [40], amount of data every second, which can result in serious pri-
the Deep Q-Networks (DQN) [41], Distributed Proximal vacy vulnerabilities and security attacks such as white-box
Policy Optimization (DPPO) [42], and Asynchronous attacks [46], data poisoning [47], and membership attacks [48].
Advantage Actor-Critic [43]. Guaranteeing the robustness and privacy of the DNN system
has become a primary concern for the deep learning com-
B. Performance Metrics munity. The traditional wise resorts to data encryption, pre-
The assessment of the DNN performance depends on the processing, and watermarking. Yet, all these solutions can
proximity-aware IoT application where deep learning is used. be neutralized using model stealing attacks. Hence, more

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2374 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

sophisticated defenses need to be designed to secure the 1) Intelligent Vehicles, Robots, and Drones: Recently,
DNN training and execution, through data distribution. The DNNs have been widely used to lead a variety of mobile plat-
robustness of a privacy mechanism is judged by its ability to forms such as drones, robots, and vehicles, in order to achieve
protect the data from attacks while maintaining the accuracy critical tasks. In this context, applications such as driving
performance. assistance, autonomous driving, and mobility mapping have
To design an efficient deep learning network or select the become more reliable and commonly used in intelligent mobile
adequate one for the targeted application, a large number systems. As an example, in [57], the captured image from the
of hyperparameters need to be considered. Therefore, under- vehicle front facing camera is used to decide the steering angle
standing the trade-off between these parameters (e.g., latency, and keep the car in the middle of the lane. The ever-improving
accuracy, energy, privacy, and memory.) is essential before online learning is broadly exploited for UAVs/robots guidance,
designing the model. Recently, automated machine learning including the work in [58] where drones learn how to navi-
frameworks responsible for DNN selection and parameters gate and avoid obstacles while searching target objects. Several
tuning, have been introduced, such as Talos [49]. start-ups are also using DL for their self-driving systems, such
as prime-air UAVs of Amazon used to deliver packages [59],
C. Pervasive Frameworks for AI and Uber self-navigating cars [60].
2) Smart Homes and Cities: The concept of a smart home
Several hardware and software libraries are publicly avail-
covers a large range of applications, that contribute to enhance
able for pervasive devices, particularly resource-limited ones,
the productivity, convenience, and life quality of the house
to enable DNN training and inference. As a first example,
occupants. Nowadays, many smart appliances are able to con-
Google TensorFlow [50] is an open source deep learning
nect to the Internet and offer intelligent services, such as
framework released in 2015 to execute DNN tasks on hetero-
smart air conditioners, smart televisions, and lighting control
geneous distributed systems based on their estimated compu-
systems. Most of these appliances require the deployment of
tational and communication capacities, which was optimized
wireless controllers and sensors in walls, floors, and corners to
later to be adequate for resource constrained devices (e.g.,
collect data for motion recognition DL services. Speech/voice
Raspberry Pi) and GPU execution. Another lightweight deep
DL recognition services are also involved for a better home
learning framework developed by Facebook is Caffe2 [51] that
control, where a Well-known example is Amazon Alexa [61].
provides a straightforward way to experiment heterogeneous
Compared to smart homes, smart city services are more
deep learning models on low-power devices. Core ML [52] and
relevant to the deep learning community as the data col-
DeepLearningKit [53] are two machine learning frameworks
lected from different ubiquitous participants is huge and highly
commercialized by Apple to support pre-trained models on
heterogeneous, which allows high-quality analysis. Examples
iPhone/iPad devices. More specifically, Core ML was designed
involve waste classification [62], energy consumption and
to leverage the CPU/GPU endowed with the end-device for
smart grid [63], and parking control [64].
deep learning applications such as natural language and image
3) Virtual Reality (VR) and Augmented Reality (AR): VR
processing, while DeepLearningKit supports more complex
is designed to create an artificial environment, where users
networks such as CNNs and it is coined to utilize the GPU
are placed into a 3D experience while AR can be defined as
more efficiently for iOS based applications.
a VR that inserts artificial objects into the real environment.
Since pervasive AI is still in its early stages, only few
Popular examples of applications using AR/VR include the
frameworks are dedicated specifically for distributed learn-
tactile Internet and holographic telepresence [65], and multi-
ing. One of these deep learning frameworks is MXNet [54],
players VR games. The latency of the virtual reality systems
which is used for pervasive training. MXNet uses KVStore1
is measured in terms of “motion-to-photons” metric, which
to synchronize parameters shared among participants during
is defined as the delay starting from moving the headset to
the learning process. To monitor the utilization of perva-
updating the display according to the movement. This motion-
sive resources, Ganglia [55] is designed to identify memory,
to-photons latency should be in the range of tens to hundreds
CPU, and network requirements of the training and track
of milliseconds [66]. Offloading the VR/AR computation to
the hardware usage for each participant. As for the inference
the remote cloud servers may incur higher latencies exceed-
phase, authors in [56] designed a hardware prototype targeting
ing the required constraints. Hence, on-device computation is
distributed deep learning for on-device prediction.
indispensable to achieve real-time performance.

D. Pervasive AI for IoT Applications

Deep learning methods have brought substantial break- E. Lessons Learned
throughs in a broad range of IoT applications, spanning from In this section, we reviewed state-of-the-art deep learn-
signal and natural language processing to image and motion ing and reinforcement learning techniques, examined their
recognition. In this section, we review the accomplishments of performance metrics, and presented some of their applica-
deep learning in different domains where pervasive comput- tions that may require pervasive deployment. In this context,
ing is needed, including intelligent vehicles and robots, smart multiple conclusions can be stated:
homes and cities, and virtual reality/augmented reality. • The AI proximity-aware IoT applications have different
requirements and each one has its distinctive performance
1 www.kvstore.io keys. For example, VR/AR is highly sensitive to delays

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2375

and cannot tolerate any motion sickness. Meanwhile, the explored such as healthcare, smart cities, and grid energy. As
applications relying on UAVs and moving robots have an example, two recent surveys [67], [68] provided an in-depth
stringent requirements in terms of energy to accomplish discussion of the usage of AI in wireless and 5G networks
their missions. For the surveillance applications, the accu- to empower caching and offloading, resource scheduling and
racy is paramount. However, such requirements come sharing, and network privacy. These surveys touched upon the
with other costs. More specifically, lower delays and pervasive AI, particularly federated learning and distributed
energy consumption can be achieved using small DL inference. However, the distribution was discussed briefly as
networks that generate fast inference and can be deployed one of the techniques that further enables AI in the edge. In
locally. On the other hand, high accuracy requires deep our survey, the applications of AI for pervasive networks is
networks that incur higher memory and computation not the main topic. Instead, the deployment of AI on pervasive
utilization and consequently higher communication over- devices is the scope of this paper.
heads for remote execution. Therefore, understanding The surveys in [69], [70], [71], [72], [73], [74] conducted
the requirements of the targeted application and the a comprehensive review on the systems, architectures, frame-
trade-off between different hyper-parameters is crucial works, software, technologies, and algorithms that enable AI
for selecting the adequate AI model and the processing computation on edge networks and discussed the advantages
device. of edge computing to support the AI deployment compared
• The common characteristic for most of AI applications, to cloud approaches. However, even though they dedicated
particularly for IoT applications that require real-time a short part for distributed AI, these papers did not dis-
data collection, is the need for prompt response and fast cuss the resource and communication challenges of pervasive
analytics that should not be piled for later processing. computing nor the partitioning techniques of AI (e.g., split-
Hence, centralized solutions such as cloud-based data ting strategies of the trained DNN models or the training
analytics are not feasible anymore, due to the commu- data.). Moreover, they did not consider the cloud comput-
nication overheads. Pervasive computation has emerged ing as indispensable part of the distributed system. Therefore,
as a solution that enables the deployment of AI at the unlike the previous surveys [69], [70], [71], [72], [73], [74],
proximity of the data source for latency-sensitive applica- we present an in-depth review that covers the resources,
tions, and in collaboration with high-performance servers communication and computation challenges of distributed AI
for better computational resources. among ubiquitous devices. More specifically, applying the
• Understanding the application requirements and the per- same classical communication and computation techniques
vasive environment, and wisely selecting the data shape adopted in centralized approaches for pervasive AI is not
and the adopted AI technique are critical for determin- trivial. As an alternative, both pervasive computing systems
ing the distribution mode. More specifically, the privacy and distributed AI techniques are tailored to take into con-
constraints and the size of the data open the doors for fed- sideration the heterogeneous resources of participants, the
erated learning where each entity trains its data locally. AI model, and the requirements of the system. These cus-
The low latency requirements and the limited resources tomized strategies for pervasive AI are the main focus of our
imposed by some pervasive systems, push for the par- survey.
titioning of inference where the AI model is split into The multi-agent reinforcement learning has not been
smaller segments. Finally, the dynamics of the system, the reviewed by any of previous papers. Other papers surveyed
unavailability of labeled data and the inherently decen- the single agent and multi-agents RL, such as [75], [76],
tralized architectures call for the reinforcement learning [77], [78], [79]. In these tutorials, the authors conducted
where agents are distributed. comprehensive studies to show that the single-agent RL is
After understanding the motivations for pervasive AI and not sufficient anymore to meet the requirements of emerg-
the requirements of the IoT applications and their related ing networks in terms of efficiency, latency, and reliability. In
AI models, we present different distribution modes and their this context, they highlighted the importance of cooperative
communication and computation models in the subsequent MARL to develop decentralized and scalable systems. They
sections. We review, first, the pervasive training including fed- also surveyed the existing decision making models including
erated learning, multi-agent RL, and active learning, and then game theory and Markov decision process and they presented
we survey the pervasive inference. However, we start by pre- an overview of the evolution of cooperative and competitive
senting the related surveys and highlighting the novelty of our MARL, in terms of rewards optimization, policy convergence,
paper. and performance improvement. Finally, the applications of
MARL for networking problems are also reported. However,
despite this recent popularity of MARL, the designed algo-
IV. R ELATED S URVEYS AND PAPER N OVELTY rithms to achieve efficient communication between agents and
The intersection of pervasive computing and AI is still in minimum computation are not surveyed yet. To the best of
its early stage, which attracts the researchers to review the our knowledge, we are the first to survey the computation
existing works and provide innovative insights, as illustrated and communication challenges faced to achieve a consensus
in Table I. First, many efforts discussed the applications of on the distributed RL policy. In other words, our focus is
artificial intelligence that support edge networks, in order to not the performance of the RL policy. Instead, we survey the
meet the networking requirements. Multiple edge contexts are computational load, communication schemes and architectures

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2376 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

TABLE I
C OMPARISON W ITH E XISTING S URVEYS

experienced by cooperative agents during learning and patterns that can be employed to exchange states. Authors
execution. in [94] presented a contemporary and comprehensive survey
Unlike aforementioned papers, authors of [80], [92], [93], of distributed ML techniques, which includes the applica-
[94] focused only on distributed machine learning. In these bility of such concept to wireless communication networks,
papers, they covered the training phase and data partitioning. and the computation and communication efficiency. However,
The survey in [92] discussed the issues of learning from a data this survey along with the previous works focus only on the
characterised by its large volume, different types of samples, training phase. Also, the authors do not provide a comprehen-
uncertainty, incompleteness, and low value density. Solutions sive and deep summary of the complexity, computation and
to minimize the learning complexity and divide the data are communication efficiency witnessed by different decentralized
introduced in [93], where authors reviewed the algorithms architectures and the amount of data shared by participants,
and decision rules to fragment large scale data into dis- particularly for the multi-agent reinforcement learning. Our
tributed datasets. The paper in [80] described the architectures survey aims to bridge the gap by providing a comprehen-
and topologies of nodes participating in the distributed train- sive review of distributed AI, including both training and
ing by presenting existing frameworks and communication inference phases. More specifically, we thoroughly study the

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2377

architectures of federated learning, active learning and rein- one can benefit from on-the-fly labeled samples incoming from
forcement learning, and the partitioning strategies of DNN the others. Finally, agents in MARL collaborate to converge
trained models. Then, for each approach, we show the impact to the policy that ensures the selection of the best actions.
on the communication and computation complexity and the
algorithms scheduling the collaboration between devices.
Finally, the authors in [81], [82], [83] provided a deep A. Federated Learning
review of communication challenges of AI-based applica- Despite the great potential of deep learning in different
tions on edge networks. Specifically, the survey in [83] pro- applications, it still has major challenges that need to be
vided insights about allocating mobile and wireless networks addressed. These challenges are mainly due to the massive
resources for AI learning tasks. However, the distribution of amount of data needed for training deep learning mod-
AI techniques was not targeted in this latter paper. The sur- els, which imposes severe communication overheads in both
veys in [81], [82] are considered the closest ones to our network design and end-users. Moreover, the conventional way
topic as they explored the communication-efficient AI distri- of transferring the acquired data to a central server for training
bution. However, they mainly focused on the training phase, comes with many privacy concerns that several applications
i.e., federated learning, whereas the pervasive inference and may not tolerate. In this context, the need for intelligent and
MARL were not studied. The inference distribution is briefly on-device DL training has emerged. More specifically, instead
discussed in [81] from a communication angle, without con- of moving the data from the users to a centralized data center,
sidering other constraints such as the memory and computation pervasive data-sources engage the server to broadcast a pre-
nor presenting the partitioning strategies (i.e., splitting of the trained model to all of them. Then, each participant deploys
trained model), which highly impact the distribution process, and personalizes this generic model by training it on its own
the parallelization technique, and participants orchestration. data locally. In this way, privacy is guaranteed as the data is
Our paper represents a holistic survey that covers all AI tasks processed within the host. The on-device training has been
that require cooperation between pervasive devices motivated widely used in many applications [88], such as the medical
either by the application requirements or by the system design field, assistance services, and smart education. However, this
and the AI model. no-round-trip training technique precludes the end-devices to
Our search of related papers has been conducted through benefit from others’ experiences, which limits the performance
different databases and engines, including IEEE Xplorer, of the local models. To this end, Federated Learning (FL) has
ScienceDirect, and ArXiv; and the papers have been chosen been advanced, where end-users can fine-tune their learning
from a time frame set between 2017 and 2021, in addition models while preserving privacy and local data processing.
to some well-established research works. More specifically, Then, these local models (i.e., model updates) are aggregated
we selected all surveys with high citation rates that cover and synchronized (averaged) at a centralized server, before
AI, pervasive computing, federated learning, reinforcement being sent back to the end-users. This process is repeated
learning, bandit learning, deep learning applications in IoT several times (i.e., communication rounds) until reaching con-
systems, and AI deployment on edge networks. In the rest of verge. Accordingly, each participant builds a model from its
our survey, we review the research conferences and journal local data and benefits from other experiences, without violat-
papers with solid results that provide comprehensive studies ing privacy constraints. FL is proposed by Google researchers
on resource-efficient distributed inference and training. in 2016 [95], and since then, it has witnessed unprecedented
growth in both industry and academia.
We present in what follows an overview for this emerg-
V. P ERVASIVE T RAINING ing pervasive learning technique, i.e., Federated Learning. In
In this section, we discuss the pervasive training, where the particular, we introduce the computation and communication
fitting of the model or the learning policy is accomplished models of the FL techniques. Then, we present a brief sum-
within the distributed devices. Particularly, we present the mary of the related works in the literature, while highlighting
resource management for federated learning, multi-agent rein- a use case that considers the application of FL within UAV
forcement learning, and active learning. These aforementioned swarms. It is worth mentioning that the FL can be used for
techniques are distributed by design, which means their objec- both online and offline learning (i.e., the training can be per-
tive is to train the learning model within pervasive devices. formed on static datasets at once, or continuously training on
More specifically the concept of federated learning and active new data received by different participants).
learning is based on guaranteeing the privacy of the perva- 1) Profiling Computation and Communication Models:
sive data by training each set locally at its source. Similarly, Generally, the FL system is composed of two main entities,
multi-agent reinforcement learning is designed to be imple- which are the data-sources (i.e., owners of data or pervasive
mented in a system comprising multiple independent/related participants) and the centralized server (i.e., model owner).
entities that interact with the same environment. However, the Let N denote the number of data-sources. Each one of these
distribution concepts of these techniques are different. In fact, devices has its own dataset Di . This private data is used to
in federated learning, the data is distributed and each partici- train the local model mi , and then the local parameters are
pant creates a local model. Then, the global model is obtained sent to the centralized server. Next, the local modelsare col-
by aggregating these pervasive models. Meanwhile, in active lected and aggregated onto a global model mG = N i=1 mi .
learning, the participants collaborate to label the data and each The FL is different from training in the remote server, where

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2378 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

the distributed
data are collected and aggregated first, i.e., where φ is the number of CPU cycles required to compute
DG = N i=1 D i , and then one model m is trained centrally. We one input instance, and γ is a constant related to the CPU.
assume that data-sources are honest and submit their real data The latency required to compute the local model can be
or their true local models to the centralized server. Otherwise, expressed as:
control and incentive techniques are used to guarantee the φ|Di |
reliability of FL, including [96]. tic = E × . (5)
f
Typically, the life cycle of FL is composed of multiple com-
munication rounds that are completed when the centralized From the equations (4) and (5), we can see that a trade-off
model reaches a satisfactory accuracy. Each round includes exists between the local training latency and the consumed
the following steps: energy. More specifically, for a fixed accuracy determined by
• Initialization of FL: The centralized server fixes the train- the number of local epochs and a fixed frequency, the latency
ing task, the data shape, the initial model parameters, is accelerated depending on the size of the private data. If
and the learning process (e.g., learning rate). This initial the data size and the accuracy are fixed, increasing the CPU
model mG 0 is broadcasted to the selected participants. frequency can help to minimize the local model computa-
• Training and updating the local models: Based on the tion. However, minimizing the latency comes at the expense
current global model mG t , each data-source i utilizes its of energy consumption that increases to the square of the
own data Di to update the local model mit . We note that t operating frequency.
presents the current round index. Hence, at each step t, the The transmission time to share the model updates between
goal of each participant is to find the optimal parameters the centralized servers and different FL participants mainly
minimizing the loss function L(mit ) defined as: depends on the channel quality, the number of devices and
the number of global rounds, illustrated as follows:
mit∗ = argmin L mit . (1)
mit
N
K
tT = T × , (6)
Subsequently, the updated parameters of the local models ρi
i=1
are offloaded to the server by all selected participants.
• Global model aggregation: The received parameters are where K is the models’ parameters size shared with the server
aggregated into one global model mG t+1
, which will and ρ is the data rate of the participant i. On the other hand, the
be sent back in its turn to the data owners. This pro- total energy consumed during the federated learning process
cess is repeated continuously, until reaching convergence. using the local transmit powers Pi is equal to:
The server goal is to minimize the global loss function
N
T KPi
presented as follows: e =T× , (7)
ρi
1
N i=1
t
L mG = L(mit ). (2) From the above equations, we can see that the local iterations
N
i=1 E and the global communication rounds T are very important
The aggregation of the global model is the most important to optimize the energy, computation, and communication costs.
phase of FL. A classical and straightforward aggregation Particularly, for a relative local accuracy θl , E can be expressed
technique, namely FedAvg, is proposed by Google refer- as follows [101]:
ence paper [95]. In this technique, the centralized server 1
tries to minimize the global loss function by averaging E = α × log , (8)
θl
the aggregation following the equation below:
where α is a parameter that depends on the dataset size and

N
|Di |
t+1
= mit+1 , local sub-problems. The global upper bound on the number of
mG N (3)
i=1 j =1 |Dj | iterations to reach the targeted accuracy θG can be presented
as [101]:
where Di is the local dataset. The FL system is iter-
ated continuously until the convergence of the global loss ζlog( θ1G )
function or reaching a desirable accuracy. Eg = . (9)
1 − θl
A major challenge in FL is the large communication and
energy overhead related to exchanging the models updates We note that ζlog( θ1G ) is used instead of O(log( θ1G )), where
between different end-users, and the centralized server [97], ζ is a positive constant. The computation cost depending on
[98]. Such overheads depend on multiple parameters, includ- the local iterations E and the communication cost depending
ing the models’ updates size, the number of participating users, on the global rounds T are contradictory. It means, minimiz-
the number of epochs per user, and the number of communica- ing E implies maximizing T to update the local parameters
tion rounds required to maintain the convergence. Particularly, frequently, which results in increasing the convergence latency.
the energy consumed by an FL participant i characterized by a To summarize, FL pervasiveness aspects that are being tack-
frequency f, a local dataset Di , and a number of local epochs led by different studies, to reduce communication and energy
E, is given by [99], [100]: overheads, may include:
1) reducing communication frequency, i.e., number of com-
eic = E × φγ|Di |f 2 , (4) munication rounds;

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2379

Fig. 10. The FL architectures considered in the literature: (a) one-layer FL, (b) edge-assisted FL.

2) reducing the number of local iterations; in [102]. This work illustrated, theoretically and empirically,
3) selecting minimum number of participating users in the that highly skewed non-IID data (i.e., the local data at dif-
training process; ferent users are not identically distributed) can substantially
4) optimizing local devices operating frequencies; decrease the accuracy of the obtained trained model by up
5) minimizing the entropy of the models updates by using to 55%. To solve this issue, the authors suggested to share
lossy compression schemes; a small subset of data between all participants. By integrat-
6) using efficient encoding schemes in communicating ing these data from the neighboring participants with the
models updates. local data at each participant, the local dataset will be less
In what follows, we categorize different presented FL skewed. However, sharing data among the available partici-
schemes in the literature, based on the system architecture, pants is not always feasible, given strict privacy constraints
namely one-layer FL and edge-assisted FL. The former refers and communication cost of sharing such data. The conver-
to user-cloud architecture, where different users share their gence analysis of FedAvg scheme using non-IID data has been
learning models with a cloud or centralized server for aggre- investigated in [103] for strongly convex problems. In [104],
gation, while the latter refers to user-edge-cloud architecture, the authors started first by studying the convergence behaviour
where edge nodes are leveraged to reduce communication of gradient-descent based FL scheme on non-IID data from a
overheads and accelerate FL convergence (see Figure 10). theoretical point of view. After that, the obtained convergence
2) Resource Management for Federated Learning: bound is used to develop a control mechanism, for resource-
a) One-layer FL: The efficiency of FL concept has been limited systems, by adjusting the frequency of the global
proved by different experiments on various datasets. In par- model aggregation in real-time while minimizing the learning
ticular, the proposed model in [95] presented a one-layer FL, loss. A new FL algorithm, named FEDL, is presented in [105].
where the available users/devices could exchange their local This algorithm used a local surrogate function that enables
models with a centralized server that collects the local mod- each user to solve its local problem approximately up to a
els and forms a global model. Afterward, several extensions certain accuracy level. The authors presented the linear con-
have been proposed to the original FL. The investigated prob- vergence rate of FEDL as a function of the local accuracy and
lems/approaches in FL, considering one-layer architecture, can hyper-learning rate. Then, a resource allocation problem over
be categorized into: wireless networks was formulated, using FEDL, to capture the
• studying the convergence behaviour of the proposed FL trade-off between the training time of FEDL and user’s energy
schemes from a theoretical perspective, while optimiz- consumption.
ing the learning process given limited computational and In [103], the effect of considering the participation of all
communication resources [102], [103], [104], [105]; users in FL algorithm has been studied. Indeed, it is shown
• considering partial user participation for the FL aggrega- that increasing the number of participant users may lead to
tion process in a resource-constrained environment while increasing the learning time since the central server have to
balancing between the model accuracy and communica- wait for stragglers, i.e., participants with bad wireless chan-
tion cost [106], [107], [108], [109]; nels or large computational delay. To overcome the impact
• presenting communication-efficient schemes that of stragglers, different schemes have been proposed to select
aim at reducing the FL communications cost by the best subset of users that can participate in the FL aggrega-
adopting distinct sparsification and compression tion [106], [107]. For instance, the authors in [107] presented a
techniques [110], [111], [112]. control algorithm, leveraging reinforcement learning, in order
The effect of non-Independent and Identically Distributed to accelerate the FL convergence by obtaining the subset of
(non-IID) data on the performance of FL has been investigated users that can participate in each communication round of FL,

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2380 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

while accounting for the effect of non-IID data distribution.

To maintain the balance between computational and commu-
nication costs, and global model accuracy, the authors in [108]
presented a joint optimization model for data and users selec-
tion. In [109], the problem of users selection to minimize
the FL training time was investigated for Cell-Free massive
Multiple-Input Multiple-Output (CFmMIMO) networks.
Alternatively, sparsification and compression techniques are
used to decrease the entropy of the exchanged models in
FL process. In particular, instead of communicating dense
models’ updates, the authors in [110] presented a framework
that aims at accelerating the distributed stochastic gradient
descent process by exchanging sparse updates (i.e., for-
warding the fraction of entries with the biggest magnitude
for each gradient). In [111], a sparse ternary compression
technique was presented to compress both the upstream
and downstream communications of FL, using sparsifica- Fig. 11. An example of FL applications in UAV-assisted environment.
tion, error accumulation, ternarization, and optimal Golomb
encoding.
b) Edge-assisted FL: Some studies have considered
edge-assisted FL architecture to tackle the problem of non- each follower exploits its gathered data to train its own learn-
IID data. For example, the authors in [113] extended the work ing model, then forwarding its model’s updates to the leading
in [104] in order to analytically prove the convergence of UAV. All received models are then aggregated at the leading
the edge-assisted FedAvg algorithm. Then, this work was fur- UAV to generate a global FL model, that will be used by
ther extended in [114] to mitigate the effect of stragglers by the following UAVs in the next iteration. Interestingly, [123]
proposing probabilistic users selection scheme. The authors investigates the impact of wireless factors (such as fading,
in [115] presented two strategies to prevent the bias of train- transmission delay, and UAV antenna angle deviations) on
ing caused by non-IID data. The first strategy was applied the performance of FL within the UAV swarms. The authors
before training the global model by performing data augmen- present the convergence analysis of FL while highlighting
tation to tackle the challenge of non-IID data. The second the communication rounds needed to obtain FL convergence.
strategy utilized mediators, i.e., edge nodes, to reschedule the Using this analysis, a joint power allocation and scheduling
training of the participants based on the distribution distance optimization problem is then formulated and solved for the
between the mediators. In [116], the impact of non-IID data UAV swarm network in order to minimize the FL convergence
in edge-assisted FL architecture was investigated and com- time. The proposed problem considers the resource limitations
pared to the centralized FL architecture. This study defined of UAVs in terms of: (1) the strict energy limitations due to
the main parameters that affect the learning process of edge- the energy consumed by learning, communications, and fly-
assisted FL. Table II presents the taxonomy of the federated ing during FL convergence; and (2) delay constraints imposed
learning techniques described in this section. by the control system that guarantees the stability of the
3) Use Case: Learning in the Sky: Nowadays, deep learn- swarm.
ing has been widely used in Flying Ad-hoc Network (FANET). 4) Lessons Learned: Despite the prompt development of
Different tasks can be executed using DL techniques at UAV diverse DL techniques in different areas, they still impose
swarms, such as coordinated trajectory planning [123] and a major challenge, which is: How can we efficiently lever-
jamming attack defense [124]. However, due to the related age the massive amount of data generated from pervasive
massive network communication overheads, forwarding the IoT devices for training DL models if these data cannot be
generated large amount of data from the UAV swarm to a cen- shared/transferred to a centralized server?
tralized entity, e.g., ground base stations, makes implementing • FL has emerged as a promising privacy-preserving collab-
centralized DL challenging. As a promising solution, FL was orative learning scheme to tackle this issue by enabling
introduced within a UAV swarm in several studies [123], [124], multiple collaborators to jointly train their deep learning
[126], [127] to avoid transferring raw data, while forwarding models, using their local-acquired data, without the need
only local trained models’ updates to the centralized entity that of revealing their data to a centralized server [111].
generates the global model and send it to the end-user and all • The model aggregation mechanisms are the most dis-
participants over the intra-swarm network (see Fig. 11). cussed in the FL literature, which are applied to
In [123], the authors present a FL framework for the swarm address the communication efficiency, system and model
of wirelessly connected UAVs flying at the same altitude. The performance, reliability issues, statistical heterogeneity,
considered swarm includes a leader UAV and a set of follow- data security, and scalability. More specifically, one-layer
ers UAVs. It is assumed that each follower collects data while FL approaches are the most studied by previous works,
flying and implements FL for executing inference tasks such as even if researchers are recently investigating decentral-
trajectory planning and cooperative target recognition. Hence, ized strategies.

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2381

TABLE II
TAXONOMY OF F EDERATED L EARNING T ECHNIQUES

• A major dilemma in FL is the large communication over- • We also remark that despite the considerable presented
head associated with transferring the models’ updates. studies that have provided significant insights about
Typically, by following the main steps of FL protocol, different FL scenarios and user selection schemes, opti-
every node or collaborator has to send a full model mizing the performance and wireless resources usage for
update in every communication round. Such update fol- edge-assisted FL is still missing. Most of the existing
lows the same size of the trained model, which can be in schemes for FL suffer from slow convergence. Also, con-
the range of gigabytes for densely-connected DL mod- sidering FL schemes in highly dynamic networks, such as
els [132]. Given that large number of communication vehicular networks, or resource-constraint environments,
rounds may be needed to reach the FL convergence on such as healthcare systems, is still challenging.
big datasets, the overall communication cost of FL can
become unproductive or even unfeasible. Thus, minimiz- B. Multi-Agent Reinforcement Learning
ing the communication overheads associated with the FL In reinforcement learning [133], the agent learns how to
process is still an open research area. map the environment’s states to actions in order to maximize

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2382 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

Algorithm 1 Basic Bandit Problem

Input: The set of K actions (arms) K,
1: for each round t ∈ [T ]:
2: Algorithm picks an action at
3: Environment returns a reward rt ∼ Dat

a reward signal. This agent is not told which actions to choose

but instead should discover which ones lead to the best reward
by trying them. In the most general case, actions affect not
only the immediate reward but also the next environment’s
states and potentially all subsequent rewards, this framework Fig. 12. The basic bandit problem: a set of actions corresponding to different
reward distributions.
is called Markov decision process reinforcement learning. In
the simplified and special case setting of RL with a single state,
the agent needs only to detect the best action that maximizes
to be estimated online from samples. Thus, it is inevitable
the current reward (bandit objective) without accounting for
that some sub-optimal actions will be picked, while building
the transition to any other state. This non-associative setting,
certainty on the optimal one. A reasonable performance mea-
namely multi-arm bandit learning, has less modeling power
sure is the Regret, which is defined as the difference between
of real systems but avoids much of the complexity of the full
the optimal policy’s cumulative rewards, and the cumulative
reinforcement learning problem.
rewards achieved by a solution algorithm.
In this section, we discuss the multi-agent Reinforcement
Learning (MARL) with its both forms. The “Multi-agent” pre- T
fix indicates the existence of multiple collaborative agents2 RT = u∗ × T − ua,t (10)

that are aiming to optimize a specific criterion through learn- t=1
Optimal policy’s
ing from past experience (i.e., past interactions). We note
cumulative rewards An algorithm’s cumulative
that RL was originally proposed to model a single agent
rewards
interacting with the environment and aiming to maximize its
reward. However, in pervasive computing systems, where there In other words, the regret RT is the sum of per-step regrets.
are numerous but resource-limited agents (i.e., devices), col- A per-step regret at time step t is simply the difference between
laboration becomes essential to leverage the potential of the the best action’s expected reward u∗ and the expected reward
collective experience of these devices. of the action chosen by an algorithm ua,t (i.e., a is selected by
Motivated by the prevalence of collaboration in pervasive the algorithm we are following). Thus, it represents how much
systems, we review in this section distributed MAB and MDP rewards are missed because the best action is not known and
MARL algorithms from a resource utilization perspective. As has to be estimated from samples. Solution algorithms typi-
has been the case throughout the paper, we are interested in the cally prove sub-linear regret growth (i.e., this difference goes
performance/resource-management trade-offs. Specifically, we to zero as time progresses. In this way, learning is achieved).
propose a taxonomy based on the obtained performance with The best achievable regret bound for the described bandit
specific resource budgets (e.g., communication rounds). problem was proven to be O(log(T)) [134].
1) Multi-Agent Multi-Arm Bandit Learning: In this sec- Several solution algorithms with optimal performance guar-
tion, we first provide technical definitions of the single-agent antees have been proposed in the literature [134], which
stochastic bandit problem and then explain its multi-agent fall generally into two categories, explore-then-commit and
extension. optimism-based algorithms. Explore-then-commit class, such
a) Overview: The Bandit problem, introduced earlier in as successive elimination algorithm, acts in phases and elimi-
Section III, is given in Algorithm 1, and visually illustrated nates arms using increasingly sensitive hypothesis tests. On the
in Fig. 12. Fundamentally, there exists a set of actions K (10 other hand, the optimism algorithm, such as Upper Confidence
actions in the figure), where each action a results in a reward Bound (UCB) algorithm, builds confidence for the reward
sampled from a distribution Da (Gaussians in the example of each action and selects the action with the highest upper
illustrated in Fig. 12). bound. The asymptotic performance of both classes is sim-
The problem instance is fully specified by the time horizon ilar. Note that performance guarantees are also classified
T and mean reward vector (the vector of the expected reward into instance-dependent bound that depends on the problem
for each action/arm) µ = ua , a ∈ K, where ua = E[Da ]. The information such as the difference between the best and
optimal policy is simply choosing the action whose expected second-best arms, and instance-independent regret (i.e., worst-
value is the highest, i.e., a∗ = arg maxa ua . However, as case regret). These algorithms are recently being extended
this action is not known a priori (Da is not known), it has to model pervasive computing though two main MAB for-
mulations: distributed and federated bandits, as shown in
2 Note that while competitive settings can also be modeled, the focus of
Fig. 13.
this section is on systems that aim to jointly optimize an objective function
with minimum resource utilization (pervasive AI systems). Thus, competitive In distributed bandits, agents aim to solve the same bandit
and zero-sum games will not be deeply surveyed. instance (i.e., quickly discover the best action), represented by

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2383

sequential decision-making problem at hand is distributed by

nature. For example, we can consider a recommender system
deployed over multiple servers in different locations. While
every server aims to always recommend the best item, it is
intuitive to reuse each other’s experiences and cut the time
needed to learn individually. Furthermore, since their commu-
nication may violate the latency constraints, it is desirable that
this collaboration and reuse of experience achieve minimum
communication overhead.
While the classical single-agent bandit algorithm has been
proposed since the 2002, its multi-agent counterpart is much
more recent, with new state-of-the-art algorithms being cur-
rently proposed. The work in [135] initiated the interest in the
communication-regret trade-off. The authors established a non-
trivial bound on the regret, given an explicitly stated bound on
the number of exchanged messages. However, they focused on
the full-information setting, assuming that the agents observe
the rewards of all actions at each round, and not only the
one picked, which is the case in bandit settings. Nonetheless,
this work initiated the interest in studying the same trade-
off under the bandit settings. The authors of [136] considered
the partial feedback (i.e., bandit settings) and presented an
optimal trade-off between performance and communication.
This work did not consider regret as the performance criterion
but rather assumed the less common “best arm identification”
setup, where the goal is to purely explore in order to even-
tually identify the best arm with high probability after some
number of rounds. The authors in [137] studied the regret
of distributed bandits with a gossiping-based P2P communi-
cation specialized to their setup, where at every step, each
agent communicates only with two other agents randomly
selected. Reference [138] studied the regret under the assump-
Fig. 13. Multi-agent bandits formulations: (a) Distributed Bandits: each tion that the reward obtained by each agent is observable by all
agent collaborates with others to identify the best action in the same environ- its neighbors. Reference [139] proposed a collaborative UCB
ment (b) Federated bandits: each agent collaborates with others to identify algorithm on a graph-network of agents and studied the effect
the best global action using biased local samples. In this example, the local
environments were generated (e.g., sampled) from a global one. of the communication graph topology on the regret bound.
Reference [140] improved this line of work as the approach
requires less global information about the communication
graph by removing the graph dependent factor multiplying the
the action set and their generating distributions. Meanwhile, in time horizon in the regret bound.
the federated bandit settings, agents handle different bandits Other works go beyond merely studying the effect of the
instances and utilize each others’ experiences to solve them. network topology on the regret bound and explicitly account-
While the terms used to describe the exact problem is some- ing for the communication resources to use. The authors
times ambiguous in the literature (i.e., distributed, federated, in [141] deduced an upper bound on the number of needed
and decentralized were sometimes used interchangeably), in communicated bits, proving the ability to achieve the regret
this work, we adopt the recent convention on reserving the bound in [140] with a finite number of communication bits.
term federated for the case where each agent faces a different However, the interesting question, particularly from the per-
(but related to each others) problem instance, while keeping spective of pervasive computing design, is whether the use of
the term distributed for the case where the instance is the same communication resources can also be bounded, i.e., can the
for all agents but the decision making is distributed across order of optimal regret bound be guaranteed with a maximum
other agents. number of communicated bits / communicated messages.
b) Distributed bandits formulations: In many bandit The work in [142] established the first logarithmic upper
problem instances, it is appealing to employ more agents to bound on the number of communication rounds needed for an
learn collaboratively and concurrently to speed up the learning optimal regret bound. The authors considered a complete graph
process. In the distributed bandit problem, there exists a set network topology, wherein a set of agents are initialized with a
of agents [M ] collaborating to solve the same bandit instance disjoint set of arms. As time progresses, a gossiping protocol
(the K arms are the same). These agents communicate accord- is used to spread the best performing arm with agents. The
ing to some networking topologies. In many contexts, the authors showed that, with high probability, all agents will be

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2384 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

aware of the best arm while progressively communicating at First, each agent shares a set of local information with other
less (halving) periods. The authors generalized this work with neighbors (the number of times an arm was sampled and its
a sequel formulation [143], which relaxes the assumption of sampled mean). Second, a gossip update is performed, where
a complete graph, and introduces the option for agents to pull each agent incorporates information received from neighbors
information. However, this approach is still using the same in updating its estimate of each arm’s mean.
gossiping style of communication. According to [144], this Shi et al. [149] presented a more general formulation,
dependence on pair-wise gossiping communication results in where the global mean vector is not necessarily the average
a sub-optimal instance-independent regret bound. The authors of the local ones. Instead, the local means are themselves
in [145] focused on the regret-communication trade-off in samples from a distribution whose mean is unknown. The
the distributed bandit problem. The networking model uti- local observation for each agent is, in turn, samples from
lizes a central node that all agents communicate with. Initially, the local distributions. The communication model is similar
agents work independently to eliminate bad arms. Then, they to supervised federated learning, where agents communicate
start communicating with the central node at the end of each periodically with an orchestrator that updates the estimates of
epoch, where epochs’ duration grows exponentially, leading to arms payoffs and instructs the agents on which arms to keep
a logarithmic bound on the number of needed messages. and which to eliminate. Although the communication is peri-
The work in [144] presents a state of the art distributed odic, the total number of communication rounds is bounded
bandit learning algorithm. The authors proposed algorithms (logarithmic with respect to the horizon T). This is because
for both fully connected and partially connected graphs (i.e., the number of agents incorporated in the learning process
assuming that every agent can broadcast to everyone and decays exponentially with time. Such an approach works since
assuming that agents can communicate with a subset of the the average of clients’ local means concentrates exponentially
others). Similar to elimination-based algorithms, the proposed fast around that global mean (a known result from probability
algorithm proceeds with epochs of doubling lengths, only concentration analyses).
communicating at the end of an epoch, thus guaranteeing a A setting that is slightly different from the federated ban-
logarithmic need for communication resources. The commu- dits was studied in [150]. The difference is that although
nicated messages are only the ID of the action played most agents have similar yet not identical local models, the reward
often. Furthermore, the regret is proved to be optimal even in for each agent is actually sampled from its local distribution.
instance independent problems, for reasonable values of the Thus, each agent is trying to identify the best arm in its local
time horizon (i.e., log(T ) < 214 ). During each epoch, agents instance through using information from other ones on arms
maintain a set of arms that are recommended by other agents that are similar. This work is different from other aforemen-
at the end of previous epochs and implement a UCB algorithm tioned approaches where the agents’ rewards are sampled from
among them. the global distribution that they are collaboratively trying to
c) Federated bandits formulations: The federated bandit estimate from biased local observations.
formulation, shown in Fig. 13 (b), is a recently emerging Table III summarizes the aforementioned work in MAB
framework akin to the federated learning framework dis- problems. It lists the problem formulation: distributed Bandits
cussed earlier. In this formulation, there exists a set of (DB) or federated Bandits (FB), the communication model
agents, each one is facing a different instance of the bandit (i.e., the network topology), the communication guarantee (i.e.,
(but the instances are related to each other). This is differ- number of messages needed to achieve the performance), the
ent from the distributed bandit formulation discussed in the regret guarantee (i.e., the growth of the regret with respect to
previous section, where a set of agents collaborate to solve the time horizon), and the main characteristics of the method,
the same instance of the multi-arms bandits. Recall that a ban- which describes how the rewards’ estimates are communicated
dit instance is determined by the mean reward vector µ. By among the agents (Recall that the agents aim to collectively
“related” instances in the federated bandit settings, we mean learn an accurate estimates of the rewards distributions). CG
that each local bandit instance is a noisy and potentially biased denotes a constant related to the communication graph or
observation of the mean vector. In light of this, collaboration gossiping matrix and N is the number of agents.
is necessary, as even perfect local learning algorithms might d) Use case: MAB for recommender systems: Online
not perform adequately due to their biased observations. learning systems are fundamental tools in recommender
The setting of federated bandits is first proposed by [146] systems, which are, in turn, a cornerstone in the develop-
(although not under the same term). The authors proposed ment of current intelligent user applications, from social media
an algorithm, where agents agree on the best global arm, application feeds to content caching across networks. Due
and they all play it at the beginning of each round. In this to the recent growth in data generation, local geo-distributed
way, communication is needed at the beginning of each round. servers are often used to support applications that utilize rec-
Recently, [147] studied this federated setting, where the global ommender systems. Furthermore, privacy concerns sometimes
arm mean vector is the average of the local ones. Although the limit the ability of these local servers to share data with other
authors did not propose a bound on the number of messages servers. The work in [149] studies the case of a set of servers
needed to be exchanged, the communication model considered that run a recommender system for their prospective clients.
a partially connected graph, where each agent communicates The goal of each one is to recommend the most popular con-
only with neighbors but with a focus on constrained commu- tent across all servers. However, due to latency constraints,
nication resources. The algorithm contains two main steps: communication at every decision-making step is infeasible.

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2385

TABLE III
M ULTI -AGENT S TOCHASTIC BANDIT L EARNING L ITERATURE

Besides, sharing individual samples of rewards violates pri- also provided (e.g., the maximum number of exchanged
vacy, as all servers will learn about a particular user’s choice messages or exchanged bits).
and preference. Due to these reasons, the authors proposed and • Towards the federation of the bandit framework: When
utilized a federated bandits algorithm (Fed-UCB) which only the bandit instances faced by each agent are local biased
communicates logT times in a T horizon to minimize recom- instances, the federated bandits framework arises. In
mendation latency. At each round, only the sample means are such a situation, agents need to learn with the help of
communicated, preserving a certain level of privacy (additional a logically centralized controller, similar to supervised
improvements are also discussed). Finally, the performance of federated learning, in order to estimate the true global
the system is shown to be near-optimal; thus, achieving the instance and the true best action [149]. However, if agents
goal of recommending the best item across all servers while are not interested in solving a hidden global instance but
meeting the privacy and communication constraints. rather only their own, they may reuse their peers’ experi-
ence and an instance-similarity metric to help them solve
e) Lessons learned:
their own instances [150].
• Distributed bandits formulations are the most popular
in the literature compared to the recent federated for- 2) Multi-Agent Markov Decision Process Learning: This
mulation. Specifically, we note that distributed bandits section presents an overview of Multi-agent MDP from a per-
with gossip-style communication, like the one intro- vasive computing perspective. We specifically focus on the
duced in [139], are a prevalent choice despite their communication-performance trade-off and classify previous
sub-optimal communication resource utilization. This works according to their approach to handle this trade-off.
is attributed to the balance between complexity and We note that our perspective is different from previous sur-
the robustness resultant from the lack of a central veys (e.g., [75], [151]), which studied the technical merits and
controller. demerits of the learning algorithms. Instead, we are interested
• Communication-cognizant Multi-agent Bandit formula- in the systems aspects of the considered works. That is, what
tions: Online-learning systems need to account for the are the communication topology and protocol used between
communication resources. Thus, recent works do not only agents and how do these choices affect the performance
analyze regret but explicitly optimize the communication (rewards obtained by all agents).
resources. This is manifested through two main obser- a) Overview: Unlike MAB formulations, in MDP, we
vations. First, the derived regret guarantees are always have a state space, which is a set of all possible states the envi-
affected by the networking topology (e.g., parameters ronment might be in, along with a transition operator which
representing the connectivity of a communication graph, describes the distribution over the next states given the current
number of involved agents, or number of exchanged mes- state and performed actions. Therefore, agents need not only
sages). Second, accompanied by every regret guarantee, to detect the best actions, which maximize the reward (bandits
an upper bound on communication resource usage is objective) but also to account for the possible next state, as

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2386 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

either through making each agent learn its own Q-function and
treating others as a part of a non-stationary environment, or
through learning a global Q-function.
The optimization in policy gradient methods is done on the
.
objective function: J (θ) = vπθ (s0 ), which is the cost of start-
ing from the initial state s0 , and following the parametrized
policy πθ thereafter. The gradient of this function can be
written as:

∇π(at |st , θt )
∇J (θ) = (Gt − b(st )) , (11)
π(at |st , θt )
.
where Gt is the sum of future discounted rewards Gt =
rt+1 + γrt+2 + γ 2 rt+3 + . . . . As shown in the policy gradient
algorithm [133], b is any function of the state, referred to as
Fig. 14. MARL Framework: multiple autonomous agents interact with an the baseline. If it is the zero function, the resulting equation
environment, observe (parts of) its state, and receive potentially different represents the “reinforce” algorithm. Another popular option
reward signals.
is the value function of the state. If this state value function is
updated through bootstrapping, the resulting method is called
it might be arbitrarily bad/good regardless of the current one. Actor-critic. Thus, Actor-critic methods are policy gradients
Hence, in MARL, the collaborative agents aim to maximize that use the state value function as a baseline (b(s) = V(s))
the current and future expected sum of rewards, where the and update this function through bootstrapping. Readers may
expectation is with respect to both the randomness of the state refer to [153] for more details and comparison between these
transitions and the action selection. approaches. As will be clarified next, each work tunes dif-
MARL problems, visualized in Fig. 14, are often modeled as ferent parts of these main solution approaches according to
a Partially Observable Markov Game (POMG) [152], which the application. In the sequel, we present a classification of
is a multi-agent extension for Partially Observable Markov MARL literature from pervasive computing perspective.
Decision Process (POMDP). POMGs are represented by the b) Centralized training and Decentralized Execution
tuple (N , S, A, O, T , R, γ), where: (CTDE): The Centralized Training and Decentralized
• N is the set of all agents. Execution (CTDE) approach is first proposed in [154]. This
• st ∈ S is a possible configuration of all the agents at approach leverages the assumption that, in most application
time t. scenarios, the initial training is done on centralized simu-
• at ∈ A is a possible action vector for the agents, where lators, where agents can communicate with each other with
A = A1 × A2 × · · · × AN . no communication cost. This phase is denoted as centralized
• ot ∈ O is a possible observation of the agents, where training. Then, at deployment, agents are assumed not to
O = O1 × O2 × · · · × ON . communicate at all, or conduct limited communication with
• T : O × A → O is the state transition probability. each other, and they rely on their “experience” at the training
• R is the set of rewards for all the agents ri : O ×A → R. phase to actually execute their collaborative policies.
• γ is the reward discount factor. • Communication only at training: The advantage of such an
Each agent aims to find a policy πn that maximizes its own approach is that it does not require communication between
reward. If the rewards are not the same for all agents, the agents upon deployment and thus incurs no communication
framework is referred to as mixed Decentralized-POMDP. cost. However, this comes at the cost of losing adaptability,
When the rewards are similar for all agents (i.e., rn = r ∀n ∈ which is the major motivation behind online learning. Such
N ), the POMG is collaborative, and each agent’s policy aims loss might occur in case of a major shift in the environment
at maximizing the total reward. In the following, we discuss model between the training and deployment, where the learned
algorithms that might work on one or both settings. The main coordinated policies are no longer performant, and new coor-
focus will be on the communication aspects (i.e., topology and dination is needed. The main workaround is to monitor the
cost) of MARL algorithms. agents’ performance and re-initiating the centralized training
There exist results on the hardness of solving the POMG phase to learn new coordinated policies whenever needed.
under several settings. We can cite the case of a tabular represen- This approach has been popularized by recent methods such
tation of the spaces and the cases where function approximation as VDN [155], QMIX [156], and QTRAN [157]. These works
is used (linear or nonlinear). The main solution approaches are adopt value function factorization technique, where factor-
similar to the single RL agent, mainly policy gradient and value- izations of the global value function in terms of individual
based methods [133]. Policy gradient methods parametrize (i.e., depending only on local observations) value function are
agents’ policies within a class of functions and utilize gradient learned during centralized training. Then, the global function
descent to optimize an objective function (i.e., sum of future (i.e., neural network) can be discarded at execution time, and
expected rewards obtained by the policy). Value-based meth- each individual agent utilizes only the local function. When
ods aim to learn the value of a state-action pair and generalize each agent acts greedily according to its local network, the
the famous Q-learning algorithm to the multi-agent settings, global optimality can still be guaranteed since, at the training

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2387

phase, these local networks were trained according to gradient messages of all other agents. Thus, they introduce an atten-
signals with respect to the global reward. tion scheme within each agent where an attention unit receives
Another approach to solving POMG is actor-critic. The encoded local observation and action intention of the agent to
CTDE version of actor-critic approaches is represented by decide whether a communication with others in its observ-
learning a centralized critic, which evaluates the global action, able field is needed. The communication group dynamically
and decentralized policy network, which takes an action based changes and persists only when necessary.
only on local observation. During training, the actor-critic The authors in [164] looked at communication at scale
networks are jointly learned, and hence the global critic and proposed an Individualized Controlled Continuous
“guides” the training of the actors. Then, at execution, the Communication Model (IC3Net), where agents are trained
global critic may be discarded, and only actors can be used. according to their own rewards (hence the approach can
The works in [158] present a deep deterministic policy gra- work for competitive scenarios also). Then, they demonstrated
dient method that follows the described approach, where that their designed gating mechanism allows agents to block
each agent learns a centralized critic and decentralized actor. their communication, which is useful in competitive scenar-
Similarly, [159] follows the same approach, but all agents learn ios and reduces communication in cooperative scenarios by
the same critic. Multiple other variations are done on the same opting out from sending unnecessary messages. However,
DDPG algorithm aiming to either enhance performance [160] the effect of the gate on communication efficiency was not
through incorporating an attention-mechanism, or reducing the thoroughly studied, and the focus was instead on the emerg-
use of communication resources (limited budget on the num- ing behavior. The work in [165] presents the state-of-the-art
ber of messages, or designing the message as (a part of) an on efficient learned communication. The authors introduced
agent’s state) [161]. Actor-Critic Message Learner (ACML), wherein the gate adap-
•Learned communication: An important line of work within tively prunes less beneficial messages. To quantify the benefit
the MARL community is the study of learned communica- of an action, Gated-ACML adopts a global Q-value difference
tion between agents. In these settings, agents are allowed to as well as a specially designed threshold. Then, it applies
send arbitrary bits through a communication channel to their the gating value to prune the messages, which do not hold
peers in order to convey useful information for collabora- values. The authors showed that surprisingly, not only the
tion. These agents need to learn what to send, and how to communication-efficiency significantly increases, but in spe-
interpret the received messages so that they inform each other cific scenarios, even the performance improves as a result
of action selection. Thus, the agents are effectively learning of well-tuned communication. The reason behind this is that,
communication protocols, which is a difficult task [170]. since the communication protocol is learned, it is probable to
While the learned communication can be trained centrally hold redundant information that agents do not decode success-
and executed in a decentralized fashion, agents can still com- fully. The proposed gating mechanism can also be integrated
municate at the execution phase through a limited bandwidth with several other learned communication methods.
channel. Hence, we distinguish this setting from the works c) Fully decentralized agents: In fully decentralized rein-
discussed in the previous section. Yet, similar approaches can forcement learning, there is no distinction between training and
be followed. For example, discarding the critic in execution testing environments. Thus, the communication model stays
(sometime used interchangeably with the term CTDE) but still the same throughout the agents’ interaction with the environ-
maintaining the learned communication [165] and parameter ment. Under these settings, we recognize two extreme cases.
sharing and gradient pushing in [170], where in execution, First, agents do not communicate with each other, and learn to
these messages are discretized. coordinate solely through the obtained rewards, without com-
Within the learned communication line of work, the authors municating messages. In the case of no communication, the
in [171] aimed to learn to schedule communication between major challenge faced by the agents is the non-stationarity
agents in a wireless environment and focused only on collision of the environment. A non-stationary environment from the
avoidance mechanism in the wireless environment. In [162], perspective of the agents is when the distribution of the next
information theoretic approach was used to compress the con- states varies for the same current state and action pairs. The
tent of the message. In addition, source and destination are fully decentralized DRL was recently popularized by [166].
also learned through a scheduler. On the other hand, a pop- In [166], the authors proposed a 3-dimensional reply buffer
ular line of work targeted the design of the so-called gating whose axes are the episode index, timestep index, and agent
mechanism techniques in order to accomplish the efficiency index. It was illustrated that conditioning on data from that
of the learned communication protocols. In this line of work, buffer helps agents to minimize the effect of the perceived
agents train a gating network, which generates a binary action non-stationarity of the environment.
to specify whether the agent should communicate with others On the other extreme, agents can be modeled to be able to
or not at a given time step, limiting the number of com- communicate at every step. Specifically, the problem of graph
municated bits/messages needed to realize a certain desirable networked agents is investigated in [167]. In this paper, agents
behavior. Reference [163] investigates the adaptability of these are connected via a time-varying and possibly sparse commu-
communication protocols and demonstrates the importance of nication graph. The policy of each agent takes actions that are
communicating only with selected groups. Specifically, agents based on the local observation and the neighbors’ messages
cannot distinguish messages that are particularly important to maximize the globally averaged return. The authors fully
to them (i.e., have implications on their actions) from the decentralized actor-critic algorithms and provided convergence

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2388 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

TABLE IV
C OMMUNICATION -C OGNIZENT M ULTI -AGENT R EINFORCEMENT L EARNING L ITERATURE

guarantees when the value functions are approximated by

linear functions. However, a possible disadvantage of this
algorithm is that the full parameter vector of the value func-
tion is required to be transmitted at each step. This has been
addressed in [168], where also graph-networked agents are
assumed, but each agent broadcasts only one (scaled) entry of
its estimate of parameters. This significantly reduces commu-
nication cost (given that it occurs at every iteration). The paper
also does not assume a bidirectional communication matrix
and deals with only unidirectional ones, which is a more
general formulation that appeals to more applications. The
decentralized actor-critic-based algorithm also solves the dis-
tributed reinforcement learning problem for strongly connected
graphs with linear value function approximation.
Reference [169] considered the communication efficiency in
fully decentralized agents, but with the assumption of a cen-
tralized controller. The paper utilizes policy gradient solution Fig. 15. UAV-assisted networks: UAV agents are trained to deduce a collabo-
methods, where the controller aggregates the gradients of the rative policies for providing compute/communication resources for on ground
agents to update the policy parameters. This process is akin equipment.
to federated learning clients selection. The authors propose
a process to determine which clients should communicate to
the controller based on the amount of progress in their local
optimization. They also propose a methodology to quantify communicate with each other. In CTDE, the training is done
the importance of local gradient (i.e., the local optimization in simulation. Thus, agents are logically centralized and do not
progress) and then only involve agents who are above a cer- communicate. If no messages are passed between agents and
tain threshold. Following this approach, the authors showed their collaboration is solely learned through rewards, then the
that the performance (i.e., cumulative reward) is similar to communication scheme is indirect. Otherwise, it is either gated
the case where all clients are participating, with considerable with neighbors directly or through a central controller. Lastly,
communication round savings. the communication decision states when the communication
Table IV summarises the works discussed above according is made, which can be at every step (with optimized message
to their communication model and the approach in handling size or not), or according to other conditions as detailed in the
the communication-performance tradeoff. We first identify the discussion.
framework (CTDE, CTDE with learned communication, or d) MDP MARL for UAV-assisted networks: UAVs have
fully decentralized) as well as the learning framework (value, provided new potentials for extending communication cov-
policy gradient, or actor-critic). Note that these MARL algo- erage and even compute capabilities of devices in areas
rithmic frameworks, which are based on a single-agent variant where full networking infrastructure is not present. This
of the problem, involve learning the state space transition oper- is done through wireless communication between UAVs
ator as it plays a major role in estimating the future expected and on-ground equipment, enabling those equipment to
sum of rewards. Then, we list two important configurations. extend their connectivity and potentially offload tasks to a
First, the communication scheme, which states how agents broader network relayed by the flying devices. The work

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2389

in [172] aims at utilizing UAVs to provide intermittent com- be discarded at execution time, enabling decentralized
pute/communication resource support for user equipments execution. The framework is emerging as a possible alter-
(UEs) on the ground. The benefits of such UAV-assisted sce- native to the fully decentralized extremes, where agents
narios are numerous, including creating dynamic networking communicate at every step or do not communicate at all
graphs without the need for full networking infrastructure, and try to indirectly and independently learn collaborative
which can be of extreme value in catastrophes response for policies [158], [161].
example. Nonetheless, the UAVs need to optimize their trajec- • Scheduling for efficient learned communication: In
tory paths so that they cover the widest areas with minimum learned communication, agents learn to encode and
movement (i.e., energy consumption) and maximum utility decode useful messages. In this area, gating mechanisms
(i.e., providing the resources to the UE that needs it the most). are the main tools towards efficient communication [163],
However, such optimization is shown to be intractable. Thus, [164], [165]. In gate design, agents learn when to send
the authors opted for learning-assisted (i.e., data-driven) meth- and refrain from sending a messages by quantifying the
ods. Since a centralized training was possible in their tackled benefit (i.e., reward) of actions following this communi-
scenario, they used a CTDE algorithm, specifically Multi- cation. More general schedulers modules investigate the
Agent DDPG (MADDPG). In MADDPG, agents aim to jointly design of communication module that learn to minimize
optimize the UE-load of each UAV while minimizing the over- the content of the messages as well (i.e., compressing
all energy consumption experienced by UEs, that depends on the communication messages) [162]. Overall, schedul-
the UAV’s trajectory and offloading decisions. Following the ing mechanisms are being increasingly used in MARL
MADDPG algorithm, the UAVs observations were commu- settings with learned communication, in order to face
nicated among them to deduce the collaborative policy at the limited bandwidth problems often encountered in
training. At execution, no message sharing was needed. This practical scenarios.
resulted in a satisfactory performance due to the accurate sim-
ulator. However, as discussed earlier, environments that are
expected to change require the use of other algorithms that C. Active Learning (AL)
maintain periodic, learned, or persistent communication after As far as pervasive training schemes have been tackled, AL
deployment. has emerged as a promising and effective concept. Herein,
We note that the application of reinforcement learning in we first present a brief overview for the concept of AL, then
resource-constrained environments (e,g., IoT devices), requir- we discuss some recent applications of AL presented in the
ing the design of communication-aware techniques, is still literature.
scarce. Most testing for these methods is done on artificial test- 1) Overview: The main idea behind AL is that an active
ing environments like Open AI’s Particle environments [173], learner is allowed to actively select over time the most infor-
or video games like StarCraft II [174], which is a typical prac- mative data to be added to its training dataset in order to
tice in the RL community since success in these environments enhance its learning goals [175], [176]. Hence, in AL frame-
is often indicative of broader applicability. work, the training dataset is not static, which means that the
e) Lessons learned: training dataset and learning model are progressively updated
• Most of the MDP MARL works focused on performance in order to continuously promote the learning quality [177].
gains and benchmarking, with little attention to resource Specifically, the main steps of AL are: (1) acquiring new data
utilization. This is because MARL is applied for games from the contiguous nodes; (2) picking out the most informa-
and other areas (e.g., robotics) where the priority is for tive data to append to the training dataset; (3) retraining the
performance. IoT applications, where resource utilization learning model using newly-acquired data. Hence, the com-
plays major role, is yet to make full use of state-of-the-art munication overheads associated with different AL schemes
MARL algorithms. will depend on:
• CTDE-a practical middle ground: We note that CTDE is • Type and amount of exchanged data between the con-
the most adopted in pervasive/IoT scenarios. We attribute tiguous nodes. We remark here that contiguous nodes
this to the simplicity in the way agents communicate in can exchange labels, features, or samples. Hence, based
this framework. Specifically, CTDE algorithm leverages on the type and amount of changed data there will be
the fact that training is often done in simulators, where always a tradeoff between enhancing the performance and
there is no communication cost, and agents may share decreasing communication overheads.
experience tuples, network parameters, and observations • Number of selected nodes that will be considered in the
freely, in order to train policies that can be executed AL process.
later on, based on only local observations. This approach It is worth mentioning also that FL allows multiple nodes
seems to model most of the pervasive computing appli- to cooperatively train a global model without sharing their
cations where agents do not need to start training while local data, which differs from AL in many ways. In particular,
being decentralized. In this framework, the actor-critic- FL seeks for obtaining a synchronization between different
based algorithms are more popular, where a centralized cooperative nodes, in addition to the presence of a centralized
critic network that uses the observations of all agents node (or server) to generate the global model. Thus, AL and
guides the training of a decentralized policy network that FL are addressing orthogonal problems—the former leverages
uses only the local observations. The critic network can the newly-acquired data from the contiguous nodes to retrain

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2390 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

its model, while the latter trains its model in a distributed data uncertainty and class imbalance level, while classify-
manner by sharing the model’s updates with the contiguous ing the unlabeled data using the trained model before adding
nodes [178]. them to the training dataset. Hence, the proposed framework
2) Applications of AL: Traditionally, AL algorithms presents two core procedures: label integration and sample
depend on the presence of an accurate classifier that gen- selection. In the label integration procedure, a Positive LAbel
erates the ground-truth labels for unlabeled data. However, Threshold (PLAT) algorithm is used to infer the correct label
this assumption becomes hard to maintain in several real-time from the received noisy labels of each sample in the training
applications, such as crowdsourcing applications and auto- set. After that, three sample selection schemes are proposed to
mated vehicles. Specifically, in crowdsourcing, many sources enhance the learning performance. These schemes are respec-
are typically weak labelers, which may generate noisy data, tively based on the uncertainty derived from the received-noisy
i.e., data that may be affected by errors due to low resolu- labels, the uncertainty derived from the learned model, and the
tion and age of information problems. However, most of the combination method.
existing studies on AL investigate the noisy data (or imper- A different application of AL is investigated in [177],
fect labels) effect on the binary classification problems [179], where AL is exploited for incremental face identification.
[180], while few works consider the general problem of Conventional incremental face recognition approaches, such
multi-class or multi-labeled data [181], [182], [183]. as incremental subspace approaches, have limited performance
One of the main problems in crowdsourcing is how to col- on complex and large-scale environment. Typically, the
lect large amount of labeled data with high quality, given that performance may drastically drop when the training data of
the labeling can be done by volunteers or non-expert labelers. face images is either noisy or insufficient. Moreover, most of
Hence, the process of acquiring large amount of labeled data existing incremental methods suffer from noisy data or outliers
turned to be challenging, computationally demanding, resource when updating the learning model. Hence, the authors in [177]
hungry, and often redundant. Moreover, crowdsourced data present an active self-paced learning framework, which com-
with cheap labels comes with its own problems. Despite being bines: active learning and Self-Paced Learning (SPL). The
labels cheap, it is still expensive to handle the problem of latter refers to a recently developed learning approach that
noisy labels. Thus, when data/labelers are not selected care- mimics the learning process of humans by gradually adding
fully, the acquired data may be very noisy [184], [185], due to to the training set the easy to more complex data, where easy
many reasons such as varying degrees of competence, labelers data is the one with high classification confidence. In particu-
biases, and disingenuous behavior, which significantly affects lar, this study aims to solve the incremental face identification
the performance of supervised learning. Such challenges have problem by building a classifier that progressively selects and
encouraged the researcher to design innovative schemes that labels the most informative samples in an active self-paced
can enhance the quality of the acquired data from different way, then adds them to the training set.
labelers. For instance, [181] tackles the problem of demand- AL has been also considered in various applications of intel-
ing deep learning techniques to large datasets by presenting ligent transportation systems. For instance, the authors in [186]
an AL-based solution that leverages multiple freely accessi- investigate the vehicle type recognition problem, in which
ble crowdsourced geographic data to increase datasets’ size. labeling a sufficient amount of data in surveillance images is
However, in order to effectively deal with the noisy labels very time consuming. To tackle this problem, this work lever-
extracted from these data and avoid performance degradation, aged fully labeled Web data to decrease the required labeling
the authors have proposed a customized loss function that inte- time of surveillance images using deep transfer learning. Then,
grates multiple datasets by assigning different weights to the the unlabeled images with high uncertainty are selected to be
acquired data based on the estimated noise. [183] enhances queried in order to be added later to the training set. Indeed,
the performance of supervised learning with noisy labels in the cross-domain similarity metric is linearly combined with
crowdsourcing systems by introducing a simple quality metric the entropy in the objective function of the query criteria to
and selecting the -optimal labeled data samples. The authors actively select the best samples. Ultimately, we highlight that
investigate the data subset selection problem based on the most of the presented studies so far consider in their AL frame-
Probably Approximately Correct (PAC) learning model. Then, work specific classifiers (or learning models), which cannot
they consider the majority voting label integration method and be easily used in other learning models [187]. Accordingly,
propose two data selection algorithms that optimally select obtaining an optimal label integration and data selection strat-
a subset of k samples with high labelling quality. In [182], egy that can be used with a generic multi-class classification
the authors investigate the problem of imbalanced noisy data, techniques is still worth further investigation.
where the acquired labeled data are not uniformly distributed 3) Use Case: AL for Connected Vehicles: Traditional
across different classes. The authors therein aim to label train- machine learning models require massive, accurately labeled
ing data given received noisy labels from diverse sources. datasets for training in order to ensure high classification accu-
Then, they used their learning model to predict the labels for racy for new data as it arrives [188]. This assumption cannot be
new unlabeled data, and update their learning model until some guaranteed in many real-time applications, such as connected
conditions are met (e.g., the performance of the learned model and autonomous vehicles. Indeed, vehicles are typically weak
meets a predefined requirement, or it cannot be improved labelers (i.e., data sources that generate label with low classi-
any more). Specifically, for labeled data, they implemented fication confidence). Hence, they may acquire/generate noisy
a label integration and data selection scheme that considers data, e.g., data generated by cameras in the presence of fog

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2391

Fig. 16. AL for time-varying vehicular network.

or rain. Also, in a highly dynamic environment like vehicular

network, not only the generated data by the vehicles’ clas-
sifiers can have low classification accuracy, but also the data
received from neighboring vehicles may be prone to noise and
communication errors. Hence, the authors in [189] have tack-
led these challenges by proposing a cooperative AL framework
for connected vehicles. The main goal of this work is two-fold:
Fig. 17. Pervasive inference system in multiple scenarios.
(1) selecting the optimal set of labelers, those considered to
be the most reliable ones; (2) selecting a maximally diverse
subset of high quality data that are locally-generated and/or The proposed AL framework in [189] depicts its efficiency
received from neighboring vehicles to be used for updating for connected automated vehicles, as follows: (1) it allows
the learning model at each vehicle. to increase the amount of acquired data at different vehicles
In [189], the time-varying vehicular network shown in during the training phase; (2) it accounts for the labelers’
Figure 16 is considered. It is assumed that each vehicle can accuracy, data freshness, and data diversity while selecting
communicate and exchange information only with the neigh- the optimal subset of labelers and data to be included in the
boring vehicles that are located within its communication training set; (3) Using different real-world datasets, it could
range. For instance, the set Nv0 (t) = {vj , vj +1 , vj +2 } at time provide 5–10% increase in the classification accuracy com-
t means that there are only three vehicles staying in the compared to the state-of-the-art approaches that consider random
munication range of vehicle v0 . Furthermore, this framework data selection, while enhancing the classification accuracy by
considers two types of data: multiple-labeled online dataset 6% compared to random labelers selection approaches.
and offline/historical labeled dataset. The online dataset is con-
sidered as sequences of samples that arrive from neighboring VI. P ERVASIVE I NFERENCE
vehicles or generated at vehicle v0 within time T (i.e., refers
to the period of time during which a vehicle v0 is exposed to a In this section, we discuss the pervasive inference, where
certain road view). At time T, vehicle v0 receives a sequence of the trained model is partitioned and different segments are
training samples/labels that contains input features and asso- distributed among ubiquitous devices. It is worth mentioning
ciates with multiple noisy labels generated from the vehicles that the training method of the distributed model is out of
sending data to v0 . The presented framework in [189] includes the scope of this section. Fig. 17 illustrates different scenar-
five main stages, as described below: ios, where the distribution can solve the challenges presented
1) Offline Learning: Initially, each vehicle with its own by the centralized approaches. In the following subsections,
offline/historical training data generates an initial learn- the communication and computation components of the perva-
ing model with a certain accuracy level. sive inference are introduced. Then, the resource management
2) Online Labeling: The vehicle starts to collect new approaches for the distribution are reviewed and a use cases
information through its local sensors or from neighbor- is described.
ing vehicles. These information can be labels, features,
or samples, depending on the adopted operational mode. A. Profiling Computation and Communication Models
3) Label Integration: After acquiring the new information, The computation and communication models present the
each vehicle obtains an aggregated label for the received mechanisms to formulate different operations and functions
data using different proposed label integration strategies. into an optimization problem in order to facilitate the theoreti-
4) Labeler Selection: After monitoring the behavior of cal analysis of DNN distribution. More specifically, we discuss
the neighboring vehicles, each vehicle selects a subset the computational requirements of different DNN tasks, the
of high-quality labelers, based on their reputation val- wireless communication latency between different pervasive
ues that are estimated from the past interactions using participants and their energy consumption.
subjective logic model. 1) Computation Models: Various parameters play a critical
5) Data Selection and Models Update: Finally, each vehicle role to model the computational tasks of different segments of
selects the maximally diverse collection of high-quality the DNN network including latency, generality, scalability and
samples to update its learning model. context awareness. In this section, we describe the computation

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2392 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

Fig. 18. Inference parallelization: data and model parallelization.

models of two popular splitting strategies adopted in the liter-

ature, which are the per-layer and per-segment splitting. These
models are presented after introducing some definitions.
a) Overview and definitions: Binary offloading:
Relatively simple or highly complex tasks that cannot
be divided into sub-tasks and have to be computed as a
whole, either locally at the source devices or sent to the
remote servers because of resource constraints, are called
binary offloading tasks. These tasks can be denoted by the
three-tuple notation T (K , τ, c). This commonly used notation
illustrates the size of the data to be classified presented by K
and the constraint τ (e.g., completion deadline, the maximum
energy, or the required accuracy). The computational load
to execute the input data of the DNN task is modeled as a
variable c, typically defined as the number of multiplications
per second [190]. Although binary offloading has been widely
studied in the literature, we note that it is out of the scope Fig. 19. Typical typologies of DNNs and partitioning strategies.
of this survey covering the pervasivity and distribution of AI
tasks.
Partial Offloading: In practice, DNN classification is com- some segments serve as the inputs of others (as shown in
posed of multiple subtasks (e.g., layers execution, multipli- Fig. 18 (b)). In this context, the inter-dependency between
cation tasks, and feature maps creation), which allows to different computational parts of the DNN model needs to be
implement fine-grained (partial) computations. More specif- defined. It is worth mentioning that many definitions of data
ically, the AI task can be split into two or more segments, and model parallelism are presented in the literature, which
where the first one can be computed at the source device and are slightly different. In our paper, we opted for the definitions
the others are offloaded to pervasive participants (either remote presented in [191].
servers or neighboring devices). Typical Dependencies: Different DNN networks can be
Data Parallelization: The most manageable task of par- abstracted as task-call graphs. These graphs are generally
tial offloading is the data parallelization, where duplicated presented by Directed Acyclic Graphs (DAGs), which have
offloaded segments are independent and can be arbitrarily a finite directed structure with no cycles. Each DNN graph is
divided into different groups and executed by different par- defined as G(V, E), where the set of vertices V presents differ-
ticipants of the pervasive computing system, e.g., segments ent segments of the network, while the set of edges E denotes
from different classification requests (as shown in Fig. 18 (a)). their relations and dependencies. Typically, three types of
We highlight that the input data to parallel segments are dependencies contribute to determining the partition strategies,
independent and can be different or akin. namely the sequential dependency which occurs in the con-
Model Parallelization: A more sophisticated partial offload- ventional CNN networks with sequential layers and without
ing pattern is the model parallelization, where the execu- any residual block (e.g., VGG [31]), the parallel dependency
tion of one task is split across multiple pervasive devices. which depicts the relation between different tasks in the same
Accordingly, the input data is also split and fed to different layer (e.g., different feature maps transformations), and the
parallel segments. Then, their outputs are merged again. In general dependency existing in general DNN models (e.g., ran-
this offloading pattern, the dependency between different tasks domly wired CNN [33]). Different dependencies are depicted
cannot be ignored as it affects the execution of the inference. in Fig. 19. The required computation workload and memory
Particularly, the computation order of different tasks (e.g., lay- are specified for each vertex V and the amount of the input
ers) cannot be determined arbitrarily because the outputs of and output data can be defined on the edges.

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2393

Ti locally at the computing device is equal to [194]:

eilocal = tic × Pi . (13)
Next, we profile the DNN partitioning strategies presented
in the literature, in terms of computation and memory require-
ments first and then in terms of communicated data to
offload the output of segments. The key idea of partition-
ing a DNN network is to evenly or unequally distributing
the computational load and the data weights across pervasive
devices intending to participate in the inference process, while
Fig. 20. Partitioning of fully connected layers. minimizing the classification latency. A partitioning can be
achieved by simply segmenting the model per-layer or set of
layers (see Fig. 18 (a)) or by splitting the layers’ tasks (see
Based on the presented dependencies, two partition strate- Fig. 18 (b)). Then, each part is mapped to a participant.
gies can be introduced, namely per-layer and per-segment b) Per-layer splitting: As previously mentioned, the com-
partitioning (see Fig. 19). Per-layer partitioning defines divid- putational load of each layer is measured as the number of
ing the model into layers and allocating each set of layers multiplications required to accomplish the layer’s goal [195].
within a pervasive participant (e.g., IoT device or remote Fully-Connected Layers: The computation requirement of a
servers). On the other hand, per-segment partitioning denotes fully-connected layer can be calculated as follows:
segmenting the DNN model into smaller tasks such as fea- c Fc = n × m, (14)
ture maps transformations, multiplication tasks and even
per-neuron segmentation. where n represents the number of the input neurons and m is
Computation Latency: The primary and most common the number of the output neurons.
engine of the pervasive devices to perform local computa- Convolutional Layers: The computation load of a convolu-
tion is the CPU. The performance of the CPU is assessed tion layer can be formulated as follows [195]:
by cycle frequency/ clock speed f [192] or the multiplication
c conv = D1 × Wf × Hf × D2 × (W2 × H2 ). (15)
speed e [193]. In the literature, authors adopt the multiplica-
tion speed to control the performance of the devices executing We remind that D1 is the number of input channels of the
the deep inference. In practice, e is bounded by a maximum convolutional layer which is equal to the number of feature
value emax reflecting the limitation of the device computation maps generated by the previous layer, (Wf × Hf ) denotes the
capacity. Based on the model introduced for binary offload- spatial size of the layer’s filter, D2 represents the number of
ing, the computation latency of the inference task T (K , τ, c) filters and (W2 × H2 ) represents the spatial size of the output
is calculated as follows [193]: feature map (see Fig. 8).
The computational load introduced by pooling and ReLU
c
tc = . (12) layers can be commonly neglected, as these layers do not
e require any multiplication task [195]. We highlight that the
Importantly, a higher computational capacity emax is desirable per-layer splitting is motivated by the sequential dependency
to minimize the computation latency at the cost of energy con- between layers. This dependency does not permit the model
sumption. As end-devices are energy constrained, the energy parallelism nor the latency minimization. Instead, it allows the
consumption of the local computation is considered as a key resource-constrained devices to participate in the AI inference.
measure for evaluating the inference efficiency. More specifi- c) Per-segment splitting:
cally, a high amount of energy consumed by AI applications Fully-Connected Layers: We start by profiling the fully-
is not desirable by end-devices due to their incurred cost. connected layer partitioning. More specifically, the compu-
Similarly, significant energy consumption of edge nodes (e.g., tations of different neurons yi in a fully-connected layer
access points or MEC servers.) increases the cost envisaged are independent. Hence, their executions can be distributed,
by the service providers. and model parallelism can be applied to minimize the infer-
Computation Energy: If the inference is executed at the data ence latency. Two methods are introduced in the literature
source, the consumed energy is mainly associated to the task (e.g., [196], [197]), which are the output and input partitioning
computation. In contrast, if the task is delegated to remote as shown in Fig. 20.
servers or to neighboring devices, the power consumption • Output Splitting: the computation of each neuron yi is
consists of the required energy to transfer the data between performed in a single participant that receives all input
participants, the amount of energy consumed for the compu- data {x1 , x2 , . . . , xn }, as highlighted in Fig. 20 (a). Later,
tation of different segments, and the energy required to await when the computation of all neurons is done, results are
and receive the classification results. Suppose that the infer- merged by concatenating the output of all devices in the
ence task/sub-task Ti takes a time tic to be computed locally correct order. The activation function can be applied on
in the device participating in the pervasive inference and let each device or after the merging process.
Pi denote the processing power to execute the task per sec- • Input Splitting: each participant computes a part of all
ond. The energy consumed to accomplish an inference task output neurons. Fig. 20 (b) illustrates an example, where

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2394 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

Fig. 21. Partitioning of convolutional layer: (a) is an output splitting, and (b) and (c) are input splittings.

each device executes n1 of the required multiplications. device or once at the concatenation device. Fig. 21 (a)
By adopting this partitioning method, only a part of shows an example of channel partitioning.
the input, xi , is fed to each participant. Subsequently, • Spatial splitting: this fine-grained splitting divides the
when all participants accomplish their tasks, summations input spatially, in the x or y axis, in order to jointly
are performed to build the output neurons. However, assemble the output data, as shown in Fig. 21 (b). Let
in contrast to the output-splitting method, the activation dx and dy define the split dimensions on the x-axis and
function can only be applied after the merging process. y-axis, respectively. Therefore, the input data are parti-
Convolutional Layers: Next, we illustrate different partition- tioned into segments of size (dx × dy ), and each group
ing strategies of the convolutional layer. As described in the of segments can be transmitted to a device. Furthermore,
previous Section III-A1b, each filter is responsible to create each part allocated to a participant needs to be extended
one of the feature maps of the output data (Fig. 8). We remind with overlapping elements from the neighboring parts,
that the dimensions of the input data are H1 × W1 × D1 , the so that the convolution can be performed on the bor-
dimensions of the k filters are Hf × Wf × Df , and the dimen- ders. Compared to the latter splitting, in which all the
sions of the output feature maps are defined by H2 ×W2 ×D2 . input data should be copied to all participants with parts
We note that by definition D1 is equal to Df and k is equal of the filters, the spatial splitting distributes only parts
to D2 . Furthermore, each filter contains D1 × (Hf × Wf ) of the data with all the filters to each device. It means,
weights and performs D1 × (Hf × Wf ) multiplications per in addition to the segment of the input data, an amount
output element. Similarly to the fully connected layers, two of (k × D1 × Wf × Hf ) weights should be transmitted
partitioning strategies characterize the convolutional layer, and stored at each device. Note that storing the filters is
namely the input and output splitting. In this context, the out- considered as a one-time memory cost, as they will be
put splitting includes the channel partitioning, meanwhile, the used for all subsequent inferences. Also, the total num-
input splitting consists of the spatial and filter partitioning ber of multiplications is reduced per-device and each
d ×d
strategies (see Fig. 21). These splitting strategies are intro- one executes only ( H1x×Wy1 ) of the computational load
duced and adopted by multiple recent works, including [196], per segment. When all computations are done, the out-
[198], [199], for which we will thoroughly review the resource put data is concatenated spatially with a complexity of
management techniques in the following section. O( Hd2x×W
×dy ), and the activation function can be applied
2

• Channel splitting: each participant computes one or before or after the merging process. Note that for sim-
multiple non-overlapping output feature maps, which plicity, we presented, for spatial splitting, the case where
serve as input channels for the next layer. This implies filters do not apply any size reduction.
that each device i possesses only 1 ≤ ki ≤ k fil- • Filter splitting: in this splitting strategy, both filters and
ters responsible to generate ki feature maps, where
input data are split channel wise on a size of ki for

i ki = k . In addition to the ki filters, the entire input each participant i. Figure 21 (c) illustrates the convolu-
data is fed to each device to compute different outputs. tion of the input data by one filter in order to produce
In this way, filters’ weights are distributed across partici- one feature map. In this example, the input channels
pants, (ki × D1 × Hf × Wf ) each, and the total number of and one of the filters are divided into 4 devices, which
multiplications is equal to D1 × (Hf × Wf ) × ki × (W2 × implies that each device stores only its assigned chan-
H2 ) per device. The channel partitioning strategy allows nels of the input data and the filter, so the memory
model parallelization, and consequently inference accel- footprint is also divided. The computational load is
eration. At the end, when all devices finish their tasks, also reduced, in such a way each participant executes
different feature maps are concatenated depth-wise, with ki × (Hf × Wf ) × (W2 × H2 ) multiplications. In the
a complexity equal to O(k). We emphasize that the acti- end, all final outputs are summed to create one feature
vation function can be applied before merging at each map and the activation function can only be applied after

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2395

TABLE V
C HARACTERISTICS OF D IFFERENT S PLITTING S TRATEGIES : A: A FTER , B: B EFORE , NFc : N UMBER OF F ULLY C ONNECTED L AYERS , n: N UMBER OF
I NPUT N EURONS (F C ), m: N UMBER OF O UTPUT N EURONS (F C ), H1 , W1 , D1 : D IMENSIONS OF THE I NPUT DATA (C ONV ), H2 , W2 , D2 : D IMENSIONS
OF THE O UTPUT DATA (C ONV ), Hf , Wf , D1 : D IMENSIONS OF THE F ILTER (C ONV ), k /D2 : N UMBER OF F ILTERS , dx , dy : D IMENSIONS OF THE
S PATIAL S PLITTING (C ONV ), N: N UMBER OF PARTICIPANTS , ki : N UMBER OF S EGMENTS P ER PARTICIPANT

the merging process. A concatenation task is performed, where K is the size of the transmitted data and ρi,j is the
when all features are created. achievable data rate between two participants i and j. The total
Table V summarizes the computation and memory char- transmission latency t T of the entire inference is related to
acteristics of different splitting strategies. In this table, we the type of dependency between different layers of the model.
present the number of smallest segments per model, the input, This latency is defined in eq. (17), if the dependency is sequen-
output and computation requirements for each small segment, tial (e.g., layers) and in eq. (18) if the dependency is parallel
the weights of filters assigned to each device owing ki seg- (e.g., feature maps). In case the dependency is general (e.g.,
ments, and the transmitted data per layer when having N randomly wired networks), we formulate the total latency as
participants. the sum of sequential communication and the maximum of
2) Communication Models: The latency is of paramount parallel transmissions.
importance, in AI applications. Hence, minimizing the com-

S
munication delay and the data transmission by designing t T
= tsu . (17)
an efficient DNN splitting is the main focus of pervasive s=1
inference.
t T = max (tsu , ∀s ∈ {1 · · · S }). (18)
a) Overview:
Communication Latency: In the literature, the commu- Communication Energy: The energy consumption to offload
nication channels between different pervasive devices are the inference sub-tasks to other participants consists of the
abstracted as bit-pipes with either constant rates or random amounts of energy consumed on outwards data transmissions
rates with a defined distribution. However, this simplified and when receiving the classification results generated by
bit-pipe model is insufficient to illustrate the fundamental the last segment of the task T. This energy is formulated as
properties of wireless propagation. More specifically, wireless follows [192], [194]:
channels are characterized by different key aspects, includ- Ks
ing: (1) the multi-path fading caused by the reflections from eiofd = tiu .Pi + .Ps .Xk ,s Xj ,s+1 , (19)
s
ρk ,j
objects existing in the environment (e.g., walls, trees, and k j
buildings); (2) the interference with other signals occupying where tiuis the upload delay to send the original data/task
the same spectrum due to the broadcast nature of the wireless i to the first participant, Ks is the output of the segment s
transmissions, which reduces their Signal-to-Interference-plus- (e.g., layers or feature maps), ρk ,j denotes the data rate of the
Noise-Ratios (SINRs) and increases the probability of errors; communication, and Xk ,s is a binary variable indicating if the
(3) bandwidth shortage, motivating the research community participant k executes the segment s.
to exploit new spectrum resources, design new spectrum shar- Using only the onboard battery and resources, the source-
ing and aggregation, and propose new solutions (e.g., in-device generating device may not be able to accomplish the inference
caching and data compression). Based on these characteristics, task within the required delays and the energy constraint. In
the communication/upload latency between two devices, either such a case, partitioning the task among neighboring devices
resource-constrained devices or high-performant servers, can or offloading the whole inference to the remote servers are
be expressed as follows: desirable solutions.
K b) Per-layer splitting: Per-layer partitioning is character-
tu = , (16)
ρi,j ized by a simple dependency between different segments and

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2396 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

conv i
a higher data transmission per device. Indeed, the computation a high number of sensors (e.g., N i D1 ×ki segments
of one Fc layer per participant costs the system a total commu- for filter splitting.) should be involved to accomplish the
nication overhead equal to (n + m). Meanwhile, the allocation inference.
of a convolutional layer requires a transmission load equal to • Choosing the most optimal partitioning for the per-
(H1 × W1 × D1 ) + (H2 × W2 × k ). segment strategy highly depends on the proprieties of the
c) Per-segment splitting: The per-segment partitioning DNN network, including the channel sizes, the number
requires a higher total transmission load with less computation of filters, the size of feature maps, and the number of
and memory footprint per device. Meaning, this type of parti- neurons. Particularly, for Fc splitting, m and n are the
tioning trades communication with the memory. More details decisive variables for choosing input or output partition-
are illustrated in Table V, where the output and input splitting ing. For convolutional layers, the size of the channels
of the fully-connected layers have a total communication over- and filters, and the capacities of participants are the deci-
head of n × (N − 1) and m × (N − 1) respectively compared sive parameters to select the strategy. In terms of memory
to the per-layer distribution. Hence, depending on the input requirements, the channel splitting requires copying the
and output sizes, namely n and m, the optimal partition strat- whole input channels to all devices along with a part of
egy can be selected. Regarding the convolutional layers, the the filters. Meanwhile, the spatial splitting copies all the
channel splitting has an overhead of (N − 1) × H1 × W1 × D1 filters and a part of the data, whereas the filter splitting
since a copy of the entire input data needs to be broadcasted needs only a part of the channels and filters. In terms of
to all devices, the spatial splitting pays an overhead of the transmission load, the spatial splitting has less output data
padding equal to N × padding, and the filter splitting has an per segment compared to channel and filter strategies.
ovrhead of (N − 1) × H2 × W2 × k incurred in the merging Finally, the channel splitting has a higher computational
process. load per device. Still, it incurs less dependency between
3) Lessons Learned: The main lessons acquired from the segments.
review of the splitting strategies are:
• The performance of model parallelism is always bet-
ter than that of data parallelism in terms of latency B. Resource Management for Distributed Inference
minimization, as it allows computing multiple sub-tasks The joint computational and transmission resource man-
simultaneously. Meanwhile, the data parallelism pays the agement plays a key role in achieving low inference latency
high costs of merging and transmitting the same inputs, and efficient energy consumption. In this section, we con-
either for fault-tolerance purposes or to handle multiple duct a comprehensive review of the existing literature on
concurrent requests. resource management for deep inference distribution and seg-
• The choice of the parallelism mode, highly depends on ments allocation on pervasive systems. We start by discussing
the partitioning strategy and the dependency between dif- the remote collaboration, which consists of the cooperation
ferent segments. For example, in the per-layer splitting between the data source and remote servers to achieve the
with a sequential dependency, the model parallelization DNN inference. In this part, we determine the key design
cannot be applied to compute different fragments. On the methodologies and considerations (e.g., partitioning strategies
other hand, the general and parallel dependencies pave and number of split points) in order to shorten the classi-
the way to distribute concurrent segments. fication delays. Subsequently, more complex collaboration,
• Data parallelism is highly important for AI applications namely localized collaboration, is examined, where multiple
with a high load of inference requests, such as 24/7 neighboring devices are coordinated to use the available com-
monitoring systems and VR/AR applications. In such putational and wireless resources and accomplish the inference
scenarios, the classifications and feature learning are tasks with optimized energy, delays, and data sharing.
required every short interval of time. Generally, source 1) Remote Collaboration: The remote collaboration
devices do not have sufficient resources to compute this encompasses two approaches, the binary and the partial
huge load of inferences. In this case, distributing the offloading defined in the previous section. The binary
requests within neighboring devices and parallelizing offloading consists of delegating the DNN task from a
their computations, contribute to minimizing the queuing single data-generating device to a single powerful remote
time. entity (e.g., edge or cloud server), with an objective to
• Understanding the characteristics of the pervasive com- optimize the classification latency, accuracy, energy, and
puting system is compulsory for selecting the partition cost (see Fig. 22 (a)). The decision will be whether to
strategy. More specifically, the per-layer distribution is offload the entire DNN or not, depending on the hardware
more adequate for systems with a lower number of par- capability of the device, the size of the data, the network
ticipants and higher pervasive capacities. For example, quality, and the DNN model, among other factors. Reference
VGG19 has 19 layers and accordingly needs a maxi- papers covering binary offloading of deep learning include
mum of 19 participants. More importantly, these devices DeepDecision [200], [201] and MCDNN [202]. The authors
are required to be able to accommodate the computation of these papers based their studies on empirical measurements
demand of convolutional layers. Meanwhile, opting for of trade-offs between different aforementioned parameters.
fine-grained partitioning results in small fragments that The binary offloading has been thoroughly investigated in the
fit in resource-limited devices, such as sensors. However, literature for different contexts. However, the DNN offloading

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2397

Fig. 22. Resource management for distributed inference.

has a particular characteristic that distinguishes it from other

networking tasks, namely the freedom to choose the type, the
parameters, and the depth of the neural network, according to
the available resources.
As the scope of this survey, is the pervasive AI, we focus
on the partial offloading that covers the per-layer distribution
applying one or multiple splitting points along with the per-
segment distribution.
a) Per-layer distribution - one split point: The par-
tial offloading leverages the unique structure of the deep
model, particularly layers, to allow the collaborative infer-
ence between the source device and the remote servers. More
specifically, in such an offloading approach, some layers are
executed in the data-generating device whereas the rest are Fig. 23. The transmitted data size between different layers of the VGG16
network.
computed by the cloud or the edge servers, as shown in
Fig. 22 (b). In this way, latency is potentially reduced owing
to the high computing cycles of the powerful remote entities.
Furthermore, latency to communicate the intermediate data significantly different characteristics. Based on the computa-
resultant from the DNN partitioning should lead to an overall tion and the latency to transmit the output data of the DNN
classification time benefit. The key idea behind the per-layer layers, the optimal partition points that minimize the energy
partitioning is that after the shallow layers, the size of the consumption and end-to-end latency are identified. Finally,
intermediate data is relatively small compared to the original after collecting these data, Neurosurgeon is trained to predict
raw data thanks to the sequential filters. This can speed up the the power consumption and latency based on the layer type
transmission over the network, which motivates the partition and network configuration and dynamically partition the model
at deep layers. between the data source and the cloud server.
Neurosurgeon [203] is one of the first works that investi- However, while the DNN splitting significantly mini-
gated layer-wise partitioning, where the split point is decided mizes the inference latency by leveraging the computational
intelligently depending on the network conditions. Particularly, resources of the remote server, this strategy is constrained by
the authors examined deeply the status quo of the cloud and the characteristics of intermediate layers that can still gener-
in-device inference and confirmed that the wireless network ate high-sized data, which is the case of VGG 16 illustrated
is the bottleneck of the cloud approach and that the mobile in Fig. 23.
device can outperform the cloud servers only when hold- To tackle the problem of sized intermediate data, the authors
ing a GPU unit. As a next step, the authors investigated the of [204] proposed to combine the early-exit strategy, namely
DNN split performance in terms of computing and output data BranchyNet [205], with their splitting approach. The objective
size of multiple state-of-the-art DNNs over multiple types of is to execute only few layers and exit the model without resort-
devices and wireless networks and concluded that layers have ing to the cloud, if the accuracy is satisfactory. In this way,

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2398 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

the model inference is accelerated, while sacrificing the accu- resources, i.e., IoT device—edge server—cloud, contribute to
racy of the classification. We note that BranchyNet is a model establishing a trade-off between minimizing the transmission
trained to tailor the right size of the network with minimum time and exploiting the powerful servers. Additionally, the
latency and higher accuracy. Accordingly, both models coop- layers of the DNN model are not always stacked in a sequen-
erate to select the optimal exit and split points. The authors tial dependency. More specifically, layers can be arranged in
extended the work by replacing both trained models with a general dependency as shown in Fig. 19 (c), where some
a reinforcement learning strategy [206], namely Boomerang. of them can be executed in parallel or do not depend on
This RL approach offers a more flexible and adaptive solution the output of the previous ones. In this case, adopting an
to real-time networks and presents less complex and more optimized back and forth distribution strategy, where the end-
optimal split and exit points’ selection. The early-exit strat- device and the remote servers parallelize the computation of
egy is also proposed along with the layer-wise partitioning by the layers and merge the output, can be beneficial for the infer-
the ADDA approach [207], where authors implemented the ence latency. Authors in [213] designed a Dynamic Adaptive
first layers on the source device and encouraged the exit point DNN Surgery (DADS) scheme that optimally distributes com-
before the split point to use only local computing and eliminate plex structured deep models, presented by DAG graphs, under
the transmission time. Similarly, authors in [208], formu- variable network conditions. In case the load of requests is
lated the problem of merging the exit point selection and the light, the min-cut problem [214] is applied to minimize the
splitting strategy, while aiming to minimize the transmission overall delay of processing one frame of the DNN structure.
energy, instead of focusing on latency. When the load condition is heavy, scheduling the computation
In addition to using the early-exit to accelerate the inference, of multiple requests (data parallelization) is envisaged using
other efforts adopted compression combined with the partition- the 3-approximation ratio algorithm [215] that maximizes the
ing to reduce the shared data between collaborating entities. parallelization of the frames from different requests. Complex
Authors in [209] introduced a distribution approach with fea- DNN structures were also the focus in [216], where the authors
ture space encoding, where the edge device computes up to used the shortest path problem to formulate the allocation of
an intermediate layer, compresses the output features (loss- different frames of the DNN back and forth between the cloud
less or lossy), and offloads the compressed data to the host and the end-device. The path, in this case, is defined as latency
device to compute the rest of the inference, which enhances the or energy of the end-to-end inference.
bandwidth utilization. To maintain high accuracy, the authors On the other hand, hierarchical architecture for sequential
proposed to re-train the DNN with the encoded features on structures is very popular as a one way distribution solution to
the host side. The works in [210], [211] also suggested com- establish a trade-off between transmission latency and compu-
pressing the intermediate data through quantization, aiming tation delay (see Fig. 22 (c)). The papers in [217], [218], [219]
at reducing the transmission latency between edge and cloud proposed to divide the trained DNN over a hierarchical distri-
entities. The authors examined the trade-off between the output bution, comprising “IoT-edge-cloud” resources. Furthermore,
data quantization and the model accuracy for different parti- they adopted the state-of-the-art work BranchyNet [205] to
tioning scenarios. Then, they designed accordingly a model to early exit the inference if the system has a good accuracy.
predict the edge and cloud latencies and the communication In this way, fast, private, and localized inference using only
overhead. Finally, they formulated an optimization problem to shallow layers becomes possible at the end and edge devices,
find the optimal split layer constrained by the accuracy require- and an offloading to the cloud is only performed when addi-
ments. To make the solution adaptive to runtime, an RL-based tional processing is required. Hierarchical distribution can also
channel-wise feature compression, namely JALAD, is intro- be combined with compressing strategies to reduce the size
duced by the authors in [210]. Pruning is another compression of the data to be transmitted and accordingly minimize the
technique proposed in [212] to be joined with the partitioning communication delay and the time of the entire inference,
strategy. The authors introduced a 2-step pruning framework, such as using the encoding techniques as done in [220].
where the first step mainly focuses on the reduction of the Authors in [221], [222] also opted for hierarchical offloading,
computation workload and the second one handles the removal while focusing primarily on fault-tolerance of the shared data.
of non-important features transmitted between collaborative Particularly, authors in [221] considered two fault-tolerance
entities, which results in less computational and offloading methods, namely reassigning and monitoring, where the first
latency. This can be done by pruning the input channels, as one consists of assigning all layers tasks at least once, and then
their height, length, and number impact directly the size of the unfinished tasks are reassigned to all participants regardless
the output data and the computation requirements, which we of their current state. This method, is generating a consider-
illustrated in Table V. able communication and latency overhead related to allocating
b) Per-layer distribution - back and forth, and hierarchi- redundant tasks, particularly to devices with limited-capacities.
cal distribution: Solely offloading the deep learning computa- Hence, a second strategy is designed to monitor the availability
tion to the cloud can violate the latency constraints of the of devices before the re-assignment. Finally, the work in [222]
AI application requiring real-time and prompt intervention. proposed to add skip blocks [3] to the DNN model and include
Meanwhile, using only the edge nodes or IoT devices can at least one block in each partition, to enhance the robustness
deprive the system from powerful computing resources and of the system in case the previous layer connection fails.
potentially increase the processing time. Hence, a judicious c) Per-segment distribution: The per-segment partition-
selection of multiple cuts and distribution between different ing is generally more popular when distributing the inference

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2399

among IoT devices with limited capacities, as some devices, system. An optimal solution and an online algorithm that uses
such as sensors, cannot execute the entire layer of a deep dynamic programming are designed to make the best architec-
network. Furthermore, per-segment partitioning creates a huge tural offloading strategy. Other than capacity-constrained IoT
dependency between devices; and consequently, multiple com- devices, the distribution of the inference process over cloudlets
munications with remote servers are required. That is why only in a 5G-enabled MEC system is the focus of the work in [194],
few works adopted this strategy for inference collaboration where authors proposed to minimize the energy consumption,
between end devices and edge/fog servers, including [223]. while meeting stringent delay requirements of AI applications,
Authors in [223] proposed a spatial splitting (see Fig. 21 (b)) using a RL technique.
that minimizes the communication overhead per device. Then, b) Per-segment distribution: The per-segment distribu-
a distribution solution is designed based on the matching the- tion is defined as allocating fine-grained partitioned DNN
ory [227] and the swap matching problem [228], to jointly on lightweight devices such as Raspberry Pis. The partition-
accomplish the DNN inference. The matching theory is a ing strategy is based on the system configuration and the
mathematical framework in economics that models interac- pervasive network characteristics, including the memory, com-
tions between two sets of selfish agents, each one is competing putation, and communication capabilities of the IoT devices
to match agents of the other set. The objective was to reduce and their number. The segmentation of the DNN models
the total computation time while increasing the utilization of varies from neurons partitioning to channels, spatial, and fil-
the resources related to the two sets of IoT and fog devices. ters splitting, as discussed in section VI-A. For example, the
2) Localized Collaboration: Another line of work con- work in [199] opted for the spatial splitting (see Fig. 21 (b)),
siders the distribution of DNN computation across multiple where the input and the output feature maps are partitioned
edge participants, as shown in Fig. 22 (d). These participants into a grid and distributed among lightweight devices. The
present neighboring nodes that co-exist in the same vicin- authors proposed to allocate the cells along the longer edge
ity, e.g., IoT devices or fog nodes. The model distribution of the input matrix (rows or columns) to each participant,
over neighboring devices can be classified into two types: in order to reduce the padding overhead produced by the
the per-layer distribution where each participant performs the spatial splitting. Different segments are distributed to IoT
computation of one layer or more and the per-segment allo- devices according to the load-balancing principles using the
cation where smaller segments of the model are allocated to MapReduce model. The same rows/columns partitioning is
resource-limited devices. proposed in [229], namely the data-lookahead strategy. More
a) Per-layer distribution: The layer-wise partitioning can specifically, each block contains data from other blocks within
itself be classified under two categories, the one splitting point the same layer such that its connected blocks in subse-
strategy where only two participants are involved and multiple quent layers can be executed independently without requesting
splitting points where two or more devices are collaborating. intermediate/padding data from other participants. The spa-
For example, the DeepWear [225] approach splits the DNN tial splitting is also adopted in [198], where authors proposed
into two sub-models that are separately computed on a wear- a Fused Tile Partitioning (FTP) method. This method fuses
able and a mobile device. First, the authors conducted in-depth the layers and divides them into a grid. Then, cells con-
measurements on different devices and for multiple models to nected across layers are assigned to one participant, which
demystify the performance of the wearable-side DL and study largely reduces the communication overhead and the memory
the potential gain from the partial offloading. The derived footprint.
conclusions are incorporated into a prediction-based online The previous works introduced homogeneous partitioning,
scheduling algorithm that judiciously determines how, and where segments are similar. Unlike these strategies, authors
when to offload, in order to minimize latency and energy con- in [226], [230] proposed a heterogeneous partitioning of the
sumption of the inference. On the other hand, authors in [193] input data to be compatible with the IoT system containing
proposed a methodology for optimal placement of CNN layers devices with different capabilities ranging from small par-
among multiple IoT devices, while being constrained by their ticipants that fit only few cells to high capacity participants
computation and memory capacities. This methodology min- suitable for layer computation. For the same purpose, authors
imizes the latency of decision-making, which is measured as in [231] jointly conducted per-layer and per-segment partition-
the total of processing times and transmissions between par- ing, where the neurons and links of the network are modeled as
ticipants. Furthermore, this proposed technique can be applied a DAG. In this work, grouped convolutional techniques [232]
both to CNNs in which the number of layers is fixed and CNNs are used to boost the model parallelization of different nodes
with an early-exit. Similarly, authors in [224] proposed a CNN of the graph. The papers in [196], [197], [233], [234] stud-
multi-splitting approach to accelerate the inference process, ied different partitioning strategies of the convolutional layers
namely AAIoT. Unlike the above-mentioned efforts, AAIoT (channel, spatial and filters splitting) and fully connected lay-
deploys the layers of the neural network on multi-layer IoT ers (output and input splitting). Next, they emphasized that
architecture. More specifically, the lower-layer device presents an optimal splitting depends greatly on the parameters of the
the data source, and the higher-layer devices have more CNN network and that the inference speedup depends on the
powerful capacities. Offloading the computation to higher number of tasks to be parallelized, which is related to the
participants implies sacrificing the transmission latency to adopted splitting method. Hence, one partitioning approach
reduce the computation time. However, delivering the compu- cannot bring benefits to all types of CNNs. Based on these
tation to lower participants does not bring any benefit to the conclusions, a dynamic heuristic is designed to select the most

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2400 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

TABLE VI
P ERFORMANCE OF D ISTRIBUTION S TRATEGIES C OMPARED TO : C LOUD O NLY; O N -D EVICE O NLY; E DGE -S ERVER O NLY

adequate splitting and model parallelism for different inference • Recently, even if it is still not mature yet, multiple
scenarios. efforts have focused on the localized inference through
Table VI shows the performance of these techniques in per-segment distribution that allows to involve resource-
terms of latency, bandwidth, energy, computation, memory, limited devices and avoid the transmission to remote
and throughput, whereas Table VII presents a comparison servers. This kind of works targeted the model paralleliza-
between different distributed inference techniques introduced tion and aimed to maximize the concurrent computation
in this section. of different segments within the same request. However,
3) Lessons Learned: The lessons acquired from the litera- fewer works covered data parallelization and real-time
ture review covering the DNN distribution can be summarized adaptability to the dynamics and number of requests.
as follows: Particularly, the load of inferences highly impacts the
• The per-layer strategy with remote collaboration is the distribution of segments to fit them to the capacity of
most studied approach in the literature, owing to its participants.
simple splitting scheme and its assets in using high- • Adopting a mixed partitioning strategy is advantageous
performance servers while reducing the transmitted data. for heterogeneous systems composed of high and low-
However, such strategies may not be efficient in terms of capacity devices and multiple DNNs, which allows to
privacy or for networks with unstable transmission links, fully utilize the pervasive capacities while minimizing the
and hence may not be suitable for all applications. dependency and data transmission between devices.
• In per-layer strategies, selecting the split points depends
on multiple parameters, which are the capacity of the
end device that constrains the length of the first seg- C. Use Case: Distribution on Moving Robots
ment, the characteristics of the network (e.g., wi-fi, 4G, Currently, robotic systems have been progressively converg-
or LTE) that impact the transmission time, and the ing to computationally expensive AI networks for tasks like
DNN topology that determines the intermediate data path planning and object detection. However, resource-limited
size. robots, such as low power UAVs, have insufficient on-board
• The deep neural networks with a small-reduction capacity power-battery or computational resources to scalably execute
of pooling layers or with fully-connected layers of sim- the highly accurate neural networks.
ilar sizes undergo small variations in the per-layer data The work in [240], [241] examined the case of per-layer dis-
size. In this case, remote collaboration is not beneficial tribution with one split point between one UAV and one MEC
for data transmission. Hence, compression (e.g., quanti- server (see Fig. 24). More specifically, the authors proposed a
zation, pruning, and encoding) can be a good solution framework for AI-based visual target tracking system, where
to benefit from remote capacity with the minimum of low-level layers of the DNN classifier are deployed in the UAV
communication overhead. device and high-level layers are assigned to the remote servers.

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2401

TABLE VII
C OMPARISON B ETWEEN D ISTRIBUTED I NFERENCE T ECHNIQUES

The classification can be performed using only the low-level weighted-sum cost minimization problem for binary offload-
layers, if the image quality is good. Otherwise, the output of ing and partial offloading, while taking into consideration the
these layers should be further processed in the MEC server, error rate/accuracy, the data quality, the communication band-
for higher accuracy. In this context, the authors formulated a width, and the computing capacity of the MEC and the UAV.

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2402 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

DNN network among ground robots and profiled the energy

consumed for such tasks, when moving or being in idle mode.
Several conclusions are stated:
• When the robot is idle, the DNN computation and
offloading increase the power consumption of the device
by 50%.
• If the device is moving, the DNN execution causes high
spikes in power consumption, which may limit the device
to attain a high performance as this variation incurs a
frequent change of the power saving settings in the CPU.
• Distributing the inference contributes to reducing the
energy consumed per device, even-though the total power
consumption is higher. This is due to the reduced com-
Fig. 24. A fire detection scenario with distributed DNN. putation and memory operations per device and the idle
time experienced after offloading the tasks.
Based on the energy study of moving robots, the authors
proposed to distribute the DNN model into smaller segments
The offloading probability is derived for the binary offload- among multiple low-power robots to achieve an equilibrium
ing and the offloading ratio (i.e., the segment of DNN to of performance in terms of energy and number of executed
execute in the MEC) is obtained for the partial offloading tasks [191]. Still, the distribution of the model into small
scheme. In this model, the mobility of the UAVs (i.e., the segments (e.g., filter splitting) requires the intervention of a
distance between the UAV and the server) is involved through large number of robots that are highly dependent, which is
the transmission data rate between the device and the MEC. not realistic.
Additionally, the distance between the UAV and the target
impacts the quality of the image and consequently impacts VII. P RIVACY OF P ERVASIVE AI S YSTEMS
the offloading decisions. In the proposed framework, multiple
Even though the pervasive AI has presented unprecedented
trade-offs are experienced:
opportunities to empower IoT applications, it gave rise to
• The accuracy is achieved at the expense of delay and
novel security and privacy concerns. In fact, if servers and
transmitted data: if most of the images have bad quality,
participants are not controlled or owned by one operator,
the system is not able to accomplish low average latency
they are considered malicious by nature. Particularly, sensi-
as on-board inference is not sufficient. For this reason,
tive information can be leaked while sharing intermediate data
different inferences should be extended wisely using the
or updates between participants. Moreover, an untrusted par-
segment allocated in the MEC, particularly if the envi-
ticipant can alter the local data or send wrong parameters to
ronment is challenging such as bad weather or when the
slow the learning or mislead the system. In this section, we
target is highly dynamic.
overview the privacy (i.e., one of the devices revealing private
• A trade-off exists also between the accuracy and latency,
information about others) and security (i.e., one of the devices
and the position of the UAVs: when the device is close
injects false information to disrupt the collective behavior of
to the target, high resolution images can be taken, which
the devices) challenges and we survey different approaches
allows obtaining good accuracy on-board and avoiding
that address these issues.
the data offloading. Being close to the targets is not
always possible, particularly in harsh environments or
when the surveillance should be hidden. A. Privacy for Pervasive Training
• The battery life is increased at the expense of the infer- 1) Privacy and Security Challenges: In some federated
ence latency: the battery can be saved, if the processing learning settings, participants can randomly join or leave the
coefficient is decreased, which enlarges the computation training process, which raises various vulnerabilities from
time of the classification. different sources, including malicious servers, insider adver-
• The split point selection: if the intermediate data is saries and outsider attackers. More specifically, aggregation
smaller than the raw data, the offloading is encouraged servers can be honest but curious to inspect the models with-
to enhance the accuracy. out introducing any changes. On the other hand, potential
An online solution of this offloading trade-off using reinforce- malicious servers [244], as well as untrusted participants can
ment learning is presented in [242]. tamper the model during the learning rounds or dissimulat-
The previous works adopted the per-layer wise with one ing participation in order to obtain the final aggregated model
split point and remote collaboration approach. This strategy is without actually contributing with any data or only contribut-
more adequate for flying devices that can enhance their link ing with a small number of samples. This attack is called
quality by approaching the MEC stations. However, for ground free-riding [245]. Outsider eavesdroppers can also intercept
robots, offloading segments of the inference to remote servers the communication channels between trusted devices and the
costs the system a large transmission overhead and high energy server to spoof the model or inject noisy data (data poisoning).
consumption. Authors in [243] studied the distribution of the Authors in [246] proved that it is possible to extract sensitive

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2403

information from a trained model, as it implies the correla- and [256] ensures the integrity of DL training process against
tion between the training samples. The research work in [247] outsider attackers as well as honest-but-curious server. The
showed that confidence information returned by ML classi- key idea is to encode and compress the parameters of the
fiers introduce new model inversion attacks that enable the trained neural networks before sharing them with the server.
adversary to reconstruct samples of training subjects with high Then, these aggregated updates are directly computed with
accuracy. Inferring the sensitive information is also possible decoder on the server. This guarantees their privacy during the
through querying the prediction model. communication and after decoding. Although the encryption
Bandit and MARL algorithms are also prone to attacks technique can preclude the server from extracting information
if one of the agents is compromised. In general, if any of of local models, it costs the system more communication
the agents starts communicating false data (i.e., false data rounds and cannot prevent the collusion between the server
injection attacks), the regret-guarantees in bandits, and conver- and a malicious participant. To solve this problem, authors
gence behavior in MARL no longer hold. In addition to these in [257] proposed to adopt hybrid solution which integrates
expected effects, recent works in [248], [249] have demon- both lightweight homomorphic encryption and differential pri-
strated that a malicious agent may not only be disruptive but vacy. In this work, intentional noises are added to perturb the
can also actively sway the policy into malicious objectives by original parameters in case the curious server accomplices with
driving other agents to reach a policy of its choice. one of the participants to get encryption parameters.
2) Defense Techniques and Solutions: Even though the encryption is a robust approach to achieve
a) Differential privacy (DP): DP is a data perturbation privacy preservation for many applications, its adoption for
technique that was first introduced for ML in [250]. In DP, deep learning is facing various challenges as it can only be
a statistical noise is injected to the sensitive data to mask it deployed on tasks with certain degrees and complexities. In
and an algorithm is called differentially private, if its output other words, the fully homomorphic schemes are still not
cannot give any insight or reveal any information about its efficient for practical use.
input. The DP has been widely used to preserve the privacy c) Blockchain-based solutions: Blockchain is a recent
of the learning, although it is always criticized by its effect distributed ledger system initially designed for cryptocurrency
on the accuracy of the results due to noise growth. For this, and later increasingly applied to the IoT systems, where a
a careful calibration between the privacy level and the model record of transactions is deployed distributively in a peer-
usability is needed. The use of differential privacy for dis- to-peer network [258]. Authors in [259] proposed to use a
tributed learning systems becomes a very active research area. blockchain-based communication scheme to exchange updates
The authors in [251] proposed a differentially private stochas- in a distributed ML system, with the aim of leveraging the
tic gradient descent technique [252] that adds random noise blockchain’s security features in the learning process. In such
(e.g., Gaussian mechanism) to the trained parameters, before practice, local models are shared and verified in the trusted
sending them to the aggregation server. Then, during the local blockchain network. Furthermore, this framework can prevent
training, each participant keeps calculating the probability that participants from free-riding as their updates are checked and
an attacker succeeds to exploit the shared data, until reaching a they receive rewards proportional to the number of trained
predefined threshold at which it stops the process. Moreover, data samples. However, in contrast to vanilla FL, Block FL
in each round, the aggregation server chooses random par- needs to take into consideration the extra delay incurred by
ticipants. In this way, neither the local parameters can be the blockchain network. To address this, the Block FL is for-
used, nor the global model, as the attacker has no information mulated by considering communication, computation, and the
about the devices participating in the current round. DP has block generation rate, i.e., the proof of work difficulty. A pos-
been also used to ensure the agent’s privacy in federated ban- sible drawback of this approach is its vulnerability against
dits [253], where the authors considered the federated bandit any latency increase. Also, the use of blockchain implies
formulation with contextual information (in both, centralized the addition of a significant cost to implement and maintain
aggregator and p2p communication style). The authors pro- miners.
vided regret and privacy guarantees so that the peers, or the d) Secure multi-party computation (SMC): SMC is a sub-
central aggregator, do not learn individual agent’s samples. field of cryptographic protocols that has as a goal to secure
While concealing the agents contribution during the train- the data except the output when multiple participants jointly
ing, a trade-off between the privacy and learning performance perform an arbitrarily function over their private input. A study
should be established. In this context, authors in [254] in [260] has used SMC to build FL systems. The proposed
tested the performance of FL applied to real-world healthcare protocols consider secret sharing, which adds new round at
datasets while securing the private data using differential pri- the beginning of the process for the keys sharing, double-
vacy techniques. Results show that a significant performance masking round that protects from potential malicious server,
loss is witnessed, even though the privacy level is increased. and server-mediated key agreement that minimizes trust.
This encouraged the research community to propose alterna- e) Prevention against data poisoning: Data poisoning is
tive approaches to ensure privacy in federated learning. one of the attacks that is very destructive for ML, where an
b) Homomorphic encryption (HE): HE is a form of attacker injects poisoned data (e.g., mislabeled samples and
encryption that consists of performing computational opera- wrong parameters) into the dataset, which can mislead the
tions on cipher texts while having the same results that can learning process. Authors in [261] and [262] propose secure
be generated by the original data. The approach in [255] decentralized techniques to protect the learning against data

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2404 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

TABLE VIII
poisoning, as well as other system attacks. A zero-sum game E XAMPLES OF I NFERENCE ATTACKS
is proposed to formulate the conflicting objectives between
honest participants that utilize Distributed Support Vector
Machines (DSVMs) and a malicious attacker that can change
data samples and labels. This game characterizes the con-
tention between the honest learner and the attacker. Then, a
fully distributed and iterative algorithm is developed based on
Alternating Direction Method of Multipliers (ADMoM) [263]
to procure the instantaneous responses of different agents.
Blockchain-based solutions can also be used to prevent the
FL system from data poisoning attacks.
f) Other techniques: Most of the aforementioned tech-
niques protect the private data from outsider attackers while
assuming that the server is trustful and participants are hon-
est. However, one malicious insider can cause serious privacy Edge computing naturally enhances privacy of the sensi-
threats. Motivated by this challenge, authors in [264] proposed tive information by minimizing the data transfer to the cloud
a collaborative DL framework to solve the problem of internal through the public Internet. However, additional privacy tech-
attackers. The key idea is to select only a small number of gra- niques should be adopted to further protect the data from
dients to share with the server and similarly receive only a part eavesdroppers. In this context, in addition to its ability to
of the global parameters instead of uploading and updating allow the pervasive deployment of neural networks, the DNN
the whole set of parameters. In this way, a malicious partic- splitting was also used for privacy purposes. Meaning, by par-
ipant cannot have the whole information and hence cannot titioning the model, partially processed data is sent to the
infer it. However, this approach suffers from accuracy loss. untrusted party instead of transmitting raw data. In fact, in
Furthermore, authors in [265] presented a new attack based contrast to the training data that belongs to a specific dataset
on Generative Adversarial Networks (GANs) that can infer and generally follows a statistical distribution, the inference
sensitive information from a victim participant even with just samples are random and harder to be reverted. Furthermore,
a portion of shared parameters. In the same context, a defense the model parameters are independent from the input data,
approach based on GANs is designed by authors in [266], in which makes the inference process reveal less information
which participants generate artificial data that can replace the about the sample [46]. While preserving privacy, the inevitable
real samples. In this way, the trained model is called feder- challenge of DNN partitioning that remains valid, is selecting
ated generative model and the private data parameters are not the splitting point that meets the latency requirements of the
exposed to external malicious devices. Still, this approach can system.
lead to potential learning instability and performance reduction 2) Defense Techniques and Solutions:
due to the fake data used in the training. a) Features extraction: Authors in [272] proposed to
extract the features sufficient and necessary to conduct the
classification from the original image or from one of the lay-
ers’ outputs using an encoder and transmit these data to the
B. Privacy for Pervasive Inference centralized server for inference. This approach prevents the
1) Privacy and Security Challenges: The data captured by exposure of irrelevant information to the untrusted party that
end-devices and sent to remote servers (e.g., from cameras or may use it for unwanted inferences. The work in [273] also
sensors to cloud servers) may contain sensitive information proposed feature extraction for data privacy, while achieving
such as camera images, GPS coordinates of critical targets, a trade-off between on-device computation, the size of trans-
or vital signs of patients. Exposing these data has become a mitted data, and security constraints. In fact, selecting the
big security concern for the deep learning community. This split layer from where the data will be extracted intrinsically
issue is even more concerning when the data is collected from presents a security compromise. Particularly, as we go deeper
a small geographical area (e.g., edge computing) involving a in the DNN network, the features become more task specific
set of limited and cooperating users. In fact, if an attacker and the irrelevant data that can involve sensitive information
reveals some data (even public or slightly sensitive), a DL are mitigated [278]. Hence, if the split is performed in a deep
classifier can be trained to automatically infer the private data layer, the privacy is more robust and the transmission overhead
of a known community. These attacks, posing severe pri- is lower. However, a higher processing load is imposed on the
vacy threats, are called inference attacks that analyze trivial source device. The latter work [273], along with the work
or available data to illegitimately acquire knowledge about in [46], advised to perform deep partition in case the source
more robust information without accessing it, by only cap- device has enough computational capacity. If the source device
turing their statistical correlations. An example of a popular is resource-constrained, the model should be partitioned in the
inference attack is the Cambridge Analytica scandal in 2016, shallow layers, although most of the output features are not
where public data of Facebook users were exploited to predict related to the main task. Authors in [273] proposed a solu-
their private attributes (e.g., political view and location). Some tion based on Siamese fine-tuning [279] and dimensionality
well-known inference attacks are summarized in Table VIII. reduction to manipulate the intermediate data and send only

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2405

TABLE IX
C OMPARISON B ETWEEN P RIVACY-AWARE D ISTRIBUTION S TRATEGIES . (H: H IGH , M: M EDIUM , L: L OW )

the primary measures without any irrelevant information. In lower communication overhead are guaranteed when the split
addition to enhancing privacy, this mechanism contributes to is performed at deep layers; however, the allocation at the
reducing the communication overhead between the end-device end-device becomes less scalable. Adding noise or extract-
and the remote server. ing task-specific data can be included under the umbrella
However, to this end, the arms race between attacks and of differential privacy, which at a high level ensures that
defenses for DNN models has come to a forefront, as the the model does not reveal any information about the private
amount of extracted features can be sufficient for adversary input data, while still presenting satisfactory classification. The
approaches to recover the original image. Whereas, less shared performance of differential privacy is assessed by a privacy
features may also result in low classification accuracy. The budget parameter that denotes the level of distinguishability.
works in [46], [280], [281] proposed adversarial attacks to Authors in [276] conducted theoretical analysis to minimize ,
predict the inference input data (or the trained model), using while considering accuracy and the communication overhead
only available features from shared outputs between partic- to offload the intermediate features among fog participants.
ipants. Authors in [46] focused particularly on the privacy c) Cryptography: Cryptography is another technique that
threats presented by the DNN distribution; and accordingly, can be used to protect the distributed inference. The main
designed a white-box attack assuming that the structure idea is to encrypt the input data and process it using a model
of the trained model is known and the intermediate data trained on encrypted dataset, in a way the intermediate data
can be inverted through a regularized Maximum Likelihood cannot be used by a malicious participant. Little research,
Estimation (rMSE). Additionally, a black-box attack is also including [277], investigated the encrypted DNN distribu-
proposed, where the malicious participant only has knowledge tion, as this approach suffers from a prohibitive computation
about his segment and attempts to design an inverse DNN and communication overhead that exacerbates the complex-
network to map the received features to the targeted input and ity of the inference process, particularly when executed in
recover the original data. Authors demonstrated that revers- resource-constrained devices.
ing the original data is possible, when the neural system is d) Distribution for privacy: All the previous techniques
distributed into layers. applied additional tasks to secure the shared data, e.g., fea-
b) Noise addition: Adding noise to the intermediate data ture extraction, adding noise, and encryption, which overloads
is adopted in [274]. In this paper, the authors proposed to the pervasive devices with computational overhead. Different
perform a simple data transformation in the source-device from previous works, DistPrivacy [239] used the partition-
to extract relevant features and add noise. Next, these fea- ing scheme to guarantee privacy of the data. In fact, all the
tures extracted from shallow layers are sent to the cloud to existing privacy-aware approaches adopted the per-layer dis-
complete the inference. To maintain a high classification accu- tribution of the DNN model. This partitioning strategy incurs
racy, the neural network is re-trained with a dataset containing an intermediate shared information that can be reverted easily
noisy samples. However, adding noise to the intermediate data using adversarial attacks. The main idea in [239] is to divide
costs the system additional energy consumption and compu- the data resulting from each layer into small segments and
tational overhead. Therefore, the splitting should be done at distribute it to multiple IoT participants, which contributes to
a layer where the output size is minimal. Though, the lat- hiding the proprieties of the original image as each participant
ter work did not describe the partition strategy. The Shredder has only small amount of information. Particularly, the authors
approach [275] resolved this dilemma by considering the com- adopted the filter splitting strategy, in such a way that each
putation overhead during the noise injection process. The idea device computes only a part of the feature maps. However,
is to conduct an offline machine learning training to find as stated in Section VI-A, this partitioning strategy results in
the noise distribution that strikes a balance between privacy large data transmission between participants. Therefore, the
(i.e., information loss) and accuracy. In this way, the DNN authors formulated an optimization that establishes a trade-off
model does not require retraining with the noisy data and the between privacy and communication overhead.
network can be cut at any point to apply directly the noise Table IX illustrates different privacy-aware strategies for
distribution. The partitioning decision is based on the com- distributed inference existing in the literature and shows their
munication and computation cost. A higher privacy level and performance. We can see that choosing the adequate strategy

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2406 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

Fig. 25. Future directions and open challenges.

depends on the requirements of the pervasive computing VIII. F UTURE D IRECTIONS AND O PEN C HALLENGES
system, as multiple trade-offs need to be established, such In this section, we present a list of open challenges and
as the security level and accuracy, or the computation and issues facing the pervasive AI systems, and we propose some
communication loads. promising ideas to mitigate these issues. Specifically, we
introduce the opportunities to integrate the pervasive AI in
C. Lessons Learned emerging systems, and we suggest some future directions for
• Current works addressing pervasive AI privacy proved efficient distributed inference and enhanced federated learning
their efficiency while trying to maintain an acceptable algorithms. Finally, we present some innovative ideas related
accuracy. However, some of these efforts incur significant to the new concepts of multi-agent reinforcement learning.
extra communication and computation costs, while others Fig. 25 presents an illustration of the proposed directions.
incorporate new hyper-parameters that not only affect the
accuracy but also distress the communication.
• Most of the efforts in the literature explored the attacks A. Deployment of Pervasive AI in Emerging Systems
that target the federated learning and the possible 1) Pervasive AI-as-a-Service: While the 5G main goal is to
defenses. However, only little research investigated the provide high speed mobile services, the 6G pledges to establish
threats facing distributed inference. More specifically, next-generation softwarization and improve the network con-
changing the intermediate data or injecting malicious figurability in order to support pervasive AI services deployed
information to mislead the prediction is not studied yet. on ubiquitous devices. However, the research on 6G is still
• Limited efforts have investigated the privacy and secu- in its infancy, and only the first steps are taken to concep-
rity in multi-agent reinforcement learning. This could be tualize its design, study its implementation, and plan for use
attributed to the satisfactory performance of the well- cases. Toward this end, academia and industry communities
known defense mechanisms (e.g., DP.) when applied to should pass from theoretical studies of AI distribution to real-
multi-agent systems without many modifications. It is world deployment and standardization, aiming at instating the
also worth to mention that these privacy and security concept of Pervasive AI-as-a-service (PAIaas). PAIaas allows
protocols add additional communication and computation the service operators and AI developers to be more domain-
requirements, which are already high in the case of multi- specific and focus on enhancing users’ quality of experience,
agent learning. Thus, most works assume that the agents instead of worrying about tasks distribution. Moreover, it per-
are trusted and focus on minimizing communication and mits to systemize the mass-production and unify the interfaces
computational resource utilization. to access the joint software that gathers all participants and

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2407

applications. Some recent works, including [282] and [283], output, and analyze them using clustering methods. The perva-
started to design distributed AI services. However, authors did sive AI system is the most adequate environment to empower
not present an end-to-end architecture enclosing the whole XAI by improving the ability to interpret, understand, and
process, neither they have envisaged an automated trusted explain the behavior of the model. More specifically, by dis-
management of the service provisioning. tributing the inference, the model becomes algorithmically
2) Incentive and Trusty Mechanism for Distributed AI transparent, and each segment can be interpreted and clustered
Using Blockchain: The distribution of heavy and deep neural by its importance for the prediction. (2) Moreover, among the
networks on ubiquitous and resource-limited devices con- most important directions supporting the XAI is model debug-
tributes to minimizing the latency of the AI task and guaran- ging. Debugging a DNN allows to detect errors and understand
tees the privacy of the data. However, even though pervasive their sources and their influence on misleading the prediction.
systems are composed of computing units existing everywhere, A distributed model produces fine-grained outputs, that help
anytime, and not belonging necessarily to any operator, the to follow the inference process and localize the errors before
distribution is based on the assumption that pervasive devices reaching the prediction layer. (3) A third direction to explain
are consenting to participate in the collaborative system. In the AI is the extraction of data samples that are highly corre-
this context, several considerations should be examined first: lated with the results generated by the model. In fact, similarly
(1) The design of an incentive mechanism to motivate differ- to human behaviors when trying to understand some processes,
ent nodes to take over AI tasks and sacrifice their memory, examples are analyzed to grasp the inner correlation that is
energy, communication, and computation resources to gain derived by the AI model. Federated learning is based on clus-
some rewards (e.g., monetary remuneration and free access to tering data entries and training local models. This technique
services); (2) In addition to the security of the private data, the permits us to narrow the examples search and enables the
security of the participants’ information should also be guar- detection of the most influencing inputs on the model behav-
anteed (e.g., locations, identifiers, and capacities). Recently, ior. Research on XAI is still in its infancy, and pervasive DNN
blockchain [258], [284] has gained large popularity as a decen- computing looks like a promising environment to track the AI
tralized dataset managing transaction records across distributed process and interpret the results.
devices, while ensuring trusty communication. Moreover, the
aforementioned incentivizing mechanism can also be handled
by blockchain systems. More specifically, all source devices B. Efficient Algorithms for Pervasive Inference
and pervasive nodes have to first register to the blockchain 1) Online Resource Orchestration: The pervasive comput-
system to benefit from the distributed AI or to participate in ing systems are characterized by an extremely dynamic envi-
the computation. Then, data-generating devices request help ronment, where the available computing resources are volatile,
to accomplish a task and submit at the same time a trans- and the load of requests may follow some statistical distribu-
action application to the blockchain with a reward. Next, tions. Additionally, the quality of the collected data may affect
when the joining devices complete the offloaded tasks, they the depth of the adopted DL and consequently the computa-
return the results to the source device and validate the com- tion requirements of the tasks. As an example, capturing high
pletion of the transaction. Finally, the recorded participants quality images allows to have a good prediction using smaller
are awarded according to their contribution to the blockchain networks. In this scenario, early-exit techniques or squeezed
transaction. The edge-based blockchain has a promising poten- models can be adopted.
tial to prevent the security threats of transferring data between Because of these network’s dynamics, the pervasive systems
heterogeneous, decentralized, and untrusted devices. However, deploying distributed inference need a well-designed online
this approach is still in its infancy. Particularly, deploying it resource orchestration and participants selection strategy to
in resource-constrained devices is challenging due to the huge support the large number of AI services with minimum
energy and computation load of blockchain mining [285]. latency. Meanwhile, heterogeneous and limited resources, and
3) Explainable AI (XAI): The AI-based applications are high dimensional parameters of the DNN should be taken
increasingly involved in many fields, where the decisions are into consideration. In Section VI, we have introduced exist-
very critical to lives and personal wellness, such as smart ing theoretical approaches to split different DNN networks
health applications and autonomous drones used during wars. and distribute the resultant segments into pervasive devices
However, most of the users do not have visibility on how the to optimize pervasive computing [196], [197], [198], [199].
AI is making decisions. This lack of explainability prevents us Nonetheless, most of them focused on how to partition the
to fully trust the predictions generated by AI systems. Finding model in order to maximize the model parallelization and min-
reasons and logical explanations for decisions made by AI is imize the dependency between participants. Yet, there is no
called Explainable AI (XAI) [12], [286]. XAI is an emerg- relevant work that deeply studied the performance of infer-
ing field that is expected to answer some questions, including: ences distribution and reported the bottleneck and gain of such
Why are some predictions chosen, and why others not? When an approach in long-term online resource orchestration, with
does an AI model succeed in taking the right decision, and different loads of requests and a dynamic behavior of partic-
when it fails? ipants and sources. In other words, the data parallelization is
Various techniques are used to explain the AI: (1) One of not well investigated in the literature, where sources can gen-
these techniques is decomposability, which stands for the abil- erate multiple requests at the same time and offload them to
ity to describe each part of the model, extract features of the neighboring devices. In this scenario, the critical decision of

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2408 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

each device is to choose whether to process the same task inferences and leverage the computation capacity of ground
from different requests while minimizing the memory to store robots to accomplish the predictive tasks. However, only few
the filters’ weights or to compute sequential tasks from the works covered the distribution of the inference among flying
same request while reducing the transmission among partic- drones, characterized by their faster navigation, higher power
ipants. Furthermore, the age-aware inference is an important consumption, and ability to reach areas with high interferences
factor that can be foreseen in online data parallelization. In (e.g., high-rise buildings) compared to ground devices [288],
fact, some requests are highly sensitive to delays and need to [289]. Moreover, recent efforts did not cover the path plan-
be processed timely, such as self-driving cars, whereas others ning for different moving robots to complete their assigned
are less critical, including machine translation and recommen- missions, while performing latency-aware predictions. More
dation systems. Thus, prioritizing urgent tasks and assigning specifically, the time period between capturing the data to
better resources and intensive data parallelization to them is the moment when tasks from all the points are collected,
of high importance. We believe that pervasive AI computing should be minimized by optimizing the devices’ trajectories,
should focus more on the online configuration to implement and planning close paths for participants handling subsequent
the above vision. segments. Furthermore, the trajectories of devices with avail-
2) Privacy-Aware Distributed Inference: Guaranteeing the able resources should cross the paths of the nodes that need
privacy of the data shared between collaborative devices is one to offload the tasks, because of resource constraints.
of the main concerns of pervasive computing systems, since 4) Remote Inference of Non-Sequential DNN Models: A
untrusted participants may join the inference and observe crit- major part of pervasive inference literature analyzes the remote
ical information. Because of this heterogeneity of ubiquitous collaboration, where the source device computes the shallow
devices, the trained models are subject to malicious attacks, layers of the model, while the cloud handles the deep lay-
such as black-box and white-box risks, by which the orig- ers [203], [206]. In this context, the split point is chosen
inal inputs may be in jeopardy. In this case, privacy-aware based on the size of the shared data, the resource capabil-
mechanisms should be enhanced to ensure the security of the ity of the end-device, and the network capacity. This DNN
distributed inference process. Many efforts have been con- partitioning approach may work well for the standard sequen-
ducted in this context, such as noise addition and cryptography. tial model, where filters are sequentially reducing the size
Even though these techniques succeeded in hiding features of of the intermediate data. However, state-of-the-art networks
the data from untrusted devices, most of them suffer from com- do not only include sequential layers with reduced outputs.
putation overhead and incompatibility with some end-devices Indeed, generative models (GAN) [37] proved their efficiency
or DNNs. More specifically, noisy or encrypted data need for image generation, image quality enhancement, text-to-
to be re-trained to preserve the accuracy of the prediction, image enhancement, etc. Auto-encoders also showed good
and each input has to be obfuscated, which adds a compu- performance for image generation, compression, and denois-
tation overhead. Moreover, encryption may not be applicable ing. These types of networks have large-sized inputs and
for all DNN operations nor possible in some end-devices due outputs. Hence, despite the reduced intermediate data, the
to the crypto key management requirements. A notable recent cloud servers have to return the high-sized results to the
work in [239] and [287] proposed to use the distribution for source device, which implies high transmission overhead.
data privacy, without applying any additional task requiring Another family of efficient neural networks is the RNN (see
computation overhead. In fact, per-segment splitting leads by Section III-A1c) [35], used mostly for speech recognition and
design to assigning only some features of the input data to natural language processing. These networks include loops in
participants. Authors, of this work, applied filter partitioning their structures and multiple outputs of a single layer, which
and conducted empirical experiments to test the efficiency of imposes multiple communications with remote servers in case
black-box attacks on different segments’ sizes (i.e., number of partitioning. Other complex DNN structures prevent remote
of feature maps per device). The lower the number of fea- collaboration wisdom, such as the randomly wired networks
ture maps per device, the higher the privacy. However, filter and Bolzman Machines (BM) having a non-sequential depen-
partitioning incurs high communication load and dependency dency. Keeping up with ever-advancing deep learning designs
between devices. This study is still immature. Other partition- is a major challenge for per-layer splitting, particularly for
ing strategies (e.g., channel and spatial.) can be examined to remote collaboration. Based on these insights, the scheduling
identify the optimal partitioning and distribution that guaran- of DNN partitioning should have various patterns depending
tee satisfactory privacy and minimum resource utilization per on the model structure.
participant. 5) Fault-Tolerance of Distributed Inference: When a deep
3) Trajectory Optimization of Moving Robots for Latency- neural network is split into small segments and distributed
Aware Distributed Inference: The usage of robots (e.g., UAVs) among multiple physical devices, the risk of nodes failure is
proved its efficiency to improve services in critical and hard- increased, which leads to performance drop and even infer-
reaching regions. Recently, moving robots have been used ence abortion. The typical networking wisdom resorts to
for real-time image analysis, such as highway inspection, re-transmission mechanisms along with scheduling redundant
search and rescue operations, and border surveillance missions. paths. These failure management techniques inevitably con-
These devices have numerous challenges, including energy sume additional bandwidths. The DNNs are characterized by
consumption and unstable communication with remote servers. a unique structure that may enclose skip connections, convolu-
Recent works, e.g., [191], [243], proposed to avoid remote AI tional neural connections, and recurrent links. These features

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2409

of state-of-the-art networks implicitly increase the robustness invisible nano-sensors can be trained to sniff human breath and
and resiliency of the joint inference. More specifically, skip analyse the concentration of specific particles. Still, reaching
blocks allow receiving information from an intermediate layer the full potential of nanomedicine (e.g., drug delivery systems
in addition to the data fed from the previous one. These con- and precision cancer medicine) is still yet to be fully realized.
nections, serving as a memory for some DL models (e.g., To guarantee that Nano particles achieve the targeted objec-
ResNet), can play the role of fault-tolerant paths. If one of tives, large amount of data and computational analysis is
the devices fails or leaves the joint system, information from expected. While the traditional techniques opt for an in-depth
a prior participant can still be propagated forward to the understanding of biological and chemical knowledge, the AI
current device via the skip blocks, which adds some fail- only requires data training. Thus, it is highly interesting to
ure resiliency. The skip connections proved an unprecedented integrate the AI to evaluate and formulate the nanoscale par-
ability to enhance the accuracy of deep models, in addition ticles [293], [294]. However, these particles suffer from small
to its potential to strengthen the fault-tolerance of pervasive energy capacity that limits their communication with remote
computing. However, transmission overheads are experienced, devices (e.g., handheld mobiles and computers). Hence, the
particularly for failure-free systems. Thus, a trade-off between distribution of inference within the nano-sensors can provide
accuracy, resilience, and resource utilization should be envis- localized processing and minimize the data transmission. In
aged. Another vision to be investigated is to train the system this context, new partitioning strategies should be envisaged,
without skip connections and use them only in case of fail- as the existing ones do not fit the extremely limited compu-
ures. This idea is inspired from the Dropout [290] technique tational resources of the particles. Particularly, even neuron,
that is used to reduce the data overfitting problem. It is based spatial, or filter splitting involving numerous multiplications
on randomly dropping some neurons during the training and are considered complex tasks. Thus, per-multiplication par-
activating them during the inference. Studying the impact of titioning and the related dependency between millions of
cutting off some transmissions during the inference for differ- nano-participants have to be investigated to ensure the practi-
ent splitting strategies while re-thinking the dropout training is cality of this futuristic convergence between pervasive AI and
interesting to strengthen the fault-tolerance of pervasive com- nanotechnology.
puting. Very recent works [222], [291] started to discuss such
insights; however, they are still immature.
6) Data-Locality-Aware Algorithms: Most of the efforts, C. Enhanced Federated Learning Algorithms
studying the pervasive inference, focus on splitting and paral- 1) Active Federated Learning: Given the main limitations
lelizing the DNN tasks related to a predictive request. Next, of FL in terms of communication overheads and slow con-
based on the resource requirements and their availability in vergence, combining AL concept with emerging FL schemes
the joint system (e.g., computation and energy), tasks are dis- would be of great interest. Since most of the existing schemes
tributed and assigned to the participants. However, in terms for FL suffer from slow convergence, a novel active FL solu-
of memory, only the weight of the input data is considered, tion would be needed, which exploits the distributed nature
whereas the weights to store the DNN structure are never of FL, while coping with highly dynamic environments and
taken into account. For example, VGG-16 model has 138 M ensuring adequately fast convergence. Indeed, heterogeneity
parameters and requires 512 Mb to store its filters [31]. What of the local training data at distributed participating nodes
worsens the situation is that some partitions impose copying and considering all nodes in the FL process can significantly
the filters to all participants (e.g., spatial splitting.). Moreover, slow down the convergence. Full nodes participation renders
if the intelligent application is led by multiple DNN mod- the centralized server to wait for the stragglers. Thus, we
els and different segments are assigned to each device, a envision that: (1) exchanging some side information between
huge memory burden is experienced. Therefore, data-locality- different participating nodes (e.g., the unique data samples
aware algorithms should be designed. More specifically, the or class distribution) can significantly help in tackling the
distribution system has to account for the past tasks assigned data heterogeneity problem; (2) considering partial node par-
to each participant and try to maximize the re-usability of ticipation by proposing efficient user selection schemes can
previously-stored weights, with consideration to the capacity play an important role in decreasing communication overheads
of the devices. Minimizing the number of weights assigned to and accelerating the convergence. A preliminary study of this
each participant, not only contributes to reduce the memory approach can be found in [295].
usage, but also guarantees the privacy of the structure against 2) Blending Inter and Intra Data Parallelism for Federated
white-box attacks [292]. Learning: Deep neural networks require intensive memory
7) Pervasive Inference for Nanotechnology Applications: and computational loads. This challenge is compounded, when
Nanotechnology is the field of innovation and research that the model is larger and deeper, as it becomes infeasible to
focuses on creating particles (e.g., devices and materials) in acquire training results from a single resource-limited device.
the scale of atoms. These particles can be used in multiple Triggered by this challenge, federated learning is proposed
domains, such as nanomedicine that studies new ways of to train deep models over tens and even hundreds of CPUs
detecting and curing diseases. One of the interesting examples and GPUs, by taking advantage of inter-data parallelism [86].
of nanomedicine is the detection of diabetes through analyzing At present, federated learning techniques split the data to be
human’s breaths. Before nanotechnology, it was not possible trained among pervasive nodes while copying the whole DL
to precisely detect nano biomarkers. Nowadays, intelligent and model to all of them. Still, small devices cannot participate in

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2410 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

such a process due to their limited capacities. Hence, blending 4) MARL Under Networking Constraints: Several commu-
the inter-data parallelism where the trained data is distributed nication characteristics have not been investigated under the
and the intra-data parallelism where the intermediate data of MDP and POMG settings. For example, while delay, noise,
the model are partitioned, can be a feasible solution to enable failure, and time-varying topologies are vital factors in today’s
training within non-GPU devices. Certainly, the practicality, practical networks, they were not considered in most of MARL
gains and bottleneck of such an approach are to be examined papers. These factors were, however, considered in other
and studied, as the backpropagation characterizing the training optimization frameworks like multi-agent (distributed) con-
phase imposes huge dependency and communication between vex optimization [299]. Some of the works started to study
devices. bandwidth and multiple-access aspects [162], [165]. Yet, it is
important to study the performance of emerging policies of
MARL under realistic networking constraints.
D. Communication-Efficient Multi-Agent Reinforcement
Learning IX. C ONCLUSION
1) Demonstrated Applications: Since most of the MAB Recently, AI and pervasive computing have drawn the
algorithms discussed in this paper are recent [143], [144], attention of academia and industrial verticals, as their con-
[147], [149], it remains interesting to see their implications fluence has proved a high efficiency to enhance human’s
on practical applications, for example, quantifying the effect productivity and lifestyle. Particularly, the computing capaci-
of bounded communication resources or energy used in wear- ties offered by the massive number of ubiquitous devices open
able devices and congestion between edge nodes. Similarly, up an attractive opportunity to fuel the continuously advanc-
quantifying the improvement in regret bounds on actual and ing and pervasive IoT services, transforming all aspects of
Quality of Experience (QoE) metrics can be promising. our modern life. In this survey, we presented a comprehen-
2) More General Forms of MABs: The state of art algo- sive review of the resource allocation and communication
rithms in the distributed and federated setup adapt the finite- challenges of pervasive AI systems, enabling to support a
actions, and the stochastic case of the multi-agent settings. plethora of latency-sensitive applications. More specifically,
However, there exist many more general forms of the ban- we first presented the fundamentals of AI networks, applica-
dit problem that are yet to be studied under the multi-agent tions and performance metrics, and the taxonomy of pervasive
settings. These include but are not limited to adversarial- computing and its intersection with AI. Then, we summa-
bandits, linear bandits, pure exploration, and non-stationary rized the resource management algorithms for the distributed
bandits [296]. Investigating potential regret improvements training and inference. In this context, partitioning strategies,
and communication resource utilization in the MAB settings architectures, and communication issues and solutions were
of non-stochastic and infinite-action bandits remains to be extensively reviewed. Additionally, relevant use cases were
tackled. described and futuristic applications were discussed. The chal-
3) Heterogeneity of Bandit Agents: In MAB settings, lenges encountered in this paper revolve around choosing
agents might not only differ in the instances they are try- the categorization of different AI distribution strategies, as
ing to solve, but also in their computational capabilities. for example the MARL can be classified under the umbrella
Different computational capabilities mean that agents interact of pervasive training, pervasive decision-making, or simply
with their environments at different rates, collecting an addi- pervasive online learning.
tional amount of samples and hence having different quality Multiple challenges remain to be addressed, to further
estimates. While the effect of this computational heterogeneity improve the performance, as well as the resource management,
is heavily studied in supervised federated learning [297], it is privacy, and avant-garde applications. Therefore, we presented
not yet investigated either in distributed or federated bandits. our vision of technical challenges and directions that may
4) MARL Performance/Communication Trade-Off: Methods emerge in the future, along with some opportunities for inno-
that train in a logically centralized server and then execute in vation. We hope that this survey will elicit fruitful discussion
a decentralized manner (CTDE) are able to communicate less and inspire new promising ideas.
(even not at all) at execution while being able to learn good
joint policy due to the central training phase, as illustrated ear- ACKNOWLEDGMENT
lier. However, their adaptability is not guaranteed when dealing
with a non-stationary environment, and they might require The findings achieved herein are solely the responsibility of
re-training again to adapt to the new environment. On the the authors.
other hand, fully decentralized agents can continue the learning
throughout their deployment but need to communicate more R EFERENCES
often to reason about their joint action. Otherwise, learning
[1] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to
can be challenging and might diverge [298]. A natural goal is attention-based neural machine translation,” 2015, arXiv:1508.04025.
to design adaptable methods that communicate conservatively, [2] A. Y. Hannun et al., “Deep speech: Scaling up end-to-end speech
which is the main motivation behind scheduling in learned recognition,” 2014, arXiv:1412.5567.
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
communication. Thus, more work is needed to address the recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
question of adaptable and communication-cognizant MARL. pp. 770–778.

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2411

[4] A. I. Chen, M. L. Balter, T. J. Maguire, and M. L. Yarmush, “Deep [26] M. Aazam and E.-N. Huh, “Fog computing micro Datacenter based
learning robotic guidance for autonomous vascular access,” Nat. Mach. dynamic resource estimation and pricing model for IoT,” in Proc. IEEE
Intell., vol. 2, pp. 104–115, Feb. 2020. 29th Int. Conf. Adv. Inf. Netw. Appl., 2015, pp. 687–694.
[5] J. Bughin and J. Seong, Assessing the Economic Impact of Artificial [27] K. Bilal, O. Khalid, A. Erbad, and S. U. Khan, “Potentials, trends, and
Intelligence, vol. 1, ITU Trends Issue Paper, Geneva, Switzerland, prospects in edge technologies: Fog, cloudlet, mobile edge, and micro
2018. data centers,” Comput. Netw., vol. 130, pp. 94–120, Jan. 2018.
[6] “Prepare to succeed with the Internet of Things,” San Jose, [28] Y. Roh, G. Heo, and S. E. Whang, “A survey on data collection for
CA, USA, Cisco, White Paper, 2017. [Online]. Available: machine learning: A big data—AI integration perspective,” IEEE Trans.
http://docs.media.bitpipe.com/io_13x/io_138814/item_1588083/ Knowl. Data Eng., vol. 33, no. 4, pp. 1328–1347, Apr. 2021.
Prepa%20%20Succe%20wi%20t%20IoT.pdf [29] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep Learning,
[7] M. Mohammadi, A. Al-Fuqaha, S. Sorour, and M. Guizani, “Deep vol. 1. Cambridge, MA, USA: MIT Press, 2016.
learning for IoT big data and streaming analytics: A survey,” IEEE [30] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica-
Commun. Surveys Tuts., vol. 20, no. 4, pp. 2923–2960, 4th Quart., tion with deep convolutional neural networks,” in Advances in Neural
2018. Information Processing Systems, vol. 25. Red Hook, NY, USA: Curran,
[8] M. K. Abdel-Aziz, C.-F. Liu, S. Samarakoon, M. Bennis, and W. Saad, 2012, pp. 1097–1105.
“Ultra-reliable low-latency vehicular networks: Taming the age of [31] K. Simonyan and A. Zisserman, “Very deep convolutional networks
information tail,” 2018, arXiv:1811.03981. for large-scale image recognition,” 2015. arXiv:1409.1556.
[32] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
[9] S.-C. Lin et al., “The architectural implications of autonomous driv-
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
ing: Constraints and acceleration,” SIGPLAN Notices, vol. 53, no. 2,
[33] S. Xie, A. Kirillov, R. Girshick, and K. He, “Exploring randomly wired
pp. 751–766, 2018.
neural networks for image recognition,” 2019, arXiv:1904.01569.
[10] M. Satyanarayanan, “The emergence of edge computing,” Computer, [34] A. Khan, A. Sohail, U. Zahoora, and A. S. Qureshi, “A survey of
vol. 50, no. 1, pp. 30–39, 2017. the recent architectures of deep convolutional neural networks,” Artif.
[11] D. Swinhoe. “The 15 biggest data breaches of the 21st century.” CSO. Intell. Rev., vol. 53, pp. 5455–5516, Dec. 2020.
2021. [Online]. Available: https://www.csoonline.com/article/2130877/ [35] D. Rumelhart, G. Hinton, and R. Williams, “Learning representations
the-biggest-data-breaches-of-the-21st-century.html by back-propagating errors,” Nature, vol. 323, pp. 533–536, Oct. 1986.
[12] A. B. Arrieta et al., “Explainable artificial intelligence (XAI): Concepts, [36] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
taxonomies, opportunities and challenges toward responsible AI,” Inf. data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,
Fusion, vol. 58, pp. 82–115, Jun. 2020. 2006.
[13] A. A. Abdellatif, A. Mohamed, C. F. Chiasserini, M. Tlili, [37] I. J. Goodfellow et al., “Generative adversarial networks,” 2014,
and A. Erbad, “Edge computing for smart health: Context-aware arXiv:1406.2661.
approaches, opportunities, and challenges,” IEEE Netw., vol. 33, no. 3, [38] V. Mnih et al., “Playing atari with deep reinforcement learning,” 2013,
pp. 196–203, May/Jun. 2019. arXiv:1312.5602.
[14] A. A. Abdellatif, A. Mohamed, C. F. Chiasserini, A. Erbad, and [39] M. S. Allahham, A. A. Abdellatif, A. Mohamed, A. Erbad, E. Yaacoub,
M. Guizani, “Edge computing for energy-efficient smart health and M. Guizani, “I-SEE: Intelligent, secure, and energy-efficient
systems: Data and application-specific approaches,” in Energy techniques for medical data transmission using deep reinforcement
Efficiency of Medical Devices and Healthcare Applications. London, learning,” IEEE Internet Things J., vol. 8, no. 8, pp. 6454–6468,
U.K.: Elsevier, 2020, pp. 53–67. Apr. 2021.
[15] E. Nieuwdorp, “The pervasive discourse: An analysis,” Comput. [40] D. Silver et al., “Mastering the game of go with deep neural networks
Entertainment, vol. 5, no. 2, p. 13, 2007. and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
[16] F. Colace, M. De Santo, V. Moscato, A. Picariello, F. A. Schreiber, [41] V. Mnih et al., “Human-level control through deep reinforcement
and L. Tanca, Pervasive Systems Architecture and the Main Related learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
Technologies. Cham, Switzerland: Springer Int., 2015, pp. 19–42. [42] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
[17] E. Baccour, S. Foufou, R. Hamila, and Z. Tari, “Achieving energy “Proximal policy optimization algorithms,” 2017, arXiv:1707.06347.
efficiency in data centers with a performance-guaranteed power aware [43] V. Mnih et al., “Asynchronous methods for deep reinforcement learn-
routing,” Comput. Commun., vol. 109, pp. 131–145, Sep. 2017. ing,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 1928–1937.
[18] F. Haouari, E. Baccour, A. Erbad, A. Mohamed, and M. Guizani, “QoE- [44] A. Marchisio et al., “Deep learning for edge computing: Current trends,
aware resource allocation for crowdsourced live streaming: A machine cross-layer optimizations, and open research challenges,” in Proc. IEEE
learning approach,” in Proc. IEEE Int. Conf. Commun. (ICC), 2019, Comput. Soc. Annu. Symp. VLSI (ISVLSI), 2019, pp. 553–559.
pp. 1–6. [45] S. H. Hasanpour, M. Rouhani, M. Fayyaz, and M. Sabokrou, “Lets
[19] E. Baccour, A. Erbad, A. Mohamed, F. Haouari, M. Guizani, and keep it simple, using simple architectures to outperform deeper and
M. Hamdi, “RL-OPRA: Reinforcement learning for online and proac- more complex architectures,” 2018, arXiv:1608.06037.
tive resource allocation of crowdsourced live videos,” Future Gener. [46] Z. He, T. Zhang, and R. B. Lee, “Model inversion attacks against
Comput. Syst., vol. 112, pp. 982–995, Nov. 2020. collaborative inference,” in Proc. 35th Annu. Comput. Security Appl.
Conf., New York, NY, USA, 2019, pp. 148–162. [Online]. Available:
[20] E. Baccour, S. Foufou, and R. Hamila, “PTNet: A parameterizable data
https://doi.org/10.1145/3359789.3359824
center network,” in Proc. IEEE Wireless Commun. Netw. Conf., 2016,
[47] V. Tolpegin, S. Truex, M. E. Gursoy, and L. Liu, “Data poisoning
pp. 1–6.
attacks against federated learning systems,” in Proc. Eur. Symp. Res.
[21] F. Haouari, E. Baccour, A. Erbad, A. Mohamed, and M. Guizani, Comput. Security, 2020, pp. 480–501.
“Transcoding resources forecasting and reservation for crowdsourced [48] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membership infer-
live streaming,” in Proc. IEEE Global Commun. Conf. (GLOBECOM), ence attacks against machine learning models,” in Proc. IEEE Symp.
2019, pp. 1–7. Security Privacy (SP), 2017, pp. 3–18.
[22] E. Baccour, A. Erbad, A. Mohamed, and M. Guizani, “CE-D2D: [49] “Autonomio Talos [computer software].” 2019. [Online]. Available:
Dual framework chunks caching and offloading in collaborative edge http://github.com/autonomio/talos
networks with D2D communication,” in Proc. 15th Int. Wireless [50] M. Abadi et al. “TensorFlow: Large-scale machine learning on hetero-
Commun. Mobile Comput. Conf. (IWCMC), 2019, pp. 1550–1556. geneous systems.” 2015. [Online]. Available: tensorflow.org
[23] E. Baccour, A. Erbad, A. Mohamed, M. Guizani, and M. Hamdi, [51] “Caffe3.” 2021. [Online]. Available: https://caffe2.ai/
“Collaborative hierarchical caching and transcoding in edge network [52] “Core ML.” 2021. [Online]. Available: https://developer.apple.com/
with CE-D2D communication,” J. Netw. Comput. Appl., vol. 172, documentation/coreml
Dec. 2020, Art. no. 102801. [53] A. Tveit, T. Morland, and T. B. Røst, “DeepLearningKit—An GPU
[24] E. Baccour, A. Erbad, A. Mohamed, M. Guizani, and M. Hamdi, “CE- optimized deep learning framework for apple’s iOS, OS X and tvOS
D2D: Collaborative and popularity-aware proactive chunks caching developed in metal and swift,” 2016, arXiv:1605.04614.
in edge networks,” in Proc. Int. Wireless Commun. Mobile Comput. [54] T. Chen et al., “MXNet: A flexible and efficient machine
(IWCMC), 2020, pp. 1770–1776. learning library for heterogeneous distributed systems,” 2015,
[25] M. Satyanarayanan, P. Bahl, R. Caceres, and N. Davies, “The case for arXiv:1512.01274.
VM-based cloudlets in mobile computing,” IEEE Pervasive Comput., [55] S. Park, J. Lee, and H. Kim, “Hardware resource analysis in distributed
vol. 8, no. 4, pp. 14–23, Oct./Dec. 2009. training with edge devices,” Electronics, vol. 9, no. 1, p. 28, 2020.

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2412 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

[56] A.-J. Farcas, G. Li, K. Bhardwaj, and R. Marculescu, “A hardware [81] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network
prototype targeting distributed deep learning for on-device inference,” intelligence at the edge,” Proc. IEEE, vol. 107, no. 11, pp. 2204–2239,
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops Nov. 2019.
(CVPRW), 2020, pp. 1600–1601. [82] Y. Shi, K. Yang, T. Jiang, J. Zhang, and K. B. Letaief, “Communication-
[57] M. Bojarski et al., “End to end learning for self-driving cars,” 2016, efficient edge AI: Algorithms and systems,” IEEE Commun. Surveys
arXiv:1604.07316. Tuts., vol. 22, no. 4, pp. 2167–2191, 4th Quart., 2020.
[58] T. Zhang, G. Kahn, S. Levine, and P. Abbeel, “Learning deep con- [83] C. Zhang, P. Patras, and H. Haddadi, “Deep learning in mobile and
trol policies for autonomous aerial vehicles with MPC-guided policy wireless networking: A survey,” IEEE Commun. Surveys Tuts., vol. 21,
search,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), 2016, no. 3, pp. 2224–2287, 3rd Quart., 2019.
pp. 528–535. [84] M. Langer, Z. He, W. Rahayu, and Y. Xue, “Distributed training of
[59] “Prime-air.” 2021. [Online]. Available: https://www.aboutamazon.com/ deep learning models: A taxonomic perspective,” IEEE Trans. Parallel
news/transportation/amazon-prime-air-prepares-for-drone-deliveries Distrib. Syst., vol. 31, no. 12, pp. 2802–2818, Dec. 2020.
[60] “Uber self-driving cars.” 2020. [Online]. Available: https:// [85] S. Shi, Z. Tang, X. Chu, C. Liu, W. Wang, and B. Li, “A quantitative
www.cnbc.com/2020/01/28/ubers-self-driving-cars-are-a-key-to- survey of communication optimizations in distributed deep learning,”
its-path-to-profitability.html IEEE Netw., vol. 35, no. 3, pp. 230–237, May/Jun. 2021.
[61] “Amazon Alexa.” 2021. [Online]. Available: https://en.wikipedia.org/
[86] D. C. Nguyen, M. Ding, P. N. Pathirana, A. Seneviratne, J. Li,
wiki/Amazon_Alexa
and H. V. Poor, “Federated learning for Internet of Things: A com-
[62] O. Adedeji and Z. Wang, “Intelligent waste classification system using
prehensive survey,” IEEE Commun. Surveys Tuts., vol. 23, no. 3,
deep learning convolutional neural network,” Procedia Manuf., vol. 35,
pp. 1622–1658, 3rd Quart., 2021.
pp. 607–612, Jan. 2019.
[87] M. Aledhari, R. Razzak, R. M. Parizi, and F. Saeed, “Federated learn-
[63] D. Zhang, X. Han, and C. Deng, “Review on the research and practice
ing: A survey on enabling technologies, protocols, and applications,”
of deep learning and reinforcement learning in smart grids,” CSEE J.
IEEE Access, vol. 8, pp. 140699–140725, 2020.
Power Energy Syst., vol. 4, no. 3, pp. 362–370, Sep. 2018.
[64] B. Y. Cai, R. Alvarez, M. Sit, F. Duarte, and C. Ratti, “Deep learning- [88] S. A. Rahman, H. Tout, H. Ould-Slimane, A. Mourad, C. Talhi, and
based video system for accurate and real-time parking measurement,” M. Guizani, “A survey on federated learning: The journey from central-
IEEE Internet Things J., vol. 6, no. 5, pp. 7693–7701, Oct. 2019. ized to distributed on-site learning and beyond,” IEEE Internet Things
[65] A. Clemm, M. T. Vega, H. K. Ravuri, T. Wauters, and F. De Turck, J., vol. 8, no. 7, pp. 5476–5497, Apr. 2021.
“Toward truly immersive holographic-type communication: Challenges [89] L. U. Khan, W. Saad, Z. Han, E. Hossain, and C. S. Hong, “Federated
and solutions,” IEEE Commun. Mag., vol. 58, no. 1, pp. 93–99, learning for Internet of Things: Recent advances, taxonomy, and open
Jan. 2020. challenges,” 2020, arXiv:2009.13012.
[66] S. LaValle, Virtual Reality. Cambridge, U.K.: Cambridge Univ. Press, [90] W. Y. B. Lim et al., “Federated learning in mobile edge networks: A
2017. comprehensive survey,” IEEE Commun. Surveys Tuts., vol. 22, no. 3,
[67] M. McClellan, C. Cervelló-Pastor, and S. Sallent, “Deep learning at pp. 2031–2063, 3rd Quart., 2020.
the mobile edge: Opportunities for 5G networks,” Appl. Sci., vol. 10, [91] Z. Du, C. Wu, T. Yoshinaga, K.-L. A. Yau, Y. Ji, and J. Li, “Federated
no. 14, p. 4735, 2020. learning for vehicular Internet of Things: Recent advances and open
[68] D. C. Nguyen et al., “Enabling AI in future wireless networks: A data issues,” IEEE Open J. Comput. Soc., vol. 1, pp. 45–61, 2020.
life cycle perspective,” IEEE Commun. Surveys Tuts., vol. 23, no. 1, [92] D. Peteiro-Barral and B. Guijarro-Berdiñas, “A survey of methods for
pp. 553–595, 1st Quart., 2021. distributed machine learning,” Progr. Artif. Intell., vol. 2, pp. 1–11,
[69] X. Wang, Y. Han, V. C. M. Leung, D. Niyato, X. Yan, and X. Chen, Mar. 2013.
“Convergence of edge computing and deep learning: A comprehensive [93] J. Qiu, Q. Wu, G. Ding, Y. Xu, and S. Feng, “A survey of machine
survey,” IEEE Commun. Surveys Tuts., vol. 22, no. 2, pp. 869–904, 2nd learning for big data processing,” EURASIP J. Adv. Signal Process.,
Quart., 2020. vol. 2016, p. 67, May 2016.
[70] J. Chen and X. Ran, “Deep learning with edge computing: A review,” [94] S. Hu, X. Chen, W. Ni, E. Hossain, and X. Wang, “Distributed machine
Proc. IEEE, vol. 107, no. 8, pp. 1655–1674, Aug. 2019. learning for wireless communication networks: Techniques, architec-
[71] Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge tures, and applications,” IEEE Commun. Surveys Tuts., vol. 23, no. 3,
intelligence: Paving the last mile of artificial intelligence with edge pp. 1458–1493, 3rd Quart., 2021.
computing,” Proc. IEEE, vol. 107, no. 8, pp. 1738–1762, 2019. [95] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and
[72] M. G. S. Murshed, C. Murphy, D. Hou, N. Khan, G. Ananthanarayanan, B. A. Y. Arcas, “Communication-efficient learning of deep networks
and F. Hussain, “Machine learning at the network edge: A survey,” from decentralized data,” in Proc. 20th Int. Conf. Artif. Intell. Stat.
2020, arXiv:1908.00080. (AISTATS), 2017, pp. 1273–1282.
[73] S. Deng, H. Zhao, W. Fang, J. Yin, S. Dustdar, and A. Y. Zomaya, [96] J. Kang, Z. Xiong, D. Niyato, S. Xie, and J. Zhang, “Incentive mech-
“Edge intelligence: The confluence of edge computing and artificial anism for reliable federated learning: A joint optimization approach
intelligence,” IEEE Internet Things J., vol. 7, no. 8, pp. 7457–7469, to combining reputation and contract theory,” IEEE Internet Things J.,
Aug. 2020. vol. 6, no. 6, pp. 10700–10714, Dec. 2019.
[74] S. Voghoei, N. H. Tonekaboni, J. G. Wallace, and H. R. Arabnia, “Deep
[97] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learn-
learning at the edge,” in Proc. Int. Conf. Comput. Sci. Comput. Intell.
ing: Challenges, methods, and future directions,” IEEE Signal Process.
(CSCI), 2018, pp. 895–901.
Mag., vol. 37, no. 3, pp. 50–60, May 2020.
[75] T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, “Deep reinforcement
[98] Y. Liu, X. Yuan, Z. Xiong, J. Kang, X. Wang, and D. Niyato,
learning for multiagent systems: A review of challenges, solutions,
“Federated learning for 6G communications: Challenges, methods,
and applications,” IEEE Trans. Cybern., vol. 50, no. 9, pp. 3826–3839,
and future directions,” China Commun., vol. 17, no. 9, pp. 105–118,
Sep. 2020.
Sep. 2020.
[76] Y. Rizk, M. Awad, and E. W. Tunstel, “Decision making in multiagent
systems: A survey,” IEEE Trans. Cogn. Develop. Syst., vol. 10, no. 3, [99] L. U. Khan, M. Alsenwi, Z. Han, and C. S. Hong, “Self organizing
pp. 514–529, Sep. 2018. federated learning over wireless networks: A socially aware clustering
[77] D. Lee, N. He, P. Kamalaruban, and V. Cevher, “Optimization for rein- approach,” in Proc. Int. Conf. Inf. Netw. (ICOIN), 2020, pp. 453–458.
forcement learning: From a single agent to cooperative agents,” IEEE [100] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui,
Signal Process. Mag., vol. 37, no. 3, pp. 123–135, May 2020. “Performance optimization of federated learning over wireless
[78] L. Lei, Y. Tan, K. Zheng, S. Liu, K. Zhang, and X. Shen, “Deep networks,” in Proc. IEEE Global Commun. Conf. (GLOBECOM), 2019,
reinforcement learning for autonomous Internet of Things: Model, pp. 1–6.
applications and challenges,” IEEE Commun. Surveys Tuts., vol. 22, [101] S. R. Pandey, N. H. Tran, M. Bennis, Y. K. Tun, A. Manzoor,
no. 3, pp. 1722–1760, 3rd Quart., 2020. and C. S. Hong, “A crowdsourcing framework for on-device fed-
[79] A. Feriani and E. Hossain, “Single and multi-agent deep reinforce- erated learning,” IEEE Trans. Wireless Commun., vol. 19, no. 5,
ment learning for AI-enabled wireless networks: A tutorial,” 2020, pp. 3241–3256, May 2020.
arXiv:2011.03615. [102] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated
[80] J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and learning with non-IID data,” Jun. 2018, arXiv:1806.00582.
J. S. Rellermeyer, “A survey on distributed machine learning,” ACM [103] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the
Comput. Surv., vol. 53, no. 2, pp. 1–33, 2020. convergence of FedAvg on non-IID data,” 2019, arXiv:1907.02189.

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2413

[104] S. Wang et al., “Adaptive federated learning in resource constrained [127] Y. Liu, J. Nie, X. Li, S. H. Ahmed, W. Y. B. Lim, and C. Miao,
edge computing systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 6, “Federated learning in the sky: Aerial-ground air quality sensing frame-
pp. 1205–1221, Jun. 2019. work with UAV swarms,” IEEE Internet Things J., vol. 8, no. 12,
[105] N. H. Tran, W. Bao, A. Zomaya, M. N. H. Nguyen, and C. S. Hong, pp. 9827–9837, Jun. 2021.
“Federated learning over wireless networks: Optimization model design [128] Y. Yang, Z. Bai, Z. Hu, Z. Zheng, K. Bian, and L. Song, “AQNet: Fine-
and analysis,” in Proc. IEEE INFOCOM Conf. Comput. Commun., grained 3D spatio-temporal air quality monitoring by aerial-ground
2019, pp. 1387–1395. WSN,” in Proc. IEEE INFOCOM Conf. Comput. Commun. Workshops
[106] T. Nishio and R. Yonetani, “Client selection for federated learning with (INFOCOM WKSHPS), 2018, pp. 1–2.
heterogeneous resources in mobile edge,” in Proc. IEEE Int. Conf. [129] D. Dua and C. Graff, 2017, “UCI Machine Learning Repository,” UCI.
Commun. (ICC), 2019, pp. 1–7. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/Airfoil+Self-
[107] H. Wang, Z. Kaplan, D. Niu, and B. Li, “Optimizing federated learning Noise
on non-IID data with reinforcement learning,” in Proc. INFOCOM, [130] G. Cohen, S. Afshar, J. Tapson, and A. van Schaik, “EMNIST:
2020, p. 10. Extending MNIST to handwritten letters,” in Proc. Int. Joint Conf.
[108] C. Feng, Y. Wang, Z. Zhao, T. Q. S. Quek, and M. Peng, “Joint Neural Netw. (IJCNN), 2017, pp. 2921–2926.
optimization of data sampling and user selection for federated learn- [131] L. N. Darlow, E. J. Crowley, A. Antoniou, and A. J. Storkey, “CINIC-10
ing in the mobile edge computing systems,” in Proc. IEEE Int. Conf. is not ImageNet or CIFAR-10,” 2018, arXiv:1810.03505.
Commun. Workshops (ICC Workshops), 2020, pp. 1–6. [132] S. Wang et al., “An ensemble-based densely-connected deep learning
system for assessment of skeletal maturity,” IEEE Trans. Syst., Man,
[109] T. T. Vu, D. T. Ngo, H. Q. Ngo, M. N. Dao, N. H. Tran, and
Cybern., Syst., vol. 52, no. 1, pp. 426–437, Jan. 2022.
R. H. Middleton, “User selection approaches to mitigate the straggler
effect for federated learning on cell-free massive MIMO networks,” [133] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
2020, arXiv:2009.02031. Cambridge, MA, USA: MIT Press, 2018.
[134] T. Lattimore and C. Szepesvári, Bandit Algorithms. Cambridge, U.K.:
[110] A. F. Aji and K. Heafield, “Sparse communication for distributed
Cambridge Univ. Press, 2020.
gradient descent,” 2017, arXiv:1704.05021.
[135] V. Kanade, Z. Liu, and B. Radunovic, “Distributed non-stochastic
[111] F. Sattler, S. Wiedemann, K. R. Müller, and W. Samek, “Robust experts,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., 2012,
and communication-efficient federated learning from non-i.i.d. data,” pp. 260–268.
IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 9, pp. 3400–3413, [136] E. Hillel, Z. S. Karnin, T. Koren, R. Lempel, and O. Somekh,
Sep. 2020. “Distributed exploration in multi-armed bandits,” in Proc. Int. Conf.
[112] J. Mills, J. Hu, and G. Min, “Communication-efficient federated learn- Adv. Neural Inf. Process. Syst., 2013, pp. 854–862.
ing for wireless edge intelligence in IoT,” IEEE Internet Things J., [137] B. Szorenyi, R. Busa-Fekete, I. Hegedus, R. Ormandi, M. Jelasity,
vol. 7, no. 7, pp. 5986–5994, Jul. 2020. and B. Kegl, “Gossip-based distributed stochastic bandit algorithms,”
[113] L. Liu, J. Zhang, S. H. Song, and K. B. Letaief, “Client-edge-cloud in Proc. 30th Int. Conf. Mach. Learn., vol. 28. Atlanta, GA, USA,
hierarchical federated learning,” in Proc. IEEE Int. Conf. Commun. Jun. 2013, pp. 19–27.
(ICC), 2020, pp. 1–6. [138] R. K. Kolla, K. Jagannathan, and A. Gopalan, “Collaborative learning
[114] W. Wu, L. He, W. Lin, and R. Mao, “Accelerating federated learning of stochastic bandits over a social network,” IEEE/ACM Trans. Netw.,
over reliability-agnostic clients in mobile edge computing systems,” vol. 26, no. 4, pp. 1782–1795, Aug. 2018.
Jul. 2020, arXiv:2007.14374. [139] P. Landgren, V. Srivastava, and N. E. Leonard, “Distributed coopera-
[115] M. Duan, D. Liu, X. Chen, R. Liu, Y. Tan, and L. Liang, “Self- tive decision-making in multiarmed bandits: Frequentist and Bayesian
balancing federated learning with global imbalanced data in mobile algorithms,” in Proc. IEEE 55th Conf. Decis. Control (CDC), 2016,
systems,” IEEE Trans. Parallel Distrib. Syst., vol. 32, no. 1, pp. 59–71, pp. 167–172.
Jan. 2021. [140] D. Martinez-Rubio, V. Kanade, and P. Rebeschini, “Decentralized coop-
[116] N. Mhaisen, A. Awad, A. Mohamed, A. Erbad, and M. Guizani, erative stochastic bandits,” in Proc. Int. Conf. Adv. Neural Inf. Process.
“Optimal user-edge assignment in hierarchical federated learning based Syst., 2019, pp. 4529–4540.
on statistical properties and network topology constraints,” IEEE Trans. [141] P.-A. Wang, A. Proutiere, K. Ariu, Y. Jedra, and A. Russo, “Optimal
Netw. Sci. Eng., vol. 9, no. 1, pp. 55–66, Jan./Feb. 2022. algorithms for multiplayer multi-armed bandits,” in Proc. Int. Conf.
[117] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Artif. Intell. Stat., 2020, pp. 4120–4129.
Dept. Comput. Sci., Univ. Toronto, Toronto, ON, Canada, Rep. TR- [142] A. Sankararaman, A. Ganesh, and S. Shakkottai, “Social learning in
2009, 2009. multi agent multi armed bandits,” Proc. ACM Meas. Anal. Comput.
[118] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learn- Syst., vol. 3, no. 3, pp. 1–35, Dec. 2019. [Online]. Available: https://
ing applied to document recognition,” Proc. IEEE, vol. 86, no. 11, doi.org/10.1145/3366701
pp. 2278–2324, Dec. 1998. [143] R. Chawla, A. Sankararaman, A. Ganesh, and S. Shakkottai, “The gos-
[119] P. Warden, “Speech commands: A dataset for limited-vocabulary siping insert-eliminate algorithm for multi-agent bandits,” in Proc. Int.
speech recognition,” 2018, arXiv:1804.03209. Conf. Artif. Intell. Stat., 2020, pp. 3471–3481.
[144] M. Agarwal, V. Aggarwal, and K. Azizzadenesheli, “Multi-agent multi-
[120] L. M. Candanedo, V. Feldheim, and D. Deramaix, “Data driven
armed bandits with limited communication,” 2021, arXiv:2102.08462.
prediction models of energy use of appliances in a low-energy house,”
Energy Build., vol. 140, pp. 81–97, Apr. 2017. [145] Y. Wang, J. Hu, X. Chen, and L. Wang, “Distributed bandit learning:
Near-optimal regret with efficient communication,” in Proc. Int. Conf.
[121] H. T. Kahraman, S. Sagiroglu, and I. Colak, “The development of Learn. Represent., 2019, pp. 1–31.
intuitive knowledge classifier and the modeling of domain dependent
[146] S. Shahrampour, A. Rakhlin, and A. Jadbabaie, “Multi-armed bandits in
data,” Knowl. Based Syst., vol. 37, pp. 283–295, Jan. 2013.
multi-agent networks,” in Proc. IEEE Int. Conf. Acoust. Speech Signal
[122] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: A novel Process. (ICASSP), 2017, pp. 2786–2790.
image dataset for benchmarking machine learning algorithms,” 2017, [147] Z. Zhu, J. Zhu, J. Liu, and Y. Liu, “Federated bandit: A gossip-
arXiv:1708.07747. ing approach,” Proc. ACM Meas. Anal. Comput. Syst., vol. 5, no. 1,
[123] T. Zeng, O. Semiari, M. Mozaffari, M. Chen, W. Saad, and M. Bennis, pp. 1–29, 2021.
“Federated learning in the sky: Joint power allocation and scheduling [148] C. Shi and C. Shen, “Federated multi-armed bandits,” 2021,
with UAV swarms,” in Proc. IEEE Int. Conf. Commun. (ICC), 2020, arXiv:2101.12204.
pp. 1–6. [149] C. Shi, C. Shen, and J. Yang, “Federated multi-armed bandits with
[124] N. I. Mowla, N. H. Tran, I. Doh, and K. Chae, “AFRL: Adaptive personalization,” in Proc. 24th Int. Conf. Artif. Intell. Stat., Feb. 2021,
federated reinforcement learning for intelligent jamming defense in pp. 2917–2925.
FANET,” J. Commun. Netw., vol. 22, no. 3, pp. 244–258, Jun. 2020. [150] Z. Wang, M. K. Singh, C. Zhang, L. D. Riek, and K. Chaudhuri,
[125] P. Oscar, P. Carlos, A. Ana, and G. James, 2014, “CRAWDAD “Stochastic multi-player bandit learning from player-dependent feed-
Dataset,” CRAWDAD. [Online]. Available: https://crawdad.org/ back,” in Proc. ICML Workshop Real World Exp. Des. Act. Learn.,
uportorwthaachen/vanetjamming2014/20140512 2020, pp. 1–24.
[126] Y. Wang, Z. Su, N. Zhang, and A. Benslimane, “Learning in the air: [151] P. Hernandez-Leal, B. Kartal, and M. E. Taylor, “A survey and critique
Secure federated learning for UAV-assisted crowdsensing,” IEEE Trans. of multiagent deep reinforcement learning,” Auton. Agents Multi-Agent
Netw. Sci. Eng., vol. 8, no. 2, pp. 1055–1069, Apr.–Jun. 2021. Syst., vol. 33, no. 6, pp. 750–797, 2019.

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2414 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

[152] E. A. Hansen, D. S. Bernstein, and S. Zilberstein, “Dynamic program- [177] L. Lin, K. Wang, D. Meng, W. Zuo, and L. Zhang, “Active self-paced
ming for partially observable stochastic games,” in Proc. AAAI, vol. 4, learning for cost-effective and progressive face identification,” IEEE
2004, pp. 709–715. Trans. Pattern Anal. Mach. Intell., vol. 40, no. 1, pp. 7–19, Jan. 2018.
[153] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, [178] A. A. Abdellatif, C. F. Chiasserini, and F. Malandrino, “Active
“Deep reinforcement learning: A brief survey,” IEEE Signal Process. learning-based classification in automated connected vehicles,” in Proc.
Mag., vol. 34, no. 6, pp. 26–38, Nov. 2017. IEEE INFOCOM Conf. Comput. Commun. Workshops (INFOCOM
[154] F. A. Oliehoek, M. T. J. Spaan, and N. Vlassis, “Optimal and approx- WKSHPS), 2020, pp. 598–603.
imate Q-value functions for decentralized POMDPs,” J. Artif. Intell. [179] T. Liu and D. Tao, “Classification with noisy labels by importance
Res., vol. 32, pp. 289–353, May 2008. reweighting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 3,
[155] P. Sunehag et al., “Value-decomposition networks for cooperative pp. 447–461, Mar. 2016.
multi-agent learning based on team reward,” in Proc. AAMAS, 2018, [180] B. van Rooyen, A. Menon, and R. C. Williamson, “Learning with
pp. 2085–2087. symmetric label noise: The importance of being unhinged,” in Proc.
[156] T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. Foerster, and NIPS, 2015, pp. 10–18.
S. Whiteson, “QMIX: Monotonic value function factorisation for deep [181] J. Chen, Y. Zhou, A. Zipf, and H. Fan, “Deep learning from multiple
multi-agent reinforcement learning,” in Proc. Int. Conf. Mach. Learn., crowds: A case study of humanitarian mapping,” IEEE Trans. Geosci.
2018, pp. 4295–4304. Remote Sens., vol. 57, no. 3, pp. 1713–1722, Mar. 2019.
[157] K. Son, D. Kim, W. J. Kang, D. E. Hostallero, and Y. Yi, “QTRAN: [182] J. Zhang, X. Wu, and V. S. Shengs, “Active learning with imbal-
Learning to factorize with transformation for cooperative multi-agent anced multiple noisy labeling,” IEEE Trans. Cybern., vol. 45, no. 5,
reinforcement learning,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 1095–1107, May 2015.
pp. 5887–5896. [183] M. Fang, T. Zhou, J. Yin, Y. Wang, and D. Tao, “Data subset selection
[158] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi- with imperfect multiple labels,” IEEE Trans. Neural Netw. Learn. Syst.,
agent actor-critic for mixed cooperative-competitive environments,” in vol. 30, no. 7, pp. 2212–2221, Jul. 2019.
Proc. 31st Int. Conf. Neural Inf. Process. Syst., 2017, pp. 6382–6393. [184] G. Hua, C. Long, M. Yang, and Y. Gao, “Collaborative active visual
[159] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, recognition from crowds: A distributed ensemble approach,” IEEE
“Counterfactual multi-agent policy gradients,” in Proc. AAAI Conf. Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 582–594,
Artif. Intell., vol. 32, 2018, pp. 2974–2982. Mar. 2018.
[160] S. Iqbal and F. Sha, “Actor-attention-critic for multi-agent reinforce- [185] Y. Liu and M. Liu, “An online learning approach to improving the
ment learning,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 2961–2970. quality of crowd-sourcing,” IEEE/ACM Trans. Netw., vol. 25, no. 4,
[161] R. E. Wang, M. Everett, and J. P. How, “R-MADDPG for par- pp. 2166–2179, Aug. 2017.
tially observable environments and limited communication,” in Proc. [186] Y. Huang, Z. Liu, M. Jiang, X. Yu, and X. Ding, “Cost-effective vehicle
RL4RealLife Workshop 36th Int. Conf. Mach. Learn., 2020. type recognition in surveillance images with deep active learning and
[162] R. Wang, X. He, R. Yu, W. Qiu, B. An, and Z. Rabinovich, Web data,” IEEE Trans. Intell. Transp. Syst., vol. 21, no. 1, pp. 79–86,
“Learning efficient multi-agent communication: An information bot- Jan. 2020.
tleneck approach,” in Proc. 37th Int. Conf. Mach. Learn., vol. 119, [187] R. Wang, T. Liu, and D. Tao, “Multiclass learning with partially cor-
Jul. 2020, pp. 9908–9918. rupted labels,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 6,
pp. 2568–2580, Jun. 2018.
[163] J. Jiang and Z. Lu, “Learning attentional communication for multi-
[188] A. B. Said et al., “A deep learning approach for vital signs compression
agent cooperation,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst.,
and energy efficient delivery in mhealth systems,” IEEE Access, vol. 6,
vol. 31, 2018, pp. 7254–7264.
pp. 33727–33739, 2018.
[164] A. Singh, T. Jain, and S. Sukhbaatar, “Learning when to communicate
[189] A. A. Abdellatif, C. F. Chiasserini, F. Malandrino, A. Mohamed, and
at scale in multiagent cooperative and competitive tasks,” in Proc. Int.
A. Erbad, “Active learning with noisy Labelers for improving classi-
Conf. Learn. Represent., 2018.
fication accuracy of connected vehicles,” IEEE Trans. Veh. Technol.,
[165] H. Mao, Z. Zhang, Z. Xiao, Z. Gong, and Y. Ni, “Learning agent
vol. 70, no. 4, pp. 3059–3070, Apr. 2021.
communication under limited bandwidth by message pruning,” in Proc.
[190] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep
AAAI Conf. Artif. Intell., vol. 34, Apr. 2020, pp. 5142–5149.
learning with limited numerical precision,” in Proc. Int. Conf. Mach.
[166] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian, “Deep decen- Learn., 2015, pp. 1737–1746.
tralized multi-task multi-agent reinforcement learning under partial [191] R. Hadidi, J. Cao, M. Woodward, M. S. Ryoo, and H. Kim, “Distributed
observability,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 2681–2690. perception by collaborative robots,” IEEE Robot. Autom. Lett., vol. 3,
[167] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Basar, “Fully decentralized no. 4, pp. 3709–3716, Oct. 2018.
multi-agent reinforcement learning with networked agents,” in Proc. [192] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey
35th Int. Conf. Mach. Learn., vol. 80. Stockholm Sweden, Jul. 2018, on mobile edge computing: The communication perspective,” IEEE
pp. 5872–5881. Commun. Surveys Tuts., vol. 19, no. 4, pp. 2322–2358, 4th Quart.,
[168] Y. Lin et al., “A communication-efficient multi-agent actor-critic algo- 2017.
rithm for distributed reinforcement learning,” in Proc. IEEE 58th Conf. [193] S. Disabato, M. Roveri, and C. Alippi, “Distributed deep convolutional
Decis. Control (CDC), 2019, pp. 5562–5567. neural networks for the Internet-of-Things,” 2019, arXiv:1908.01656.
[169] T. Chen, K. Zhang, G. B. Giannakis, and T. Başar, “Communication- [194] Z. Xu et al., “Energy-aware inference offloading for DNN-driven appli-
efficient distributed reinforcement learning,” 2018, arXiv:1812.03239. cations in mobile edge clouds,” IEEE Trans. Parallel Distrib. Syst.,
[170] J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson, “Learning vol. 32, no. 4, pp. 799–814, Apr. 2021.
to communicate with deep multi-agent reinforcement learning,” in [195] C. Alippi, S. Disabato, and M. Roveri, “Moving convolutional neural
Proc. 30th Int. Conf. Neural Inf. Process. Syst., 2016, pp. 2145–2153. networks to embedded systems: The AlexNet and VGG-16 case,” in
[171] D. Kim et al., “Learning to schedule communication in multi-agent Proc. 17th ACM/IEEE Int. Conf. Inf. Process. Sens. Netw. (IPSN), 2018,
reinforcement learning,” in Proc. Int. Conf. Learn. Represent., 2019, pp. 212–223.
pp. 1–17. [196] R. Hadidi, J. Cao, M. S. Ryoo, and H. Kim, “Toward collaborative
[172] L. Wang, K. Wang, C. Pan, W. Xu, N. Aslam, and L. Hanzo, “Multi- inferencing of deep neural networks on Internet-of-Things devices,”
agent deep reinforcement learning-based trajectory planning for multi- IEEE Internet Things J., vol. 7, no. 6, pp. 4950–4960, Jun. 2020.
UAV assisted mobile edge computing,” IEEE Trans. Cogn. Commun. [197] R. Stahl, Z. Zhao, D. Mueller-Gritschneder, A. Gerstlauer, and
Netw., vol. 7, no. 1, pp. 73–84, Mar. 2021. U. Schlichtmann, “Fully distributed deep learning inference on
[173] “openai/multiagent-particle-envs.” Apr. 2021. [Online]. Available: resource-constrained edge devices,” in Proc. Int. Conf. Embedded
https://github.com/openai/multiagent-particle-envs Comput. Syst.. 2019, pp. 77–90.
[174] O. Vinyals et al., “Grandmaster level in StarCraft II using multi-agent [198] Z. Zhao, K. M. Barijough, and A. Gerstlauer, “DeepThings: Distributed
reinforcement learning,” Nature, vol. 575, no. 7782, pp. 350–354, 2019. adaptive deep learning inference on resource-constrained IoT edge clus-
[175] K. Wang, D. Zhang, Y. Li, R. Zhang, and L. Lin, “Cost-effective active ters,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 37,
learning for deep image classification,” IEEE Trans. Circuits Syst. Video no. 11, pp. 2348–2359, Nov. 2018.
Technol., vol. 27, no. 12, pp. 2591–2600, Dec. 2017. [199] J. Mao, X. Chen, K. W. Nixon, C. Krieger, and Y. Chen, “MoDNN:
[176] S.-J. Huang, R. Jin, and Z.-H. Zhou, “Active learning by querying Local distributed mobile computing system for deep neural network,”
informative and representative examples,” IEEE Trans. Pattern Anal. in Proc. Des. Autom. Test Europe Conf. Exhibit. (DATE), 2017,
Mach. Intell., vol. 36, no. 10, pp. 1936–1949, Oct. 2014. pp. 1396–1401.

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2415

[200] X. Ran, H. Chen, X. Zhu, Z. Liu, and J. Chen, “DeepDecision: A [220] R. Du, S. Magnusson, and C. Fischione, “The Internet of Things as a
mobile deep learning framework for edge video analytics,” in Proc. deep neural network,” IEEE Commun. Mag., vol. 58, no. 9, pp. 20–25,
IEEE INFOCOM Conf. Comput. Commun., 2018, pp. 1421–1429. Sep. 2020.
[201] X. Ran, H. Chen, Z. Liu, and J. Chen, “Delivering deep learning [221] Z. Zhang et al., “Towards ubiquitous intelligent computing:
to mobile devices via offloading,” in Proc. Workshop Virtual Reality Heterogeneous distributed deep neural networks,” IEEE Trans. Big
Augmented Reality Netw., New York, NY, USA, 2017, pp. 42–47. Data, vol. 8, no. 3, pp. 644–657, Jun. 2022.
[Online]. Available: https://doi.org/10.1145/3097895.3097903 [222] A. Yousefpour et al., “Guardians of the deep fog: Failure-resilient DNN
[202] S. Han, H. Shen, M. Philipose, S. Agarwal, A. Wolman, and inference from edge to cloud,” in Proc. 1st Int. Workshop Challenges
A. Krishnamurthy, “McDNN: An approximation-based execution Artif. Intell. Mach. Learn. Internet Things, 2019, pp. 25–31.
framework for deep stream processing under resource constraints,” [223] T. Mohammed, C. Joe-Wong, R. Babbar, and M. Di Francesco,
in Proc. 14th Annu. Int. Conf. Mobile Syst. Appl. Serv., 2016, “Distributed inference acceleration with adaptive DNN partitioning and
pp. 123–136. offloading,” in Proc. IEEE INFOCOM Conf. Comput. Commun., 2020,
[203] Y. Kang et al., “Neurosurgeon: Collaborative intelligence between pp. 854–863.
the cloud and mobile edge,” ACM SIGARCH Comput. Archit. News, [224] J. Zhou, Y. Wang, K. Ota, and M. Dong, “AAIoT: Accelerating artificial
vol. 45, no. 1, pp. 615–629, 2017. intelligence in IoT systems,” IEEE Wireless Commun. Lett., vol. 8,
[204] E. Li, Z. Zhou, and X. Chen, “Edge intelligence: On-demand deep no. 3, pp. 825–828, Jun. 2019.
learning model co-inference with device-edge synergy,” in Proc. [225] M. Xu, F. Qian, M. Zhu, F. Huang, S. Pushp, and X. Liu, “DeepWear:
Workshop Mobile Edge Commun., 2018, pp. 31–36. Adaptive local offloading for on-wearable deep learning,” IEEE Trans.
[205] S. Teerapittayanon, B. McDanel, and H.-T. Kung, “BranchyNet: Fast Mobile Comput., vol. 19, no. 2, pp. 314–330, Feb. 2020.
inference via early exiting from deep neural networks,” in Proc. 23rd [226] F. M. C. de Oliveira and E. Borin, “Partitioning convolutional neural
Int. Conf. Pattern Recognit. (ICPR), 2016, pp. 2464–2469. networks to maximize the inference rate on constrained IoT devices,”
[206] L. Zeng, E. Li, Z. Zhou, and X. Chen, “Boomerang: On-demand coop- Future Internet, vol. 11, no. 10, p. 209, 2019.
erative deep neural network inference for edge intelligence on the [227] Z. Han, Y. Gu, and W. Saad, “Fundamentals of matching theory,”
industrial Internet of Things,” IEEE Netw., vol. 33, no. 5, pp. 96–103, in Matching Theory Wireless Networks (Wireless Networks). Cham,
Sep./Oct. 2019. Switzerland: Springer, 2017.
[207] H. Wang, G. Cai, Z. Huang, and F. Dong, “ADDA: Adaptive dis- [228] P. Ahmed, C. S. Iliopoulos, A. S. M. Islam, and M. S. Rahman,
tributed DNN inference acceleration in edge computing environment,” “The swap matching problem revisited,” Theor. Comput. Sci., vol. 557,
in Proc. IEEE 25th Int. Conf. Parallel Distrib. Syst. (ICPADS), 2019, pp. 34–49, Nov. 2014.
pp. 438–445.
[229] C.-C. Hsu, C.-K. Yang, J.-J. Kuo, W.-T. Chen, and J.-P. Sheu,
[208] Y. Jin, J. Xu, Y. Huan, Y. Yan, L. Zheng, and Z. Zou, “Energy-aware “Cooperative convolutional neural network deployment over mobile
workload allocation for distributed deep neural networks in edge-cloud networks,” in Proc. IEEE Int. Conf. Commun. (ICC), 2020,
continuum,” in Proc. 32nd IEEE Int. Syst. Chip Conf. (SOCC), 2019, pp. 1–7.
pp. 213–217.
[230] F. M. C. de Oliveira and E. Borin, “Partitioning convolutional neural
[209] J. H. Ko, T. Na, M. F. Amir, and S. Mukhopadhyay, “Edge-host par-
networks for inference on constrained Internet-of-Things devices,” in
titioning of deep neural networks with feature space encoding for
Proc. 30th Int. Symp. Comput. Archit. High Perform.Comput. (SBAC-
resource-constrained Internet-of-Things platforms,” in Proc. 15th IEEE
PAD), 2018, pp. 266–273.
Int. Conf. Adv. Video Signal Based Surveillance (AVSS), 2018, pp. 1–6.
[231] Y. Chang, X. Huang, Z. Shao, and Y. Yang, “An efficient distributed
[210] H. Li, C. Hu, J. Jiang, Z. Wang, Y. Wen, and W. Zhu, “JALAD:
deep learning framework for fog-based IoT systems,” in Proc. IEEE
Joint accuracy-and latency-aware deep structure decoupling for edge-
Global Commun. Conf. (GLOBECOM), 2019, pp. 1–6.
cloud execution,” in Proc. IEEE 24th Int. Conf. Parallel Distrib. Syst.
(ICPADS), 2018, pp. 671–678. [232] S. Sahoo. “Grouped convolutions—Convolutions in paral-
[211] G. Li, L. Liu, X. Wang, X. Dong, P. Zhao, and X. Feng, “Auto- lel.” 2018. [Online]. Available: https://towardsdatascience.com/
tuning neural network quantization framework for collaborative infer- grouped-convolutions-convolutions-in-parallel-3b8cc847e851
ence between the cloud and edge,” in Artificial Neural Networks [233] R. Hadidi, J. Cao, M. Woodward, M. S. Ryoo, and H. Kim, “Musical
and Machine Learning (ICANN), V. Kůrková, Y. Manolopoulos, chair: Efficient real-time recognition using collaborative IoT devices,”
B. Hammer, L. Iliadis, and I. Maglogiannis, Eds. Cham, Switzerland: 2018, arXiv:1802.02138.
Springer Int., 2018, pp. 402–411. [234] R. Hadidi, J. Cao, M. Woodward, M. S. Ryoo, and H. Kim, “Real-
[212] W. Shi, Y. Hou, S. Zhou, Z. Niu, Y. Zhang, and L. Geng, “Improving time image recognition using collaborative IoT devices,” in Proc. 1st
device-edge cooperative inference of deep learning via 2-step prun- Reproducible Qual. Efficient Syst. Tournament Co-Des. Pareto Efficient
ing,” in Proc. IEEE INFOCOM Conf. Comput. Commun. Workshops Deep Learn., 2018, p. 4.
(INFOCOM WKSHPS), 2019, pp. 1–6. [235] J. Choi, Z. Hakimi, J. Sampson, and V. Narayanan, “Byzantine-tolerant
[213] C. Hu, W. Bao, D. Wang, and F. Liu, “Dynamic adaptive DNN surgery inference in distributed deep intelligent system: Challenges and oppor-
for inference acceleration on the edge,” in Proc. IEEE INFOCOM Conf. tunities,” IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 9, no. 3,
Comput. Commun., 2019, pp. 1423–1431. pp. 509–519, Sep. 2019.
[214] E. Dahlhaus, D. S. Johnson, C. H. Papadimitriou, P. D. Seymour, [236] C. Zhang, M. Dong, and K. Ota, “Accelerate deep learning in
and M. Yannakakis, “The complexity of multiterminal cuts,” SIAM J. IoT: Human-interaction co-inference networking system for edge,”
Comput., vol. 23, pp. 864–894, Aug. 1994. in Proc. 13th Int. Conf. Human Syst. Interact. (HSI), 2020,
[215] V. Polishchuk and J. Suomela, “A simple local 3-approximation pp. 1–6.
algorithm for vertex cover,” Inf. Process. Lett., vol. 109, no. 12, [237] M. Singhal, V. Raghunathan, and A. Raghunathan, “Communication-
pp. 642–645, 2009. efficient view-pooling for distributed multi-view neural networks,”
[216] A. E. Eshratifar, M. S. Abrishami, and M. Pedram, “JointDNN: An in Proc. Des. Autom. Test Europe Conf. Exhibit. (DATE), 2020,
efficient training and inference engine for intelligent mobile cloud pp. 1390–1395.
computing services,” IEEE Trans. Mobile Comput., vol. 20, no. 2, [238] C. Li, S. Wang, X. Li, F. Zhao, and R. Yu, “Distributed perception and
pp. 565–576, Feb. 2021. model inference with intelligent connected vehicles in smart cities,” Ad
[217] S. Teerapittayanon, B. McDanel, and H.-T. Kung, “Distributed deep Hoc Netw., vol. 103, Jun. 2020, Art. no. 102152.
neural networks over the cloud, the edge and end devices,” in [239] E. Baccour, A. Erbad, A. Mohamed, M. Hamdi, and M. Guizani,
Proc. IEEE 37th Int. Conf. Distrib. Comput. Syst. (ICDCS), 2017, “DistPrivacy: Privacy-aware distributed deep neural networks in IoT
pp. 328–339. surveillance systems,” in Proc. GLOBECOM, 2020, pp. 1–6.
[218] J.-I. Chang, J.-J. Kuo, C.-H. Lin, W.-T. Chen, and J.-P. Sheu, “Ultra- [240] B. Yang, X. Cao, C. Yuen, and L. Qian, “Offloading optimization in
low-latency distributed deep neural network over hierarchical mobile edge computing for deep-learning-enabled target tracking by Internet
networks,” in Proc. IEEE Global Commun. Conf. (GLOBECOM), 2019, of UAVs,” IEEE Internet Things J., vol. 8, no. 12, pp. 9878–9893,
pp. 1–6. Jun. 2021.
[219] P. Ren, X. Qiao, Y. Huang, L. Liu, S. Dustdar, and J. Chen, “Edge- [241] B. Yang et al., “Intelli-eye: An UAV tracking system with opti-
assisted distributed DNN collaborative computing approach for mobile mized machine learning tasks offloading,” in Proc. IEEE INFOCOM
Web augmented reality in 5G networks,” IEEE Netw., vol. 34, no. 2, Conf. Comput. Commun. Workshops (INFOCOM WKSHPS), 2019,
pp. 254–261, Mar./Apr. 2020. pp. 1–6.

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2416 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

[242] S. Chinchali et al., “Network offloading policies for cloud robotics: [267] L. Wei, B. Luo, Y. Li, Y. Liu, and Q. Xu, “I know what you see: Power
A learning-based approach,” Auton. Robots, vol. 45, pp. 997–1012, side-channel attack on convolutional neural network accelerators,” in
Jul. 2021. Proc. 34th Annu. Comput. Security Appl. Conf., 2018, pp. 393–406.
[243] R. Hadidi et al., “Understanding the power consumption of execut- [268] S. Narain, T. D. Vo-Huu, K. Block, and G. Noubir, “Inferring user
ing deep neural networks on a distributed robot system,” in Proc. routes and locations using zero-permission mobile sensors,” in Proc.
CODES+ISSS Int. Conf. Hardw. Softw. Codesign Syst. Synth., 2019, IEEE Symp. Security Privacy (SP), 2016, pp. 397–413.
pp. 1–5. [269] S. Yeom, I. Giacomelli, M. Fredrikson, and S. Jha, “Privacy risk
[244] Z. Wang, M. Song, Z. Zhang, Y. Song, Q. Wang, and H. Qi, in machine learning: Analyzing the connection to overfitting,” in
“Beyond inferring class representatives: User-level privacy leakage Proc. IEEE 31st Comput. Security Found. Symp. (CSF), 2018,
from federated learning,” 2018, arXiv:1812.00535. pp. 268–282.
[245] Y. Fraboni, R. Vidal, and M. Lorenzi, “Free-rider attacks on model [270] I. E. Olatunji, W. Nejdl, and M. Khosla, “Membership infer-
aggregation in federated learning,” in Proc. AISTATS, 2021, pp. 1846– ence attack on graph neural networks,” in Proc. TPS-ISA, 2021,
1854. pp. 11–20.
[246] G. Ateniese, L. V. Mancini, A. Spognardi, A. Villani, D. Vitali, and [271] N. Z. Gong and B. Liu, “Attribute inference attacks in online social
G. Felici, “Hacking smart machines with smarter ones: How to extract networks,” ACM Trans. Privacy Security, vol. 21, no. 1, pp. 1–30, 2018.
meaningful data from machine learning classifiers,” Int. J. Security [272] Y. Zhang, H. Salehinejad, J. Barfett, E. Colak, and S. Valaee,
Netw., vol. 10, no. 3, pp. 137–150, 2015. “Privacy preserving deep learning with distributed encoders,” in
[247] M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacks Proc. IEEE Global Conf. Signal Inf. Process. (GlobalSIP), 2019,
that exploit confidence information and basic countermeasures,” in pp. 1–5.
Proc. 22nd ACM SIGSAC Conf. Comput. Commun. Security, 2015, [273] S. A. Osia et al., “A hybrid deep learning architecture for privacy-
pp. 1322–1333. preserving mobile analytics,” IEEE Internet Things J., vol. 7, no. 5,
[248] M. Figura, K. C. Kosaraju, and V. Gupta, “Adversarial attacks in pp. 4505–4518, May 2020.
consensus-based multi-agent reinforcement learning,” in Proc. Amer. [274] J. Wang, J. Zhang, W. Bao, X. Zhu, B. Cao, and P. S. Yu, “Not just
Control Conf., 2021, pp. 3050–3055. privacy: Improving performance of private deep learning in mobile
[249] J. Lin, K. Dzeparoska, S. Q. Zhang, A. Leon-Garcia, and N. Papernot, cloud,” in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discov. Data
“On the robustness of cooperative multi-agent reinforcement learn- Min., 2018, pp. 2407–2416.
ing,” in Proc. IEEE Security Privacy Workshops (SPW), 2020, [275] F. Mireshghallah, M. Taram, P. Ramrakhyani, A. Jalali, D. Tullsen, and
pp. 62–68. H. Esmaeilzadeh, “Shredder: Learning noise distributions to protect
[250] P. Mohassel and Y. Zhang, “SecureML: A system for scalable privacy- inference privacy,” in Proc. 25th Int. Conf. Archit. Support Program.
preserving machine learning,” in Proc. IEEE Symp. Security Privacy Lang. Oper. Syst., 2020, pp. 3–18.
(SP), 2017, pp. 19–38. [276] L. Lyu, J. C. Bezdek, J. Jin, and Y. Yang, “FORESEEN: Towards dif-
[251] R. C. Geyer, T. Klein, and M. Nabi, “Differentially private federated ferentially private deep inference for intelligent Internet of Things,”
learning: A client level perspective,” 2017, arXiv:1712.07557. IEEE J. Sel. Areas Commun., vol. 38, no. 10, pp. 2418–2429,
[252] M. Abadi et al., “Deep learning with differential privacy,” in Proc. Oct. 2020.
ACM SIGSAC Conf. Comput. Commun. Security, 2016, pp. 308–318. [277] L. Gomez, M. Wilhelm, J. Márquez, and P. Duverger, “Security
for distributed deep neural networks towards data confidentiality &
[253] A. Dubey and A. Pentland, “Differentially-private federated linear
intellectual property protection,” in Proc. ICETE, 2019, pp. 439–447.
bandits,” 2020, arXiv:2010.11425.
[278] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are
[254] O. Choudhury et al., “Differential privacy-enabled federated learning
features in deep neural networks?” in Proc. 27th Int. Conf. Neural Inf.
for sensitive health data,” 2020, arXiv:1910.02578.
Process. Syst. Vol. 2, 2014, pp. 3320–3328.
[255] H. Li and T. Han, “An end-to-end encrypted neural network for gradi-
[279] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric
ent updates transmission in federated learning,” in Proc. DCC, 2019,
discriminatively, with application to face verification,” in Proc. IEEE
p. 589.
Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), vol. 1,
[256] L. T. Phong, Y. Aono, T. Hayashi, L. Wang, and S. Moriai, “Privacy- 2005, pp. 539–546.
preserving deep learning via additively homomorphic encryption,” [280] Z. Yang, E.-C. Chang, and Z. Liang, “Adversarial neural network
IEEE Trans. Inf. Forensics Security, vol. 13, pp. 1333–1345, 2018. inversion via auxiliary knowledge alignment,” 2019, arXiv:1902.08552.
[257] M. Hao, H. Li, G. Xu, S. Liu, and H. Yang, “Towards efficient and [281] S. J. Oh, B. Schiele, and M. Fritz, “Towards reverse-engineering black-
privacy-preserving federated deep learning,” in Proc. IEEE Int. Conf. box neural networks,” in Explainable AI: Interpreting, Explaining
Commun. (ICC), 2019, pp. 1–6. and Visualizing Deep Learning. Cham, Switzerland: Springer, 2019,
[258] Z. Zheng, S. Xie, H. Dai, X. Chen, and H. Wang, “An overview of pp. 121–144.
blockchain technology: Architecture, consensus, and future trends,” in [282] N. Kourtellis, K. Katevas, and D. Perino, “FLaaS: Federated learn-
Proc. IEEE Int. Congr. Big Data (BigData Congr.), 2017, pp. 557–564. ing as a service,” in Proc. 1st Workshop Distrib. Mach. Learn., 2020,
[259] H. Kim, J. Park, M. Bennis, and S.-L. Kim, “Blockchained on- pp. 7–13.
device federated learning,” IEEE Commun. Lett., vol. 24, no. 6, [283] N. Janbi, I. Katib, A. Albeshri, and R. Mehmood, “Distributed
pp. 1279–1283, Jun. 2020. artificial intelligence-as-a-service (DAIaaS) for smarter IoE and 6G
[260] K. Bonawitz et al., “Practical secure aggregation for federated learning environments,” Sensors, vol. 20, no. 20, p. 5796, 2020.
on user-held data,” 2016, arXiv:1611.04482. [284] N. Mhaisen, M. S. Allahham, A. Mohamed, A. Erbad, and M. Guizani,
[261] R. Zhang and Q. Zhu, “Security of distributed machine learn- “On designing smart agents for service provisioning in blockchain-
ing: A game-theoretic approach to design secure DSVM,” 2020, powered systems,” IEEE Trans. Netw. Sci. Eng., vol. 9, no. 2,
arXiv:2003.04735. pp. 401–415, Mar./Apr. 2022.
[262] R. Zhang and Q. Zhu, “A game-theoretic approach to design [285] C. G. Akcora, M. Kantarcioglu, and Y. R. Gel, “Blockchain data
secure and resilient distributed support vector machines,” IEEE analytics,” in Proc. IEEE Int. Conf. Data Min. (ICDM), 2018, p. 6.
Trans. Neural Netw. Learn. Syst., vol. 29, no. 11, pp. 5512–5527, [286] E. Tjoa and C. Guan, “A survey on explainable artificial intelligence
Nov. 2018. (XAI): Toward medical XAI,” IEEE Trans. Neural Netw. Learn. Syst.,
[263] J. Eckstein, “Augmented Lagrangian and alternating direction methods vol. 32, no. 11, pp. 4793–4813, Nov. 2021.
for convex optimization: A tutorial and some illustrative computational [287] E. Baccour, A. Erbad, A. Mohamed, M. Hamdi, and M. Guizani, “RL-
results,” Dept. Manag. Sci. Inf. Syst., Rutgers Univ., New Brunswick, DistPrivacy: Privacy-aware distributed deep inference for low latency
NJ, USA, Rep. RRR 32-2012, 2012. IoT systems,” IEEE Trans. Netw. Sci. Eng., vol. 9, no. 4, pp. 2066–
[264] R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” in 2083, Jul./Aug. 2022.
Proc. 53rd Annu. Allerton Conf. Commun. Control Comput. (Allerton), [288] M. Dhuheir, E. Baccour, A. Erbad, S. Sabeeh, and M. Hamdi,
2015, pp. 909–910. “Efficient real-time image recognition using collaborative swarm of
[265] B. Hitaj, G. Ateniese, and F. Perez-Cruz, “Deep models under the UAVs and convolutional networks,” in Proc. Int. Wireless Commun.
GAN: Information leakage from collaborative deep learning,” in Proc. Mobile Comput. (IWCMC), 2021, pp. 1954–1959.
Conf. Comput. Commun. Security, 2017, pp. 603–618. [289] M. Jouhari et al., “Distributed CNN inference on resource-constrained
[266] A. Triastcyn and B. Faltings, “Federated generative privacy,” IEEE UAVs for surveillance systems: Design and optimization,” IEEE
Intell. Syst., vol. 35, no. 4, pp. 50–57, Jul./Aug. 2020. Internet Things J., vol. 9, no. 2, pp. 1227–1242, Jan. 2022.

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
BACCOUR et al.: PERVASIVE AI FOR IoT APPLICATIONS 2417

[290] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and Alaa Awad Abdellatif (Member, IEEE) received the
R. Salakhutdinov, “Dropout: A simple way to prevent neural networks B.Sc. and M.Sc. degrees (with Hons.) in electron-
from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, ics and electrical communications engineering from
2014. Cairo University, in 2009 and 2012, respectively, and
[291] A. Yousefpour et al., “ResiliNet: Failure-resilient inference in dis- the Ph.D. degree from the Politecnico di Torino in
tributed neural networks,” 2020, arXiv:2002.07386. 2018. He is currently a Postdoctoral Researcher with
[292] E. Baccour, A. Erbad, A. Mohamed, M. Hamdi, and M. Guizani, Qatar University, where he also worked as a Senior
“RL-PDNN: Reinforcement learning for privacy-aware distributed neu- Research Assistant and a Research Assistant from
ral networks in IoT systems,” IEEE Access, vol. 9, pp. 54872–54887, 2013 to 2015, and with Cairo University, from 2009
2021. to 2012. He has authored or coauthored over 35 ref-
[293] E. Egorov, C. Pieters, H. Korach-Rechtman, J. Shklover, and ereed journal, magazine, and conference papers in
A. Schroeder, “Robotics, microfluidics, nanotechnology and AI in reputable international journals and conferences. His research interests include
the synthesis and evaluation of liposomes and polymeric drug edge computing, blockchain, machine learning and resources optimization for
delivery systems,” Drug Del. Transl. Res., vol. 11, pp. 345–352, wireless heterogeneous networks, smart-health, IoT applications, and vehic-
Feb. 2021. ular networks. He was a recipient of the Graduate Student Research Award
[294] O. Adir et al., “Integrating artificial intelligence and nanotechnology from Qatar National Research Fund and the Best Paper Award from Wireless
for precision cancer medicine,” Adv. Mater., vol. 32, no. 13, 2020, Art. Telecommunications Symposium 2018, in USA, in addition to the Quality
no. 1901989. Award from Politecnico di Torino in 2018. He has served as a technical
[295] N. Aussel, S. Chabridon, and Y. Petetin, “Combining federated reviewer for many international journals and magazines.
and active learning for communication-efficient distributed failure
prediction in aeronautics,” 2020, arXiv:2001.07504.
[296] T. Lattimore and C. Szepesvári, Bandit Algorithms. Cambridge, U.K.:
Cambridge Univ. Press, 2020.
[297] D. Makhija, N. Ho, and J. Ghosh, “Federated self-supervised learning
for heterogeneous clients,” 2022, arXiv:2205.12493. Aiman Erbad (Senior Member, IEEE) received
[298] W. Du and S. Ding, “A survey on multi-agent deep reinforce- the B.Sc. degree in computer engineering from
ment learning: From the perspective of challenges and appli- the University of Washington, Seattle, in 2004, the
cations,” Artif. Intell. Rev., vol. 54, no. 5, pp. 3215–3238, Master of Computer Science degree in embedded
2021. systems and robotics from the University of Essex,
[299] A. H. Sayed, “Adaptation, learning, and optimization over networks,” U.K., in 2005, and the Ph.D. degree in computer
Found. Trends Mach. Learn., vol. 7, nos. 4–5, pp. 311–801, 2014. science from the University of British Columbia,
[Online]. Available: http://dx.doi.org/10.1561/2200000051 Canada, in 2012. He is an Associate Professor
and the Head of Information and Computing
Technology Division with the College of Science
and Engineering, Hamad Bin Khalifa University.
Prior to this, he was an Associate Professor with Computer Science
and Engineering Department and the Director of Research Planning and
Development, Qatar University until May 2020. He also served as the Director
of Research Support responsible for all grants and contracts from 2016 to
2018, and as the Computer Engineering Program Coordinator from 2014 to
2016. His research interests span cloud computing, edge intelligence, Internet
of Things, private and secure networks, and multimedia systems. He received
the Platinum award from H. H. The Emir Sheikh Tamim bin Hamad Al Thani
Emna Baccour received the Ph.D. degree in com- at the Education Excellence Day 2013 (Ph.D. category). He also received
puter Science from the University of Burgundy, the 2020 Best Research Paper Award from Computer Communications, the
France, in 2017. She was a Postdoctoral Fellow IWCMC 2019 Best Paper Award, and the IEEE CCWC 2017 Best Paper
with Qatar University on a project covering the Award. He is a Senior Member of ACM.
interconnection networks for massive data centers
and then on a project covering video caching and
processing in mobile edge computing networks. She
currently holds a postdoctoral position with Hamad
Bin Khalifa University. Her research interests
include data center networks, cloud, edge and mobile Amr Mohamed (Senior Member, IEEE) received
computing, as well as distributed systems. She is the M.S. and Ph.D. degrees in electrical and com-
also interested in distributed machine learning and reinforcement learning puter engineering from the University of British
with application to IoT systems. Columbia, Vancouver, Canada, in 2001, and 2006
respectively. He has worked as an Advisory IT
Specialist with IBM Innovation Centre, Vancouver,
from 1998 to 2007, taking a leadership role in
systems development for vertical industries. He
is currently a Professor with the College of
Engineering, Qatar University and the Director of
the Cisco Regional Academy. He has over 25 years
of experience in wireless networking research and industrial systems develop-
ment. His research interests include wireless networking, and edge computing
for IoT applications. He has authored or coauthored over 160 refereed journal
Naram Mhaisen received the B.Sc. degree in com- and conference papers, textbook, and book chapters in reputable interna-
puter engineering with excellence and the M.Sc. tional journals, and conferences. He holds three awards from IBM Canada
degree in computing from Qatar University (QU) in for his achievements and leadership, and the four best paper awards from
2017 and 2020, respectively. He is currently pursu- IEEE Conferences. He has served as the Technical Program Committee Co-
ing the Ph.D. degree with the Delft University of Chair for workshops in IEEE WCNC’16. He has served as the Co-Chair
Technology. Since 2017, he has been working as for technical symposia of international conferences, including Globecom’16,
a Researcher on several funded projects with the Crowncom’15, AICCSA’14, IEEE WLN’11, and IEEE ICT’10. He has served
Department of Computer Science and Engineering, on the organization committee of many other international conferences as a
QU. His research interests include the design and TPC member, including the IEEE ICC, GLOBECOM, WCNC, LCN, and
optimization of networked (learning) systems with PIMRC, and a technical reviewer for many international IEEE, ACM, Elsevier,
applications to IoT. Springer, and Wiley journals.

Authorized licensed use limited to: TU Delft Library. Downloaded on December 13,2022 at 10:09:07 UTC from IEEE Xplore. Restrictions apply.
2418 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 24, NO. 4, FOURTH QUARTER 2022

Mounir Hamdi (Fellow, IEEE) received the B.S. Mohsen Guizani (Fellow, IEEE) received the B.S.
degree (Hons.) in electrical engineering (computer (with Distinction), M.S., and Ph.D. degrees in
engineering) from the University of Louisiana in electrical and computer engineering from Syracuse
1985, and the M.S. and Ph.D. degrees in electri- University, Syracuse, NY, USA, in 1985, 1987, and
cal engineering from the University of Pittsburgh, 1990, respectively. He is currently a Professor of
in 1987 and 1991, respectively. He is currently Machine Learning and the Associate Provost with
the Founding Dean of the College of Science and the Mohamed Bin Zayed University of Artificial
Engineering, Hamad Bin Khalifa University. He was Intelligence, Abu Dhabi, UAE. Previously, he
a Chair Professor and a Founding Member of The worked in different institutions in the USA. He is
Hong Kong University of Science and Technology, the author of ten books and more than 800 publica-
where he was the Head of the Department of tions. His research interests include applied machine
Computer Science and Engineering. From 1999 to 2000, he held visiting pro- learning and artificial intelligence, Internet of Things, intelligent autonomous
fessor positions with Stanford University and the Swiss Federal Institute of systems, smart city, and cybersecurity. He has won several research awards,
Technology. His area of research is in high-speed wired/wireless networking, including the “2015 IEEE Communications Society Best Survey Paper
in which he has published more than 400 publications, graduated more 60 Award,” the Best ComSoc Journal Paper Award in 2021 as well five Best
M.S./Ph.D. students, and awarded numerous research awards and grants. He Paper Awards from ICC and Globecom Conferences. He is also the recipient
was awarded many best research paper awards such as in ICC and Globecom. of the 2017 IEEE Communications Society Wireless Technical Committee
In addition, he has frequently consulted for companies and governmental orga- Recognition Award, the 2018 AdHoc Technical Committee Recognition
nizations in the USA, Europe, and Asia. He received the Best 10 Lecturer Award, and the 2019 IEEE Communications and Information Security
Award and the Distinguished Engineering Teaching Appreciation Award from Technical Recognition Award. He was listed as a Clarivate Analytics Highly
HKUST. He is also a frequent keynote speaker in international conferences Cited Researcher in Computer Science in 2019, 2020, and 2021. He served
and forums. He is/was on the editorial board of more than ten prestigious as the Editor-in-Chief for IEEE N ETWORK and is currently serving on the
journals and magazines. He has chaired more than 20 international con- editorial boards of many IEEE T RANSACTIONS and Magazines. He was the
ferences and workshops. In addition to his commitment to research and Chair of the IEEE Communications Society Wireless Technical Committee
academic/professional service, he is also a dedicated teacher and a quality and the TAOS Technical Committee. He served as the IEEE Computer Society
assurance educator. He is frequently involved in higher education quality Distinguished Speaker and is currently the IEEE ComSoc Distinguished
assurance activities as well as engineering programs accreditation all over Lecturer.
the world. He is a Fellow of the IEEE for his significant contributions to the
“design and analysis of high-speed packet switching.”