Paper 2
Paper 2
article info a b s t r a c t
Article history: The increasing data produced by IoT devices and the need to harness intelligence
Received 30 June 2022 in our environments impose the shift of computing and intelligence at the edge,
Received in revised form 21 April 2023 leading to a novel computing paradigm called Edge Intelligence/Edge AI. This paradigm
Accepted 6 May 2023
combines Artificial Intelligence and Edge Computing, enables the deployment of machine
Available online 13 May 2023
learning algorithms to the edge, where data is generated, and is able to overcome the
Keywords: drawbacks of a centralized approach based on the cloud (e.g., performance bottleneck,
Internet of Things poor scalability, and single point of failure). Edge AI supports the distributed Federated
Federated Learning Learning (FL) model that maintains local training data at the end devices and shares only
Edge Intelligence globally learned model parameters in the cloud. This paper proposes a novel, energy-
Edge AI efficient, and dynamic FL-based approach considering a hierarchical edge FL architecture
Hierarchical FL called HED-FL, which supports a sustainable learning paradigm using model parameters
Neural networks
aggregation at different layers and considering adaptive learning rounds at the edge to
Machine learning
save energy but still preserving the learning model’s accuracy. Performance evaluations
of the proposed approach have also been led out considering model accuracy, loss, and
energy consumption.
© 2023 The Authors. Published by Elsevier B.V. This is an open access article under the CC
BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
1. Introduction
We live in a connected world where everything can be potentially addressable, and many objects can make distributed
sensing and computations so to create Smart Environments [1–3]. In this context, Internet of Things (IoT) [4,5] devices are
growing their importance as enabling technologies for the realization of such smart environments. In the last few years,
smart environments are giving way to the so-called Cognitive Environments [6–8] that are environments with the particular
capabilities of understanding what is happening in a place, learning from the environments themselves, and adapting
to special conditions. One of the most important enabling technology for such cognitive environments is the Machine
Learning (ML) [9,10]. ML algorithms usually need powerful computers, both in terms of memory and computation, for
their execution. So, such algorithms typically execute in the cloud, where memory and computation power are potentially
infinite. Executing such algorithms in the cloud means high communication rates between objects collecting data and the
cloud itself, low privacy levels for the data sent to the cloud, and much global energy consumed [11,12]. To overcome
some issues related to the cloud-based ML, a few years ago, a cooperative learning approach called FL [13] was proposed.
∗ Corresponding authors.
E-mail addresses: [email protected] (F. De Rango), [email protected] (A. Guerrieri), [email protected] (P. Raimondo),
[email protected] (G. Spezzano).
https://doi.org/10.1016/j.pmcj.2023.101804
1574-1192/© 2023 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.
org/licenses/by-nc-nd/4.0/).
F. De Rango, A. Guerrieri, P. Raimondo et al. Pervasive and Mobile Computing 92 (2023) 101804
FL allows training neural networks across multiple decentralized edge devices. Such edge devices, equipped with some
computation capabilities, enable the so-called Edge Computing [14], a promising solution to support Artificial Intelligence
(AI) applications on edge devices. The FL technique was invented to cope with privacy problems raised in the last few
years due to the rapid development of the IoT paradigm. The privacy of edge devices is preserved because they cancel
or significantly limit the exchange of sensible data alongside the Internet. The only information exchanged by applying
the FL technique are the weights of a common model shared among all the devices. Usually, the FL is composed of three
simple steps: The choice of a statistical model by an orchestrator entity to be trained and model parameters transmission
to several end devices; the devices’ model trained locally with locally generated data; local trained models are sent back
to the orchestrator that uses them to create a global model without accessing any specific data.
FL helps to reduce security issues, avoiding transmitting local private data to the cloud. However, only recently, some
enhancements to the FL have been designed, which also consider edge computing. The proliferation of devices and the
massive IoT access in the networks is imposing a shift in the computation paradigm at the network’s border, considering
the edge layer as an additional element to consider. This means that FL can be designed by considering shifting the
learning at the edge [15]. Even if applying a distributed intelligence at the edge is already an ambitious step forward
in the research, a scalable approach able to include constraints such as energy, wireless and unreliable communications,
computational limit, etc., needs to be accounted for when we try to distribute intelligence at the edge also involving
embedded devices or IoT devices [16,17]. In our view, applying FL can improve the performance of classical centralized
learning because it can try to share multiple datasets from heterogeneous data owners and speed up the learning process.
In this proposal, we account for the problem of the communication between edge nodes and IoT devices to train the
model; we consider the heterogeneous data generated by constrained devices to reduce the number of communications
among nodes at the edge in the learning phase through an organization of the edge in groups of nodes and to exploit a
hierarchical architecture that is able to efficiently perform learning tasks preserving energy consumption and reducing the
data communication overhead. We introduce a novel, hierarchical, energy-efficient, and dynamic edge FL approach called
HED-FL. Our approach is based on a hierarchical nodes architecture and is integrated with algorithms and methodologies
for applying sustainable distributed intelligence and learning at the edge. This means that even if federated learning
can speed up the convergence supposing a very high resource availability, it does not consider the sustainability of the
process under increasing demand for IoT devices and services that always increase the volume of data towards a network
and cloud servers. Creating sustainable intelligence in the network means distributing the intelligence and not only data
samples and led to the design of a novel architecture where local and trusted data collectors (aggregators) can manage part
of the edge networks to distribute resources, security policies, and also learning tasks. The application of a distributed
architecture where learning tasks can be distributed and protected through trusted-computing and secure lightweight
transactions of trained models and model parameters can reduce security risks and, at the same time, can save network
resources such as bandwidth and energy. On the basis of the issues listed above, the following challenges will be addressed
in this work:
• Hierarchical Edge Learning Architecture: A learning approach (FED-HL) based on a hierarchical architecture at the
edge to take advantage of the model/data aggregation at fog/edge nodes is proposed. This approach will reduce the
amount of data sent towards the cloud and can achieve a good accuracy value.
• Sustainable Learning Architecture: we propose a sustainable model because the learning architecture considers energy
consumption and is compared with well-known architectures based on Federated Learning and Centralized Learning.
The contribution emphasizes the energy related to the architecture by introducing a simple energy model applied
to the proposed hierarchical learning architecture.
• Heuristics for the dynamic learning architecture management: In order to apply the hierarchical architecture in some
specific cases, two heuristics have been proposed to select the iterations made for training purposes (round number)
at the different hierarchy layers. In particular, the first heuristic considers a static approach, and the second one
applies a dynamic strategy where the round number is changed according to an accuracy threshold established at
the first layer of the architecture.
The paper is organized as follows: Section 2 presents the related work on FL and the different categories of FL;
centralized, distributed, and hierarchical architectures applied to the FL are introduced in Section 3; our proposal is
described in Section 4; and a case study is presented in Section 5; finally, conclusions are summarized in the last section.
2. Related work
In this section, some related work to FL is presented. The section is divided into some sub-sections on the basis of the
specific FL approaches applied in a centralized (in the cloud) or decentralized (at the edge) context and also considering
more recent hybrid approaches that combine the benefits of several FL strategies.
FL is an emerging machine learning approach in which many clients can collaboratively train an ML model to preserve
privacy [13]. The basic idea is to send to a central server only the weights (hyper-parameters) of the neural network
2
F. De Rango, A. Guerrieri, P. Raimondo et al. Pervasive and Mobile Computing 92 (2023) 101804
trained on the board of devices to avoid exposing private and local data to the whole network (e.g. Internet). Central
servers can collect all the weights coming by local nodes and then perform an average operation or some other operations
to obtain novel hyper-parameters to share again with the distributed nodes [18]. Thus, the interesting idea is to share
the model parameters on a centralized server (centralized learning) avoiding sharing the local data (distributed and local
data producers) [19]. In the classic FL, it is supposed homogeneity of computation among nodes that share their model
parameters with a cloud server. In this classic approach, the strength points are related to the huge number of hyper-
parameters obtained by locally trained models that can be collected (and aggregated) in the cloud. The cloud can speed
up the learning process and convergence providing a more accurate global model that can be shared among nodes or also
sent to new nodes in the network. Concerning the communication load related to the classic FL, it is evident that it can
be huge since the number of involved nodes in the communication can be very high, and there is often a considerable
distance between distributed nodes and central servers (cloud). This means that, sometimes, the benefits obtained in the
learning process can be wasted in terms of communication resources to support centralized learning on the cloud. For
this purpose, some more recent contributions tried to take advantage of the learning speed-up process of the cooperative
learning at the edge, replicating the cloud model on the edge nodes that can be equipped with more computation capacity
in comparison with classic participating nodes. In this last case, the learning models are relegated to the borders so that
the number of communications with a central remote server (cloud) is avoided or reduced. In this last case, it is possible
to benefit from an edge layer that can support computation and storage on some special nodes, such as fog or edge nodes.
However, this last approach can achieve less accurate trained models because it can collect a lower amount of data related
to the local area it is managing. Supposing that edge nodes can receive data from a small set of nodes, the trained model
needs to increase the training time waiting to receive more data by the small set of nodes, or it can stop after a fixed
number of rounds but obtaining a less accurate model. Even increasing the training time, local learning models can be
less accurate because they are based on local data, not exploiting novel data sets coming from other remote local areas.
To overcome the issues mentioned before at the edge, some recent studies tried to combine the benefits of the fully
centralized approach in comparison with a locally centralized and more scalable distributed approach at the edge. Some
of these recent approaches are called hybrid or hierarchical and are explained in the next sub-sections.
Hybrid learning is a new learning approach where more learning strategies can be combined at different geographic
or time scales in order to improve the overall management of the learning tasks and the resource management in the
network. These hybrid approaches always use the FL basic approach, but they try to get advantage also of specific node
capacity or constraints distributed on the network to manage heterogeneous data sources combining the benefits of fully
centralized learning and fully distributed learning. In [20], a hybrid cloud–edge learning approach is proposed to manage
heterogeneous IoT devices. They consider three stages: offloading, learning, and personalization stage. The basic idea is to
transfer small data sets to the edge nodes to offload the computation of model parameters to reduce the computation load
for devices with reduced resource computation capabilities. However, the communication between the edge node and the
cloud server can be significant if more iterations at global level (edge–cloud) are necessary. Moreover, the scalability in
the case of dense networks with many IoT devices is not exploited, and the energy consumption and evaluation in the
proposed learning approach are not considered. In [21], the authors propose a hybrid approach to provide privacy in data
through the FL approach and using Secure Multiparty Computation (SMC). Even if this work is interesting in exploiting
the privacy issues in the learning parameters communication towards the server that performs the aggregation, the main
focus of this proposal is security and privacy, which is not the main focus of our contribution. Moreover, scalability
and sustainability in terms of energy consumption of the proposed model are not considered. A two-timescale hybrid
federated learning is proposed in [22]. The proposed learning scheme is defined as hybrid because two learning procedures
are applied in the Device-to-Server D2s! (D2s!) communication and in the Device-to-Device (D2D) communication. The
communication between devices and servers is performed periodically, and multiple Stochastic Gradient Descent (SGD)
iterations are applied to the local data sets and asynchronous consensus-based model parameters establishment has
been performed through D2D communications among devices belonging to the same cluster. This approach, such as
our proposal, tries to reduce communication with the cloud where FedAvg is performed through local rounds for model
parameters tuning in the local clusters. However, the local cluster where members communicate distributedly is similar
to our first level of the hierarchy, where the first edge node can perform the FedAvg locally. In our case, the architecture
is broader because it is possible to exploit more levels of aggregation considering edge nodes at higher layers. This means
a much lower overhead and energy saving. Moreover, the random selection members of the cluster proposed in [22] to
apply the FedAvg towards the cloud server assumes that all nodes should receive the locally trained model parameters
reducing the scalability of the approach if the cluster size increases (the number of communication increase with the
square of the nodes). In [23], a novel Hybrid Federated and Centralized Learning (HFCL) framework has been proposed.
It applies a learning model to exploit the clients’ computational capability. Based on HFCL, only the clients that have
sufficient resources employ FL; the remaining clients resort to Centralized Learning (CL) on the cloud by transmitting
their local data set to a central server. This allows all the clients to collaborate on the learning process regardless of their
computational resources. This proposed approach does not consider the problem of learning scalability and the reduction
of the learning model parameters transmission overhead. Moreover, energy consumption is not considered in the overall
architecture evaluation.
3
F. De Rango, A. Guerrieri, P. Raimondo et al. Pervasive and Mobile Computing 92 (2023) 101804
Recent contributions considered heterogeneous data sources where devices with different capabilities could collect
different data types. In this scenario, partial data samples among some nodes led to a hybrid data partitioning, and novel
learning architecture needs to be exploited. In [24], the authors proposed a hierarchical learning architecture to divide
samples and feature space in the learning process. In this case, the authors’ main focus is to show how a simple hierarchy at
two layers can provide more robustness in the learning at the edge when horizontal and vertical data partitioning among
nodes is considered. They assume that nodes with a few features in their local data set can be combined with nodes
with more features through the FedAvg at the edge speeding up the convergence process. However, no energy evaluation
under different levels of aggregation or with a dynamic tuning of aggregation parameters at the edge is evaluated. In [25],
authors apply a hierarchical FL to exploit the heterogeneity of the network segments with different bandwidth resources.
Authors try to take advantage of the higher bandwidth availability of Local Area Network (LAN) networks where more
communication can be supported, reducing the overhead in the Wide Area Network (WAN). In this case, a hierarchical FL
can support these network diversities by providing improved learning models, accelerating FL convergence, and reducing
billing costs to deploy FL in the real world. According to this approach, we consider edge learning to speed up the learning
convergence, maintaining low the communication cost that is kept more at the edge. However, while authors in [25]
consider only a two levels hierarchy, in our case, we focus on a more general approach where more aggregation levels
can be exploited to reduce much more the communication with the cloud server and to reduce energy consumption.
A hierarchical FL applied to the cellular networks is considered in [26]. In this case, an FL is applied at the edge where
mobile users can learn a global model by sharing local updates. In order to reduce the impact of the communication latency
that is possible to experience in UpLink (UL) and DownLink (DoL) channels, an Mobile Base Station (MBS) is introduced in
the architecture with the purpose of reducing the distance between mobile users moving in the local areas. According to
this approach, we apply edge learning too. However, differently by this approach, we exploit a more general hierarchical
architecture where more levels of model parameters aggregation can be activated. This means that our architecture is
more scalable and not specific to a cellular system where aggregations are applied at SBS and MBS only.
In [27], authors present a client–edge–cloud hierarchical Federated Learning system supported by a HierFAVG algo-
rithm that allows multiple edge servers to perform partial model aggregation. In this way, the model can be trained
faster, and better communication-computation trade-offs can be achieved. This proposal testifies our proposal’s intuition
to apply more aggregation levels. With a similar methodology, we considered the possibility of tuning the local learning
round applying a modified FedAvg at different levels. They show the benefits of applying only two levels of aggregation
and tuning the parameters k1 (rounds between local devices and edge nodes) and k2 (rounds between edge devices and
cloud server) and considering the classical FedAvg between edge nodes and cloud server. In our case, we exploit the
possibility of extending the number of levels to more than 2 to reduce much more the overhead and communication
towards the cloud server. Our approach focuses more on higher scalability and energy saving, considering the possibility
of increasing the complexity of the edge layer. This means that the HierFAVG can be considered a specific case of our more
general approach to the hierarchical organization of the edge in the learning design. Moreover, we propose an adaptive
tuning of the local round in the learning phase based on a threshold.
A lot of research contribution till now, such as the one mentioned in [12], has been focused much more on the
learning performance in terms of accuracy and iterations without worrying about the sustainability of the process. For
this purpose, the efficiency evaluated in terms of energy consumption, bandwidth usage, communication costs, latency,
carbon emissions, memory usage, etc., should be accounted. All these aspects can change the approach to the FL design
and also to the architecture supporting the Deep Learning (DL) and FL. For example, in the [28,29] contributions related to
computer vision addressed the efficiency of inference. Also, in [30], some contributions to put efficient models on mobile
phones have been addressed. However, these approaches tried to minimize the processing cost, but no analysis of different
architectures or constraints to consider jointly with the learning process is provided. The diffusion of DL and FL imposes to
give a look also to the energy and carbon footprint generated on the network architecture especially considering a broad-
scale ML development. For this purpose, the authors in [11] emphasize the importance of considering energy consumption
when a distributed learning such as FL is applied. In the overall evaluation of the learning process, the packets sent to the
cloud and the number of nodes involved in the communication can play an important role in carbon footprint emissions.
This very nice contribution inspired our proposed energy models and computation. We applied a similar methodology in
considering energy consumption associated with FedAvg. However, differently from this work, we extended the energy
consumption model to the case of hierarchical federated learning, where more layers and aggregation steps are accounted
for.
Differently from the contributions presented above, our proposal tries to combine the benefits of different learning
categories proposing a learning and communication architecture based on a multi-level hierarchy and considering
dynamic learning parameters aggregation and transmission. The hierarchy offers more scalability to the learning process
using hyper-parameters aggregation at different layers, whereas the heterogeneity of the data samples and nodes is
simulated considering a node clusters organization (groups) with randomly selected nodes. Moreover, unlike many works
4
F. De Rango, A. Guerrieri, P. Raimondo et al. Pervasive and Mobile Computing 92 (2023) 101804
Fig. 1. Architectures comparison — (A) Centralized learning and (B) Federated learning.
presented, our main focus in the proposal has been energy consumption since we want to propose a sustainable learning
architecture that can reduce communication and energy costs while preserving the reliability of the learning model. Also,
very few contributions considered some hierarchical architecture; they did not consider a multi-layer architecture using
more aggregation levels to evaluate the goodness of the multiple layers. Typically, all these works consider a two-layer
architecture (fog/edge and cloud layer). Moreover, an energy evaluation of the layers in the network is also missing in
all these works. Some of them, such as [27], did not consider the energy consumption of the overall network when the
cloud server is far and remotely located. In addition, two heuristics for considering the effect of a static and an adaptive
Hierarchical Edge Federated Learning (HEFL) are considered to see the benefits in terms of energy saving and learning
model accuracy.
A Federated training was used for the first time by Google to train the model used for keyboard suggestions on mobile
devices [31]. In this case all the data exchanged between entities regard only the information about computed and merged
models. To achieve good performances and levels of accuracy, these exchanges need to be repeated several times. This
architecture heavily relies on network communication to gather and distribute the model. For this reason, the natural
position of the aggregator is in the cloud. During each training round, every client sends towards the network its trained
model as a collection of parameters and receives the same collection structure as a result of the aggregation. In order to
achieve good results in terms of accuracy, a high number of rounds is necessary. For this reason, it is really important
to address the energy consumption problem of this training method focusing majorly on the energy depleted by the
communication equipment. By lowering the number of communication, it is possible to reduce the energy consumption
for the model training phase. In this paper, we propose some strategies to lower the number of communications and the
energy consumption for the continuous training of an FL approach. In many distributed learning architectures, there is
the strong assumption that all clients can model computation, which may require powerful parallel processing units, such
as Graphical Processing Unit (GPU)s. However, this may not always be possible in practice due to the heterogeneity of
the devices with different computation capabilities, such as mobile phones, vehicular components, and IoT devices. When
the edge devices lack computational power or need to save energy, they cannot perform complex model computations,
and they cannot exchange data frequently with remote computing servers, thus becoming unable to participate in the
learning process. To address this problem, groups of clients can be considered in a learning process, and not all the group
members always participate in training the model parameters. In the following, we present some of the most known
learning architectures highlighting benefits and drawbacks. All these architectures use local computation on centralized
servers or local devices, and they perform aggregation of model parameters θ and broadcast the updated model.
In CL, the model has access to the whole dataset D, which can be collected at the Centralized Server (CS) from the
clients
⋃ (see Fig. 1.A). Let∑
Dk be the local dataset of the kth client with |Dk | being the size of the local dataset such that
D = k∈K Dk and |D| = k∈K |Dk | where K = 1, 2, . . . , N. In this case, CL can be performed at the CS solving Eq. (1)
N
∑ Dk
min F (θ ) = · F k (θ ) (1)
θ D
k=1
5
F. De Rango, A. Guerrieri, P. Raimondo et al. Pervasive and Mobile Computing 92 (2023) 101804
s.t. f Xki | θ = Yki where Xki and Yki are respectively the input and the output related to the ith element of the data set
( )
Dk , Fk (θ ) is the loss function over θ and it can be defined as follows:
Dk
1 ∑
F k (θ ) = f X | θ − Y i
( i )
· k k (2)
Dk
i=1
In this case, CL can collect many datasets coming from many clients, and it can converge in terms of the lower loss function
and model accuracy. However, the communication overhead can be very high, and the security risk is also high because
the overall data are forwarded to the CS.
Compared to CL, FL does not involve the transmission of datasets to the CS. Instead, the model training is performed
at the clients while the model parameters produced by the clients are aggregated at the CS, as shown in Fig. 1.B.
Consequently, the optimization problem in Eq. (1) is solved as follows:
3.3. Hybrid FL
In ML tasks, training a model requires huge computational power to compute the model parameters. The computational
capability of the client devices cannot always satisfy this requirement. In order to train the ML model effectively, taking
into account the computational capability of the clients, a hybrid training framework can also be considered. Hybrid FL
assumes that only a portion of the clients with sufficient computational power performs FL, while the remaining clients,
which suffer from computational capability, send their datasets to the CS for model computation, as illustrated in Fig. 2
∑K
Dk θk
At the CS θ = k=1
(7)
D
where K is the set of clients with good computation capability that can compute learning parameters, as expressed in
Eq. (5). K ′ , instead, is the set of clients with reduced computing capability that prefer to send local datasets to the CS and
wait to receive the tuned model parameters. In Eq. (7), it is shown the aggregation function performed by the CS before
sending the model parameters to all the clients. In this case, the model is hybrid because two different ways to send data
to CS are performed and because the CS needs to compute training model parameters and aggregate already computed
trained model parameters. In this case, some resources are preserved for low-computing devices, but more data are sent
to the CS compared to the FL. Moreover, also in terms of security, there are some potential risks related to the part of
data sets forwarded to the CS. However, this solution is a compromise between CL and FL but is strictly dependent on
the number of powerful and/or low power-computing devices.
A novel architecture, that will be more deeply explained in the next section, is the HEFL. In this case, such as presented
in Fig. 3, the edge layer is empowered in order to extend the fog/edge nodes in terms of computation tasks. The idea is to
improve the training model parameters in the layers that are close to the computing devices (e.g. IoT devices). This can
reduce the overall dataset transmission towards the CS and can allow using the aggregation function of edge node that
can merge model parameters coming from different datasets. The dataset diversity should improve the convergence of the
trained model, such as the CL but preserving the energy and communication cost of the overall learning model because
a lower number of iterations are performed at the top layers approaching the CS. In Fig. 3 we present a two edge layers
architecture where a double aggregation is performed separately at the second and at third layers. We use a sequential
approach, and we will show in the next sections what kinds of benefits and flexibility are possible to obtain.
4. Proposal
This Section will introduce our novel hierarchical, energy efficient, and dynamic edge FL approach, called HED-FL.
HED-FL resides on a HEFL architecture, already introduced in the previous section. Our approach has the aim of creating,
at different layers, models for object classification, starting from data gathered at the edge devices. Here, we will detail
how our approach can provide energy saving compared to the traditional FL approach. Moreover, we will propose two
novel algorithms, one static and one dynamic, for the creation of object classification models by calibrating the rounds
done between the architecture layers so as to find a good trade-off between energy saving and model accuracy.
Fig. 4 summarizes the HED-FL proposed approach. Here, we abstract the underlying physical network including all the
edge devices as a graph, in which the nodes represent the edge devices (or a cloud node, at the top layer) and the links
represent the network connections among the nodes. Our model makes use of a well-designed hierarchical aggregation
structure imposed on edge devices. More specifically, a tree structure is used to organize edge devices. In such a structure,
the leaf nodes represent the learning workers, and the non-leaf nodes represent model aggregators. Our architecture
7
F. De Rango, A. Guerrieri, P. Raimondo et al. Pervasive and Mobile Computing 92 (2023) 101804
considers a set of nodes organized in a multi-layer fashion where, in the bottom layer (Layer0), we have a set (that
can also be quite big) of N0 IoT edge devices scattered in the environment that makes sampling through their sensors.
Operators acting in the IoT edge devices’ proximity can locally label such samplings. These labeled samples collaborate
with the training of a local Artificial Neural Network (ANN) model. This local model is typically trained for a number of
times named as epochs. Once nodes at Layer0 prepare a local model, they forward it to one of the N1 low-level Edge Node
Aggregators of Layer1. Each low-level Edge Aggregator (EA) runs an algorithm to merge the received models to create a
new model that considers all the models received by its child nodes. Now, there are two options: (i) the node in Layer1
disseminates the calculated model to the child nodes, that substitute their actual model with the new one and continue
with their training, or (ii) the node in Layer1 forwards the calculated model to the upper layers to have high-level unions
of the model. The number of communication rounds (between Layer0 and Layer1), shown in Fig. 4 as K1 , guides the
choice between the options above. In each communication round, usually, a limited number of children of a low-level
EA are randomly selected to perform the training described above, which is followed by aggregation of the individual
model weights into one Layer1 model. Once the communication rounds are finished, the model is forwarded to the Edge
Aggregator of the next layer, until the LayerY Cloud Aggregator (CA), which is the top layer aggregator. Between every
subsequent layer, Ki communication rounds can be performed before sending the new aggregated model to the aggregator
in the layer above. It is worth noting that, from K2 to KY , the communication rounds are intended as rounds in which are
only exchanged model weights (from the lower to the upper layer and vice versa) since Layer0 only performs the sensing
and the training. In our proposal, we assume that, until the Layer_(Y − 1), all the Layers can be associated with EA all
operating in a single or in associated organizations, while the CA node is the only one operating in the cloud. This is a
feasible assumption because, if an aggregator operating in the Layer_i, with i < Y is in the cloud, it can be considered
as a CA (also because, since it is in the cloud, all the nodes in the Layer_(i − 1) should reach it) and can merge all the
aggregators of the layers Layer_l, with l ≥ i. So, we can consider all the edge nodes in adjacent layers as one or two hops
away from each other. On the other hand, we can consider the nodes in Layer_(Y − 1) to be P hops far from the CA node
in LayerY , where P is a value that cannot be calculated a priori for each datacenter. Anyway, in literature, some works
have defined that an average hop count on the Internet between a random starting point and a random ending point is
between 12 and 14 [32,33]. This means that we can consider that a datacenter is at least 12 hops far from the nodes at
Layer_(Y − 1).
Several algorithms are presented in the literature to aggregate models in FL. Federated Averaging (FedAvg) [18] is the
first and most common algorithm used to merge the locally trained models at the end of each communication round.
In [34], authors presented FedProx, which tries to improve FedAvg by adding a proximal term. FedNova [35], instead, is
a normalized averaging method that eliminates objective inconsistency while preserving fast error convergence. In this
paper, we will consider an aggregation strategy based on the most common FedAvg algorithm with a simple weighted
strategy. This is necessary because a training set associated with a specific client can be very diverse in terms of the
number of training tuples with respect to other clients. For this reason, the averaging performed on each parameter value
also considers the size of the training set analyzed.
8
F. De Rango, A. Guerrieri, P. Raimondo et al. Pervasive and Mobile Computing 92 (2023) 101804
As already determined in the paper at [11], the introduction of the FL as an ML distributed process can allow for
optimizing global consumption in terms of energy and, consequently, in terms of CO2 emissions while maintaining good
performance.
Let us now consider our proposal and compare it with the ‘‘traditional’’ federated learning analyzed in [11]. In the
‘‘traditional’’ FL, the total energy used in a round of the FL process can be defined as:
EFLT = ETrainT + EComT (8)
where ETrainT is the total energy used for the training in the edge nodes and EComT is the total amount of energy spent
for all the communications between the edge nodes and the coordinator. Eq. (8) does not consider the energy spent for
the fusion of the parameters sent by all the nodes to the aggregator in the cloud since it can be considered very small
with respect to the energy spent both for training and communication. Similarly, the total energy spent in a round of the
proposed HED-FL can be seen as:
EFLH = ETrainH + EComH (9)
where ETrainH is the total energy used for the training in the edge nodes and EComH is the total amount of energy spent
for all the communications between all the nodes in the hierarchical architecture considered, and depends on the model
parameters size. Also, this equation does not consider the energy spent for the fusion of the parameters sent by all the
nodes to the aggregators since it can be considered very small with respect to the energy spent both for training and
communication.
It is worth noting that, in the following evaluation, we will consider the case in which, in all the rounds, we will
always forward the models merged in all the Edge Aggregator nodes until the Cloud Aggregator and vice versa. While
this is mandatory in the traditional FL (where we have two layers only), in the proposed HED-FL we can have some rounds
in which, for example, the low-level EAs of Layer1 aggregate K1 times the models from the edge nodes in Layer0 without
forwarding the resulting models to the upper layers. This last aggregation strategy can allow the proposed HED-FL model
to save a good amount of communication energy.
Let us now focus on the element EComH in Eq. (9). It can be detailed as follows:
where Y + 1 represents (as already highlighted in Fig. 4) the number of layers in the considered HEFL network, Ni
represents the number of nodes in the Ni th layer, P is an average number of hops between the nodes at the NY −1 layer and
a CA in a datacenter, and Ehop is (as an assumption) a mean energy needed to allow the (bi-directional) communication
through a communication hop. At the end of this subsection, we will introduce a real estimation of the communication
energy spent, firstly reported in [11]. It is important to highlight that Ni is also the number of communications between
the Ni nodes at Layeri and the Ni+1 nodes at Layeri+1 . Moreover, P can also be considered as the number of hops between
all the nodes and the CA in the traditional FL. Based on this last sentence and considering a traditional FL case involving
N0 edge nodes (the same number of nodes as the ones in Layer0 of the considered HEFL architecture) with one CA only,
EComT becomes:
• K1 = 50 and K3 = 1, namely 50 low-level rounds with only one model aggregation in the cloud;
• K1 = 25 and K3 = 2, that means 25 low-level rounds repeated twice (K3 = 2) with two model aggregations in the
cloud;
• K1 = 10 and K3 = 5, that provides the repetition of 10 low-level rounds followed by a model aggregation. Such a
process is repeated 5 times.
Obviously, in order to make a proper comparison, also the number of active clients per round and training hyperparam-
eters (e.g., epochs and batch size) must be the same.
Going back to our example, in which ETrainT is equal to ETrainH , let us consider the case in which the hierarchical structure
considered for our HEFL is a binary tree. This can be seen as the worst-case because, usually, the aggregator nodes are
devoted to merging models coming from more than two nodes. In this case, Eq. (11) becomes:
EComH = 2(2Y +1 − 4)Ehop + 2PEhop (13)
while Eq. (12) with the same number of nodes in the edge is:
EComT = (2Y )PEhop (14)
So, what we need to demonstrate is:
2(2Y +1 − 4) + 2P < 2Y · P (15)
By solving Eq. (15), it results that, independently from the number of layers, the inequality is verified for all the P that
are greater than 4:
P >4 (16)
This means that, in the majority of cases, the proposed HED-FL approach allows for saving energy with respect to the
traditional FL approach since P is always greater than 4. Taking into consideration what has been introduced in [32,33],
P can be considered a number between 12 and 14. As an example, let us consider a real case scenario, in which we have:
(i) 28 edge nodes in the Layer0, divided into groups of 4 or 5 nodes, (ii) 6 low-level EAs in Layer1, (iii) 2 EAs in Layer2,
and (iv) 1 CA in Layer3. If we substitute such numbers in Eq. (11), we have:
EComH = 2(28 + 6)Ehop + 2PEhop = 68Ehop + 2PEhop (17)
By considering the same number of edge nodes in a traditional FL approach, we have:
EComT = 28PEhop (18)
This means that our HED-FL approach, in this case, performs better (in terms of energy) than the FL approach for
P > 2.6 hops, which means that, in a real case, the HED-FL approach is very likely better than the FL one in terms of
energy consumption. It is worth noting that, in the proposed approach, we can perform still better if we consider that:
• we can execute several rounds of training only involving the nodes in the first two layers;
• each aggregator in the edge can aggregate many more nodes than the ones already considered in the examples given.
Some simulated experimental results on the energy saving and on the accuracy of the proposed HED-FL approach will be
analyzed in Section 5. In particular, experimental results about energy saving/consumption will be based on the following
equation, which inherits its approach from [11] and extends it for our proposal.
Y Ki Hops(L(i−1) −Li ) N(i−1)
∑ ∑ ∑ ∑ 2 · S(D + U)
EComH = · Pr (19)
DU
i=1 k=1 j=1 s=1
Eq. (19), in particular, considers for each layer i, starting from Layer1 till Layer(Y − 1) at the edge all the rounds Ki to be
done between Layer(i − 1) and Layer(i). The inner sum considers the energy consumption of all the nodes, belonging to
the nodes’ cluster, that send weights to the local aggregator and receive updated weights. It is considered an average
evaluation applying an average power value Pr consumed by the router in the data communication and considering
the uplink speed U and downlink speed D. Moreover, for each round, it considers all the hops to be crossed between
two consecutive layers and all the nodes involved in sending/receiving data. For the sake of simplicity, this equation
considers a mean hop-distance between two nodes belonging to two consecutive layers Li−1 and Li through the function
Hops(L(i − 1) − Li). In Fig. 4, for example, between Layer1 and Layer2 Hops(1 − 2) will provide 2 hops whereas the
Hops((Y − 1) − Y ) will provide the number p to indicate p hops to be traversed to reach the servers on the cloud. The
second sum considers all the rounds Ki applied between two layers and the first sum considers all the layers of the
hierarchical structure involved in the aggregation and federated learning at the edge. Furthermore, in Eq. (19), S is the
size of the model transferred between the layers, D is the download speed, and U is the upload speed. The value 2 in the
inner term of the formula is a multiplier that considers that each model is sent to the upper layer and back to the lower
layer. It is worth noting that such an equation is simplified since it considers mean values for D, U, and Pr . Anyway, such
values can simply become functions depending on the involved routers/nodes.
10
F. De Rango, A. Guerrieri, P. Raimondo et al. Pervasive and Mobile Computing 92 (2023) 101804
In this section, we have already described our HED-FL approach to federated learning making, also, a preliminary
comparison with the traditional FL in terms of communication energy. Moreover, here we propose two novel heuristics
to choose when executing model aggregation in order to both have high accuracy and keep low energy consumption.
As already seen, a hierarchical structure gives several advantages in terms of energy spent on communications. During
the traditional FL, an orchestrator, usually in the cloud, asks several clients to perform model training locally. Then, the
orchestrator collects all the models from the designed clients to merge them into a global model that is then re-distributed
to all the clients. Researchers invented this approach to cope with emergent privacy issues arising in the last few years.
As a matter of fact, this procedure avoids sending sensible data towards the cloud because the clients share only the
parameters of the models. The number of times the model is merged and shared with the clients heavily influences the
energy consumption to build the model. Using a hierarchical edge architecture with groups of clients in the Layer0 directly
communicating with EAs in Layer1 allows both to reduce the number of communications towards the cloud and to have
preliminary aggregated models already on the edge. Such aggregated models can be recursively refined (some rounds can
be executed) or aggregated with other models at higher layers. The number of rounds for each layer (above indicated
with K1 , K2 , . . . , KY ) can determine the quality of the model and also the energy used for the training. Anyway, it is really
hard to define a perfect strategy to execute rounds at different layers, maximizing the final accuracy, and minimizing the
energy spent on communication. Moreover, such a strategy can depend on several factors beyond the number of rounds.
In the following, we present two types of algorithms for the realization of HEFL, both belonging to our HED-FL approach:
a static and a dynamic one. Such algorithms rely on two different heuristics to choose round execution at different layers.
11
F. De Rango, A. Guerrieri, P. Raimondo et al. Pervasive and Mobile Computing 92 (2023) 101804
This Section will show a case study purposely built to show all the benefits of the approach proposed in Section 4. In
particular, the next subsection will describe the case study setup. Here, we will show the configuration of the considered
network, will discuss the used dataset, will detail the Neural Network scattered in the nodes, and will fix some parameters
of the experiment. Subsequently, we will draw out some results and will make a comparison between our HED-FL
approach and the traditional FL. Moreover, we will analyze and compare several different configurations using the
proposed approach. Finally, we will show some energy comparisons on the analyzed configurations.
We used a simple configuration composed of 4 layers of nodes for our experiments, as shown in Fig. 5. The nodes at
Layer0 are the only ones to perform training, while the nodes in the layers above manage the merging of the model and
the communication. The nodes (edge devices) in the Layer0 are 20, and they are divided into 4 groups of 5 nodes each.
Each group refers to a first-level EA in Layer1. In Layer2 the nodes are 2 while, in the last layer Layer3, there is only one
CA.
12
F. De Rango, A. Guerrieri, P. Raimondo et al. Pervasive and Mobile Computing 92 (2023) 101804
To perform our experiments, we have used the well-known CIFAR-10 dataset1 that is widely used in literature to
train and evaluate image classification models [36]. It consists of 32 × 32 color images divided into 10 classes with 6000
images per class. Moreover, the dataset is already split into 50 000 training images and 10 000 test images, and each
training example contains objects of exactly one class without overlaps. Usually, in a classic machine learning computer
vision application, centralized training in a datacenter uses the whole dataset to train a model on all the available classes.
In such a case, the CIFAR-10 dataset is used as it is, namely Independent and Identically Distributed (IID), providing the
same amount of training instances for all the classes. This IID condition is rarely present in real-life applications, and, for
this reason, we decided to pre-process the CIFAR-10 dataset in order to shape the dataset in a NON-IID way and distribute
it among our 20 edge devices. So, the considered distribution among clients is shown in Fig. 6. We tried to provide each
client with a very heterogeneous portion of the CIFAR dataset with an average number of classes per client between 2
and 3. Using a NON-IID dataset for a federated learning application allows depicting a scenario that is very close to the
real world one in which some Layer0 nodes can also be randomly disconnected so allowing the node at Layer1 to have
reduced information about several classes. In the following, we will consider only three nodes working per low-level
round to emulate frequent disconnections.
The Neural Network we decided to use is a ‘‘Visual Geometry Group (VGG) 11’’ invented by the homonym group at
Oxford University.2 VGG-11 is an 11-layer convolutional network introduced by Simonyan and Zisserman [37]. It is much
used in literature for image recognition and, with respect to the other networks of its family, it uses fewer convolutional
layers and therefore fits better on a reduced power device during training. The kernel size of the convolutional layer is
13
F. De Rango, A. Guerrieri, P. Raimondo et al. Pervasive and Mobile Computing 92 (2023) 101804
Table 1
Simulation parameters of the considered case study.
Parameter Value Description
Common parameters
num clients 20 Total number of clients
num selected 12 Number of selected clients per round
Epochs 5 Number of times each client iterates over the training set
Batch size 32 Number of training inputs analyzed before model update
Hierarchical parameters
Levels 3 Number of levels of the hierarchical architecture
Number of groups 4 Number of groups in the first level
Number of clients per group 5 Number of client selected in a group
Number of clients selected per group 3 Number of clients randomly selected in each group per round
K3 rounds Variable Number of times the model is aggregated at Layer3
K1 rounds Variable Number of rounds executed between Layer0 and Layer1
Federated parameters
num rounds 50 Number of training rounds
num clients selected 12 Number of selected clients each round
3 × 3 with rectified linear unit (ReLU) activation function and 5 maxpool layers to reduce the dimensionality of the input.
In this section, we will analyze and compare our HED-FL approach, applied to a network as the one reported in Fig. 5,
with a traditional FL (hereafter referred to as ‘‘Classic’’) on a network with the same number of edge devices (namely
20). We will test our approach in different configurations. In particular, by applying the Static Algorithm introduced in
Section 4.3.1, we will consider the following configurations (note that, in our example, Y = 3): {K1 = 10; K3 = 5},
{K1 = 50; K3 = 1}, {K1 = 1; K3 = 50}, {K1 = 5; K3 = 10}, {K1 = 25; K3 = 2}, and {K1 = 2; K3 = 25} (hereafter referred as,
respectively, Static(10-5), Static(50-1), Static(1-50), Static(5-10), Static(25-2), and Static(2-25)). It is worth noting that, for
all the considered configurations, we have used K2 = 1. We postpone the study of different values of K2 for future works.
By applying the Dynamic Algorithm provided in Section 4.3.2, we will consider the following thresholds for the accuracy:
{0.42}, {0.45}, {0.46}, and {0.53} (hereafter referred as Dynamic(0.42), Dynamic(0.45), Dynamic(0.46), and Dynamic(0.53)).
We have selected such thresholds among many simulations done with different accuracy thresholds since they guarantee
a good final accuracy and good energy saving with respect to the Classic case. To better proceed with a comparison among
the cases taken into consideration, all the other parameters used in the simulations are considered constants. In particular,
all the configuration parameters are shown in Table 1 where, in the ‘‘Common Parameters’’ section, the parameters used
by all the simulations (i.e., Classic, Static, and Dynamic) are presented. Here, num clients represents the total number
of clients available in Layer0. Not all of these clients are used in each round, but only a num selected number of nodes
are randomly chosen among all the clients. epochs and batch size are the hyper-parameters used on each client during
the training phase. The ‘‘Federated Parameters’’ section refers to the parameters used in the Classic FL approach. The
‘‘Hierarchical Parameters’’ section shows the parameters used for the Hierarchical approach. It is worth noting that, in
the Static approach, the number of rounds K3 and K1 is fixed (as reported above in this section) while, in the Dynamic
approach, they dynamically change. Finally, while in the Classic FL approach 12 clients are randomly selected from the
client population, in the hierarchical approach, in each of the four groups, 3 clients out of 5 are randomly selected so that
the total number of active clients per round is the same (i.e., 12).
All the simulations hereafter considered have been realized by using the Pro version of Google Colaboratory.3 Google
Colaboratory, or ‘‘Colab’’ for short, allows all the users to write and execute Python code in any browser giving access to
powerful GPUs and providing easy sharing of code among colleagues. The framework used for ANN training and merge is
Pytorch.4 In the simulations, first of all, the Classic FL and the Static algorithm of HED-FL have been compared. It has to be
noted that, to make a good comparison of the two approaches, the number of equivalent rounds for the Static algorithm
has always been maintained constant and equals to the num rounds reported in Table 1 of the Classic FL approach. For
this reason, the product between all the considered K1 and K3 is always 50.
Fig. 7 shows a comparison, between the Classic FL approach and some Static HED-FL cases, regarding the accuracy at
the end of 50 equivalent rounds. This chart highlights that it is possible to obtain, with our Static algorithm, a similar
result, in terms of accuracy, to the Classic FL approach but using a HEFL network. In particular, the case performing better
is the one in which K1 = 1 and K3 = 50. This means that, every time a model aggregation is done at Layer1, the model
is aggregated up to the top layer and disseminated to all the nodes until Layer0. Obviously, this last case requires more
communication than the other Static cases. On the other hand, the case in which K1 = 50 and K3 = 1 appears to be
the worst one since the single models, aggregated at Layer1 for 50 times, result to be too specific to the random datasets
14
F. De Rango, A. Guerrieri, P. Raimondo et al. Pervasive and Mobile Computing 92 (2023) 101804
Fig. 7. Accuracy comparison between the Classical FL approach and some Static HED-FL cases.
Fig. 8. Accuracy comparison, over global rounds, for the Dynamic cases considered.
given to the single groups. In this case, a single aggregation at the higher level is insufficient to have good effects on global
accuracy. Good compromises, among these experiments, are the Static(5-10) or the Static(10-5) cases, which reduce a lot
the communications towards the cloud, still giving a quite good global accuracy.
We also simulated the novel Dynamic algorithm of HED-FL introduced in Section 4.3.2 that dynamically changes the
number of K1 rounds (initially set at 1) according to an accuracy threshold. When the threshold is exceeded, the K1
number of rounds is increased for each global aggregation done at Layer3. In our experiments, global rounds continue
until the equivalent rounds are equal to or close to 50. Figs. 8 and 9 show the results for the Dynamic cases considered
regarding accuracy and loss over global rounds. It is worth noting that even the slightest change in the accuracy threshold
can cause significant differences in the final model accuracy. Right now, it results impossible to know a priori which is
the best accuracy threshold. This can be a task for future research works. The accuracy threshold heavily influences the
number of low-level K1 rounds followed by high-level aggregations at Layer3. Anyway, it is evident that, in order to build
final models with good accuracies (also using models trained on NON-IID datasets), a certain number of aggregations
until the Layer3 needs to be performed at the beginning of the training process. Fig. 8 highlights that the best performing
Dynamic cases, in terms of accuracy, are the Dynamic(0.42) and the Dynamic(0.46). Fig. 10 compares such two cases with
the Classic FL, Static(1-50), and Static(5-10) cases. This chart shows that the Dynamic algorithm in our HED-FL approach
with a threshold accuracy of 0.42 outperforms all the other considered cases. To better understand the evolution of both
the accuracy and the loss of the models built during the training, Figs. 11 and 12 provide a comparison, respectively for
accuracy and loss, between the best cases of each approach (i.e., Classic, Static, and Dynamic). For each model, the figures
15
F. De Rango, A. Guerrieri, P. Raimondo et al. Pervasive and Mobile Computing 92 (2023) 101804
Fig. 9. Loss comparison, over global rounds, for the Dynamic cases considered.
Fig. 10. Accuracy comparison among the Classic and some Static/Dynamic cases.
show a value for each time a fusion of the model is performed at Layer3, and on the x-axis the equivalent rounds are
reported.
Regarding the energy consumption for communication purposes in the considered cases, we applied Eq. (19) to make
a comprehensive comparison. To reach our purpose, we hypothesized an average router consumption of Pr = 40 W.
Such consumption has been chosen as an average value among a set of considered routers.5 Regarding the values of U
and D, we chose to use the global average values for fixed broadband taken from the Speedtest Global Index,6 namely
D = 102.12 Mbps and U = 54 Mbps. Finally, regarding the S size of the model, we have considered the one really used
in the performed experiments, which weighs 37.26 MB (Pytorch Model). Table 2 shows the results of our computations
together with the related accuracy values. Here, we show the energy consumption per adjacent layers. In particular, we
show the ECom for the communications between Layer0 and Layer1 (ECom - L0-L1), between Layer1 and Layer2 (ECom - L1-L2),
and between Layer2 and Layer3 (ECom - Cloud-Edge). We express all the measures in Joule [J] and kilowatt-hour [KWh]. It
is worth noting that, in the Classic approach, only the column ECom - Cloud-Edge has a value since all the communications,
in this case, are between the edge devices and the cloud. Regarding the values in the columns for the other cases, it
is evident that ECom - L0-L1 is constant for all the configurations because we decided to train the models on the same
16
F. De Rango, A. Guerrieri, P. Raimondo et al. Pervasive and Mobile Computing 92 (2023) 101804
Fig. 11. Accuracy evolution, over equivalent rounds, for Classic, Static(1-50), and Dynamic(0.42) cases.
Fig. 12. Loss evolution, over equivalent rounds, for Classic, Static(1-50), and Dynamic(0.42) cases.
number of equivalent rounds. The only row in which ECom - L0-L1 is different is Dynamic(0.46) because, in this case, the
final equivalent rounds are 47 only (adding one more high-level round would mean reaching a number of equivalent
rounds higher than 50). While ECom - L1-L2 has not a significant impact on the total energy consumption, ECom - Cloud-Edge
represents, in many cases, an important parameter regarding the total energy consumption since it is the one including
all the steps towards the cloud. Looking at the TOTAL ECom columns, it is evident that the HED-FL approach is always
more energy efficient than the Classic FL. What is important to highlight is that the Dynamic cases considered, already
outperforming the Static ones in terms of accuracy, still have good behavior in terms of energy consumption. In particular,
the Dynamic(0.46) and the Dynamic(0.42) cases consume, respectively, only the 21% and the 19% of the communication
energy consumed in the Classic case.
To better highlight the different levels of the TOTAL ECom in the considered cases, we drew Fig. 13, where we also show
the energy spent in the communication between adjacent layers.
6. Conclusions
In this work, a novel hierarchical, energy-efficient, and dynamic edge Federated Learning approach, called HED-FL,
has been introduced. Our approach is based on a hierarchical architecture of edge nodes that allows the aggregations of
17
F. De Rango, A. Guerrieri, P. Raimondo et al. Pervasive and Mobile Computing 92 (2023) 101804
Table 2
The consumed energy (In Joule and in kWh) for the com-
munication between all the layers of the case study
considered.
Neural Network models at different layers so enabling distributed intelligence and learning at the edge. In addition, we
have introduced two heuristics considering the effect of a static and a dynamic round execution on different layers of the
proposed Hierarchical Edge Federated Learning (HEFL) architecture.
18
F. De Rango, A. Guerrieri, P. Raimondo et al. Pervasive and Mobile Computing 92 (2023) 101804
Evaluations executed in our case study highlight how the proposed HED-FL approach allows for overcoming the
accuracy of a classic FL while still saving much energy. In particular, simulations reveal that HED-FL can reach a comparable
accuracy with respect to the classic FL, but its energy consumption, in terms of communication energy, is about 20% of
the classic FL case.
Future works will focus on: studying how the aggregation method used in our approach affects the quality of the final
model when vertical and horizontal data partitioning are considered; the definition of other heuristics to dynamically
change the number of rounds at all the layers K1 , K2 , K3 , . . . , KY considering the resources of both selected edge nodes
and end devices; the definition of an optimization problem to find the optimal trade-off between energy saving and model
accuracy.
The authors declare that they have no known competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper.
Data availability
Acknowledgments
This work has been partially supported by the COGITO (A COGnItive dynamic sysTem to allOw buildings to learn and
adapt) project, funded by the Italian Government (PON ARS01 00836) and by the CNR, Italy project ‘‘Industrial transition
and resilience of post-Covid19 Societies - Sub-project: Energy Efficient Cognitive Buildings’’.
Appendix A. Acronyms
Table of acronyms
AI Artificial Intelligence
ANN Artificial Neural Network
CA Cloud Aggregator
CL Centralized Learning
CS Centralized Server
D2D Device-to-Device
DL Deep Learning
DoL DownLink
EA Edge Aggregator
FedAvg Federated Averaging
FL Federated Learning
GD Gradient Descend
GPU Graphical Processing Unit
HED-FL Hierarchical Energy Efficient and Dynamic Model for Federated Learning
HEFL Hierarchical Edge Federated Learning
HFCL Hybrid Federated and Centralized Learning
HFL Hybrid Federated Learning
IID Independent and Identically Distributed
IoT Internet of Things
LAN Local Area Network
MBS Mobile Base Station
ML Machine Learning
MLLA Multi-Layer Learning Architecture
ReLU rectified linear unit
SGD Stochastic Gradient Descent
SMC Secure Multiparty Computation
UL UpLink
VGG Visual Geometry Group
WAN Wide Area Network
19
F. De Rango, A. Guerrieri, P. Raimondo et al. Pervasive and Mobile Computing 92 (2023) 101804
Appendix B. Symbols
Table of symbols
θ Model parameters
D Dataset
Dk Local dataset of the kth client
F k (θ ) Loss function over θ
Nk Number of nodes in the kth Level
Ki Number of communication rounds between Layeri−1 and Layeri
EFLT Total energy consumption for traditional FL
ETrainT Total energy consumption for the Training Phase in traditional FL
EComT Total energy consumption for Communication in traditional FL
EFLH Total energy consumption in HED-FL
ETrainH Total energy consumption for the Training Phase in HED-FL
ECommH Total energy consumption for Communication in HED-FL
ECommLi−Lk Energy consumption for Communication between Layeri and Layerk in HED-FL
P Number of hopes to reach the cloud
S Size of the model (in MB)
D Download Speed (in Mbps)
U Upload Speed (in Mbps)
Pr Power of the Router (in Watt)
References
[1] A. Souri, A. Hussien, M. Hoseyninezhad, M. Norouzi, A systematic review of IoT communication strategies for an efficient smart environment,
Trans. Emerg. Telecommun. Technol. (2019) e3736.
[2] D. Singh, E. Merdivan, S. Hanke, J. Kropf, M. Geist, A. Holzinger, Convolutional and recurrent neural networks for activity recognition in smart
environment, in: Towards Integrative Machine Learning and Knowledge Extraction, Springer, 2017, pp. 194–205.
[3] F. Cicirelli, A. Guerrieri, C. Mastroianni, G. Spezzano, A. Vinci, The Internet of Things for Smart Urban Ecosystems, Springer International
Publishing, 2019.
[4] E. Ahmed, I. Yaqoob, A. Gani, M. Imran, M. Guizani, Internet-of-things-based smart environments: state of the art, taxonomy, and open research
challenges, IEEE Wirel. Commun. 23 (5) (2016) 10–16.
[5] A. Guerrieri, V. Loscri, A. Rovella, G. Fortino, Management of Cyber Physical Objects in the Future Internet of Things, Springer International
Publishing, 2016.
[6] P.K.D. Pramanik, S. Pal, P. Choudhury, Beyond automation: the cognitive IoT. artificial intelligence brings sense to the Internet of Things, in:
Cognitive Computing for Big Data Systems over IoT, Springer, 2018, pp. 1–37.
[7] F. Cicirelli, A. Guerrieri, A. Mercuri, G. Spezzano, A. Vinci, ITEMa: A methodological approach for cognitive edge computing IoT ecosystems,
Future Gener. Comput. Syst. 92 (2019) 189–197.
[8] A.K. Sangaiah, J.S.A. Dhanaraj, P. Mohandas, A. Castiglione, Cognitive IoT system with intelligence techniques in sustainable computing
environment, Comput. Commun. 154 (2020) 347–360.
[9] F. Samie, L. Bauer, J. Henkel, From cloud down to things: An overview of machine learning in internet of things, IEEE Internet Things J. 6 (3)
(2019) 4921–4934.
[10] M. Shafique, T. Theocharides, C.-S. Bouganis, M.A. Hanif, F. Khalid, R. Hafız, S. Rehman, An overview of next-generation architectures for machine
learning: Roadmap, opportunities and challenges in the IoT era, in: 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE),
IEEE, 2018, pp. 827–832.
[11] X. Qiu, T. Parcolle, D.J. Beutel, T. Topa, A. Mathur, N.D. Lane, A first look into the carbon footprint of federated learning, 2020, arXiv preprint
arXiv:2010.06537.
[12] R. Schwartz, J. Dodge, N.A. Smith, O. Etzioni, Green ai, Commun. ACM.
[13] Q. Yang, Y. Liu, Y. Cheng, Y. Kang, T. Chen, H. Yu, Federated Learning, in: Synthesis Lectures on Artificial Intelligence and Machine Learning,
13, 2019, pp. 1–207, (3).
[14] W. Shi, S. Dustdar, The promise of edge computing, Computer 49 (5) (2016) 78–81.
[15] F. De Rango, A. Guerrieri, P. Raimondo, G. Spezzano, A novel edge-based multi-layer hierarchical architecture for federated learning, in: 2021
IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and
Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), IEEE, 2021, pp. 221–225.
[16] A.F. Santamaria, F. De Rango, A. Serianni, P. Raimondo, A real IoT device deployment for e-Health applications under lightweight communication
protocols, activity classifier and edge data filtering, Comput. Commun. 128 (2018) 60–73.
[17] A.F. Santamaria, P. Raimondo, M. Tropea, F. De Rango, C. Aiello, An IoT surveillance system based on a decentralised architecture, Sensors 19
(6) (2019) 1469.
[18] B. McMahan, E. Moore, D. Ramage, S. Hampson, B.A. y Arcas, Communication-efficient learning of deep networks from decentralized data, in:
Artificial Intelligence and Statistics, PMLR, 2017, pp. 1273–1282.
[19] T. Li, A.K. Sahu, A. Talwalkar, V. Smith, Federated learning: Challenges, methods, and future directions, IEEE Signal Process. Mag. 37 (3) (2020)
50–60.
[20] Q. Wu, K. He, X. Chen, Personalized federated learning for intelligent IoT applications: A cloud-edge based framework, IEEE Open J. Comput.
Soc. 1 (2020) 35–44.
[21] S. Truex, N. Baracaldo, A. Anwar, T. Steinke, H. Ludwig, R. Zhang, Y. Zhou, A hybrid approach to privacy-preserving federated learning, in:
Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security, 2019, pp. 1–11.
20
F. De Rango, A. Guerrieri, P. Raimondo et al. Pervasive and Mobile Computing 92 (2023) 101804
[22] F.P.-C. Lin, S. Hosseinalipour, S.S. Azam, C.G. Brinton, N. Michelusi, Two timescale hybrid federated learning with cooperative D2D local model
aggregations, 2021, arXiv preprint arXiv:2103.10481.
[23] A.M. Elbir, S. Coleri, K.V. Mishra, Hybrid federated and centralized learning, 2020, arXiv preprint arXiv:2011.06892.
[24] L. Su, V.K. Lau, Hierarchical federated learning for hybrid data partitioning across multi-type sensors, IEEE Internet Things J. (2021).
[25] J. Yuan, M. Xu, X. Ma, A. Zhou, X. Liu, S. Wang, Hierarchical federated learning through LAN-WAN orchestration, 2020, arXiv preprint
arXiv:2010.11612.
[26] M.S.H. Abad, E. Ozfatura, D. Gunduz, O. Ercetin, Hierarchical federated learning across heterogeneous cellular networks, in: ICASSP 2020-2020
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 8866–8870.
[27] L. Liu, J. Zhang, S. Song, K.B. Letaief, Client-edge-cloud hierarchical federated learning, in: ICC 2020-2020 IEEE International Conference on
Communications (ICC), IEEE, 2020, pp. 1–6.
[28] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T.-J. Yang, E. Choi, Morphnet: Fast & simple resource-constrained structure learning of deep
networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1586–1595.
[29] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2016, pp. 779–788.
[30] X. Zhang, X. Zhou, M. Lin, J. Sun, Shufflenet: An extremely efficient convolutional neural network for mobile devices, in: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6848–6856.
[31] A. Hard, K. Rao, R. Mathews, S. Ramaswamy, F. Beaufays, S. Augenstein, H. Eichner, C. Kiddon, D. Ramage, Federated learning for mobile
keyboard prediction, 2018, arXiv preprint arXiv:1811.03604.
[32] P. Van Mieghem, G. Hooghiemstra, R. Van Der Hofstad, Stochastic model for the number of traversed routers in internet, in: Proceedings of
Passive and Active Measurement (PAM2001), Amsterdam, the Netherlands, 2001, pp. 23–24.
[33] F. Begtasevic, P. Van Mieghem, Measurements of the hopcount in internet, in: PAM2001, a Workshop on Passive and Active Measurements,
Amsterdam, the Netherlands, April 23-24, 2001, 2001.
[34] T. Li, A.K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, V. Smith, Federated optimization in heterogeneous networks, in: Proceedings of the 1 St
Adaptive & Multitask Learning Workshop, Long Beach, California, 2019, 2019.
[35] J. Wang, Q. Liu, H. Liang, G. Joshi, H.V. Poor, Tackling the objective inconsistency problem in heterogeneous federated optimization, in: 34th
Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada, 2020.
[36] B. Recht, R. Roelofs, L. Schmidt, V. Shankar, Do cifar-10 classifiers generalize to cifar-10? in: 34th Conference on Neural Information Processing
Systems (NeurIPS 2020), Vancouver, Canada, 2020.
[37] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: International Conference on Learning
Representations, 2015.
21