0% found this document useful (0 votes)
104 views16 pages

Federated and Continual Learning For Classification Tasks in A Society of Devices

This document proposes a new federated and continual learning architecture called Light Federated and Continual Consensus (LFedCon2) to address challenges with centralized deep learning on distributed devices. LFedCon2 allows power-constrained devices like smartphones to learn locally and continuously from users while also improving models globally through federated learning in the cloud. It uses simpler learners than deep models for feasibility on mobile devices. The paper tests LFedCon2 on a community of smartphone users for walking recognition, finding advantages over other state-of-the-art methods.

Uploaded by

papapapa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views16 pages

Federated and Continual Learning For Classification Tasks in A Society of Devices

This document proposes a new federated and continual learning architecture called Light Federated and Continual Consensus (LFedCon2) to address challenges with centralized deep learning on distributed devices. LFedCon2 allows power-constrained devices like smartphones to learn locally and continuously from users while also improving models globally through federated learning in the cloud. It uses simpler learners than deep models for feasibility on mobile devices. The paper tests LFedCon2 on a community of smartphone users for walking recognition, finding advantages over other state-of-the-art methods.

Uploaded by

papapapa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Federated and continual learning for classification tasks in a society of devices

Fernando E. Casadoa,∗, Dylan Lemaa , Roberto Iglesiasa , Carlos V. Regueirob , Senén Barroa
a CiTIUS (Centro Singular de Investigación en Tecnoloxı́as Intelixentes), Universidade de Santiago de Compostela, 15782 Santiago de Compostela, Spain
b CITIC, Computer Architecture Group, Universidade da Coruña, 15071 A Coruña, Spain

Abstract
Today we live in a context in which devices are increasingly interconnected and sensorized and are almost ubiquitous. Deep
learning has become in recent years a popular way to extract knowledge from the huge amount of data that these devices are able
to collect. Nevertheless, centralized state-of-the-art learning methods have a number of drawbacks when facing real distributed
problems, in which the available information is usually private, partial, biased and evolving over time. Federated learning is a
arXiv:2006.07129v2 [cs.LG] 12 Jan 2021

popular framework that allows multiple distributed devices to train models remotely, collaboratively, and preserving data privacy.
However, the current proposals in federated learning focus on deep architectures that in many cases are not feasible to implement
in non-dedicated devices such as smartphones. Also, little research has been done regarding the scenario where data distribution
changes over time in unforeseen ways, causing what is known as concept drift. Therefore, in this work we want to present Light
Federated and Continual Consensus (LFedCon2), a new federated and continual architecture that uses light, traditional learners.
Our method allows powerless devices (such as smartphones or robots) to learn in real time, locally, continuously, autonomously
and from users, but also improving models globally, in the cloud, combining what is learned locally, in the devices. In order to test
our proposal, we have applied it in a heterogeneous community of smartphone users to solve the problem of walking recognition.
The results show the advantages that LFedCon2 provides with respect to other state-of-the-art methods.
Keywords: federated learning, continual learning, distributed learning, semi-supervised classification, cloud-based ensemble,
smartphones.

1. Introduction This data, typically sensor measurements, photos, videos and


location information, must be later uploaded and processed cen-
Smartphones, tablets, wearables, robots and “things” from
trally in a cloud-based server or data center. Thereafter, the data
the Internet of Things (IoT) are already counted in millions
is used to provide insights or produce effective inference mod-
and allow a growing and sophisticated number of applications
els. With this approach, deep learning techniques have proven
related to absolutely all human domains: education, health,
to be very effective in terms of model accuracy [2]. However,
leisure, travel, banking, sport, social interaction, etc. If we also
in this kind of scenarios, cloud-centric learning is usually either
take into account factors such as the progressive sensorization
ineffective or infeasible, since it involves facing the following
and the connection to the networks of these devices, we can
challenges:
talk about an authentic “society of devices” that is being formed
around people. Challenge 1: Scalability. There are limitations both in storage
The volume of data generated by these agents is growing and communication costs, as well as in computing speeds. Cen-
rapidly. Having such an exponentially growing amount of data tral storage and the transfer of huge amounts of data over the
collected on real working conditions from distributed environ- network might take extremely much time and also require an
ments, together with a good intercommunication between de- unbearable financial cost. Note also that communication may
vices [1], opens up a new world of opportunities. In particu- be a continuous overhead, as data form real environments is
lar, it will allow devices to incorporate models that evolve and continuously been updated. This can be especially challenging
adapt in order to perform better and better, thus benefiting the in tasks involving unstructured data, e.g., in video analytics [3].
consumers. A cloud-centric approach can also imply long propagation de-
In the context of distributed devices (smartphones, robots, lays and incur unacceptable latency for applications in which
etc.), applying traditional cloud-centric machine learning pro- real-time decisions have to be made, e.g., in autonomous driv-
cesses involves gathering data from all the devices in the cloud. ing systems [3, 4]. Similarly, central computing can take much
more time than parallel processing of smaller parts of data.
∗ Correspondingauthor
Email addresses: [email protected] (Fernando E. Challenge 2: Data privacy and sensitivity. Many popular
Casado), [email protected] (Dylan Lema),
[email protected] (Roberto Iglesias),
applications deal with sensitive data, that is, any information
[email protected] (Carlos V. Regueiro), about an user that circulates on the Internet and allows him/her
[email protected] (Senén Barro) to be identified. For example, the ID card, telephone number,
Preprint submitted to Information Fusion January 13, 2021
or address, but also pictures, videos, browsing history, or ge- sus performed in the cloud. Section 6 presents the experimental
olocation. The central collection of such data is not desirable as results in walking recognition. Finally, some conclusions are
it puts people’s privacy into risk. In recent years, governments presented in Section 7.
have been implementing data privacy legislations in order
to protect the consumer. Examples of this are the European
2. Related work
Commission’s General Data Protection Regulation (GDPR) [5]
or the Consumer Privacy Bill of Rights in the US [6]. In As we have already mentioned in the previous section, apply-
particular, the consent (GDPR Art. 6) and data minimisation ing cloud-centric learning in networks of smart and distributed
principles (GDPR Art. 5) limit data collection and storage only agents, such as smartphones, robots or IoT devices, involves a
to what is consented by the consumer and absolutely necessary number of issues. We are referring to the two challenges (scal-
for processing. ability and data privacy) mentioned above. In this section, we
are going to review some of the strategies that are being ap-
These limitations have led in recent years to the emergence plied in this context of devices in order to address these chal-
of new learning paradigms that bring processing and analysis lenges: cloud-centric variations, distributed learning, and fed-
capacity closer to the devices themselves, following the phi- erated learning.
losophy of edge or fog computing [7, 8]. A good example of
this is the recent federated learning (FL) [9, 10], which al- 2.1. Cloud-centric variations
lows the training of the model across multiple decentralized
edge devices holding local data samples, without exchanging In the last decade, there have emerged some cloud-centric
them. Nevertheless, there is still plenty of room for improve- variations that suggest that the learning process can take place
ment in this area. For example, most work on FL is based on in the cloud, but the learned model can be then transferred to
deep neural network architectures, but deep learning usually re- the devices, which is where it is executed [11, 12]. Other pro-
quires huge amounts of data and high computational capabili- posals suggest that there may even be a final adjustment of the
ties (not available on the edge devices). Also, there is a number model at the local level. The latter would fall within the trans-
of practical concerns that arise when running federated learn- fer learning paradigm [13], which focuses on storing knowledge
ing in production. In particular, the problem of nonstationarity gained while solving one problem and applying it to a different
and concept drift (i.e., when the underlying data distribution but related problem. In this case, a global deep learning model
changes over time) is perhaps one of the most important ones, obtained in the cloud would be a general solution that could be
and has not yet been addressed in FL literature. then adapted to each device locally. Other hybrid solutions us-
Therefore, in the present work, we propose a new FL algo- ing local and cloud computing have also been explored. For in-
rithm, Light Federated and Continual Consensus (LFedCon2). stance, in [14], the pre-training of a deep neural network (DNN)
Our method allows powerless devices (such as smartphones, is carried out in the smartphone and the subsequent supervised
wearables, or robots) to learn in real time, locally, continuously, training is performed in the cloud.
autonomously and from users, but also improving models glob- However, any of these strategies will still involve moving a
ally, in the cloud, combining what is learned locally, in the de- significant volume of potentially sensitive data. Therefore, a
vices. Basically, it consists on the training of light, weak, local better option for learning in this kind of scenarios where data is
learners, on the devices themselves, which are later consensu- naturally distributed seems to be a decentralized approach.
ated globally, in the cloud, using ensemble techniques. This
global model is then returned to the devices so that they speed 2.2. Distributed learning
up the local learning and help to quickly improve device be- Distributed learning [15, 16, 17] is not something new. Dif-
haviour, at the same time that they are subdued to a new lo- ferent from the cloud-centric approach, in this kind of algo-
cal adaptation process. Our algorithm is also able to deal with rithms the learning process can be carried out in a distributed
continuous single-task problems where the underlying distri- and parallel manner along a network of interconnected devices,
bution of data might be non-IID (independent and identically a.k.a. clients or agents. This client devices are able to collect,
distributed) among the different client devices, but that also can store, and process their own data locally. The learning pro-
change in unforeseen ways over time (concept drift). As we will cess can be performed with or without explicit knowledge of
describe later, there are important benefits as a consequence of the other agents. Once each device performs its local learning,
the explicit detection of concept drifts. In short, we can say a global integration stage is typically carried out in the cloud, so
that it helps us to answer two key questions: what needs to that a global model is agreed upon. Allocating the learning pro-
be learned and when it should be learned. In this work, we cess among the network of devices is a natural way of scaling
will show the performance of our approach when it is used to up learning algorithms in terms of storage, communication and
solve the specific semi-supervised classification task of walking computational cost. Furthermore, it makes it easier to protect
recognition in smartphones. the privacy of the users, since sharing raw data with the cloud
The rest of the paper is structured as follows: Section 2 pro- or with other participants can be avoided.
vides a review of the state of the art. In Section 3, LFedCon2 Distributed machine learning algorithms can be roughly clas-
is presented. Section 4 explains in detail the learning carried sified into two different groups: (1) distributed optimization al-
out on the devices. Section 5 exposes the details of the consen- gorithms, and (2) ensemble methods.
2
2.2.1. Distributed optimization algorithms federated learning, essentially designed to work on mobile de-
These methods mainly focus on how to train a global model vice networks, aims at training on heterogeneous datasets [30].
more efficiently, by managing large scale of devices simul- A common underlying assumption in distributed learning is that
taneously and making proper use of their hardware in a dis- the local datasets are identically distributed and roughly have
tributed and parallel way. For that purpose, there already ex- the same size. None of these hypotheses are made for federated
ist many powerful optimization algorithms to solve the lo- learning. Instead, the datasets are typically heterogeneous and
cal sub-problems, which are developed from Gradient De- their sizes may span several orders of magnitude.
scent (GD). The best known one is the ancient Stochastic Gra- Federated learning seems, at present, the most suitable ap-
dient Descent (SGD), which greatly reduces the computation proach for this context of multi-device learning. Nevertheless,
cost in each iteration compared with the normal gradient de- there are still some constraints in the state-of-the-art methods
scent [18, 19]. There are other proposals that rely on other that need to be addressed. In this work, we would like to high-
optimization methods, such as those that use augmented La- light two of them: (1) the non-viability of deep architectures,
grangian methods, being Alternating Direction Method of Mul- and (2) the lack of methods to deal with nonstationarity of data.
tipliers (ADMM) the most popular one [20, 21]. Another exam- By following the list of challenges initiated in Section 1, we
ple are Newton-like techniques, such as Distributed Approxi- have:
mate NEwton (DANE) [22] or Distributed Self-Concordant Op-
timization (DiSCO) [23]. Challenge 3: Deployment of deep federated learning. As far as
The main drawback of this kind of algorithms is that they we are aware, all the existing FL algorithms are based on deep
require the devices to share a common representation of the neural network architectures. This is a significant constraint,
model and feature set. Moreover, most of the proposals that since the type of learner to be used is often conditioned by the
fall into this category are not usually intended to be applied task to be solved, and deep neural networks are not always the
in the context of smartphones and IoT devices, as they typi- best solution. In addition, the use of DNNs usually demands
cally involve high computing costs in each training step. Thus, high amounts of data and high computing capabilities. To date,
they are not suitable for working on this kind of devices as they client devices, such as smartphones, do not have the appropri-
might have a negative effect on experience of user’s daily usage ate hardware and software to carry out this computing. The
(high battery consumption, overheating, etc.). most popular deep learning frameworks and libraries, such as
TensorFlow [32] or Pytorch [33], still do not have any available
2.2.2. Ensemble methods version for operating systems such as Android or iOS that allow
Ensemble learning [24] improves the predictive performance the training of these models. In terms of hardware, although
of a single model by training multiple models and combining much progress has been made in recent decades, these kind of
their predictions (e.g., using a voting system) or even the clas- devices still cannot compete with dedicated computers capable
sifiers themselves (e.g., generating a definitive model from dif- of getting the most out of their GPUs. This is a major impedi-
ferent local sets of rules) [25, 26, 27, 28]. Thus, the ensemble ment to the introduction of FL-based solutions in the real world.
approach is almost directly applicable to a distributed environ- Nevertheless, in many cases it could be solved using traditional
ment since a local classifier can be trained at each device and learning methods, such as decision trees or support vector ma-
then the classifiers can be eventually aggregated using some chines, that have lightweight implementations in multiple pro-
ensemble strategy. Some examples of the existing proposals gramming languages and for multiple operating systems.
that follow this approach are Effective Stacking [25], Knowl-
edge Probing [27] or Distributed Boosting [29]. Challenge 4: Nonstationarity. Most machine learning algo-
rithms assume that data is stationary and IID, i.e., independent
and identically distributed. However, it does not hold in many
2.3. Federated learning
real world applications, where the underlying distribution of the
In recent years, a new learning framework called federated data streams changes over time and between the different de-
learning (FL) has been boosted by Google [30, 10, 31, 9]. Its vices. Some authors have evaluated some FL algorithms on
core idea is very similar to distributed learning, i.e., solving lo- non-IID scenarios, analyzing the impact of non-identical client
cal problems on the devices and aggregate updates on a server data partitions [10, 34, 35]. Nevertheless, little research has
without uploading the original user data. In fact, we could even been done studying the effect of non-stationary data streams
contemplate state-of-the-art federated methods as distributed along time. This phenomenon is known in the literature as con-
optimization algorithms (the first group of distributed methods cept drift [36]. If the concept drift occurs, the inducted pattern
mentioned above). However, FL has been considered in the of past data may not be relevant to the new data, leading to
literature as an independent paradigm as it has some key dif- poor predictions or decision outcomes. Yoon et al. [37] pro-
ferences with more traditional distributed learning. Firstly, the pose a method for federated continual learning with Adaptive
main objective of federated learning is to ensure the privacy of Parameter Communication, Fed-APC, which additively decom-
the data and the users, so sharing sensitive information is not poses the network weights into global shared parameters and
an option in FL. Besides, both approaches make different as- sparse task-specific parameters. They validate their approach
sumptions on the properties of the local datasets, as distributed on several settings, including multi-incremental-task continual
learning mainly focus on parallelizing computing power, where scenarios. However, to the best of our knowledge, there are no
3
other researchers working on federated learning who have ad- the information available at the local level will increase pro-
dressed concept drift detection and adaptation in the underlying gressively. As it is unrealistic to assume that infinite storage is
distribution of data. available, a compromise solution must be reached that retains
The new federated method we propose in this paper, previously learned knowledge that is still relevant at the same
LFedCon2, is intended to overcome these challenges. On the time as old information is replaced with new knowledge, avoid-
one hand, it is designed to work with weak traditional learn- ing the catastrophic forgetting.
ers, e.g. decision trees, instead of DNNs. On the other hand, In this work we have focused on semi-supervised scenarios,
our algorithm is also able to deal with concept drift in continu- where the data in the local devices is partially labeled. Never-
ous single-task problems. Moreover, the proposed architecture theless, the application of LFedCon2 to fully supervised prob-
is asynchronous, and contemplates a semi-supervised learning, lems is immediate. In the following sections we explain the
two additional aspects that make it more suitable to be deployed details of our method at both learning levels, local (Section 4),
in real-world scenarios. and global (Section 5). However, it should be borne in mind that
this kind of architecture would allow the integration of other
methods different from those we propose below. Indeed, as we
3. Light Federated and Continual Consensus: an overview will see later, we believe that there is room for improvement at
both levels.
Figure 1 shows a high-level diagram of our proposal. As
we can see, LFedCon2 has a cyclical architecture. In the fig-
ure there are smartphones, but we could think of any other 4. Local learning
set of agents, either homogeneous or heterogeneous, including
tablets, wearables, robots, or IoT devices. Each client device Figure 2 shows the continuous work flow for each device.
is able to perceive its environment through its sensors and is Basically, devices gather data from the environment. This data,
connected to the cloud. conveniently preprocessed, is used to build or update a local
model. The preprocessing of the data refers to feature extrac-
global model local model tion, normalization, instance selection, etc. As we deal with a
semi-supervised task, this data will be partially annotated.
Learn/readjust Learn/readjust
local model local model

data stream time line

local model Learn/readjust global model


global model
Preprocessing

no global
global model local model model?

yes

Semi-supervised
Learn/readjust
local model
labeling

Figure 1: Diagram of the proposed architecture.


enough no
balanced Store data
data?
Iteratively, each of the devices creates and refines its own lo-
yes
cal model of the learning problem that is intended to be solved.
For that, devices are continuously acquiring and storing new in- no local yes
formation through their sensors. This information is raw data, Drift detection
model?
which must be locally preprocessed before being able to use it
in a learning stage: noise detection, data transformation, feature
yes no
extraction, data normalization, instance selection, etc. When lo- Train first
Drift adaptation drift?
local model
cal models are obtained, they are sent to the cloud where a new
learning stage is performed to join the local knowledge, thus
obtaining a global model. The global model is then shared with
Update global
all the devices in the network. Each device can take advantage model
of that global model to improve the local one, thus starting a
new cycle. New data is continuously recorded in each device Figure 2: Work flow on a local device.
and will be also used to retrain better local models or refine
existing ones. Of course, an improvement of the local models In order to learn the local model, in this work we have opted
will result in an improvement at the global level too. Note that for the use of ensemble methods. These methods have received
4
much attention in stream data mining and large-scale learning. Section 4.1.1. We define L in terms of ∆ as both parameters are
In particular, every device builds an ensemble of base classi- strongly related.
fiers. Any algorithm that provides posterior probabilities for As we pointed out before, the local device collects data which
its predictions can be used as base classifier. In our case, in is partially annotated. In fact, it will be common to have a much
our experimental results (Section 6), we tried different methods greater number of unlabeled than labeled examples. In order
as base classifiers: Naı̈ve Bayes, Generalized Linear Models to deal with this, we perform what is called semi-supervised
(GLM), C5.0 Decision Trees, Support Vector Machines (SVM), transduction. To understand how it works we must remind that
Random Forests and Stochastic Gradient Boosting (SGB). In in LFedCon2 all devices receive the last global model achieved
the same way, any state-of-the-art method could be used to by consensus in the cloud (Figure 1). To take advantage of
combine the predictions of the base classifiers of the local en- that knowledge globally agreed, our proposal uses the global
semble. We opted for a simple but effective approach, employ- model for labeling data that has no label. As soon as a new
ing decision rules, which effectively combine a posteriori class unlabeled instance is available, we use the latest global model,
probabilities given by each classifier. Rule based ensemble have if any, to predict a possible label and, then, we filter the pre-
received very much attention because of their simplicity and be- dictions based on their degree of confidence. We define the
cause they do not require training [38, 39, 40]. When the base confidence of a prediction as the classifier conditional posterior
classifiers operate in the same measure space, as it is this case, probability [44], i.e., the probability P(ci |x) of a class ci from
averaging the different posterior estimates of each base clas- one of the C possible classes c1 , c2 , . . . , cC to be the correct class
sifier reduces the estimation noise, and therefore improves the for an instance x. It is a normalized value between 0 and 1. We
decision [40]. Thus, we should use a rule that averages the pos- accept the predicted label as the real label of the example when
terior estimates of the base classifiers. The sum rule could be its confidence is equal to or greater than a threshold γ, whose
used, but if any of the classifiers outputs an a posteriori proba- optimal value we have empirically set at γ = 0.9. Low thresh-
bility for some class which is an outlier, it will affect the aver- olds (γ < 0.8) may introduce a lot of noise in the training set,
age and this could lead to an incorrect decision. It is well known while very high thresholds (γ >= 0.95) may allow to add very
that a robust estimate of the mean is the median. Thus, we use few examples to the labeled set. Therefore, we consider that
the median rule. The median rule predicts that an instance x γ = 0.9 is an adequate value.
belongs to class c j if the following condition is fulfilled: Finally, once the device has a global model and a local model,
the remaining question is when this local model should be up-
median{y1 j (x), . . . , yN j (x)} = maxCk=1 median{y1k (x), . . . , yNk (x)}, (1) dated using the new data that is being collected. It makes no
sense to update the local model if it is performing well. How-
where C is the number of possible classes (c1 , c2 , . . . , cC ), N is ever, as we said before, data is usually non-stationary, i.e., its
the number of base classifiers and yi = {yi1 (x), . . . , yiC (x)} is the distribution evolves in time in an unpredictable way. This is
output of the i-th classifier, for i = 1, . . . , N. what is usually called concept drift. If a concept drift happens
As we can see in Figure 2, our local device gathers data and, the model will lower its performance. Thus, as we will show
if there is a global model available, it uses it to annotate the in Section 4.1, we will update the local model when a concept
unlabeled data (semi-supervised transduction). Otherwise, it drift is detected.
keeps collecting data until it has enough examples to perform
the training of the first local model. A common question in ma- 4.1. Learning under concept drift
chine learning is how much data is necessary to train a model. Concept drift is a continual learning issue [45]. Formally,
Unfortunately, the answer is not simple. In fact, it will depend we can define it as follows: Given a time period [0,t], a set
on many factors, such as the complexity of the problem and of samples, denoted as S0,t = {s0 , . . . , st }, where si = (xi , yi ) is
the complexity of the learning algorithm. Significant research one observation (or data instance), xi is the feature vector, yi
efforts have been made to find the relationship between clas- is the label, and S0,t has a certain joint probability distribution
sification error, learning sample size, dimensionality and com- of x and y, Pt (x, y). Therefore, concept drift can be defined
plexity of the classification algorithm [41, 42, 43]. Statistical as a change in the joint probability at timestamp t, such that
heuristic methods have been frequently used to calculate a suit- ∃t : Pt (x, y) , Pt+1 (x, y).
able sample size, typically based on the number of classes, the The aforementioned is the definition of conventional concept
number of input features or the number of model parameters. drift. Nevertheless, in a federated learning setting, with multi-
To ensure a minimally balanced training set with representation ple clients and a global server, we must adapt it a little. Now,
of all classes, in this work we establish a single heuristic rule the goal is to train a global model in a distributed and parallel
that must be fulfilled in order to allow the training of a local way using the local data of the D available clients. Each client
model. We define a minimum amount of labeled data, L, so will have a different bias because of the conditions of its local
L
that there must be at least 2C examples from each class in the environment, and likewise, its data stream may change in differ-
training set to allow the training process, where C is the number ent ways over time. Therefore, there may occur concept drifts
of possible classes. The first base model of the local ensemble that affect all clients, some of them, or just a single one. Thus,
–Equation (1)– can be trained as soon as this rule is met. In we can generalize the problem in the following way: Given a
our experiments we have used L = 2∆, where ∆ is a parame- time period [0,t], a set of clients {d1 , . . . , dD } , and a set of local
ter used in the drift detection algorithm that we will expose in samples for each client, denoted as S0,t k = {sk , . . . , sk }, where
0 t

5
ski = (xik , yki ) is one data instance from client dk , xik is the feature Algorithm 1: Change detection algorithm.
vector, yki is the label. Each local dataset S0,t k has a certain joint
1 Input: W , N, x, Th , α, ∆, Nmax
probability distribution Ptk (x, y). A local concept drift occurs at 2 Output: W , N
timestamp t for client dk if ∃t, c : Ptk (x, y) , Pt+1
k (x, y).
3 [ŷ, ςx ] ← classify(x)
However, note that a local concept drift does not necessarily 4 wnew ← [x, ςx ]
have a direct impact on the global federated model. It may be 5 W ← W ∪ wnew // Add the pattern and its confidence into W
the case that a local drift on device dk will result in a change 6 N ← N +1
in the distribution of dk , but not in the joint distribution of all 7 if N >= Nmax then
clients, PtG (x, y). In that case, the federated model will not be 8 w0 ← ∅ // Remove the oldest element in W
affected by the local change and this one can be ignored. Thus, 9 N ← N −1
we can define a global concept drift as a distribution change at 10 end
time t in one or more clients {dk , . . . , dl } ⊆ {d1 , . . . , dC } such 11 s f ← 0
that ∃t : PtG (x, y) , Pt+1 G (x, y). If a global concept drift happens
12 r ← random(0, 1) // Generate random number in the interval [0,1]
the model might lower its performance, so we need to detect −2ςx ≥ r then
13 if e
these global changes and adapt to them.
14 for k ← ∆ to N − ∆ do
Research on learning under concept drift is typically classi- 15 mb ← mean(ς1 : ςk ∈ W )
fied into three components [46]: (1) drift detection (whether 16 ma ← mean(ςk+1 : ςN ∈ W )
or not drift occurs), (2) drift understanding (when, how, where 17 if ma ≤ (1 − α) · mb then
it occurs) and (3) drift adaptation (reaction to the existence of 18 sk ← 0
drift). In this work, we have focused on drift detection and
19 [α̂b , β̂b ] ← estimateParams(ς1 : ςk ∈ W )
adaptation.
20 [α̂a , β̂a ] ← estimateParams(ςk+1 : ςN ∈ W.)
21 for i ← k + 1 to Ndo
4.1.1. Drift detection f (ςi ∈wi | α̂a , β̂a )

Drift detection refers to the techniques and mechanisms that 22 sk ← sk + log
f (ςi ∈wi | α̂b , β̂b )
characterize and quantify concept drift via identifying change 23 end
points or change intervals in the underlying data distribution. 24 if sk > s f then
In our proposal, if a concept drift is identified it means that 25 s f ← sk
the local model is no longer a good abstraction of the knowl- 26 end
edge of the device, so it must be updated. Drift detection al- 27 end
gorithms are typically classified into three categories [46]: (1) 28 end
error rate-based methods, (2) data distribution-based methods 29 if s f > Th then
and (3) multiple hypothesis test methods. The first class of al- 30 driftAdaptation(W ) // See details in Section 4.1.2
gorithms focuses on tracking changes in the online error rate of 31 W ← {∅} // Reinitialize the sliding window
a reference classifier (in our case, it would be the latest local 32 N←0
model). The second type uses a distance function or metric to
33 end
quantify the dissimilarity between the distribution of historical
34 end
data and the new data. The third group basically combines tech-
niques from the two previous categories in different ways. Error
rate-based methods operate only on true labeled data, because
they need the labels to estimate the error rate of the classifier. dictions provided by the current local model for the patterns in
Therefore, in order to take advantage of both labeled and unla- W . In the original method proposed in [47], authors do not use
beled data, in this work we decided to use a data distribution- a sliding window, but a dynamic window, which is reinitialized
based algorithm, which do not present such a restriction. Thus, every time a drift is identified. However, they do not establish
we have used an adapted version of the change detection tech- any limits on the size of this dynamic window, which is not
nique (CDT) originally proposed in [47], which is a CUSUM- very realistic, since its size could grow to infinity if no drift is
type CDT on beta distribution [48]. Algorithm 1 outlines the detected. Therefore, in our proposal, we use a sliding window
proposed method. and we set a maximum size Nmax = 20∆, where ∆ is a strongly
As soon as a new instance x is available, Algorithm 1 is in- related parameter used in the core of the CDT algorithm, as we
voked. First of all, the confidence of the current local classifier will explain below. Once this maximum size is reach, adding a
on that instance, ςx , is estimated (line 3 in the pseudocode) and new element to W implies deleting the oldest one (lines 7 to 9).
both, instance and confidence, are stored in a sliding window The core of this CDT algorithm (lines 14 to 32) can be a
W of length N (lines 4 to 6). Note that W = {w1 , w2 , . . . , wN } bottleneck in our system if we have to execute it after insert-
is a history of tuples of the form “[instance, confidence]”, so ing each confidence value in W . Therefore, as shown in Fig-
that W = [X, Σ ], where X = {x1 , x2 , . . . , xN } is the history of in- ure 2, CDT will be invoked only if W contains a representative
L
stances and Σ = {ς1 , ς2 , . . . , ςN } the associated confidence vec- number of instances X ∈ W , i.e., there are at least 2C labeled
tor. We will try to detect changes in the confidence of the pre- instances for each class. However, once this condition is met,
6
CDT would be invoked for every new recorded sample. There- The second one preserve old models in an ensemble and when
fore, we restrict the number of executions, so that the core of a new one is trained, it is added to the ensemble, so that the lo-
Algorithm 1 will be executed with a probability of e−2ςx (line cal model itself is the ensemble. The third approach consist of
13 in the pseudocode). Hence, the higher the confidence, the developing a model that adaptatively learns from the changing
lower the probability of executing CDT, and vice versa. data by partially updating itself. This last strategy is arguably
In the core of the CDT algorithm, W is divided into two sub- the most efficient when drift only occurs in local regions. How-
windows for every pattern k between ∆ and N −∆ (lines 14 to 16 ever, online model adjusting is not straightforward and it will
in Algorithm 1). Let Wa and Wb be the two sub-windows, where depend on the specific learning algorithm being used. In fact,
Wa contains the most recent instances and their confidences. most of the methods from this category are based on decision
Each sub-window is required to contain at least ∆ examples to trees because trees have the ability to examine and adapt to each
preserve statistical properties of a distribution. When a concept sub-region separately.
drift occurs, confidence scores are expected to decrease. There- In our case, we apply ensemble retraining. In particular, as
fore, only changes in the negative direction are required to be we already mentioned before, we propose a simple rule based
detected. In other words, if ma and mb are the mean values of ensemble using the median rule –Equation (1)– for local adap-
the confidences in Wa and Wb respectively, a change point is tation to concept drift. Each local device is allowed to keep up
searched only if ma ≤ (1 − α) × mb , where α is the sensitiv- to Ml local models, which will make up its ensemble. When
ity to change (line 17). Same as in [47], we use α = 0.05 and a drift is detected, a new model will be trained using only the
∆ = 100 in our experiments, which are also widely used in the labeled data stored in the dynamic window W described before
literature. (Section 4.1.1). This new model will be added to the ensem-
We can model the confidence values in each sub-window, Wa ble. If there are already Ml models in the ensemble, the new
and Wb , as two different beta distributions. However, the ac- one replaces the oldest, thus ensuring that there are at most Ml
tual parameters for each one are unknown. The proposed CDT models at any time. Therefore, with this strategy, we also face
algorithm estimates these parameters at lines 19 and 20. Next, the infinite length problem, as a constant and limited amount
the sum of the log likelihood ratios sk is calculated
 in thein- of memory will be enough to keep both training data and the
ner loop between lines 21 and 23, where f ςi ∈ wi | α̂, β̂ is ensemble. In our experiments we used Ml = 5.
the probability density function Therefore, we conceive each local model as an ensemble of
 (PDF)  of the beta distribution, base historical models. From now on, for simplicity, we will
having estimated parameters α̂, β̂ , applied on the confidence refer to this ensemble simply as the local model of the device.
ςi of the tuple wi = [xi , ςi ] ∈ W . This PDF describes the relative Note that the estimated posterior probability of the ensemble
likelihood for a random variable, in this case the confidence ς , from Equation (1) is the confidence of the local model, that we
to take on a given value, and it is defined as: use for drift detection.
(1−ς )β −1
( α−1
ς
B(α,β ) , if 0 < ς < 1
f (ς | α, β ) = (2) 5. Global learning
0, otherwise,
Every time that a local device train or update its local model,
where Z 1 the changes in that local model will be reported to the cloud.
B (α, β ) = ς α−1 (1 − ς )β −1 dς . (3) For the construction of the global model, we decided once again
0 to use an ensemble, because of the good properties of these
The variable sk is a dissimilarity score for each iteration k of kind of methods that we already mention in Section 4. As in
the outer loop between lines 14 and 28. The larger the differ- the local level, any state-of-the-art ensemble technique could be
ence between the PDFs in Wa and Wb , the higher the value of used to combine the predictions of the local models. Stacking,
sk (line 22). Let kmax is the value of k for which the algorithm or stacked generalization [49], is a technique for achieving the
calculated the maximum sk value where ∆ ≤ k ≤ N − ∆. Fi- highest generalization accuracy that has probably been the most
nally, a change is detected at point kmax if skmax ≡ s f is greater widely applied in distributed learning [50, 15, 25]. Stacking
than a prefixed threshold Th (line 29). As in the original work, tries to induce which base classifiers (in our case, local models)
we use Th = − log(α). In case a drift is detected, a drift adap- are reliable and which are not by training a meta-learner. This
tation strategy is applied (line 30). We will discuss this strategy meta-learner is a higher-level model which is trained on a meta-
in detail in Section 4.1.2. Moreover, the sliding window W is dataset composed from the outputs of all base classifiers on a
reinitialized (lines 31 and 32). given training set. However, it presents a great limitation, which
is the need to have this meta-dataset available in advance. In
4.1.2. Drift adaptation the context of devices, this would involve gathering a certain
Once a drift is detected, the local classifier should be updated amount of labeled data from all users in the cloud before being
according to that drift. There exist three main groups of drift able to create a global model. Therefore, a better solution is
adaptation (or reaction) methods [46]: (1) simple retraining, (2) to use a simpler but equally effective rule-based ensemble, as
ensemble retraining and (3) model adjusting. The first strategy we already do at the local level (Section 4). In this case, the
is to simply retrain a new model combining in some way the optimal combining rule is the product rule, because each local
latest data and the historical data to replace the obsolete model. classifier operates in a different measure space [39]: different
7
environments, sensors, users, etc. The product rule predicts that with a significance level of 0.05 is performed for each pair of
an instance x belongs to class c j if the following condition is models ci , c j so that:
met: 
N N
C 1
 if ci is significantly better than c j ,
∏ yi j (x) = max
k=1
∏ yik (x), (4)
t(ci , c j ) = −1 if c j is significantly better than ci , (5)
i=1 i=1 
0 otherwise.

where C is the number of possible classes (c1 , c2 , . . . , cC ), N
is the number of base classifiers and yi = {yi1 (x), . . . , yiC (x)}
Then, for each model we calculate its overall significance index:
is the output of the i-th classifier, for i = 1, . . . , N. Note that
the outputs of this global ensemble are the estimated posterior N
probabilities that we use on each local device as a confidence S(ci ) = ∑ t(ci , c j ). (6)
measure to decide which unlabeled patterns can and cannot be j=1
labeled.
Finally, we select the new Mg models with the highest signif-
Although the time complexity on the server side is linear,
icance index or score, S. If there are ties, we break them by
O(N), including all local models in the ensemble is not the best
selecting the most accurate ones (we compute the mean accu-
option for several reasons. First, depending on the problem,
racy from the q available evaluations). Figure 3 summarizes
there could be thousands or millions of connected devices, so
the whole process, that we can call Distributed Effective Voting
using an ensemble of those dimensions could be computation-
(DEV).
ally very expensive. As the global model is sent back to local
In our experiments we just had 10 different devices, so we
devices, it would also have a negative impact on the bandwidth
tried several values for Mg , from 3 to 10. However, the size
and computational requirements of the nodes. In addition, as-
of ensemble usually ranges from 5 to 30 models and it will be
suming that there is an optimal global abstraction of knowl-
strongly dependent on the problem. The p and q sizes depend
edge, not all the local models will reflect such knowledge in
both on the size of the ensemble (Mg ) and the total number of
equal measure or bring the same wealth to the ensemble. On
devices available online (N), so that Mg ≤ q ≤ p ≤ N. For sim-
the contrary, there will be devices that, accidentally or inten-
plicity, in our experiments we always used p = q = Mg .
tionally, may be creating local models totally separated from
reality, which should be detected in time so as not to participate
in the ensemble. 6. Experimental results
Because of all of this, we propose to keep a selection of the
Mg best local models to participate in the global ensemble. In The aim of this section is to evaluate the performance of
this way, we can know a priori the computational and storage LFedCon2. In particular, we are interested in checking to what
resources we will need more easily. When a new local model extent the global models obtained from the consensus of the
is received on the cloud, if the device owner of that model is different local devices are capable of solving distributed and
already present in the global ensemble, we can simply update semi-supervised classification tasks. In addition, we also want
the global ensemble with the new version of the local model. to evaluate the evolution of these models over the time. In our
Otherwise, it must compete against the Mg models that are cur- experiments, we have used several base classifiers at the local
rently in the global ensemble. The server will keep a score level so as not to link the results and the architecture to any par-
representing the relevance of each of the Mg models and will ticular method (although, as we will see below, some of them
compute that score for each new incoming model. For the com- are clearly suboptimal according to the strategy we propose).
putation of the scores we base on the Effective Voting (EV) The task we have chosen to conduct our experiments is the
technique [51]. EV propose to perform 10-fold cross validation detection of the walking activity on smartphones. It is relatively
for the evaluation of each model and then apply a paired t-test easy to detect the walking activity and even count the steps
for each pair of models to evaluate the statistical significance when a person walks ideally with the mobile in the palm of
of their relative performance. Finally, the most significant ones his/her hand, facing upwards and without moving it too much.
are selected. However, the situation gets much worse in real life, when the
In our context, a cross-validation is not a fair way to evaluate orientation of the mobile with respect to the body, as well as its
each local model due to the skewness and bias inherent in the location (hand, bag, pocket, ear, etc.), may change constantly
distributed paradigm. Thus, when a new local model arrives, as the person moves [52, 53]. Figure 4 shows the complexity of
the server choose p different local devices, randomly selected, this problem with a real example. In this figure we can see the
and ask them to evaluate that classifier on their respective local norm of the acceleration experienced by a mobile phone while
training sets. Once this evaluation is done, each device sends its owner is doing two different activities. The person and the
back to the cloud its accuracy. Assuming that not all the p se- device are the same in both cases. In Figure 4a, we can see the
lected devices are necessarily available to handle the request, signal obtained when the person is walking with the mobile in
the server waits until it has received q performance measures the pocket. Figure 4b shows a very similar acceleration signal
for that model, being q ≤ p. This process could be considered experienced by the mobile, but in this case when the user is
a distributed cross-validation. After gathering the q measure- standing still with the mobile in his hand, without walking, but
ments for the current Mg models and the new one, a paired t-test gesticulating with the arms in a natural way.
8
dev. j

2 dev. k
Ask p random

li
de
devices to validate

i,j
mo
that model

cy
ra
cu
ac
1
Send new model i
local model dev. l

local model i accuracy i,l


device i
mod
el i
4
Obtain significance dev. m
indices and update

m
od
the ranking

el
ac

i
cu
ra
cy
i,n
3
Wait for q
dev. n
responses
(accuracies)

Figure 3: Work flow of the Distributed Effective Voting.

In order to evaluate the performance of our method and the


evolution of the models over the time, we require an indepen-
dent test set that is representative enough of the problem. Thus,
we started by building a fully labeled test dataset, for which
77 different people participated. We asked the volunteers to
10 15 20 25 30
Acceleration norm (m/s^2)

perform different walking and non-walking activities in several


conditions. Since manual data labeling would be very time-
consuming, volunteers carried a set of sensorized devices in
their legs (besides their own mobile) tied with sports armbands
in order to automatically get a ground truth [52, 53]. Several
5

features (21) from time and frequency domains were extracted


0

from the 9-dimensional time series composed by the 3 axis of


0 100 200 300 400 500 each of the 3 inertial sensors of the smartphone: accelerom-
Time (centiseconds) eter, gyroscope and magnetometer. This fully labeled dataset
(a) is composed of 7917 instances when dividing the data in time
windows of 2.5 seconds. The ultimate goal of this set is to
10 15 20 25 30

help us to evaluate the performance of the models we obtain


Acceleration norm (m/s^2)

applying LFedCon2. Nevertheless, we decided to apply the tra-


ditional learning approach first in order to have a baseline. Re-
member that in the context of distributed devices it is not always
feasible to apply traditional methods. However, for this task it
is, because we were able to obtain a sufficient amount of well-
5

labeled data from a high number of users. This is one of the


0

0 100 200 300 400 500 main reasons why we chose this problem to test our proposal.
Time (centiseconds)
Thus, we applied some of the most popular and widely known
(b)
supervised classification methods on this dataset:
• a simple Naı̈ve Bayes (NB),
Figure 4: Norm of the acceleration experienced by a mobile phone when its • a Generalized Linear Model (GLM),
owner is walking (a), and not walking, but gesticulating with the mobile in
his/her hand (b).
• a C5.0 Decision Tree (C5.0),
• a Support Vector Machine with radial basis function ker-
nel (RBF SVM),
• a Random Forests (RF),
• a Stochastic Gradient Boosting (SGB).
Table 1 summarize the most relevant results we achieved. Op-
9
timal hyperparameters for each of these traditional classifiers also labels autonomously some examples applying a series of
were estimated applying 10-fold cross validation on the dataset. heuristic rules when it comes to clearly identifiable positives
We also tried several Convolutional Neural Network (CNN) ar- or negatives (e.g., when the phone is at rest). With this app, we
chitectures [52]. Table 1 shows the results provided by the best collected partially labeled data from 10 different people. Partic-
one. Note that in this last case, as the size of the dataset is too ipants installed our application and recorded data continuously
small for deep learning, we had to to carry out an oversampling while they were performing their usual routine. Table 2 sum-
process to generate more instances artificially. The training and marizes the data distribution in the new semi-labeled dataset by
testing of the different models was carried out using the frame- user, and compares it with the already presented test set.
work provided by the R language. In particular, for the training
of the traditional models (Naı̈ve Bayes, GLM, C5.0, etc.) we Table 2: Summary of the training and test sets, indicating the number of exam-
ples attending to the label.
used the implementations already provided by the caret pack-
age [54]. In the case of the CNNs, we employed the keras Walking Not walking Unlabeled Total
package [55]. user 1 3130 2250 4620 10000
user 2 2519 4359 3122 10000
Table 1: Performance of several supervised classifiers, trained and tested on a user 3 186 325 125 636
dataset of 77 different people [52]. user 4 2432 2455 5113 10000
user 5 233 2785 6982 10000
Balanced Accuracy Sensitivity Specificity Training
user 6 554 1821 69 2444
set
Naı̈ve Bayes 0.8938 0.9580 0.8295 user 7 2582 2052 769 5403
GLM1 0.8855 0.8826 0.8884 user 8 232 678 229 1139
C5.0 Tree 0.9108 0.9015 0.9200 user 9 1151 2669 6180 10000
RBF SVM2 0.9368 0.9410 0.9326 user 10 2329 2669 5002 10000
Total 15348 22063 32211 69622
RF3 0.9444 0.9375 0.9516
Test set Total 6331 1586 0 7917
SGB4 0.9324 0.9406 0.9242
CNN3 0.9791 0.9755 0.9822
1 GLM = Generalized Linear Model Applying the same techniques of Table 1 but now using the
2 RBF SVM = Support Vector Machine with radial basis kernel new dataset for training and the old one for testing (Table 2), the
3 RF = Random Forests results obtained are those of Table 3. If we compare both tables,
4 SGB = Stochastic Gradient Boosting
5 Table 1 and Table 3, it is clear that the latter results are sub-
CNN = Convolutional Neural Network
stantially worse—approximately 10% worse in all the cases—.
However, despite the high performances shown in Table 1, This is logical, since training now is carried out using a totally
we must admit that these numbers are unrealistic. Obtaining new dataset, obtained in a more realistic and unconstrained way,
performances like those involves months of work, collecting which contains biased and partially labeled data from a much
and labeling data from different volunteers and re-tuning and smaller number of users. Note that we have not obtained the
re-obtaining increasingly robust models until significant im- new results using CNNs in Table 3. This is because, for the
provements are no longer achieved. It is at that moment when experiments in Table 1, a dedicated mobile device was used,
the model is finally put into exploitation. In fact, we have tried so recording raw data for training a CNN was not a problem.
all the models in Table 1 on a smartphone and we have identi- However, the new training set was obtained in a federated con-
fied several situations, quite common in everyday life, where all text using the participants’ own smartphones, so recording raw
of them fail, such as walking with the mobile in a back pocket data was not an option.
of the pants, carrying the mobile in a backpack or shaking it
Table 3: Performance of supervised classifiers, trained and tested using the
slowly in the hand simulating the forces the device experience datasets described in Table 2.
when the user is walking. No matter how complete we believe
our dataset is, in many real problems there are going to be some Balanced Accuracy Sensitivity Specificity
situations that are poorly represented in the training data. The Naı̈ve Bayes 0.7949 0.9044 0.6854
classical solution to this would be to collect new data of these GLM 0.7221 0.8206 0.6236
situations where the model fails, try to obtain a better model ro- C5.0 Tree 0.8173 0.8842 0.7503
bust to this situations and, once achieved, replace the old model RBF SVM 0.8378 0.9430 0.7327
with the new one. RF 0.8546 0.9021 0.8071
In order to build an appropriate training dataset to evaluate SGB 0.8596 0.9084 0.8108
the federated, continual and semi-supervised learning possibil-
ities of our proposal, we developed an Android application that Using LFedCon2, the training data from Table 2 is dis-
samples and logs the inertial data on mobile phones continu- tributed among the different local devices. Data acquisition is
ously, after being processed. These data is all unlabeled. Nev- continuous over time. For simplicity, we assume all local de-
ertheless, the app allows the user to indicate whether he/she is vices record data with the same frequency and, therefore, each
walking or not through a switch button in the graphical inter- iteration of LFedCon2 will correspond to a new pattern. It is
face, but this is optional, so depending on the user’s willingness important to remember that in our proposal each local model
to participate, there will be more or less labeled data. The app is an ensemble that is updated over time and has a maximum
10
size, Ml . Similarly, a subset of local models of size Mg is se- other than DEV. However, designing a good selection method
lected on the server to form the global ensemble model. In this is not straightforward as some challenges must be addressed.
experiments we used ensembles of 5 models, both locally and Firstly, local models must be evaluated without centralized data
globally, i.e., Ml = 5 and Mg = 5. Table 4 shows the perfor- being available anywhere and without even being able to re-
mance of the global model after 10000 iterations using each of tain all local data. In addition, a balance is required between
the methods in Table 3 as base local learners. Figure 5 com- precision in the selection and efficiency in the overall system.
pares the Receiver Operating Characteristic (ROC) curve of the Improving the voting system may involve more communication
traditional models from Table 3 with its federated equivalent at between server and devices, which limits the scalability of the
different iterations. proposal. Nevertheless, in the results we have obtained we can
notice how the global ensemble is able to provide good perfor-
Table 4: Performance of LFedCon2 after 10000 iterations, when Mg = Ml = 5, mance almost from the beginning. This leads us to think that
using as base classifier the methods in Table 3.
perhaps it is not necessarily bad that some suboptimal classi-
Balanced Accuracy Sensitivity Specificity fiers are chosen and even that the diversity on the global ensem-
LFedCon2 + NB 0.8280 0.7909 0.8651 ble is more important than the accuracy of each of its members.
LFedCon2 + GLM 0.7405 0.8668 0.6141
LFedCon2 + C5.0 0.8168 0.9665 0.6671 Table 5: Detail of the balanced accuracies achieved in the LFedCon2 setting of
LFedCon2 + SVM 0.8663 0.8499 0.8827 Figure 6.
LFedCon2 + RF 0.8314 0.9489 0.7129
LFedCon2 + SGB 0.8231 0.9187 0.7264 Number of iterations
Model 2500 5000 7500 10000
user 1 0.4069* 0.4069 0.4813 0.4813
As we can see, our federated proposal gives similar perfor-
mances to those achieved by a single model trained with all data user 2 0.7468* 0.7158* 0.7158 0.7157
available. However, the comparison between Tables 3 and 4 is user 3 - - 0.6730* 0.6730*
not fair, as they correspond to two very different approaches. user 4 0.4999 0.7549* 0.7549* 0.7549*
Note that in the LFedCon2 setting of Table 4 we are only using user 5 0.8093* 0.8093* 0.8093* 0.8093*
half of the users to build the global ensemble model (Mg = 5), user 6 - - 0.6080* 0.6080*
whereas the models in Table 3 use all labeled data from all user 7 0.5555* 0.7016 0.7016 0.7016
users. Furthermore, the use of RF or SGB as base classifiers user 8 0.6789 0.8644 0.8644 0.8644
does not seem to be the best option in the federated scenario, user 9 0.5882* 0.5882* 0.8304 0.8304
since these algorithms are already ensembles. For these cases user 10 - 0.7266* 0.7306* 0.7306*
there are probably much more efficient combination strategies global 0.8603 0.8552 0.8663 0.8663
that could be tested. For example, by creating an ensemble of * Cells marked with an asterisk indicate that the current model of
Random Forests we are combining the final decisions of each that user was chosen to participate in the global ensemble.
of the forest, when it might be better to combine the decisions
of all the trees of each forest, at a lower level. Proposals like In Figure 6, the thick black line corresponds to the global
this are out of the scope of this work, in which we have opted model accuracy while the rest of the coloured lines are each of
for a global learning method scalable to multiple settings. the 10 anonymous users. As in each iteration the unlabeled lo-
In order to better illustrate the learning process of our cal data is labeled with the most recent global model, the more
method, we will take as an example the case where we use a unlabeled data the device has, the more it will be enriched by
RBF SVM as base classifier (fourth row in Table 4). Figure 6 the global knowledge. The bottom graph shows the drift de-
and Table 5 show how the performance of LFedCon2 evolves tection and adaptation. Once again, each coloured line corre-
over the iterations. The upper graph in Figure 6 shows the evo- sponds to one user. In those places where the line is not drawn
lution of the accuracies of all the models —from each of the it means that the device is not taking data. A circumference (◦)
10 users and also the global one—. Table 5 provides the exact on the line indicates when a drift has been detected and the lo-
values of these accuracies. In each column of Table 5, those cal model has been updated. If the circumference is filled (•) it
local models that are part of the global model in that iteration indicates that the local model is chosen as one of the 5 models
are marked with an asterisk (*). We can appreciate that the of the global ensemble –the other 4 chosen are marked with a
global ensemble is not always selecting the most accurate lo- cross (×)–.
cal models. This actually makes sense, since local models are As we can see, LFedCon2 allows the society of devices to
chosen using the Distributed Effective Voting (DEV) from Fig- have a classifier from practically the beginning and to evolve
ure 3, which is based on the voting carried out by a subset of it over time. In order to illustrate some other advantages of
q users randomly chosen (5 in this case, because q = Mg = 5). our approach, we have artificially modified the training dataset
The fact that some suboptimal local models are being selected in different ways. Suppose now there are some users who are
indicates that there are devices that value these models highly, mislabeling the data, whether intentionally or not. To simulate
which means that the selection criteria of the voters can be im- this, we have inverted all the labels of four of the users. This
proved. Remember that LFedCon2 needs a mechanism for se- is the maximum number of local datasets we can poison so that
lecting local models to build the global one, but this could be the system continues to properly choose the best participants for
11
ROC curve (NB) ROC curve (GLM) ROC curve (C5.0)
1.00 1.00 1.00

0.75 0.75 0.75


Sensitivity

Sensitivity

Sensitivity
0.50 0.50 0.50

0.25 0.25 0.25

0.00 0.00 0.00


0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
1 − Specificity 1 − Specificity 1 − Specificity

Single NB Ours (5k) Ours (10k) Single GLM Ours (5k) Ours (10k) Single C5.0 Ours (5k) Ours (10k)
Ours (2.5k) Ours (7.5k) Ours (2.5k) Ours (7.5k) Ours (2.5k) Ours (7.5k)

(a) Using NB as base classifier. (b) Using GLM as base classifier. (c) Using C5.0 as base classifier.
ROC curve (SVM) ROC curve (RF) ROC curve (SGB)
1.00 1.00 1.00

0.75 0.75 0.75


Sensitivity

Sensitivity

Sensitivity
0.50 0.50 0.50

0.25 0.25 0.25

0.00 0.00 0.00


0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
1 − Specificity 1 − Specificity 1 − Specificity

Single SVM Ours (5k) Ours (10k) Single RF Ours (5k) Ours (10k) Single SGB Ours (5k) Ours (10k)
Ours (2.5k) Ours (7.5k) Ours (2.5k) Ours (7.5k) Ours (2.5k) Ours (7.5k)

(d) Using RBF SVM as base classifier. (e) Using RF as base classifier. (f) Using SGB as base classifier.

Figure 5: ROC curves provided by a single classifier (Table 3) compared with the global model from the federated training (Table 4) at iterations 2500, 5000, 7500
and 10000.

the global ensemble. In this example we have chosen three very Table 6: Performance of traditional classifiers when there are users who provide
noisy data.
active users (users 1, 2 and 9), and one more not very involved
(user 9). The results using a single classifier trained on all the Balanced Accuracy Sensitivity Specificity
available data are those in Table 6. The results using our method Naı̈ve Bayes 0.4837 0.2008 0.7667
with different base classifiers are those shown in Table 7. Fig- GLM 0.6283 0.5321 0.7245
ure 7 compares the ROC curves of the traditional models with C5.0 0.4293 0.4645 0.3941
the federated ones in this scenario. As can be appreciated, in RBF SVM 0.5943 0.6986 0.4899
this case LFedCon2 far outperforms the results provided by the RF 0.4679 0.5102 0.4256
classical approach. This is because the system is able to identify SGB 0.4499 0.3916 0.5082
those noisy users and remove them from the global ensemble.
In addition, the devices of those users will use the global model
to label the unlabeled data correctly, thus overcoming the data t = 2500, all devices begin to identify patterns of “walking”
that was manually mislabeled. Figure 8 shows the details of our when before they would be “not walking”, and vice versa. Fig-
federated and continual learning process in this scenario, in the ure 9 shows the process using again the RBF SVM as base clas-
particular case where RBF SVM is used as base classifier. sifier. As we can see, it takes a while for the system to converge
Finally, we have simulated a big drift in data. The biggest because the change is too drastic, but it ends up achieving it.
drift the system could experience would be a total inversion in Obviously, if we reduce the size of the local ensembles, Ml , the
the meaning of the two classes. Therefore, from the instant system would adapt faster.
12
and selection of local candidates for the subsequent global con-
0.4 0.5 0.6 0.7 0.8 0.9 1.0

sensus. However, we believe that all of these modules can be


further improved. For example, we want to explore optimal
Balanced Accuracy

ways to combine models locally and also in the cloud, study


effective ways to perform distributed feature selection, analyze
the use of the global model for instance selection in the local
devices or the application of techniques such as amending [56].
On the other hand, real world, physically distributed devices
have an intrinsic data skewness property so, depending on the
500 1500 3000 4500 6000 7500 9000 problem, instead of having a single global model, it could be
more interesting to cluster users or devices that share similar
Drift Detection and Adaptation

user9

properties and then obtain several global models, one for each
group. Another promising approach could be to have some sys-
user7

tem of local adaptation of the global model, for example using


user5

transfer learning, so that it would be able to adjust to local par-


ticularities.
user3
user1

Acknowledgements
500 2000 3500 5000 6500 8000 9500

Figure 6: Results for LFedCon2 using theTime


RBF SVM as local base classifier and
This research has received financial support from
Mg = 5 and Ml = 5. The upper graph shows the evolution of the accuracy over AEI/FEDER (European Union) grant number TIN2017-
time. The bottom graph shows the drifts. The thick black line corresponds to 90135-R, as well as the Consellerı́a de Cultura, Educación e
the global model. The rest of the coloured lines are each of the 10 anonymous Ordenación Universitaria of Galicia (accreditation 2016–2019,
users.
ED431G/01 and ED431G/08, and reference competitive group
Table 7: Performance of LFedCon2 when there are users who provide noisy
ED431C2018/29), and the European Regional Development
data (Mg = Ml = 5). Fund (ERDF). It has also been supported by the Ministerio
de Educación, Cultura y Deporte of Spain in the FPU 2017
Bal. Accuracy Sensitivity Specificity program (FPU17/04154).
LFedCon2 + NB 0.7572 0.6116 0.9029
LFedCon2 + GLM 0.7495 0.9940 0.5050
LFedCon2 + C5.0 0.7834 0.8681 0.6986 Declarations of interest
LFedCon2 + SVM 0.8632 0.8356 0.8909
LFedCon2 + RF 0.8137 0.9888 0.6387 None.
LFedCon2 + SGB 0.8269 0.9684 0.6854

References
7. Conclusions
[1] S. Li, L. Da Xu, S. Zhao, 5g internet of things: A survey, Journal of
In this paper we have presented LFedCon2, a novel method Industrial Information Integration 10 (2018) 1–9.
[2] J. Qiu, Q. Wu, G. Ding, Y. Xu, S. Feng, A survey of machine learning for
for light, semi-supervised, asynchronous, federated, and contin- big data processing, EURASIP Journal on Advances in Signal Processing
ual classification. Our method allows to face real tasks where 2016 (1) (2016) 67.
several distributed devices (smartphones, robots, etc.) continu- [3] G. Ananthanarayanan, P. Bahl, P. Bodı́k, K. Chintalapudi, M. Philipose,
ously acquire biased and partially-labeled data, which evolves L. Ravindranath, S. Sinha, Real-time video analytics: The killer app for
edge computing, computer 50 (10) (2017) 58–67.
over time in unforeseen ways. It basically consists of training [4] U. Montanaro, S. Dixit, S. Fallah, M. Dianati, A. Stevens, D. Oxtoby,
local models in each of the devices and then share those models A. Mouzakitis, Towards connected autonomous driving: review of use-
with the cloud, where a global model is created. This global cases, Vehicle system dynamics 57 (6) (2019) 779–814.
model is sent back to the devices and helps them to label un- [5] B. Custers, A. M. Sears, F. Dechesne, I. Georgieva, T. Tani, S. van der
Hof, EU Personal Data Protection in Policy and Practice, Springer, 2019.
labeled data to improve the performance of their local models. [6] B. M. Gaff, H. E. Sussman, J. Geetter, Privacy and big data, Computer
We have applied our proposal to a real classification task, walk- 47 (6) (2014) 7–9.
ing recognition, and we have shown that our proposal obtains [7] K. Dolui, S. K. Datta, Comparison of edge computing implementations:
very high performances without the need of large amounts of Fog computing, cloudlet and mobile edge computing, in: 2017 Global
Internet of Things Summit (GIoTS), IEEE, 2017, pp. 1–6.
labeled data collected at the beginning. [8] T. H. Luan, L. Gao, Z. Li, Y. Xiang, G. Wei, L. Sun, Fog computing:
We believe that we have opened a very promising line of re- Focusing on mobile users at the edge, arXiv preprint arXiv:1502.01815.
search in the context of federated learning, and that there is still [9] Q. Li, Z. Wen, B. He, Federated learning systems: Vision, hype and real-
ity for data privacy and protection, arXiv preprint arXiv:1907.09693.
a lot of work to be done. In this work we have presented a [10] H. B. McMahan, E. Moore, D. Ramage, B. A. y Arcas, Feder-
new method with clearly separated components: local learning, ated learning of deep networks using model averaging, arXiv preprint
semi-supervised labeling, local drift detection and adaptation, arXiv:1602.05629v1.

13
ROC curve (NB) ROC curve (GLM) ROC curve (C5.0)
1.00 1.00 1.00

0.75 0.75 0.75


Sensitivity

Sensitivity

Sensitivity
0.50 0.50 0.50

0.25 0.25 0.25

0.00 0.00 0.00


0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
1 − Specificity 1 − Specificity 1 − Specificity

Single NB Ours (5k) Ours (10k) Single GLM Ours (5k) Ours (10k) Single C5.0 Ours (5k) Ours (10k)

Ours (2.5k) Ours (7.5k) Ours (2.5k) Ours (7.5k) Ours (2.5k) Ours (7.5k)

(a) Using NB as base classifier. (b) Using GLM as base classifier. (c) Using C5.0 as base classifier.
ROC curve (SVM) ROC curve (RF) ROC curve (SGB)
1.00 1.00 1.00

0.75 0.75 0.75


Sensitivity

Sensitivity

Sensitivity
0.50 0.50 0.50

0.25 0.25 0.25

0.00 0.00 0.00


0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
1 − Specificity 1 − Specificity 1 − Specificity

Single SVM Ours (5k) Ours (10k) Single RF Ours (5k) Ours (10k) Single SGB Ours (5k) Ours (10k)
Ours (2.5k) Ours (7.5k) Ours (2.5k) Ours (7.5k) Ours (2.5k) Ours (7.5k)

(d) Using RBF SVM as base classifier. (e) Using RF as base classifier. (f) Using SGB as base classifier.

Figure 7: ROC curves provided by a single classifier (Table 6) compared with the global model from the federated training (Table 7) at iterations 2500, 5000, 7500
and 10000 when there are users who provide noisy data.

[11] P. Nakkiran, R. Alvarez, R. Prabhavalkar, C. Parada, Compressing deep allelizing stochastic gradient descent, in: Advances in neural information
neural networks using a rank-constrained topology, in: Proceedings of processing systems, 2011, pp. 693–701.
the 16th Anual Conference of the International Speach Communication [20] F. Andersson, M. Carlsson, J.-Y. Tourneret, H. Wendt, A new frequency
Association, ISCA, 2015, pp. 1473–1477. estimation method for equally and unequally spaced data, IEEE Transac-
[12] A. Vasilyev, CNN optimizations for embedded systems and FFT, Stand- tions on Signal Processing 62 (21) (2014) 5761–5774.
ford University Report. [21] F. Lin, M. Fardad, M. R. Jovanović, Design of optimal sparse feedback
[13] L. Zhang, Transfer adaptation learning: A decade survey, arXiv preprint gains via the alternating direction method of multipliers, IEEE Transac-
arXiv:1903.04687. tions on Automatic Control 58 (9) (2013) 2426–2431.
[14] N. D. Lane, P. Georgiev, Can deep learning revolutionize mobile sensing?, [22] O. Shamir, N. Srebro, T. Zhang, Communication-efficient distributed op-
in: Proceedings of the 16th International Workshop on Mobile Computing timization using an approximate newton-type method, in: International
Systems and Applications, ACM, 2015, pp. 117–122. conference on machine learning, 2014, pp. 1000–1008.
[15] D. Peteiro-Barral, B. Guijarro-Berdiñas, A survey of methods for dis- [23] Y. Zhang, X. Lin, Disco: Distributed optimization for self-concordant
tributed machine learning, Progress in Artificial Intelligence 2 (1) (2013) empirical loss, in: International conference on machine learning, 2015,
1–11. pp. 362–370.
[16] J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, J. S. [24] O. Sagi, L. Rokach, Ensemble learning: A survey, Wiley Interdisciplinary
Rellermeyer, A survey on distributed machine learning, arXiv preprint Reviews: Data Mining and Knowledge Discovery 8 (4) (2018) e1249.
arXiv:1912.09789. [25] G. Tsoumakas, I. Vlahavas, Effective stacking of distributed classifiers,
[17] R. Gu, S. Yang, F. Wu, Distributed machine learning on mobile devices: in: Ecai, Vol. 2002, 2002, pp. 340–344.
A survey, arXiv preprint arXiv:1909.08329. [26] L. O. Hall, N. Chawla, K. W. Bowyer, W. P. Kegelmeyer, Learning rules
[18] H. Robbins, S. Monro, A stochastic approximation method, The annals from distributed data, in: Large-Scale Parallel Data Mining, Springer,
of mathematical statistics (1951) 400–407. 2000, pp. 211–220.
[19] B. Recht, C. Re, S. Wright, F. Niu, Hogwild: A lock-free approach to par- [27] Y. Guo, J. Sutiwaraphun, Probing knowledge in distributed data mining,

14
1.0

1.0
0.8
0.8
Balanced Accuracy

Balanced Accuracy

0.6
0.6

0.4
0.4

0.2
0.2

500 1500 3000 4500 6000 7500 9000 500 1500 3000 4500 6000 7500 9000
Drift Detection and Adaptation

Drift Detection and Adaptation


user9

user9
user7

user7
user5

user5
user3

user3
user1

user1
500 2000 3500 5000 6500 8000 9500 500 2000 3500 5000 6500 8000 9500

Time Time
Figure 8: Results for LFedCon2 (using RBF SVM as base classifier and Mg = Figure 9: Results for LFedCon2 (using RBF SVM as base classifier and Mg =
Ml = 5) when 4 of the users are mislabeling the data. Ml = 5) when a drastic drift occurs at t = 2500.

in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, [41] A. K. Jain, B. Chandrasekaran, 39 dimensionality and sample size consid-
Springer, 1999, pp. 443–452. erations in pattern recognition practice, Handbook of statistics 2 (1982)
[28] L. O. Hall, N. Chawla, K. W. Bowyer, Decision tree learning on very large 835–855.
data sets, in: SMC’98 Conference Proceedings. 1998 IEEE International [42] L. Kanal, B. Chandrasekaran, On dimensionality and sample size in sta-
Conference on Systems, Man, and Cybernetics (Cat. No. 98CH36218), tistical pattern classification, Pattern recognition 3 (3) (1971) 225–234.
Vol. 3, IEEE, 1998, pp. 2579–2584. [43] S. J. Raudys, A. K. Jain, Small sample size effects in statistical pattern
[29] A. Lazarevic, Z. Obradovic, Boosting algorithms for parallel and dis- recognition: Recommendations for practitioners, IEEE Transactions on
tributed learning, Distributed and parallel databases 11 (2) (2002) 203– Pattern Analysis & Machine Intelligence 13 (3) (1991) 252–264.
229. [44] R. P. Duin, D. M. Tax, Classifier conditional posterior probabilities, in:
[30] J. Konečnỳ, B. McMahan, D. Ramage, Federated optimization: Joint IAPR International Workshops on Statistical Techniques in Pat-
Distributed optimization beyond the datacenter, arXiv preprint tern Recognition (SPR) and Structural and Syntactic Pattern Recognition
arXiv:1511.03575. (SSPR), Springer, 1998, pp. 611–619.
[31] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang, [45] T. Lesort, V. Lomonaco, A. Stoian, D. Maltoni, D. Filliat, N. Dı́az-
D. Niyato, C. Miao, Federated learning in mobile edge networks: A com- Rodrı́guez, Continual learning for robotics: Definition, framework, learn-
prehensive survey, arXiv preprint arXiv:1909.11875. ing strategies, opportunities and challenges, Information Fusion 58 (2020)
[32] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, 52–68.
S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: A system for large- [46] J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, G. Zhang, Learning under concept
scale machine learning, in: 12th {USENIX} symposium on operating drift: A review, IEEE Transactions on Knowledge and Data Engineering.
systems design and implementation ({OSDI} 16), 2016, pp. 265–283. [47] A. Haque, L. Khan, M. Baron, Sand: Semi-supervised adaptive novel
[33] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, class detection and classification over data stream, in: Thirtieth AAAI
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., Pytorch: An impera- Conference on Artificial Intelligence, 2016, pp. 1652–1658.
tive style, high-performance deep learning library, in: Advances in neural [48] M. Baron, Convergence rates of change-point estimators and tail prob-
information processing systems, 2019, pp. 8026–8037. abilities of the first-passage-time process, Canadian Journal of Statistics
[34] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, V. Chandra, Federated learning 27 (1) (1999) 183–197.
with non-iid data, arXiv preprint arXiv:1806.00582. [49] D. H. Wolpert, Stacked generalization, Neural networks 5 (2) (1992) 241–
[35] S. Caldas, P. Wu, T. Li, J. Konečnỳ, H. B. McMahan, V. Smith, 259.
A. Talwalkar, Leaf: A benchmark for federated settings, arXiv preprint [50] P. K. Chan, S. J. Stolfo, et al., Toward parallel and distributed learn-
arXiv:1812.01097. ing by meta-learning, in: AAAI workshop in Knowledge Discovery in
[36] G. Widmer, M. Kubat, Learning in the presence of concept drift and hid- Databases, 1993, pp. 227–240.
den contexts, Machine learning 23 (1) (1996) 69–101. [51] G. Tsoumakas, L. Angelis, I. Vlahavas, Clustering classifiers for knowl-
[37] J. Yoon, W. Jeong, G. Lee, E. Yang, S. J. Hwang, Federated con- edge discovery from physically distributed databases, Data & Knowledge
tinual learning with adaptive parameter communication, arXiv preprint Engineering 49 (3) (2004) 223–242.
arXiv:2003.03196. [52] F. E. Casado, G. Rodrı́guez, R. Iglesias, C. V. Regueiro, S. Barro,
[38] J. Czyz, J. Kittler, L. Vandendorpe, Multiple classifier combination for A. Canedo-Rodrı́guez, Walking recognition in mobile devices, Sensors
face-based identity verification, Pattern recognition 37 (7) (2004) 1459– 20 (4).
1469. [53] G. Rodrı́guez, F. E. Casado, R. Iglesias, C. V. Regueiro, A. Nieto, Robust
[39] J. Kittler, M. Hatef, R. P. Duin, J. Matas, On combining classifiers, IEEE step counting for inertial navigation with mobile phones, Sensors 18 (9).
transactions on pattern analysis and machine intelligence 20 (3) (1998) [54] M. Kuhn, J. Wing, S. Weston, A. Williams, C. Keefer, A. Engelhardt,
226–239. T. Cooper, Z. Mayer, B. Kenkel, R. C. Team, et al., Package ‘caret’, The
[40] K. Tumer, J. Ghosh, Error correlation and error reduction in ensemble R Journal.
classifiers, Connection science 8 (3-4) (1996) 385–404. [55] J. Allaire, F. Chollet, et al., keras: R interface to’keras’, R package version

15
2 (3).
[56] I. Triguero, S. Garcı́a, F. Herrera, Self-labeled techniques for semi-
supervised learning: taxonomy, software and empirical study, Knowledge
and Information systems 42 (2) (2015) 245–284.

16

You might also like