Federated and Continual Learning For Classification Tasks in A Society of Devices
Federated and Continual Learning For Classification Tasks in A Society of Devices
Fernando E. Casadoa,∗, Dylan Lemaa , Roberto Iglesiasa , Carlos V. Regueirob , Senén Barroa
a CiTIUS (Centro Singular de Investigación en Tecnoloxı́as Intelixentes), Universidade de Santiago de Compostela, 15782 Santiago de Compostela, Spain
b CITIC, Computer Architecture Group, Universidade da Coruña, 15071 A Coruña, Spain
Abstract
Today we live in a context in which devices are increasingly interconnected and sensorized and are almost ubiquitous. Deep
learning has become in recent years a popular way to extract knowledge from the huge amount of data that these devices are able
to collect. Nevertheless, centralized state-of-the-art learning methods have a number of drawbacks when facing real distributed
problems, in which the available information is usually private, partial, biased and evolving over time. Federated learning is a
arXiv:2006.07129v2 [cs.LG] 12 Jan 2021
popular framework that allows multiple distributed devices to train models remotely, collaboratively, and preserving data privacy.
However, the current proposals in federated learning focus on deep architectures that in many cases are not feasible to implement
in non-dedicated devices such as smartphones. Also, little research has been done regarding the scenario where data distribution
changes over time in unforeseen ways, causing what is known as concept drift. Therefore, in this work we want to present Light
Federated and Continual Consensus (LFedCon2), a new federated and continual architecture that uses light, traditional learners.
Our method allows powerless devices (such as smartphones or robots) to learn in real time, locally, continuously, autonomously
and from users, but also improving models globally, in the cloud, combining what is learned locally, in the devices. In order to test
our proposal, we have applied it in a heterogeneous community of smartphone users to solve the problem of walking recognition.
The results show the advantages that LFedCon2 provides with respect to other state-of-the-art methods.
Keywords: federated learning, continual learning, distributed learning, semi-supervised classification, cloud-based ensemble,
smartphones.
no global
global model local model model?
yes
Semi-supervised
Learn/readjust
local model
labeling
5
ski = (xik , yki ) is one data instance from client dk , xik is the feature Algorithm 1: Change detection algorithm.
vector, yki is the label. Each local dataset S0,t k has a certain joint
1 Input: W , N, x, Th , α, ∆, Nmax
probability distribution Ptk (x, y). A local concept drift occurs at 2 Output: W , N
timestamp t for client dk if ∃t, c : Ptk (x, y) , Pt+1
k (x, y).
3 [ŷ, ςx ] ← classify(x)
However, note that a local concept drift does not necessarily 4 wnew ← [x, ςx ]
have a direct impact on the global federated model. It may be 5 W ← W ∪ wnew // Add the pattern and its confidence into W
the case that a local drift on device dk will result in a change 6 N ← N +1
in the distribution of dk , but not in the joint distribution of all 7 if N >= Nmax then
clients, PtG (x, y). In that case, the federated model will not be 8 w0 ← ∅ // Remove the oldest element in W
affected by the local change and this one can be ignored. Thus, 9 N ← N −1
we can define a global concept drift as a distribution change at 10 end
time t in one or more clients {dk , . . . , dl } ⊆ {d1 , . . . , dC } such 11 s f ← 0
that ∃t : PtG (x, y) , Pt+1 G (x, y). If a global concept drift happens
12 r ← random(0, 1) // Generate random number in the interval [0,1]
the model might lower its performance, so we need to detect −2ςx ≥ r then
13 if e
these global changes and adapt to them.
14 for k ← ∆ to N − ∆ do
Research on learning under concept drift is typically classi- 15 mb ← mean(ς1 : ςk ∈ W )
fied into three components [46]: (1) drift detection (whether 16 ma ← mean(ςk+1 : ςN ∈ W )
or not drift occurs), (2) drift understanding (when, how, where 17 if ma ≤ (1 − α) · mb then
it occurs) and (3) drift adaptation (reaction to the existence of 18 sk ← 0
drift). In this work, we have focused on drift detection and
19 [α̂b , β̂b ] ← estimateParams(ς1 : ςk ∈ W )
adaptation.
20 [α̂a , β̂a ] ← estimateParams(ςk+1 : ςN ∈ W.)
21 for i ← k + 1 to Ndo
4.1.1. Drift detection f (ςi ∈wi | α̂a , β̂a )
Drift detection refers to the techniques and mechanisms that 22 sk ← sk + log
f (ςi ∈wi | α̂b , β̂b )
characterize and quantify concept drift via identifying change 23 end
points or change intervals in the underlying data distribution. 24 if sk > s f then
In our proposal, if a concept drift is identified it means that 25 s f ← sk
the local model is no longer a good abstraction of the knowl- 26 end
edge of the device, so it must be updated. Drift detection al- 27 end
gorithms are typically classified into three categories [46]: (1) 28 end
error rate-based methods, (2) data distribution-based methods 29 if s f > Th then
and (3) multiple hypothesis test methods. The first class of al- 30 driftAdaptation(W ) // See details in Section 4.1.2
gorithms focuses on tracking changes in the online error rate of 31 W ← {∅} // Reinitialize the sliding window
a reference classifier (in our case, it would be the latest local 32 N←0
model). The second type uses a distance function or metric to
33 end
quantify the dissimilarity between the distribution of historical
34 end
data and the new data. The third group basically combines tech-
niques from the two previous categories in different ways. Error
rate-based methods operate only on true labeled data, because
they need the labels to estimate the error rate of the classifier. dictions provided by the current local model for the patterns in
Therefore, in order to take advantage of both labeled and unla- W . In the original method proposed in [47], authors do not use
beled data, in this work we decided to use a data distribution- a sliding window, but a dynamic window, which is reinitialized
based algorithm, which do not present such a restriction. Thus, every time a drift is identified. However, they do not establish
we have used an adapted version of the change detection tech- any limits on the size of this dynamic window, which is not
nique (CDT) originally proposed in [47], which is a CUSUM- very realistic, since its size could grow to infinity if no drift is
type CDT on beta distribution [48]. Algorithm 1 outlines the detected. Therefore, in our proposal, we use a sliding window
proposed method. and we set a maximum size Nmax = 20∆, where ∆ is a strongly
As soon as a new instance x is available, Algorithm 1 is in- related parameter used in the core of the CDT algorithm, as we
voked. First of all, the confidence of the current local classifier will explain below. Once this maximum size is reach, adding a
on that instance, ςx , is estimated (line 3 in the pseudocode) and new element to W implies deleting the oldest one (lines 7 to 9).
both, instance and confidence, are stored in a sliding window The core of this CDT algorithm (lines 14 to 32) can be a
W of length N (lines 4 to 6). Note that W = {w1 , w2 , . . . , wN } bottleneck in our system if we have to execute it after insert-
is a history of tuples of the form “[instance, confidence]”, so ing each confidence value in W . Therefore, as shown in Fig-
that W = [X, Σ ], where X = {x1 , x2 , . . . , xN } is the history of in- ure 2, CDT will be invoked only if W contains a representative
L
stances and Σ = {ς1 , ς2 , . . . , ςN } the associated confidence vec- number of instances X ∈ W , i.e., there are at least 2C labeled
tor. We will try to detect changes in the confidence of the pre- instances for each class. However, once this condition is met,
6
CDT would be invoked for every new recorded sample. There- The second one preserve old models in an ensemble and when
fore, we restrict the number of executions, so that the core of a new one is trained, it is added to the ensemble, so that the lo-
Algorithm 1 will be executed with a probability of e−2ςx (line cal model itself is the ensemble. The third approach consist of
13 in the pseudocode). Hence, the higher the confidence, the developing a model that adaptatively learns from the changing
lower the probability of executing CDT, and vice versa. data by partially updating itself. This last strategy is arguably
In the core of the CDT algorithm, W is divided into two sub- the most efficient when drift only occurs in local regions. How-
windows for every pattern k between ∆ and N −∆ (lines 14 to 16 ever, online model adjusting is not straightforward and it will
in Algorithm 1). Let Wa and Wb be the two sub-windows, where depend on the specific learning algorithm being used. In fact,
Wa contains the most recent instances and their confidences. most of the methods from this category are based on decision
Each sub-window is required to contain at least ∆ examples to trees because trees have the ability to examine and adapt to each
preserve statistical properties of a distribution. When a concept sub-region separately.
drift occurs, confidence scores are expected to decrease. There- In our case, we apply ensemble retraining. In particular, as
fore, only changes in the negative direction are required to be we already mentioned before, we propose a simple rule based
detected. In other words, if ma and mb are the mean values of ensemble using the median rule –Equation (1)– for local adap-
the confidences in Wa and Wb respectively, a change point is tation to concept drift. Each local device is allowed to keep up
searched only if ma ≤ (1 − α) × mb , where α is the sensitiv- to Ml local models, which will make up its ensemble. When
ity to change (line 17). Same as in [47], we use α = 0.05 and a drift is detected, a new model will be trained using only the
∆ = 100 in our experiments, which are also widely used in the labeled data stored in the dynamic window W described before
literature. (Section 4.1.1). This new model will be added to the ensem-
We can model the confidence values in each sub-window, Wa ble. If there are already Ml models in the ensemble, the new
and Wb , as two different beta distributions. However, the ac- one replaces the oldest, thus ensuring that there are at most Ml
tual parameters for each one are unknown. The proposed CDT models at any time. Therefore, with this strategy, we also face
algorithm estimates these parameters at lines 19 and 20. Next, the infinite length problem, as a constant and limited amount
the sum of the log likelihood ratios sk is calculated
in thein- of memory will be enough to keep both training data and the
ner loop between lines 21 and 23, where f ςi ∈ wi | α̂, β̂ is ensemble. In our experiments we used Ml = 5.
the probability density function Therefore, we conceive each local model as an ensemble of
(PDF) of the beta distribution, base historical models. From now on, for simplicity, we will
having estimated parameters α̂, β̂ , applied on the confidence refer to this ensemble simply as the local model of the device.
ςi of the tuple wi = [xi , ςi ] ∈ W . This PDF describes the relative Note that the estimated posterior probability of the ensemble
likelihood for a random variable, in this case the confidence ς , from Equation (1) is the confidence of the local model, that we
to take on a given value, and it is defined as: use for drift detection.
(1−ς )β −1
( α−1
ς
B(α,β ) , if 0 < ς < 1
f (ς | α, β ) = (2) 5. Global learning
0, otherwise,
Every time that a local device train or update its local model,
where Z 1 the changes in that local model will be reported to the cloud.
B (α, β ) = ς α−1 (1 − ς )β −1 dς . (3) For the construction of the global model, we decided once again
0 to use an ensemble, because of the good properties of these
The variable sk is a dissimilarity score for each iteration k of kind of methods that we already mention in Section 4. As in
the outer loop between lines 14 and 28. The larger the differ- the local level, any state-of-the-art ensemble technique could be
ence between the PDFs in Wa and Wb , the higher the value of used to combine the predictions of the local models. Stacking,
sk (line 22). Let kmax is the value of k for which the algorithm or stacked generalization [49], is a technique for achieving the
calculated the maximum sk value where ∆ ≤ k ≤ N − ∆. Fi- highest generalization accuracy that has probably been the most
nally, a change is detected at point kmax if skmax ≡ s f is greater widely applied in distributed learning [50, 15, 25]. Stacking
than a prefixed threshold Th (line 29). As in the original work, tries to induce which base classifiers (in our case, local models)
we use Th = − log(α). In case a drift is detected, a drift adap- are reliable and which are not by training a meta-learner. This
tation strategy is applied (line 30). We will discuss this strategy meta-learner is a higher-level model which is trained on a meta-
in detail in Section 4.1.2. Moreover, the sliding window W is dataset composed from the outputs of all base classifiers on a
reinitialized (lines 31 and 32). given training set. However, it presents a great limitation, which
is the need to have this meta-dataset available in advance. In
4.1.2. Drift adaptation the context of devices, this would involve gathering a certain
Once a drift is detected, the local classifier should be updated amount of labeled data from all users in the cloud before being
according to that drift. There exist three main groups of drift able to create a global model. Therefore, a better solution is
adaptation (or reaction) methods [46]: (1) simple retraining, (2) to use a simpler but equally effective rule-based ensemble, as
ensemble retraining and (3) model adjusting. The first strategy we already do at the local level (Section 4). In this case, the
is to simply retrain a new model combining in some way the optimal combining rule is the product rule, because each local
latest data and the historical data to replace the obsolete model. classifier operates in a different measure space [39]: different
7
environments, sensors, users, etc. The product rule predicts that with a significance level of 0.05 is performed for each pair of
an instance x belongs to class c j if the following condition is models ci , c j so that:
met:
N N
C 1
if ci is significantly better than c j ,
∏ yi j (x) = max
k=1
∏ yik (x), (4)
t(ci , c j ) = −1 if c j is significantly better than ci , (5)
i=1 i=1
0 otherwise.
where C is the number of possible classes (c1 , c2 , . . . , cC ), N
is the number of base classifiers and yi = {yi1 (x), . . . , yiC (x)}
Then, for each model we calculate its overall significance index:
is the output of the i-th classifier, for i = 1, . . . , N. Note that
the outputs of this global ensemble are the estimated posterior N
probabilities that we use on each local device as a confidence S(ci ) = ∑ t(ci , c j ). (6)
measure to decide which unlabeled patterns can and cannot be j=1
labeled.
Finally, we select the new Mg models with the highest signif-
Although the time complexity on the server side is linear,
icance index or score, S. If there are ties, we break them by
O(N), including all local models in the ensemble is not the best
selecting the most accurate ones (we compute the mean accu-
option for several reasons. First, depending on the problem,
racy from the q available evaluations). Figure 3 summarizes
there could be thousands or millions of connected devices, so
the whole process, that we can call Distributed Effective Voting
using an ensemble of those dimensions could be computation-
(DEV).
ally very expensive. As the global model is sent back to local
In our experiments we just had 10 different devices, so we
devices, it would also have a negative impact on the bandwidth
tried several values for Mg , from 3 to 10. However, the size
and computational requirements of the nodes. In addition, as-
of ensemble usually ranges from 5 to 30 models and it will be
suming that there is an optimal global abstraction of knowl-
strongly dependent on the problem. The p and q sizes depend
edge, not all the local models will reflect such knowledge in
both on the size of the ensemble (Mg ) and the total number of
equal measure or bring the same wealth to the ensemble. On
devices available online (N), so that Mg ≤ q ≤ p ≤ N. For sim-
the contrary, there will be devices that, accidentally or inten-
plicity, in our experiments we always used p = q = Mg .
tionally, may be creating local models totally separated from
reality, which should be detected in time so as not to participate
in the ensemble. 6. Experimental results
Because of all of this, we propose to keep a selection of the
Mg best local models to participate in the global ensemble. In The aim of this section is to evaluate the performance of
this way, we can know a priori the computational and storage LFedCon2. In particular, we are interested in checking to what
resources we will need more easily. When a new local model extent the global models obtained from the consensus of the
is received on the cloud, if the device owner of that model is different local devices are capable of solving distributed and
already present in the global ensemble, we can simply update semi-supervised classification tasks. In addition, we also want
the global ensemble with the new version of the local model. to evaluate the evolution of these models over the time. In our
Otherwise, it must compete against the Mg models that are cur- experiments, we have used several base classifiers at the local
rently in the global ensemble. The server will keep a score level so as not to link the results and the architecture to any par-
representing the relevance of each of the Mg models and will ticular method (although, as we will see below, some of them
compute that score for each new incoming model. For the com- are clearly suboptimal according to the strategy we propose).
putation of the scores we base on the Effective Voting (EV) The task we have chosen to conduct our experiments is the
technique [51]. EV propose to perform 10-fold cross validation detection of the walking activity on smartphones. It is relatively
for the evaluation of each model and then apply a paired t-test easy to detect the walking activity and even count the steps
for each pair of models to evaluate the statistical significance when a person walks ideally with the mobile in the palm of
of their relative performance. Finally, the most significant ones his/her hand, facing upwards and without moving it too much.
are selected. However, the situation gets much worse in real life, when the
In our context, a cross-validation is not a fair way to evaluate orientation of the mobile with respect to the body, as well as its
each local model due to the skewness and bias inherent in the location (hand, bag, pocket, ear, etc.), may change constantly
distributed paradigm. Thus, when a new local model arrives, as the person moves [52, 53]. Figure 4 shows the complexity of
the server choose p different local devices, randomly selected, this problem with a real example. In this figure we can see the
and ask them to evaluate that classifier on their respective local norm of the acceleration experienced by a mobile phone while
training sets. Once this evaluation is done, each device sends its owner is doing two different activities. The person and the
back to the cloud its accuracy. Assuming that not all the p se- device are the same in both cases. In Figure 4a, we can see the
lected devices are necessarily available to handle the request, signal obtained when the person is walking with the mobile in
the server waits until it has received q performance measures the pocket. Figure 4b shows a very similar acceleration signal
for that model, being q ≤ p. This process could be considered experienced by the mobile, but in this case when the user is
a distributed cross-validation. After gathering the q measure- standing still with the mobile in his hand, without walking, but
ments for the current Mg models and the new one, a paired t-test gesticulating with the arms in a natural way.
8
dev. j
2 dev. k
Ask p random
li
de
devices to validate
i,j
mo
that model
cy
ra
cu
ac
1
Send new model i
local model dev. l
m
od
the ranking
el
ac
i
cu
ra
cy
i,n
3
Wait for q
dev. n
responses
(accuracies)
0 100 200 300 400 500 main reasons why we chose this problem to test our proposal.
Time (centiseconds)
Thus, we applied some of the most popular and widely known
(b)
supervised classification methods on this dataset:
• a simple Naı̈ve Bayes (NB),
Figure 4: Norm of the acceleration experienced by a mobile phone when its • a Generalized Linear Model (GLM),
owner is walking (a), and not walking, but gesticulating with the mobile in
his/her hand (b).
• a C5.0 Decision Tree (C5.0),
• a Support Vector Machine with radial basis function ker-
nel (RBF SVM),
• a Random Forests (RF),
• a Stochastic Gradient Boosting (SGB).
Table 1 summarize the most relevant results we achieved. Op-
9
timal hyperparameters for each of these traditional classifiers also labels autonomously some examples applying a series of
were estimated applying 10-fold cross validation on the dataset. heuristic rules when it comes to clearly identifiable positives
We also tried several Convolutional Neural Network (CNN) ar- or negatives (e.g., when the phone is at rest). With this app, we
chitectures [52]. Table 1 shows the results provided by the best collected partially labeled data from 10 different people. Partic-
one. Note that in this last case, as the size of the dataset is too ipants installed our application and recorded data continuously
small for deep learning, we had to to carry out an oversampling while they were performing their usual routine. Table 2 sum-
process to generate more instances artificially. The training and marizes the data distribution in the new semi-labeled dataset by
testing of the different models was carried out using the frame- user, and compares it with the already presented test set.
work provided by the R language. In particular, for the training
of the traditional models (Naı̈ve Bayes, GLM, C5.0, etc.) we Table 2: Summary of the training and test sets, indicating the number of exam-
ples attending to the label.
used the implementations already provided by the caret pack-
age [54]. In the case of the CNNs, we employed the keras Walking Not walking Unlabeled Total
package [55]. user 1 3130 2250 4620 10000
user 2 2519 4359 3122 10000
Table 1: Performance of several supervised classifiers, trained and tested on a user 3 186 325 125 636
dataset of 77 different people [52]. user 4 2432 2455 5113 10000
user 5 233 2785 6982 10000
Balanced Accuracy Sensitivity Specificity Training
user 6 554 1821 69 2444
set
Naı̈ve Bayes 0.8938 0.9580 0.8295 user 7 2582 2052 769 5403
GLM1 0.8855 0.8826 0.8884 user 8 232 678 229 1139
C5.0 Tree 0.9108 0.9015 0.9200 user 9 1151 2669 6180 10000
RBF SVM2 0.9368 0.9410 0.9326 user 10 2329 2669 5002 10000
Total 15348 22063 32211 69622
RF3 0.9444 0.9375 0.9516
Test set Total 6331 1586 0 7917
SGB4 0.9324 0.9406 0.9242
CNN3 0.9791 0.9755 0.9822
1 GLM = Generalized Linear Model Applying the same techniques of Table 1 but now using the
2 RBF SVM = Support Vector Machine with radial basis kernel new dataset for training and the old one for testing (Table 2), the
3 RF = Random Forests results obtained are those of Table 3. If we compare both tables,
4 SGB = Stochastic Gradient Boosting
5 Table 1 and Table 3, it is clear that the latter results are sub-
CNN = Convolutional Neural Network
stantially worse—approximately 10% worse in all the cases—.
However, despite the high performances shown in Table 1, This is logical, since training now is carried out using a totally
we must admit that these numbers are unrealistic. Obtaining new dataset, obtained in a more realistic and unconstrained way,
performances like those involves months of work, collecting which contains biased and partially labeled data from a much
and labeling data from different volunteers and re-tuning and smaller number of users. Note that we have not obtained the
re-obtaining increasingly robust models until significant im- new results using CNNs in Table 3. This is because, for the
provements are no longer achieved. It is at that moment when experiments in Table 1, a dedicated mobile device was used,
the model is finally put into exploitation. In fact, we have tried so recording raw data for training a CNN was not a problem.
all the models in Table 1 on a smartphone and we have identi- However, the new training set was obtained in a federated con-
fied several situations, quite common in everyday life, where all text using the participants’ own smartphones, so recording raw
of them fail, such as walking with the mobile in a back pocket data was not an option.
of the pants, carrying the mobile in a backpack or shaking it
Table 3: Performance of supervised classifiers, trained and tested using the
slowly in the hand simulating the forces the device experience datasets described in Table 2.
when the user is walking. No matter how complete we believe
our dataset is, in many real problems there are going to be some Balanced Accuracy Sensitivity Specificity
situations that are poorly represented in the training data. The Naı̈ve Bayes 0.7949 0.9044 0.6854
classical solution to this would be to collect new data of these GLM 0.7221 0.8206 0.6236
situations where the model fails, try to obtain a better model ro- C5.0 Tree 0.8173 0.8842 0.7503
bust to this situations and, once achieved, replace the old model RBF SVM 0.8378 0.9430 0.7327
with the new one. RF 0.8546 0.9021 0.8071
In order to build an appropriate training dataset to evaluate SGB 0.8596 0.9084 0.8108
the federated, continual and semi-supervised learning possibil-
ities of our proposal, we developed an Android application that Using LFedCon2, the training data from Table 2 is dis-
samples and logs the inertial data on mobile phones continu- tributed among the different local devices. Data acquisition is
ously, after being processed. These data is all unlabeled. Nev- continuous over time. For simplicity, we assume all local de-
ertheless, the app allows the user to indicate whether he/she is vices record data with the same frequency and, therefore, each
walking or not through a switch button in the graphical inter- iteration of LFedCon2 will correspond to a new pattern. It is
face, but this is optional, so depending on the user’s willingness important to remember that in our proposal each local model
to participate, there will be more or less labeled data. The app is an ensemble that is updated over time and has a maximum
10
size, Ml . Similarly, a subset of local models of size Mg is se- other than DEV. However, designing a good selection method
lected on the server to form the global ensemble model. In this is not straightforward as some challenges must be addressed.
experiments we used ensembles of 5 models, both locally and Firstly, local models must be evaluated without centralized data
globally, i.e., Ml = 5 and Mg = 5. Table 4 shows the perfor- being available anywhere and without even being able to re-
mance of the global model after 10000 iterations using each of tain all local data. In addition, a balance is required between
the methods in Table 3 as base local learners. Figure 5 com- precision in the selection and efficiency in the overall system.
pares the Receiver Operating Characteristic (ROC) curve of the Improving the voting system may involve more communication
traditional models from Table 3 with its federated equivalent at between server and devices, which limits the scalability of the
different iterations. proposal. Nevertheless, in the results we have obtained we can
notice how the global ensemble is able to provide good perfor-
Table 4: Performance of LFedCon2 after 10000 iterations, when Mg = Ml = 5, mance almost from the beginning. This leads us to think that
using as base classifier the methods in Table 3.
perhaps it is not necessarily bad that some suboptimal classi-
Balanced Accuracy Sensitivity Specificity fiers are chosen and even that the diversity on the global ensem-
LFedCon2 + NB 0.8280 0.7909 0.8651 ble is more important than the accuracy of each of its members.
LFedCon2 + GLM 0.7405 0.8668 0.6141
LFedCon2 + C5.0 0.8168 0.9665 0.6671 Table 5: Detail of the balanced accuracies achieved in the LFedCon2 setting of
LFedCon2 + SVM 0.8663 0.8499 0.8827 Figure 6.
LFedCon2 + RF 0.8314 0.9489 0.7129
LFedCon2 + SGB 0.8231 0.9187 0.7264 Number of iterations
Model 2500 5000 7500 10000
user 1 0.4069* 0.4069 0.4813 0.4813
As we can see, our federated proposal gives similar perfor-
mances to those achieved by a single model trained with all data user 2 0.7468* 0.7158* 0.7158 0.7157
available. However, the comparison between Tables 3 and 4 is user 3 - - 0.6730* 0.6730*
not fair, as they correspond to two very different approaches. user 4 0.4999 0.7549* 0.7549* 0.7549*
Note that in the LFedCon2 setting of Table 4 we are only using user 5 0.8093* 0.8093* 0.8093* 0.8093*
half of the users to build the global ensemble model (Mg = 5), user 6 - - 0.6080* 0.6080*
whereas the models in Table 3 use all labeled data from all user 7 0.5555* 0.7016 0.7016 0.7016
users. Furthermore, the use of RF or SGB as base classifiers user 8 0.6789 0.8644 0.8644 0.8644
does not seem to be the best option in the federated scenario, user 9 0.5882* 0.5882* 0.8304 0.8304
since these algorithms are already ensembles. For these cases user 10 - 0.7266* 0.7306* 0.7306*
there are probably much more efficient combination strategies global 0.8603 0.8552 0.8663 0.8663
that could be tested. For example, by creating an ensemble of * Cells marked with an asterisk indicate that the current model of
Random Forests we are combining the final decisions of each that user was chosen to participate in the global ensemble.
of the forest, when it might be better to combine the decisions
of all the trees of each forest, at a lower level. Proposals like In Figure 6, the thick black line corresponds to the global
this are out of the scope of this work, in which we have opted model accuracy while the rest of the coloured lines are each of
for a global learning method scalable to multiple settings. the 10 anonymous users. As in each iteration the unlabeled lo-
In order to better illustrate the learning process of our cal data is labeled with the most recent global model, the more
method, we will take as an example the case where we use a unlabeled data the device has, the more it will be enriched by
RBF SVM as base classifier (fourth row in Table 4). Figure 6 the global knowledge. The bottom graph shows the drift de-
and Table 5 show how the performance of LFedCon2 evolves tection and adaptation. Once again, each coloured line corre-
over the iterations. The upper graph in Figure 6 shows the evo- sponds to one user. In those places where the line is not drawn
lution of the accuracies of all the models —from each of the it means that the device is not taking data. A circumference (◦)
10 users and also the global one—. Table 5 provides the exact on the line indicates when a drift has been detected and the lo-
values of these accuracies. In each column of Table 5, those cal model has been updated. If the circumference is filled (•) it
local models that are part of the global model in that iteration indicates that the local model is chosen as one of the 5 models
are marked with an asterisk (*). We can appreciate that the of the global ensemble –the other 4 chosen are marked with a
global ensemble is not always selecting the most accurate lo- cross (×)–.
cal models. This actually makes sense, since local models are As we can see, LFedCon2 allows the society of devices to
chosen using the Distributed Effective Voting (DEV) from Fig- have a classifier from practically the beginning and to evolve
ure 3, which is based on the voting carried out by a subset of it over time. In order to illustrate some other advantages of
q users randomly chosen (5 in this case, because q = Mg = 5). our approach, we have artificially modified the training dataset
The fact that some suboptimal local models are being selected in different ways. Suppose now there are some users who are
indicates that there are devices that value these models highly, mislabeling the data, whether intentionally or not. To simulate
which means that the selection criteria of the voters can be im- this, we have inverted all the labels of four of the users. This
proved. Remember that LFedCon2 needs a mechanism for se- is the maximum number of local datasets we can poison so that
lecting local models to build the global one, but this could be the system continues to properly choose the best participants for
11
ROC curve (NB) ROC curve (GLM) ROC curve (C5.0)
1.00 1.00 1.00
Sensitivity
Sensitivity
0.50 0.50 0.50
Single NB Ours (5k) Ours (10k) Single GLM Ours (5k) Ours (10k) Single C5.0 Ours (5k) Ours (10k)
Ours (2.5k) Ours (7.5k) Ours (2.5k) Ours (7.5k) Ours (2.5k) Ours (7.5k)
(a) Using NB as base classifier. (b) Using GLM as base classifier. (c) Using C5.0 as base classifier.
ROC curve (SVM) ROC curve (RF) ROC curve (SGB)
1.00 1.00 1.00
Sensitivity
Sensitivity
0.50 0.50 0.50
Single SVM Ours (5k) Ours (10k) Single RF Ours (5k) Ours (10k) Single SGB Ours (5k) Ours (10k)
Ours (2.5k) Ours (7.5k) Ours (2.5k) Ours (7.5k) Ours (2.5k) Ours (7.5k)
(d) Using RBF SVM as base classifier. (e) Using RF as base classifier. (f) Using SGB as base classifier.
Figure 5: ROC curves provided by a single classifier (Table 3) compared with the global model from the federated training (Table 4) at iterations 2500, 5000, 7500
and 10000.
the global ensemble. In this example we have chosen three very Table 6: Performance of traditional classifiers when there are users who provide
noisy data.
active users (users 1, 2 and 9), and one more not very involved
(user 9). The results using a single classifier trained on all the Balanced Accuracy Sensitivity Specificity
available data are those in Table 6. The results using our method Naı̈ve Bayes 0.4837 0.2008 0.7667
with different base classifiers are those shown in Table 7. Fig- GLM 0.6283 0.5321 0.7245
ure 7 compares the ROC curves of the traditional models with C5.0 0.4293 0.4645 0.3941
the federated ones in this scenario. As can be appreciated, in RBF SVM 0.5943 0.6986 0.4899
this case LFedCon2 far outperforms the results provided by the RF 0.4679 0.5102 0.4256
classical approach. This is because the system is able to identify SGB 0.4499 0.3916 0.5082
those noisy users and remove them from the global ensemble.
In addition, the devices of those users will use the global model
to label the unlabeled data correctly, thus overcoming the data t = 2500, all devices begin to identify patterns of “walking”
that was manually mislabeled. Figure 8 shows the details of our when before they would be “not walking”, and vice versa. Fig-
federated and continual learning process in this scenario, in the ure 9 shows the process using again the RBF SVM as base clas-
particular case where RBF SVM is used as base classifier. sifier. As we can see, it takes a while for the system to converge
Finally, we have simulated a big drift in data. The biggest because the change is too drastic, but it ends up achieving it.
drift the system could experience would be a total inversion in Obviously, if we reduce the size of the local ensembles, Ml , the
the meaning of the two classes. Therefore, from the instant system would adapt faster.
12
and selection of local candidates for the subsequent global con-
0.4 0.5 0.6 0.7 0.8 0.9 1.0
user9
properties and then obtain several global models, one for each
group. Another promising approach could be to have some sys-
user7
Acknowledgements
500 2000 3500 5000 6500 8000 9500
References
7. Conclusions
[1] S. Li, L. Da Xu, S. Zhao, 5g internet of things: A survey, Journal of
In this paper we have presented LFedCon2, a novel method Industrial Information Integration 10 (2018) 1–9.
[2] J. Qiu, Q. Wu, G. Ding, Y. Xu, S. Feng, A survey of machine learning for
for light, semi-supervised, asynchronous, federated, and contin- big data processing, EURASIP Journal on Advances in Signal Processing
ual classification. Our method allows to face real tasks where 2016 (1) (2016) 67.
several distributed devices (smartphones, robots, etc.) continu- [3] G. Ananthanarayanan, P. Bahl, P. Bodı́k, K. Chintalapudi, M. Philipose,
ously acquire biased and partially-labeled data, which evolves L. Ravindranath, S. Sinha, Real-time video analytics: The killer app for
edge computing, computer 50 (10) (2017) 58–67.
over time in unforeseen ways. It basically consists of training [4] U. Montanaro, S. Dixit, S. Fallah, M. Dianati, A. Stevens, D. Oxtoby,
local models in each of the devices and then share those models A. Mouzakitis, Towards connected autonomous driving: review of use-
with the cloud, where a global model is created. This global cases, Vehicle system dynamics 57 (6) (2019) 779–814.
model is sent back to the devices and helps them to label un- [5] B. Custers, A. M. Sears, F. Dechesne, I. Georgieva, T. Tani, S. van der
Hof, EU Personal Data Protection in Policy and Practice, Springer, 2019.
labeled data to improve the performance of their local models. [6] B. M. Gaff, H. E. Sussman, J. Geetter, Privacy and big data, Computer
We have applied our proposal to a real classification task, walk- 47 (6) (2014) 7–9.
ing recognition, and we have shown that our proposal obtains [7] K. Dolui, S. K. Datta, Comparison of edge computing implementations:
very high performances without the need of large amounts of Fog computing, cloudlet and mobile edge computing, in: 2017 Global
Internet of Things Summit (GIoTS), IEEE, 2017, pp. 1–6.
labeled data collected at the beginning. [8] T. H. Luan, L. Gao, Z. Li, Y. Xiang, G. Wei, L. Sun, Fog computing:
We believe that we have opened a very promising line of re- Focusing on mobile users at the edge, arXiv preprint arXiv:1502.01815.
search in the context of federated learning, and that there is still [9] Q. Li, Z. Wen, B. He, Federated learning systems: Vision, hype and real-
ity for data privacy and protection, arXiv preprint arXiv:1907.09693.
a lot of work to be done. In this work we have presented a [10] H. B. McMahan, E. Moore, D. Ramage, B. A. y Arcas, Feder-
new method with clearly separated components: local learning, ated learning of deep networks using model averaging, arXiv preprint
semi-supervised labeling, local drift detection and adaptation, arXiv:1602.05629v1.
13
ROC curve (NB) ROC curve (GLM) ROC curve (C5.0)
1.00 1.00 1.00
Sensitivity
Sensitivity
0.50 0.50 0.50
Single NB Ours (5k) Ours (10k) Single GLM Ours (5k) Ours (10k) Single C5.0 Ours (5k) Ours (10k)
Ours (2.5k) Ours (7.5k) Ours (2.5k) Ours (7.5k) Ours (2.5k) Ours (7.5k)
(a) Using NB as base classifier. (b) Using GLM as base classifier. (c) Using C5.0 as base classifier.
ROC curve (SVM) ROC curve (RF) ROC curve (SGB)
1.00 1.00 1.00
Sensitivity
Sensitivity
0.50 0.50 0.50
Single SVM Ours (5k) Ours (10k) Single RF Ours (5k) Ours (10k) Single SGB Ours (5k) Ours (10k)
Ours (2.5k) Ours (7.5k) Ours (2.5k) Ours (7.5k) Ours (2.5k) Ours (7.5k)
(d) Using RBF SVM as base classifier. (e) Using RF as base classifier. (f) Using SGB as base classifier.
Figure 7: ROC curves provided by a single classifier (Table 6) compared with the global model from the federated training (Table 7) at iterations 2500, 5000, 7500
and 10000 when there are users who provide noisy data.
[11] P. Nakkiran, R. Alvarez, R. Prabhavalkar, C. Parada, Compressing deep allelizing stochastic gradient descent, in: Advances in neural information
neural networks using a rank-constrained topology, in: Proceedings of processing systems, 2011, pp. 693–701.
the 16th Anual Conference of the International Speach Communication [20] F. Andersson, M. Carlsson, J.-Y. Tourneret, H. Wendt, A new frequency
Association, ISCA, 2015, pp. 1473–1477. estimation method for equally and unequally spaced data, IEEE Transac-
[12] A. Vasilyev, CNN optimizations for embedded systems and FFT, Stand- tions on Signal Processing 62 (21) (2014) 5761–5774.
ford University Report. [21] F. Lin, M. Fardad, M. R. Jovanović, Design of optimal sparse feedback
[13] L. Zhang, Transfer adaptation learning: A decade survey, arXiv preprint gains via the alternating direction method of multipliers, IEEE Transac-
arXiv:1903.04687. tions on Automatic Control 58 (9) (2013) 2426–2431.
[14] N. D. Lane, P. Georgiev, Can deep learning revolutionize mobile sensing?, [22] O. Shamir, N. Srebro, T. Zhang, Communication-efficient distributed op-
in: Proceedings of the 16th International Workshop on Mobile Computing timization using an approximate newton-type method, in: International
Systems and Applications, ACM, 2015, pp. 117–122. conference on machine learning, 2014, pp. 1000–1008.
[15] D. Peteiro-Barral, B. Guijarro-Berdiñas, A survey of methods for dis- [23] Y. Zhang, X. Lin, Disco: Distributed optimization for self-concordant
tributed machine learning, Progress in Artificial Intelligence 2 (1) (2013) empirical loss, in: International conference on machine learning, 2015,
1–11. pp. 362–370.
[16] J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, J. S. [24] O. Sagi, L. Rokach, Ensemble learning: A survey, Wiley Interdisciplinary
Rellermeyer, A survey on distributed machine learning, arXiv preprint Reviews: Data Mining and Knowledge Discovery 8 (4) (2018) e1249.
arXiv:1912.09789. [25] G. Tsoumakas, I. Vlahavas, Effective stacking of distributed classifiers,
[17] R. Gu, S. Yang, F. Wu, Distributed machine learning on mobile devices: in: Ecai, Vol. 2002, 2002, pp. 340–344.
A survey, arXiv preprint arXiv:1909.08329. [26] L. O. Hall, N. Chawla, K. W. Bowyer, W. P. Kegelmeyer, Learning rules
[18] H. Robbins, S. Monro, A stochastic approximation method, The annals from distributed data, in: Large-Scale Parallel Data Mining, Springer,
of mathematical statistics (1951) 400–407. 2000, pp. 211–220.
[19] B. Recht, C. Re, S. Wright, F. Niu, Hogwild: A lock-free approach to par- [27] Y. Guo, J. Sutiwaraphun, Probing knowledge in distributed data mining,
14
1.0
1.0
0.8
0.8
Balanced Accuracy
Balanced Accuracy
0.6
0.6
0.4
0.4
0.2
0.2
500 1500 3000 4500 6000 7500 9000 500 1500 3000 4500 6000 7500 9000
Drift Detection and Adaptation
user9
user7
user7
user5
user5
user3
user3
user1
user1
500 2000 3500 5000 6500 8000 9500 500 2000 3500 5000 6500 8000 9500
Time Time
Figure 8: Results for LFedCon2 (using RBF SVM as base classifier and Mg = Figure 9: Results for LFedCon2 (using RBF SVM as base classifier and Mg =
Ml = 5) when 4 of the users are mislabeling the data. Ml = 5) when a drastic drift occurs at t = 2500.
in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, [41] A. K. Jain, B. Chandrasekaran, 39 dimensionality and sample size consid-
Springer, 1999, pp. 443–452. erations in pattern recognition practice, Handbook of statistics 2 (1982)
[28] L. O. Hall, N. Chawla, K. W. Bowyer, Decision tree learning on very large 835–855.
data sets, in: SMC’98 Conference Proceedings. 1998 IEEE International [42] L. Kanal, B. Chandrasekaran, On dimensionality and sample size in sta-
Conference on Systems, Man, and Cybernetics (Cat. No. 98CH36218), tistical pattern classification, Pattern recognition 3 (3) (1971) 225–234.
Vol. 3, IEEE, 1998, pp. 2579–2584. [43] S. J. Raudys, A. K. Jain, Small sample size effects in statistical pattern
[29] A. Lazarevic, Z. Obradovic, Boosting algorithms for parallel and dis- recognition: Recommendations for practitioners, IEEE Transactions on
tributed learning, Distributed and parallel databases 11 (2) (2002) 203– Pattern Analysis & Machine Intelligence 13 (3) (1991) 252–264.
229. [44] R. P. Duin, D. M. Tax, Classifier conditional posterior probabilities, in:
[30] J. Konečnỳ, B. McMahan, D. Ramage, Federated optimization: Joint IAPR International Workshops on Statistical Techniques in Pat-
Distributed optimization beyond the datacenter, arXiv preprint tern Recognition (SPR) and Structural and Syntactic Pattern Recognition
arXiv:1511.03575. (SSPR), Springer, 1998, pp. 611–619.
[31] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang, [45] T. Lesort, V. Lomonaco, A. Stoian, D. Maltoni, D. Filliat, N. Dı́az-
D. Niyato, C. Miao, Federated learning in mobile edge networks: A com- Rodrı́guez, Continual learning for robotics: Definition, framework, learn-
prehensive survey, arXiv preprint arXiv:1909.11875. ing strategies, opportunities and challenges, Information Fusion 58 (2020)
[32] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, 52–68.
S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: A system for large- [46] J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, G. Zhang, Learning under concept
scale machine learning, in: 12th {USENIX} symposium on operating drift: A review, IEEE Transactions on Knowledge and Data Engineering.
systems design and implementation ({OSDI} 16), 2016, pp. 265–283. [47] A. Haque, L. Khan, M. Baron, Sand: Semi-supervised adaptive novel
[33] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, class detection and classification over data stream, in: Thirtieth AAAI
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., Pytorch: An impera- Conference on Artificial Intelligence, 2016, pp. 1652–1658.
tive style, high-performance deep learning library, in: Advances in neural [48] M. Baron, Convergence rates of change-point estimators and tail prob-
information processing systems, 2019, pp. 8026–8037. abilities of the first-passage-time process, Canadian Journal of Statistics
[34] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, V. Chandra, Federated learning 27 (1) (1999) 183–197.
with non-iid data, arXiv preprint arXiv:1806.00582. [49] D. H. Wolpert, Stacked generalization, Neural networks 5 (2) (1992) 241–
[35] S. Caldas, P. Wu, T. Li, J. Konečnỳ, H. B. McMahan, V. Smith, 259.
A. Talwalkar, Leaf: A benchmark for federated settings, arXiv preprint [50] P. K. Chan, S. J. Stolfo, et al., Toward parallel and distributed learn-
arXiv:1812.01097. ing by meta-learning, in: AAAI workshop in Knowledge Discovery in
[36] G. Widmer, M. Kubat, Learning in the presence of concept drift and hid- Databases, 1993, pp. 227–240.
den contexts, Machine learning 23 (1) (1996) 69–101. [51] G. Tsoumakas, L. Angelis, I. Vlahavas, Clustering classifiers for knowl-
[37] J. Yoon, W. Jeong, G. Lee, E. Yang, S. J. Hwang, Federated con- edge discovery from physically distributed databases, Data & Knowledge
tinual learning with adaptive parameter communication, arXiv preprint Engineering 49 (3) (2004) 223–242.
arXiv:2003.03196. [52] F. E. Casado, G. Rodrı́guez, R. Iglesias, C. V. Regueiro, S. Barro,
[38] J. Czyz, J. Kittler, L. Vandendorpe, Multiple classifier combination for A. Canedo-Rodrı́guez, Walking recognition in mobile devices, Sensors
face-based identity verification, Pattern recognition 37 (7) (2004) 1459– 20 (4).
1469. [53] G. Rodrı́guez, F. E. Casado, R. Iglesias, C. V. Regueiro, A. Nieto, Robust
[39] J. Kittler, M. Hatef, R. P. Duin, J. Matas, On combining classifiers, IEEE step counting for inertial navigation with mobile phones, Sensors 18 (9).
transactions on pattern analysis and machine intelligence 20 (3) (1998) [54] M. Kuhn, J. Wing, S. Weston, A. Williams, C. Keefer, A. Engelhardt,
226–239. T. Cooper, Z. Mayer, B. Kenkel, R. C. Team, et al., Package ‘caret’, The
[40] K. Tumer, J. Ghosh, Error correlation and error reduction in ensemble R Journal.
classifiers, Connection science 8 (3-4) (1996) 385–404. [55] J. Allaire, F. Chollet, et al., keras: R interface to’keras’, R package version
15
2 (3).
[56] I. Triguero, S. Garcı́a, F. Herrera, Self-labeled techniques for semi-
supervised learning: taxonomy, software and empirical study, Knowledge
and Information systems 42 (2) (2015) 245–284.
16