0% found this document useful (0 votes)
37 views12 pages

Multimodal Federated Learning On Iot Data: Yuchen Zhao Payam Barnaghi Hamed Haddadi

Uploaded by

18147607386
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views12 pages

Multimodal Federated Learning On Iot Data: Yuchen Zhao Payam Barnaghi Hamed Haddadi

Uploaded by

18147607386
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Multimodal Federated Learning on IoT Data

Yuchen Zhao Payam Barnaghi Hamed Haddadi


UK Dementia Research Institute UK Dementia Research Institute UK Dementia Research Institute
Imperial College London Imperial College London Imperial College London
[email protected] [email protected] [email protected]

Abstract—Federated learning is proposed as an alternative makes FL systems specifically suitable for privacy sensitive
to centralized machine learning since its client-server structure applications such as smart home [2]–[4] based on IoT tech-
provides better privacy protection and scalability in real-world nologies. For example, Wu et al. [5] propose an FL framework
arXiv:2109.04833v2 [cs.LG] 18 Feb 2022

applications. In many applications, such as smart homes with


Internet-of-Things (IoT) devices, local data on clients are gener- that uses personalization to address the device, statistical and
ated from different modalities such as sensory, visual, and audio model heterogeneity issues in IoT environments. Pang et al. [6]
data. Existing federated learning systems only work on local data propose an FL framework using reinforcement learning to
from a single modality, which limits the scalability of the systems. adjust the model aggregation strategy on models trained with
In this paper, we propose a multimodal and semi-supervised IoT data. As a distributed system paradigm, FL provides a
federated learning framework that trains autoencoders to extract
shared or correlated representations from different local data feasible and scalable solution for realizing ML on resource-
modalities on clients. In addition, we propose a multimodal constrained IoT devices [7].
FedAvg algorithm to aggregate local autoencoders trained on IoT applications often deploy different types of sensors
different data modalities. We use the learned global autoencoder or devices that generate data from different modalities (e.g.,
for a downstream classification task with the help of auxil- sensory, visual, and audio) [8]. For example, in one smart
iary labelled data on the server. We empirically evaluate our
framework on different modalities including sensory data, depth home, activities of a person can be recorded by body sensors in
camera videos, and RGB camera videos. Our experimental results a smartwatch worn by the person, and also by a video camera
demonstrate that introducing data from multiple modalities into in the room at the same time. Meanwhile, for smart homes with
federated learning can improve its classification performance. different device setups, some of them may have multimodal
In addition, we can use labelled data from only one modality local data (i.e., multimodal clients) while the others may
for supervised learning on the server and apply the learned
model to testing data from other modalities to achieve decent F1 have unimodal local data (i.e., unimodal clients). One way to
scores (e.g., with the best performance being higher than 60%), apply FL to these IoT applications is to implement individual
especially when combining contributions from both unimodal services for different modalities. However, many centralized
clients and multimodal clients. ML systems [9]–[13] have shown that combining data from
Index Terms—collaborative work, semisupervised learning, different modalities can improve their performance. Therefore,
edge computing, multimodal sensors
it is necessary to design and implement FL systems in a way
that supports multimodal IoT data and different device setups.
I. I NTRODUCTION
To work on multimodal data, one approach in existing FL
In recent years, we have witnessed a rapid growth in per- systems uses data fusion [14] to mix representations from
sonal data generated from many different aspects in people’s different modalities before a final decision layer into a new
daily lives, such as mobile devices and IoT devices. Powered representation space. This requires all the data (i.e., training
by the enormous amount of personal data, machine-learning and testing) in the system to be aligned multimodal data,
(ML) techniques, especially Deep Neural Networks (DNN), which means that all the clients need to have data from all
have shown great capabilities of conducting complex tasks modalities in the system. In addition, the labelled data in the
such as image recognition, natural language processing, human system also need to be from all modalities, in order to support
activity recognition, and so forth. Traditionally, ML systems supervised learning on the new representation space. This does
are centralized and need to collect and store personal data on not work on systems with unimodal clients and increases
a server to train DNN models, which causes privacy issues. the complexity of data annotation. Another approach [15]
The long-debated privacy issues in centralized ML systems extracts representations from different modalities locally and
have motivated researchers to design and implement machine requires the clients to send the representations to the server in
learning in decentralized fashions. Federated learning (FL) [1], order to align different modalities. This may break the privacy
which allows different parties to jointly train DNN models guarantee provided by FL since the representations can be used
without releasing their local data, is a system paradigm that to recover local data, especially when the server has taken part
has gained much popularity in both research communities and in the training of the model that extracts the representations.
real-world ML applications. Allowing FL to work on clients with arbitrary data modalities
In FL systems, DNN models are trained on clients at the (i.e., unimodal or multimodal) and with labelled data that come
edge of networks instead of on servers in the cloud. This from single modalities, however, still remains a challenge.
In this paper, we propose a multimodal FL framework their own data to locally train the model and then send the
that takes advantage of aligned multimodal data on clients. resulting models back to the server, on which these models
Although acquiring alignment information for multimodal data are aggregated into a new global model. The system repeats
across different clients is challenging, our assumption is that this process for a number of rounds until the performance of
data from different modalities (e.g., sensory data and visual the global model on a given task converges. The privacy of the
data) on a multimodal client inherently have some alignment clients’ data is protected since the data are never shared with
information (e.g., through synchronized local timestamps of others. Given its decentralized feature, FL is especially suitable
sensory data samples and video frames on that client), based for edge computing [19], [20], which moves computation to
on which we can train models to extract multimodal represen- the place where data are generated.
tations from the data. We utilize multimodal autoencoders [9], Canonical FL systems focus on supervised learning that
[10] to encode the data into shared or correlated hidden requires all local data on FL clients to be labelled. In edge
representations. To enable the server in our framework to computing, data generated from IoT devices can only be
aggregate trained local autoencoders into a global autoencoder, accessed by the data subjects, since FL clients do not share
we propose a multimodal version of the FedAvg algorithm [1] data to third parties. These data subjects (i.e., end users of an
that can combine local models trained on data from both FL system) may not have time or abilities to annotate their
unimodal and multimodal clients. data with labels of a given task, especially when the task
As it is difficult to have adequate labels on clients in real- requires expert knowledge (e.g., labelling timer-series sensory
world FL systems [14], [16], we focus on semi-supervised data with clinical knowledge). Therefore, one key challenge
scenarios wherein local data on the clients are unlabelled of deploying FL in real-world IoT environments is the lack
and the server has an auxiliary labelled dataset. We use the of labelled data on clients for local training. In order to
global autoencoder and the auxiliary labelled dataset on the address this issue, recent research in FL has been focusing
server to train a classifier for activity recognition tasks [17], on unsupervised and semi-supervised FL frameworks through
[18] and evaluate its performance on a variety of multimodal data augmentation [16], [21]–[29] to generate pseudo labels
datasets (e.g., sensory and visual). Compared with existing for local data, or through unsupervised learning to extract
FL systems [14], [15], our proposed framework does not share hidden representations from unlabelled local data [17], [18].
representations of local data to the server. Additionally, instead For example, van Berlo et al. [17] propose to learn hidden
of requiring the clients and the server to have aligned data from representations through convolutional autoencoders from un-
all modalities, our framework conducts local training on both labelled local data on FL clients. Their results show that the
multimodal and unimodal clients, and only needs unimodal learned representations can empower downstream tasks such
labelled data on the server. Our experimental results indicate as classifications. Zhao et al. [18] propose a semi-supervised
that our proposed framework can improve the classification FL framework for human activity recognition and compared
performance (F1 score) of FL systems in comparison to the performance of different autoencoders. Their framework
unimodal FL, and allows us to use unimodal labelled data shows better performance than data augmentation schemes
to train models that can be applied to multimodal testing data. do. Our work in this paper follows the path of the latter
We make the following contributions in this paper: category. Compared with the existing research, we enable
• We propose a multimodal FL framework that works on
semi-supervised FL to learn from multiple data modalities.
data from different modalities and clients with different B. Heterogeneity in federated learning
device setups, and a multimodal FedAvg algorithm.
• Complementing the existing knowledge on the benefit of
Heterogeneity is one of the most challenging issues [30],
using multimodal data in centralized ML, we find that [31] in FL because models are locally trained on clients.
introducing data from more modalities into FL also leads Different clients may vary in terms of computational capa-
to better classification performance. bilities, model structures, distribution of data, or distribution
• We show that classifiers trained on labelled data on the
of features. Among all these issues, the heterogeneity in
server from one modality can achieve decent classifica- distribution of data (i.e., non-IID local data) has attracted most
tion F1 scores on testing data from other modalities. research efforts [32]–[35]. Smith et al. [32] apply multi-task
• We show that combining contributions from both uni-
learning to addressing the issue of training on non-IID data in
modal and multimodal clients can further improve the FL. Instead of training one global model for all clients, they
classification F1 scores. treat each client as a different task and train separate models
for them. Similarly, Li et al. [34] extend federated multi-task
II. R ELATED WORK learning to an online fashion and allow new clients to join
the system. To address the heterogeneity in the distribution of
A. Federated learning features when shifting FL from one domain to another, Chen et
McMahan et al. [1] propose federated learning (FL) as al. [35] propose to use transfer learning to align the features
an alternative system paradigm to centralized ML. In an FL in lower-stream layers (e.g., fully connected layers before final
system, a server acts as an coordinator to select clients and output layers). In order to learn from heterogeneous models
to send a global DNN model to the clients. The clients use (i.e., DNN models with different structures), Lin et al. [36]
propose to use knowledge distillation [37] to train global III. M ETHODOLOGY
models of FL based on the output probability distribution from
local models, instead of directly averaging the parameters of Our goal is to enable FL to work on clients that have
them. Existing research, however, neglected the heterogeneity different local data modalities. We first introduce the overall
in data modalities in FL, which is commonplace in many design of our framework. We then describe the key techniques
scenarios such as edge computing, IoT environments, and that we use to extract representations from multimodal data
mobile computing. and the algorithms that we designed to aggregate local models
The recent study by Liu et al. [15] applies FL on data from trained on both unimodal and multimodal clients.
two modalities (i.e., images and texts) and treats each modality
individually, which is the same as running two individual FL
instances. In the study, to align the two modalities on a server, A. Framework overview
representations of local data need to be uploaded to the server.
This breaks the privacy guarantee of FL because the server A canonical FL system, as shown in Fig. 1a, only works
has the global model that generates the representations from on clients that have local data from the same modality and
raw data and could recover the raw data if it has those repre- requires the data to be labelled for supervised learning.
sentations. The framework proposed by Liang et al. [14] can We propose an FL framework wherein clients’ unlabelled
work on multimodal data only when the clients’ local data, the local data can be from either one single modality or multiple
server’s labelled data, and testing data are all aligned data from modalities. In our framework, as shown in Fig. 1b, unimodal
both modalities. Instead of aligning the representations from clients (e.g., Clients 1 and 3) only deploy one type of devices
different modalities, it conducts early fusion (i.e., element- due to reasons such as budget or privacy. Multimodal clients
wise multiplication) on the representations. Thus unimodal (e.g., Client 2) deploy both types of devices and thus have
data cannot contribute to the local training and the trained multimodal local data. On a multimodal client, we assume
model cannot be used on unimodal data. Compared to the that there is alignment information between the data from
existing work, we use the alignment information in local data two modalities, based on which we can align the hidden
to learn to extract shared or correlated hidden representations representations of two modalities. For example, a person’s
from multiple modalities. Our scheme does not require sending activity can be captured by the accelerometers in a smartwatch
representations of local data to the server, which contradicts and by an IP camera in the room at the same time. A record
the motivation of using FL. In addition, it allows models to of video call contains both the visual information and audio
be trained and used on unimodal data. information of a speech. This kind of matching information is
the key to align the hidden representations of multimodal data
C. Multimodal deep learning since they describe the same underlying activities or events.
When training deep learning models for a certain task, the To address the lack of labelled data in FL systems using
used data can be generated from a variety of modalities (e.g., IoT devices, similar to existing semi-supervised FL frame-
recognizing human activities from IoT sensory data or videos). works [17], [18], on clients we assume that no labelled local
In order to utilize these data, multimodal deep learning has data are available. Thus we learn to extract hidden represen-
attracted much attention from researchers. Ngiam et al. [9] tations from unlabelled data. On multimodal clients, we train
propose to use deep autoencoders [38] to learn multimodal local models to extract shared or correlated representations
representations from audio and visual data. The alignment between different modalities since we have aligned pairs of
between the two modalities is done by reconstructing the multimodal data. On unimodal clients, we train models to
output for both modalities from the hidden representation gen- extract representations from one single modality. Local models
erated by either modality. Wang et al. [10] compare different from both types of clients are sent to the server and are
multimodal representation learning techniques and propose to aggregated into a global model by using a multimodal version
combine both deep canonical correlation analysis [39] and of the FedAvg algorithm [1]. The server uses the global
autoencoders to map data from different modalities into highly model to encode a labelled dataset from either modality into
correlated representations instead of one common represen- a labelled representation dataset, based on which a classifier
tation. These techniques have demonstrated that data from is trained through supervised learning. We believe that, as
different modalities can complement each other when learning the service provider, the server can provide such an auxiliary
representations and improve the overall performance of an dataset with labels that requires expert knowledge about the
ML system. Many applications such as audio-visual speech task of the service. For example, in many existing human
recognition [9], activity and context recognition [12], [13], activity datasets, labelling activities with sensory data can be
and textual description generation for images [40], have been done through controlled laboratory trials with the assistance
implemented based on multimodal deep learning. The recent from video cameras and pre-defined trial scripts [42]. The
survey by Baltrušaitis et al. [41] provides a detailed analysis clients receive both the global model and the classifier from
and taxonomy of multimodal deep learning. In this paper, we the server during each communication round and can use them
apply multimodal representation learning to FL to address the on their local data for classifications. Alg. 1 describes the the
heterogeneity issue in local data modalities. process of multimodal federated learning.
Server
𝑤 = 𝑀𝑚-𝐹𝑒𝑑𝐴𝑣𝑔 𝑤 ,𝑤 ,𝑤

Server 𝑤 𝑤

Supervised
𝑔 1 2 3

Encode
𝑤𝑡+1 = 𝐹𝑒𝑑𝐴𝑣𝑔 𝑤𝑡+1 , 𝑤𝑡+1 , 𝑤𝑡+1 XA or XB Y h Y
𝑤

𝑔 𝑔 𝑔
𝑤𝑡 𝑤𝑡 𝑤𝑡

𝑤 𝑤 𝑤
1 2 3
𝑤𝑡+1 𝑤𝑡+1 𝑤𝑡+1
𝑤 𝑤 𝑤
𝑔 𝑔 𝑔
𝑤𝑡 𝑤𝑡 𝑤𝑡
𝑤 𝑤 𝑤

Unsupervised

Unsupervised
Unsupervised
Supervised

Supervised

Supervised
1 2 3
𝑤𝑡+1 𝑤𝑡+1 𝑤𝑡+1
𝑤 𝑤 𝑤
X Y X Y X Y XA XA XB XB

Client 1 Client 2 Client 3


Client 1 Client 2 Client 3
(a) Canonical federated learning (b) Multimodal federated learning

Fig. 1: In canonical federated learning (a), during round t, a server sends a global model wtg to selected clients that have data
k
from the same modality. Client k conducts supervised learning to generate a local model wt+1 . Local models are aggregated on
a
the server by using the FedAvg algorithm. In multimodal federated learning (b), a server sends a global model wt g to selected
clients to learn to extract multimodal representations (Sec. III-B) on unlabelled local data. The server uses multimodal FedAvg
ag
(Sec. III-C) to aggregate local models into a new global model wt+1 and uses it to encode a labelled dataset (modality A or
s
B) to a labelled representation dataset (h, Y ). A classifier wt+1 is then trained on (h, Y ), which can be used by all clients.

Algorithm 1 Multimodal Federated Learning Encoder Decoder


Require: K: number of clients; C: fraction of clients to
choose; D = (X, Y ): labelled dataset from either modality
(A or B)
ag
1: initializes w0 , w0s at t = 0
2: for all communication round t do
3: St ← randomly selected K · C clients
4: Wt ← ∅
5: for all client k ∈ St do
ak a
6: wt+1 ← Multimodal Local Training(k, wt g ) .
on client k ℎ
ak
7: Wt ← Wt ∪ wt+1
8: end for
ag 𝑓 𝑔
9: wt+1 ← Multimodal FedAvg(Wt ) . on the server
ag
𝑋 𝑋′
10: h ← wt+1 .encoder(X) . using the encoder for the
modality of X Fig. 2: A simple autoencoder structure. An encoder f maps
11: Dt0 ← (h, Y ) input data X into a hidden representation h. A decoder g maps
12: s
wt+1 ← Cloud Training(Dt0 , wts ) . on the server h into a reconstruction X 0 .
13: end for

hidden representations from different modalities.


1) Autoencoders: Autoencoders [38] are one of the most
B. Learning to extract representations
commonly used DNNs in unsupervised ML. A typical autoen-
The key part of the local training in our proposed framework coder, as shown in Fig. 2, has two building blocks, which
is how to learn representations from unlabelled unimodal are an encoder (f ) and a decoder (g). The encoder maps
data or multimodal data. We first introduce canonical autoen- unlabelled data (X) into a hidden representation (h). The
coders, which we train to extract hidden representations from decoder tries to generate a reconstruction (X 0 ) of the input
unimodal data. Then we introduce two types of multimodal data from the representation. When training an autoencoder,
autoencoders, which learn to extract shared and correlated the objective is to minimize the difference between X and
𝐿𝐴 𝐿𝐴 𝐿𝐴

𝑋𝐴 𝑋𝐴′ 𝑋𝐴 𝑋𝐴′ 𝑋𝐴 ℎ𝐴 𝑋𝐴′

ℎ ℎ 𝐿𝐶

𝑋𝐵 𝑋𝐵′ 𝑋𝐵 𝑋𝐵′ 𝑋𝐵 ℎ𝐵 𝑋𝐵′

𝐿𝐵 𝐿𝐵 𝐿𝐵
(a) Split autoencoder (b) Canonically correlated autoencoder

Fig. 3: In split autoencoders (a), for aligned input (XA , XB ) from two modalities, data from one modality are input into
its encoder to generate an h, which is then used to reconstruct the data for both modalities through two decoders. Each
single modality has a loss function (i.e., LA and LB ) and the overall objective of training is to minimize LA + LB . In a
canonically correlated autoencoder (b), data from both modalities are input into their encoders to generate two representations.
Two parameter matrices are used to maximize the canonical correlation between the paired representations hA and hB . The
overall objective of the training is to minimize λ(LA + LB ) + LC , where λ is a trade-off parameter and LC is the negative
value of the canonical correlation.

X 0 , which is measured by a loss function L(X, X 0 ), such both modalities. Similarly, for input modality B, its SplitAE
as the mean squared error (MSE). The assumption is that is (fB , gA , gB ).
if the reconstruction error is small, then it means that the 3) Deep canonically correlated autoencoders: In order to
hidden representation contains the most useful information in combine deep canonical correlation analysis [39] and autoen-
the original input. Therefore, minimizing the error will make coders together, Wang et al. [10] propose a deep canonically
the encoder to learn to extract such useful information. correlated autoencoder (DCCAE). Instead of mapping mul-
2) Split autoencoders: Canonical autoencoders only work timodal data into shared representations, DCCAE keeps an
on data from the same modality. In order to extract shared individual autoencoder for each modality and tries to maximize
representations from multimodal data, Ngiam et al. [9] propose the canonical correlation between the hidden representations
a split autoencoder (SplitAE) that takes input data from from two modalities. Fig. 3b shows the structure of a DCCAE
one modality and encode the data into a shared h for two for two modalities.
modalities. With the shared h, two decoders are used to For modalities A and B, given aligned input (XA , XB ), the
generate the reconstructions for two modalities. Fig. 3a shows DCCAE (fA , gA , fB , gB ) is:
the structures of SplitAEs for two data modalities. The premise
is that the data from two modalities have to be matching pairs,
which means that they present the same underlying activities arg min λ(LA + LB ) + LC (2)
fA ,gA ,fB ,gB ,U,V
or events. Since the encoders for both modalities aim to extract
LC = −tr(U | fA (XA )fB (XB )| V ) (3)
hidden representations, we want the representations to be not
only specific to an individual modality. Instead, we hope that Parameter matrices U and V are canonical correlation anal-
the extracted representations from both encoders can reflect ysis directions. Similarly to SplitAE, one of the objectives of
the general nature of the activities or events in question. DCCAE is to minimize the reconstruction losses. In addition,
For modalities A and B, given a pair of matching samples it uses another objective to increase the canonical correlation
(XA , XB ) (e.g., accelerometer data and video data of the same between the generated representations from two modalities
activity), the SplitAE (fA , gA , gB ) for input modality A is: (i.e., minimizing its negative value LC ). The two objectives
are balanced by a parameter λ. By this means, DCCAE maps
0 0 multimodal data into correlated representations rather than
arg min LA (XA , XA ) + LB (XB , XB ) (1)
fA ,gA ,gB shared representations.

XA0
and XB 0
are the reconstructions for two modalities. LA C. Multimodal federated averaging
and LB are the loss functions for two modalities, respectively. During each round t, the server sends a global multimodal
a
By minimizing the compound loss in Eq. 1, the learned autoencoder wt g to selected clients. A selected client is either
a
encoder fA will extract representations that are useful for unimodal or multimodal and the local training on wt g depends
𝑎 𝑎 𝑎
𝑤𝑡 𝑔 𝑤𝑡 𝑔 𝑤𝑡 𝑔

𝑎1 𝑎
2 𝑎3
𝑤𝑡+1 𝑤𝑡+1 𝑤𝑡+1

Unsupervised

Unsupervised

Unsupervised
XA XA XB XB

Client 1 Client 2 Client 3

Fig. 4: Multimodal local training. Clients only update the f and g that are related to the modalities of their data.

Algorithm 2 Multimodal FedAvg (Mm-FedAvg)


𝑎
1
Require: Wt : local multimodal autoencoders at round t; α:
𝑤𝑡+1 multimodal weight parameter; nk : number of samples on
client k; mk : data modality of client k;
1: WtA ← {w ak |w ak ∈ Wt ∧ mk = A}
𝑎 𝑔 𝑎 2: WtB ← {w ak |w ak ∈ Wt ∧ mk = B}
2
𝑤𝑡+1 𝑤𝑡+1
3: WtAB ← {w ak |w ak ∈ Wt ∧ mk = AB}
k k
P P
4: nA ← w ak
∈W A n + α wak ∈WtAB n
t
k k
P P
5: nB ← wak ∈WtB n + α wak ∈WtAB n
𝑎 nk k
3
P
𝑤𝑡+1 6: (fA , gA ) ← wak ∈WtA nA (fA , gA ) +
k
α wak ∈W AB nnA (fA , gA )k
P
t
nk k
P
Fig. 5: Multimodal FedAvg on the server. Only the updated 7: (fB , gB ) ← wak ∈WtB nB (fB , gB ) +
k
α wak ∈W AB nnB (fB , gB )k
P
parts of each local model will be aggregated. t
ag
8: wt+1 ← (fA , gA , fB , gB )

on the modality of data on the client. As shown in Fig. 4, a IV. E VALUATION


multimodal client (e.g., Client 2) locally updates the encoders
We evaluate our proposed framework on different multi-
and decoders for both modalities. A unimodal client (e.g.,
modal datasets including sensory data, depth camera data, and
Client 1 or 3) only updates the encoder and decoder for its data
RGB camera data through simulations. The research questions
modality through standard autoencoder training. The encoder
that we want to answer are as follows:
and decoder for the other modality will be frozen during the
• Q1. Does introducing data from multiple modalities into
local training.
FL improve its performance?
We propose a multimodal FedAvg (Mm-FedAvg) algorithm • Q2. Does a classifier trained on labelled data from one
to aggregate autoencoders received from both unimodal clients modality work on testing data from other modalities?
and multimodal clients. Fig. 5 shows which parts of different • Q3. Does learning from both unimodal and multimodal
local autoencoders are used when generating a new global clients provide better performance than only learning
a
model. Given a global multimodal autoencoder wt g at round from multimodal clients?
t represented as (fA , gA , fB , gB )t , (fA , gA )t is the encoder
and decoder for modality A. Similarly, a local multimodal A. Datasets
autoencoder updated by client k is wtak and the client’s
As human activity recognition (HAR) is a domain that often
modality mk is one of A, B and AB. The Mm-FedAvg
relies on multimodal data, we used three HAR datasets that
algorithm is shown in Alg. 2.
contain IoT data from different modalities in our experiments.
When aggregating local models from multimodal clients and Table I shows the modalities, X sizes, h sizes, and the number
unimodal clients, the contribution from multimodal clients is of classes in the datasets.
controlled by a weight parameter α. Increasing α can give 1) Different sensory modalities: The Opportunity (Opp)
more weights to multimodal clients because they play a key challenge dataset [42] contains 18 short-term and non-repeated
role in aligning two modalities, which helps unimodal clients kitchen activities including opening & closing doors, fridges,
benefit from the data from another modality. dishwashers, and drawers, cleaning tables, drinking from cups,
TABLE I: USED MULTIMODAL DATASETS P40Ratio, which are provided in the dataset. The size of
Dataset Modality X size h size Classes
h is 2 with Acce and is 4 without it. For each replicate
of our simulations, we randomly sample 1/10 data (i.e., 7
Acce 24
Opp
Gyro 15
10 18 video clips) as testing data and use the rest as training data.
The average number of frames in a video clip is 164 ± 82
Acce 9
mHealth Gyro 6 4 13 (mean ± std). From the training data, the size of a randomly
Mag 6 sampled sequence for a client is 1/9 of the training data.
Acce 3
UR Fall RGB 512 2,4 3 B. Simulation setup
Depth 8
In each replicate of our simulation, the server conducts at
most 100 communication rounds with the clients and selects
toggling switches, and null activities. Its multimodal data 10% clients for local training (2 epochs with a 0.01 or a 0.001
are measured by on-body sensors including accelerometers, learning rate, whichever provides better performance) in each
gyroscopes, and magnetic sensors. We use the accelerometer round, after which the cloud training (5 epochs with a 0.001
data (Acce) measured in milligrams and gyroscope data learning rate) is conducted. The labelled dataset on the server
(Gyro) measured in degrees/s as the two modalities in is randomly sampled from the training dataset and its size is
our experiments. Following the experimental setup used by the same as the size of a client’s local data. For DCCAE, we set
Hammerla et al. [43], we use the runs ADL4 and ADL5 λ = 0.01 as suggested by Wang et al. [10]. For the multimodal
of subjects 2 and 3 as testing data (118k samples) and the weight parameter α, we tested {1, 2, 10, 50, 100, 500} and
remaining runs (except for ADL2 of subject 1) as training found that α = 100 provides the best performance. For each
data (525k samples). For NaN data in a sequence, we use individual simulation setup, we use different random seeds to
their previous value in the sequence to replace them [42]. As run 64 replicates.
the training data are from 15 runs, when generating local data 1) Baselines: To answer Q1, we consider a system in which
for a client, the size of the randomly sampled sequence is 1/15 clients have multimodal data and a server has two labelled
of the training data. unimodal datasets. Without multimodal representation learn-
The mHealth dataset [44] contains 13 daily living and ing, a baseline scheme can only use data from one modality,
exercise activities including standing still, sitting & relaxing, which we refer to as UmFL (30 unimodal clients, 1 label
lying down, walking, climbing stairs, waist bending forward, modality). Comparing UmFL with our multimodal scheme
frontal elevation of arms, knees bending, cycling, jogging, (30 multimodal clients, 2 label modalities) will reveal whether
running, jumping front & back, and null activities. The activ- introducing more modalities in FL improves its performance.
ities are measured by multimodal on-body sensors including We test both of them on the data from the modality of UmFL.
accelerometers, ECG sensors, gyroscopes, and magnetometers. To answer Q2, we consider a system wherein clients have
We use the accelerometer data (Acce) measured in meters/s2 multimodal data and a server has a labelled dataset from
, gyroscope data (Gyro) measured in degrees/s, and magne- one modality. A baseline scheme trains a global unimodal
tometer data (Mag) measured as local magnetic field in our autoencoder for each modality with the same size of h. The
experiments and test the combinations of each two of them. classifier of the baseline is trained on the labelled data from
For each replicate of our simulations, we use the Leave-One- one modality with the help from the autoencoder on that
Subject-Out method to randomly choose one participant and modality. We directly test the classifier on data from the other
use her data as testing data. The other 9 participants’ data are modality, since the sizes of h from two modalities are the
used as training data. The average number of samples from a same. This baseline does not use the alignment information
participant is 122±18k (mean±std). The size of the randomly to do any multimodal local training. It is for the ablation
sampled sequence for a client is 1/9 of the training data. study on the multimodal local training and multimodal FedAvg
2) Sensory-Visual modalities: The UR Fall Detection component. We refer to this baseline as Abl (30 unimodal
dataset [45] contains 70 video clips recorded by a RGB clients for each modality, 1 label modality). Comparing Abl
camera (RGB) and a depth camera (Depth) of human activities with our scheme (30 multimodal clients, 1 label modality)
including not lying, lying on the ground, and temporary will indicate whether the multimodal component brings any
poses. Each video frame is labelled and paired with sensory improvement to the performance.
data from accelerometers (Acce) measured in grams. We To answer Q3, we consider a system that has both unimodal
use this dataset for our experiments on sensory-visual and clients and multimodal clients. The server in the system has
visual-visual modality combinations. For the modality RGB, a labelled dataset from one modality. A baseline scheme only
similar to the work by Srivastava et al. [46], we use a pre- chooses multimodal clients (30 clients) to update the global
trained ResNet-18 [47] to convert each frame into a feature autoencoders. Comparing it with other schemes that use both
map. For the modality Depth, we use the extracted features multimodal and unimodal clients for local update will show
including HeightWidthRatio, MajorMinorRatio, BoundingBox- whether our proposed Mm-FedAvg improves the performance
Occupancy, MaxStdXZ, HHmaxRatio, Height, Distance, and of the system.
Opp, DCCAE, A: Acce, B: Gyro mHealth, SplitAE, A: Acce, B: Gyro mHealth, SplitAE, A: Acce, B: Mag
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5

Test F1

Test F1
Test F1

0.4 0.4 0.4


UmFLA UmFLA UmFLA
0.3 UmFLB 0.3 UmFLB 0.3 UmFLB
0.2 MmFLAB LAB TA 0.2 MmFLAB LAB TA 0.2 MmFLAB LAB TA
0.1 MmFLAB LAB TB 0.1 MmFLAB LAB TB 0.1 MmFLAB LAB TB
0.0 0.0 0.0
0 10 20 30 40 50 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
Communication rounds Communication rounds Communication rounds
(a) Opp (Acce & Gyro) (b) mHealth (Acce & Gyro) (c) mHealth (Acce & Mag)

mHealth, SplitAE, A: Gyro, B: Mag UR Fall, SplitAE, A: Acce, B: Depth UR Fall, SplitAE, A: RGB, B: Depth
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
Test F1

Test F1

Test F1
0.4 0.4 0.4
UmFLA UmFLA UmFLA
0.3 UmFLB 0.3 0.3
UmFLB UmFLB
0.2 MmFLAB LAB TA 0.2 MmFLAB LAB TA 0.2 MmFLAB LAB TA
0.1 MmFLAB LAB TB 0.1 MmFLAB LAB TB 0.1 MmFLAB LAB TB
0.0 0.0 0.0
1 2 3 4 5 6 7 8 9 0 20 40 60 80 100 0 20 40 60 80 100
Communication rounds Communication rounds Communication rounds
(d) mHealth (Gyro & Mag) (e) UR Fall (Acce & Depth) (f) UR Fall (RGB & Depth)

Fig. 6: Comparison between UmFL and MmFL. MmFL schemes have higher or same level of converged F1 scores on UR
Fall datasets than UmFL schemes do. On all three datasets, MmFL converges faster than UmFL does.

2) Models: We implement all the deep learning components The weighted average F1 score of all classes within the
through the PyTorch library [48]. For training autoencoders on sequence (with the number of ground truth samples of a class
time-series data, we use long short-term memory (LSTM) [49] being its weight) is the F1 score on the sequence. And the
autoencoders [46] in our experiments for local training and average F1 score of all sequences is the F1 score of the
use the bagging strategy [50] to train our models with random classifier. We evaluate the F1 score of the classifier every
batch sizes and sequence lengths. An LSTM autoencoder takes other communication round until it converges and calculate
a time-series sequence (e.g., sensory data, video frames) as its its average value and standard error from 64 replicates. On
input. The hidden states generated by the LSTM encoder unit each dataset, we evaluate both SplitAE and DCCAE and keep
are used as the hidden representations of the input samples in the one that has better F1 scores.
the sequence. On the server side, we use a simple classifier that
has one multilayer perceptron (MLP) layer connected to one V. R ESULTS
LogSoftmax layer as the model for supervised learning. On We find that by using data from multiple modalities, the F1
the mHealth dataset, we introduce a Dropout layer (rate=0.5) score of the classifier is higher than that by using data from one
before the MLP layer of the classifier to prevent overfitting. single modality. With the help of multimodal representations,
the classifier trained on labelled data from one modality can be
C. Metrics used on the data from another modality and achieve acceptable
We test the classifier on the server against a labelled testing F1 scores. In addition, combining local autoencoders from
dataset. We use a sliding time window with length of 2,000 both unimodal and multimodal clients can achieve higher F1
to extract time-series sequences (without overlap) from the scores than only using multimodal clients.
testing dataset. We use the encoder of wag for the modality of
the testing data to convert the sequences into representations A. Multimodal data improve F1 scores
and test them on the classifier ws . We calculate the F1 score On the Opp dataset, as shown in Fig. 6a, the F1 scores
of each class within a sequence as: of multimodal schemes (MmFL) that are trained on labelled
datasets from two modalities (LAB ) converge faster than
2 ∗ TP UmFLA and UmFLB do when being tested on each modality
F1 = (4)
2 ∗ TP + FP + FN (TA and TB ). Although the converged F1 scores are the same
TP, FP, and FN are the numbers of true positive, false for both UmFL and MmFL, using multimodal data speeds up
positive, and false negative classification results, respectively. the convergence.
Opp, DCCAE, A: Acce, B: Gyro Opp, DCCAE, A: Acce, B: Gyro mHealth, SplitAE, A: Acce, B: Gyro mHealth, SplitAE, A: Acce, B: Gyro
0.8 0.8 0.8 0.8
0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
0.5 0.5 0.5 0.5

Test F1

Test F1
Test F1

Test F1
0.4 0.4 0.4 0.4
0.3 UmFLA 0.3 UmFLB 0.3 UmFLA 0.3 UmFLB
MmFLAB LB TA MmFLAB LA TB MmFLAB LB TA MmFLAB LA TB
0.2
MmFLABA LB TA 0.2 MmFLABAB LA TB 0.2
MmFLABB LB TA 0.2 MmFLABB LA TB
0.1 Abl LB TA 0.1 Abl LA TB 0.1 Abl LB TA 0.1 Abl LA TB
0.0 0.0 0.0 0.0
0 10 20 30 40 50 0 10 20 30 40 50 0 5 10 15 20 25 0 5 10 15 20 25
Communication rounds Communication rounds Communication rounds Communication rounds
(a) Opp (Acce & Gyro) (b) mHealth (Acce & Gyro)

mHealth, SplitAE, A: Acce, B: Mag mHealth, SplitAE, A: Acce, B: Mag mHealth, SplitAE, A: Gyro, B: Mag mHealth, SplitAE, A: Gyro, B: Mag
0.8 0.8 0.8 0.8
0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
0.5 0.5 0.5 0.5
Test F1

Test F1

Test F1

Test F1
0.4 0.4 0.4 0.4
0.3 UmFLA 0.3 UmFLB 0.3 UmFLA 0.3 UmFLB
MmFLAB LB TA MmFLAB LA TB MmFLAB LB TA MmFLAB LA TB
0.2
MmFLABB LB TA 0.2 MmFLABA LA TB 0.2
MmFLABA LB TA 0.2 MmFLABB LA TB
0.1 Abl LB TA 0.1 Abl LA TB 0.1 Abl LB TA 0.1 Abl LA TB
0.0 0.0 0.0 0.0
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
Communication rounds Communication rounds Communication rounds Communication rounds
(c) mHealth (Acce & Mag) (d) mHealth (Gyro & Mag)

UR Fall, SplitAE, A: Acce, B: Depth UR Fall, SplitAE, A: Acce, B: Depth UR Fall, SplitAE, A: RGB, B: Depth UR Fall, SplitAE, A: RGB, B: Depth
0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
0.5 0.5 0.5 0.5
0.4 0.4 0.4 0.4
Test F1

Test F1

Test F1

Test F1
0.3 UmFLA 0.3 UmFLB 0.3 UmFLA 0.3 UmFLB
0.2 MmFLAB LB TA 0.2 MmFLAB LA TB 0.2 MmFLAB LB TA 0.2 MmFLAB LA TB
MmFLABA LB TA MmFLABB LA TB MmFLABAB LB TA MmFLABAB LA TB
0.1
Abl LB TA 0.1 Abl LA TB 0.1
Abl LB TA 0.1 Abl LA TB
0.0 0.0 0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Communication rounds Communication rounds Communication rounds Communication rounds
(e) UR Fall (Acce & Depth) (f) UR Fall (RGB & Depth)

Fig. 7: F1 scores of MmFL with labelled data from one modality (e.g., LB ) and test data from the other modality (e.g., TA ).
MmFL schemes achieve higher converged F1 scores or faster convergence than baselines (i.e., Abl schemes) in most cases.
Combining contributions from both unimodal and multimodal clients (e.g., MmFLABA ) can further improve the F1 scores.

On the mHealth dataset (Fig. 6b– 6d), the results on three modalities of data in UR Fall are more heterogeneous (i.e.,
modality combinations show similar trends. On each testing sensory & visual) than those in Opp or mHealth (i.e., sensory
modality, the converged F1 scores of MmFL schemes are & sensory), multimodal FL can still align their representations,
similar to those of their unimodal counterparts. However, the thereby introducing more data to improve the F1 score of the
F1 scores of MmFL schemes converge faster than UmFL FL system.
schemes do. Similar to the results of existing studies on centralized ML
On the UR Fall dataset, the sizes of X from Acce and RGB systems, our results demonstrate that, in FL systems, com-
are 3 and 512, respectively. Thus h = 2 is the largest repre- bining different modalities through multimodal representation
sentation size that we can use for the modality combination learning can achieve higher F1 scores or faster convergence
Acce & RGB and it is not large enough to encode useful than only using unimodal data. Compared with existing work
representations from RGB data. Therefore we only show the using early fusion [14], the labelled data source on the server
results from the other two modality combinations (Fig. 6e & in our framework does not have to be aligned multimodal
6f). The F1 scores of MmFL schemes are higher than those of data. It can be individual unimodal datasets that are collected
UmFL schemes when the schemes are tested against Acce data separately. This suggests that we can scale up FL systems
or RGB data. When being tested against Depth data, MmFL across different modalities by utilizing the alignment informa-
schemes converge faster than UmFL schemes do. Even the tion contained in local data on multimodal clients.
B. Labels can be used across modalities C. Training on mixed clients
To understand how mixed clients with different device
To answer Q2, we use labelled data from one modality for setups (i.e., unimodal clients and multimodal clients), which is
supervised learning on the server and test the trained classifier a more realistic scenario for FL systems, affect the F1 scores,
on the other modality that does not have any labels in the for each MmFLAB scheme with 30 multimodal clients, we run
system. Fig. 7 shows the F1 scores of MmFL with different one mixed-client scheme that has 10 more clients for modality
modalities for labelled data (e.g., LB ) and testing data (e.g., A (i.e., MmFLABA ), one that has 10 more clients for modality
TA ), in comparison with a baseline scheme (Abl) for the B (i.e., MmFLABB ), and one that has 10 more clients for each
ablation study and a unimodal scheme for the modality of modality (i.e., MmFLABAB ). We compare them and keep the
the testing data (e.g., UmFLA ). one that has the highest F1 scores.
On the Opp dataset with DCCAE (Fig. 7a), using only In Fig. 7a, the MmFLABA -LB -TA scheme on the Opp dataset
multimodal clients (i.e., MmFLAB ) achieves higher converged further speeds up the convergence of test F1 scores compared
F1 scores than baseline schemes do, which means that the to MmFLAB , which means that combining contributions from
multimodal representation learning on clients indeed aligns both unimodal and multimodal clients by using Mm-FedAvg
two modalities. When training classifiers on labelled Gyro is better than using only multimodal clients. On the mHealth
data and testing them on Acce data (i.e., MmFLAB -LB -TA ), dataset (Fig. 7b & 7c), the mixed-client schemes slightly
the F1 score is close to that of a unimodal scheme using Acce improve the test F1 scores in two experiments. Similarly,
data (i.e., UmFLA ), which demands labelled Acce data on the on the UR Fall dataset (Fig. 7e), MmFLABA and MmFLABB
server. schemes show improved F1 scores in the experiments of the
On the mHealth dataset (Fig. 7b–7d), the converged F1 Acce & Depth combination.
score of baseline schemes and unimodal schemes is close to The results indicate that using Mm-FedAvg to combine
each other. This means that the different modalities may be models from both multimodal (with higher weights) and
correlated even without being aligned (similar to the findings unimodal clients can provide higher F1 scores or faster con-
reported by Malekzadeh et al. [51]). This might be due to vergence than only using multimodal clients. Thus, when there
the fact that except for 1 accelerometer on the chest, 6 are a limited number of multimodal clients in a mixed-client
sensors for different modalities in the mHealth dataset were FL system, we can utilize unimodal clients to boost the local
attached to 2 body parts (e.g., left-ankle and right-lower-arm). training.
Thus the readings of different modalities from the same body VI. D ISCUSSIONS
part might be correlated. MmFLAB schemes still improve the
converged F1 scores compared to Abl schemes and have faster In this paper, we have proposed a multimodal FL framework
convergence in two modality combinations (i.e., Acce & Gyro, on IoT data. We now discuss how the framework can be used
Acce & Mag). in real-world FL systems and what potential research topics
are in the space of multimodal FL.
On the UR Fall dataset (Fig. 7e–7f), MmFLAB schemes have
higher F1 scores than baselines do. It is worth to note that, A. Heterogeneity beyond data distributions
when using labelled Depth data (i.e., LB ), the test F1 scores on Training in FL is mainly conducted on clients. In a real-
Acce and RGB data (i.e., MmFLAB -LB -TA schemes in Fig. 7e world FL system, each client’s local data are generated on an
& 7f) are even higher than those when using labelled data individual level rather than a population level, which means
from these two testing modalities (i.e., UmFLA ). In Sec. V-A, that heterogeneity between clients is commonplace. Some
results in Fig. 6e & 6f show that the unimodal schemes using heterogeneity such as data distributions has been well studied
Depth data have higher F1 scores than those using Acce or and solving it can help keep the performance of FL systems
RGB data. Therefore, for MmFL with SplitAE, using labelled stable across different clients. Other heterogeneity, such as data
Depth data for the supervised learning on the server leads to modalities, is also an important issues in implementing FL
higher F1 scores than those using Acce or RGB data’s own systems. As shown in our results, solving such heterogeneity
labels. can make FL systems scalable across different modalities,
Our results show that, with the help of multimodal repre- thereby increasing the amount of available data. In an FL
sentation learning on FL clients, we can use the trained global system using IoT devices, it is difficult to force all clients to
autoencoder to share the label information from one modality deploy devices that have the same data modality, because users
to other modalities by mapping them into shared or related may have different budgets for devices or privacy concerns on
representations. The test F1 scores on the other modalities can the devices installed in their homes. Therefore, multimodal FL
be close to or even better than those of unimodal FL schemes plays an important role in realizing those promised FL systems
using labels from the modalities. This allows us to scale up FL that aim to work with hundreds of thousands of clients. In this
systems even with limited source of unimodal labelled data. In paper, we focused on the modality heterogeneity issue and the
addition, we can potentially improve the testing performance other types of heterogeneity are out of our scope, which is
of a modality by aligning it with other modalities that have the limitation of this paper. For future research, we plan to
labels, instead of directly mapping it to labels. investigate how multimodal FL performs with the influence
from the other types of heterogeneity in aspects such as data introducing data from multiple modalities into FL systems can
distributions and DNN model structures. improve their classification F1 scores. In addition, it allows us
to apply models trained on labelled data from one modality
B. Sharing label information across modalities to testing data from other modalities and achieve decent F1
The lack of labelled data on FL clients has recently mo- scores. It only requires a part of the clients to be multimodal
tivated researchers to design semi-supervised FL systems. In in order to align different modalities. We believe that our
many cases, only the service provider (i.e., the FL server) has contributions can help machine-learning system designers who
the ability and expertise to provide labelled data. The existing want to implement FL in complex real-world scenarios such
research on semi-supervised FL assumes that the labelled data as IoT environments, wherein data are generated from dif-
on the server and the local data on clients are from the same ferent modalities. For future research, we plan to investigate
modality. In this paper, we have shown that our framework broader applications of our framework in domains apart from
allows label information from one modality to be used by other multimodal human activity recognition.
modalities. This can potentially contribute to reducing the cost
of data annotation on the server when implementing real-world ACKNOWLEDGEMENT
semi-supervised FL systems. Some modalities (e.g., sensory
data) may not be easy to directly annotate on. However, by This work was supported by the UK Dementia Research
using the matching information on FL clients, we can align Institute.
these modalities with other modalities that are easy to acquire
annotations (e.g., visual data) on the server. By this means, we R EFERENCES
can enable clients from all modalities in the system to utilize [1] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y.
the label information through multimodal representations. It Arcas, “Communication-Efficient Learning of Deep Networks from De-
may also allow us to deploy fewer privacy-intrusive devices centralized Data,” in Proceedings of the 20th International Conference
on Artificial Intelligence and Statistics, 2017, pp. 1273–1282.
(e.g., cameras) in people’s homes since we only need some [2] U. M. Aı̈vodji, S. Gambs, and A. Martin, “IOTFLA : A Secured and
clients to have multimodal data for alignment. Privacy-Preserving Smart Home Architecture Implementing Federated
Learning,” in Proceedings of the 2019 IEEE Security and Privacy
C. Utilizing mixed FL clients Workshops (SPW), 2019, pp. 175–180.
[3] B. Liu, L. Wang, M. Liu, and C.-Z. Xu, “Federated Imitation Learning:
One of our contributions in this paper is the Mm-FedAvg A Novel Framework for Cloud Robotic Systems With Heterogeneous
algorithm that combines locally updated autoencoders from Sensor Data,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp.
both unimodal and multimodal clients. By giving multimodal 3509–3516, 2020.
[4] Y. Zhao, H. Haddadi, S. Skillman, S. Enshaeifar, and P. Barnaghi,
clients more weights, combining contributions from mixed “Privacy-preserving activity and health monitoring on databox,” in
clients has higher F1 scores than only using multimodal Proceedings of the Third ACM International Workshop on Edge Systems,
clients. Thus only a part of the clients in the system needs Analytics and Networking, 2020, p. 49–54.
[5] Q. Wu, K. He, and X. Chen, “Personalized Federated Learning for
to be multimodal clients. Currently, all the multimodal clients Intelligent IoT Applications: A Cloud-Edge Based Framework,” IEEE
in the framework use the same type of autoencoder (i.e., either Open Journal of the Computer Society, vol. 1, pp. 35–44, 2020.
all SplitAE or all DCCAE) and the unimodal clients’ can [6] J. Pang2021, Y. Huang, Z. Xie, Q. Han, and Z. Cai, “Realizing the
Heterogeneity: A Self-Organized Federated Learning Framework for
directly update a part of the autoencoders. In reality, this IoT,” IEEE Internet of Things Journal, vol. 8, no. 5, pp. 3088–3098,
assumption may need to be changed due to different local 2021.
data distributions or computational capabilities. Therefore, we [7] A. Imteaj, U. Thakker, S. Wang, J. Li, and M. H. Amini, “A Survey
on Federated Learning for Resource-Constrained IoT Devices,” IEEE
suggest that more flexible multimodal averaging algorithms Internet of Things Journal, pp. 1–1, 2021.
using techniques such as knowledge distillation [36] should [8] A. Brunete, E. Gambao, M. Hernando, and R. Cedazo, “Smart Assistive
be investigated. It would allow FL systems to use different Architecture for the Integration of IoT Devices, Robotic Systems, and
Multimodal Interfaces in Healthcare Environments,” Sensors, vol. 21,
local autoencoders for multimodal representation learning. In no. 6, 2021.
addition, mechanisms that can evaluate the quality of models [9] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng,
trained on different data modalities and can dynamically adjust “Multimodal Deep Learning,” in Proceedings of the 28th International
Conference on Machine Learning, 2011, pp. 689–696.
the weights of multimodal clients are necessary, which will
[10] W. Wang, R. Arora, K. Livescu, and J. Bilmes, “On Deep Multi-View
allow us to optimise the combined contributions. Representation Learning,” in Proceedings of the 32nd International
Conference on Machine Learning, vol. 37, 2015, pp. 1083–1092.
VII. C ONCLUSIONS [11] F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, and R. Bajcsy, “Berkeley
MHAD: A Comprehensive Multimodal Human Action Database,” in
As a new system paradigm, federated learning (FL) has Proceedings of the 2013 IEEE Workshop on Applications of Computer
shown great potentials to realize deep learning systems in Vision (WACV). IEEE, 2013, pp. 53–60.
the real world and protect the privacy of data subjects at [12] V. Radu, C. Tong, S. Bhattacharya, N. D. Lane, C. Mascolo, M. K.
Marina, and F. Kawsar, “Multimodal Deep Learning for Activity and
the same time. In this paper, we propose a multimodal and Context Recognition,” Proceedings of the ACM on Interactive, Mobile,
semi-supervised framework that enables FL systems to work Wearable and Ubiquitous Technologies, vol. 1, no. 4, Jan. 2018.
with clients that have local data from different modalities and [13] T. Xing, S. S. Sandha, B. Balaji, S. Chakraborty, and M. Srivastava,
“Enabling Edge Devices that Learn from Each Other,” in Proceedings
clients with different device setups (i.e., unimodal clients and of the 1st International Workshop on Edge Systems, Analytics and
multimodal clients). Our experimental results demonstrate that Networking, 2018, pp. 37–42.
[14] P. P. Liang, T. Liu, L. Ziyin, N. B. Allen, R. P. Auerbach, D. Brent, [34] R. Li, F. Ma, W. Jiang, and J. Gao, “Online Federated Multitask
R. Salakhutdinov, and L.-P. Morency, “Think Locally, Act Globally: Learning,” in Proceedings of the 2019 IEEE International Conference
Federated Learning With Local And Global Representations,” 2020, on Big Data (Big Data), 2019, pp. 215–220.
arXiv: 2001.01523. [35] Y. Chen, X. Qin, J. Wang, C. Yu, and W. Gao, “FedHealth: A Fed-
[15] F. Liu, X. Wu, S. Ge, W. Fan, and Y. Zou, “Federated Learning for erated Transfer Learning Framework for Wearable Healthcare,” IEEE
Vision-and-Language Grounding Problems,” in Proceedings of the AAAI Intelligent Systems, vol. 35, no. 4, pp. 83–93, 2020.
Conference on Artificial Intelligence, 2020, pp. 11 572–11 579. [36] T. Lin, L. Kong, S. U. Stich, and M. Jaggi, “Ensemble Distillation for
[16] W. Jeong, J. Yoon, E. Yang, and S. J. Hwang, “Federated Semi- Robust Model Fusion in Federated Learning,” in Advances in Neural
supervised Learning with Inter-client Consistency,” 2020, arXiv: Information Processing Systems, vol. 33, 2020, pp. 2351–2363.
2006.12097. [37] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in A
[17] B. van Berlo, A. Saeed, and T. Ozcelebi, “Towards Federated Unsu- Neural Network,” 2015, arXiv: 1503.02531.
pervised Representation Learning,” in Proceedings of the Third ACM [38] P. Baldi, “Autoencoders, Unsupervised Learning and Deep Architec-
International Workshop on Edge Systems, Analytics and Networking, tures,” in Proceedings of ICML Workshop on Unsupervised and Transfer
2020, p. 31–36. Learning, Bellevue, Washington, USA, 2012, pp. 37–49.
[18] Y. Zhao, H. Liu, H. Li, P. Barnaghi, and H. Haddadi, “Semi-supervised [39] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep Canonical Cor-
Federated Learning for Activity Recognition,” 2021, arXiv: 2011.00851. relation Analysis,” in Proceedings of the 30th International Conference
[19] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge Computing: Vision on Machine Learning, Atlanta, Georgia, USA, 2013, pp. 1247–1255.
and Challenges,” IEEE Internet of Things Journal, vol. 3, no. 5, pp. [40] A. Karpathy and L. Fei-Fei, “Deep Visual-Semantic Alignments for
637–646, Oct. 2016. Generating Image Descriptions,” in Proceedings of the IEEE Conference
[20] J. Chen and X. Ran, “Deep Learning With Edge Computing: A Review,” on Computer Vision and Pattern Recognition (CVPR), June 2015, pp.
Proceedings of the IEEE, vol. 107, no. 8, pp. 1655–1674, 2019. 3128–3137.
[21] Y. Liu, X. Yuan, R. Zhao, Y. Zheng, and Y. Zheng, “RC-SSFL: To- [41] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal Machine
wards Robust and Communication-efficient Semi-supervised Federated Learning: A Survey and Taxonomy,” IEEE Transactions on Pattern
Learning System,” 2020, arXiv: 2012.04432. Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423–443, 2019.
[22] Z. Zhang, Z. Yao, Y. Yang, Y. Yan, J. E. Gonzalez, and M. W. [42] R. Chavarriaga, H. Sagha, A. Calatroni, S. T. Digumarti, G. Tröster,
Mahoney, “Benchmarking Semi-supervised Federated Learning,” 2021, J. d. R. Millán, and D. Roggen, “The Opportunity Challenge: A
arXiv: 2008.11364. Benchmark Database for On-Body Sensor-based Activity Recognition,”
[23] Z. Long, L. Che, Y. Wang, M. Ye, J. Luo, J. Wu, H. Xiao, and F. Ma, Pattern Recognition Letters, vol. 34, no. 15, pp. 2033–2042, 2013.
“FedSiam: Towards Adaptive Federated Semi-Supervised Learning,” [43] N. Y. Hammerla, S. Halloran, and T. Plötz, “Deep, Convolutional, and
2021, arXiv: 2012.03292. Recurrent Models for Human Activity Recognition using Wearables,”
[24] W. Zhang, X. Li, H. Ma, Z. Luo, and X. Li, “Federated Learning in Proceedings of the Twenty-Fifth International Joint Conference on
for Machinery Fault Diagnosis with Dynamic Validation and Self- Artificial Intelligence, 2016, pp. 1533–1540.
supervision,” Knowledge-Based Systems, vol. 213, p. 106679, 2021. [44] O. Banos, R. Garcia, J. A. Holgado-Terriza, M. Damas, H. Pomares,
[25] Y. Kang, Y. Liu, and T. Chen, “FedMVT: Semi-supervised Vertical I. Rojas, A. Saez, and C. Villalonga, “mHealthDroid: A Novel Frame-
Federated Learning with MultiView Training,” 2020, arXiv: 2008.10838. work For Agile Development of Mobile Health Applications,” in Pro-
[26] B. Wang, A. Li, H. Li, and Y. Chen, “GraphFL: A Federated Learning ceedings of the 6th International Work-Conference on Ambient Assisted
Framework for Semi-Supervised Node Classification on Graphs,” 2020, Living and Daily Activities, 2014, pp. 91–98.
arXiv: 2012.04187. [45] B. Kwolek and M. Kepski, “Human Fall Detection on Embedded
[27] D. Yang, Z. Xu, W. Li, A. Myronenko, H. R. Roth, S. Harmon, Platform Using Depth Maps and Wireless Accelerometer,” Computer
S. Xu, B. Turkbey, E. Turkbey, X. Wang et al., “Federated Semi- Methods and Programs in Biomedicine, vol. 117, no. 3, pp. 489–501,
Supervised Learning for COVID Region Segmentation in Chest CT 2014.
using Multi-National Data from China, Italy, Japan,” Medical Image [46] N. Srivastava, E. Mansimov, and R. Salakhutdinov, “Unsupervised
Analysis, vol. 70, p. 101992, 2021. Learning of Video Representations Using LSTMs,” in Proceedings of
[28] A. Saeed, T. Ozcelebi, and J. Lukkien, “Multi-task Self-Supervised the 32nd International Conference on Machine Learning, vol. 37, 2015,
Learning for Human Activity Detection,” Proceedings of the ACM on p. 843–852.
Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 3, no. 2, [47] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for
Jun. 2019. Image Recognition,” in Proceedings of the 2016 IEEE Conference on
[29] A. Saeed, F. D. Salim, T. Ozcelebi, and J. Lukkien, “Federated Self- Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
Supervised Learning of Multisensor Representations for Embedded [48] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
Intelligence,” IEEE Internet of Things Journal, vol. 8, no. 2, pp. 1030– T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf,
1040, 2021. E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,
[30] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. L. Fang, J. Bai, and S. Chintala, “PyTorch: An Imperative Style, High-
Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al., Performance Deep Learning Library,” in Advances in Neural Information
“Advances and Open Problems in Federated Learning,” 2021, arXiv: Processing Systems, vol. 32, 2019.
1912.04977. [49] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
[31] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated Learning: Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
Challenges, Methods, and Future Directions,” IEEE Signal Processing [50] Y. Guan and T. Plötz, “Ensembles of Deep LSTM Learners for Activity
Magazine, vol. 37, no. 3, pp. 50–60, 2020. Recognition using Wearables,” Proceedings of the ACM on Interactive,
[32] V. Smith, C. K. Chiang, M. Sanjabi, and A. Talwalkar, “Federated Multi- Mobile, Wearable and Ubiquitous Technologies, vol. 1, no. 2, pp. 1–28,
Task Learning,” in Advances in Neural Information Processing Systems, Jun. 2017.
vol. 30, 2017. [51] M. Malekzadeh, R. G. Clegg, A. Cavallaro, and H. Haddadi, “DANA:
[33] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated Dimension-Adaptive Neural Architecture for Multivariate Sensor Data,”
Learning with Non-IID Data,” 2018, arXiv: 1806.00582. 2020, arXiv: 2008.02397.

You might also like