Multimodal Federated Learning On Iot Data: Yuchen Zhao Payam Barnaghi Hamed Haddadi
Multimodal Federated Learning On Iot Data: Yuchen Zhao Payam Barnaghi Hamed Haddadi
Abstract—Federated learning is proposed as an alternative makes FL systems specifically suitable for privacy sensitive
to centralized machine learning since its client-server structure applications such as smart home [2]–[4] based on IoT tech-
provides better privacy protection and scalability in real-world nologies. For example, Wu et al. [5] propose an FL framework
arXiv:2109.04833v2 [cs.LG] 18 Feb 2022
Server 𝑤 𝑤
Supervised
𝑔 1 2 3
Encode
𝑤𝑡+1 = 𝐹𝑒𝑑𝐴𝑣𝑔 𝑤𝑡+1 , 𝑤𝑡+1 , 𝑤𝑡+1 XA or XB Y h Y
𝑤
𝑔 𝑔 𝑔
𝑤𝑡 𝑤𝑡 𝑤𝑡
𝑤 𝑤 𝑤
1 2 3
𝑤𝑡+1 𝑤𝑡+1 𝑤𝑡+1
𝑤 𝑤 𝑤
𝑔 𝑔 𝑔
𝑤𝑡 𝑤𝑡 𝑤𝑡
𝑤 𝑤 𝑤
Unsupervised
Unsupervised
Unsupervised
Supervised
Supervised
Supervised
1 2 3
𝑤𝑡+1 𝑤𝑡+1 𝑤𝑡+1
𝑤 𝑤 𝑤
X Y X Y X Y XA XA XB XB
Fig. 1: In canonical federated learning (a), during round t, a server sends a global model wtg to selected clients that have data
k
from the same modality. Client k conducts supervised learning to generate a local model wt+1 . Local models are aggregated on
a
the server by using the FedAvg algorithm. In multimodal federated learning (b), a server sends a global model wt g to selected
clients to learn to extract multimodal representations (Sec. III-B) on unlabelled local data. The server uses multimodal FedAvg
ag
(Sec. III-C) to aggregate local models into a new global model wt+1 and uses it to encode a labelled dataset (modality A or
s
B) to a labelled representation dataset (h, Y ). A classifier wt+1 is then trained on (h, Y ), which can be used by all clients.
ℎ ℎ 𝐿𝐶
𝐿𝐵 𝐿𝐵 𝐿𝐵
(a) Split autoencoder (b) Canonically correlated autoencoder
Fig. 3: In split autoencoders (a), for aligned input (XA , XB ) from two modalities, data from one modality are input into
its encoder to generate an h, which is then used to reconstruct the data for both modalities through two decoders. Each
single modality has a loss function (i.e., LA and LB ) and the overall objective of training is to minimize LA + LB . In a
canonically correlated autoencoder (b), data from both modalities are input into their encoders to generate two representations.
Two parameter matrices are used to maximize the canonical correlation between the paired representations hA and hB . The
overall objective of the training is to minimize λ(LA + LB ) + LC , where λ is a trade-off parameter and LC is the negative
value of the canonical correlation.
X 0 , which is measured by a loss function L(X, X 0 ), such both modalities. Similarly, for input modality B, its SplitAE
as the mean squared error (MSE). The assumption is that is (fB , gA , gB ).
if the reconstruction error is small, then it means that the 3) Deep canonically correlated autoencoders: In order to
hidden representation contains the most useful information in combine deep canonical correlation analysis [39] and autoen-
the original input. Therefore, minimizing the error will make coders together, Wang et al. [10] propose a deep canonically
the encoder to learn to extract such useful information. correlated autoencoder (DCCAE). Instead of mapping mul-
2) Split autoencoders: Canonical autoencoders only work timodal data into shared representations, DCCAE keeps an
on data from the same modality. In order to extract shared individual autoencoder for each modality and tries to maximize
representations from multimodal data, Ngiam et al. [9] propose the canonical correlation between the hidden representations
a split autoencoder (SplitAE) that takes input data from from two modalities. Fig. 3b shows the structure of a DCCAE
one modality and encode the data into a shared h for two for two modalities.
modalities. With the shared h, two decoders are used to For modalities A and B, given aligned input (XA , XB ), the
generate the reconstructions for two modalities. Fig. 3a shows DCCAE (fA , gA , fB , gB ) is:
the structures of SplitAEs for two data modalities. The premise
is that the data from two modalities have to be matching pairs,
which means that they present the same underlying activities arg min λ(LA + LB ) + LC (2)
fA ,gA ,fB ,gB ,U,V
or events. Since the encoders for both modalities aim to extract
LC = −tr(U | fA (XA )fB (XB )| V ) (3)
hidden representations, we want the representations to be not
only specific to an individual modality. Instead, we hope that Parameter matrices U and V are canonical correlation anal-
the extracted representations from both encoders can reflect ysis directions. Similarly to SplitAE, one of the objectives of
the general nature of the activities or events in question. DCCAE is to minimize the reconstruction losses. In addition,
For modalities A and B, given a pair of matching samples it uses another objective to increase the canonical correlation
(XA , XB ) (e.g., accelerometer data and video data of the same between the generated representations from two modalities
activity), the SplitAE (fA , gA , gB ) for input modality A is: (i.e., minimizing its negative value LC ). The two objectives
are balanced by a parameter λ. By this means, DCCAE maps
0 0 multimodal data into correlated representations rather than
arg min LA (XA , XA ) + LB (XB , XB ) (1)
fA ,gA ,gB shared representations.
XA0
and XB 0
are the reconstructions for two modalities. LA C. Multimodal federated averaging
and LB are the loss functions for two modalities, respectively. During each round t, the server sends a global multimodal
a
By minimizing the compound loss in Eq. 1, the learned autoencoder wt g to selected clients. A selected client is either
a
encoder fA will extract representations that are useful for unimodal or multimodal and the local training on wt g depends
𝑎 𝑎 𝑎
𝑤𝑡 𝑔 𝑤𝑡 𝑔 𝑤𝑡 𝑔
𝑎1 𝑎
2 𝑎3
𝑤𝑡+1 𝑤𝑡+1 𝑤𝑡+1
Unsupervised
Unsupervised
Unsupervised
XA XA XB XB
Fig. 4: Multimodal local training. Clients only update the f and g that are related to the modalities of their data.
Test F1
Test F1
Test F1
mHealth, SplitAE, A: Gyro, B: Mag UR Fall, SplitAE, A: Acce, B: Depth UR Fall, SplitAE, A: RGB, B: Depth
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
Test F1
Test F1
Test F1
0.4 0.4 0.4
UmFLA UmFLA UmFLA
0.3 UmFLB 0.3 0.3
UmFLB UmFLB
0.2 MmFLAB LAB TA 0.2 MmFLAB LAB TA 0.2 MmFLAB LAB TA
0.1 MmFLAB LAB TB 0.1 MmFLAB LAB TB 0.1 MmFLAB LAB TB
0.0 0.0 0.0
1 2 3 4 5 6 7 8 9 0 20 40 60 80 100 0 20 40 60 80 100
Communication rounds Communication rounds Communication rounds
(d) mHealth (Gyro & Mag) (e) UR Fall (Acce & Depth) (f) UR Fall (RGB & Depth)
Fig. 6: Comparison between UmFL and MmFL. MmFL schemes have higher or same level of converged F1 scores on UR
Fall datasets than UmFL schemes do. On all three datasets, MmFL converges faster than UmFL does.
2) Models: We implement all the deep learning components The weighted average F1 score of all classes within the
through the PyTorch library [48]. For training autoencoders on sequence (with the number of ground truth samples of a class
time-series data, we use long short-term memory (LSTM) [49] being its weight) is the F1 score on the sequence. And the
autoencoders [46] in our experiments for local training and average F1 score of all sequences is the F1 score of the
use the bagging strategy [50] to train our models with random classifier. We evaluate the F1 score of the classifier every
batch sizes and sequence lengths. An LSTM autoencoder takes other communication round until it converges and calculate
a time-series sequence (e.g., sensory data, video frames) as its its average value and standard error from 64 replicates. On
input. The hidden states generated by the LSTM encoder unit each dataset, we evaluate both SplitAE and DCCAE and keep
are used as the hidden representations of the input samples in the one that has better F1 scores.
the sequence. On the server side, we use a simple classifier that
has one multilayer perceptron (MLP) layer connected to one V. R ESULTS
LogSoftmax layer as the model for supervised learning. On We find that by using data from multiple modalities, the F1
the mHealth dataset, we introduce a Dropout layer (rate=0.5) score of the classifier is higher than that by using data from one
before the MLP layer of the classifier to prevent overfitting. single modality. With the help of multimodal representations,
the classifier trained on labelled data from one modality can be
C. Metrics used on the data from another modality and achieve acceptable
We test the classifier on the server against a labelled testing F1 scores. In addition, combining local autoencoders from
dataset. We use a sliding time window with length of 2,000 both unimodal and multimodal clients can achieve higher F1
to extract time-series sequences (without overlap) from the scores than only using multimodal clients.
testing dataset. We use the encoder of wag for the modality of
the testing data to convert the sequences into representations A. Multimodal data improve F1 scores
and test them on the classifier ws . We calculate the F1 score On the Opp dataset, as shown in Fig. 6a, the F1 scores
of each class within a sequence as: of multimodal schemes (MmFL) that are trained on labelled
datasets from two modalities (LAB ) converge faster than
2 ∗ TP UmFLA and UmFLB do when being tested on each modality
F1 = (4)
2 ∗ TP + FP + FN (TA and TB ). Although the converged F1 scores are the same
TP, FP, and FN are the numbers of true positive, false for both UmFL and MmFL, using multimodal data speeds up
positive, and false negative classification results, respectively. the convergence.
Opp, DCCAE, A: Acce, B: Gyro Opp, DCCAE, A: Acce, B: Gyro mHealth, SplitAE, A: Acce, B: Gyro mHealth, SplitAE, A: Acce, B: Gyro
0.8 0.8 0.8 0.8
0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
0.5 0.5 0.5 0.5
Test F1
Test F1
Test F1
Test F1
0.4 0.4 0.4 0.4
0.3 UmFLA 0.3 UmFLB 0.3 UmFLA 0.3 UmFLB
MmFLAB LB TA MmFLAB LA TB MmFLAB LB TA MmFLAB LA TB
0.2
MmFLABA LB TA 0.2 MmFLABAB LA TB 0.2
MmFLABB LB TA 0.2 MmFLABB LA TB
0.1 Abl LB TA 0.1 Abl LA TB 0.1 Abl LB TA 0.1 Abl LA TB
0.0 0.0 0.0 0.0
0 10 20 30 40 50 0 10 20 30 40 50 0 5 10 15 20 25 0 5 10 15 20 25
Communication rounds Communication rounds Communication rounds Communication rounds
(a) Opp (Acce & Gyro) (b) mHealth (Acce & Gyro)
mHealth, SplitAE, A: Acce, B: Mag mHealth, SplitAE, A: Acce, B: Mag mHealth, SplitAE, A: Gyro, B: Mag mHealth, SplitAE, A: Gyro, B: Mag
0.8 0.8 0.8 0.8
0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
0.5 0.5 0.5 0.5
Test F1
Test F1
Test F1
Test F1
0.4 0.4 0.4 0.4
0.3 UmFLA 0.3 UmFLB 0.3 UmFLA 0.3 UmFLB
MmFLAB LB TA MmFLAB LA TB MmFLAB LB TA MmFLAB LA TB
0.2
MmFLABB LB TA 0.2 MmFLABA LA TB 0.2
MmFLABA LB TA 0.2 MmFLABB LA TB
0.1 Abl LB TA 0.1 Abl LA TB 0.1 Abl LB TA 0.1 Abl LA TB
0.0 0.0 0.0 0.0
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
Communication rounds Communication rounds Communication rounds Communication rounds
(c) mHealth (Acce & Mag) (d) mHealth (Gyro & Mag)
UR Fall, SplitAE, A: Acce, B: Depth UR Fall, SplitAE, A: Acce, B: Depth UR Fall, SplitAE, A: RGB, B: Depth UR Fall, SplitAE, A: RGB, B: Depth
0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
0.5 0.5 0.5 0.5
0.4 0.4 0.4 0.4
Test F1
Test F1
Test F1
Test F1
0.3 UmFLA 0.3 UmFLB 0.3 UmFLA 0.3 UmFLB
0.2 MmFLAB LB TA 0.2 MmFLAB LA TB 0.2 MmFLAB LB TA 0.2 MmFLAB LA TB
MmFLABA LB TA MmFLABB LA TB MmFLABAB LB TA MmFLABAB LA TB
0.1
Abl LB TA 0.1 Abl LA TB 0.1
Abl LB TA 0.1 Abl LA TB
0.0 0.0 0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Communication rounds Communication rounds Communication rounds Communication rounds
(e) UR Fall (Acce & Depth) (f) UR Fall (RGB & Depth)
Fig. 7: F1 scores of MmFL with labelled data from one modality (e.g., LB ) and test data from the other modality (e.g., TA ).
MmFL schemes achieve higher converged F1 scores or faster convergence than baselines (i.e., Abl schemes) in most cases.
Combining contributions from both unimodal and multimodal clients (e.g., MmFLABA ) can further improve the F1 scores.
On the mHealth dataset (Fig. 6b– 6d), the results on three modalities of data in UR Fall are more heterogeneous (i.e.,
modality combinations show similar trends. On each testing sensory & visual) than those in Opp or mHealth (i.e., sensory
modality, the converged F1 scores of MmFL schemes are & sensory), multimodal FL can still align their representations,
similar to those of their unimodal counterparts. However, the thereby introducing more data to improve the F1 score of the
F1 scores of MmFL schemes converge faster than UmFL FL system.
schemes do. Similar to the results of existing studies on centralized ML
On the UR Fall dataset, the sizes of X from Acce and RGB systems, our results demonstrate that, in FL systems, com-
are 3 and 512, respectively. Thus h = 2 is the largest repre- bining different modalities through multimodal representation
sentation size that we can use for the modality combination learning can achieve higher F1 scores or faster convergence
Acce & RGB and it is not large enough to encode useful than only using unimodal data. Compared with existing work
representations from RGB data. Therefore we only show the using early fusion [14], the labelled data source on the server
results from the other two modality combinations (Fig. 6e & in our framework does not have to be aligned multimodal
6f). The F1 scores of MmFL schemes are higher than those of data. It can be individual unimodal datasets that are collected
UmFL schemes when the schemes are tested against Acce data separately. This suggests that we can scale up FL systems
or RGB data. When being tested against Depth data, MmFL across different modalities by utilizing the alignment informa-
schemes converge faster than UmFL schemes do. Even the tion contained in local data on multimodal clients.
B. Labels can be used across modalities C. Training on mixed clients
To understand how mixed clients with different device
To answer Q2, we use labelled data from one modality for setups (i.e., unimodal clients and multimodal clients), which is
supervised learning on the server and test the trained classifier a more realistic scenario for FL systems, affect the F1 scores,
on the other modality that does not have any labels in the for each MmFLAB scheme with 30 multimodal clients, we run
system. Fig. 7 shows the F1 scores of MmFL with different one mixed-client scheme that has 10 more clients for modality
modalities for labelled data (e.g., LB ) and testing data (e.g., A (i.e., MmFLABA ), one that has 10 more clients for modality
TA ), in comparison with a baseline scheme (Abl) for the B (i.e., MmFLABB ), and one that has 10 more clients for each
ablation study and a unimodal scheme for the modality of modality (i.e., MmFLABAB ). We compare them and keep the
the testing data (e.g., UmFLA ). one that has the highest F1 scores.
On the Opp dataset with DCCAE (Fig. 7a), using only In Fig. 7a, the MmFLABA -LB -TA scheme on the Opp dataset
multimodal clients (i.e., MmFLAB ) achieves higher converged further speeds up the convergence of test F1 scores compared
F1 scores than baseline schemes do, which means that the to MmFLAB , which means that combining contributions from
multimodal representation learning on clients indeed aligns both unimodal and multimodal clients by using Mm-FedAvg
two modalities. When training classifiers on labelled Gyro is better than using only multimodal clients. On the mHealth
data and testing them on Acce data (i.e., MmFLAB -LB -TA ), dataset (Fig. 7b & 7c), the mixed-client schemes slightly
the F1 score is close to that of a unimodal scheme using Acce improve the test F1 scores in two experiments. Similarly,
data (i.e., UmFLA ), which demands labelled Acce data on the on the UR Fall dataset (Fig. 7e), MmFLABA and MmFLABB
server. schemes show improved F1 scores in the experiments of the
On the mHealth dataset (Fig. 7b–7d), the converged F1 Acce & Depth combination.
score of baseline schemes and unimodal schemes is close to The results indicate that using Mm-FedAvg to combine
each other. This means that the different modalities may be models from both multimodal (with higher weights) and
correlated even without being aligned (similar to the findings unimodal clients can provide higher F1 scores or faster con-
reported by Malekzadeh et al. [51]). This might be due to vergence than only using multimodal clients. Thus, when there
the fact that except for 1 accelerometer on the chest, 6 are a limited number of multimodal clients in a mixed-client
sensors for different modalities in the mHealth dataset were FL system, we can utilize unimodal clients to boost the local
attached to 2 body parts (e.g., left-ankle and right-lower-arm). training.
Thus the readings of different modalities from the same body VI. D ISCUSSIONS
part might be correlated. MmFLAB schemes still improve the
converged F1 scores compared to Abl schemes and have faster In this paper, we have proposed a multimodal FL framework
convergence in two modality combinations (i.e., Acce & Gyro, on IoT data. We now discuss how the framework can be used
Acce & Mag). in real-world FL systems and what potential research topics
are in the space of multimodal FL.
On the UR Fall dataset (Fig. 7e–7f), MmFLAB schemes have
higher F1 scores than baselines do. It is worth to note that, A. Heterogeneity beyond data distributions
when using labelled Depth data (i.e., LB ), the test F1 scores on Training in FL is mainly conducted on clients. In a real-
Acce and RGB data (i.e., MmFLAB -LB -TA schemes in Fig. 7e world FL system, each client’s local data are generated on an
& 7f) are even higher than those when using labelled data individual level rather than a population level, which means
from these two testing modalities (i.e., UmFLA ). In Sec. V-A, that heterogeneity between clients is commonplace. Some
results in Fig. 6e & 6f show that the unimodal schemes using heterogeneity such as data distributions has been well studied
Depth data have higher F1 scores than those using Acce or and solving it can help keep the performance of FL systems
RGB data. Therefore, for MmFL with SplitAE, using labelled stable across different clients. Other heterogeneity, such as data
Depth data for the supervised learning on the server leads to modalities, is also an important issues in implementing FL
higher F1 scores than those using Acce or RGB data’s own systems. As shown in our results, solving such heterogeneity
labels. can make FL systems scalable across different modalities,
Our results show that, with the help of multimodal repre- thereby increasing the amount of available data. In an FL
sentation learning on FL clients, we can use the trained global system using IoT devices, it is difficult to force all clients to
autoencoder to share the label information from one modality deploy devices that have the same data modality, because users
to other modalities by mapping them into shared or related may have different budgets for devices or privacy concerns on
representations. The test F1 scores on the other modalities can the devices installed in their homes. Therefore, multimodal FL
be close to or even better than those of unimodal FL schemes plays an important role in realizing those promised FL systems
using labels from the modalities. This allows us to scale up FL that aim to work with hundreds of thousands of clients. In this
systems even with limited source of unimodal labelled data. In paper, we focused on the modality heterogeneity issue and the
addition, we can potentially improve the testing performance other types of heterogeneity are out of our scope, which is
of a modality by aligning it with other modalities that have the limitation of this paper. For future research, we plan to
labels, instead of directly mapping it to labels. investigate how multimodal FL performs with the influence
from the other types of heterogeneity in aspects such as data introducing data from multiple modalities into FL systems can
distributions and DNN model structures. improve their classification F1 scores. In addition, it allows us
to apply models trained on labelled data from one modality
B. Sharing label information across modalities to testing data from other modalities and achieve decent F1
The lack of labelled data on FL clients has recently mo- scores. It only requires a part of the clients to be multimodal
tivated researchers to design semi-supervised FL systems. In in order to align different modalities. We believe that our
many cases, only the service provider (i.e., the FL server) has contributions can help machine-learning system designers who
the ability and expertise to provide labelled data. The existing want to implement FL in complex real-world scenarios such
research on semi-supervised FL assumes that the labelled data as IoT environments, wherein data are generated from dif-
on the server and the local data on clients are from the same ferent modalities. For future research, we plan to investigate
modality. In this paper, we have shown that our framework broader applications of our framework in domains apart from
allows label information from one modality to be used by other multimodal human activity recognition.
modalities. This can potentially contribute to reducing the cost
of data annotation on the server when implementing real-world ACKNOWLEDGEMENT
semi-supervised FL systems. Some modalities (e.g., sensory
data) may not be easy to directly annotate on. However, by This work was supported by the UK Dementia Research
using the matching information on FL clients, we can align Institute.
these modalities with other modalities that are easy to acquire
annotations (e.g., visual data) on the server. By this means, we R EFERENCES
can enable clients from all modalities in the system to utilize [1] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y.
the label information through multimodal representations. It Arcas, “Communication-Efficient Learning of Deep Networks from De-
may also allow us to deploy fewer privacy-intrusive devices centralized Data,” in Proceedings of the 20th International Conference
on Artificial Intelligence and Statistics, 2017, pp. 1273–1282.
(e.g., cameras) in people’s homes since we only need some [2] U. M. Aı̈vodji, S. Gambs, and A. Martin, “IOTFLA : A Secured and
clients to have multimodal data for alignment. Privacy-Preserving Smart Home Architecture Implementing Federated
Learning,” in Proceedings of the 2019 IEEE Security and Privacy
C. Utilizing mixed FL clients Workshops (SPW), 2019, pp. 175–180.
[3] B. Liu, L. Wang, M. Liu, and C.-Z. Xu, “Federated Imitation Learning:
One of our contributions in this paper is the Mm-FedAvg A Novel Framework for Cloud Robotic Systems With Heterogeneous
algorithm that combines locally updated autoencoders from Sensor Data,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp.
both unimodal and multimodal clients. By giving multimodal 3509–3516, 2020.
[4] Y. Zhao, H. Haddadi, S. Skillman, S. Enshaeifar, and P. Barnaghi,
clients more weights, combining contributions from mixed “Privacy-preserving activity and health monitoring on databox,” in
clients has higher F1 scores than only using multimodal Proceedings of the Third ACM International Workshop on Edge Systems,
clients. Thus only a part of the clients in the system needs Analytics and Networking, 2020, p. 49–54.
[5] Q. Wu, K. He, and X. Chen, “Personalized Federated Learning for
to be multimodal clients. Currently, all the multimodal clients Intelligent IoT Applications: A Cloud-Edge Based Framework,” IEEE
in the framework use the same type of autoencoder (i.e., either Open Journal of the Computer Society, vol. 1, pp. 35–44, 2020.
all SplitAE or all DCCAE) and the unimodal clients’ can [6] J. Pang2021, Y. Huang, Z. Xie, Q. Han, and Z. Cai, “Realizing the
Heterogeneity: A Self-Organized Federated Learning Framework for
directly update a part of the autoencoders. In reality, this IoT,” IEEE Internet of Things Journal, vol. 8, no. 5, pp. 3088–3098,
assumption may need to be changed due to different local 2021.
data distributions or computational capabilities. Therefore, we [7] A. Imteaj, U. Thakker, S. Wang, J. Li, and M. H. Amini, “A Survey
on Federated Learning for Resource-Constrained IoT Devices,” IEEE
suggest that more flexible multimodal averaging algorithms Internet of Things Journal, pp. 1–1, 2021.
using techniques such as knowledge distillation [36] should [8] A. Brunete, E. Gambao, M. Hernando, and R. Cedazo, “Smart Assistive
be investigated. It would allow FL systems to use different Architecture for the Integration of IoT Devices, Robotic Systems, and
Multimodal Interfaces in Healthcare Environments,” Sensors, vol. 21,
local autoencoders for multimodal representation learning. In no. 6, 2021.
addition, mechanisms that can evaluate the quality of models [9] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng,
trained on different data modalities and can dynamically adjust “Multimodal Deep Learning,” in Proceedings of the 28th International
Conference on Machine Learning, 2011, pp. 689–696.
the weights of multimodal clients are necessary, which will
[10] W. Wang, R. Arora, K. Livescu, and J. Bilmes, “On Deep Multi-View
allow us to optimise the combined contributions. Representation Learning,” in Proceedings of the 32nd International
Conference on Machine Learning, vol. 37, 2015, pp. 1083–1092.
VII. C ONCLUSIONS [11] F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, and R. Bajcsy, “Berkeley
MHAD: A Comprehensive Multimodal Human Action Database,” in
As a new system paradigm, federated learning (FL) has Proceedings of the 2013 IEEE Workshop on Applications of Computer
shown great potentials to realize deep learning systems in Vision (WACV). IEEE, 2013, pp. 53–60.
the real world and protect the privacy of data subjects at [12] V. Radu, C. Tong, S. Bhattacharya, N. D. Lane, C. Mascolo, M. K.
Marina, and F. Kawsar, “Multimodal Deep Learning for Activity and
the same time. In this paper, we propose a multimodal and Context Recognition,” Proceedings of the ACM on Interactive, Mobile,
semi-supervised framework that enables FL systems to work Wearable and Ubiquitous Technologies, vol. 1, no. 4, Jan. 2018.
with clients that have local data from different modalities and [13] T. Xing, S. S. Sandha, B. Balaji, S. Chakraborty, and M. Srivastava,
“Enabling Edge Devices that Learn from Each Other,” in Proceedings
clients with different device setups (i.e., unimodal clients and of the 1st International Workshop on Edge Systems, Analytics and
multimodal clients). Our experimental results demonstrate that Networking, 2018, pp. 37–42.
[14] P. P. Liang, T. Liu, L. Ziyin, N. B. Allen, R. P. Auerbach, D. Brent, [34] R. Li, F. Ma, W. Jiang, and J. Gao, “Online Federated Multitask
R. Salakhutdinov, and L.-P. Morency, “Think Locally, Act Globally: Learning,” in Proceedings of the 2019 IEEE International Conference
Federated Learning With Local And Global Representations,” 2020, on Big Data (Big Data), 2019, pp. 215–220.
arXiv: 2001.01523. [35] Y. Chen, X. Qin, J. Wang, C. Yu, and W. Gao, “FedHealth: A Fed-
[15] F. Liu, X. Wu, S. Ge, W. Fan, and Y. Zou, “Federated Learning for erated Transfer Learning Framework for Wearable Healthcare,” IEEE
Vision-and-Language Grounding Problems,” in Proceedings of the AAAI Intelligent Systems, vol. 35, no. 4, pp. 83–93, 2020.
Conference on Artificial Intelligence, 2020, pp. 11 572–11 579. [36] T. Lin, L. Kong, S. U. Stich, and M. Jaggi, “Ensemble Distillation for
[16] W. Jeong, J. Yoon, E. Yang, and S. J. Hwang, “Federated Semi- Robust Model Fusion in Federated Learning,” in Advances in Neural
supervised Learning with Inter-client Consistency,” 2020, arXiv: Information Processing Systems, vol. 33, 2020, pp. 2351–2363.
2006.12097. [37] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in A
[17] B. van Berlo, A. Saeed, and T. Ozcelebi, “Towards Federated Unsu- Neural Network,” 2015, arXiv: 1503.02531.
pervised Representation Learning,” in Proceedings of the Third ACM [38] P. Baldi, “Autoencoders, Unsupervised Learning and Deep Architec-
International Workshop on Edge Systems, Analytics and Networking, tures,” in Proceedings of ICML Workshop on Unsupervised and Transfer
2020, p. 31–36. Learning, Bellevue, Washington, USA, 2012, pp. 37–49.
[18] Y. Zhao, H. Liu, H. Li, P. Barnaghi, and H. Haddadi, “Semi-supervised [39] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep Canonical Cor-
Federated Learning for Activity Recognition,” 2021, arXiv: 2011.00851. relation Analysis,” in Proceedings of the 30th International Conference
[19] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge Computing: Vision on Machine Learning, Atlanta, Georgia, USA, 2013, pp. 1247–1255.
and Challenges,” IEEE Internet of Things Journal, vol. 3, no. 5, pp. [40] A. Karpathy and L. Fei-Fei, “Deep Visual-Semantic Alignments for
637–646, Oct. 2016. Generating Image Descriptions,” in Proceedings of the IEEE Conference
[20] J. Chen and X. Ran, “Deep Learning With Edge Computing: A Review,” on Computer Vision and Pattern Recognition (CVPR), June 2015, pp.
Proceedings of the IEEE, vol. 107, no. 8, pp. 1655–1674, 2019. 3128–3137.
[21] Y. Liu, X. Yuan, R. Zhao, Y. Zheng, and Y. Zheng, “RC-SSFL: To- [41] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal Machine
wards Robust and Communication-efficient Semi-supervised Federated Learning: A Survey and Taxonomy,” IEEE Transactions on Pattern
Learning System,” 2020, arXiv: 2012.04432. Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423–443, 2019.
[22] Z. Zhang, Z. Yao, Y. Yang, Y. Yan, J. E. Gonzalez, and M. W. [42] R. Chavarriaga, H. Sagha, A. Calatroni, S. T. Digumarti, G. Tröster,
Mahoney, “Benchmarking Semi-supervised Federated Learning,” 2021, J. d. R. Millán, and D. Roggen, “The Opportunity Challenge: A
arXiv: 2008.11364. Benchmark Database for On-Body Sensor-based Activity Recognition,”
[23] Z. Long, L. Che, Y. Wang, M. Ye, J. Luo, J. Wu, H. Xiao, and F. Ma, Pattern Recognition Letters, vol. 34, no. 15, pp. 2033–2042, 2013.
“FedSiam: Towards Adaptive Federated Semi-Supervised Learning,” [43] N. Y. Hammerla, S. Halloran, and T. Plötz, “Deep, Convolutional, and
2021, arXiv: 2012.03292. Recurrent Models for Human Activity Recognition using Wearables,”
[24] W. Zhang, X. Li, H. Ma, Z. Luo, and X. Li, “Federated Learning in Proceedings of the Twenty-Fifth International Joint Conference on
for Machinery Fault Diagnosis with Dynamic Validation and Self- Artificial Intelligence, 2016, pp. 1533–1540.
supervision,” Knowledge-Based Systems, vol. 213, p. 106679, 2021. [44] O. Banos, R. Garcia, J. A. Holgado-Terriza, M. Damas, H. Pomares,
[25] Y. Kang, Y. Liu, and T. Chen, “FedMVT: Semi-supervised Vertical I. Rojas, A. Saez, and C. Villalonga, “mHealthDroid: A Novel Frame-
Federated Learning with MultiView Training,” 2020, arXiv: 2008.10838. work For Agile Development of Mobile Health Applications,” in Pro-
[26] B. Wang, A. Li, H. Li, and Y. Chen, “GraphFL: A Federated Learning ceedings of the 6th International Work-Conference on Ambient Assisted
Framework for Semi-Supervised Node Classification on Graphs,” 2020, Living and Daily Activities, 2014, pp. 91–98.
arXiv: 2012.04187. [45] B. Kwolek and M. Kepski, “Human Fall Detection on Embedded
[27] D. Yang, Z. Xu, W. Li, A. Myronenko, H. R. Roth, S. Harmon, Platform Using Depth Maps and Wireless Accelerometer,” Computer
S. Xu, B. Turkbey, E. Turkbey, X. Wang et al., “Federated Semi- Methods and Programs in Biomedicine, vol. 117, no. 3, pp. 489–501,
Supervised Learning for COVID Region Segmentation in Chest CT 2014.
using Multi-National Data from China, Italy, Japan,” Medical Image [46] N. Srivastava, E. Mansimov, and R. Salakhutdinov, “Unsupervised
Analysis, vol. 70, p. 101992, 2021. Learning of Video Representations Using LSTMs,” in Proceedings of
[28] A. Saeed, T. Ozcelebi, and J. Lukkien, “Multi-task Self-Supervised the 32nd International Conference on Machine Learning, vol. 37, 2015,
Learning for Human Activity Detection,” Proceedings of the ACM on p. 843–852.
Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 3, no. 2, [47] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for
Jun. 2019. Image Recognition,” in Proceedings of the 2016 IEEE Conference on
[29] A. Saeed, F. D. Salim, T. Ozcelebi, and J. Lukkien, “Federated Self- Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
Supervised Learning of Multisensor Representations for Embedded [48] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
Intelligence,” IEEE Internet of Things Journal, vol. 8, no. 2, pp. 1030– T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf,
1040, 2021. E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,
[30] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. L. Fang, J. Bai, and S. Chintala, “PyTorch: An Imperative Style, High-
Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al., Performance Deep Learning Library,” in Advances in Neural Information
“Advances and Open Problems in Federated Learning,” 2021, arXiv: Processing Systems, vol. 32, 2019.
1912.04977. [49] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
[31] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated Learning: Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
Challenges, Methods, and Future Directions,” IEEE Signal Processing [50] Y. Guan and T. Plötz, “Ensembles of Deep LSTM Learners for Activity
Magazine, vol. 37, no. 3, pp. 50–60, 2020. Recognition using Wearables,” Proceedings of the ACM on Interactive,
[32] V. Smith, C. K. Chiang, M. Sanjabi, and A. Talwalkar, “Federated Multi- Mobile, Wearable and Ubiquitous Technologies, vol. 1, no. 2, pp. 1–28,
Task Learning,” in Advances in Neural Information Processing Systems, Jun. 2017.
vol. 30, 2017. [51] M. Malekzadeh, R. G. Clegg, A. Cavallaro, and H. Haddadi, “DANA:
[33] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated Dimension-Adaptive Neural Architecture for Multivariate Sensor Data,”
Learning with Non-IID Data,” 2018, arXiv: 1806.00582. 2020, arXiv: 2008.02397.