Sensors 23 06986
Sensors 23 06986
Review
Multimodal Federated Learning: A Survey
Liwei Che 1 , Jiaqi Wang 1 , Yao Zhou 2 and Fenglong Ma 1, *
Abstract: Federated learning (FL), which provides a collaborative training scheme for distributed data
sources with privacy concerns, has become a burgeoning and attractive research area. Most existing
FL studies focus on taking unimodal data, such as image and text, as the model input and resolving
the heterogeneity challenge, i.e., the challenge of non-identical distribution (non-IID) caused by a
data distribution imbalance related to data labels and data amount. In real-world applications, data
are usually described by multiple modalities. However, to the best of our knowledge, only a handful
of studies have been conducted to improve system performance utilizing multimodal data. In this
survey paper, we identify the significance of this emerging research topic of multimodal federated
learning (MFL) and present a literature review on the state-of-art MFL methods. Furthermore, we
categorize multimodal federated learning into congruent and incongruent multimodal federated
learning based on whether all clients possess the same modal combinations. We investigate the
feasible application tasks and related benchmarks for MFL. Lastly, we summarize the promising
directions and fundamental challenges in this field for future research.
1. Introduction
In various real-world scenarios, data are usually collected and stored in a distributed
and privacy-sensitive manner—for instance, multimedia data on personal smartphones,
Citation: Che, L.; Wang, J.; Zhou, Y.; sensory data from various vehicles, and examination data and diagnostic records of patients
Ma, F. Multimodal Federated across different hospitals. The significant volume of sensitive yet multimodal data being
Learning: A Survey. Sensors 2023, 23, collected and shared has heightened people’s concerns regarding privacy protection. Con-
6986. https://doi.org/ sequently, there has been an emergence of increasingly stringent data regulation policies,
10.3390/s23156986 such as the General Data Protection Regulation (GDPR) in the European Union and the
Academic Editor: Antonio Puliafito
Health Insurance Portability and Accountability Act (HIPAA) in the United States. These
regulations have given rise to challenges in data collaboration and have raised privacy
Received: 14 July 2023 concerns for traditional centralized multimodal machine learning approaches [1].
Revised: 3 August 2023 To address these data privacy concerns, a novel paradigm called federated learning
Accepted: 4 August 2023
(FL) [2] has been introduced. This approach enables distributed clients to collaboratively
Published: 6 August 2023
train a high-performing global model without sharing their local data, effectively pre-
venting privacy leakage through data transmission. However, the majority of previous
works have focused on the unimodal setting, where all the clients in the federated system
Copyright: © 2023 by the authors.
hold the same data modality, as shown in Figure 1 (left). Among these studies, statistical
Licensee MDPI, Basel, Switzerland.
heterogeneity [3], i.e., the non-IID challenge, caused by the skew of labels, features, and
This article is an open access article
data quantity among clients, is one of the most critical challenges that has attracted much
distributed under the terms and attention [4–8]. In contrast, multimodal federated learning, as shown in Figure 1 (right), fur-
conditions of the Creative Commons ther introduced the modality heterogeneity challenge, which led to significant differences
Attribution (CC BY) license (https:// in model structures, local tasks, and parameter spaces among clients, thereby exposing the
creativecommons.org/licenses/by/ substantial limitations of traditional unimodal algorithms.
4.0/).
Federated systems trained with multimodal data are intuitively more powerful and
insightful compared to unimodal ones [1]. We define the modality types held by the
clients as their modality combinations, which determine the local tasks they perform. If
two clients hold the same or similar modality combinations (e.g., both image and text
data), they have a smaller semantic gap and task gap. In other words, the more congruent
modality combinations the clients hold, the less heterogeneous the modality distribution of
the system.
Based on the congruence of modality distribution, MFL can be divided into two
categories: congruent MFL and incongruent MFL, as depicted in Figure 2. In congruent
MFL, the clients hold similar or the same local modality combinations, and horizontal FL is
the typical setting of this type. The majority of existing MFL work [9–12] has also focused
on this federated setting, where all the clients hold the same input modality categories and
feature space but differ as to the sample space. In [10], the authors proposed a multimodal
federated learning framework for multimodal activity recognition with an early fusion
approach via local co-attention. The authors in [12] provided a detailed analysis of the
convergence problem of MFL with late fusion methods under the non-IID setting. In the
healthcare domain [13–15], congruent MFL has shown great application value by providing
diagnosis assistance with distributed digital health data.
Hybrid MFL
For incongruent MFL, the clients usually hold unique or partially overlapped data
modality combinations, which makes the federated optimization and model aggregation
more challenging. This category contains vertical multimodal federated learning (VMFL),
multimodal federated transfer learning (MFTL), and hybrid multimodal federated learning
(hybrid MFL). In VMFL, the clients hold different input modalities and feature spaces,
but all the data samples are in the same space. In [16], the authors assumed that each
client only held one specific modality and, correspondingly, proposed FDARN, a five-
module framework, for cross-modal federated human activity recognition (CMF-HAR). For
MFTL, the clients mainly differ as to feature spaces (e.g., photographic images and cartoon
images) and sample ID spaces. For instance, in [17], the authors proposed a fine-grained
representation block named aimNet. They evaluated their methods under different FL
settings, including the transfer setting between two different vision–language tasks.
Hybrid MFL is a more challenging setting, where the data relationships among the
clients cannot be appropriately described by any of the above three settings alone. The
Sensors 2023, 23, 6986 3 of 21
clients in a hybrid setting can hold different local data, varying in terms of both modality
categories and quantities. Given M modalities in a federated system, the theoretical client
types are 2 M − 1, including both unimodal and multimodal clients. Ref. [18] discussed a
significant challenge for hybrid MFL, i.e., modality incongruity, where the unique modality
combination among the clients enlarges the heterogeneity. They proposed FedMSplit for
multitask learning in the hybrid MFL setting, with a graph-based attention mechanism to
extract the client relationship for aggregation.
Based on our observation of the increasing interest among researchers in exploring
the challenges of multimodal data in FL [10,11,15,16,19,20], multimodal federated learning
has emerged as a promising and practical topic with numerous application scenarios.
However, much of the research in this area has been conducted in customized multimodal
federated learning scenarios, lacking categorization and standardization. The diverse and
varied nature of this field emphasizes the need for a systematic investigation and study on
multimodal federated learning topics. Therefore, we present our perspective on exploring
multimodal data in federated learning and outline our contributions below:
• We conducted a comprehensive literature review on existing multimodal federated
learning research, leading to the formal definition of multimodal federated learning
(MFL). We also introduced essential concepts like modality combination and modality
heterogeneity, which distinguish MFL from traditional FL.
• To enhance the clarity and organization of the field, we classified existing MFL work
into four categories: horizontal, vertical, transfer, and hybrid MFL. By expanding upon
traditional unimodal federated learning, this categorization provides a structured
framework for the design and development of subsequent MFL research, facilitating
method comparison and advancement.
• Given the current lack of well-defined evaluation benchmarks for MFL, we thoroughly
examined feasible application scenarios of MFL and surveyed relevant and suitable
open-source benchmark datasets that can serve as valuable resources for researchers
in this domain.
• We identified and summarized significant challenges and potential research directions
in MFL, shedding light on unique issues such as modality heterogeneity and missing
modalities. These insights offer valuable guidance for future research and innovation
in the field of multimodal federated learning.
The rest of this paper is organized as follows. We introduce the methodology used to
conduct the literature review in Section 2. In Section 3, we summarize the three popular
aspects for mitigating the statistical heterogeneity in unimodal federated learning systems.
In Section 4, we present preliminaries and a formal definition of multimodal federated
learning. In Section 5, we categorize multimodal federated learning into four types based
on the input modalities of the clients. We introduce the common tasks and benchmarks
for MFL in Section 6 and Section 7, respectively. Section 8 identifies the challenges and
promising research directions, as well as the potential application scenarios.
2. Methodology
The exploration of multimodal data in federated learning is still in its nascent stage.
Below, we introduce the process we followed to collect and analyze the related papers.
and global cluster models. PerFedAvg [27] adapted meta-learning into the FL framework,
where it treated the global model as a meta-model to provide a few-shot adaptation for each
client. In [28], the authors added Moreau envelopes as a regularization term in the local loss
functions to help achieve personalized model optimization. In [29], the authors reassem-
bled models and selected the most fitted personalized models for clients by calculating
the similarity.
m mM |D |
Dk = {( xk 1 , xkm2 , . . . , xk k
, yk )i }i=1k , (1)
where xkm represents a data sample of m-modality in client k. The i-th data sample of the
m mM
k-th local dataset is Xk (i ) = ( xk 1 , xkm2 , . . . , xk k )i . The modality combination of this local
set is defined as Xk = (m1 , m2 , . . . , m Mk ). As an example, for client a containing both image
and text data, its modality combination is X a = (image, text), and its i-th local data sample
image text
is X a (i ) = ( x a , x a )i . Therefore, its modality number Ma is 2.
In a communication round t, the local model θkt of client k can be updated by a local
training process via stochastic gradient descent (SGD):
where µ is the learning rate of the local training process; Xk is the corresponding local
multimodal data; Lk represents the total loss function of client k with multimodal input
data Xk ; and θkt is the local model of client k parameters at communication round t.
Multiple modalities can make different contributions to the final loss affected by the
problem context, data quality, and downstream tasks. For instance, in an image–text pair
classification task, we may set a higher weight for the loss computed from image data and
a lower one for text data. Therefore, given the input Xk (i ), the total loss Lk is defined as
Mk
mj mj mj
Lk (Xk (i ), θkt , yk (i )) = ∑ ϕk lk (Ck ( xk ; θkt ), yk (i )) (3)
j =1
mj
Here, ϕk represents the sum weight of modality m j ; Ck is the local model of client k;
mj mj
lk is the loss function for modality m j ; and xk is the input data of modality m j .
Accordingly, we define the local training target as follows:
|Dk |
1
fk =
|Dk | ∑ Lk (Xk (i), θk , yk (i)) (4)
i =1
K
min F (θG ) =
θG
∑ ω k f k ( θ k ), (5)
k =1
where θG is the global model parameters; ωk is the global aggregation weight for client k;
and K is the total number of clients.
Definition 1 (Horizontal Multimodal Federated Learning). Given a client set N and modality
set M in a federated system, the system is called horizontal multimodal federated learning if, for
∀ a, b ∈ N , the clients hold the same modality set, i.e., | Ma | = | Mb | and X a = Xb . Here, | Mk |
denotes the total number of modality types for client k, and Xk is the modality combination set.
For instance, in Figure 4 (left), two mobile users, Xa and Xb , with the same APP usage
patterns can hold both image and text data (denoted as ximage and x text modalities) on their
devices, as shown in Figure 4 (left). With the same data modalities locally, the two clients
have inputs that are the same in terms of the modality combination but different in terms
of the sample IDs, mathematically defined as follows:
image |D | image |D |
Xa = {( x a , x text a
a , y a )i }i =1 , Xb = {( xb , xbtext , yb ) j } j=1b , (6)
image
where ( x a , x text
a , y )i denotes the i-th data sample of user a with two modalities, image and
text, and the corresponding data label y, and |D a | represents the number of data samples.
Figure 4. Illustration of horizontal multimodal federated learning and vertical multimodal federated
learning. (Left): horizontal multimodal federated learning involving two clients. Both hold image
and text data. (Right): the vertical multimodal federated learning example includes two clients with
exclusive modalities. Client a has audio and video data, while client b holds heat rate and acceleration
sensor data.
the feature extractor to support the downstream classifier on the server side. The authors
also validated the effectiveness of their method in the missing modality challenge, where
some clients only have certain shared data modalities in the horizontal federation. In [9],
the authors used an ensemble of local and global models to reduce both data variance and
device variance in the federated system.
Definition 2 (Vertical Multimodal Federated Learning). Given a client set N and modality
set M in a federated system, the system is defined as vertical multimodal federated learning if, for
∀ a, b ∈ N , they hold totally different modality combinations while connected by sample IDs, i.e.,
X a ∩ Xb = ∅ and D a = Db .
For instance, in the human activity recognition task, a user may own multiple devices
that collect different data modalities due to the divergence of the sensor category, as shown
in Figure 4 (right). In a two-device case, the local datasets of the devices could be defined as:
|D| |D|
Xa = {( x video
a , x aaudio , y a )i }i=1 , Xb = {( xbheart_rate , xbacceleration , yb )i }i=1 , (7)
where client a holds modality video and modality audio, and modality heartratesensor and
accelerationsensor are held by client b. Unlike the horizontal scenario, the two clients could
share the same sample ID set D.
In [16], the authors proposed the feature-disentangled activity recognition network
(FDARN) for the cross-modal federated human activity recognition task. With five ad-
versarial training modules, the proposed method captured both the modality-agnostic
features and modality-specific discriminative characteristics of each client to achieve better
performance than existing personalized federated learning methods. Notably, each client
held a single modality dataset that could differ from group to group in their experiments.
Definition 3 (Multimodal Federated Transfer Learning). Given a client set N and modality
set M in a federated system, the system is defined as multimodal federated transfer learning if, for
∀ a, b ∈ N , the clients hold different modality combinations and sample IDs, X a ∩ Xb = ∅ and
D a 6 = Db .
|D | |D |
Xa = {( x aMRI , x aPET , y a )i }i=1a , Xb = {( xbMRI , xbCT , yb ) j } j=1b , (8)
Sensors 2023, 23, 6986 9 of 21
where the two clients differ in terms of both local data modalities and sample ID sets.
However, since CT, MRI, and PET scans are all medical image techniques for diagnosis.
The rich knowledge and model advantages could be shared between the clients, forming a
typical multimodal federated transfer learning setting.
Liu et al. proposed aimNet to generate fine-grained image representations and im-
prove the performance for various vision–language grounding problems under federated
settings. They validated their methods in horizontal, vertical, and transfer multimodal
federated learning settings to show their superiority.
Definition 4 (Hybrid Multimodal Federated Learning). Given a client set N and modality set
M in a federated system, the system is defined as hybrid multimodal federated learning if there exist
at least two basic relationships (horizontal, vertical, or transfer) or both unimodal and multimodal
clients. There are at most 2 M − 1 types of clients in a hybrid federated system.
In Figure 5 (right), we may take the mental health prediction task at the beginning of
the section as an example of hybrid MFL, where three mobile users share a horizontally
related screen time (ST) and hold a unique data modality. The input modalities of this
example are
image |D | |D | |D |
Xa = {( x ST
a , xa , y a )i }i=1a , Xb = {( xbST , xbvideo , yb ) j } j=b1 , Xc = {( xcST , xcaudio , yc )k }k=c1 . (9)
The client category can vary in a hybrid setting. In a bimodal federated system, there
could be three kinds of clients in total; for a trimodal federated system, the number of
client categories could rise to seven depending on the different modality numbers and
combinations. The relationships among the clients in a hybrid MFL system could be
described at the modality level.
Chen and Zhang in [18] proposed FedMSplit, a dynamic and multiview graph struc-
ture aiming to solve the modality incongruity challenges in a hybrid MFL setting. The
novel modality incongruity problem in MFL is a significant challenge within the scope of
hybrid MFL. In [39], the authors proposed a general multimodal model that worked on both
multitask and transfer learning for high-modality (a large set of diverse modalities) and
partially observable (each task only defined on a small subset of modalities) scenarios. This
Sensors 2023, 23, 6986 10 of 21
6.4. Healthcare
Numerous healthcare centers and hospitals have accumulated vast amounts of mul-
timodal data during patient consultations and treatments, including X-ray images, CT
scans, physician diagnoses, and physiological measurements of patients. These multi-
modal data are typically tightly linked to patient identifiers and require stringent privacy
protection measures. As a result, these healthcare institutions have formed isolated data
islands, impeding direct collaboration in terms of co-operative training and data sharing
through open databases. This presents a series of crucial challenges within the realm of
multimodal federated learning, encompassing tasks such as AI-assisted diagnosis, medical
image analysis, and laboratory report generation.
Some works in the field of healthcare have explored multimodal federated learning,
often assuming that all institutions have the same set of modalities, referred to as horizontal
MFL, or that each institution possesses only a single modality, known as vertical MFL.
Agbley et al. in [14] applied federated learning for the prediction of melanoma and obtained
a performance level that was on-par with the centralized training results. FedNorm [15]
performed modality-based normalization techniques to enhance liver segmentation and
was trained with unimodal clients holding CT and MRI data, respectively. Qayyum
et al. utilized cluster federated learning for the automatic diagnosis of COVID-19 [13].
Each cluster contained healthcare entities that held the same modality, such as X-ray and
ultrasound data.
tributes. Reed et al. expanded the dataset by providing ten fine-grained text description
sentences for each image [42]. The sentences were collected through the Amazon Mechani-
cal Turk (AMT) platform and had a minimum length of 10 words, without exposing the
label and action information.
Oxford 102 Flower (102 Category Flower Dataset). Oxford 102 Flower [43] is a fine-
grained classification dataset comprising 102 categories of flowers that commonly occur in
the United Kingdom. Each category contains 40 to 258 images. There are 10 text descriptions
for each image.
UPMC Food-101. Food-101 [44] is a noisy multimodal classification dataset that
contains both images and paired captions of 101 food categories. Each category has
750 training and 250 testing images. There are a total of 101,000 images, each paired with
one caption. However, the labels and captions of the training set contain some noise and
may leak the label information. The testing set has been manually cleaned.
Microsoft Common Objects in Context (MS COCO). The MS COCO dataset [45] is
a comprehensive dataset used for various tasks such as object detection, segmentation,
key-point detection, captioning, stuff image segmentation, panoptic segmentation, and
dense pose estimation. It comprises a total of 328 K images. The dataset provides detailed
annotations for object detection (bounding boxes and segmentation masks), captioning,
keypoint detection, stuff image segmentation, panoptic segmentation, and dense pose
estimation. Note that the dense pose annotations are only available for training and
validation images, totaling more than 39,000 images and 56,000 person instances.
Flickr30k. The Flickr30k dataset [46] comprises 31,000 images sourced from Flickr,
accompanied by five reference sentences per image generated by human annotators. Ad-
ditionally, we constructed an image caption corpus consisting of 158,915 crowd-sourced
captions describing 31,783 images. This updated collection of images and captions primarily
focuses on individuals participating in routine activities and events.
disorders. ADNI consists of clinical, genetic, imaging, and biomarker data gathered from
participants across multiple sites in the United States and Canada. The dataset includes var-
ious modalities such as magnetic resonance imaging (MRI), positron emission tomography
(PET), cerebrospinal fluid (CSF) biomarkers, and cognitive assessments.
8. Discussion
We introduce the potential directions and challenges of multimodal federated learning
in this section. These challenges are non-exclusive; rather, they are rooted in one core factor,
the data modality distribution in the federated learning system.
the unification of the embedding operation among all the clients difficult. A client maps
the original multimodal data onto embedding representations, which all exist in its unique
common subspace. Due to the modality heterogeneity and the different local model tasks,
these common subspaces differ from each other, resulting in the task gaps that are difficult
to bridge. For example, unimodal clients and multimodal clients could hold totally different
parameter spaces and work on different feature spaces. Even if the clients hold the same
modality combinations, the specific local tasks can be different, such as visual question
answering and image captioning.
To solve this challenge, it is necessary for a new aggregation paradigm. The modality
heterogeneity could result in more divergent gradients and even heterogeneous local model
architectures. As Wang et al. proved that different modalities overfit and generalize at
different rates, the one-fits-all global training strategy for MFL might not work, since
optimizing all the clients jointly can produce suboptimal results [62].
9. Conclusions
In this paper, we delved into the promising research area of multimodal federated
learning (MFL). We provided an introduction to existing MFL methods and discussed
the motivations behind leveraging distributed multimodal data. Recognizing that many
studies in this domain have proposed customized scenario settings, we took the initiative
to formally define multimodal federated learning and categorize and organize existing
works. Our aim was to establish standards for subsequent research and foster a coherent
and structured approach in this evolving field. Addressing the lack of evaluation and
benchmarking, we refined several representative MFL application scenarios and identi-
fied relevant datasets. These efforts will allow the research community to compare and
analyze task performance, ultimately promoting advancements in MFL. Moreover, we
emphasized the core issue of modality heterogeneity, which presents unique challenges
to MFL, including dealing with missing modalities and the deployment of pre-trained
models. Additionally, traditional privacy protection and data heterogeneity have become
more complex in MFL. By highlighting these challenges, we sought to raise awareness
among researchers and encourage innovative solutions. Overall, this survey paper pro-
vides preliminary summaries and explorations that can significantly contribute to a better
understanding of the importance and uniqueness of the MFL field. We hope our insights
will serve as valuable guidance for researchers and inspire further development in this
promising area of research.
Author Contributions: Conceptualization, L.C. and F.M.; methodology, L.C.; software, L.C.;
validation, L.C., J.W., Y.Z. and F.M.; formal analysis, L.C.; investigation, L.C.; resources, L.C.;
writing—original draft preparation, L.C.; writing—review and editing, J.W., Y.Z. and F.M.; visu-
alization, L.C.; supervision, F.M.; project administration, F.M. All authors have read and agreed to the
published version of the manuscript
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Acknowledgments: We sincerely thank all anonymous reviewers for their valuable comments.
Sensors 2023, 23, 6986 18 of 21
References
1. Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach.
Intell. 2018, 41, 423–443. [CrossRef] [PubMed]
2. McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from
decentralized data. In Proceedings of the Artificial Intelligence and Statistics, PMLR, Fort Lauderdale, FL, USA, 20–22 April 2017;
pp. 1273–1282.
3. Zhu, H.; Xu, J.; Liu, S.; Jin, Y. Federated learning on non-IID data: A survey. Neurocomputing 2021, 465, 371–390. [CrossRef]
4. Zhao, Y.; Li, M.; Lai, L.; Suda, N.; Civin, D.; Chandra, V. Federated Learning with Non-IID Data. arXiv 2018, arXiv:1806.00582.
[CrossRef]
5. Sattler, F.; Wiedemann, S.; Müller, K.R.; Samek, W. Robust and communication-efficient federated learning from non-iid data.
IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 3400–3413. [CrossRef] [PubMed]
6. Wang, H.; Kaplan, Z.; Niu, D.; Li, B. Optimizing federated learning on non-iid data with reinforcement learning. In Proceedings of
the IEEE INFOCOM 2020-IEEE Conference on Computer Communications, Toronto, ON, Canada, 6–9 July 2020; IEEE: Piscataway,
NJ, USA, 2020, pp. 1698–1707.
7. Wang, J.; Zeng, S.; Long, Z.; Wang, Y.; Xiao, H.; Ma, F. Knowledge-Enhanced Semi-Supervised Federated Learning for Aggregating
Heterogeneous Lightweight Clients in IoT. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM),
Minneapolis, MN, UUSA, 27–29 April 2023; SIAM: Lodhi Road, India, 2023; pp. 496–504.
8. Wang, J.; Qian, C.; Cui, S.; Glass, L.; Ma, F. Towards federated COVID-19 vaccine side effect prediction. In Proceedings of the
Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Grenoble, France, 19–23 September
2022; Springer: Berlin/Heidelberg, Germany, 2022, pp. 437–452.
9. Liang, P.P.; Liu, T.; Ziyin, L.; Allen, N.B.; Auerbach, R.P.; Brent, D.; Salakhutdinov, R.; Morency, L.P. Think locally, act globally:
Federated learning with local and global representations. arXiv 2020, arXiv:2001.01523.
10. Xiong, B.; Yang, X.; Qi, F.; Xu, C. A unified framework for multi-modal federated learning. Neurocomputing 2022, 480, 110–118.
[CrossRef]
11. Zong, L.; Xie, Q.; Zhou, J.; Wu, P.; Zhang, X.; Xu, B. FedCMR: Federated Cross-Modal Retrieval. In Proceedings of the 44th
International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, 11–15 July
2021; pp. 1672–1676.
12. Chen, S.; Li, B. Towards Optimal Multi-Modal Federated Learning on Non-IID Data with Hierarchical Gradient Blending. In
Proceedings of the IEEE INFOCOM 2022-IEEE Conference on Computer Communications, London, UK, 2–5 May 2022; IEEE:
Piscataway, NJ, USA, 2022; pp. 1469–1478.
13. Qayyum, A.; Ahmad, K.; Ahsan, M.A.; Al-Fuqaha, A.; Qadir, J. Collaborative federated learning for healthcare: Multi-modal
covid-19 diagnosis at the edge. arXiv 2021, arXiv:2101.07511.
14. Agbley, B.L.Y.; Li, J.; Haq, A.U.; Bankas, E.K.; Ahmad, S.; Agyemang, I.O.; Kulevome, D.; Ndiaye, W.D.; Cobbinah, B.; Latipova, S.
Multimodal melanoma detection with federated learning. In Proceedings of the 2021 18th International Computer Conference on
Wavelet Active Media Technology and Information Processing (ICCWAMTIP), Chengdu, China, 17–19 December 2021; IEEE:
Piscataway, NJ, USA, 2021, pp. 238–244.
15. Bernecker, T.; Peters, A.; Schlett, C.L.; Bamberg, F.; Theis, F.; Rueckert, D.; Weiß, J.; Albarqouni, S. FedNorm: Modality-Based
Normalization in Federated Learning for Multi-Modal Liver Segmentation. arXiv 2022, arXiv:2205.11096.
16. Yang, X.; Xiong, B.; Huang, Y.; Xu, C. Cross-Modal Federated Human Activity Recognition via Modality-Agnostic and Modality-
Specific Representation Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 22 February–1
March 2022; Volume 36, pp. 3063–3071 .
17. Liu, F.; Wu, X.; Ge, S.; Fan, W.; Zou, Y. Federated learning for vision-and-language grounding problems. In Proceedings of the
AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11572–11579.
18. Chen, J.; Zhang, A. FedMSplit: Correlation-Adaptive Federated Multi-Task Learning across Multimodal Split Networks. In
Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 14–18
August 2022; pp. 87–96. [CrossRef]
19. Zhao, H.; Du, W.; Li, F.; Li, P.; Liu, G. FedPrompt: Communication-Efficient and Privacy-Preserving Prompt Tuning in Federated
Learning. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [CrossRef]
20. Chen, Y.; Hsu, C.F.; Tsai, C.C.; Hsu, C.H. HPFL: Federated Learning by Fusing Multiple Sensor Modalities with Heterogeneous
Privacy Sensitivity Levels. In Proceedings of the 1st International Workshop on Methodologies for Multimedia, Lisboa, Portugal,
14 October 2022; pp. 5–14. [CrossRef]
21. Li, X.; Huang, K.; Yang, W.; Wang, S.; Zhang, Z. On the convergence of fedavg on non-iid data. arXiv 2019, arXiv:1907.02189.
22. Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated Optimization in Heterogeneous Networks. Proc.
Mach. Learn. Syst. 2020, 2, 429–450. [CrossRef]
Sensors 2023, 23, 6986 19 of 21
23. Karimireddy, S.P.; Kale, S.; Mohri, M.; Reddi, S.; Stich, S.; Suresh, A.T. Scaffold: Stochastic controlled averaging for feder-
ated learning. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 13–18 July 2020;
pp. 5132–5143.
24. Zhou, Y.; Wu, J.; Wang, H.; He, J. Adversarial Robustness through Bias Variance Decomposition: A New Perspective for Federated
Learning. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA,
USA, 17–21 October 2022; ACM: New York, NY, USA, 2022; pp. 2753–2762.
25. Tan, A.Z.; Yu, H.; Cui, L.; Yang, Q. Towards personalized federated learning. IEEE Trans. Neural Netw. Learn. Syst. 2022, 1–17.
[CrossRef] [PubMed]
26. Ruan, Y.; Joe-Wong, C. Fedsoft: Soft clustered federated learning with proximal local updating. In Proceedings of the AAAI
Conference on Artificial Intelligence, Virtually, 22 February–1 March 2022; Volume 36, pp. 8124–8131.
27. Fallah, A.; Mokhtari, A.; Ozdaglar, A. Personalized federated learning: A meta-learning approach. arXiv 2020, arXiv:2002.07948.
28. T Dinh, C.; Tran, N.; Nguyen, J. Personalized federated learning with moreau envelopes. Adv. Neural Inf. Process. Syst. 2020,
33, 21394–21405.
29. Wang, J.; Cui, S.; Ma, F. FedLEGO: Enabling Heterogenous Model Cooperation via Brick Reassembly in Federated Learning. In
Proceedings of the International Workshop on Federated Learning for Distributed Data Mining, Long Beach, CA, USA, 7 August
2023.
30. Smith, V.; Chiang, C.K.; Sanjabi, M.; Talwalkar, A.S. Federated multi-task learning. Adv. Neural Inf. Process. Syst. 2017, 30,
4427–4437.
31. Corinzia, L.; Beuret, A.; Buhmann, J.M. Variational federated multi-task learning. arXiv 2019, arXiv:1906.06268.
32. Marfoq, O.; Neglia, G.; Bellet, A.; Kameni, L.; Vidal, R. Federated Multi-Task Learning under a Mixture of Distributions. In
Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021; Ranzato, M., Beygelzimer,
A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 15434–15447.
33. Yang, Q.; Liu, Y.; Chen, T.; Tong, Y. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. TIST
2019, 10, 1–19. [CrossRef]
34. Zhao, Y.; Barnaghi, P.; Haddadi, H. Multimodal Federated Learning on IoT Data. In Proceedings of the 2022 IEEE/ACM
Seventh International Conference on Internet-of-Things Design and Implementation (IoTDI), Milano, Italy, 4–6 May 2022; IEEE:
Piscataway, NJ, USA, 2022, pp. 43–54.
35. Guo, T.; Guo, S.; Wang, J. pFedPrompt: Learning Personalized Prompt for Vision-Language Models in Federated Learning. In
Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 1364–1374.
36. Zhang, R.; Chi, X.; Liu, G.; Zhang, W.; Du, Y.; Wang, F. Unimodal Training-Multimodal Prediction: Cross-modal Federated
Learning with Hierarchical Aggregation. arXiv 2023, arXiv:2303.15486.
37. Yu, Q.; Liu, Y.; Wang, Y.; Xu, K.; Liu, J. Multimodal Federated Learning via Contrastive Representation Ensemble. In Proceedings
of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023.
38. Lu, W.; Hu, X.; Wang, J.; Xie, X. FedCLIP: Fast Generalization and Personalization for CLIP in Federated Learning. arXiv 2023,
arXiv:2302.13485.
39. Liang, P.P.; Lyu, Y.; Fan, X.; Mo, S.; Yogatama, D.; Morency, L.P.; Salakhutdinov, R. HighMMT: Towards Modality and Task
Generalization for High-Modality Representation Learning. arXiv 2022, arXiv:2203.01311. https://doi.org/10.48550/ARXIV.2203
.01311.
40. Liang, P.P.; Liu, T.; Cai, A.; Muszynski, M.; Ishii, R.; Allen, N.; Auerbach, R.; Brent, D.; Salakhutdinov, R.; Morency, L.P. Learning
language and multimodal privacy-preserving markers of mood from mobile data. arXiv 2021, arXiv:2106.13213.
41. Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset; Technical Report CNS-TR-
2011-001; California Institute of Technology: Pasadena, CA, USA, 2011.
42. Reed, S.; Akata, Z.; Lee, H.; Schiele, B. Learning deep representations of fine-grained visual descriptions. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 49–58.
43. Nilsback, M.E.; Zisserman, A. Automated Flower Classification over a Large Number of Classes. In Proceedings of the Indian
Conference on Computer Vision, Graphics and Image Processing, Bhubaneswar, India, 16–19 December 2008.
44. Bossard, L.; Guillaumin, M.; Van Gool, L. Food-101—Mining Discriminative Components with Random Forests. In Proceedings
of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014.
45. Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft
COCO: Common Objects in Context. arXiv 2015, arXiv:1405.0312.
46. Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From image descriptions to visual denotations: New similarity metrics for
semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2014, 2, 67–78. [CrossRef]
47. Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity
understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2684–2701. [CrossRef]
48. Damen, D.; Doughty, H.; Farinella, G.M.; Furnari, A.; Ma, J.; Kazakos, E.; Moltisanti, D.; Munro, J.; Perrett, T.; Price, W.; et al.
Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100. Int. J. Comput. Vis. IJCV 2022,
130, 33–55. [CrossRef]
Sensors 2023, 23, 6986 20 of 21
49. Nakamura, K.; Yeung, S.; Alahi, A.; Fei-Fei, L. Jointly learning energy expenditures and activities using egocentric multimodal
signals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July
2017; pp. 1868–1877.
50. Banos, O.; Garcia, R.; Saez, A. MHEALTH Dataset; UCI Machine Learning Repository. 2014. Available online: https://archive.ics.
uci.edu/dataset/319/mhealth+dataset (accessed on 3 August 2023).
51. Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive
emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [CrossRef]
52. Zadeh, A.; Liang, P.P.; Poria, S.; Vij, P.; Cambria, E.; Morency, L.P. Multi-attention recurrent network for human communication
comprehension. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7
February 2018.
53. Liang, P.P.; Lyu, Y.; Fan, X.; Wu, Z.; Cheng, Y.; Wu, J.; Chen, L.Y.; Wu, P.; Lee, M.A.; Zhu, Y.; et al. MultiBench: Multiscale
Benchmarks for Multimodal Representation Learning. In Proceedings of the Thirty-fifth Conference on Neural Information
Processing Systems Datasets and Benchmarks Track (Round 1), Virtual, 6–14 December 2021.
54. Johnson, A.; Bulgarelli, L.; Shen, L.; Gayles, A.; Shammout, A.; Horng, S.; Pollard, T.J.; Hao, S.; Moody, B.; Gow, B.; et al. MIMIC-IV,
a freely accessible electronic health record dataset. Sci. Data 2023, 10, 1. [CrossRef] [PubMed]
55. Goldberger, A.L.; Amaral, L.A.; Glass, L.; Hausdorff, J.M.; Ivanov, P.C.; Mark, R.; Mietus, J.E.; Moody, G.B.; Peng, C.K.; Stanley,
H.E. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals.
Circulation 2000, 101, e215–e220. [CrossRef] [PubMed]
56. Alzheimer’s Disease Neuroimaging Initiative (ADNI). ADNI Database. Available online: http://adni.loni.usc.edu (accessed on 3
August 2023).
57. Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; Xiao, J. 3d shapenets: A deep representation for volumetric shapes.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015;
pp. 1912–1920.
58. Duarte, M.F.; Hu, Y.H. Vehicle classification in distributed sensor networks. J. Parallel Distrib. Comput. 2004, 64, 826–838.
[CrossRef]
59. Feng, T.; Bose, D.; Zhang, T.; Hebbar, R.; Ramakrishna, A.; Gupta, R.; Zhang, M.; Avestimehr, S.; Narayanan, S. FedMultimodal:
A Benchmark For Multimodal Federated Learning. arXiv 2023, arXiv:2306.09486.
60. Guo, W.; Wang, J.; Wang, S. Deep multimodal representation learning: A survey. IEEE Access 2019, 7, 63373–63394. [CrossRef]
61. Liang, W.; Zhang, Y.; Kwon, Y.; Yeung, S.; Zou, J. Mind the gap: Understanding the modality gap in multi-modal contrastive
representation learning. arXiv 2022, arXiv:2203.02053.
62. Wang, W.; Tran, D.; Feiszli, M. What makes training multi-modal classification networks hard? In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 12695–12705.
63. Ma, M.; Ren, J.; Zhao, L.; Testuggine, D.; Peng, X. Are Multimodal Transformers Robust to Missing Modality? In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18177–18186.
64. Ma, M.; Ren, J.; Zhao, L.; Tulyakov, S.; Wu, C.; Peng, X. Smil: Multimodal learning with severely missing modality. In Proceedings
of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 2302–2310.
65. Wu, M.; Goodman, N. Multimodal generative models for scalable weakly-supervised learning. Adv. Neural Inf. Process. Syst.
2018, 31, 5580–5590.
66. Tsai, Y.H.H.; Liang, P.P.; Zadeh, A.; Morency, L.P.; Salakhutdinov, R. Learning factorized multimodal representations. arXiv 2018,
arXiv:1806.06176.
67. Rieke, N.; Hancox, J.; Li, W.; Milletari, F.; Roth, H.R.; Albarqouni, S.; Bakas, S.; Galtier, M.N.; Landman, B.A.; Maier-Hein, K.; et al.
The future of digital health with federated learning. NPJ Digit. Med. 2020, 3, 119. [CrossRef] [PubMed]
68. Cobbinah, B.M.; Sorg, C.; Yang, Q.; Ternblom, A.; Zheng, C.; Han, W.; Che, L.; Shao, J. Reducing variations in multi-center
Alzheimer’s disease classification with convolutional adversarial autoencoder. Med. Image Anal. 2022, 82, 102585. [CrossRef]
[PubMed]
69. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding.
arXiv 2018, arXiv:1810.04805.
70. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning
transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine
Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763.
71. Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and
generation. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, ML, USA, 17–23 July 2022;
pp. 12888–12900.
72. Tian, Y.; Wan, Y.; Lyu, L.; Yao, D.; Jin, H.; Sun, L. FedBERT: when federated learning meets pre-training. ACM Trans. Intell. Syst.
Technol. TIST 2022, 13, 1–26. [CrossRef]
73. Tan, Y.; Long, G.; Ma, J.; Liu, L.; Zhou, T.; Jiang, J. Federated learning from pre-trained models: A contrastive learning approach.
arXiv 2022, arXiv:2209.10083.
Sensors 2023, 23, 6986 21 of 21
74. Nasr, M.; Shokri, R.; Houmansadr, A. Comprehensive privacy analysis of deep learning: Passive and active white-box inference
attacks against centralized and federated learning. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP),
San Francisco, CA, USA, 19–23 May 2019; IEEE: Piscataway, NJ, USA, 2019, pp. 739–753.
75. Luo, X.; Wu, Y.; Xiao, X.; Ooi, B.C. Feature inference attack on model predictions in vertical federated learning. In Proceedings of
the 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 19–22 April 2021; IEEE: Piscataway, NJ,
USA, 2021, pp. 181–192.
76. Wei, K.; Li, J.; Ding, M.; Ma, C.; Yang, H.H.; Farokhi, F.; Jin, S.; Quek, T.Q.; Poor, H.V. Federated learning with differential privacy:
Algorithms and performance analysis. IEEE Trans. Inf. Forensics Secur. 2020, 15, 3454–3469. [CrossRef]
77. Park, J.; Lim, H. Privacy-preserving federated learning using homomorphic encryption. Appl. Sci. 2022, 12, 734. [CrossRef]
78. Fang, H.; Qian, Q. Privacy preserving machine learning with homomorphic encryption and federated learning. Future Internet
2021, 13, 94. [CrossRef]
79. Qiu, P.; Zhang, X.; Ji, S.; Li, C.; Pu, Y.; Yang, X.; Wang, T. Hijack Vertical Federated Learning Models with Adversarial Embedding.
arXiv 2022, arXiv:2212.00322.
80. Zhuang, W.; Wen, Y.; Zhang, S. Divergence-aware federated self-supervised learning. arXiv 2022, arXiv:2204.04385.
81. Saeed, A.; Salim, F.D.; Ozcelebi, T.; Lukkien, J. Federated self-supervised learning of multisensor representations for embedded
intelligence. IEEE Internet Things J. 2020, 8, 1030–1040. [CrossRef]
82. Jeong, W.; Yoon, J.; Yang, E.; Hwang, S.J. Federated semi-supervised learning with inter-client consistency & disjoint learning.
arXiv 2020, arXiv:2006.12097.
83. Che, L.; Long, Z.; Wang, J.; Wang, Y.; Xiao, H.; Ma, F. FedTriNet: A Pseudo Labeling Method with Three Players for Federated
Semi-supervised Learning. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA,
15–18 December 2021; pp. 715–724. [CrossRef]
84. Long, Z.; Che, L.; Wang, Y.; Ye, M.; Luo, J.; Wu, J.; Xiao, H.; Ma, F. FedSiam: Towards adaptive federated semi-supervised learning.
arXiv 2020, arXiv:2012.03292.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.