0% found this document useful (0 votes)
41 views15 pages

FedFMSL - Federated Learning of Foundation Models With Sparsely Activated LoRA

The document presents FedFMSL, a novel federated learning algorithm designed to optimize foundation models (FMs) while addressing communication and computation challenges, particularly in edge computing scenarios. FedFMSL employs a two-stage training process with a global expert and a local expert, utilizing a Mixture of Foundation Models (MoFM) and a Sparsely Activated LoRA (SAL) algorithm to enhance personalization and efficiency. Experimental results demonstrate that FedFMSL significantly outperforms existing state-of-the-art methods, achieving up to 59.19% improvement while tuning only a small fraction of the model's parameters.

Uploaded by

apogne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views15 pages

FedFMSL - Federated Learning of Foundation Models With Sparsely Activated LoRA

The document presents FedFMSL, a novel federated learning algorithm designed to optimize foundation models (FMs) while addressing communication and computation challenges, particularly in edge computing scenarios. FedFMSL employs a two-stage training process with a global expert and a local expert, utilizing a Mixture of Foundation Models (MoFM) and a Sparsely Activated LoRA (SAL) algorithm to enhance personalization and efficiency. Experimental results demonstrate that FedFMSL significantly outperforms existing state-of-the-art methods, achieving up to 59.19% improvement while tuning only a small fraction of the model's parameters.

Uploaded by

apogne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO.

12, DECEMBER 2024 15167

FedFMSL: Federated Learning of Foundation Models


With Sparsely Activated LoRA
Panlong Wu , Kangshuo Li, Ting Wang, Yanjie Dong, Member, IEEE, Victor C. M. Leung , Life Fellow, IEEE,
and Fangxin Wang , Member, IEEE

Abstract—Foundation models (FMs) have shown great success transformer blocks, and activates them progressively during the
in natural language processing, computer vision, and multimodal training. We employ extensive experiments to verify the effective-
tasks. FMs have a large number of model parameters, thus requir- ness of FedFMSL, results show that FedFMSL outperforms other
ing a substantial amount of data to help optimize the model during SOTA baselines by up to 59.19% in default settings while tuning
the training. Federated learning has revolutionized machine learn- less than 0.3% parameters of the foundation model.
ing by enabling collaborative learning from decentralized data
while still preserving clients’ data privacy. Despite the great benefits Index Terms—Edge computing, federated learning, foundation
foundation models can have empowered by federated learning, model.
their bulky model parameters cause severe communication chal-
lenges for modern networks and computation challenges especially I. INTRODUCTION
for edge devices. Moreover, the data distribution of different clients
can be different thus inducing statistical challenges. In this paper, OUNDATION model (FM) has emerged as a potent solu-
we propose a novel two-stage federated learning algorithm called
FedFMSL. A global expert is trained in the first stage and a local
expert is trained in the second stage to provide better personaliza-
F tion to address the growing demand for machine learning
services. It presents several advantages over its predecessors,
tion. We construct a Mixture of Foundation Models (MoFM) with the traditional smaller models. FM stands out primarily due to
these two experts and design a gate neural network with an inserted the massive amount of training data together with the extensive
gate adapter that joins the aggregation every communication round number of parameters. This massive increased parameter space
in the second stage. To further adapt to edge computing scenarios and collected intricate patterns and relationships in the data
with limited computational resources, we design a novel Sparsely
Activated LoRA (SAL) algorithm that freezes the pre-trained foun-
enable FM to improve performance across various machine
dation model parameters inserts low-rank adaptation matrices into learning tasks.
FM follows a distinct training methodology compared to
Received 20 March 2024; revised 23 August 2024; accepted 27 August 2024. smaller models. While smaller models often rely on task-specific
Date of publication 4 September 2024; date of current version 5 November training, FM employs a pre-training and fine-tuning strategy.
2024. The work was supported in part by the NSFC under Grant 62293482, This pre-training phase with large datasets acts as a stepping
in part by the Basic Research Project under Grant HZQB-KCZYZ-2021067
of Hetao Shenzhen-HK S&T Cooperation Zone, in part by the NSFC under stone, equipping FM with substantial knowledge and context
Grant 62471423, in part by the Shenzhen Science and Technology Program from massive data. Consequently, when fine-tuning FM for
under Grant JCYJ20230807114204010 and Grant RCBS20221008093120047, specific tasks, they derive significant advantages from the initial
in part by the Guangdong Basic and Applied Basic Research Foundation under
Grant 2023A1515012668, in part by Shenzhen Outstanding Talents Training pre-training, leading to enhanced performance across a diverse
Fund under Grant 202002, in part by Guangdong Research Projects under Grant set of tasks.
2019CX01X104, and in part by the Guangdong Provincial Key Laboratory of FMs with a tremendous number of parameters are data-hungry
Future Networks of Intelligence under Grant 2022B1212010001. The work of
Yanjie Dong and Victor C.M. Leung was supported in part by the NSFC under due to the large number of parameters to be optimized. However,
Grant 62102266, in part by the Pearl River Talent Recruitment Program of user data is usually stored on edges and is privacy sensitive
Guangdong Province under Grant 2019ZT08X603, in part by the Public Tech- thus cannot meet the data volume for FM training. Federated
nology Platform of Shenzhen City under Grant GGFW2018021118145859, and
in part by the Shenzhen Science and Technology Innovation Commission under learning (FL) has revolutionized the landscape of machine
Grant R2020A045. Recommended for acceptance by X. Peng. (Corresponding learning by enabling the collaborative training of a shared model
author: Fangxin Wang.) across multiple edge devices without sharing raw data. By
Panlong Wu, Kangshuo Li, and Ting Wang are with the Shenzhen Fu-
ture Network of Intelligence Institute (FNii-Shenzhen) and School of Science combining the power of FM with the decentralized approach of
and Engineering (SSE), The Chinese University of Hong Kong, Shenzhen FL, we can utilize the distributed edge data while preserving data
518172, China (e-mail: [email protected]; [email protected]; privacy and overcoming the limitations of centralized training
[email protected]).
Yanjie Dong and Victor C. M. Leung are with the Artificial Intelligence approaches, thus enabling the enhanced generalization ability of
Research Institute and the Guangdong-Hong Kong-Macao Joint Laboratory for FM. This combination allows us to harness the benefits of both
Emotional Intelligence and Pervasive Computing, Shenzhen MSU-BIT Univer- FL’s collaborative training across edge devices and FM’s large
sity, Shenzhen 518172, China (e-mail: [email protected]; [email protected]).
Fangxin Wang is with the School of Science and Engineering (SSE), Shen- parameter capacity as well as the pre-training strategy.
zhen Future Network of Intelligence Institute (FNii-Shenzhen), and Guang- However, existing FMs are cumbersome, bandwidth-
dong Provincial Key Laboratory of Future Networks of Intelligence, The intensive, and computation-intensive. This raises several chal-
Chinese University of Hong Kong, Shenzhen 518172, China (e-mail: wang-
[email protected]). lenges in the domain of FL with FM which makes it hard to
Digital Object Identifier 10.1109/TMC.2024.3454634 employ in real-world applications.
1536-1233 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
15168 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024

r Bandwidth: The first challenge lies within the substantial In summary, the main contributions of this paper can be
number of parameters possessed by FMs, which distin- summarized as follows:
guishes them from traditional FL models that possess much r We propose a communication and computation friendly
fewer parameters. This differentiation introduces impedi- two-stage personalized FL algorithm FedFMSL, which can
ments in the areas of communication and networking, as capture the global feature information through collabo-
the transmission of parameters of these FMs is significantly rative learning and capture the local feature information
time-consuming for modern mobile networks. through personalized learning. We give theoretical proof
r Computation: The second challenge arises from the huge of the convergence of the FedFMSL algorithm.
computational resource requirements posed by FMs, par- r We propose a Sparsely Activated LoRA (SAL) algorithm
ticularly for edge devices with limited computing re- that sparsely activates the trainable low-rank decompo-
sources. The considerable computational costs associated sition matrices injected into foundation models in a pro-
with these FMs create difficulties in implementing FL on gressive way through a self-defined controller to adapt to
resource-constrained edge devices. scarce computation and communication resources in edge
r Data heterogeneity: The third challenge is that FL en- computing scenarios.
counters statistical challenges due to non-IID decentral- r We propose a Mixture of Foundation Models (MoFM) al-
ized data, potentially resulting in issues such as parameter gorithm which is, to our best knowledge the first work to
divergence and data distribution biases. These issues can construct a mixture of vision language foundation models
lead to severe performance loss when FMs are employed in personalized federated learning to tackle the data het-
in FL because FMs usually have large parameter spaces erogeneous in federated learning and we further prove the
which makes them hard to be optimized. effectiveness of FedFMSL through extensive experiments.
To fill the gap, this paper addresses the challenges associated
with FL with FM by introducing the FedFMSL algorithm, II. BACKGROUND AND RELATED WORK
an FL algorithm with a Mixture of Foundations Models that
have Sparsely Activated Parameters. The proposed FedFMSL A. Foundation Model
algorithm consists of two training stages. A global foundation Recently, FMs have achieved remarkable success in various
model is trained collaboratively in the first stage and local domains such as natural language processing, computer vision,
foundation models are trained in the second stage to provide and multimodal tasks. By utilizing deep learning techniques
better personalization. like self-supervised learning and contrastive learning, FMs with
In the first training stage, each client owns a foundation a massive number of model parameters are trained on large
model, and low-rank adaption matrices are inserted into every datasets. Consequently, these models exhibit strong generaliza-
transformer block of the foundation model. During the training, tion, feature extraction, and comprehension abilities.
the pre-trained weights of the foundation models are frozen, Various works have been daone related to FMs in natural
and all the parameters of inserted matrices are activated to language processing: Bert [1], referred to as Bidirectional En-
better extract global information. In each communication round, coder Representation from Transformers, is an advanced natural
only the inserted matrices join the weight aggregation to reduce language processing model introduced by Devlin et al. (2018).
bandwidth consumption. The trained foundation model in the This model employs a transformer architecture and is pre-
first stage will be frozen in the second stage and act as the global trained on extensive text data, using a masked language model
expert. pre-training objective. GPT-3 [2] is trained using a language
In the second training stage, we for the first time form a modeling pre-training objective. By making the model do the
Mixture of Foundation Models (MoFM) system in FL which next token prediction, it can utilize the massive unlabeled data
specifically addresses the data heterogeneity challenges encoun- from the internet and have a powerful few-shots learning ability.
tered in FL with FM. The MoFM system consists of a global Various works that have been done related to visual language
foundation model, a local foundation model, and a gate model FMs: Contrastive Language Image Pre-training (CLIP) [3] is
to provide both generalization and personalization ability. We a famous FM proposed by OpenAI. This model uses a visual
leverage the foundation model trained in the first stage as a global encoder and a text encoder to extract the semantic meaning
expert and introduce another local expert which is a foundation and encode images and texts into image features and text fea-
model initialized from the weights of the global model to provide tures. Throughout the training process, contrastive learning is
better personalization. We design a gate model with a specially employed to maximize the similarity between related images
designed gate adapter inserted into it so that it can quickly adapt and texts while minimizing the similarity between unrelated
to the change of relationship between two experts and intelli- ones. DALL-E 3 [4] is a modern text-to-image system that
gently assign weights to the final decision of two experts. In each has extraordinary prompt-following capability. It addresses the
communication round, only the gate adapter’s activated param- noisy and inaccurate image captions issue by training another
eters join the aggregation to save communication resources. To specially designed image captioner.
further tackle the computation challenges, we propose a Sparsely There are also various works that have been done related to
Activated LoRA (SAL) algorithm to activate inserted low-rank applications of FMs: NetLLM [5] uses the foundation model
adaptation matrices in a progressive way through a controller to to do three networking-related tasks which are viewport pre-
suit different edge resource conditions. diction, adaptive bitrate streaming, and cluster job scheduling.

Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FedFMSL: FEDERATED LEARNING OF FOUNDATION MODELS WITH SPARSELY ACTIVATED LoRA 15169

LMaaS [6] explore the service pricing of foundation models in utilize FM on the server to generate synthesized images given
mobile computing scenarios. the transmitted prompt from the clients to improve the training
performance of the model. Tao et al. [21] propose a PromptFL
B. Federated Learning algorithm to replace the model aggregation in traditional FL to
prompt aggregation to reduce communication and computation
Federated learning [7] is a machine learning technique that
costs. Cai et al. [22] design an AdaFL algorithm to fine-tune
enables the training of decentralized data while preserving the
FMs for modern natural language processing tasks by inserting
privacy of clients participating in the training. Typically, every
adapters into models, dividing clients into three groups, and
client doesn’t share their data but their private model after local
observing each group’s training accuracy to decide the best
training in each communication round. Despite that FL has
configuration for adapters. Lu et al. [23] propose a FedCLIP
shown great potential in the Internet of Things, financial field,
algorithm to insert adapters in the visual encoder of FM CLIP
and smart healthcare, it still faces lots of challenges.
and test it on datasets in different domains. Zhang et al. [24]
Many studies focus on solving the statistical challenges: FL
introduce a Federated Instruction Tuning (FedIT) algorithm
faces serious statistical challenges because the data distribution
to leverage federated learning in induction tuning of FM to
of the datasets is often non-iid which can lead to weight diver-
enhance their performance. Vu et al. [25] do a thorough analysis
gence after model aggregation. Li et al. [8] propose FedProx that
of the privacy leakage of federated learning with foundations
handles the system heterogeneity by introducing an additional
when data is under the protection of differential privacy mech-
proximal term to prevent the local model updates from being
anisms. Cai et al. [26] explore the usage of federated learning
far from the global model and thus can safely aggregate the
to boost the performance of large language models in practical
local updates in statistical heterogeneity conditions. Li et al. [9]
software development. The F-CODLLM they propose supports
introduce MOON which uses the idea of contrastive learning
multi-language combined fine-tuning, which can extract uni-
to compare the representation learned by local models and the
versal knowledge from a variety of programming languages
global model. Inspired by the philosophy of the global model
and improve the overall performance of code-related tasks. Liu
has better feature extraction ability than local models that are
et al. [27] utilize the Self-Consistency and Chain-of-Thought
trained on skewed local datasets. Zhang et al. [10] design FedLC
techniques in federated learning to improve the quality of the
that introduces a fine-grained calibrated cross-entropy loss to
answer of LLM to users without sophisticated parameter-tuning.
mitigate the local gradient deviation and gives theoretical proof
However, none of these works consider the cooperation of FMs
of the deviation bound after calibration. Zec et al. [11] propose
and thus cannot achieve good performance in data heterogeneous
a personalized federated learning algorithm with a mixture of
conditions on challenging datasets.
experts with a training pipeline of global expert training, local
expert training, and mixer training. Zhang et al. [12] propose
a Multi-level Personalized Federated Learning (MuPFL) algo- III. DESIGN OF FEDFMSL
rithm to tackle the heterogeneous and long-tailed data problems A. Overview of FedFMSL
in FL. Wu et al. [13] propose a FEDCNI algorithm to address
the noisy and heterogeneous data problem in FL. We consider a typical FL scenario with a total number of
The communication efficiency issue is also an important issue N clients with non-iid dataset {D1 , . . ., DN }. Our method
that many researchers focus on: MAO et al. [14] propose an FedFMSL consists of two stages of training as depicted in
Adaptive Quantized Gradient (AQG) algorithm to decide the Fig. 1. In the first stage, low-rank adaptation matrices are inserted
level of quantization according to the gradient update of hetero- into every transformer block of the foundation model [28]. All
geneous clients. Huang et al. [15] propose a Residual Pooling the clients freeze the pre-trained foundation model weights and
Network (RPN) based on the approximation of parameters and only update and upload the weights of the inserted matrices in
selection of parameters, and apply it to a CNN-based model every communication round. In this stage, all the inserted low-
FL training. Haddadpour et al. [16] introduce an algorithm rank adaptation matrices are activated to better extract general
with periodical compressed communication. Specifically, they information from all the clients and every client collaboratively
introduce the FedCOM algorithm to tackle the homogeneous trains a global model, which will be the global expert in stage
client situation and the FedCOMGATE algorithm to tackle two. The learning goal of stage one can be expressed as
heterogeneous client situations. Chen et al. [17] propose a fed-
1 
N
erated learning algorithm that considers the weight quantization min F = E(xi ,yi )∼di Fi (xi , yi ; wi ), (1)
wi N i=1
in wireless transmission and formulate the federated learning
problem into a mixed-integer programming problem. Zhang et
where Fi denotes the loss function of client i ∈ [N ]; wi denotes
al. [18] introduce a CFEL algorithm that jointly considers cloud-
the parameters of the inserted low-rank adaptation matrices
based federated learning and edge-based federated learning. Qu
of the model of client i; xi denotes its private data; yi denotes
et al. [19] design a partially Synchronized federated learning
the corresponding label and di denotes the data distribution of
algorithm to accelerate the federated learning training.
client i. In the second stage, each client utilizes the trained global
expert in the first stage and trains its personalized model. This
C. FL With FM model consists of a global expert, a local expert, and a gate model
Not many works have been done related to FL with FM. Zhang together constitute a Mixture of Foundation Models. During this
et al. [20] propose a federated generative learning framework to stage, local experts are only trained on clients’ local datasets
Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
15170 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024

Fig. 1. FedFMSL Workflow.

using a novel Sparsely Activated LoRA algorithm and do not en- In traditional FL [7], [8], model parameters after local training
gage the global aggregation. We design and insert a gate adapter are usually transmitted to the server for model weights aggre-
into the gate model and aggregate all the parameters of gate gation in each communication round. This paradigm faces great
adapters in each communication round to gather information challenges when FMs are trained in an FL procedure. Suppose
about global data distribution to better assign weights to two we have an FM whose parameters are represented by Wf .
experts. We optimize the parameters of the local expert and the For full parameters fine-tuning, we need to calculate and store
gate adapter by another model Wk which has the same parameter size of Wf
for each task k.
1 
N
min L = E(xi ,yi )∼di Li (xi , yi ; ei , ui ), (2) FM which typically consists of over 10 million model param-
ei ,ui N i=1 eters needs more than 160 million bits to be represented. This
results in significant transmission time requirements for modern
where ui denotes the parameter of the gate adapter of client i; ei
mobile communication networks.
denotes the parameter of the local expert of client i. We propose
Moreover, the training of FM necessitates substantial com-
two novel algorithms to tackle challenges raised by FL with FM.
putation power and storage capacity, whereas edge devices
In this paper, we use typical visual language foundation mod-
typically possess limited computational capabilities and storage
els like CLIP to do the image classification task. In these models,
space. Therefore, it is imperative to develop an algorithm that
the visual encoder and the text encoder map input images and
mitigates the communication and computation costs associated
text to high-dimensional feature vectors. The model training
with FL using FM.
aims to minimize the cosine similarity between related pairs and
To tackle these challenges, we design a novel Sparsely Acti-
maximize it for unrelated pairs, aligning their representations in
vated LoRA algorithm that can achieve the SOTA performance
the feature space. During the evaluation, the model precomputes
while only tuning less than 1% of the total parameters of FM.
a set of text feature vectors for each class. Then the cosine
Common pre-trained language models are capable of efficient
similarity between the input image feature vector and these
learning even when randomly projected into a smaller subspace
precomputed text feature vectors is calculated and the class with
because they have a very low intrinsic dimension [30]. Edward
the highest similarity is the classification result.
J. Hu et al. [28] propose Low-rank adaptation (LoRA) to insert
trainable low-rank decomposition matrices in FMs, enabling
B. Sparsely Activated LoRA model optimization with minimal parameter tuning.
According to [29], the capability of the deep neural network Inspired by this, we insert trainable low-rank decomposition
tends to improve with the increase of the number of parameters matrices in every layer of the visual encoder and the text en-
of the model. FL with FM presents substantial challenges to the coder of the foundation model. We denote the weight parameter
communication and computation of the distributed system due matrix of the foundation model as W0 ∈ RE×F and the inserted
to FM’s large number of parameters. low-rank decomposition matrices as ΔW ∈ RE×F , which can
Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FedFMSL: FEDERATED LEARNING OF FOUNDATION MODELS WITH SPARSELY ACTIVATED LoRA 15171

Fig. 2. Iteration process of SAL.

be calculated by two low-rank matrices ΔW = WA WB , WA ∈ where Acci,j denotes the image classification accuracy of the
RE×H and WB ∈ RH×F (H << min(E, F )). This adjustment model of client j in communication round i. If Δi,j < δ, the
allows WA WB to have the same dimensions as the weight training is considered to come into a bottleneck. Then low-rank
parameter matrix W0 of the foundational model while reducing decomposition matrices in the next lower layer will be activated.
the number of tunable parameters for each weight matrix by The design of the SAL algorithm is inspired by the fact
EF − (EH + HF ). For WA , we employ a random Gaussian that the performance of FM is usually affected by the model
initialization, while WB is initialized with zero. During training, size, dataset size, and the quality of the dataset. Challenging
W0 is frozen and only WA and WB are optimized thus can save datasets require more model parameters to be optimized to better
computation and storage costs. Suppose the input of the weight extract the semantic meaning of the data. However, there is
matrices and the inserted low-rank decomposition matrices is x. no silver bullet configuration in the training of FL with FM.
The output can be calculated by So we introduce a Capability Queue to intelligently decide the
number of activated LoRA parameters and enable the training
y = (W0 + WA WB )x. (3) on computation resource-limited devices.

The procedure of the proposed SAL algorithm is depicted C. Mixture of Foundation Models
in Fig. 2. We activate the low-rank decomposition matrices
sparsely instead of activating them all during the training. At In traditional FL, a global model is trained using the decen-
the beginning of the training stage, every layer of the visual tralized data of clients. Only model weights are aggregated in the
encoder and the text encoder are inserted with frozen low-rank central server while the local data of clients are kept private to
decomposition matrices. In deep neural networks, lower layers ensure clients’ data privacy. This paradigm faces statistical chal-
can better extract general information than higher layers [31]. lenges especially when the data distribution of clients is non-iid.
During the first training stage, low-rank decomposition matrices Such non-iid data distribution could cause the weight divergence
in all layers are activated to better extract general information to during the training [32] and cause significant performance drops.
form a global expert while in the second stage, we unfreeze the Moreover, training a single global model and applying it to all
low-rank decomposition matrices from higher layers to lower clients can not suit different clients’ needs when their data have
layers during the training. different data distributions. Training personalized models while
More specifically, we introduce a Capability Queue with a benefiting from utilizing a global model is essential to providing
maximum queue length of Q. Image classification accuracies better performance for different clients.
of clients are forwarded to the Capability Queue after every To tackle this challenge, we design a novel Mixture of Foun-
communication round. Once the Capability Queue is full, the dation Models (MoFM) algorithm to utilize an FM as the global
previously added accuracies will be popped out. We set an expert and another FM as the local expert thus creating a mixture
accuracy threshold δ to help decide whether the training comes of Foundation Models to simultaneously learn personalized
into a bottleneck. The incremental factor Δ of client j in com- feature information as well as global feature information on each
munication round i is client.
As shown in Fig. 1, in the first stage of training, every
client collaboratively trains a global FM with weight. Low-rank
1 
i−1
Δi,j = Acci,j − Acct,j , (4) decomposition matrices are inserted in every layer of the visual
Q encoder and the text encoder. This global FM acts as a global
t=i−Q

Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
15172 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024

expert. In the second stage, a local expert is created for each Assumption 2. The expectation of the square of the gradient
client i to cooperate with the global expert. is bounded by G which can be a relatively large number that
More specifically, the local experts have the same neural denotes the maximum gradient norm during the backpropagation
network architecture as the global expert and are initialized with process in the optimization of the foundation model.
the weights of the global expert. A gate function Gi for each
E ∇F(wt )2 ≤ G2 . (8)
client i is a neural network introduced to control the relative
N
contribution of the global expert and the local expert to the final We set ∇F(wt ) = i=1 pi ∇Fi (wt ). Where pi is the weight

image classification decision given different images. We denote of the kth device which satisfy pi ≥ 0 and N i=1 pi = 1. In
the extracted image features and text features by the global expert communication round t, the server performs the gradient descent
as Vg and Tg and the extracted image features and text features by
by the local expert i as Vi and Ti . The final cosine similarity of
image features and text features extracted from the dataset of wt+1 = wt − γ∇F(wt ), (9)
client i can be denoted by where γ is the learning rate. From L-smooth we can get
Õi = λi < Vg , Tg > +(1 − λi ) < Vi , Ti >, L
(5) F(wt+1 )−F(wt )−(wt+1 −wt )∇F(wt ) ≤ wt+1 −wt 2 ,
2
where λi ∈ (0, 1) is a weight factor representing the mixing (10)
ratio of the global expert and the local expert of client i. Larger then we substitute (9) and get
λi indicates more global knowledge is used while smaller λi L
F(wt+1 )−F(wt )+γ ∇F(wt ), ∇F(wt ) ≤ wt+1 −wt 2 .
indicates more personal knowledge is used. 2
During the second training stage of FedFMSL, the weights (11)
of the global expert are frozen, and the local expert and the After taking the expectations of both sides, we can get
gate model are optimized only using the local data of client i. L 2 2
The adapter [33] has been a popular parameter-efficient tuning E[F(wt+1 ) − F(wt )] + γE ∇F(wt )2 ≤ γ G (12a)
2
method in FMs. It works by inserting very few layers into L
FMs and optimizing FM by only tuning the inserted very few γE ∇F(wt )2 ≤ E[F(wt ) − F(wt+1 )] + γ 2 G2 (12b)
2
parameters.
E[F(wt ) − F(wt+1 )] L 2
We design a novel gate adapter to adapt to the local datasets E ∇F(wt )2 ≤ + γG . (12c)
while maintaining a low computation and communication cost. γ 2
In each communication round, clients’ activated gate adapter By iterating the value of t from 1 to T we have
parameters are aggregated to learn global feature information,
E[F(w1 ) − F(w2 )] L 2
thus maintaining a low computation and communication cost. E ∇F (w1 )2 ≤ + γG (13a)
The aggregation of gate adapters can help the gate neural net- γ 2
work to better assign weight to different experts by determining E[F(w2 ) − F(w3 )] L 2
E ∇F(w2 )2 ≤ + γG (13b)
whether the given data is more subject to the user’s personalized γ 2
data distribution or the global data distribution.
..
We denote the parameter of the gate adapter after aggregation .
as gglobal . Specifically, we construct the gate adapter with a
E[F(wT ) − F(wT +1 )] L 2
Multi-Layer Perceptron (MLP), a batch norm layer, an MLP, E ∇F(wT )2 ≤ + γG . (13c)
a batch norm layer, and finally a Softmax function to ensure the γ 2
output is between (0,1). The gate adapter aggregation procedure After summing these equations we can get
is denoted as:

T
E[F(w1 ) − F(w∗ ) L

N E ∇F(wt )2 ≤ + T γG2 . (14)
|Di | γ 2
uglobal = N ui . (6) t=1
i=0 |Di | i=1 Because of
1
T
IV. CONVERGENCE ANALYSIS min E ∇F(wt )2 ≤ E ∇F(wt )2 . (15)
t=1:T T t=1
In this section, we do a thorough analysis of the convergence
of FedFMSL algorithm. We first prove the convergence of After dividing LHS and RHS by T we can have
the first stage and then the second stage of FedFMSL. For E[F(w1 ) − F(w∗ ) L 2
the first stage, we introduce the following assumptions which min E ∇F(wt )2 ≤ + γG . (16)
t=1:T γT 2
are common assumptions used in [34], [35]. √1
Assumption 1. The loss function F1 , . . ., Fi , . . ., FN is L- When γ satisfies γ = ΩT ,where Ω is a constant. We can have
smooth, that is for all w and w we have a convergence rate of O( √1T ).
L In the second stage, only the gate adapter joins the aggregation
Fi (w) ≤ Fi (w ) + (w − w )∇Fi (w ) + w − w 2 . (7) in each communication round. We introduce some assumptions
2

Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FedFMSL: FEDERATED LEARNING OF FOUNDATION MODELS WITH SPARSELY ACTIVATED LoRA 15173

on the second training stage of FedFMSL. These are common = (ei,t+1 − ei,t )∇e Li (ei,t , ut )
assumptions used in [34], [36].
Assumption 3. The loss function L1 , . . ., Li , . . ., LN is + (ei,t+1 − ei,t )(∇e Li (ei,t , ut+1 ) − ∇e Li (ei,t , ut ))
smooth.∇e Li (ei , u) is Le -Lipschitz with regard to ei and Leu - ≤ (ei,t+1 − ei,t )∇e Li (ei,t , ut )
Lipschitz with regard to u. ∇u Li (ei , u) is Lu -Lipschitz with
regard to u and Leu -Lipschitz with regard to ei . We further + ei,t+1 − ei,t ∇e Li (ei,t , ut+1 ) − ∇e Li (ei,t , ut )
define  = √LLeu
L
. For all ei , ei , u and u we have ≤ (ei,t+1 − ei,t )∇e Li (ei,t , ut )
e u

∇e Li (ei , u) − ∇e Li (ei , u) ≤ Le ei − ei  (17) + Leu ut+1 − ut ei,t+1 − ei,t 

∇u Li (ei , u ) − ∇u Li (ei , u) ≤ Lu u − u (18) ≤ (ei,t+1 − ei,t )∇e Li (ei,t , ut )

∇e Li (ei , u ) − ∇e Li (ei , u) ≤ Leu u − u (19) +  Lu Le ut+1 − ut ei,t+1 − ei,t 
∇u Li (ei , u) − ∇u Li (ei , u) ≤ Leu ei − ei . (20) ≤ (ei,t+1 − ei,t )∇e Li (ei,t , ut )
Assumption 4. The expectation of the square of the gradient Lu Le
+ 2 ut+1 − ut 2 + 2 ei,t+1 − ei,t 2 . (29)
is bounded. 2 2
E ∇e Li (ei,t , ut )2 ≤ G2e (21) By utilizing the bound get from (29) we can have

E ∇u Li (ei,t , ut )2 ≤ G2u . (22) L(et+1 , ut+1 ) − L(et , ut )

In each communication round, client i performs gradient ≤ (ut+1 − ut )∇u L(et , ut )


descent by
1 
N

ei,t+1 = ei,t − γ∇e Li (ei,t , ut ). (23) + (ei,t+1 − ei,t )∇e Li (ei,t , ut )


N i=1
The server receives all the parameters of the gate adapter from Lu
the clients and performs gradient descent by + (2 + 1) ut+1 − ut 2
2
ut+1 = ut − γ∇u L(et , ut ). (24)
Le 
N
+ (2 + 1) ei,t+1 − ei,t 2 (30)
We can rewrite Li (ei,t+1 , ut+1 ) − Li (ei,t , ut ) as 2 i=1
Li (ei,t+1 , ut+1 ) − Li (ei,t , ut ) and then we can get
= Li (ei,t+1 , ut+1 ) − Li (ei,t , ut+1 )
L(et+1 , ut+1 ) − L(et , ut ) − (ut+1 − ut )∇u L(et , ut )
+ Li (ei,t , ut+1 ) − Li (ei,t , ut ).
1 
N
(25) − (ei,t+1 − ei,t )∇e Li (ei,t , ut )
N i=1
From the smoothness property, we can get
Le 1 
N
Lu
Li (ei,t+1 , ut+1 ) − Li (ei,t , ut+1 ) ≤ (2 +1) ut+1−ut 2 +(2 +1) ei,t+1−ei,t 2 .
2 2 N i=1
Le
≤ (ei,t+1 − ei,t )∇e Li (ei,t , ut+1 ) + ei,t+1 − ei,t 2 (31)
2
(26) Substitute (23) and (24) we can get
Li (ei,t , ut+1 ) − Li (ei,t , ut ) L(et+1 , ut+1 ) − L(et , ut )
Lu
1  2
N
≤ (ut+1 − ut )∇u Li (ei,t , ut ) + ut+1 − ut 2 . (27)
2 +γ ∇ Li (ei,t , ut ) + γ∇2u L(et , ut )
N i=1 e
By summing (26) and (27) and we can get
Le 1 
N
Li (ei,t+1 , ut+1 ) − Li (ei,t , ut ) ≤ (2 +1)
Lu
ut+1−ut 2 +(2 +1) ei,t+1 −ei,t 2 .
2 2 N i=1
Lu
≤ (ut+1 − ut )∇u Li (ei,t , ut ) + ut+1 − ut 2 (32)
2
Le By taking the expected value of both sides we can have
+ (ei,t+1 − ei,t )∇e Li (ei,t , ut+1 ) + ei,t+1 − ei,t 2 .
2
(28) E[L(et+1 , ut+1 ) − L(et , ut )]
 
1  2
We can rewrite the third term as N
+ γE ∇ Li (ei,t , ut ) + γE[∇2u L(et , ut )]
(ei,t+1 − ei,t )∇e Li (ei,t , ut+1 ) N i=1 e

Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
15174 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024

Lu 2 2 Le clients are aggregated in every communication round. In


≤ (2 + 1) γ Gu + (2 +1) γ 2 G2e , (33)
2 2 PromptFL clients train soft prompts using their local data,
then making minimal updates to a small set of parameters in the
  prompt while keeping the weights of the model fixed. The
1  2
N
E ∇ Li (ei,t , ut ) + E[∇2u L(et , ut )] server aggregates these updates and shares them back with
N i=1 e the clients in an iterative process.
r LayerFreeze Fine-Tuning (LFFT): This algorithm freezes
E[L(et , ut ) − L(et+1 , ut+1 )]
≤ several layers in FMs and only the activated layers. These
γ parameters will be aggregated in every communication
Lu Le round to save communication and computation resources.
+ (2 + 1)
γG2u + (2 + 1) γG2e . (34)
2 2 3) Default Training Settings: We set the backbone of the
By setting the value of t in (34) from 1 to T and summing visual encoder of CLIP to be ViT-B/16. The batch size is set
them together we can have to be 512. The learning rate is set to be 2e−4 . The number
  of clients is set to be 10. The rank of the inserted low-rank
T
1  2
N
decomposition matrices is set to 1. The dropout probability of
E ∇e Li (ei,t , ut ) + E[∇2u L(et , ut )]
N the inserted low-rank decomposition matrices is set to 0.1. The
t=1 i=1
number of communication rounds of training stage one is set to
E[L(et , ut ) − L∗ ] 25 and the number of communication rounds of training stage

γ two is set to 25.
Le Lu
+ T (2 + 1)γG2e + T (2 + 1) γG2u . (35)
2 2 B. Results Comparisons
Finally, we can have
  We assume the non-iid data partition in FL to follow the
1  2
N Dirichlet distribution [40]. The α parameter in Dirichlet dis-
min E ∇ Li (ei,t , ut ) + E[∇2u L(et , ut )] tribution represents the degree of heterogeneity. The smaller the
t=1:T N i=1 e
α, the more non-iid the data distributed in the clients will be.
E[L(et , ut ) − L∗ ] γ 2 We test the image classification accuracy of various datasets in
≤ + ( + 1)(Le G2e + Lu G2u ). (36) different non-iid levels.
γT 2
√1 1) Impact of Different System Settings. Impact of degrees
When γ satisfies γ = MT , where M is a constant, we can have of data heterogeneity: Fig. 3 shows the image classification
a convergence rate of O( √1T ). accuracy of Food101, UCF101, and EuroSAT datasets re-
spectively, assuming different degrees of data heterogeneity.
V. EXPERIMENTS Specifically, we set the α to be 0.1, 1, and 10. From the
results, we can find that FedFMSL achieves the highest ac-
In this section, we conduct comprehensive experiments com- curacy among the four algorithms in all cases. FedFMSL has
pared to SOTA baselines to verify the effectiveness of FedFMSL surpassed the average accuracy of FT, PromptFL, and LFFT
under different settings. in all datasets by 20.40%, 29.90%, and 11.11% respectively.
Our method has a minimum accuracy gain of 6.20%, 12.79%,
A. Experiments Set up
and 12.69% and a maximum accuracy gain of 9.45%, 22.18%,
1) Datasets: We select some representative datasets that are and 59.19% on Food101, UCF101, and EuroSAT datasets
widely used in the image classification task of the CLIP model. respectively compared to other baselines when α is set
Specifically, we select Food101 [37] which is a food classifi- to 1.
cation dataset containing 101 classes, EuroSAT [38] which is a By observing the accuracy at different data heterogene-
dataset for land use and land cover classification containing 10 ity levels, we can find that the performance of these algo-
classes, and UCF101 [39] which is a dataset for human actions rithms does not always follow a positive correlation relationship
classification in the wild containing 101 classes. with the data heterogeneity level. On UCF101 and EuroSAT
2) Baselines: To verify the effectiveness of the proposed datasets, the accuracy of FedFMSL, when α is 10, has an
FedFMSL algorithm, we compare image classification accuracy increase of 1.03% and 0.02% compared to the case when α
with the following state-of-the-art baselines. is 0.1. On the food101 dataset, FedFMSL has an accuracy
r Vanilla Fine-Tuning (FT): This is one of the most repre- of 93.78% when α = 1 but has an accuracy of 93.74% when
sentative fin-tuning algorithms used in natural language α = 10.
processing and computer vision areas [1]. This algorithm This is because higher data heterogeneity may lead to a
involves taking a generic, pre-trained model and further smaller number of classes in clients’ local datasets. For example,
training it on a smaller, task-specific dataset by finetuning a client’s local dataset may contain data from 10 classes at a low
the model’s weights. data heterogeneity level but may contain 3 classes at a high data
r PromptFL [21]: This algorithm does prompt tuning instead heterogeneity level, which can lower the difficulty of identifying
of tuning the parameters of FM. prompts from different the right class given an image.

Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FedFMSL: FEDERATED LEARNING OF FOUNDATION MODELS WITH SPARSELY ACTIVATED LoRA 15175

Fig. 3. Average accuracy on different datasets at different non-iid α.

Fig. 4. Average accuracy on different datasets under different number of clients.

Fig. 5. Average accuracy on different datasets when under different visual encoders.

Impact of number of clients: We test the performance of 83.44%. Results show that FedFMSL works well under different
FedFMSL and other baselines under different number of clients. number of clients.
Specifically, we set the number of clients to 5, 10, and 15. 2) Impact of Training Settings. Impact of visual encoders:
From Fig. 4 we can conclude that FedFMSL has the highest We test the performance of FedFMSL under visual encoders
accuracy in different client number cases. When the number of ViT-B/16 and ViT-B/32 to further verify the effectiveness of
clients is 5, FedFMSL achieves accuracy of 92.03%, 86.89%, FedFMSL under various visual encoders.
and 99.40% in Food101, UCF101, and EuroSAT datasets re- We can observe from Fig. 5 that FedFMSL achieves the
spectively while the best accuracy of the other three base- highest image classification accuracy on all datasets using
lines in these three datasets are 87.56%, 77.15%, and 90.40%. the visual backbone Vit-B/16 or Vit-B/32. Results show that
FedFMSL suppresses the best performance of other base- visual encoders with a larger number of parameters can
lines by 4.47%, 9.74%, and 9%. In the case when there are achieve better performance than those with smaller visual
10 clients, FedFMSL has a maximum accuracy increase of encoders. The accuracy of the four algorithms increased by
9.73%, 21.29%, and 54.85% compared to the three baselines. 0.95%, 11.85%, 4.57%, and 4.25% when using ViT-B/16 as the
When the number of clients reaches 15, the accuracy of visual encoder on the UCF101 dataset which confirms the theo-
FedFMSL is 94.74%, 87.69%, and 96.19% while the highest retical analysis that larger models have better feature extraction
accuracy of the other three baselines are 87.44%, 62.86%, and ability.

Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
15176 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024

Fig. 7. Accuracy under different accuracy thresholds in capability queue.

Fig. 8. Accuracy with different LoRA ranks.

low-rank decomposition matrices to be activated, and if their


computing resources are scarce they can set a small δ to save
more computation resources.
We set the accuracy threshold to 0.001, 0.005, 0.01, and 0.02 to
Fig. 6. Comparison of accuracy of FedFMSL under different learning rates. see its effect on the model performance. From Fig. 7 we can find
that at the beginning, the model’s results increase as the value
of δ increases, but after reaching a certain value, the difference
Our method suppresses other baselines by a maximum of in the model’s results becomes insignificant.
54.85% and a minimum of 8.60% when using Vit-B/16 and On the Food101 dataset, the accuracy is 93.80% when δ is
suppresses other baselines by a maximum of 60.01% and a 0.02 and is 93.12% when δ is 0.001, which has an increase of
minimum of 5.76% when using Vit-B/32. Results show that 0.68%. On the EuroSAT dataset, the accuracy increases from
FedFMSL can adapt to visual encoders with different scales. 98.84% to 99.07%. But on the UCF101 dataset, the model
Imapct of learning rates: We show the accuracy of FedFMSL has the highest accuracy of 86.73% when the value of δ is
in three datasets under different learning rates. We set the learn- 0.005. The largest accuracy difference in the three datasets is
ing rate to 2e−3 , 2e−4 , and 2e−5 . 0.68%, 2.1%, and 0.23%.
From Fig. 6 we can observe that when the learning rate is Results imply the effectiveness of the design of SAL because
2e−5 the model shows a slow convergence rate, especially in of leveraging the idea of curriculum learning that progressively
UCF101 datasets, the accuracy is 67.62% which has an accuracy increases the number of activated low-rank adaptation matrices
loss of 18.8% compared to the accuracy when the learning rate in the visual encoder and the text encoder. In the optimization
is 2e−4 . When the learning rate is 2e−3 , the accuracy has a sharp process of FM, it is often easier at the beginning, while the opti-
increase in the first few epochs in all datasets but may encounter mization of parameters becomes more difficult as it progresses,
severe oscillation in the following epochs which is because of making it harder to improve accuracy. We increase the number
the large learning rate can cause instability in the model training. of activated parameters through a controller during the training
This phenomenon is especially usual in the training of FMs for to tackle the increasing difficulty of optimizing the FM during
the reason that they usually have a large number of parameters training.
and the scale of gradient calculations will increase which may Impact of LoRA ranks: In FedFMSL, we incorporate LoRA
cause the gradient explosion. Moreover, models with large for model training and optimization. The rank in LoRA refers to
parameters tend to have more complex optimization spaces, the degree of model compression or pruning, and different rank
resulting in the training process being more easily affected by settings can have an impact on the model training performance.
noise and instability. We examine the training accuracy performance of FedFMSL
Impact of accuracy thresholds: We further discuss the perfor- under different LoRA ranks, specifically ranks set to 1, 4, 8,
mance of FedFMSL when the value of the accuracy threshold and 16.
δ is different. Typically, if the clients have more computation From Fig. 8, we can observe that as the LoRA rank increases,
resources, they can have a higher δ to encourage more inserted FedFMSL shows an uncertain trend in accuracy performance

Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FedFMSL: FEDERATED LEARNING OF FOUNDATION MODELS WITH SPARSELY ACTIVATED LoRA 15177

Fig. 9. Accuracy with different LoRA dropout rates. Fig. 10. Accuracy with different LoRA weight decays.

across different datasets. On the UCF101 dataset, the perfor-


mance first increases with the increase of the rank and it performs
the best at LoRA rank=8, but the model accuracy decreases as
the LoRA rank further increases. The changes in LoRA rank
can cause the accuracy of the final model to fluctuate within a
0.97% range on the UCF101 dataset while on other datasets
the fluctuation is much smaller. One possible reason is that
the semantic information that the UCF101 dataset contains is
difficult for the model to capture and thus requires more training
to optimize the model. A high LoRA rank can lead to a large
optimization space which raises challenges for the training. On Fig. 11. Number of activated LoRA layers during training.
the other hand, the information patterns in the Food101 and
EuroSAT datasets are relatively simpler and easy for the inserted accuracy under different weight decay also varies less than 1%
low-rank matrices with different ranks to capture, so the model’s on these two datasets. Overall, the performance of FedFMSL
learning performance is not affected significantly. improves with the use of weight decay but decreases when the
Impact of LoRA dropout rates: The dropout coefficient in weight decay value is set too high. This is because large weight
LoRA affects the probability of applying dropout regularization decay can lead to excessive constraints on parameters, limiting
during the model training process. In FedFMSL, we apply the effective information learned by the model.
dropout to prevent neural networks from overfitting and enhance Impact of different dataset: We compare the number of acti-
the model’s generalization ability. In our experiment, the group vated LoRA layers in the visual encoder under different datasets
with a dropout rate of 0 represents the accuracy of the model to illustrate the flexibility of our algorithm. We use the CLIP
without using dropout techniques. The groups with dropout rates model with Vit-L/14 encoder and train the model using the
of 0.1, 0.3, and 0.5 are set to investigate the impact of different SAL algorithm for 25 communication rounds with an accuracy
dropout coefficients. threshold of 0.005. From Fig. 11 we can observe that the SAL
From Fig. 9, we can clearly see that dropout has effect on algorithm can adapt to different datasets according to their
the final model accuracy. The model achieves the best accuracy characteristic. On the EuroSAT dataset, the number of activated
on Food101, UCF101, and EuroSAT datasets when the dropout LoRA layers is small because very few layers can achieve sig-
rate are set to 0.3, 0.5, and 0.3. Setting the dropout rate too nificant accuracy gain but on the Food101 and UCF101 datasets,
high increases the number of discarded neurons during model the number of activated layers is higher due to more difficulty
training, leading to a decrease in the model’s learning and in the optimization of the foundation model.
expressive capacity, and setting the dropout rate too low may The foundation model is pretrained on large-scale data and has
cause the overfitting of the dataset. zero-shot image classification ability. The model achieves higher
Impact of weight decays: Weight decay is used to solve the zero shot accuracy on the Food101 dataset but less increase
overfitting problem. By adding a regularization term to the loss during the finetuning while it achieves lower zero shot accuracy
function, it can encourage the model to have small weight values on the UCF101 dataset and has more substantial accuracy gain
during the training. In our experiment, we set up five different during the fine-tuning. The greater difficulty in improving the
experimental groups with varying degrees of weight decay. The model during finetuning on the Food101 dataset results in the
group with a weight decay value of 0 represents the training activation of more LoRA layers.
performance of FedFMSL without using weight decay. Impact of different foundation models: Besides the foundation
From Fig. 10 we can conclude that weight decay has a model CLIP, we also compare the performance of our algorithm
different impact on different datasets but is not significant. On with other baselines under another foundation model. More
the UCF101 dataset, when the weight decay is 0.5 it has an specifically, we compare the performance using the foundation
accuracy of 86.67% and when the weight decay is 0.1 it has model ALBEF [41] which is also a visual language foundation
an accuracy of 86.54%. On Food101 and UCF101 datasets, the model that is pretrained on large-scale data. From Fig. 12 we

Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
15178 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024

Fig. 12. Average accuracy on different datasets when using ALBEF.


Fig. 13. Communication time under different bandwidths.

TABLE I TABLE II
PROPORTION AND TOTAL NUMBER OF TRAINING PARAMETERS FOR DIFFERENT COMPARISON OF THE NUMBER OF PARAMETERS OF THE MODEL
METHODS

for parameter aggregation in each communication round thus


impairs the training efficiency of FL.
From the results, we can observe that when the bandwidth
is 0.1 MB/s, the time for FedFMSL to finish a communication
can observe that FedFMSL achieves the highest accuracy among round is 8.34 seconds while the communication time for FT,
all algorithms on all datasets. FedFMSL has an accuracy gain LFFT, and PromptFL per communication round is 5,960 sec-
of 3.59% − 22.9%, 8.01% − 16.53%, and 13.8% − 38.85% on onds, 3,920 seconds, and 0.32 seconds. FedFMSL saves 99.95%
Food101, UCF101, and EuroSAT datasets respectively com- communication time compared to the FT algorithm. When the
pared to other baselines. The PromptFL algorithm experiences bandwidth is 1 MB/s and 10 MB/s, the difference in training
performance degradation when evaluated on the ALBEF model, time between different algorithms is reduced. FedFMSL and
as it heavily relies on prompt tuning that necessitates a strong PromptFL have communication times of fewer than 1 s per com-
zero-shot performance on the target dataset to yield significant munication round when the network bandwidth is 1 MB/s and
improvements. 10 MB/s. Although PromptFL has the minimum communication
3) Comparison of Resource Consumption. Comparison of the resource consumption, both FedFMSL and PromptFL are com-
number of training parameters: Table I illustrates the proportion munication efficient in real-world scenarios and FedFMSL has
and the total number of training parameters of FedFMSL and much higher accuracy on all datasets compared with PromptFL.
other baselines. Results show that FedFMSL only tunes 0.1% For a more practical comparison, we select a real-world
and 0.134% of parameters on training stage one and training network trace from [42] and test the transmission time. From
stage two. FT and LFFT tune 100% and 65.78% of parameters. Fig. 15 we can find that the communication time of FedFMSL,
Notice that while PromptFL reduces the number of trainable FT, LFFT, and PromptFL in one communication round is 0.31s,
parameters compared to FedFMSL, it suffers from a signif- 210s, 138s, and 0.005s. FedFMSL can achieve the best perfor-
icant performance decline. FedFMSL strikingly demonstrates mance under a very low communication resource consumption.
an optimal balance between conserving trainable parameters 4) Comparison of F1 Score: In order to more comprehen-
during training and maintaining high model performance, with sively evaluate the performance of different algorithms, we also
its marginally increased trainable parameter count being read- compare the F1 scores of different algorithms under different
ily supportable by modern communication networks and edge datasets. In Fig. 16 we can find that FedFMSL has a maximum
device development. performance gain of 7.34%, 21.96% and 55.42% on Food101,
Comparison of transmission time: Considering the scenario UCF101, and EuroSAT datasets respectively.
in mobile computing and the Internet of Things, we test the 5) Ablation Study. Impact of Mixture of Foundation Models
proposed FedFMSL in different bandwidth conditions. Specifi- architecture: We compare the image classification accuracy of
cally, we set the bandwidth to be 0.1 MB/s, 1MB/s, and 10 MB/s the model with the MoFM architecture using the visual encoder
which are typical bandwidths in modern mobile communication ViT-B/16 and the model without the MoFM architecture rely-
networks. ing solely on the SAL algorithm while incorporating the more
Fig. 13 illustrates the communication time of the proposed parameter-intensive ViT-L/14 visual encoder. The total number
FedFMSL and other baselines on different datasets per commu- of epochs of both models is set to 50 to ensure fair comparison.
nication round under different network bandwidths. To better The comparison of the total number of parameters is shown in
visualize the results, we take the logarithm of them in the Table II. The model with MoFM architecture but with a smaller
figure. Low bandwidth results in long data transmission time visual encoder has 27.33% parameter less than another model.

Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FedFMSL: FEDERATED LEARNING OF FOUNDATION MODELS WITH SPARSELY ACTIVATED LoRA 15179

Fig. 14. Comparison of the accuracy on different datasets using model with MoFM (ViT-B/16) and model without MoFM (ViT-L/14).

TABLE III
COMPARISON OF ACCURACY ON DIFFERENT DATASETS

TABLE IV
COMPARISON OF THE NUMBER OF PARAMETERS OF THE GATE ADAPTER AND
THE GATE MODEL
Fig. 15. Communication time under the real-world network trace.

Fig. 16. Comparison of F1 scores on different dataset.

We can observe from Fig. 14 that although with much fewer


Fig. 17. Comparisons of the accuracy of FedFMSL, LoRA+FL, Adapter+FL.
parameters, the model with the MoFM architecture and trained
through complete two stages achieves a higher accuracy on all
datasets. It has an average accuracy increase of 1.98% on all observe from Table III and Table IV that by only fine-tuning
datasets. More specifically, with an increase of 2.09%, 2.33%, the last layer of the gate adapter, we can achieve an accuracy
and 1.43% on Food101, UCF101, and EuroSAT datasets. The of 93.79%, 86.42%, and 99.07% in Food101, UCF101, and Eu-
performance of ViT-L/14 has a drop on the Food101 dataset. This roSAT datasets which is only 0.05% average lower than tuning
is because the non-IID data may cause the divergence between full parameters of the gate model but save the communication
different client model parameters when doing aggregation. resource consumption by 97.69%.
This surprising result demonstrates the superiority of the This is because the parameters changing of the local expert
MoFM architecture as it can intelligently assign different weights during training can cause the relationship between the global
to different experts according to the characteristics of the images expert and the local expert to keep changing which makes it hard
to be classified. This enables better generalization ability to for the gate model to decide the decision weight it assigns to the
data that is out of the distribution of clients’ local datasets and two FMs and can cause unstableness in training. By freezing the
enables better personalization ability to data that follow the local gate parameters and inserting a lightweight gate adapter, the gate
distribution of the local clients’ local datasets. model can quickly adapt to the newly optimized parameters of
Impact of aggregation of gate parameters: We compare the the local expert while maintaining the feature extraction ability.
image classification accuracy of FedFMSL and FedFMSL with Impact of modification to LoRA and Adapter: In Fig. 17
all gate parameters activated in the second training stage. We can we compare the image classification accuracy of FedFMSL,

Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
15180 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024

LoRA+FL, and Adapter+FL, where LoRA+FL represents fine- REFERENCES


tuning and aggregating LoRA parameters every communication [1] J. D. M.-W. C. Kenton and L. K. Toutanova, “BERT: Pre-training of deep
round and Adapter+FL [23] represents finetuning and aggregat- bidirectional transformers for language understanding,” in Proc. Conf.
ing Adapter parameters every communication round. From the North Amer. Chapter Assoc. Comput. Linguistics Hum. Lang. Technol.,
2019, pp. 4171–4186.
results, we can observe that the FedFMSL algorithm achieves [2] T. Brown et al., “Language models are few-shot learners,” in Proc. Adv.
the highest accuracy among all algorithms under three datasets. Neural Inf. Process. Syst., 2020, pp. 1877–1901.
On The Food101 dataset, FedFMSL has an accuracy gain of [3] A. Radford et al., “Learning transferable visual models from natural lan-
guage supervision,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 8748–8763.
4.41% to 6.97%. On the UCF101 dataset, FedFMSL has an [4] J. Betker et al., “Improving image generation with better captions,” Com-
accuracy gain of 1.84% to 14.55%, and On the EuroSAT dataset, put. Sci. https://cdn. openai. com/papers/dall-e-3. pdf, vol. 2, no. 3, p. 8,
FedFMSL has an accuracy gain of 1.06% to 9.48%. 2023.
[5] D. Wu et al., “Large language model adaptation for networking,”
2024, arXiv:2402.02338.
[6] P. Wu, Q. Liu, Y. Dong, and F. Wang, “LMaaS: Exploring pricing strategy
VI. LIMITATIONS AND FUTURE WORKS of large model as a service for communication,” 2024, arXiv:2401.02675.
[7] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Y. Arcas,
While our work demonstrates competitive performance com- “Communication-efficient learning of deep networks from decentralized
pared to various baselines, it is not without limitations. The data,” in Proc. Artif. Intell. Statist., 2017, pp. 1273–1282.
efficiency of FedFMSL can be further enhanced through the [8] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith,
“Federated optimization in heterogeneous networks,” Proc. Mach. Learn.
incorporation of quantization techniques, which reduces the size Syst., vol. 2, pp. 429–450, 2020.
of model parameters and consequently conserves computational [9] Q. Li, B. He, and D. Song, “Model-contrastive federated learning,” in Proc.
and communication resources during FL training. Furthermore, IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 10713–10722.
[10] J. Zhang et al., “Federated learning with label distribution skew via logits
there is potential for further exploration in adaptive quantization calibration,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 26311–26329.
strategies that can dynamically adjust the level of quantization [11] E. L. Zec, O. Mogren, J. Martinsson, L. R. Sütfeld, and D. Gillblad,
based on data characteristics and model requirements in FL “Specialized federated learning using a mixture of experts,” 2020, arXiv:
2010.02056.
training. Additionally, conducting experiments across a broader [12] R. Zhang, Y. Chen, C. Wu, and F. Wang, “Multi-level personalized feder-
range of datasets and a more extensive set of clients could ated learning on heterogeneous and long-tailed data,” IEEE Trans. Mobile
significantly enhance and validate our findings. Future research Comput., to be published, doi: 10.1109/TMC.2024.3409159.
[13] C. Wu, Z. Li, F. Wang, and C. Wu, “Learning cautiously in federated
can be done to address these limitations. learning with noisy and heterogeneous clients,” in Proc. IEEE Int. Conf.
Multimedia Expo, 2023, pp. 660–665.
[14] Y. Mao et al., “Communication-efficient federated learning with adaptive
VII. CONCLUSION quantization,” ACM Trans. Intell. Syst. Technol., vol. 13, no. 4, pp. 1–26,
2022.
In this paper, we propose a novel FedFMSL algorithm that [15] A. Huang, Y. Chen, Y. Liu, T. Chen, and Q. Yang, “RPN: A residual pooling
contains two training stages to address the bandwidth-intensive, network for efficient federated learning,” 2020, arXiv: 2001.08600.
computation-intensive, and clients’ data heterogeneity chal- [16] F. Haddadpour, M. M. Kamani, A. Mokhtari, and M. Mahdavi, “Federated
learning with compression: Unified analysis and sharp guarantees,” in
lenges in federated learning with foundation models. In the Proc. Int. Conf. Artif. Intell. Stat., 2021, pp. 2350–2358.
first stage, we freeze the pre-trained FM weight and insert [17] R. Chen, L. Li, K. Xue, C. Zhang, M. Pan, and Y. Fang, “Energy efficient
low-rank decomposition matrices in every transformer block. federated learning over heterogeneous mobile devices via joint design
of weight quantization and wireless transmission,” IEEE Trans. Mobile
We activate all the inserted matrices to better extract global Comput., vol. 22, no. 12, pp. 7451–7465, Dec. 2023.
feature information. In every communication round only the [18] Z. Zhang, Z. Gao, Y. Guo, and Y. Gong, “Scalable and low-latency
parameters of low-rank decomposition matrices join the weight federated learning with cooperative mobile edge networking,” IEEE Trans.
Mobile Comput., vol. 23, no. 1, pp. 812–822, Jan. 2024.
aggregation and save more than 99.9% communication band- [19] Z. Qu et al., “Partial synchronization to accelerate federated learning
widths compared to full parameters transmission. In the second over relay-assisted edge networks,” IEEE Trans. Mobile Comput., vol. 21,
stage, we take FM trained in the first stage as the global expert no. 12, pp. 4502–4516, Dec. 2022.
[20] J. Zhang, X. Qi, and B. Zhao, “Federated generative learning with foun-
and construct another local expert to provide personalization for dation models,” 2023, arXiv:2306.16064.
individual clients. We are the first to form the global expert and [21] T. Guo, S. Guo, J. Wang, X. Tang, and W. Xu, “PromptFL: Let feder-
the local expert as a Mixture of Foundation Models (MoFM) in ated participants cooperatively learn prompts instead of models-federated
learning in age of foundation model,” IEEE Trans. Mobile Comput.,
federated learning. We specially design and insert a gate adapter vol. 23, no. 5, pp. 5179–5194, May 2024.
into the gate model to help assign the decision weight of the [22] D. Cai, Y. Wu, S. Wang, F. X. Lin, and M. Xu, “Efficient federated learning
two experts. The aggregation of the gate adapter can achieve for modern NLP,” in Proc. 29th Annu. Int. Conf. Mobile Comput. Netw.,
2023, pp. 1–16.
competitive performance while saving 97.69% communication [23] W. Lu, H. Xixu, J. Wang, and X. Xie, “FedCLIP: Fast generalization and
resource consumption. Moreover, to enable efficient training in personalization for CLIP in federated learning,” IEEE Data Eng. Bull.,
computation-scarce scenarios, we propose a Sparsely Activated vol. 46, no. 1, pp. 52–66, 2023. [Online]. Available: https://dblp.org/rec/
journals/debu/LuH0023.bib
LoRA (SAL) algorithm to activate the low-rank adaptation ma- [24] J. Zhang et al., “Towards building the federated GPT: Federated instruction
trices progressively according to past accuracies in the Capabil- tuning,” 2023, arXiv:2305.05644.
ity Queue. [25] M. Vu et al., “Analysis of privacy leakage in federated large language
models,” in Proc. Int. Conf. Artif. Intell. Statist., 2024, pp. 1423–1431.
We test the performance of FedFMSL through extensive [26] Z. Cai, J. Chen, W. Chen, W. Wang, X. Zhu, and A. Ouyang, “F-CodeLLM:
experiments in various settings, and results show that FedFMSL A federated learning framework for adapting large language models to
outperforms other SOTA baselines by tuning less than 0.3% practical software development,” in Proc. IEEE/ACM 46th Int. Conf. Softw.
Eng.: Companion Proc., 2024, pp. 416–417.
parameters of the foundation model.

Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FedFMSL: FEDERATED LEARNING OF FOUNDATION MODELS WITH SPARSELY ACTIVATED LoRA 15181

[27] X. Liu, T. Pang, and C. Fan, “Federated prompting and chain-of-thought Ting Wang is currently working toward the bache-
reasoning for improving LLMs answering,” in Proc. Int. Conf. Knowl. Sci. lor’s of engineering degree in electronic information
Eng. Manageme., 2023, pp. 3–11. engineering focusing on computer engineering with
[28] E. J. Hu et al., “LoRA: Low-rank adaptation of large language models,” the Chinese University of Hong Kong (Shenzhen).
2021, arXiv:2106.09685. His research interests lie in foundation models and
[29] J. Kaplan et al., “Scaling laws for neural language models,” 2020, arXiv: edge computing.
2001.08361.
[30] A. Aghajanyan, S. Gupta, and L. Zettlemoyer, “Intrinsic dimensionality
explains the effectiveness of language model fine-tuning,” in Proc. 59th
Annu. Meeting Assoc. Comput. Linguistics 11th Int. Joint Conf. Natural
Lang. Process., 2021, pp. 7319–7328.
[31] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are
features in deep neural networks?,” in Proc. Adv. Neural Inf. Process.
Yanjie Dong (Member, IEEE) received the MASc
Syst., 2014, pp. 3320–3328.
and PhD degrees from the University of British
[32] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated Columbia, Canada, in 2016 and 2020, respectively.
learning with non-IID data,” 2018, arXiv: 1806.00582.
He is currently an associate professor and assis-
[33] N. Houlsby et al., “Parameter-efficient transfer learning for NLP,” in Proc.
tant dean of Artificial Intelligence Research Institute,
Int. Conf. Mach. Learn., 2019, pp. 2790–2799.
Shenzhen MSU-BIT university. His research interests
[34] S. Wang, Y. Hong, R. Wang, Q. Hao, Y.-C. Wu, and D. W. K. Ng,
focus on the protocol design of energy-efficient com-
“Edge federated learning via unit-modulus over-the-air computation,”
munications, machine learning based resource allo-
IEEE Trans. Commun., vol. 70, no. 5, pp. 3141–3156, May 2022.
cation algorithms, and quantum computing technolo-
[35] Y. Dong et al., “Accelerating wireless federated learning via Nesterov’s gies. He regularly serves as a member of Technical
momentum and distributed principle component analysis,” IEEE Trans.
Program Committee in flagship conferences in IEEE
Wireless Commun., vol. 23, no. 6, pp. 5938–5952, Jun. 2024.
ComSoc.
[36] K. Pillutla, K. Malik, A.-R. Mohamed, M. Rabbat, M. Sanjabi, and L.
Xiao, “Federated learning with partial model personalization,” in Proc.
Int. Conf. Mach. Learn., 2022, pp. 17716–17758.
[37] L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discrimi-
native components with random forests,” in Proc. 13th Eur. Conf. Comput. Victor C. M. Leung (Life Fellow, IEEE) is a distin-
Vis., 2014, pp. 446–461. guished professor and dean of Artificial Intelligence
[38] P. Helber, B. Bischke, A. Dengel, and D. Borth, “EuroSAT: A novel dataset Research Institute, Shenzhen MSU-BIT University,
and deep learning benchmark for land use and land cover classification,” China. He is also an emeritus professor of electrical
IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 12, no. 7, and computer engineering and director of the Labora-
pp. 2217–2226, Jul. 2019. tory for Wireless Networks and Mobile Systems with
[39] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human the University of British Columbia (UBC), Canada.
actions classes from videos in the wild,” 2012, arXiv:1212.0402. His research is in the broad areas of wireless networks
[40] M. Yurochkin, M. Agarwal, S. Ghosh, K. Greenewald, N. Hoang, and and mobile systems, and he has published widely
Y. Khazaeni, “Bayesian nonparametric federated learning of neural net- in these areas. He is serving as a senior editor of
works,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 7252–7261. IEEE Transactions on Green Communications and
[41] J. Li, R. R. Selvaraju, A. D. Gotmare, S. Joty, C. Xiong, and S. Hoi, “Align Networking. He is also serving on the editorial boards of IEEE Transactions on
before fuse: Vision and language representation learning with momentum Cloud Computing, IEEE Transactions on Computational Social Systems, IEEE
distillation,” in Proc. Adv. Neural Inf. Process. Syst., 2021, pp. 9694–9705. Access, IEEE Network, and several other journals. He received the 1977 APEBC
[42] J. Van Der et al., “HTTP/2-based adaptive streaming of HEVC video over Gold Medal, 1977-1981 NSERC Postgraduate Scholarships, IEEE Vancouver
4G/LTE networks,” IEEE Commun. Lett., vol. 20, no. 11, pp. 2177–2180, Section Centennial Award, 2011 UBC Killam Research Prize, 2017 Canadian
Nov. 2016. Award for Telecommunications Research, 2018 IEEE TCGCC Distinguished
Technical Achievement Recognition Award, and 2018 ACM MSWiM Reginald
Fessenden Award. He co-authored papers that won the 2017 IEEE ComSoc Fred
W. Ellersick Prize, 2017 IEEE Systems Journal Best Paper Award, 2018 IEEE
CSIM Best Journal Paper Award, and 2019 IEEE TCGCC Best Journal Paper
Award. He is a Fellow of the Royal Society of Canada (Academy of Science),
Panlong Wu received the BEng degree from the Canadian Academy of Engineering, and Engineering Institute of Canada. He is
Department of Electrical and Electronic Engineering, named in the current Clarivate Analytics list of “Highly Cited Researchers”.
Southern University of Science and Technology, in
2022. He is currently working toward the PhD de-
gree in the School of Science and Engineering, The
Chinese University of Hong Kong, Shenzhen. His Fangxin Wang (Member, IEEE) received the BEng,
current research interests include federated learning, MEng, and PhD degrees all in computer science
foundation models, and multimedia networking. and technology from Simon Fraser University, Ts-
inghua University, and Beijing University of Posts
and Telecommunications, respectively. He is an as-
sistant professor with the Chinese University of
Hong Kong, Shenzhen (CUHKSZ). Before joining
CUHKSZ, he was a postdoctoral fellow with the
University of British Columbia. Dr. Wang’s research
Kangshuo Li is currently working toward the BEng interests include Multimedia Systems and Applica-
degree in the School of Data Science, The Chinese tions, Cloud and Edge Computing, Deep Learning,
University of Hong Kong, Shenzhen. His current re- and Distributed Networking and System. He leads the intelligent networking
search interests include federated learning and foun- and multimedia lab (INML) at CUHKSZ. He has published more than 50 papers
dation models. at top journal and conference, including INFOCOM, Multimedia, VR, ToN,
TMC, IOTJ, etc. He was selected in the 8th Young Elite Scientist Sponsorship
Program, CUHKSZ Presidential Young Scholar, and a recipient of SFU Dean’s
Convocation Medal for Academic Excellence. He serves as an associate editor
of IEEE Transactions on Mobile Computing, TPC chair of IEEE Satellite 2023,
TPC member of IWQoS, ICC, BigCom and reviewer of many top conference
and journals, including INFOCOM, ToN, TMC, JSAC, etc.

Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.

You might also like