FedFMSL - Federated Learning of Foundation Models With Sparsely Activated LoRA
FedFMSL - Federated Learning of Foundation Models With Sparsely Activated LoRA
Abstract—Foundation models (FMs) have shown great success transformer blocks, and activates them progressively during the
in natural language processing, computer vision, and multimodal training. We employ extensive experiments to verify the effective-
tasks. FMs have a large number of model parameters, thus requir- ness of FedFMSL, results show that FedFMSL outperforms other
ing a substantial amount of data to help optimize the model during SOTA baselines by up to 59.19% in default settings while tuning
the training. Federated learning has revolutionized machine learn- less than 0.3% parameters of the foundation model.
ing by enabling collaborative learning from decentralized data
while still preserving clients’ data privacy. Despite the great benefits Index Terms—Edge computing, federated learning, foundation
foundation models can have empowered by federated learning, model.
their bulky model parameters cause severe communication chal-
lenges for modern networks and computation challenges especially I. INTRODUCTION
for edge devices. Moreover, the data distribution of different clients
can be different thus inducing statistical challenges. In this paper, OUNDATION model (FM) has emerged as a potent solu-
we propose a novel two-stage federated learning algorithm called
FedFMSL. A global expert is trained in the first stage and a local
expert is trained in the second stage to provide better personaliza-
F tion to address the growing demand for machine learning
services. It presents several advantages over its predecessors,
tion. We construct a Mixture of Foundation Models (MoFM) with the traditional smaller models. FM stands out primarily due to
these two experts and design a gate neural network with an inserted the massive amount of training data together with the extensive
gate adapter that joins the aggregation every communication round number of parameters. This massive increased parameter space
in the second stage. To further adapt to edge computing scenarios and collected intricate patterns and relationships in the data
with limited computational resources, we design a novel Sparsely
Activated LoRA (SAL) algorithm that freezes the pre-trained foun-
enable FM to improve performance across various machine
dation model parameters inserts low-rank adaptation matrices into learning tasks.
FM follows a distinct training methodology compared to
Received 20 March 2024; revised 23 August 2024; accepted 27 August 2024. smaller models. While smaller models often rely on task-specific
Date of publication 4 September 2024; date of current version 5 November training, FM employs a pre-training and fine-tuning strategy.
2024. The work was supported in part by the NSFC under Grant 62293482, This pre-training phase with large datasets acts as a stepping
in part by the Basic Research Project under Grant HZQB-KCZYZ-2021067
of Hetao Shenzhen-HK S&T Cooperation Zone, in part by the NSFC under stone, equipping FM with substantial knowledge and context
Grant 62471423, in part by the Shenzhen Science and Technology Program from massive data. Consequently, when fine-tuning FM for
under Grant JCYJ20230807114204010 and Grant RCBS20221008093120047, specific tasks, they derive significant advantages from the initial
in part by the Guangdong Basic and Applied Basic Research Foundation under
Grant 2023A1515012668, in part by Shenzhen Outstanding Talents Training pre-training, leading to enhanced performance across a diverse
Fund under Grant 202002, in part by Guangdong Research Projects under Grant set of tasks.
2019CX01X104, and in part by the Guangdong Provincial Key Laboratory of FMs with a tremendous number of parameters are data-hungry
Future Networks of Intelligence under Grant 2022B1212010001. The work of
Yanjie Dong and Victor C.M. Leung was supported in part by the NSFC under due to the large number of parameters to be optimized. However,
Grant 62102266, in part by the Pearl River Talent Recruitment Program of user data is usually stored on edges and is privacy sensitive
Guangdong Province under Grant 2019ZT08X603, in part by the Public Tech- thus cannot meet the data volume for FM training. Federated
nology Platform of Shenzhen City under Grant GGFW2018021118145859, and
in part by the Shenzhen Science and Technology Innovation Commission under learning (FL) has revolutionized the landscape of machine
Grant R2020A045. Recommended for acceptance by X. Peng. (Corresponding learning by enabling the collaborative training of a shared model
author: Fangxin Wang.) across multiple edge devices without sharing raw data. By
Panlong Wu, Kangshuo Li, and Ting Wang are with the Shenzhen Fu-
ture Network of Intelligence Institute (FNii-Shenzhen) and School of Science combining the power of FM with the decentralized approach of
and Engineering (SSE), The Chinese University of Hong Kong, Shenzhen FL, we can utilize the distributed edge data while preserving data
518172, China (e-mail: [email protected]; [email protected]; privacy and overcoming the limitations of centralized training
[email protected]).
Yanjie Dong and Victor C. M. Leung are with the Artificial Intelligence approaches, thus enabling the enhanced generalization ability of
Research Institute and the Guangdong-Hong Kong-Macao Joint Laboratory for FM. This combination allows us to harness the benefits of both
Emotional Intelligence and Pervasive Computing, Shenzhen MSU-BIT Univer- FL’s collaborative training across edge devices and FM’s large
sity, Shenzhen 518172, China (e-mail: [email protected]; [email protected]).
Fangxin Wang is with the School of Science and Engineering (SSE), Shen- parameter capacity as well as the pre-training strategy.
zhen Future Network of Intelligence Institute (FNii-Shenzhen), and Guang- However, existing FMs are cumbersome, bandwidth-
dong Provincial Key Laboratory of Future Networks of Intelligence, The intensive, and computation-intensive. This raises several chal-
Chinese University of Hong Kong, Shenzhen 518172, China (e-mail: wang-
[email protected]). lenges in the domain of FL with FM which makes it hard to
Digital Object Identifier 10.1109/TMC.2024.3454634 employ in real-world applications.
1536-1233 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
15168 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024
r Bandwidth: The first challenge lies within the substantial In summary, the main contributions of this paper can be
number of parameters possessed by FMs, which distin- summarized as follows:
guishes them from traditional FL models that possess much r We propose a communication and computation friendly
fewer parameters. This differentiation introduces impedi- two-stage personalized FL algorithm FedFMSL, which can
ments in the areas of communication and networking, as capture the global feature information through collabo-
the transmission of parameters of these FMs is significantly rative learning and capture the local feature information
time-consuming for modern mobile networks. through personalized learning. We give theoretical proof
r Computation: The second challenge arises from the huge of the convergence of the FedFMSL algorithm.
computational resource requirements posed by FMs, par- r We propose a Sparsely Activated LoRA (SAL) algorithm
ticularly for edge devices with limited computing re- that sparsely activates the trainable low-rank decompo-
sources. The considerable computational costs associated sition matrices injected into foundation models in a pro-
with these FMs create difficulties in implementing FL on gressive way through a self-defined controller to adapt to
resource-constrained edge devices. scarce computation and communication resources in edge
r Data heterogeneity: The third challenge is that FL en- computing scenarios.
counters statistical challenges due to non-IID decentral- r We propose a Mixture of Foundation Models (MoFM) al-
ized data, potentially resulting in issues such as parameter gorithm which is, to our best knowledge the first work to
divergence and data distribution biases. These issues can construct a mixture of vision language foundation models
lead to severe performance loss when FMs are employed in personalized federated learning to tackle the data het-
in FL because FMs usually have large parameter spaces erogeneous in federated learning and we further prove the
which makes them hard to be optimized. effectiveness of FedFMSL through extensive experiments.
To fill the gap, this paper addresses the challenges associated
with FL with FM by introducing the FedFMSL algorithm, II. BACKGROUND AND RELATED WORK
an FL algorithm with a Mixture of Foundations Models that
have Sparsely Activated Parameters. The proposed FedFMSL A. Foundation Model
algorithm consists of two training stages. A global foundation Recently, FMs have achieved remarkable success in various
model is trained collaboratively in the first stage and local domains such as natural language processing, computer vision,
foundation models are trained in the second stage to provide and multimodal tasks. By utilizing deep learning techniques
better personalization. like self-supervised learning and contrastive learning, FMs with
In the first training stage, each client owns a foundation a massive number of model parameters are trained on large
model, and low-rank adaption matrices are inserted into every datasets. Consequently, these models exhibit strong generaliza-
transformer block of the foundation model. During the training, tion, feature extraction, and comprehension abilities.
the pre-trained weights of the foundation models are frozen, Various works have been daone related to FMs in natural
and all the parameters of inserted matrices are activated to language processing: Bert [1], referred to as Bidirectional En-
better extract global information. In each communication round, coder Representation from Transformers, is an advanced natural
only the inserted matrices join the weight aggregation to reduce language processing model introduced by Devlin et al. (2018).
bandwidth consumption. The trained foundation model in the This model employs a transformer architecture and is pre-
first stage will be frozen in the second stage and act as the global trained on extensive text data, using a masked language model
expert. pre-training objective. GPT-3 [2] is trained using a language
In the second training stage, we for the first time form a modeling pre-training objective. By making the model do the
Mixture of Foundation Models (MoFM) system in FL which next token prediction, it can utilize the massive unlabeled data
specifically addresses the data heterogeneity challenges encoun- from the internet and have a powerful few-shots learning ability.
tered in FL with FM. The MoFM system consists of a global Various works that have been done related to visual language
foundation model, a local foundation model, and a gate model FMs: Contrastive Language Image Pre-training (CLIP) [3] is
to provide both generalization and personalization ability. We a famous FM proposed by OpenAI. This model uses a visual
leverage the foundation model trained in the first stage as a global encoder and a text encoder to extract the semantic meaning
expert and introduce another local expert which is a foundation and encode images and texts into image features and text fea-
model initialized from the weights of the global model to provide tures. Throughout the training process, contrastive learning is
better personalization. We design a gate model with a specially employed to maximize the similarity between related images
designed gate adapter inserted into it so that it can quickly adapt and texts while minimizing the similarity between unrelated
to the change of relationship between two experts and intelli- ones. DALL-E 3 [4] is a modern text-to-image system that
gently assign weights to the final decision of two experts. In each has extraordinary prompt-following capability. It addresses the
communication round, only the gate adapter’s activated param- noisy and inaccurate image captions issue by training another
eters join the aggregation to save communication resources. To specially designed image captioner.
further tackle the computation challenges, we propose a Sparsely There are also various works that have been done related to
Activated LoRA (SAL) algorithm to activate inserted low-rank applications of FMs: NetLLM [5] uses the foundation model
adaptation matrices in a progressive way through a controller to to do three networking-related tasks which are viewport pre-
suit different edge resource conditions. diction, adaptive bitrate streaming, and cluster job scheduling.
Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FedFMSL: FEDERATED LEARNING OF FOUNDATION MODELS WITH SPARSELY ACTIVATED LoRA 15169
LMaaS [6] explore the service pricing of foundation models in utilize FM on the server to generate synthesized images given
mobile computing scenarios. the transmitted prompt from the clients to improve the training
performance of the model. Tao et al. [21] propose a PromptFL
B. Federated Learning algorithm to replace the model aggregation in traditional FL to
prompt aggregation to reduce communication and computation
Federated learning [7] is a machine learning technique that
costs. Cai et al. [22] design an AdaFL algorithm to fine-tune
enables the training of decentralized data while preserving the
FMs for modern natural language processing tasks by inserting
privacy of clients participating in the training. Typically, every
adapters into models, dividing clients into three groups, and
client doesn’t share their data but their private model after local
observing each group’s training accuracy to decide the best
training in each communication round. Despite that FL has
configuration for adapters. Lu et al. [23] propose a FedCLIP
shown great potential in the Internet of Things, financial field,
algorithm to insert adapters in the visual encoder of FM CLIP
and smart healthcare, it still faces lots of challenges.
and test it on datasets in different domains. Zhang et al. [24]
Many studies focus on solving the statistical challenges: FL
introduce a Federated Instruction Tuning (FedIT) algorithm
faces serious statistical challenges because the data distribution
to leverage federated learning in induction tuning of FM to
of the datasets is often non-iid which can lead to weight diver-
enhance their performance. Vu et al. [25] do a thorough analysis
gence after model aggregation. Li et al. [8] propose FedProx that
of the privacy leakage of federated learning with foundations
handles the system heterogeneity by introducing an additional
when data is under the protection of differential privacy mech-
proximal term to prevent the local model updates from being
anisms. Cai et al. [26] explore the usage of federated learning
far from the global model and thus can safely aggregate the
to boost the performance of large language models in practical
local updates in statistical heterogeneity conditions. Li et al. [9]
software development. The F-CODLLM they propose supports
introduce MOON which uses the idea of contrastive learning
multi-language combined fine-tuning, which can extract uni-
to compare the representation learned by local models and the
versal knowledge from a variety of programming languages
global model. Inspired by the philosophy of the global model
and improve the overall performance of code-related tasks. Liu
has better feature extraction ability than local models that are
et al. [27] utilize the Self-Consistency and Chain-of-Thought
trained on skewed local datasets. Zhang et al. [10] design FedLC
techniques in federated learning to improve the quality of the
that introduces a fine-grained calibrated cross-entropy loss to
answer of LLM to users without sophisticated parameter-tuning.
mitigate the local gradient deviation and gives theoretical proof
However, none of these works consider the cooperation of FMs
of the deviation bound after calibration. Zec et al. [11] propose
and thus cannot achieve good performance in data heterogeneous
a personalized federated learning algorithm with a mixture of
conditions on challenging datasets.
experts with a training pipeline of global expert training, local
expert training, and mixer training. Zhang et al. [12] propose
a Multi-level Personalized Federated Learning (MuPFL) algo- III. DESIGN OF FEDFMSL
rithm to tackle the heterogeneous and long-tailed data problems A. Overview of FedFMSL
in FL. Wu et al. [13] propose a FEDCNI algorithm to address
the noisy and heterogeneous data problem in FL. We consider a typical FL scenario with a total number of
The communication efficiency issue is also an important issue N clients with non-iid dataset {D1 , . . ., DN }. Our method
that many researchers focus on: MAO et al. [14] propose an FedFMSL consists of two stages of training as depicted in
Adaptive Quantized Gradient (AQG) algorithm to decide the Fig. 1. In the first stage, low-rank adaptation matrices are inserted
level of quantization according to the gradient update of hetero- into every transformer block of the foundation model [28]. All
geneous clients. Huang et al. [15] propose a Residual Pooling the clients freeze the pre-trained foundation model weights and
Network (RPN) based on the approximation of parameters and only update and upload the weights of the inserted matrices in
selection of parameters, and apply it to a CNN-based model every communication round. In this stage, all the inserted low-
FL training. Haddadpour et al. [16] introduce an algorithm rank adaptation matrices are activated to better extract general
with periodical compressed communication. Specifically, they information from all the clients and every client collaboratively
introduce the FedCOM algorithm to tackle the homogeneous trains a global model, which will be the global expert in stage
client situation and the FedCOMGATE algorithm to tackle two. The learning goal of stage one can be expressed as
heterogeneous client situations. Chen et al. [17] propose a fed-
1
N
erated learning algorithm that considers the weight quantization min F = E(xi ,yi )∼di Fi (xi , yi ; wi ), (1)
wi N i=1
in wireless transmission and formulate the federated learning
problem into a mixed-integer programming problem. Zhang et
where Fi denotes the loss function of client i ∈ [N ]; wi denotes
al. [18] introduce a CFEL algorithm that jointly considers cloud-
the parameters of the inserted low-rank adaptation matrices
based federated learning and edge-based federated learning. Qu
of the model of client i; xi denotes its private data; yi denotes
et al. [19] design a partially Synchronized federated learning
the corresponding label and di denotes the data distribution of
algorithm to accelerate the federated learning training.
client i. In the second stage, each client utilizes the trained global
expert in the first stage and trains its personalized model. This
C. FL With FM model consists of a global expert, a local expert, and a gate model
Not many works have been done related to FL with FM. Zhang together constitute a Mixture of Foundation Models. During this
et al. [20] propose a federated generative learning framework to stage, local experts are only trained on clients’ local datasets
Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
15170 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024
using a novel Sparsely Activated LoRA algorithm and do not en- In traditional FL [7], [8], model parameters after local training
gage the global aggregation. We design and insert a gate adapter are usually transmitted to the server for model weights aggre-
into the gate model and aggregate all the parameters of gate gation in each communication round. This paradigm faces great
adapters in each communication round to gather information challenges when FMs are trained in an FL procedure. Suppose
about global data distribution to better assign weights to two we have an FM whose parameters are represented by Wf .
experts. We optimize the parameters of the local expert and the For full parameters fine-tuning, we need to calculate and store
gate adapter by another model Wk which has the same parameter size of Wf
for each task k.
1
N
min L = E(xi ,yi )∼di Li (xi , yi ; ei , ui ), (2) FM which typically consists of over 10 million model param-
ei ,ui N i=1 eters needs more than 160 million bits to be represented. This
results in significant transmission time requirements for modern
where ui denotes the parameter of the gate adapter of client i; ei
mobile communication networks.
denotes the parameter of the local expert of client i. We propose
Moreover, the training of FM necessitates substantial com-
two novel algorithms to tackle challenges raised by FL with FM.
putation power and storage capacity, whereas edge devices
In this paper, we use typical visual language foundation mod-
typically possess limited computational capabilities and storage
els like CLIP to do the image classification task. In these models,
space. Therefore, it is imperative to develop an algorithm that
the visual encoder and the text encoder map input images and
mitigates the communication and computation costs associated
text to high-dimensional feature vectors. The model training
with FL using FM.
aims to minimize the cosine similarity between related pairs and
To tackle these challenges, we design a novel Sparsely Acti-
maximize it for unrelated pairs, aligning their representations in
vated LoRA algorithm that can achieve the SOTA performance
the feature space. During the evaluation, the model precomputes
while only tuning less than 1% of the total parameters of FM.
a set of text feature vectors for each class. Then the cosine
Common pre-trained language models are capable of efficient
similarity between the input image feature vector and these
learning even when randomly projected into a smaller subspace
precomputed text feature vectors is calculated and the class with
because they have a very low intrinsic dimension [30]. Edward
the highest similarity is the classification result.
J. Hu et al. [28] propose Low-rank adaptation (LoRA) to insert
trainable low-rank decomposition matrices in FMs, enabling
B. Sparsely Activated LoRA model optimization with minimal parameter tuning.
According to [29], the capability of the deep neural network Inspired by this, we insert trainable low-rank decomposition
tends to improve with the increase of the number of parameters matrices in every layer of the visual encoder and the text en-
of the model. FL with FM presents substantial challenges to the coder of the foundation model. We denote the weight parameter
communication and computation of the distributed system due matrix of the foundation model as W0 ∈ RE×F and the inserted
to FM’s large number of parameters. low-rank decomposition matrices as ΔW ∈ RE×F , which can
Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FedFMSL: FEDERATED LEARNING OF FOUNDATION MODELS WITH SPARSELY ACTIVATED LoRA 15171
be calculated by two low-rank matrices ΔW = WA WB , WA ∈ where Acci,j denotes the image classification accuracy of the
RE×H and WB ∈ RH×F (H << min(E, F )). This adjustment model of client j in communication round i. If Δi,j < δ, the
allows WA WB to have the same dimensions as the weight training is considered to come into a bottleneck. Then low-rank
parameter matrix W0 of the foundational model while reducing decomposition matrices in the next lower layer will be activated.
the number of tunable parameters for each weight matrix by The design of the SAL algorithm is inspired by the fact
EF − (EH + HF ). For WA , we employ a random Gaussian that the performance of FM is usually affected by the model
initialization, while WB is initialized with zero. During training, size, dataset size, and the quality of the dataset. Challenging
W0 is frozen and only WA and WB are optimized thus can save datasets require more model parameters to be optimized to better
computation and storage costs. Suppose the input of the weight extract the semantic meaning of the data. However, there is
matrices and the inserted low-rank decomposition matrices is x. no silver bullet configuration in the training of FL with FM.
The output can be calculated by So we introduce a Capability Queue to intelligently decide the
number of activated LoRA parameters and enable the training
y = (W0 + WA WB )x. (3) on computation resource-limited devices.
The procedure of the proposed SAL algorithm is depicted C. Mixture of Foundation Models
in Fig. 2. We activate the low-rank decomposition matrices
sparsely instead of activating them all during the training. At In traditional FL, a global model is trained using the decen-
the beginning of the training stage, every layer of the visual tralized data of clients. Only model weights are aggregated in the
encoder and the text encoder are inserted with frozen low-rank central server while the local data of clients are kept private to
decomposition matrices. In deep neural networks, lower layers ensure clients’ data privacy. This paradigm faces statistical chal-
can better extract general information than higher layers [31]. lenges especially when the data distribution of clients is non-iid.
During the first training stage, low-rank decomposition matrices Such non-iid data distribution could cause the weight divergence
in all layers are activated to better extract general information to during the training [32] and cause significant performance drops.
form a global expert while in the second stage, we unfreeze the Moreover, training a single global model and applying it to all
low-rank decomposition matrices from higher layers to lower clients can not suit different clients’ needs when their data have
layers during the training. different data distributions. Training personalized models while
More specifically, we introduce a Capability Queue with a benefiting from utilizing a global model is essential to providing
maximum queue length of Q. Image classification accuracies better performance for different clients.
of clients are forwarded to the Capability Queue after every To tackle this challenge, we design a novel Mixture of Foun-
communication round. Once the Capability Queue is full, the dation Models (MoFM) algorithm to utilize an FM as the global
previously added accuracies will be popped out. We set an expert and another FM as the local expert thus creating a mixture
accuracy threshold δ to help decide whether the training comes of Foundation Models to simultaneously learn personalized
into a bottleneck. The incremental factor Δ of client j in com- feature information as well as global feature information on each
munication round i is client.
As shown in Fig. 1, in the first stage of training, every
client collaboratively trains a global FM with weight. Low-rank
1
i−1
Δi,j = Acci,j − Acct,j , (4) decomposition matrices are inserted in every layer of the visual
Q encoder and the text encoder. This global FM acts as a global
t=i−Q
Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
15172 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024
expert. In the second stage, a local expert is created for each Assumption 2. The expectation of the square of the gradient
client i to cooperate with the global expert. is bounded by G which can be a relatively large number that
More specifically, the local experts have the same neural denotes the maximum gradient norm during the backpropagation
network architecture as the global expert and are initialized with process in the optimization of the foundation model.
the weights of the global expert. A gate function Gi for each
E ∇F(wt )2 ≤ G2 . (8)
client i is a neural network introduced to control the relative
N
contribution of the global expert and the local expert to the final We set ∇F(wt ) = i=1 pi ∇Fi (wt ). Where pi is the weight
image classification decision given different images. We denote of the kth device which satisfy pi ≥ 0 and N i=1 pi = 1. In
the extracted image features and text features by the global expert communication round t, the server performs the gradient descent
as Vg and Tg and the extracted image features and text features by
by the local expert i as Vi and Ti . The final cosine similarity of
image features and text features extracted from the dataset of wt+1 = wt − γ∇F(wt ), (9)
client i can be denoted by where γ is the learning rate. From L-smooth we can get
Õi = λi < Vg , Tg > +(1 − λi ) < Vi , Ti >, L
(5) F(wt+1 )−F(wt )−(wt+1 −wt )∇F(wt ) ≤ wt+1 −wt 2 ,
2
where λi ∈ (0, 1) is a weight factor representing the mixing (10)
ratio of the global expert and the local expert of client i. Larger then we substitute (9) and get
λi indicates more global knowledge is used while smaller λi L
F(wt+1 )−F(wt )+γ ∇F(wt ), ∇F(wt ) ≤ wt+1 −wt 2 .
indicates more personal knowledge is used. 2
During the second training stage of FedFMSL, the weights (11)
of the global expert are frozen, and the local expert and the After taking the expectations of both sides, we can get
gate model are optimized only using the local data of client i. L 2 2
The adapter [33] has been a popular parameter-efficient tuning E[F(wt+1 ) − F(wt )] + γE ∇F(wt )2 ≤ γ G (12a)
2
method in FMs. It works by inserting very few layers into L
FMs and optimizing FM by only tuning the inserted very few γE ∇F(wt )2 ≤ E[F(wt ) − F(wt+1 )] + γ 2 G2 (12b)
2
parameters.
E[F(wt ) − F(wt+1 )] L 2
We design a novel gate adapter to adapt to the local datasets E ∇F(wt )2 ≤ + γG . (12c)
while maintaining a low computation and communication cost. γ 2
In each communication round, clients’ activated gate adapter By iterating the value of t from 1 to T we have
parameters are aggregated to learn global feature information,
E[F(w1 ) − F(w2 )] L 2
thus maintaining a low computation and communication cost. E ∇F (w1 )2 ≤ + γG (13a)
The aggregation of gate adapters can help the gate neural net- γ 2
work to better assign weight to different experts by determining E[F(w2 ) − F(w3 )] L 2
E ∇F(w2 )2 ≤ + γG (13b)
whether the given data is more subject to the user’s personalized γ 2
data distribution or the global data distribution.
..
We denote the parameter of the gate adapter after aggregation .
as gglobal . Specifically, we construct the gate adapter with a
E[F(wT ) − F(wT +1 )] L 2
Multi-Layer Perceptron (MLP), a batch norm layer, an MLP, E ∇F(wT )2 ≤ + γG . (13c)
a batch norm layer, and finally a Softmax function to ensure the γ 2
output is between (0,1). The gate adapter aggregation procedure After summing these equations we can get
is denoted as:
T
E[F(w1 ) − F(w∗ ) L
N E ∇F(wt )2 ≤ + T γG2 . (14)
|Di | γ 2
uglobal = N ui . (6) t=1
i=0 |Di | i=1 Because of
1
T
IV. CONVERGENCE ANALYSIS min E ∇F(wt )2 ≤ E ∇F(wt )2 . (15)
t=1:T T t=1
In this section, we do a thorough analysis of the convergence
of FedFMSL algorithm. We first prove the convergence of After dividing LHS and RHS by T we can have
the first stage and then the second stage of FedFMSL. For E[F(w1 ) − F(w∗ ) L 2
the first stage, we introduce the following assumptions which min E ∇F(wt )2 ≤ + γG . (16)
t=1:T γT 2
are common assumptions used in [34], [35]. √1
Assumption 1. The loss function F1 , . . ., Fi , . . ., FN is L- When γ satisfies γ = ΩT ,where Ω is a constant. We can have
smooth, that is for all w and w we have a convergence rate of O( √1T ).
L In the second stage, only the gate adapter joins the aggregation
Fi (w) ≤ Fi (w ) + (w − w )∇Fi (w ) + w − w 2 . (7) in each communication round. We introduce some assumptions
2
Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FedFMSL: FEDERATED LEARNING OF FOUNDATION MODELS WITH SPARSELY ACTIVATED LoRA 15173
on the second training stage of FedFMSL. These are common = (ei,t+1 − ei,t )∇e Li (ei,t , ut )
assumptions used in [34], [36].
Assumption 3. The loss function L1 , . . ., Li , . . ., LN is + (ei,t+1 − ei,t )(∇e Li (ei,t , ut+1 ) − ∇e Li (ei,t , ut ))
smooth.∇e Li (ei , u) is Le -Lipschitz with regard to ei and Leu - ≤ (ei,t+1 − ei,t )∇e Li (ei,t , ut )
Lipschitz with regard to u. ∇u Li (ei , u) is Lu -Lipschitz with
regard to u and Leu -Lipschitz with regard to ei . We further + ei,t+1 − ei,t ∇e Li (ei,t , ut+1 ) − ∇e Li (ei,t , ut )
define = √LLeu
L
. For all ei , ei , u and u we have ≤ (ei,t+1 − ei,t )∇e Li (ei,t , ut )
e u
∇e Li (ei , u) − ∇e Li (ei , u) ≤ Le ei − ei (17) + Leu ut+1 − ut ei,t+1 − ei,t
∇u Li (ei , u ) − ∇u Li (ei , u) ≤ Lu u − u (18) ≤ (ei,t+1 − ei,t )∇e Li (ei,t , ut )
∇e Li (ei , u ) − ∇e Li (ei , u) ≤ Leu u − u (19) + Lu Le ut+1 − ut ei,t+1 − ei,t
∇u Li (ei , u) − ∇u Li (ei , u) ≤ Leu ei − ei . (20) ≤ (ei,t+1 − ei,t )∇e Li (ei,t , ut )
Assumption 4. The expectation of the square of the gradient Lu Le
+ 2 ut+1 − ut 2 + 2 ei,t+1 − ei,t 2 . (29)
is bounded. 2 2
E ∇e Li (ei,t , ut )2 ≤ G2e (21) By utilizing the bound get from (29) we can have
Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
15174 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024
Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FedFMSL: FEDERATED LEARNING OF FOUNDATION MODELS WITH SPARSELY ACTIVATED LoRA 15175
Fig. 5. Average accuracy on different datasets when under different visual encoders.
Impact of number of clients: We test the performance of 83.44%. Results show that FedFMSL works well under different
FedFMSL and other baselines under different number of clients. number of clients.
Specifically, we set the number of clients to 5, 10, and 15. 2) Impact of Training Settings. Impact of visual encoders:
From Fig. 4 we can conclude that FedFMSL has the highest We test the performance of FedFMSL under visual encoders
accuracy in different client number cases. When the number of ViT-B/16 and ViT-B/32 to further verify the effectiveness of
clients is 5, FedFMSL achieves accuracy of 92.03%, 86.89%, FedFMSL under various visual encoders.
and 99.40% in Food101, UCF101, and EuroSAT datasets re- We can observe from Fig. 5 that FedFMSL achieves the
spectively while the best accuracy of the other three base- highest image classification accuracy on all datasets using
lines in these three datasets are 87.56%, 77.15%, and 90.40%. the visual backbone Vit-B/16 or Vit-B/32. Results show that
FedFMSL suppresses the best performance of other base- visual encoders with a larger number of parameters can
lines by 4.47%, 9.74%, and 9%. In the case when there are achieve better performance than those with smaller visual
10 clients, FedFMSL has a maximum accuracy increase of encoders. The accuracy of the four algorithms increased by
9.73%, 21.29%, and 54.85% compared to the three baselines. 0.95%, 11.85%, 4.57%, and 4.25% when using ViT-B/16 as the
When the number of clients reaches 15, the accuracy of visual encoder on the UCF101 dataset which confirms the theo-
FedFMSL is 94.74%, 87.69%, and 96.19% while the highest retical analysis that larger models have better feature extraction
accuracy of the other three baselines are 87.44%, 62.86%, and ability.
Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
15176 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024
Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FedFMSL: FEDERATED LEARNING OF FOUNDATION MODELS WITH SPARSELY ACTIVATED LoRA 15177
Fig. 9. Accuracy with different LoRA dropout rates. Fig. 10. Accuracy with different LoRA weight decays.
Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
15178 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024
TABLE I TABLE II
PROPORTION AND TOTAL NUMBER OF TRAINING PARAMETERS FOR DIFFERENT COMPARISON OF THE NUMBER OF PARAMETERS OF THE MODEL
METHODS
Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FedFMSL: FEDERATED LEARNING OF FOUNDATION MODELS WITH SPARSELY ACTIVATED LoRA 15179
Fig. 14. Comparison of the accuracy on different datasets using model with MoFM (ViT-B/16) and model without MoFM (ViT-L/14).
TABLE III
COMPARISON OF ACCURACY ON DIFFERENT DATASETS
TABLE IV
COMPARISON OF THE NUMBER OF PARAMETERS OF THE GATE ADAPTER AND
THE GATE MODEL
Fig. 15. Communication time under the real-world network trace.
Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
15180 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 23, NO. 12, DECEMBER 2024
Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
WU et al.: FedFMSL: FEDERATED LEARNING OF FOUNDATION MODELS WITH SPARSELY ACTIVATED LoRA 15181
[27] X. Liu, T. Pang, and C. Fan, “Federated prompting and chain-of-thought Ting Wang is currently working toward the bache-
reasoning for improving LLMs answering,” in Proc. Int. Conf. Knowl. Sci. lor’s of engineering degree in electronic information
Eng. Manageme., 2023, pp. 3–11. engineering focusing on computer engineering with
[28] E. J. Hu et al., “LoRA: Low-rank adaptation of large language models,” the Chinese University of Hong Kong (Shenzhen).
2021, arXiv:2106.09685. His research interests lie in foundation models and
[29] J. Kaplan et al., “Scaling laws for neural language models,” 2020, arXiv: edge computing.
2001.08361.
[30] A. Aghajanyan, S. Gupta, and L. Zettlemoyer, “Intrinsic dimensionality
explains the effectiveness of language model fine-tuning,” in Proc. 59th
Annu. Meeting Assoc. Comput. Linguistics 11th Int. Joint Conf. Natural
Lang. Process., 2021, pp. 7319–7328.
[31] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are
features in deep neural networks?,” in Proc. Adv. Neural Inf. Process.
Yanjie Dong (Member, IEEE) received the MASc
Syst., 2014, pp. 3320–3328.
and PhD degrees from the University of British
[32] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated Columbia, Canada, in 2016 and 2020, respectively.
learning with non-IID data,” 2018, arXiv: 1806.00582.
He is currently an associate professor and assis-
[33] N. Houlsby et al., “Parameter-efficient transfer learning for NLP,” in Proc.
tant dean of Artificial Intelligence Research Institute,
Int. Conf. Mach. Learn., 2019, pp. 2790–2799.
Shenzhen MSU-BIT university. His research interests
[34] S. Wang, Y. Hong, R. Wang, Q. Hao, Y.-C. Wu, and D. W. K. Ng,
focus on the protocol design of energy-efficient com-
“Edge federated learning via unit-modulus over-the-air computation,”
munications, machine learning based resource allo-
IEEE Trans. Commun., vol. 70, no. 5, pp. 3141–3156, May 2022.
cation algorithms, and quantum computing technolo-
[35] Y. Dong et al., “Accelerating wireless federated learning via Nesterov’s gies. He regularly serves as a member of Technical
momentum and distributed principle component analysis,” IEEE Trans.
Program Committee in flagship conferences in IEEE
Wireless Commun., vol. 23, no. 6, pp. 5938–5952, Jun. 2024.
ComSoc.
[36] K. Pillutla, K. Malik, A.-R. Mohamed, M. Rabbat, M. Sanjabi, and L.
Xiao, “Federated learning with partial model personalization,” in Proc.
Int. Conf. Mach. Learn., 2022, pp. 17716–17758.
[37] L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discrimi-
native components with random forests,” in Proc. 13th Eur. Conf. Comput. Victor C. M. Leung (Life Fellow, IEEE) is a distin-
Vis., 2014, pp. 446–461. guished professor and dean of Artificial Intelligence
[38] P. Helber, B. Bischke, A. Dengel, and D. Borth, “EuroSAT: A novel dataset Research Institute, Shenzhen MSU-BIT University,
and deep learning benchmark for land use and land cover classification,” China. He is also an emeritus professor of electrical
IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 12, no. 7, and computer engineering and director of the Labora-
pp. 2217–2226, Jul. 2019. tory for Wireless Networks and Mobile Systems with
[39] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human the University of British Columbia (UBC), Canada.
actions classes from videos in the wild,” 2012, arXiv:1212.0402. His research is in the broad areas of wireless networks
[40] M. Yurochkin, M. Agarwal, S. Ghosh, K. Greenewald, N. Hoang, and and mobile systems, and he has published widely
Y. Khazaeni, “Bayesian nonparametric federated learning of neural net- in these areas. He is serving as a senior editor of
works,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 7252–7261. IEEE Transactions on Green Communications and
[41] J. Li, R. R. Selvaraju, A. D. Gotmare, S. Joty, C. Xiong, and S. Hoi, “Align Networking. He is also serving on the editorial boards of IEEE Transactions on
before fuse: Vision and language representation learning with momentum Cloud Computing, IEEE Transactions on Computational Social Systems, IEEE
distillation,” in Proc. Adv. Neural Inf. Process. Syst., 2021, pp. 9694–9705. Access, IEEE Network, and several other journals. He received the 1977 APEBC
[42] J. Van Der et al., “HTTP/2-based adaptive streaming of HEVC video over Gold Medal, 1977-1981 NSERC Postgraduate Scholarships, IEEE Vancouver
4G/LTE networks,” IEEE Commun. Lett., vol. 20, no. 11, pp. 2177–2180, Section Centennial Award, 2011 UBC Killam Research Prize, 2017 Canadian
Nov. 2016. Award for Telecommunications Research, 2018 IEEE TCGCC Distinguished
Technical Achievement Recognition Award, and 2018 ACM MSWiM Reginald
Fessenden Award. He co-authored papers that won the 2017 IEEE ComSoc Fred
W. Ellersick Prize, 2017 IEEE Systems Journal Best Paper Award, 2018 IEEE
CSIM Best Journal Paper Award, and 2019 IEEE TCGCC Best Journal Paper
Award. He is a Fellow of the Royal Society of Canada (Academy of Science),
Panlong Wu received the BEng degree from the Canadian Academy of Engineering, and Engineering Institute of Canada. He is
Department of Electrical and Electronic Engineering, named in the current Clarivate Analytics list of “Highly Cited Researchers”.
Southern University of Science and Technology, in
2022. He is currently working toward the PhD de-
gree in the School of Science and Engineering, The
Chinese University of Hong Kong, Shenzhen. His Fangxin Wang (Member, IEEE) received the BEng,
current research interests include federated learning, MEng, and PhD degrees all in computer science
foundation models, and multimedia networking. and technology from Simon Fraser University, Ts-
inghua University, and Beijing University of Posts
and Telecommunications, respectively. He is an as-
sistant professor with the Chinese University of
Hong Kong, Shenzhen (CUHKSZ). Before joining
CUHKSZ, he was a postdoctoral fellow with the
University of British Columbia. Dr. Wang’s research
Kangshuo Li is currently working toward the BEng interests include Multimedia Systems and Applica-
degree in the School of Data Science, The Chinese tions, Cloud and Edge Computing, Deep Learning,
University of Hong Kong, Shenzhen. His current re- and Distributed Networking and System. He leads the intelligent networking
search interests include federated learning and foun- and multimedia lab (INML) at CUHKSZ. He has published more than 50 papers
dation models. at top journal and conference, including INFOCOM, Multimedia, VR, ToN,
TMC, IOTJ, etc. He was selected in the 8th Young Elite Scientist Sponsorship
Program, CUHKSZ Presidential Young Scholar, and a recipient of SFU Dean’s
Convocation Medal for Academic Excellence. He serves as an associate editor
of IEEE Transactions on Mobile Computing, TPC chair of IEEE Satellite 2023,
TPC member of IWQoS, ICC, BigCom and reviewer of many top conference
and journals, including INFOCOM, ToN, TMC, JSAC, etc.
Authorized licensed use limited to: Washington State University. Downloaded on January 05,2025 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.