Collaborative Learning Via Prediction Consensus
Collaborative Learning Via Prediction Consensus
Abstract
We consider a collaborative learning setting where the goal of each agent is to
improve their own model by leveraging the expertise of collaborators, in addition
to their own training data. To facilitate the exchange of expertise among agents,
we propose a distillation-based method leveraging shared unlabeled auxiliary data,
which is pseudo-labeled by the collective. Central to our method is a trust weighting
scheme that serves to adaptively weigh the influence of each collaborator on the
pseudo-labels until a consensus on how to label the auxiliary data is reached. We
demonstrate empirically that our collaboration scheme is able to significantly boost
the performance of individual models in the target domain from which the auxiliary
data is sampled. By design, our method adeptly accommodates heterogeneity in
model architectures and substantially reduces communication overhead compared
to typical collaborative learning methods. At the same time, it can provably mitigate
the negative impact of bad models on the collective.
1 Introduction
This work considers a decentralized learning setting where each agent locally has access to a labeled
dataset and predictive model. The agents may differ in the data distribution they have access to as
well as the quality of their local models. In addition, we assume a shared unlabeled dataset X ∗
sampled from a target distribution Q is available to all agents. The central question studied in this
work is; how can agents effectively exchange information to benefit from each other’s local expertise
in order to improve their predictive performance on the target domain Q?
Towards this goal, our work takes inspiration from social science on how a panel of human experts
collaborate on a task. Humans typically engage in discourse to exchange information, they share
their opinions, and based on how much they trust their peers, each individual will then adjust their
subjective belief towards the opinion of peers. When repeated, this process gives rise to a dynamic
process of consensus finding, as formalized by DeGroot [1]. Central to the consensus mechanism of
DeGroot is the concept of trust. It determines how much individual agents influence each other’s
opinion, and thus the influence of each agent on the final consensus.
Our proposed algorithm mimics this consensus-finding mechanism in the context of collaborative
learning, inspired by recent work [2]. In particular, our consensus procedure is aimed at how to label
the shared dataset X ∗ . Therefore, we carefully design a strategy by which each agent determines
the trust towards others, given its local information, to optimally leverage each agent’s expertise
to collectively pseudo-label X ∗ . This mechanism of knowledge distillation is then combined with
techniques from self-training [3] in order to transfer the shared knowledge from the pseudo-labels
into the local models in an iterative fashion.
Code available at https://github.com/fan1dy/collaboration-consensus
2 Related work
In classical federated learning setting a central server coordinates local updates toward learning a
global model. Local nodes upload gradients or model parameters, instead of data itself, to maintain a
certain level of privacy. McMahan et al. [5] describe the classic FedAvg algorithm. Follow-up works
predominantly focus on addressing challenges from non-i.i.d. local data [6, 7, 8] and robustness
towards Byzantine attacks [9, 10]. Apart from communicating gradients or model parameters, several
works discuss alternatives to allow for heterogeneous model architectures. These methods are based
on variants of model distillation [11, 12], reaching an agreement in the representations space [13]
or output space [14]. Similar to our work, both assume access to a shared unlabeled dataset, but
they determine agreement on the unlabeled data using naive averaging, while we aim to account for
heterogeneity in model and data quality through trust weighting.
In contrast to federated learning, the fully decentralized learning setting does not assume the existence
of a central server. Instead, decentralized schemes such as gossip averaging are used to aggregate
local information across agents [15, 16]. Despite the lack of a global state, such methods can provably
converge to the desired global solution, leading to a gradual consensus among individual models [4].
In this context, Dandi et al. [17], Le Bars et al. [18] optimize the communication topology to adapt to
data heterogeneity but do not offer any collaborator selection mechanisms. Bellet et al. [15] allows
personalized models on each agent, but assumes prior information about task-relatedness, as opposed
to learned selection. Gossip algorithms typically assume a fixed gossip mixing matrix given by
e.g. physical connections of nodes [19, 20, 21]. These approaches fail to consider data-dependent
communication as with task similarities and node qualities. Several recent works have addressed this
issue by proposing alternative methods that consider these factors. Notably, Li et al. [22] directly
optimizes the mixing weights by minimizing the local validation loss per node, which requires
labeled validation sets. Sui et al. [23] uses the E-step of EM algorithm to estimate the importance of
other agents to one specific agent i, by evaluating the accuracy of other agents’ models on the local
data of agent i. This way of trust computation does not allow the algorithm to be applied to target
distributions that differ from the local distribution, differentiating it from our work. Moreover, for
2
both Li et al. [22], Sui et al. [23] the aggregation is performed in the gradient space, therefore not
allowing heterogeneous models.
Our work relates to semi-supervised learning as it involves partially unlabeled data. Most relevant are
self-training methods [24] that first train a model using labeled data, then use the trained model to give
pseudo-labels to unlabeled data. The pseudo-labels can further be fed back to the training loop to attain
a better model. Wei et al. [25] shows that under expansion and separation assumptions, self-training
with input consistency regularization can achieve high accuracy with respect to ground-truth labels.
When more than one learner is involved, co-training [26] appears as an extension to self-training,
benefiting from the knowledge of learners from independent views in labeling a set of unlabeled
data. Diao et al. [27] incorporates semi-supervised learning into federated learning. In a setting
where agents are with unlabeled data and the center server is with labeled data, experimental studies
demonstrate that the performance of a labeled server is significantly improved with unlabeled clients.
Farina [28] presented a collective learning framework for distributed semi-supervised learning, where
they combine predictions on a shared dataset via weights evaluated from local models’ performances
on local validation datasets. While their algorithm bears similarities to ours, it is important to note
that it is exclusively tailored to scenarios in which the target domain matches the global distribution.
In a similar spirit, we want to leverage unlabeled data in a fully decentralized setting.
Finally, Mendler-Dünner et al. [2] have previously formalized collective prediction as a dynamic
consensus finding procedure. They demonstrated that such an approach can lead to significant gains
over naive model averaging. We leverage these insights and extend their approach from test-time
prediction to collaborative model training.
3 Method description
Our proposed method is designed to take advantage of shared unlabeled data in the context of
collaborative learning through knowledge distillation. Therefore, it emulates human opinion dynamics
to collectively pseudo-label the shared auxiliary data. These labels are then incorporated in the local
model update steps towards collectively improving the performance on the data distribution from
which the shared data is sampled.
To describe the pseudo labeling step, let us use fθi to denote the local model of agent i ∈ [N ]
parameterized by θi . We write ŷi = fθi (X ∗ ) to denote the predictions of agent i on the auxiliary
data X ∗ . Agents share these predictions with their peers. Naturally, the individual models may differ
in these predictions and it is a priori unclear which model is most accurate, as ground truth labels
of the auxiliary data are not available. To combine the predictions into pseudo-labels for X ∗ , each
agent locally decides how to weigh other agents’ predictions by estimating their respective expertise
on the target task. We refer to these weights as trust scores and we use wij to denote the trust of
agent i towards the predictions of agent j. It’s worth noting that the trust between agents is not
necessarily mutual, i.e., can be asymmetrical: agent i can trust agent j without agent j necessarily
trusting agent i back. We use W to denote the matrix of trust scores. Given the trust scores, agent i
uses the following pseudo labels for the auxiliary data:
X
ψi = wij ŷj . (1)
j
Trust weights are determined locally by each agent based on query access to other agents’ predictions
and they are refined iteratively throughout training as models are being updated. The adaptive weight
computation will be detailed in Section 4.
In the second step, the proxy labels for the auxiliary data are used to augment local model training.
Therefore, in each step, the local optimization problem is augmented by a disagreement loss, and the
new objective is given by
L(fθi (Xi ), yi ) + λdist(fθi (X ∗ ), ψi ) (2)
3
Algorithm 1 Pseudo code of our proposed algorithm
(0)
Input: For each agent i ∈ [N ] we are given a local model θi , a labeled local dataset (Xi , yi ),
and unlabeled shared data X ∗ .
for t = 1, ..., T do
(t−1)
Each node i ∈ [N ] broadcasts their soft labels ŷi = fθ(t−1) (X ∗ ) to all other nodes
i
in parallel for each agent i do
(t)
• Calculate pairwise trust score wij (j ∈ [N ]), based on the received soft decisions
using methods provided in Section 4
(t) P (t) (t−1)
• Get pseudo-labels on X ∗ from collaborators: ψi = j wij ŷj
• Do local training with collaborative disagreement loss
(t) (t)
θi ∈ arg min L(fθ (Xi ), yi ) + λdist(fθ (X ∗ ), ψi ) (3)
θ
end for
where fθi (X) denotes the vector of agent i’s predictions on the dataset X, L is the local training
loss and dist(·) is a disagreement measure. We choose l2 distance for the disagreement measure in
the regression case and cross-entropy for the classification case. λ > 0 is a trade-off hyperparameter
that weighs the local loss and the cost of disagreement. This objective adheres to a conventional
semi-supervised learning approach, however, we generate pseudo-labels in a trust-based collective
manner.
To iteratively refine the local models in the spirit of self-training, the pseudo labeling step and the
local training step are performed in an alternating fashion as described in Algorithm 1. Starting from
(0)
pre-trained models θi , in each round t ∈ {1, .., T } model predictions on the auxiliary data are
shared and then each agent aggregates them into a set of pseudo labels to augment local data and
perform an update step.
Our algorithm is motivated conceptually by co-training [26] where it was demonstrated that unlabeled
data can be used to augment labeled data to boost model performance. Moreover, agents communicate
by broadcasting predictions, thus the communication cost of transmitting predictions is significantly
lower than that of sharing model weights. Moreover, the algorithm can be extended to use the same
pseudo-labels ψi for multiple local epochs to further reduce the communication burden.
We study under what conditions Algorithm 1 will reach a consensus among agents on how to label
the auxiliary data. For the analysis, we focus on the over-parameterized regime 1 and we make the
following assumption on the local data distributions:
Assumption 1. There is no concept shift between the local data distributions and the target domain
Q from which the shared data is sampled, i.e., Pi (Y |X = x) = Q(Y |X = x) for all i ∈ [N ] and
x ∈ supp(Q).
Together with over-parameterization this assumption implies that the minimizer of the objective
specified in (2) can always reach zero loss. Further, this allows us to model the update of agents’
predictions on X ∗ as a Markov process where the state transition matrix corresponds to the trust
matrix W (t) . Therefore, it is convenient to write the update of the predictions on X ∗ performed by
(t) (t)
the algorithm in matrix form, as Ψ(t) = [ŷ1 , .., ŷN ]. Adopting this notation we have for t ≥ 1
It can be shown that under weak conditions on W (t) , a consensus will be reached by our algorithm.
1
We say a model is over-parameterized if its training error can reach zero. Over-parameterization is a
reasonable assumption in the deep learning regime.
4
Theorem 1 (Consensus on predictions). Assume all agents’ models are over-parameterized and the
data distributions satisfy Assumption 1. Then, for t → ∞ Algorithm 1 converges to a consensus
among the local models on the predictions on X ∗ , that is,
(t) (t)
ψi = ψj ∀i ̸= j, (5)
(t)
as long as W is row-stochastic and positive for any t ≥ 0.
A key feature of our method is that agents do not share model parameters, but they communicate
by exchanging prediction queries. This implies that if Algorithm 1 achieves a consensus this does
not imply that they have learned the same model, or that they agree on predictions outside X ∗ . We
illustrate this difference between information sharing in prediction space and parameter space with a
simple example. Therefore, we construct an over-parameterized problem. We generate local data
using cubic regression models with additive i.i.d. noise in the output, as shown in Fig. 1. We apply
the optimal trust weighting scheme that can be computed in closed form in this example. Then, each
agent fits a polynomial regression of degree 4, which leads to over-parameterization of the model to
fit the data. Full details of our example are given in Appendix A. We refer to the work of [2] for a
similar setting with under-parameterized models. Here we note the most interesting observations in
the over-parameterized regime.
First, we observe that for T ≥ 20 the three agents reach a consensus on the predictions of X ∗ .
However, the model parameters are not the same across the agents, as depicted in the rightmost panel.
Further, considering the properties of the algorithm across rounds and the predictions in different
regions of the input space in more detail, the following desirable behaviors are observed:
a) In data-rich regions, agent i fits the local data more accurately and moves pseudo-labels
closer to its own predictions.
b) In data-scarce regions, agent i only updates its model parameters to fit the pseudo-labels.
c) When local loss minimization and prediction consensus can be achieved at the same time,
agents can arrive at models with a perfect agreement in the target prediction space.
5
4.1 Trust evaluation through self-confidence
The quality of the local models could differ due to various factors, such as the amount of labeled data
available during training, due to the expressivity of the local model, the training algorithm, or due
to the relevance of the local data for the target task of labeling Q. Thus, a desirable property of the
consensus solution is that malicious agents, or agents with low-quality models contribute less to the
pseudo-labeling than agents with better models.
Hadjicostis and Dominguez-Garcia [29] differentiate between malicious and non-malicious agents
and they discuss the concept of trustworthy consensus, where only non-malicious agents contribute to
the consensus. In contrast to prior work, we do not aim for trustworthy agents to contribute equally.
Instead, we specifically want consensus to come from potentially unequal contribution of all agents,
weighted according to their relevance. We allow for the trust matrix to be asymmetric. All agents
determine trust from information given locally to the respective agent, which differs across agents.
Central to any such strategy is that the capabilities of models on Q can be estimated appropriately. In
the following, we discuss a strategy of how to determine trust from local data and prediction queries
to other models.
As no label information on X ∗ is available to evaluate trust, it is natural to use agents’ own predictions
on X ∗ as a local reference point. Then, each agent distributes their trust towards other agents based
on the alignment of their predictions. We use weighted pairwise cosine similarity as a measure of
alignment which motivates the following trust weight calculation:
D E
(t) f (t−1) (x), f (t−1) (x)
(t) γ ij (t) 1 X (t) θi θ j
wij = P (t) with γij = ∗ βi (x) . (6)
n ∥f (x)∥ ∥f (t−1) (x)∥2
j γij
(t−1)
θi 2 θj
x∈X ∗
(t)
The inclusion of the weighting factor βi (x) and how to choose it will be discussed in Section 4.2.
Self-confident trust. Naturally, pairwise cosine similarity leads to a trust matrix that has diagonal
entries being the highest value among each row. We call this property self-confident, as each agent
trusts itself the most. We now demonstrate that this property is not particularly restrictive. Even if
constraining trust matrix to be self-confident, it is still possible to design such a matrix that facilitates
any consensus. The proof is given in Appendix D.
Proposition 2. For any given consensus distribution π, it is always possible to find a trust matrix W
that leads to it, which is both row stochastic and self-confident.
(t)
A second nice property our trust calculation has is that for an appropriate choice of βi the proposed
calculation of trust scores in (6) leads to scores that become more evenly distributed over time as
agents gradually reach consensus.
(t)
Claim 3. Given Assumption 1 holds and that all agents are over-parameterized. Assume βi is
chosen such that the trust matrix W (t) is row-stochastic and positive for all t ≥ 0. Then, for the trust
calculation in (6), we have W (t) loses self-confidence over time and finally converges to a uniform
matrix:
t→∞ 1
tr(W (t) ) ≤ tr(W (t−1) ) and W (t) −→ 11⊤ (7)
N
The proof is provided in Appendix C. This claim characterizes the behavior of our dynamic trust
scheme: while initially all agents distribute trust towards helpful collaborators and try to achieve a
(t) (t)
consensus on X ∗ . Once consensus is reached, we will have ŷi = ŷj for any i, j, and W (t) will
become a matrix with uniform weights. Thus, no individual agent has increasingly high weight after
consensus is reached. This in turn implies that no agent will have the ability to excessively manipulate
the collective labeling on their own. In the next section we discuss other robustness properties of our
algorithm.
If agents possess low-quality local data, we aim to minimize their influence on the labeling of the
auxiliary data not only at consensus, but also throughout the algorithm. Proposition 4 gives sufficient
conditions for such a desirable property to be preserved at any step before consensus is reached:
6
assume there exists only one node with low-quality data, then, as long as it receives the lowest trust
from any other node, it will remain to have lowest importance in the consensus.
Proposition 4. Given Assumption 1, the trust matrix is row stochastic and positive, and all agents
hold over-parameterized models. Let b be the only node with low-quality data and τ be the timestep
that consensus is reached. If the following desirable properties hold for t < τ :
(t) (t)
i) b receives the lowest trust from others than itself, i.e., wjb = mini wji for j ̸= b.
P (t) P (t)
ii) b-th column has the lowest column sum: j wjb < mini̸=b j wji .
The proof is given in Appendix E, where we also provide desired properties in the presence of
multiple nodes with low-quality data, under some extra assumptions. Note that when nodes with
weak model architectures (such as under-parameterized models) are involved, achieving consensus
is not assured. If such a consensus solution does exist, it will be constrained by the underfitting of
weak nodes. Consequently, this solution would not serve as a stationary solution concerning the local
training loss of a strong node. Nevertheless, we conjecture and we find empirically that these desired
properties can still enhance training by mitigating the impact of the weak nodes.
(t)
Confidence weighting. In the following paragraph, we discuss the choice of the weights βi in (6).
Specifically, we incorporate confidence weighting into the pairwise cosine similarity calculation to
emulate the construction of a transition matrix based on a known consensus distribution.
Let us start by outlining an idealized trust calculation that effectively down-weighs agents with low
quality data. We first construct an intermediate transition matrix Φ from pairwise cosine similarities
of the agents’ predictions on X ∗ (with row normalization). For the low-quality node b, we will have
ϕjb being the lowest value in the j-th row, for any j ̸= b. According to Proposition 4, in order to
have low importance of low-quality workers in the consensus, we need to set the overall trust that b
receives to be the lowest among all the nodes. To achieve this, we need to assign the trust of regular
workers towards the low-quality workers to a very small value, as it is difficult to alter self-confidence.
If the consensus importance weight is known, one can easily calculate the corresponding trust matrix
π(b) ϕbj
wjb = ϕjb min 1, , (8)
π(j) ϕjb
following a classical result from Metropolis chains [30] (also see Appendix D). We will have
π(b)
wjb < ϕjb for j ̸= b, as π(j) should be sufficiently small.
However, the consensus importance weight in (8) can not usually be calculated and we cannot attain
the ideal trust matrix. Therefore, we propose an alternative weighting scheme that achieves similar
effects: we up-weight the similarity in the region where agent j has more confidence, i.e., where
agent j’s class probability assignments have lower entropy. By doing this, we encourage that the
trust weights become more concentrated on themselves and helpful workers, and less concentrated on
low-quality workers. We incorporate this into the trust weight calculation (6) by choosing
(t) 1
βi (x) =
H(fθ(t−1) (x))
i
where H denotes the entropy. We offer further intuition as well as justification of this weighting
scheme in Appendix F. Moreover, we empirically demonstrate how our choice of trust matrix leads
to a low column sum for bad nodes in Section 5.2.
5 Experiments
We start with a synthetic example to visualize the decision boundary achieved by our algorithm and
then demonstrate its performance on real data in a heterogeneous collaborative learning setting.
7
(a) Local data distributions (b) Dynamic Trust (c) Naive Trust
Figure 2: Decision boundary comparison between our dynamic trust update and naive trust update
Four classes are generated via multivariate Gaussian following P c ∼ N (µc , Σ), where µ0 =
(−2, 2)⊤ , µ1 = (2, 2)⊤ , µ2 = (−2, −2)⊤ , µ3 = (2, −2)⊤ and Σ = I2×2 . We have four agents and
each agent holds local data sampled from an even mixture of P c ’s. To simulate heterogeneity in data
quality we choose to flip a fraction of labels for each agent. Namely, for agents 0-3, we randomly
flip 10% of the labels, and for the last client, we flip all labels. The unlabeled data X ∗ are sampled
equally from P c ’s. The data distribution is shown in Fig. 2a. The base model used in each node is a
multi-layer perceptron of 3 layers with 5, 10, and 4 neurons respectively. We compare the decision
boundary found by Algorithm 1 with dynamic trust weight to with naive trust weight. Results are
illustrated in Fig. 2b-c. When a client with low-quality data is involved, i.e. client 3 in the toy
example, our trust update scheme gives a better decision boundary to good agents after collaboration,
as blind trust towards low-quality clients will impair the effectiveness of pseudo labeling.
Next, we consider a more challenging setting, where local data distributions are non-i.i.d. Two
different statistical heterogeneities are considered: (1) Synthetic heterogeneity. We utilize the classic
Cifar10 and Cifar100 datasets [31] and create 10 clients from each dataset. To distribute classes
among clients, we use a Dirichlet distribution2 with α = 1. Unless specified otherwise, we employ
ResNet20 [32] without pretraining. (2) Real-world data heterogeneity. A real-world dermoscopic
lesion image dataset from the ISIC 2019 challenge [33, 34, 35] is included here. The same client
splits are used as in [36], based on the imaging acquisition system employed in six different hospitals.
The dataset includes eight classes of lesions to classify, with the class distribution among the clients
displayed in Fig. 3b. Following [36], we choose pretrained EfficientNet [37] as the base model, and
use balanced accuracy as the evaluation metric. For every dataset, we construct X ∗ from equally
contributed samples by every agent.
Baseline methods. We compare our method to several baseline methods, including FedAvg [5],
FedProx [7], SCAFFOLD [8] (SCA), FedDyn [38], local training without collaboration (LT), and
training with naive trust (Naive). Note with naive trust we are realizing soft majority voting, which
represents the baseline method proposed from [14]. We adhere to the same architecture setting, where
the standard federated learning algorithms can be applied. To initiate the process, we allow each client
to perform local training for 5 global rounds, with the objective of obtaining a sufficiently refined
model that can be used for trust evaluation. From the 6th training round, the clients start collaboration.
2
The Dirichlet distributed samples are constructed using the codes from https://github.com/TsingZ0/
PFL-Non-IID
8
Table 1: Our methods compare to baseline methods. Blue denotes the algorithm with top 1 accuracy
and green denotes the method with 2nd best accuracy. “Ours - S" denotes the static version where the
trust score is kept constant after first-time calculation (after 5 rounds of local training) and “Ours - D"
denotes the dynamic version where the trust score is updated per global round.
FedAvg FedProx SCA FedDyn LT Naive Ours-S Ours-D
Cifar10 0.542 0.517 0.578 0.578 0.475 0.618 0.604 0.612
Regular Cifar100 0.261 0.240 0.317 0.310 0.178 0.311 0.319 0.308
Fed-ISIC 0.279 0.261 0.213 0.243 0.248 0.290 0.302 0.291
Low- Cifar10 0.541 0.530 0.570 0.575 0.470 0.596 0.605 0.608
Quality Cifar100 0.254 0.240 0.289 0.308 0.171 0.285 0.300 0.306
Data Fed-ISIC 0.229 0.242 0.221 0.243 0.217 0.247 0.249 0.269
Figure 5: Target accuracy comparison with 2 different model architectures with error bars (hatch
pattern denotes fully connected NN is used). From left to right: Cifar10, Cifar100, FedISIC
Over a total of 50 global rounds, each consisting of 5 local epochs, we report the averaged accuracy
results from three repeated experiments in Table 1. The evaluation metric is calculated on the dataset
X ∗ . λ is fixed as 0.5 in all experiments.
Heterogeneity in data quality. When all nodes share the same data quality and degree of statistical
heterogeneity (denoted by “regular” in the table), our methods align closely with consensus through
naive averaging, which is optimal in this case. When all nodes share the same degree of statistical
heterogeneity but differ in data quality, exemplified by randomly selecting two nodes (indexed as
2 and 9) for a complete flip of local training labels, our dynamic trust update shows better overall
performances3 , proving the effectiveness of our approach in limiting the detrimental influences from
nodes with low-quality data. We further plotted out the learned trust matrix in the dynamic update
mode during one of the middle training rounds in the left plot of Fig. 4. Clearly, our algorithm is able
to give low trust weights to the nodes with low-quality data, and the 2nd and 9th columns have the
lowest column sum.
Heterogeneity in model architecture. We allocate a more expressive model architecture to the
first half of the nodes and a less expressive one to the other half. The former comprises ResNet20
and EfficientNet, which were the models of choice in the previous experiments. For the latter, we
employ a linear model (i.e., one-layer fully connected neural network) with a flattened image tensor
as input and the output is of size equivalent to the number of classes. It is worth noting that if
agents with strong and weak model architectures (as in cases of under-parameterization) co-exist,
consensus might not occur, as suggested by our empirical findings illustrated in Fig. 5. Nevertheless,
our trust-based collaborator selection mechanism consistently outperforms local training and simple
averaging. The trust weight matrix learned during Cifar100 training is depicted in the right plot of
Fig. 4, revealing the presence of asymmetric trust. Specifically, the last 5 nodes exhibit a higher level
of trust towards the first 5 nodes, while the opposite is not true. The trust allocation is desired in
identifying the helpers. We further refer to Appendix A.1 for more empirical evidence from a toy
polynomial regression example on the presence of strong and weak architectures.
Reduced communication costs. Gradient aggregation-based methods incur a significant communica-
tion burden proportional to the number of model parameters (O(N × |params|)), which is particularly
heavy given the over-parameterized nature of modern deep learning. In contrast to existing ap-
proaches, our proposed method significantly reduces the communication burden by enabling each
node to transmit only their predictions on the shared dataset. This results in communication overhead
O(N 2 × n⋆ × C). It is clear that this value does not scale up with more complex models, and is
3
Here we report average accuracy of regular workers, excluding workers with low-quality data
9
Figure 6: Algorithm performances on Cifar100 for different algorithm configurations. (left) effect of
varying number of local epochs on final performance; (right) algorithm performance as a function of
the number of training rounds for 5 local epochs each
much smaller than the model size. Moreover, our methods maintain their high performance even
when the number of local epochs increases. On the other hand, FedAvg loses its effectiveness with
less frequent synchronization, i.e. more local epochs between global aggregation rounds, as shown in
the left panel of Fig. 6.
In the context of decentralized learning, we leverage the collective knowledge of individual nodes to
improve the accuracy of predictions with respect to a target distribution. Our proposed trust update
scheme, based on self-confidence, ensures robustness against nodes with low-quality data. By achiev-
ing consensus in the prediction space, our method effectively handles diverse model architectures
within local clients, while maintaining a low communication overhead, thereby exhibiting important
practical potential. Our trust-based collaborative pseudo-labeling method demonstrates a fruitful
interplay between tools from semi-supervised learning and collective learning. Notably, our algorithm
is intrinsically compatible with personalization, in terms of allowing some concept shift across clients.
We leave this for future work.
Robustness. We have designed our algorithm with the assumption that all agents communicate hon-
estly, meaning that no Byzantine workers intentionally provide incorrect information. Nevertheless,
our method exhibits some resilience against a common Byzantine attack, known as the label flip
(referred to as "low-quality workers" in our paper). For instance, even with 2 out of 10 workers having
100% flipped labels, our algorithm maintains good performance. If there are malicious workers
deliberately providing incorrect information, the nodes may refuse to reach a consensus, instead of
reaching a detrimental bad consensus, assuming a reasonable λ is chosen. Consider the scenario in
which a detrimental consensus is reached with malicious nodes involved; in this case, the consensus
loss and local loss for regular nodes will not decrease in the same direction, making the consensus
solution non-stationary. Notably, the "personal" component of our loss function adds an element of
robustness against malicious nodes.
Privacy Concerns. While previous works show that training data can be reconstructed from model
parameters [39] or gradients [40], our algorithm requires less privacy-sensitive information sharing,
which is predictions on a shared dataset. While we are aware that model predictions can still leak
private information on training data due to memorization [41], there is a trade-off between the gain
from collaboration and the amount of information that users are willing to share. As the number of
outer rounds increases, we observe a notable improvement in accuracy within the context of X ∗ .
However, this enhanced accuracy comes at the cost of disclosing more information, a relationship
that is depicted in the right panel of Fig. 6. An interesting extension would be to apply differential
privacy [42] to further guarantee privacy, as well as to design methods of privacy accounting, e.g., [43]
so each agent can maintain control over the local privacy leakage at any time.
Acknowledgements. DF would like to thank Anastasia Koloskova, Felix Kuchelmeister, Matteo
Pagliardini, and Nikita Doikov for helpful discussions during the project and El Mahdi Chayti for
proofreading. DF acknowledges funding from EDIC fellowship from the Department of Computer
Science at EPFL. CM acknowledges support from the Tübingen AI Center. This project was supported
by SNSF grant 200020_200342.
10
References
[1] Morris H. DeGroot. Reaching a consensus. Journal of the American Statistical Association, 69
(345):118–121, 1974. ISSN 01621459.
[2] Celestine Mendler-Dünner, Wenshuo Guo, Stephen Bates, and Michael Jordan. Test-time
collective prediction. In Advances in Neural Information Processing Systems (NeurIPS),
volume 34, pages 13719–13731, 2021.
[3] S. Fralick. Learning to recognize patterns without a teacher. IEEE Transactions on Information
Theory, 13(1):57–64, 1967.
[4] Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Ar-
jun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings,
Rafael G. L. D’Oliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner,
Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Har-
chaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara
Javidi, Gauri Joshi, Mikhail Khodak, Jakub Konečný, Aleksandra Korolova, Farinaz Koushanfar,
Sanmi Koyejo, Tancrède Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock,
Ayfer Özgür, Rasmus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage, Ramesh Raskar,
Dawn Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian
Tramèr, Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu,
Han Yu, and Sen Zhao. Advances and open problems in federated learning. ArXiv preprint
arxiv:1912.04977, 2021.
[5] H. Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Agüera y Arcas. Federated
learning of deep networks using model averaging. CoRR, abs/1602.05629, 2016.
[6] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated
learning with non-iid data. arXiv preprint arxiv:1806.0058, 2018.
[7] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia
Smith. Federated optimization in heterogeneous networks, 2020.
[8] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and
Ananda Theertha Suresh. SCAFFOLD: Stochastic controlled averaging for federated learning.
In International Conference on Machine Learning, volume 119, pages 5132–5143. PMLR,
2020.
[9] Xiaoyu Cao, Minghong Fang, Jia Liu, and Neil Zhenqiang Gong. Fltrust: Byzantine-robust
federated learning via trust bootstrapping. In Network and Distributed System Security (NDSS)
Symposium, 2021.
[10] Sai Praneeth Karimireddy, Lie He, and Martin Jaggi. Byzantine-robust learning on heteroge-
neous datasets via bucketing. In International Conference on Learning Representations, 2022.
URL https://openreview.net/forum?id=jXKKDEi5vJt.
[11] Tao Lin, Lingjing Kong, Sebastian U. Stich, and Martin Jaggi. Ensemble distillation for
robust model fusion in federated learning. In International Conference on Neural Information
Processing Systems, 2020.
[12] Zhuangdi Zhu, Junyuan Hong, and Jiayu Zhou. Data-free knowledge distillation for hetero-
geneous federated learning. In International Conference on Machine Learning, volume 139,
pages 12878–12889. PMLR, 2021.
[13] Disha Makhija, Xing Han, Nhat Ho, and Joydeep Ghosh. Architecture agnostic federated
learning for neural networks. In International Conference on Machine Learning, volume 162,
pages 14860–14870. PMLR, 2022.
[14] Amr Abourayya, Michael Kamp, Erman Ayday, Jens Kleesiek, Kanishka Rao, Geoffrey I. Webb,
and Bharat Rao. AIMHI: Protecting sensitive data through federated co-training. In Workshop
on Federated Learning: Recent Advances and New Challenges (at NeurIPS), 2022.
[15] Aurélien Bellet, Rachid Guerraoui, Mahsa Taziki, and Marc Tommasi. Personalized and private
peer-to-peer machine learning. In International Conference on Artificial Intelligence and
Statistics, volume 84, pages 473–481. PMLR, 2018.
11
[16] Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian Stich. A
unified theory of decentralized SGD with changing topology and local updates. In International
Conference on Machine Learning, volume 119, pages 5381–5393. PMLR, 2020.
[17] Yatin Dandi, Anastasia Koloskova, Martin Jaggi, and Sebastian U. Stich. Data-heterogeneity-
aware mixing for decentralized learning. In OPT 2022: NeurIPS Workshop on Optimization for
Machine Learning, 2022.
[18] Batiste Le Bars, Aurélien Bellet, Marc Tommasi, Erick Lavoie, and Anne-Marie Kermarrec.
Refined convergence and topology learning for decentralized sgd with heterogeneous data. In
International Conference on Artificial Intelligence and Statistics, volume 206, pages 1672–1702.
PMLR, 2023.
[19] Lin Xiao and S. Boyd. Fast linear iterations for distributed averaging. In IEEE International
Conference on Decision and Control, volume 5, pages 4997–5002, 2003.
[20] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms. IEEE
Transactions on Information Theory, 52(6):2508–2530, 2006.
[21] Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, and Mike Rabbat. Stochastic gradient push
for distributed deep learning. In International Conference on Machine Learning, volume 97,
pages 344–353. PMLR, 2019.
[22] Shuangtong Li, Tianyi Zhou, Xinmei Tian, and Dacheng Tao. Learning to collaborate in
decentralized learning of personalized models. In Conference on Computer Vision and Pattern
Recognition (CVPR), pages 9756–9765, 2022.
[23] Yi Sui, Junfeng Wen, Yenson Lau, Brendan Leigh Ross, and Jesse C. Cresswell. Find
your friends: Personalized federated learning with the right collaborators. Arxiv preprint
arxiv:2210.06597, 2022.
[24] Dong-Hyun Lee. Pseudo-label : The simple and efficient semi-supervised learning method
for deep neural networks. 2013. URL https://api.semanticscholar.org/CorpusID:
18507866.
[25] Colin Wei, Kendrick Shen, Yining Chen, and Tengyu Ma. Theoretical analysis of self-training
with deep networks on unlabeled data. In International Conference on Learning Representations
(ICLR), 2021.
[26] Avrim Blum and Tom M. Mitchell. Combining labeled and unlabeled data with co-training. In
Peter L. Bartlett and Yishay Mansour, editors, Annual Conference on Computational Learning
Theory (COLT), pages 92–100. ACM, 1998.
[27] Enmao Diao, Jie Ding, and Vahid Tarokh. Semifl: Semi-supervised federated learning for
unlabeled clients with alternate training. In Advances in Neural Information Processing Systems,
volume 35, pages 17871–17884, 2022.
[28] Francesco Farina. Collective learning. Arxiv preprint arxiv:1912.02580, 2021.
[29] Christoforos N. Hadjicostis and Alejandro D. Dominguez-Garcia. Trustworthy distributed
average consensus. In Conference on Decision and Control (CDC), pages 7403–7408. Institute
of Electrical and Electronics Engineers Inc., 2022.
[30] D.A. Levin, Y. Peres, and E.L. Wilmer. Markov Chains and Mixing Times. American Mathe-
matical Soc., 2008.
[31] Alex Krizhevsky. Learning multiple layers of features from tiny images. pages 32–33, 2009.
URL https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
[32] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages
770–778, 2016.
[33] Rosendahl C. Tschandl P. and Kittler H. The ham10000 dataset, a large collection of multi-
source dermatoscopic images of common pigmented skin lesions, 2018.
[34] Noel C. F. Codella, David Gutman, M. Emre Celebi, Brian Helba, Michael A. Marchetti,
Stephen W. Dusza, Aadi Kalloo, Konstantinos Liopyris, Nabin Mishra, Harald Kittler, and Allan
Halpern. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international
symposium on biomedical imaging (isbi), 2017.
12
[35] Marc Combalia, Noel C. F. Codella, Veronica Rotemberg, Brian Helba, Veronica Vilaplana, Ofer
Reiter, Cristina Carrera, Alicia Barreiro, Allan C. Halpern, Susana Puig, and Josep Malvehy.
Bcn20000: Dermoscopic lesions in the wild, 2019.
[36] Jean Ogier du Terrail, Samy-Safwan Ayed, Edwige Cyffers, Felix Grimberg, Chaoyang He,
Regis Loeb, Paul Mangold, Tanguy Marchand, Othmane Marfoq, Erum Mushtaq, Boris Muzel-
lec, Constantin Philippenko, Santiago Silva, Maria Teleńczuk, Shadi Albarqouni, Salman
Avestimehr, Aurélien Bellet, Aymeric Dieuleveut, Martin Jaggi, Sai Praneeth Karimireddy,
Marco Lorenzi, Giovanni Neglia, Marc Tommasi, and Mathieu Andreux. Flamby: Datasets and
benchmarks for cross-silo federated learning in realistic healthcare settings, 2022.
[37] Mingxing Tan and Quoc Le. EfficientNet: Rethinking model scaling for convolutional neural
networks. In International Conference on Machine Learning, volume 97, pages 6105–6114.
PMLR, 2019.
[38] Durmus Alp Emre Acar, Yue Zhao, Ramon Matas, Matthew Mattina, Paul Whatmough, and
Venkatesh Saligrama. Federated learning based on dynamic regularization. In International
Conference on Learning Representations (ICLR), 2021.
[39] Niv Haim, Gal Vardi, Gilad Yehudai, michal Irani, and Ohad Shamir. Reconstructing training
data from trained neural networks. In Advances in Neural Information Processing Systems,
2022.
[40] Zihan Wang, Jason Lee, and Qi Lei. Reconstructing training data from model gradient, provably.
In International Conference on Artificial Intelligence and Statistics, volume 206, pages 6595–
6612. PMLR, 2023.
[41] Nicholas Carlini, Matthew Jagielski, Chiyuan Zhang, Nicolas Papernot, Andreas Terzis, and
Florian Tramer. The privacy onion effect: Memorization is relative. Advances in Neural
Information Processing Systems, 35:13263–13276, 2022.
[42] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to
sensitivity in private data analysis. In Third Conference on Theory of Cryptography (TCC),
page 265–284. Springer-Verlag, 2006.
[43] Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar,
and Li Zhang. Deep learning with differential privacy. In ACM SIGSAC Conference on
Computer and Communications Security (CCS), page 308–318, 2016.
[44] J. Wolfowitz. Products of indecomposable, aperiodic, stochastic matrices. Proceedings of the
American Mathematical Society, 14(5):733–737, 1963.
13
Appendix
The true underlying function is chosen as f (x) = 0.5x3 + 0.3x2 − 5x + 4. There are three agents in
total, each of whom has 50 data points. The local data points are generated using normal distributions:
x1 ∼ N (−2, 1), x2 ∼ N (0, 1) and x3 ∼ N (2, 1). To introduce noise in the labels, each agent
adds a normally distributed error term with zero mean and unit variance, i.e. yi = f (xi ) + ε with
ε ∼ N (0, 1). A set of 50 equally spaced data points in the range of −4 to 4, denoted as X ∗ , is used
in the analysis.
For the Example in Section 3.4 the algorithm is applied using fixed trust weights with 1/3 in each
entry and λ is chosen as 1.
For this example we use the same setup as before, but there are four agents in total, each of whom
has 50 data points. The local data points are generated using normal distributions: x1 ∼ N (−2, 1),
x2 ∼ N (0, 1), x3 ∼ N (2, 1) and x4 ∼ N (3, 1).
The algorithm is applied using dynamic trust weights and λ is chosen as 1. For the first three agents,
a polynomial model with a maximum degree of four is fit, while for the fourth agent, a polynomial
model with a maximum degree of one is fit, signifying a weak node.
We see that after 50 rounds of model training using our proposed algorithm with dynamic trust, agent
4’s model is still underfitting due to its limited expressiveness. Agents 1-3 end up agreeing with each
other and giving good predictions in the union of their local regions. While with naive trust weights,
we see that the strong agents also get influenced in the region where they could perform well, as the
underfitted model has a stronger impact through collective pseudo-labeling.
B Proof of Theorem 1
The proof is rooted in the results from the work of Wolfowitz [44], we recommend readers to check
the original paper for more detailed references. Note, for the following texts, when we say a matrix
W has certain properties, it is equivalent to saying a Markov chain induced by transition matrix W
has certain properties.
Definition A (Irreducible Markov chains). A Markov chain induced by transition matrix W is
irreducible if for all i, j, there exists some n such that Wijn > 0. Equivalently, the graph corresponding
to W is strongly connected.
Definition B (Strongly connected graph). A graph is said to be strongly connected if every vertex is
reachable from every other vertex.
Definition C (Aperiodic Markov chains). A Markov chain induced by transition matrix W is
aperiodic if every state has a self-loop. By self-loop, we mean that there is a nonzero probability of
remaining in that state, i.e. wii > 0 for every i.
14
Assumption 2. W (t) ’s are row-stochastic and positive , i.e.
P
j wij = 1 for any row i, and wij > 0.
Claim 5. Given Assumption 2, the matrix product of any n elements of {W (t) } are SIA (SIA stands
for stochastic, irreducible and aperiodic) for n ≥ 1.
Proof. According to the assumption that all W (t) ’s are positive, and thus we have any product of
W (t) ’s being positive in each entry, which is equivalent to the graph introduced by the product
being fully connected. Being fully connected implies being strongly connected. According to
Definitions A B, irreducibility follows.
By the product being positive, we also have its diagonal entries being all positive. According to
Definition C, aperiodicity follows.
The product of row-stochastic matrices remains row-stochastic: for A and B row stochastic, we have
the product AB remains row-stochastic.
XX X X
( aik bkj ) = aik ( bkj ) = 1, ∀i
j k k j
(t)
Thus, we have any product of W ’s being irreducible, aperiodic and stochastic (SIA).
Theorem 6 (Rewrite of Wolfowitz [44]). Let A1 , ..., Ak be square row stochastic matrices of the
same order and any product of the A’s (of whatever length) is SIA. When k → ∞, the product of A1 ,
..., Ak gets reduced to a matrix with identity rows.
Following Assumptions 12, we have ψ (t) = W (t) ψ (t−1) holds for all t ≥ 1. From Claim 5, we have
any products of W (t) ’s being SIA. From Theorem 6, we have the product W (t) W (t−1) . . . W (1)
gets reduced to a matrix with identical rows when t goes to infinity. That implies, ψ ∞ has identical
rows. The statement is thus proved.
C Proof of Claim 3
Definition D (Row differences). Define how different the rows of W are by
δ(W ) = max max |wi1 j − wi2 j | (9)
j i1 ,i2
For identical rows, δ(W ) = 0
Definition E (Scrambling matrix). W is a scrambling matrix if
X
λ(W ) := 1 − min min(wi1 j , wi2 j ) < 1 (10)
i1 ,i2
j
In plain words, Definition E says that if for every pair of rows i1 and i2 in a matrix W , there exists a
column j (which may depend on i1 and i2 ) such that wi1 j > 0 and wi2 j > 0, then W is a scrambling
matrix. It is easy to verify that a positive matrix is always a scrambling matrix.
Lemma 1 (Adaptation of Lemma 2 from Wolfowitz [44]). For any t,
Yt
δ(W (t) W (t−1) . . . W (1) ) ≤ λ(W (i) ) (11)
i=1
Lemma 1 states that multiplying with scrambling matrices will make the row differences smaller.
P (t)
tr(W (t) ) = i wii represents the sum of self-confidences of all nodes. As every W (t) is positive,
we have all W (t) ’s scrambling. Thus, the differences between rows of W (t) W (t−1) ..W (1) get
smaller when t gets bigger.
(t) (t−1)
As ψi = j [W (t) W (t−1) ..W (1) ]ij ψj , we have the predictions on X ∗ given by all nodes get
P
similar over time. According to our calculation of W (t) in Equation (6), which is based on cosine
similarity between predictions, it follows that an agent’s trust towards the others gets larger over time.
P (t+1) P (t) (t+1) (t)
That is, j wij ≥ j wij . Since each row sums up to 1, we have wii ≤ wii , for all i.
(t) (t)
According to Theorem 1, we have ψi = ψj as t → ∞, for any i and j. According to the
calculation of W , we have W (t) with equal entries when t reaches infinity.
15
D Proof of Proposition 2
The proof follows from the construction of Metropolis chains given a stationary distribution. We will
first give an example of how Metropolis chains work.
Example 2 (Metropolis chains [30]). Given stationary distribution π = [0.3, 0.3, 0.3, 0.1], how
could we construct a transition matrix that leads to the stationary distribution?
Suppose Φ is a symmetric matrix, one can construct a Metropolis chain P as follows:
ϕ(x, y) min 1, π(y) y ̸= x
π(x)
p(x, y) = P
π(z)
(13)
1 −
z̸=x ϕ(x, z) min 1, π(x) y=x
Following Example 2, choose Φ to be any self-confident doubly stochastic matrix. For all x, choose
P as calculated from (13), we have
X π(z) X
p(x, x) = 1 − ϕ(x, z) min 1, ≥1− ϕ(x, z) = ϕ(x, x) (14)
π(x)
z̸=x z̸=x
we see that probability distribution among each row gets more concentrated on the diagonal entries in
P than Φ. As Φ already has high diagonal values, the claim follows.
E Proof of Proposition 4
Proposition 4 states sufficient conditions for W (t) ’s to have such that a low-quality node b is assigned
lowest importance in π, i.e. πb = mini πi . From Equation (12), π comes from the product of trust
matrices. We start from a product of two such matrices.
Proposition 7. For row-stochastic and positive matrices A and B, and C = AB, if in both A and
B,
then we have j-th column remains the the lowest column sum in matrix C and (i, j)-th entry being
the lowest value in i-th row of C for i ̸= j,
Proof. Let C = AB, the column sum of column j of C can be expressed as:
X XX
cij = aik bkj
i i k
XX (15)
= ( aik )bkj
k i
16
X XX
cit = aik bkt
i i k
XX (16)
= ( aik )bkt
k i
We first show that j-th column remains the lowest column sum in C. For t ̸= j:
X X XX
cit − cij = ( aik )(bkt − bkj )
i i k i
XX X
= ( aik )(bkt − bkj ) + ( aij )(bjt − bjj )
k̸=j i i
(i)
XX X
> ( aij )(bkt − bkj ) + ( aij )(bjt − bjj )
k̸=j i i
(17)
X X
=( aij ) (bkt − bkj ) + (bjt − bjj )
i k̸=j
!
X X X
= aij bkt − bkj
i k k
(ii)
> 0
P P
(i) holds because for k ̸= j, bkt − bkj > 0 and i aij < i aik
(ii) holds because the j-th column has the lowest column sum in B
We then show that (i, j)-th entry remains the lowest value in i-th row of C for i ̸= j. For t ̸= j, we
have X X
cit − cij = aik bkt − aik bkj
k k
X
= aik (bkt − bkj ) + aij (bjt − bjj )
k̸=j
(iii) X
> aij (bkt − bkj ) + aij (bjt − bjj )
k̸=j
(18)
X
=aij (bkt − bkj ) + (bjt − bjj )
k̸=j
!
X X
=aij bkt − bkj
k k
(iv)
>0
(iii) holds since bkt − bkj > 0 and aik > aij for i, k ̸= j.
P P
(iv) holds because k bkt > k bkj
(t)
For time-inhomogenous trust matrix, Assumptions 1 2 ensure the Markov chain update: ψi =
P (t) (t−1)
j wij ψj , which is followed by consensus as proven in Theorem 1. We see that b-th column
remains the lowest column sum in the product W (τ ) W (τ −1) ...W (1) , by iteratively applying Propo-
sition 7. For t ≥ τ , multiplying consensus with any row stochastic preserves the consensus. Thus,
the b-th column will remain to be the smallest column in the consensus. For the time-homogenous
case, as long as W holds the same properties, one can easily verify that the same result still holds.
Thus, Proposition 4 is proved.
17
Extend to more than one node with low-quality data. For more than one low-quality node, what
are the desired properties (sufficient conditions) for the transition (trust) matrices to have? It turns out
that apart from the two conditions in a single low-quality node case, we need an extra assumption.
Proposition 8. Given Assumptions 1 2 and that all agents are over-parameterized, let R be the set of
indices of regular nodes, and B be the set of indices of low-quality nodes, if for t ≤ τ , W (t) satisfies
the following conditions:
P (t)
i) any regular node’s column sum is larger than any low-quality node’s: minr∈R i wir >
P (t)
maxb∈B i wib ;
ii) the gap between the sum of trust from regular nodes towards any regular node r and low-
quality node b is larger than the gap between low-quality node b’s self-confidence and its
P (t) (t) (t) (t)
trust towards the regular node: n∈R (wnr − wnb ) > (wbb − wbr ),
iii) any node’s trust towards a regular node is bigger or equal than its trust towards a low-
(t) (t)
quality node other than itself: for any r ∈ R and any b ∈ B, we have wnr ≥ wnb holds as
long as n ̸= b.
And after t > τ , W (t) = 11⊤ N1 . Then we have nodes in B having a lower importance in the
consensus than nodes in R.
Proof. First, let us look at the multiplication of two such matrices when 1 < t < τ , for any r ∈ R
and b ∈ B, we have conditions (1)(2)(3) remain to be true for the product W (t) W (t−1) . We will
verify them one by one in the following part:
Verification of condition (1) : any regular node’s column sum is larger than any low-quality node’s
in W (t) W (t−1) . For any r ∈ R and any b ∈ B, we have
X X (t) X X (t) (t−1)
(t−1)
win wnr − win wnb
i n i n
X X (t) (t−1)
(t−1)
= ( win ) wnr − wnb
n i
X X (t) (t−1)
X X (t) (t−1)
(t−1) (t−1)
= ( win ) wnr − wnb + ( win ) wnr − wnb
n∈R i n∈B\{b} i
X (t) (t−1) (t−1)
+( wib ) wbr − wbb
i
(i)
X X (t) (t−1)
X
(t)
(t−1) (t−1)
(t−1)
> ( wib ) wnr − wnb + wib wbr − wbb
n∈R i i
(t) (t−1)
X X
(t−1)
+ ( win ) wnr − wnb
n∈B\{b} i
!
X (t) X X (t−1) (t−1) (t−1)
(t−1)
=( wib ) wnr − wnb + wbr − wbb
i n∈R n∈R
(t) (t−1)
X X
(t−1)
+ ( win ) wnr − wnb
n∈B\{b} i
(ii)
>0
P (t) P (t)
(i) holds because i win for any n ∈ R is larger than i wib for any b ∈ B, which follows from
(t) (t)
condition (1), and wnr − wnb > 0, which follows from condition (3).
P (t−1) P (t−1) (t−1)
(ii) holds following the conditions (2) and (3). From (2), n∈R wnr − n∈R wnb + wbr −
(t−1) (t−1) (t−1)
wbb > 0, and from (3), wnr ≥ wnb for n ̸= b
18
Verification of condition (2) :
! !
(t) (t−1) (t) (t−1) (t) (t−1)
X X X X X
(t) (t−1)
wnp wpr − wnp wpb − wbp wpb − wbp wpr
n∈R p p p p
! !
(t) (t) (t−1)
X X X X
(t) (t−1) (t)
= wnp + wbp wpr − wnp + wbp wpb
p n∈R p n∈R
!
(t) (t−1)
X X
(t) (t−1)
= wnp + wbp wpr − wpb
p n∈R
! !
(t) (t−1) (t) (t−1)
X X X X
(t) (t−1) (t) (t−1)
= wnp + wbp wpr − wpb + wnp + wbp wpr − wpb
p∈R n∈R p∈B\{b} n∈R
!
(t) (t) (t−1) (t−1)
X
+ wnb + wbb wbr − wbb
n∈R
! !
(iii)
(t) (t) (t−1) (t) (t) (t−1) (t−1)
X X X
(t−1)
≥ wnb + wbb wpr − wpb + wnb + wbb wbr − wbb
p∈R n∈R n∈R
!
(t) (t−1)
X X
(t) (t−1)
+ wnp + wbp wpr − wpb
p∈B\{b} n∈R
!
(t) (t) (t−1) (t−1) (t−1)
X X X
(t−1)
= wnb + wbb wpr − wpb + wbr − wbb
n∈R p∈R p∈R
!
(t) (t−1)
X X
(t) (t−1)
+ wnp + wbp wpr − wpb
p∈B\{b} n∈R
(iv)
≥0
P (t) (t) P (t) (t)
(iii) holds because for p a regular node, we have n∈R wnp + wbp > n∈R wnb + wbb , which
(t−1) (t−1)
follows from condition (2), and wpr − wpb ≥ 0 for p ̸= b, following from condition (3).
(iv) holds because of conditions (2) and (3).
Verification of (3) : for n ̸= b, we want to show the trust towards a regular node r is bigger than
P (t) (t) P (t) (t)
towards a low-quality node b, that is p wnp wpr > p wnp wpb
(t)
X X
(t) (t) (t)
wnp wpr − wnp wpb
p p
(t) (t) (t) (t) (t)
X X
(t) (t) (t) (t)
= wnp wpr − wpb + wnp wpr − wpb + wnb wbr − wbb
p∈R p∈B\{b}
(v)
(t) (t) (t) (t) (t) (t)
X X
(t) (t) (t)
≥ wnb wpr − wpb + wnb wbr − wbb + wnp wpr − wpb
p∈R p∈B\{b}
(t) (t) (t) (t) (t)
X X X
(t) (t) (t)
=wnb wpr − wpb + wbr − wbb + + wnp wpr − wpb
p∈R p∈R p∈B\{b}
(vi)
≥0
19
(vi) holds following from conditions (2) and (3).
It follows that in the product W (τ ) W (τ −1) ...W (1) , a low-quality node will still have a lower column
sum than any regular node. Because conditions (1)(2)(3) holds for any product of W (t) ’s as long as
each of the W (t) share the conditions listed by (1)(2)(3).
After t > τ , multiplying with a naive weight matrix does not change the column sum order, we will
have all low-quality nodes have lower importance in the consensus than the regular nodes.
(t)
F Reasoning for confidence weighting factor βi
(t)
In this section, we justify our choice of βi (x) in Section 4.2, i.e. we show via adding such a term,
we are able to downweight a regular node’s trust towards a bad node.
Φ(t) is a row-normalized pairwise cosine similarity matrix, with (i, j)-th entry before row normaliza-
tion as D E
′ ′
1 X f (t−1) (x ), f (t−1) (x )
θi θj
(19)
n⋆ ′ ∗ ∥fθ(t−1) (x′ )∥2 ∥fθ(t−1) (x′ )∥2
x ∈X i j
(t)
After adding a βi (x) = 1/H(fθ(t−1) (x)), we have W (t) with (i, j)-th entry before row normalization
i
as D E
′ ′
1 X 1 f (t−1)
θi
(x ), f (t−1)
θj
(x )
(20)
n ′ ∗ H(fθ(t−1) (x )) ∥fθ(t−1) (x )∥2 ∥fθ(t−1) (x′ )∥2
⋆ ′ ′
x ∈X i i j
We want to show that the weighting scheme down-weights a regular node i’s trust towards a low-
quality node b, that is
(t) (t)
ϕib > wib
As the comparison is made with respect to the same time step t, we drop the t notation from now
on. Let {a0 , .., aN −1 } be the cosine similarity between a regular agent i and others inside agent i’s
confident region, and {b0 , .., bN −1 } be the cosine similarity between i and others outside agent i’s
confident region. By confident region, we mean region with low entropy in class probabilities, i.e.
the model is more sure about the prediction. Further, we make the following assumptions:
a) for x′ in agent i’s confident region, we have low entropy of predicted class probabili-
ties: H(fθ(t−1) (x′ )) = 1/c1 ; while for x′ outside agent i’s confident region, we have
i
H(fθ(t−1) (x′ )) = 1/c2 . We further assume 0 < c2 < c1 .
i
b) inside a regular node i’s confident region, i has a better judgment of the alignment score
produced by cosine similarity, such that the cosine similarity with low quality b is weighted
lower inside:
a b
P b < Pb (21)
j aj j bj
20
X X X X
c2 bb aj + c1 ab bj < c1 bb aj + c2 ab bj (25)
j j j j
P P
Now add c1 ab j aj + c2 bb j bj to both sides, we have
X X X X
c1 ab aj + c2 bb aj + c1 ab bj + c 2 bb bj <
j j j j
X X X X (26)
c1 ab aj + c1 bb aj + c2 ab bj + c2 bb bj
j j j j
G Complementary details
All the model training was done using a single GPU (NVIDIA Tesla V100). For each local iteration,
we load local data and shared unlabeled data with batch size 64 and 256 separately. We empirically
observed that a larger batch size for unlabeled data is necessary for the training to work well. The
optimizer used is Adam with a learning rate 5e-3. For Cifar10 and Cifar100, as the base model is not
pretrained, we do 50 global rounds with 5 local training epochs for each agent per global round. For
Fed-ISIC-2019 dataset, as the base model is pretrained EfficientNet, we do 20 global rounds. For the
first 5 global rounds, we set λ = 0 to arrive at good local models, such that every agent can evaluate
trust more fairly. After that, λ is fixed as 0.5. Dynamic trust is computed after each global round,
while static trust denotes the utilization of the initially calculated trust value throughout the whole
experiment.
For Cifar10 and Cifar100, we use 5% of the whole dataset to constitute X ∗ , where each class has
equal representation. For the rest, we spread them into 10 clients using Dirichlet distribution with
α = 1. For Fed-ISIC-2019 dataset, we follow the original splits as in du Terrail et al. [36], and we let
each client contribute 50 data samples to constitute X ∗ .
We employ a fixed λ for all our experiments. To select λ, we randomly sample 10% of the full
Cifar10 dataset, which we then split into local training data (95%) and X ∗ (5%). The local training
data is then spread into 10 clients using Dirichlet distribution with α = 1. The test global accuracy
and value of λ is plotted out in Fig. 8. We thus choose λ = 0.5 for all our experiments, and it is
always able to give stable performances according to our experiments.
21