0% found this document useful (0 votes)

15 views21 pages

Collaborative Learning Via Prediction Consensus

Uploaded by

mohsendoublea

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views21 pages

Collaborative Learning Via Prediction Consensus

Uploaded by

mohsendoublea

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Collaborative Learning via Prediction Consensus

Dongyang Fan1 Celestine Mendler-Dünner2,3 Martin Jaggi1

1
EPFL, Switzerland
2
Max Planck Institute for Intelligent Systems, Tübingen, Germany
3
ELLIS Institute Tübingen, Germany
{dongyang.fan, martin.jaggi}@epfl.ch
arXiv:2305.18497v3 [cs.LG] 14 Nov 2023

[email protected]

Abstract
We consider a collaborative learning setting where the goal of each agent is to
improve their own model by leveraging the expertise of collaborators, in addition
to their own training data. To facilitate the exchange of expertise among agents,
we propose a distillation-based method leveraging shared unlabeled auxiliary data,
which is pseudo-labeled by the collective. Central to our method is a trust weighting
scheme that serves to adaptively weigh the influence of each collaborator on the
pseudo-labels until a consensus on how to label the auxiliary data is reached. We
demonstrate empirically that our collaboration scheme is able to significantly boost
the performance of individual models in the target domain from which the auxiliary
data is sampled. By design, our method adeptly accommodates heterogeneity in
model architectures and substantially reduces communication overhead compared
to typical collaborative learning methods. At the same time, it can provably mitigate
the negative impact of bad models on the collective.

1 Introduction

This work considers a decentralized learning setting where each agent locally has access to a labeled
dataset and predictive model. The agents may differ in the data distribution they have access to as
well as the quality of their local models. In addition, we assume a shared unlabeled dataset X ∗
sampled from a target distribution Q is available to all agents. The central question studied in this
work is; how can agents effectively exchange information to benefit from each other’s local expertise
in order to improve their predictive performance on the target domain Q?
Towards this goal, our work takes inspiration from social science on how a panel of human experts
collaborate on a task. Humans typically engage in discourse to exchange information, they share
their opinions, and based on how much they trust their peers, each individual will then adjust their
subjective belief towards the opinion of peers. When repeated, this process gives rise to a dynamic
process of consensus finding, as formalized by DeGroot [1]. Central to the consensus mechanism of
DeGroot is the concept of trust. It determines how much individual agents influence each other’s
opinion, and thus the influence of each agent on the final consensus.
Our proposed algorithm mimics this consensus-finding mechanism in the context of collaborative
learning, inspired by recent work [2]. In particular, our consensus procedure is aimed at how to label
the shared dataset X ∗ . Therefore, we carefully design a strategy by which each agent determines
the trust towards others, given its local information, to optimally leverage each agent’s expertise
to collectively pseudo-label X ∗ . This mechanism of knowledge distillation is then combined with
techniques from self-training [3] in order to transfer the shared knowledge from the pseudo-labels
into the local models in an iterative fashion.
Code available at https://github.com/fan1dy/collaboration-consensus

37th Conference on Neural Information Processing Systems (NeurIPS 2023).

Crucial to our approach is that instead of aiming for a consensus on model parameters, such as
typically done in federated learning, we leverage the abundance of unlabeled data to enforce consensus
in prediction space over X ∗ . A key benefit of information exchange in prediction space is a reduction
in communication complexity and the ability to control privacy leakage, but also the ability to
seamlessly cope with both data heterogeneity and model heterogeneity.
Problem setup. We consider a collaborative learning setup with N agents. Each agent i ∈ [N ] holds
local training data, sampled from a local data distribution Pi . We denote the local training data as
(Xi , yi ), where the matrix Xi is composed of ni local data points and yi denotes the vector of
corresponding labels. The number ni needs not to be the same across agents. In addition, we assume
agents have access to a shared unlabeled dataset X ∗ sampled from a target distribution Q. We use n∗
to denote the number of data points in X ∗ . The ultimate success measure we consider is each agent’s
prediction performance on the target distribution Q. We work under the assumption that sharing
of raw labeled data is not desirable due to privacy concerns, data ownership, or storage constraints
and we wish to keep communication at a minimum to avoid overheads. Our setting recovers the
goal of both decentralized and federated learning, with or without personalization [4], when Q is
defined as the prediction task on the mixture of all local data distributions. We allow agents to differ
in their model architectures. We do not require a coordination server but assume agents have all-to-all
communication available.
Contributions. The key contributions of our work can be summarized in the following aspects:
1. We propose a novel collaborative learning algorithm based on prediction-consensus, which
effectively addresses statistical and model heterogeneity in the learning process, and provably
mitigates the negative impact of low-quality participating agents.
2. Our algorithm is able to significantly reduce communication overhead and privacy-sensitive
information sharing in comparison to other collaborative learning methods while achieving
superior empirical performance.
3. We show theoretically that consensus can be reached via our algorithm and justify the condi-
tions for good consensus to be achieved.

2 Related work

In classical federated learning setting a central server coordinates local updates toward learning a
global model. Local nodes upload gradients or model parameters, instead of data itself, to maintain a
certain level of privacy. McMahan et al. [5] describe the classic FedAvg algorithm. Follow-up works
predominantly focus on addressing challenges from non-i.i.d. local data [6, 7, 8] and robustness
towards Byzantine attacks [9, 10]. Apart from communicating gradients or model parameters, several
works discuss alternatives to allow for heterogeneous model architectures. These methods are based
on variants of model distillation [11, 12], reaching an agreement in the representations space [13]
or output space [14]. Similar to our work, both assume access to a shared unlabeled dataset, but
they determine agreement on the unlabeled data using naive averaging, while we aim to account for
heterogeneity in model and data quality through trust weighting.
In contrast to federated learning, the fully decentralized learning setting does not assume the existence
of a central server. Instead, decentralized schemes such as gossip averaging are used to aggregate
local information across agents [15, 16]. Despite the lack of a global state, such methods can provably
converge to the desired global solution, leading to a gradual consensus among individual models [4].
In this context, Dandi et al. [17], Le Bars et al. [18] optimize the communication topology to adapt to
data heterogeneity but do not offer any collaborator selection mechanisms. Bellet et al. [15] allows
personalized models on each agent, but assumes prior information about task-relatedness, as opposed
to learned selection. Gossip algorithms typically assume a fixed gossip mixing matrix given by
e.g. physical connections of nodes [19, 20, 21]. These approaches fail to consider data-dependent
communication as with task similarities and node qualities. Several recent works have addressed this
issue by proposing alternative methods that consider these factors. Notably, Li et al. [22] directly
optimizes the mixing weights by minimizing the local validation loss per node, which requires
labeled validation sets. Sui et al. [23] uses the E-step of EM algorithm to estimate the importance of
other agents to one specific agent i, by evaluating the accuracy of other agents’ models on the local
data of agent i. This way of trust computation does not allow the algorithm to be applied to target
distributions that differ from the local distribution, differentiating it from our work. Moreover, for

2
both Li et al. [22], Sui et al. [23] the aggregation is performed in the gradient space, therefore not
allowing heterogeneous models.
Our work relates to semi-supervised learning as it involves partially unlabeled data. Most relevant are
self-training methods [24] that first train a model using labeled data, then use the trained model to give
pseudo-labels to unlabeled data. The pseudo-labels can further be fed back to the training loop to attain
a better model. Wei et al. [25] shows that under expansion and separation assumptions, self-training
with input consistency regularization can achieve high accuracy with respect to ground-truth labels.
When more than one learner is involved, co-training [26] appears as an extension to self-training,
benefiting from the knowledge of learners from independent views in labeling a set of unlabeled
data. Diao et al. [27] incorporates semi-supervised learning into federated learning. In a setting
where agents are with unlabeled data and the center server is with labeled data, experimental studies
demonstrate that the performance of a labeled server is significantly improved with unlabeled clients.
Farina [28] presented a collective learning framework for distributed semi-supervised learning, where
they combine predictions on a shared dataset via weights evaluated from local models’ performances
on local validation datasets. While their algorithm bears similarities to ours, it is important to note
that it is exclusively tailored to scenarios in which the target domain matches the global distribution.
In a similar spirit, we want to leverage unlabeled data in a fully decentralized setting.
Finally, Mendler-Dünner et al. [2] have previously formalized collective prediction as a dynamic
consensus finding procedure. They demonstrated that such an approach can lead to significant gains
over naive model averaging. We leverage these insights and extend their approach from test-time
prediction to collaborative model training.

3 Method description

Our proposed method is designed to take advantage of shared unlabeled data in the context of
collaborative learning through knowledge distillation. Therefore, it emulates human opinion dynamics
to collectively pseudo-label the shared auxiliary data. These labels are then incorporated in the local
model update steps towards collectively improving the performance on the data distribution from
which the shared data is sampled.

3.1 Collective pseudo-labeling

To describe the pseudo labeling step, let us use fθi to denote the local model of agent i ∈ [N ]
parameterized by θi . We write ŷi = fθi (X ∗ ) to denote the predictions of agent i on the auxiliary
data X ∗ . Agents share these predictions with their peers. Naturally, the individual models may differ
in these predictions and it is a priori unclear which model is most accurate, as ground truth labels
of the auxiliary data are not available. To combine the predictions into pseudo-labels for X ∗ , each
agent locally decides how to weigh other agents’ predictions by estimating their respective expertise
on the target task. We refer to these weights as trust scores and we use wij to denote the trust of
agent i towards the predictions of agent j. It’s worth noting that the trust between agents is not
necessarily mutual, i.e., can be asymmetrical: agent i can trust agent j without agent j necessarily
trusting agent i back. We use W to denote the matrix of trust scores. Given the trust scores, agent i
uses the following pseudo labels for the auxiliary data:
X
ψi = wij ŷj . (1)
j

Trust weights are determined locally by each agent based on query access to other agents’ predictions
and they are refined iteratively throughout training as models are being updated. The adaptive weight
computation will be detailed in Section 4.

3.2 Collaborative learning from pseudo labels

In the second step, the proxy labels for the auxiliary data are used to augment local model training.
Therefore, in each step, the local optimization problem is augmented by a disagreement loss, and the
new objective is given by
L(fθi (Xi ), yi ) + λdist(fθi (X ∗ ), ψi ) (2)

3
Algorithm 1 Pseudo code of our proposed algorithm
(0)
Input: For each agent i ∈ [N ] we are given a local model θi , a labeled local dataset (Xi , yi ),
and unlabeled shared data X ∗ .
for t = 1, ..., T do
(t−1)
Each node i ∈ [N ] broadcasts their soft labels ŷi = fθ(t−1) (X ∗ ) to all other nodes
i
in parallel for each agent i do
(t)
• Calculate pairwise trust score wij (j ∈ [N ]), based on the received soft decisions
using methods provided in Section 4
(t) P (t) (t−1)
• Get pseudo-labels on X ∗ from collaborators: ψi = j wij ŷj
• Do local training with collaborative disagreement loss
(t) (t)
θi ∈ arg min L(fθ (Xi ), yi ) + λdist(fθ (X ∗ ), ψi ) (3)
θ

end for

where fθi (X) denotes the vector of agent i’s predictions on the dataset X, L is the local training
loss and dist(·) is a disagreement measure. We choose l2 distance for the disagreement measure in
the regression case and cross-entropy for the classification case. λ > 0 is a trade-off hyperparameter
that weighs the local loss and the cost of disagreement. This objective adheres to a conventional
semi-supervised learning approach, however, we generate pseudo-labels in a trust-based collective
manner.
To iteratively refine the local models in the spirit of self-training, the pseudo labeling step and the
local training step are performed in an alternating fashion as described in Algorithm 1. Starting from
(0)
pre-trained models θi , in each round t ∈ {1, .., T } model predictions on the auxiliary data are
shared and then each agent aggregates them into a set of pseudo labels to augment local data and
perform an update step.
Our algorithm is motivated conceptually by co-training [26] where it was demonstrated that unlabeled
data can be used to augment labeled data to boost model performance. Moreover, agents communicate
by broadcasting predictions, thus the communication cost of transmitting predictions is significantly
lower than that of sharing model weights. Moreover, the algorithm can be extended to use the same
pseudo-labels ψi for multiple local epochs to further reduce the communication burden.

3.3 Conditions for consensus

We study under what conditions Algorithm 1 will reach a consensus among agents on how to label
the auxiliary data. For the analysis, we focus on the over-parameterized regime 1 and we make the
following assumption on the local data distributions:
Assumption 1. There is no concept shift between the local data distributions and the target domain
Q from which the shared data is sampled, i.e., Pi (Y |X = x) = Q(Y |X = x) for all i ∈ [N ] and
x ∈ supp(Q).

Together with over-parameterization this assumption implies that the minimizer of the objective
specified in (2) can always reach zero loss. Further, this allows us to model the update of agents’
predictions on X ∗ as a Markov process where the state transition matrix corresponds to the trust
matrix W (t) . Therefore, it is convenient to write the update of the predictions on X ∗ performed by
(t) (t)
the algorithm in matrix form, as Ψ(t) = [ŷ1 , .., ŷN ]. Adopting this notation we have for t ≥ 1

Ψ(t) = W (t) Ψ(t−1) = W (t) W (t−1) ...W (1) Ψ(0) . (4)

It can be shown that under weak conditions on W (t) , a consensus will be reached by our algorithm.
1
We say a model is over-parameterized if its training error can reach zero. Over-parameterization is a
reasonable assumption in the deep learning regime.

4
Theorem 1 (Consensus on predictions). Assume all agents’ models are over-parameterized and the
data distributions satisfy Assumption 1. Then, for t → ∞ Algorithm 1 converges to a consensus
among the local models on the predictions on X ∗ , that is,
(t) (t)
ψi = ψj ∀i ̸= j, (5)
(t)
as long as W is row-stochastic and positive for any t ≥ 0.

P in Appendix B and the main insight is that as long as W is row-stochastic and

The proof is given
positive (that is j wij = 1 for any row i, and wij > 0), the product of any W (t) ’s is stochastic,
irreducible, and aperiodic, and this leads to the differences between rows in Ψ(t) vanishing in time.
Together with no concept shift, over-parameterization is important to guarantee that models can fit
the consensus predictions on the unlabeled data while at the same time minimizing local losses.

3.4 Information sharing in prediction space

A key feature of our method is that agents do not share model parameters, but they communicate
by exchanging prediction queries. This implies that if Algorithm 1 achieves a consensus this does
not imply that they have learned the same model, or that they agree on predictions outside X ∗ . We
illustrate this difference between information sharing in prediction space and parameter space with a
simple example. Therefore, we construct an over-parameterized problem. We generate local data
using cubic regression models with additive i.i.d. noise in the output, as shown in Fig. 1. We apply
the optimal trust weighting scheme that can be computed in closed form in this example. Then, each
agent fits a polynomial regression of degree 4, which leads to over-parameterization of the model to
fit the data. Full details of our example are given in Appendix A. We refer to the work of [2] for a
similar setting with under-parameterized models. Here we note the most interesting observations in
the over-parameterized regime.

(a) (b) (c) (d) (e)

Figure 1: Local data distributions are shown in (a), and the initial fit on local data is shown in (b). (c)
and (d) are predictions on X ∗ after 5 rounds and 20 rounds of our algorithm update respectively. (e)
is the comparison of model fits in a larger range. (d) is zoom-in of the rectangular area of (e).

First, we observe that for T ≥ 20 the three agents reach a consensus on the predictions of X ∗ .
However, the model parameters are not the same across the agents, as depicted in the rightmost panel.
Further, considering the properties of the algorithm across rounds and the predictions in different
regions of the input space in more detail, the following desirable behaviors are observed:
a) In data-rich regions, agent i fits the local data more accurately and moves pseudo-labels
closer to its own predictions.
b) In data-scarce regions, agent i only updates its model parameters to fit the pseudo-labels.
c) When local loss minimization and prediction consensus can be achieved at the same time,
agents can arrive at models with a perfect agreement in the target prediction space.

4 Design of trust weights

In Section 3.3 we have shown that our algorithm is guaranteed to reach consensus on X ∗ under weak
assumptions on the trust matrix W (t) . In this section, we discuss how to design W (t) to ensure
consensus and encourage that the achieved consensus also leads to a high-quality labeling of X ∗ .
Therefore, we focus on multi-class classification. We let fθi (x) denote the class probabilities obtained
using model θi for a datapoint x. We choose the cross-entropy measure H(·, ·) to define the agreement
loss function dist(·, ·) in (3). If X = [x1 , .., xn ]⊤ , then fθ (X) = [fθ (x1 ), .., fθ (xn )]⊤ ∈ Rn×C ,
where n is the number of samples in X and C is the number of classes.

5
4.1 Trust evaluation through self-confidence

The quality of the local models could differ due to various factors, such as the amount of labeled data
available during training, due to the expressivity of the local model, the training algorithm, or due
to the relevance of the local data for the target task of labeling Q. Thus, a desirable property of the
consensus solution is that malicious agents, or agents with low-quality models contribute less to the
pseudo-labeling than agents with better models.
Hadjicostis and Dominguez-Garcia [29] differentiate between malicious and non-malicious agents
and they discuss the concept of trustworthy consensus, where only non-malicious agents contribute to
the consensus. In contrast to prior work, we do not aim for trustworthy agents to contribute equally.
Instead, we specifically want consensus to come from potentially unequal contribution of all agents,
weighted according to their relevance. We allow for the trust matrix to be asymmetric. All agents
determine trust from information given locally to the respective agent, which differs across agents.
Central to any such strategy is that the capabilities of models on Q can be estimated appropriately. In
the following, we discuss a strategy of how to determine trust from local data and prediction queries
to other models.
As no label information on X ∗ is available to evaluate trust, it is natural to use agents’ own predictions
on X ∗ as a local reference point. Then, each agent distributes their trust towards other agents based
on the alignment of their predictions. We use weighted pairwise cosine similarity as a measure of
alignment which motivates the following trust weight calculation:
D E
(t) f (t−1) (x), f (t−1) (x)
(t) γ ij (t) 1 X (t) θi θ j
wij = P (t) with γij = ∗ βi (x) . (6)
n ∥f (x)∥ ∥f (t−1) (x)∥2
j γij
(t−1)
θi 2 θj
x∈X ∗

(t)
The inclusion of the weighting factor βi (x) and how to choose it will be discussed in Section 4.2.
Self-confident trust. Naturally, pairwise cosine similarity leads to a trust matrix that has diagonal
entries being the highest value among each row. We call this property self-confident, as each agent
trusts itself the most. We now demonstrate that this property is not particularly restrictive. Even if
constraining trust matrix to be self-confident, it is still possible to design such a matrix that facilitates
any consensus. The proof is given in Appendix D.
Proposition 2. For any given consensus distribution π, it is always possible to find a trust matrix W
that leads to it, which is both row stochastic and self-confident.
(t)
A second nice property our trust calculation has is that for an appropriate choice of βi the proposed
calculation of trust scores in (6) leads to scores that become more evenly distributed over time as
agents gradually reach consensus.
(t)
Claim 3. Given Assumption 1 holds and that all agents are over-parameterized. Assume βi is
chosen such that the trust matrix W (t) is row-stochastic and positive for all t ≥ 0. Then, for the trust
calculation in (6), we have W (t) loses self-confidence over time and finally converges to a uniform
matrix:
t→∞ 1
tr(W (t) ) ≤ tr(W (t−1) ) and W (t) −→ 11⊤ (7)
N
The proof is provided in Appendix C. This claim characterizes the behavior of our dynamic trust
scheme: while initially all agents distribute trust towards helpful collaborators and try to achieve a
(t) (t)
consensus on X ∗ . Once consensus is reached, we will have ŷi = ŷj for any i, j, and W (t) will
become a matrix with uniform weights. Thus, no individual agent has increasingly high weight after
consensus is reached. This in turn implies that no agent will have the ability to excessively manipulate
the collective labeling on their own. In the next section we discuss other robustness properties of our
algorithm.

4.2 Robustness to low-quality nodes

If agents possess low-quality local data, we aim to minimize their influence on the labeling of the
auxiliary data not only at consensus, but also throughout the algorithm. Proposition 4 gives sufficient
conditions for such a desirable property to be preserved at any step before consensus is reached:

6
assume there exists only one node with low-quality data, then, as long as it receives the lowest trust
from any other node, it will remain to have lowest importance in the consensus.
Proposition 4. Given Assumption 1, the trust matrix is row stochastic and positive, and all agents
hold over-parameterized models. Let b be the only node with low-quality data and τ be the timestep
that consensus is reached. If the following desirable properties hold for t < τ :
(t) (t)
i) b receives the lowest trust from others than itself, i.e., wjb = mini wji for j ̸= b.
P (t) P (t)
ii) b-th column has the lowest column sum: j wjb < mini̸=b j wji .

Then node b will have the lowest importance in the consensus.

The proof is given in Appendix E, where we also provide desired properties in the presence of
multiple nodes with low-quality data, under some extra assumptions. Note that when nodes with
weak model architectures (such as under-parameterized models) are involved, achieving consensus
is not assured. If such a consensus solution does exist, it will be constrained by the underfitting of
weak nodes. Consequently, this solution would not serve as a stationary solution concerning the local
training loss of a strong node. Nevertheless, we conjecture and we find empirically that these desired
properties can still enhance training by mitigating the impact of the weak nodes.
(t)
Confidence weighting. In the following paragraph, we discuss the choice of the weights βi in (6).
Specifically, we incorporate confidence weighting into the pairwise cosine similarity calculation to
emulate the construction of a transition matrix based on a known consensus distribution.
Let us start by outlining an idealized trust calculation that effectively down-weighs agents with low
quality data. We first construct an intermediate transition matrix Φ from pairwise cosine similarities
of the agents’ predictions on X ∗ (with row normalization). For the low-quality node b, we will have
ϕjb being the lowest value in the j-th row, for any j ̸= b. According to Proposition 4, in order to
have low importance of low-quality workers in the consensus, we need to set the overall trust that b
receives to be the lowest among all the nodes. To achieve this, we need to assign the trust of regular
workers towards the low-quality workers to a very small value, as it is difficult to alter self-confidence.
If the consensus importance weight is known, one can easily calculate the corresponding trust matrix

π(b) ϕbj
wjb = ϕjb min 1, , (8)
π(j) ϕjb

following a classical result from Metropolis chains [30] (also see Appendix D). We will have
π(b)
wjb < ϕjb for j ̸= b, as π(j) should be sufficiently small.
However, the consensus importance weight in (8) can not usually be calculated and we cannot attain
the ideal trust matrix. Therefore, we propose an alternative weighting scheme that achieves similar
effects: we up-weight the similarity in the region where agent j has more confidence, i.e., where
agent j’s class probability assignments have lower entropy. By doing this, we encourage that the
trust weights become more concentrated on themselves and helpful workers, and less concentrated on
low-quality workers. We incorporate this into the trust weight calculation (6) by choosing

(t) 1
βi (x) =
H(fθ(t−1) (x))
i

where H denotes the entropy. We offer further intuition as well as justification of this weighting
scheme in Appendix F. Moreover, we empirically demonstrate how our choice of trust matrix leads
to a low column sum for bad nodes in Section 5.2.

5 Experiments

We start with a synthetic example to visualize the decision boundary achieved by our algorithm and
then demonstrate its performance on real data in a heterogeneous collaborative learning setting.

7
(a) Local data distributions (b) Dynamic Trust (c) Naive Trust
Figure 2: Decision boundary comparison between our dynamic trust update and naive trust update

Figure 4: Learned trust matrix during training

with diagonal entries masked and column sum
(a) Cifar10 (b) FedISIC-2019
reported in the lowest bar. (left) Agents 2&9
Figure 3: Class distributions among clients have bad data. (right) Agents 5-9 have weak
models.

5.1 Decision boundary visualization

Four classes are generated via multivariate Gaussian following P c ∼ N (µc , Σ), where µ0 =
(−2, 2)⊤ , µ1 = (2, 2)⊤ , µ2 = (−2, −2)⊤ , µ3 = (2, −2)⊤ and Σ = I2×2 . We have four agents and
each agent holds local data sampled from an even mixture of P c ’s. To simulate heterogeneity in data
quality we choose to flip a fraction of labels for each agent. Namely, for agents 0-3, we randomly
flip 10% of the labels, and for the last client, we flip all labels. The unlabeled data X ∗ are sampled
equally from P c ’s. The data distribution is shown in Fig. 2a. The base model used in each node is a
multi-layer perceptron of 3 layers with 5, 10, and 4 neurons respectively. We compare the decision
boundary found by Algorithm 1 with dynamic trust weight to with naive trust weight. Results are
illustrated in Fig. 2b-c. When a client with low-quality data is involved, i.e. client 3 in the toy
example, our trust update scheme gives a better decision boundary to good agents after collaboration,
as blind trust towards low-quality clients will impair the effectiveness of pseudo labeling.

5.2 Deep learning experiments

Next, we consider a more challenging setting, where local data distributions are non-i.i.d. Two
different statistical heterogeneities are considered: (1) Synthetic heterogeneity. We utilize the classic
Cifar10 and Cifar100 datasets [31] and create 10 clients from each dataset. To distribute classes
among clients, we use a Dirichlet distribution2 with α = 1. Unless specified otherwise, we employ
ResNet20 [32] without pretraining. (2) Real-world data heterogeneity. A real-world dermoscopic
lesion image dataset from the ISIC 2019 challenge [33, 34, 35] is included here. The same client
splits are used as in [36], based on the imaging acquisition system employed in six different hospitals.
The dataset includes eight classes of lesions to classify, with the class distribution among the clients
displayed in Fig. 3b. Following [36], we choose pretrained EfficientNet [37] as the base model, and
use balanced accuracy as the evaluation metric. For every dataset, we construct X ∗ from equally
contributed samples by every agent.
Baseline methods. We compare our method to several baseline methods, including FedAvg [5],
FedProx [7], SCAFFOLD [8] (SCA), FedDyn [38], local training without collaboration (LT), and
training with naive trust (Naive). Note with naive trust we are realizing soft majority voting, which
represents the baseline method proposed from [14]. We adhere to the same architecture setting, where
the standard federated learning algorithms can be applied. To initiate the process, we allow each client
to perform local training for 5 global rounds, with the objective of obtaining a sufficiently refined
model that can be used for trust evaluation. From the 6th training round, the clients start collaboration.
2
The Dirichlet distributed samples are constructed using the codes from https://github.com/TsingZ0/
PFL-Non-IID

8
Table 1: Our methods compare to baseline methods. Blue denotes the algorithm with top 1 accuracy
and green denotes the method with 2nd best accuracy. “Ours - S" denotes the static version where the
trust score is kept constant after first-time calculation (after 5 rounds of local training) and “Ours - D"
denotes the dynamic version where the trust score is updated per global round.
FedAvg FedProx SCA FedDyn LT Naive Ours-S Ours-D
Cifar10 0.542 0.517 0.578 0.578 0.475 0.618 0.604 0.612
Regular Cifar100 0.261 0.240 0.317 0.310 0.178 0.311 0.319 0.308
Fed-ISIC 0.279 0.261 0.213 0.243 0.248 0.290 0.302 0.291
Low- Cifar10 0.541 0.530 0.570 0.575 0.470 0.596 0.605 0.608
Quality Cifar100 0.254 0.240 0.289 0.308 0.171 0.285 0.300 0.306
Data Fed-ISIC 0.229 0.242 0.221 0.243 0.217 0.247 0.249 0.269

Figure 5: Target accuracy comparison with 2 different model architectures with error bars (hatch
pattern denotes fully connected NN is used). From left to right: Cifar10, Cifar100, FedISIC

Over a total of 50 global rounds, each consisting of 5 local epochs, we report the averaged accuracy
results from three repeated experiments in Table 1. The evaluation metric is calculated on the dataset
X ∗ . λ is fixed as 0.5 in all experiments.
Heterogeneity in data quality. When all nodes share the same data quality and degree of statistical
heterogeneity (denoted by “regular” in the table), our methods align closely with consensus through
naive averaging, which is optimal in this case. When all nodes share the same degree of statistical
heterogeneity but differ in data quality, exemplified by randomly selecting two nodes (indexed as
2 and 9) for a complete flip of local training labels, our dynamic trust update shows better overall
performances3 , proving the effectiveness of our approach in limiting the detrimental influences from
nodes with low-quality data. We further plotted out the learned trust matrix in the dynamic update
mode during one of the middle training rounds in the left plot of Fig. 4. Clearly, our algorithm is able
to give low trust weights to the nodes with low-quality data, and the 2nd and 9th columns have the
lowest column sum.
Heterogeneity in model architecture. We allocate a more expressive model architecture to the
first half of the nodes and a less expressive one to the other half. The former comprises ResNet20
and EfficientNet, which were the models of choice in the previous experiments. For the latter, we
employ a linear model (i.e., one-layer fully connected neural network) with a flattened image tensor
as input and the output is of size equivalent to the number of classes. It is worth noting that if
agents with strong and weak model architectures (as in cases of under-parameterization) co-exist,
consensus might not occur, as suggested by our empirical findings illustrated in Fig. 5. Nevertheless,
our trust-based collaborator selection mechanism consistently outperforms local training and simple
averaging. The trust weight matrix learned during Cifar100 training is depicted in the right plot of
Fig. 4, revealing the presence of asymmetric trust. Specifically, the last 5 nodes exhibit a higher level
of trust towards the first 5 nodes, while the opposite is not true. The trust allocation is desired in
identifying the helpers. We further refer to Appendix A.1 for more empirical evidence from a toy
polynomial regression example on the presence of strong and weak architectures.
Reduced communication costs. Gradient aggregation-based methods incur a significant communica-
tion burden proportional to the number of model parameters (O(N × |params|)), which is particularly
heavy given the over-parameterized nature of modern deep learning. In contrast to existing ap-
proaches, our proposed method significantly reduces the communication burden by enabling each
node to transmit only their predictions on the shared dataset. This results in communication overhead
O(N 2 × n⋆ × C). It is clear that this value does not scale up with more complex models, and is
3
Here we report average accuracy of regular workers, excluding workers with low-quality data

9
Figure 6: Algorithm performances on Cifar100 for different algorithm configurations. (left) effect of
varying number of local epochs on final performance; (right) algorithm performance as a function of
the number of training rounds for 5 local epochs each

much smaller than the model size. Moreover, our methods maintain their high performance even
when the number of local epochs increases. On the other hand, FedAvg loses its effectiveness with
less frequent synchronization, i.e. more local epochs between global aggregation rounds, as shown in
the left panel of Fig. 6.

6 Discussion and extensions

In the context of decentralized learning, we leverage the collective knowledge of individual nodes to
improve the accuracy of predictions with respect to a target distribution. Our proposed trust update
scheme, based on self-confidence, ensures robustness against nodes with low-quality data. By achiev-
ing consensus in the prediction space, our method effectively handles diverse model architectures
within local clients, while maintaining a low communication overhead, thereby exhibiting important
practical potential. Our trust-based collaborative pseudo-labeling method demonstrates a fruitful
interplay between tools from semi-supervised learning and collective learning. Notably, our algorithm
is intrinsically compatible with personalization, in terms of allowing some concept shift across clients.
We leave this for future work.
Robustness. We have designed our algorithm with the assumption that all agents communicate hon-
estly, meaning that no Byzantine workers intentionally provide incorrect information. Nevertheless,
our method exhibits some resilience against a common Byzantine attack, known as the label flip
(referred to as "low-quality workers" in our paper). For instance, even with 2 out of 10 workers having
100% flipped labels, our algorithm maintains good performance. If there are malicious workers
deliberately providing incorrect information, the nodes may refuse to reach a consensus, instead of
reaching a detrimental bad consensus, assuming a reasonable λ is chosen. Consider the scenario in
which a detrimental consensus is reached with malicious nodes involved; in this case, the consensus
loss and local loss for regular nodes will not decrease in the same direction, making the consensus
solution non-stationary. Notably, the "personal" component of our loss function adds an element of
robustness against malicious nodes.
Privacy Concerns. While previous works show that training data can be reconstructed from model
parameters [39] or gradients [40], our algorithm requires less privacy-sensitive information sharing,
which is predictions on a shared dataset. While we are aware that model predictions can still leak
private information on training data due to memorization [41], there is a trade-off between the gain
from collaboration and the amount of information that users are willing to share. As the number of
outer rounds increases, we observe a notable improvement in accuracy within the context of X ∗ .
However, this enhanced accuracy comes at the cost of disclosing more information, a relationship
that is depicted in the right panel of Fig. 6. An interesting extension would be to apply differential
privacy [42] to further guarantee privacy, as well as to design methods of privacy accounting, e.g., [43]
so each agent can maintain control over the local privacy leakage at any time.
Acknowledgements. DF would like to thank Anastasia Koloskova, Felix Kuchelmeister, Matteo
Pagliardini, and Nikita Doikov for helpful discussions during the project and El Mahdi Chayti for
proofreading. DF acknowledges funding from EDIC fellowship from the Department of Computer
Science at EPFL. CM acknowledges support from the Tübingen AI Center. This project was supported
by SNSF grant 200020_200342.

10
References

[1] Morris H. DeGroot. Reaching a consensus. Journal of the American Statistical Association, 69
(345):118–121, 1974. ISSN 01621459.
[2] Celestine Mendler-Dünner, Wenshuo Guo, Stephen Bates, and Michael Jordan. Test-time
collective prediction. In Advances in Neural Information Processing Systems (NeurIPS),
volume 34, pages 13719–13731, 2021.
[3] S. Fralick. Learning to recognize patterns without a teacher. IEEE Transactions on Information
Theory, 13(1):57–64, 1967.
[4] Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Ar-
jun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings,
Rafael G. L. D’Oliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner,
Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Har-
chaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara
Javidi, Gauri Joshi, Mikhail Khodak, Jakub Konečný, Aleksandra Korolova, Farinaz Koushanfar,
Sanmi Koyejo, Tancrède Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock,
Ayfer Özgür, Rasmus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage, Ramesh Raskar,
Dawn Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian
Tramèr, Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu,
Han Yu, and Sen Zhao. Advances and open problems in federated learning. ArXiv preprint
arxiv:1912.04977, 2021.
[5] H. Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Agüera y Arcas. Federated
learning of deep networks using model averaging. CoRR, abs/1602.05629, 2016.
[6] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated
learning with non-iid data. arXiv preprint arxiv:1806.0058, 2018.
[7] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia
Smith. Federated optimization in heterogeneous networks, 2020.
[8] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and
Ananda Theertha Suresh. SCAFFOLD: Stochastic controlled averaging for federated learning.
In International Conference on Machine Learning, volume 119, pages 5132–5143. PMLR,
2020.
[9] Xiaoyu Cao, Minghong Fang, Jia Liu, and Neil Zhenqiang Gong. Fltrust: Byzantine-robust
federated learning via trust bootstrapping. In Network and Distributed System Security (NDSS)
Symposium, 2021.
[10] Sai Praneeth Karimireddy, Lie He, and Martin Jaggi. Byzantine-robust learning on heteroge-
neous datasets via bucketing. In International Conference on Learning Representations, 2022.
URL https://openreview.net/forum?id=jXKKDEi5vJt.
[11] Tao Lin, Lingjing Kong, Sebastian U. Stich, and Martin Jaggi. Ensemble distillation for
robust model fusion in federated learning. In International Conference on Neural Information
Processing Systems, 2020.
[12] Zhuangdi Zhu, Junyuan Hong, and Jiayu Zhou. Data-free knowledge distillation for hetero-
geneous federated learning. In International Conference on Machine Learning, volume 139,
pages 12878–12889. PMLR, 2021.
[13] Disha Makhija, Xing Han, Nhat Ho, and Joydeep Ghosh. Architecture agnostic federated
learning for neural networks. In International Conference on Machine Learning, volume 162,
pages 14860–14870. PMLR, 2022.
[14] Amr Abourayya, Michael Kamp, Erman Ayday, Jens Kleesiek, Kanishka Rao, Geoffrey I. Webb,
and Bharat Rao. AIMHI: Protecting sensitive data through federated co-training. In Workshop
on Federated Learning: Recent Advances and New Challenges (at NeurIPS), 2022.
[15] Aurélien Bellet, Rachid Guerraoui, Mahsa Taziki, and Marc Tommasi. Personalized and private
peer-to-peer machine learning. In International Conference on Artificial Intelligence and
Statistics, volume 84, pages 473–481. PMLR, 2018.

11
[16] Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian Stich. A
unified theory of decentralized SGD with changing topology and local updates. In International
Conference on Machine Learning, volume 119, pages 5381–5393. PMLR, 2020.
[17] Yatin Dandi, Anastasia Koloskova, Martin Jaggi, and Sebastian U. Stich. Data-heterogeneity-
aware mixing for decentralized learning. In OPT 2022: NeurIPS Workshop on Optimization for
Machine Learning, 2022.
[18] Batiste Le Bars, Aurélien Bellet, Marc Tommasi, Erick Lavoie, and Anne-Marie Kermarrec.
Refined convergence and topology learning for decentralized sgd with heterogeneous data. In
International Conference on Artificial Intelligence and Statistics, volume 206, pages 1672–1702.
PMLR, 2023.
[19] Lin Xiao and S. Boyd. Fast linear iterations for distributed averaging. In IEEE International
Conference on Decision and Control, volume 5, pages 4997–5002, 2003.
[20] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms. IEEE
Transactions on Information Theory, 52(6):2508–2530, 2006.
[21] Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, and Mike Rabbat. Stochastic gradient push
for distributed deep learning. In International Conference on Machine Learning, volume 97,
pages 344–353. PMLR, 2019.
[22] Shuangtong Li, Tianyi Zhou, Xinmei Tian, and Dacheng Tao. Learning to collaborate in
decentralized learning of personalized models. In Conference on Computer Vision and Pattern
Recognition (CVPR), pages 9756–9765, 2022.
[23] Yi Sui, Junfeng Wen, Yenson Lau, Brendan Leigh Ross, and Jesse C. Cresswell. Find
your friends: Personalized federated learning with the right collaborators. Arxiv preprint
arxiv:2210.06597, 2022.
[24] Dong-Hyun Lee. Pseudo-label : The simple and efficient semi-supervised learning method
for deep neural networks. 2013. URL https://api.semanticscholar.org/CorpusID:
18507866.
[25] Colin Wei, Kendrick Shen, Yining Chen, and Tengyu Ma. Theoretical analysis of self-training
with deep networks on unlabeled data. In International Conference on Learning Representations
(ICLR), 2021.
[26] Avrim Blum and Tom M. Mitchell. Combining labeled and unlabeled data with co-training. In
Peter L. Bartlett and Yishay Mansour, editors, Annual Conference on Computational Learning
Theory (COLT), pages 92–100. ACM, 1998.
[27] Enmao Diao, Jie Ding, and Vahid Tarokh. Semifl: Semi-supervised federated learning for
unlabeled clients with alternate training. In Advances in Neural Information Processing Systems,
volume 35, pages 17871–17884, 2022.
[28] Francesco Farina. Collective learning. Arxiv preprint arxiv:1912.02580, 2021.
[29] Christoforos N. Hadjicostis and Alejandro D. Dominguez-Garcia. Trustworthy distributed
average consensus. In Conference on Decision and Control (CDC), pages 7403–7408. Institute
of Electrical and Electronics Engineers Inc., 2022.
[30] D.A. Levin, Y. Peres, and E.L. Wilmer. Markov Chains and Mixing Times. American Mathe-
matical Soc., 2008.
[31] Alex Krizhevsky. Learning multiple layers of features from tiny images. pages 32–33, 2009.
URL https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
[32] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages
770–778, 2016.
[33] Rosendahl C. Tschandl P. and Kittler H. The ham10000 dataset, a large collection of multi-
source dermatoscopic images of common pigmented skin lesions, 2018.
[34] Noel C. F. Codella, David Gutman, M. Emre Celebi, Brian Helba, Michael A. Marchetti,
Stephen W. Dusza, Aadi Kalloo, Konstantinos Liopyris, Nabin Mishra, Harald Kittler, and Allan
Halpern. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international
symposium on biomedical imaging (isbi), 2017.

12
[35] Marc Combalia, Noel C. F. Codella, Veronica Rotemberg, Brian Helba, Veronica Vilaplana, Ofer
Reiter, Cristina Carrera, Alicia Barreiro, Allan C. Halpern, Susana Puig, and Josep Malvehy.
Bcn20000: Dermoscopic lesions in the wild, 2019.
[36] Jean Ogier du Terrail, Samy-Safwan Ayed, Edwige Cyffers, Felix Grimberg, Chaoyang He,
Regis Loeb, Paul Mangold, Tanguy Marchand, Othmane Marfoq, Erum Mushtaq, Boris Muzel-
lec, Constantin Philippenko, Santiago Silva, Maria Teleńczuk, Shadi Albarqouni, Salman
Avestimehr, Aurélien Bellet, Aymeric Dieuleveut, Martin Jaggi, Sai Praneeth Karimireddy,
Marco Lorenzi, Giovanni Neglia, Marc Tommasi, and Mathieu Andreux. Flamby: Datasets and
benchmarks for cross-silo federated learning in realistic healthcare settings, 2022.
[37] Mingxing Tan and Quoc Le. EfficientNet: Rethinking model scaling for convolutional neural
networks. In International Conference on Machine Learning, volume 97, pages 6105–6114.
PMLR, 2019.
[38] Durmus Alp Emre Acar, Yue Zhao, Ramon Matas, Matthew Mattina, Paul Whatmough, and
Venkatesh Saligrama. Federated learning based on dynamic regularization. In International
Conference on Learning Representations (ICLR), 2021.
[39] Niv Haim, Gal Vardi, Gilad Yehudai, michal Irani, and Ohad Shamir. Reconstructing training
data from trained neural networks. In Advances in Neural Information Processing Systems,
2022.
[40] Zihan Wang, Jason Lee, and Qi Lei. Reconstructing training data from model gradient, provably.
In International Conference on Artificial Intelligence and Statistics, volume 206, pages 6595–
6612. PMLR, 2023.
[41] Nicholas Carlini, Matthew Jagielski, Chiyuan Zhang, Nicolas Papernot, Andreas Terzis, and
Florian Tramer. The privacy onion effect: Memorization is relative. Advances in Neural
Information Processing Systems, 35:13263–13276, 2022.
[42] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to
sensitivity in private data analysis. In Third Conference on Theory of Cryptography (TCC),
page 265–284. Springer-Verlag, 2006.
[43] Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar,
and Li Zhang. Deep learning with differential privacy. In ACM SIGSAC Conference on
Computer and Communications Security (CCS), page 308–318, 2016.
[44] J. Wolfowitz. Products of indecomposable, aperiodic, stochastic matrices. Proceedings of the
American Mathematical Society, 14(5):733–737, 1963.

13
Appendix

A Polynomial regression examples

The true underlying function is chosen as f (x) = 0.5x3 + 0.3x2 − 5x + 4. There are three agents in
total, each of whom has 50 data points. The local data points are generated using normal distributions:
x1 ∼ N (−2, 1), x2 ∼ N (0, 1) and x3 ∼ N (2, 1). To introduce noise in the labels, each agent
adds a normally distributed error term with zero mean and unit variance, i.e. yi = f (xi ) + ε with
ε ∼ N (0, 1). A set of 50 equally spaced data points in the range of −4 to 4, denoted as X ∗ , is used
in the analysis.
For the Example in Section 3.4 the algorithm is applied using fixed trust weights with 1/3 in each
entry and λ is chosen as 1.

A.1 Example with strong and weak architectures

For this example we use the same setup as before, but there are four agents in total, each of whom
has 50 data points. The local data points are generated using normal distributions: x1 ∼ N (−2, 1),
x2 ∼ N (0, 1), x3 ∼ N (2, 1) and x4 ∼ N (3, 1).
The algorithm is applied using dynamic trust weights and λ is chosen as 1. For the first three agents,
a polynomial model with a maximum degree of four is fit, while for the fourth agent, a polynomial
model with a maximum degree of one is fit, signifying a weak node.
We see that after 50 rounds of model training using our proposed algorithm with dynamic trust, agent
4’s model is still underfitting due to its limited expressiveness. Agents 1-3 end up agreeing with each
other and giving good predictions in the union of their local regions. While with naive trust weights,
we see that the strong agents also get influenced in the region where they could perform well, as the
underfitted model has a stronger impact through collective pseudo-labeling.

(a) (b) (c) (d)

Figure 7: (a) local data distributions in each agent; (b) local model fit without collaboration; (c)
model fits after 50 rounds of our algorithm with dynamic trust update; (d) model fits after 50 rounds
with naive trust update

B Proof of Theorem 1

The proof is rooted in the results from the work of Wolfowitz [44], we recommend readers to check
the original paper for more detailed references. Note, for the following texts, when we say a matrix
W has certain properties, it is equivalent to saying a Markov chain induced by transition matrix W
has certain properties.
Definition A (Irreducible Markov chains). A Markov chain induced by transition matrix W is
irreducible if for all i, j, there exists some n such that Wijn > 0. Equivalently, the graph corresponding
to W is strongly connected.
Definition B (Strongly connected graph). A graph is said to be strongly connected if every vertex is
reachable from every other vertex.
Definition C (Aperiodic Markov chains). A Markov chain induced by transition matrix W is
aperiodic if every state has a self-loop. By self-loop, we mean that there is a nonzero probability of
remaining in that state, i.e. wii > 0 for every i.

14
Assumption 2. W (t) ’s are row-stochastic and positive , i.e.
P
j wij = 1 for any row i, and wij > 0.
Claim 5. Given Assumption 2, the matrix product of any n elements of {W (t) } are SIA (SIA stands
for stochastic, irreducible and aperiodic) for n ≥ 1.

Proof. According to the assumption that all W (t) ’s are positive, and thus we have any product of
W (t) ’s being positive in each entry, which is equivalent to the graph introduced by the product
being fully connected. Being fully connected implies being strongly connected. According to
Definitions A B, irreducibility follows.
By the product being positive, we also have its diagonal entries being all positive. According to
Definition C, aperiodicity follows.
The product of row-stochastic matrices remains row-stochastic: for A and B row stochastic, we have
the product AB remains row-stochastic.
XX X X
( aik bkj ) = aik ( bkj ) = 1, ∀i
j k k j
(t)
Thus, we have any product of W ’s being irreducible, aperiodic and stochastic (SIA).
Theorem 6 (Rewrite of Wolfowitz [44]). Let A1 , ..., Ak be square row stochastic matrices of the
same order and any product of the A’s (of whatever length) is SIA. When k → ∞, the product of A1 ,
..., Ak gets reduced to a matrix with identity rows.
Following Assumptions 12, we have ψ (t) = W (t) ψ (t−1) holds for all t ≥ 1. From Claim 5, we have
any products of W (t) ’s being SIA. From Theorem 6, we have the product W (t) W (t−1) . . . W (1)
gets reduced to a matrix with identical rows when t goes to infinity. That implies, ψ ∞ has identical
rows. The statement is thus proved.

C Proof of Claim 3
Definition D (Row differences). Define how different the rows of W are by
δ(W ) = max max |wi1 j − wi2 j | (9)
j i1 ,i2
For identical rows, δ(W ) = 0
Definition E (Scrambling matrix). W is a scrambling matrix if
X
λ(W ) := 1 − min min(wi1 j , wi2 j ) < 1 (10)
i1 ,i2
j

In plain words, Definition E says that if for every pair of rows i1 and i2 in a matrix W , there exists a
column j (which may depend on i1 and i2 ) such that wi1 j > 0 and wi2 j > 0, then W is a scrambling
matrix. It is easy to verify that a positive matrix is always a scrambling matrix.
Lemma 1 (Adaptation of Lemma 2 from Wolfowitz [44]). For any t,
Yt
δ(W (t) W (t−1) . . . W (1) ) ≤ λ(W (i) ) (11)
i=1

Lemma 1 states that multiplying with scrambling matrices will make the row differences smaller.
P (t)
tr(W (t) ) = i wii represents the sum of self-confidences of all nodes. As every W (t) is positive,
we have all W (t) ’s scrambling. Thus, the differences between rows of W (t) W (t−1) ..W (1) get
smaller when t gets bigger.
(t) (t−1)
As ψi = j [W (t) W (t−1) ..W (1) ]ij ψj , we have the predictions on X ∗ given by all nodes get
P

similar over time. According to our calculation of W (t) in Equation (6), which is based on cosine
similarity between predictions, it follows that an agent’s trust towards the others gets larger over time.
P (t+1) P (t) (t+1) (t)
That is, j wij ≥ j wij . Since each row sums up to 1, we have wii ≤ wii , for all i.
(t) (t)
According to Theorem 1, we have ψi = ψj as t → ∞, for any i and j. According to the
calculation of W , we have W (t) with equal entries when t reaches infinity.

15
D Proof of Proposition 2

Recall stationary distribution (π ∈ R1×N ) of a Markov chain being

lim W (t) . . . W (1) → [π ⊤ . . . π ⊤ ]⊤ (12)

t→∞

The proof follows from the construction of Metropolis chains given a stationary distribution. We will
first give an example of how Metropolis chains work.
Example 2 (Metropolis chains [30]). Given stationary distribution π = [0.3, 0.3, 0.3, 0.1], how
could we construct a transition matrix that leads to the stationary distribution?
Suppose Φ is a symmetric matrix, one can construct a Metropolis chain P as follows:

ϕ(x, y) min 1, π(y) y ̸= x
π(x)
p(x, y) = P
π(z)
(13)
1 −
z̸=x ϕ(x, z) min 1, π(x) y=x

1/3 1/4 1/4 1/6 4/9 1/4 1/4 1/18

   
1/4 1/3 1/4 1/6 1/4 4/9 1/4 1/18
Choose Φ =  , we could get P =  .
1/4 1/4 1/3 1/6 1/4 1/4 4/9 1/18
1/6 1/6 1/6 1/2 1/6 1/6 1/6 1/2
It can be verified that π is the stationary distribution of Markov chain with transition matrix P . If Φ
π(y) π(y) ϕ(y,x)
is not symmetric, we modify π(x) to π(x) ϕ(x,y) , and the results remain unchanged.

Following Example 2, choose Φ to be any self-confident doubly stochastic matrix. For all x, choose
P as calculated from (13), we have

X π(z) X
p(x, x) = 1 − ϕ(x, z) min 1, ≥1− ϕ(x, z) = ϕ(x, x) (14)
π(x)
z̸=x z̸=x

we see that probability distribution among each row gets more concentrated on the diagonal entries in
P than Φ. As Φ already has high diagonal values, the claim follows.

E Proof of Proposition 4

Proposition 4 states sufficient conditions for W (t) ’s to have such that a low-quality node b is assigned
lowest importance in π, i.e. πb = mini πi . From Equation (12), π comes from the product of trust
matrices. We start from a product of two such matrices.
Proposition 7. For row-stochastic and positive matrices A and B, and C = AB, if in both A and
B,

i) j-th column has the lowest column sum,

ii) (i, j)-th entry being the lowest value in i-th row for i ̸= j

then we have j-th column remains the the lowest column sum in matrix C and (i, j)-th entry being
the lowest value in i-th row of C for i ̸= j,

Proof. Let C = AB, the column sum of column j of C can be expressed as:
X XX
cij = aik bkj
i i k
XX (15)
= ( aik )bkj
k i

for t ̸= j, the column sum of C is

16
X XX
cit = aik bkt
i i k
XX (16)
= ( aik )bkt
k i

We first show that j-th column remains the lowest column sum in C. For t ̸= j:
X X XX
cit − cij = ( aik )(bkt − bkj )
i i k i
XX X
= ( aik )(bkt − bkj ) + ( aij )(bjt − bjj )
k̸=j i i
(i)
XX X
> ( aij )(bkt − bkj ) + ( aij )(bjt − bjj )
k̸=j i i
  (17)
X X
=( aij )  (bkt − bkj ) + (bjt − bjj )
i k̸=j
!
X X X
= aij bkt − bkj
i k k
(ii)
> 0
P P
(i) holds because for k ̸= j, bkt − bkj > 0 and i aij < i aik
(ii) holds because the j-th column has the lowest column sum in B
We then show that (i, j)-th entry remains the lowest value in i-th row of C for i ̸= j. For t ̸= j, we
have X X
cit − cij = aik bkt − aik bkj
k k
X
= aik (bkt − bkj ) + aij (bjt − bjj )
k̸=j
(iii) X
> aij (bkt − bkj ) + aij (bjt − bjj )
k̸=j
  (18)
X
=aij  (bkt − bkj ) + (bjt − bjj )
k̸=j
!
X X
=aij bkt − bkj
k k
(iv)
>0

(iii) holds since bkt − bkj > 0 and aik > aij for i, k ̸= j.
P P
(iv) holds because k bkt > k bkj

(t)
For time-inhomogenous trust matrix, Assumptions 1 2 ensure the Markov chain update: ψi =
P (t) (t−1)
j wij ψj , which is followed by consensus as proven in Theorem 1. We see that b-th column
remains the lowest column sum in the product W (τ ) W (τ −1) ...W (1) , by iteratively applying Propo-
sition 7. For t ≥ τ , multiplying consensus with any row stochastic preserves the consensus. Thus,
the b-th column will remain to be the smallest column in the consensus. For the time-homogenous
case, as long as W holds the same properties, one can easily verify that the same result still holds.
Thus, Proposition 4 is proved.

17
Extend to more than one node with low-quality data. For more than one low-quality node, what
are the desired properties (sufficient conditions) for the transition (trust) matrices to have? It turns out
that apart from the two conditions in a single low-quality node case, we need an extra assumption.
Proposition 8. Given Assumptions 1 2 and that all agents are over-parameterized, let R be the set of
indices of regular nodes, and B be the set of indices of low-quality nodes, if for t ≤ τ , W (t) satisfies
the following conditions:
P (t)
i) any regular node’s column sum is larger than any low-quality node’s: minr∈R i wir >
P (t)
maxb∈B i wib ;
ii) the gap between the sum of trust from regular nodes towards any regular node r and low-
quality node b is larger than the gap between low-quality node b’s self-confidence and its
P (t) (t) (t) (t)
trust towards the regular node: n∈R (wnr − wnb ) > (wbb − wbr ),
iii) any node’s trust towards a regular node is bigger or equal than its trust towards a low-
(t) (t)
quality node other than itself: for any r ∈ R and any b ∈ B, we have wnr ≥ wnb holds as
long as n ̸= b.

And after t > τ , W (t) = 11⊤ N1 . Then we have nodes in B having a lower importance in the
consensus than nodes in R.

Proof. First, let us look at the multiplication of two such matrices when 1 < t < τ , for any r ∈ R
and b ∈ B, we have conditions (1)(2)(3) remain to be true for the product W (t) W (t−1) . We will
verify them one by one in the following part:
Verification of condition (1) : any regular node’s column sum is larger than any low-quality node’s
in W (t) W (t−1) . For any r ∈ R and any b ∈ B, we have
X X (t) X X (t) (t−1)
(t−1)
win wnr − win wnb
i n i n
X X (t) (t−1)

(t−1)
= ( win ) wnr − wnb
n i
X X (t) (t−1)
X X (t) (t−1)

(t−1) (t−1)
= ( win ) wnr − wnb + ( win ) wnr − wnb
n∈R i n∈B\{b} i
X (t) (t−1) (t−1)

+( wib ) wbr − wbb
i
(i)
X X (t) (t−1)
X
(t)

(t−1) (t−1)

(t−1)
> ( wib ) wnr − wnb + wib wbr − wbb
n∈R i i

(t) (t−1)
X X
(t−1)
+ ( win ) wnr − wnb
n∈B\{b} i
!
X (t) X X (t−1) (t−1) (t−1)
(t−1)
=( wib ) wnr − wnb + wbr − wbb
i n∈R n∈R

(t) (t−1)
X X
(t−1)
+ ( win ) wnr − wnb
n∈B\{b} i
(ii)
>0

P (t) P (t)
(i) holds because i win for any n ∈ R is larger than i wib for any b ∈ B, which follows from
(t) (t)
condition (1), and wnr − wnb > 0, which follows from condition (3).
P (t−1) P (t−1) (t−1)
(ii) holds following the conditions (2) and (3). From (2), n∈R wnr − n∈R wnb + wbr −
(t−1) (t−1) (t−1)
wbb > 0, and from (3), wnr ≥ wnb for n ̸= b

18
Verification of condition (2) :
! !
(t) (t−1) (t) (t−1) (t) (t−1)
X X X X X
(t) (t−1)
wnp wpr − wnp wpb − wbp wpb − wbp wpr
n∈R p p p p
! !
(t) (t) (t−1)
X X X X
(t) (t−1) (t)
= wnp + wbp wpr − wnp + wbp wpb
p n∈R p n∈R
!

(t) (t−1)
X X
(t) (t−1)
= wnp + wbp wpr − wpb
p n∈R
! !

(t) (t−1) (t) (t−1)
X X X X
(t) (t−1) (t) (t−1)
= wnp + wbp wpr − wpb + wnp + wbp wpr − wpb
p∈R n∈R p∈B\{b} n∈R
!

(t) (t) (t−1) (t−1)
X
+ wnb + wbb wbr − wbb
n∈R
! !
(iii)
(t) (t) (t−1) (t) (t) (t−1) (t−1)
X X X
(t−1)
≥ wnb + wbb wpr − wpb + wnb + wbb wbr − wbb
p∈R n∈R n∈R
!

(t) (t−1)
X X
(t) (t−1)
+ wnp + wbp wpr − wpb
p∈B\{b} n∈R
! 
(t) (t) (t−1) (t−1) (t−1) 
X X X
(t−1)
= wnb + wbb  wpr − wpb + wbr − wbb
n∈R p∈R p∈R
!

(t) (t−1)
X X
(t) (t−1)
+ wnp + wbp wpr − wpb
p∈B\{b} n∈R
(iv)
≥0
P (t) (t) P (t) (t)
(iii) holds because for p a regular node, we have n∈R wnp + wbp > n∈R wnb + wbb , which
(t−1) (t−1)
follows from condition (2), and wpr − wpb ≥ 0 for p ̸= b, following from condition (3).
(iv) holds because of conditions (2) and (3).
Verification of (3) : for n ̸= b, we want to show the trust towards a regular node r is bigger than
P (t) (t) P (t) (t)
towards a low-quality node b, that is p wnp wpr > p wnp wpb

(t)
X X
(t) (t) (t)
wnp wpr − wnp wpb
p p

(t) (t) (t) (t) (t)
X X
(t) (t) (t) (t)
= wnp wpr − wpb + wnp wpr − wpb + wnb wbr − wbb
p∈R p∈B\{b}
(v)
(t) (t) (t) (t) (t) (t)
X X
(t) (t) (t)
≥ wnb wpr − wpb + wnb wbr − wbb + wnp wpr − wpb
p∈R p∈B\{b}
 

(t) (t) (t) (t) (t)
X X X
(t) (t) (t)
=wnb  wpr − wpb + wbr − wbb  + + wnp wpr − wpb
p∈R p∈R p∈B\{b}
(vi)
≥0

(t) (t) (t) (t)

(v) holds because for n ̸= b, we have wnp ≥ wnb , following from condition (3), and wpr − wpb ≥ 0
for p ̸= b.

19
(vi) holds following from conditions (2) and (3).
It follows that in the product W (τ ) W (τ −1) ...W (1) , a low-quality node will still have a lower column
sum than any regular node. Because conditions (1)(2)(3) holds for any product of W (t) ’s as long as
each of the W (t) share the conditions listed by (1)(2)(3).
After t > τ , multiplying with a naive weight matrix does not change the column sum order, we will
have all low-quality nodes have lower importance in the consensus than the regular nodes.

(t)
F Reasoning for confidence weighting factor βi
(t)
In this section, we justify our choice of βi (x) in Section 4.2, i.e. we show via adding such a term,
we are able to downweight a regular node’s trust towards a bad node.
Φ(t) is a row-normalized pairwise cosine similarity matrix, with (i, j)-th entry before row normaliza-
tion as D E
′ ′
1 X f (t−1) (x ), f (t−1) (x )
θi θj
(19)
n⋆ ′ ∗ ∥fθ(t−1) (x′ )∥2 ∥fθ(t−1) (x′ )∥2
x ∈X i j

(t)
After adding a βi (x) = 1/H(fθ(t−1) (x)), we have W (t) with (i, j)-th entry before row normalization
i
as D E
′ ′
1 X 1 f (t−1)
θi
(x ), f (t−1)
θj
(x )
(20)
n ′ ∗ H(fθ(t−1) (x )) ∥fθ(t−1) (x )∥2 ∥fθ(t−1) (x′ )∥2
⋆ ′ ′
x ∈X i i j

We want to show that the weighting scheme down-weights a regular node i’s trust towards a low-
quality node b, that is
(t) (t)
ϕib > wib
As the comparison is made with respect to the same time step t, we drop the t notation from now
on. Let {a0 , .., aN −1 } be the cosine similarity between a regular agent i and others inside agent i’s
confident region, and {b0 , .., bN −1 } be the cosine similarity between i and others outside agent i’s
confident region. By confident region, we mean region with low entropy in class probabilities, i.e.
the model is more sure about the prediction. Further, we make the following assumptions:
a) for x′ in agent i’s confident region, we have low entropy of predicted class probabili-
ties: H(fθ(t−1) (x′ )) = 1/c1 ; while for x′ outside agent i’s confident region, we have
i
H(fθ(t−1) (x′ )) = 1/c2 . We further assume 0 < c2 < c1 .
i
b) inside a regular node i’s confident region, i has a better judgment of the alignment score
produced by cosine similarity, such that the cosine similarity with low quality b is weighted
lower inside:
a b
P b < Pb (21)
j aj j bj

to claim wib < ϕib , we need to show

c1 ab + c2 bb ab + bb
P <P (22)
j (c1 aj + c2 bj ) j (aj + bj )

Proof. Re-arrange Equation 21, we get

X X
bb aj > ab bj (23)
j j

Multiply with c2 − c1 on both sides, we have

X X
(c2 − c1 )bb aj < (c2 − c1 )ab bj (24)
j j

20
X X X X
c2 bb aj + c1 ab bj < c1 bb aj + c2 ab bj (25)
j j j j
P P
Now add c1 ab j aj + c2 bb j bj to both sides, we have
X X X X
c1 ab aj + c2 bb aj + c1 ab bj + c 2 bb bj <
j j j j
X X X X (26)
c1 ab aj + c1 bb aj + c2 ab bj + c2 bb bj
j j j j

Combining the terms we have

   
X X
(c1 ab + c2 bb )  (aj + bj ) <  (c1 aj + c2 bj ) (ab + bb ) (27)
j j

following which, we directly have

c1 ab + c2 bb ab + bb
P <P (28)
j (c1 aj + c2 bj ) j (aj + bj )

G Complementary details

G.1 Details regarding model training

All the model training was done using a single GPU (NVIDIA Tesla V100). For each local iteration,
we load local data and shared unlabeled data with batch size 64 and 256 separately. We empirically
observed that a larger batch size for unlabeled data is necessary for the training to work well. The
optimizer used is Adam with a learning rate 5e-3. For Cifar10 and Cifar100, as the base model is not
pretrained, we do 50 global rounds with 5 local training epochs for each agent per global round. For
Fed-ISIC-2019 dataset, as the base model is pretrained EfficientNet, we do 20 global rounds. For the
first 5 global rounds, we set λ = 0 to arrive at good local models, such that every agent can evaluate
trust more fairly. After that, λ is fixed as 0.5. Dynamic trust is computed after each global round,
while static trust denotes the utilization of the initially calculated trust value throughout the whole
experiment.
For Cifar10 and Cifar100, we use 5% of the whole dataset to constitute X ∗ , where each class has
equal representation. For the rest, we spread them into 10 clients using Dirichlet distribution with
α = 1. For Fed-ISIC-2019 dataset, we follow the original splits as in du Terrail et al. [36], and we let
each client contribute 50 data samples to constitute X ∗ .
We employ a fixed λ for all our experiments. To select λ, we randomly sample 10% of the full
Cifar10 dataset, which we then split into local training data (95%) and X ∗ (5%). The local training
data is then spread into 10 clients using Dirichlet distribution with α = 1. The test global accuracy
and value of λ is plotted out in Fig. 8. We thus choose λ = 0.5 for all our experiments, and it is
always able to give stable performances according to our experiments.

Figure 8: λ versus algorithm performance

Time and Space Complexity
No ratings yet
Time and Space Complexity
5 pages
0023 Matrix-Scaled Consensus
No ratings yet
0023 Matrix-Scaled Consensus
7 pages
Cse 3318 - W4 - 06242024
100% (1)
Cse 3318 - W4 - 06242024
121 pages
Communication-Efficient Learning of Deep Networks From Decentralized Data
No ratings yet
Communication-Efficient Learning of Deep Networks From Decentralized Data
11 pages
Convex Hull: Jarvis's Algorithm
No ratings yet
Convex Hull: Jarvis's Algorithm
21 pages
Learning Critically Selective Self Distillation in Federated Learning On Non-IID Data
No ratings yet
Learning Critically Selective Self Distillation in Federated Learning On Non-IID Data
12 pages
Mime Mimicking Centralized Stochastic Algorithms in Federating Learning
No ratings yet
Mime Mimicking Centralized Stochastic Algorithms in Federating Learning
40 pages
A Novel Algorithm For Personalized Federated Learning Knowledge Distillation With Weighted Combination Loss
No ratings yet
A Novel Algorithm For Personalized Federated Learning Knowledge Distillation With Weighted Combination Loss
9 pages
Federated Learning of A Mixture of Global and Local Models
No ratings yet
Federated Learning of A Mixture of Global and Local Models
33 pages
Base Paper
No ratings yet
Base Paper
38 pages
Distributed Journal Final
No ratings yet
Distributed Journal Final
29 pages
Diffusion Adaptation Strategies For Distributed Optimization and Learning Over Networks
No ratings yet
Diffusion Adaptation Strategies For Distributed Optimization and Learning Over Networks
39 pages
Consensus Learning: A Novel Decentralised Ensemble Learning Paradigm
No ratings yet
Consensus Learning: A Novel Decentralised Ensemble Learning Paradigm
41 pages
A D F L: Symmetrically Ecentralized Ederated Earning
No ratings yet
A D F L: Symmetrically Ecentralized Ederated Earning
18 pages
Self-Organizing Democratized Learning Toward Large-Scale Distributed Learning Systems
No ratings yet
Self-Organizing Democratized Learning Toward Large-Scale Distributed Learning Systems
13 pages
Communication-Efficient Distributed Learning An Overview 0
No ratings yet
Communication-Efficient Distributed Learning An Overview 0
22 pages
Extra: An Exact First-Order Algorithm For Decentralized Consensus Optimization
No ratings yet
Extra: An Exact First-Order Algorithm For Decentralized Consensus Optimization
23 pages
Personalized Federated Learning: A Meta-Learning Approach: Alireza Fallah, Aryan Mokhtari, Asuman Ozdaglar
No ratings yet
Personalized Federated Learning: A Meta-Learning Approach: Alireza Fallah, Aryan Mokhtari, Asuman Ozdaglar
29 pages
Personalized Federated Learning With Theoretical Guarantees A Model Agnostic Meta Learning Approach
No ratings yet
Personalized Federated Learning With Theoretical Guarantees A Model Agnostic Meta Learning Approach
28 pages
SLSGD: Secure and Efficient Distributed On-Device Machine Learning
No ratings yet
SLSGD: Secure and Efficient Distributed On-Device Machine Learning
28 pages
Federated Learning Via Consensus Mechanism
No ratings yet
Federated Learning Via Consensus Mechanism
12 pages
EEN 412 Lecture 05b 25032023 115409am 20022024 082256am 10022025 021846pm
No ratings yet
EEN 412 Lecture 05b 25032023 115409am 20022024 082256am 10022025 021846pm
19 pages
Graded: What Is AI? Applications and Examples of AI
100% (1)
Graded: What Is AI? Applications and Examples of AI
2 pages
Towards Model-Agnostic Federated Learning
No ratings yet
Towards Model-Agnostic Federated Learning
23 pages
BIT-FL Blockchain-Enabled Incentivized and Secure Federated Learning Framework PDF
No ratings yet
BIT-FL Blockchain-Enabled Incentivized and Secure Federated Learning Framework PDF
18 pages
3781 A Real Time Contribution Measu
No ratings yet
3781 A Real Time Contribution Measu
10 pages
Decentralised Deep
No ratings yet
Decentralised Deep
14 pages
Selective Knowledge Sharing For Privacy-Preserving Federated Distillation Without A Good Teacher
No ratings yet
Selective Knowledge Sharing For Privacy-Preserving Federated Distillation Without A Good Teacher
11 pages
Federated Reconstruction: Partially Local Federated Learning
No ratings yet
Federated Reconstruction: Partially Local Federated Learning
13 pages
Federated Learning For Edge Networks: Resource Optimization and Incentive Mechanism
No ratings yet
Federated Learning For Edge Networks: Resource Optimization and Incentive Mechanism
7 pages
NIPS 2015 Communication Complexity of Distributed Convex Learning and Optimization Paper
No ratings yet
NIPS 2015 Communication Complexity of Distributed Convex Learning and Optimization Paper
9 pages
Presented By:-Manjot Singh Bilkhu (20135100) Shashank Bharadwaj (20135032) Shailendra Azad (20135159) Guided By: - Prof. V.K.Srivastava
No ratings yet
Presented By:-Manjot Singh Bilkhu (20135100) Shashank Bharadwaj (20135032) Shailendra Azad (20135159) Guided By: - Prof. V.K.Srivastava
23 pages
Fang Robust Federated Learning With Noisy and Heterogeneous Clients CVPR 2022 Paper
No ratings yet
Fang Robust Federated Learning With Noisy and Heterogeneous Clients CVPR 2022 Paper
10 pages
Certainty Factor Model: SEEM 5750
No ratings yet
Certainty Factor Model: SEEM 5750
25 pages
An Overview On Clustering Methods: T. Soni Madhulatha
No ratings yet
An Overview On Clustering Methods: T. Soni Madhulatha
7 pages
Federated Learning
No ratings yet
Federated Learning
36 pages
Contents 2022 Actuarial-Principles
No ratings yet
Contents 2022 Actuarial-Principles
7 pages
Data Link Layer: TOPIC: Error Control & Flow Control
No ratings yet
Data Link Layer: TOPIC: Error Control & Flow Control
25 pages
MTSW GC 2023
No ratings yet
MTSW GC 2023
6 pages
CS 188: Artificial Intelligence: Search
No ratings yet
CS 188: Artificial Intelligence: Search
55 pages
Distributed Subgradient Methods For Multi-Agent Optimization
No ratings yet
Distributed Subgradient Methods For Multi-Agent Optimization
28 pages
2223-Ii718-M - Ams-Sq
No ratings yet
2223-Ii718-M - Ams-Sq
4 pages
SCAFFOLD Stochastic Controlled Averaging For Federated Learning
No ratings yet
SCAFFOLD Stochastic Controlled Averaging For Federated Learning
40 pages
Client Selection in Federated Learning-Convergence Analysis and Power of Choice Selection Strategies
No ratings yet
Client Selection in Federated Learning-Convergence Analysis and Power of Choice Selection Strategies
22 pages
Robust and Communication-Efficient Federated Learning From Non-I.i.d. Data
No ratings yet
Robust and Communication-Efficient Federated Learning From Non-I.i.d. Data
14 pages
Data Science Roadmap
No ratings yet
Data Science Roadmap
4 pages
Learning Critically Selective Self-Distillation in Federated Learning On Non-IID Data
No ratings yet
Learning Critically Selective Self-Distillation in Federated Learning On Non-IID Data
12 pages
STA2100-Regression Analysis
No ratings yet
STA2100-Regression Analysis
15 pages
Diff IT
No ratings yet
Diff IT
22 pages
Ma034 RP
No ratings yet
Ma034 RP
9 pages
20 Short Questions
No ratings yet
20 Short Questions
11 pages
Split-Fed Learning A Deep Dive Into Methods Innova
No ratings yet
Split-Fed Learning A Deep Dive Into Methods Innova
24 pages
Decentralized Personalized Federated Learning For Min-Max Problems
No ratings yet
Decentralized Personalized Federated Learning For Min-Max Problems
33 pages
Paper 14
No ratings yet
Paper 14
25 pages
Federated Select: A Primitive For Communication-And Memory-Efficient Federated Learning
No ratings yet
Federated Select: A Primitive For Communication-And Memory-Efficient Federated Learning
22 pages
NeurIPS 2018 Lag Lazily Aggregated Gradient For Communication Efficient Distributed Learning Paper
No ratings yet
NeurIPS 2018 Lag Lazily Aggregated Gradient For Communication Efficient Distributed Learning Paper
11 pages
Splitfed: When Federated Learning Meets Split Learning
No ratings yet
Splitfed: When Federated Learning Meets Split Learning
14 pages
Robust and Communication-Efficient Federated Learning From Non-IID Data
No ratings yet
Robust and Communication-Efficient Federated Learning From Non-IID Data
17 pages
Implementation of STFT For Auditory Compensation On Fpga: Objectives
No ratings yet
Implementation of STFT For Auditory Compensation On Fpga: Objectives
1 page
MGT256 Article RefresherLinearRegression
No ratings yet
MGT256 Article RefresherLinearRegression
4 pages
Federated Dropout
No ratings yet
Federated Dropout
12 pages
33.real Time Drowsy Driver Detection in Matlab
No ratings yet
33.real Time Drowsy Driver Detection in Matlab
5 pages
Gradient-Congruity Guided Federated Sparse
No ratings yet
Gradient-Congruity Guided Federated Sparse
12 pages
A Communication-Efficient Hierarchical Federated Learning Framework Via Shaping Data Distribution at Edge
No ratings yet
A Communication-Efficient Hierarchical Federated Learning Framework Via Shaping Data Distribution at Edge
16 pages
TCT: Convexifying Federated Learning Using Bootstrapped Neural Tangent Kernels
No ratings yet
TCT: Convexifying Federated Learning Using Bootstrapped Neural Tangent Kernels
29 pages
FL 1
No ratings yet
FL 1
25 pages
A Communication-Efficient Collaborative Learning21
No ratings yet
A Communication-Efficient Collaborative Learning21
19 pages
100percent Updated 1
No ratings yet
100percent Updated 1
13 pages
FedAFR Enhancing Federated Learning With Adaptive Fea - 2024 - Computer Communi
No ratings yet
FedAFR Enhancing Federated Learning With Adaptive Fea - 2024 - Computer Communi
8 pages
Federated Learning For Generalization Robustness Fairness A Survey and Benchmark
No ratings yet
Federated Learning For Generalization Robustness Fairness A Survey and Benchmark
20 pages
Imresize (Dot) M
No ratings yet
Imresize (Dot) M
5 pages
Accelerating Federated Learning Via Momentum Gradient Descent
No ratings yet
Accelerating Federated Learning Via Momentum Gradient Descent
13 pages
Fed Adp
No ratings yet
Fed Adp
11 pages
Federated Models (Mathematics)
No ratings yet
Federated Models (Mathematics)
15 pages
Federated Learning Challanges
No ratings yet
Federated Learning Challanges
21 pages
2024 L2 Seminars
No ratings yet
2024 L2 Seminars
47 pages
Federated Learning With Differential Privacy Algorithms and Performance Analysis
No ratings yet
Federated Learning With Differential Privacy Algorithms and Performance Analysis
16 pages
Chapter 3: Interpolation
No ratings yet
Chapter 3: Interpolation
4 pages
Presentation On Line Drawing Algorithms
No ratings yet
Presentation On Line Drawing Algorithms
32 pages
2017 Konecny Et Al Federated Learning Google Paper
No ratings yet
2017 Konecny Et Al Federated Learning Google Paper
10 pages
Federated Learning: Strategies For Improving Communication Efficiency
No ratings yet
Federated Learning: Strategies For Improving Communication Efficiency
5 pages
Question Paper Computational Thinking Algorithms and Programming
No ratings yet
Question Paper Computational Thinking Algorithms and Programming
16 pages
A Magic Square
No ratings yet
A Magic Square
10 pages
Machine Learning Based Intrusion Detection System: Anish Halimaa A Dr. K.Sundarakantham
No ratings yet
Machine Learning Based Intrusion Detection System: Anish Halimaa A Dr. K.Sundarakantham
5 pages
Three Moment Equation
No ratings yet
Three Moment Equation
2 pages
2024 MTH058 Lecture07 FederatedLearning
No ratings yet
2024 MTH058 Lecture07 FederatedLearning
25 pages
USN 18CS654: B. E. Degree (Autonomous) Sixth Semester End Examination (SEE)
No ratings yet
USN 18CS654: B. E. Degree (Autonomous) Sixth Semester End Examination (SEE)
2 pages
Finite Element Design of Concrete Structures 2nd Rombach 17
No ratings yet
Finite Element Design of Concrete Structures 2nd Rombach 17
1 page

Collaborative Learning Via Prediction Consensus

Uploaded by

Collaborative Learning Via Prediction Consensus

Uploaded by

Collaborative Learning via Prediction Consensus

Dongyang Fan1 Celestine Mendler-Dünner2,3 Martin Jaggi1

37th Conference on Neural Information Processing Systems (NeurIPS 2023).

3.1 Collective pseudo-labeling

3.2 Collaborative learning from pseudo labels

3.3 Conditions for consensus

Ψ(t) = W (t) Ψ(t−1) = W (t) W (t−1) ...W (1) Ψ(0) . (4)

P in Appendix B and the main insight is that as long as W is row-stochastic and

3.4 Information sharing in prediction space

(a) (b) (c) (d) (e)

4 Design of trust weights

4.2 Robustness to low-quality nodes

Then node b will have the lowest importance in the consensus.

Figure 4: Learned trust matrix during training

5.1 Decision boundary visualization

5.2 Deep learning experiments

6 Discussion and extensions

A Polynomial regression examples

A.1 Example with strong and weak architectures

(a) (b) (c) (d)

Recall stationary distribution (π ∈ R1×N ) of a Markov chain being

lim W (t) . . . W (1) → [π ⊤ . . . π ⊤ ]⊤ (12)

1/3 1/4 1/4 1/6 4/9 1/4 1/4 1/18

i) j-th column has the lowest column sum,

for t ̸= j, the column sum of C is

(t) (t) (t) (t)

to claim wib < ϕib , we need to show

Proof. Re-arrange Equation 21, we get

Multiply with c2 − c1 on both sides, we have

Combining the terms we have

following which, we directly have

G.1 Details regarding model training

Figure 8: λ versus algorithm performance

You might also like