Quality Inference in Federated Learning With Secure Aggregation
Quality Inference in Federated Learning With Secure Aggregation
5, OCTOBER 2023
Short Paper
Quality Inference in Federated Learning With Secure Aggregation
Balázs Pejó and Gergely Biczók
Abstract—Federated learning algorithms are developed both for effi- to the aggregator server [4]. FL provides some privacy protection by
ciency reasons and to ensure the privacy and confidentiality of personal design as the actual data never leaves the hardware located within the
and business data, respectively. Despite no data being shared explicitly,
recent studies showed that the mechanism could still leak sensitive infor- participants’ premises. Yet, there is already rich and growing related
mation. Hence, secure aggregation is utilized in many real-world scenarios literature revealing that from these updates (i.e., gradients) a handful of
to prevent attribution to specific participants. In this paper, we focus on characteristics can be inferred about the underlying training dataset. Po-
the quality (i.e., the ratio of correct labels) of individual training datasets tential attacks include model inversion [5], membership inference [6],
and show that such quality information could be inferred and attributed to
specific participants even when secure aggregation is applied. Specifically,
reconstruction attack [7], (hyper)parameter inference [8], and property
through a series of image recognition experiments, we infer the relative inference [9].
quality ordering of participants. Moreover, we apply the inferred quality Parallel to these, several techniques have been developed to conceal
information to stabilize training performance, measure the individual con- the participants’ updates from the aggregator server, such as differential
tribution of participants, and detect misbehavior. privacy (DP) [10] and secure aggregation (SA) [11]. Although DP
Index Terms—Quality inference, federated learning, secure aggregation, comes with a mathematical privacy guarantee, it also results in heavy
misbehavior detection, contribution score. utility loss, which limits its applicability in many real-world scenarios.
On the other hand, SA does not affect the aggregated final model, which
makes it a suitable candidate for many applications. Essentially, SA
I. INTRODUCTION
hides the individual model updates without changing the aggregated
For machine learning (ML) tasks, it is widely accepted that more model by adding pairwise masks to the participants’ gradients in a
training data leads to a more accurate model. Unfortunately, in reality, clever way so that they cancel out during aggregation.
the data is scattered among multiple different entities. Thus, data holders Consequently, SA only protects the participants’ individual updates
could potentially increase the accuracy of their local model accuracy and leaves the aggregated model unprotected. Hence, SA provides a
by training a joint model together with others [1]. Several collaborative “hiding in the crowd” type of protection [12], thus, without specific
learning approaches were proposed in the literature, amongst which the background knowledge, it is unlikely that a privacy attacker could link
least privacy-friendly method is centralized learning, where a server the leaked information to a specific participant. The lack of attribution
pools the data from all participants together and trains the desired severely affects the security of FL as well; we are not aware of any
model. On the other end of the privacy spectrum, there are cryptographic attack detection scheme applicable with SA enabled.
techniques such as multi-party computation [2] and homomorphic In this paper, we study the possibility of inferring the quality of the
encryption [3], guaranteeing that only the final model is revealed to individual datasets when SA is in place. This could also be utilized
legitimate collaborators and nothing more. Neither of these extremes for attack detection as well. Note, however, that it is different from
admits most real-world use cases: while the first requires participants to mere poisoning and backdoor detection [13], as that line of research is
share their datasets directly, the latter requires too much computational only interested in classifying participants as malicious or benign, while
resource to be a practical solution for Big Data scenarios. our goal is to enable the fine-grained differentiation of FL participants
Somewhere between these (in terms of privacy protection) stands with respect to their data quality. This is fundamentally similar to
federated learning (FL), which mitigates the communication bottleneck contribution score computation, which is also an unsolved problem
and provides flexible participation by selecting a random subset of in the SA setting.
participants per round, who compute and send their model updates Data quality is a complex concept with multiple dimensions [14].
Moreover, it is relative: it can only be considered in terms of the
Manuscript received 7 August 2022; revised 15 December 2022; accepted 15 proposed use, and in relation to other data samples. For this reason
May 2023. Date of publication 29 May 2023; date of current version 1 September (similarly to [15]) we focus on image recognition tasks with noisy la-
2023. This work was supported in part by the European Union (SECURED bels, as in this scenario data quality has a straightforward interpretation.
Project) under Grant 10109571; in part by the Ministry of Innovation and
Technology from the NRDI Fund under Grant 138903, financed under the
FK_21 funding scheme; and in part by the Ministry of Culture and Innovation of
Hungary from the National Research, Development and Innovation Fund under A. Contributions
Grants TKP2021-NVA-02 and TKP2021-NVA. Recommended for acceptance We propose a method called Quality Inference (QI) which (by uti-
by Y. Yang. (Corresponding author: Balázs Pejó.)
Balázs Pejó and Gergely Biczók are with the CrySyS Lab, Department of lizing the improvement of the aggregated updates) recovers the relative
Networked Systems and Services, Faculty of Electrical Engineering and Infor- label quality of the contributing participants’ datasets. To obtain this
matics, Budapest University of Technology and Economics, 1111 Budapest, quality information, our method takes advantage of the improvements
Hungary, and also with the ELKH-BME Information Systems Research Group, of the aggregated models across multiple rounds, as well as the known
1111 Budapest, Hungary (e-mail: [email protected]; [email protected]).
This article has supplementary downloadable material available at per-round selected subset of participants. QI works by evaluating the
https://doi.org/10.1109/TBDATA.2023.3280406, provided by the authors. aggregated updates in each round and assigning scores to the selected
Digital Object Identifier 10.1109/TBDATA.2023.3280406 participants based on three simple but novel rules called The Good, The
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
IEEE TRANSACTIONS ON BIG DATA, VOL. 9, NO. 5, OCTOBER 2023 1431
Bad, and The Ugly (as in the movie [16]). As a result, we are able to parameter un . Moreover, we can represent θn = un + en where en
recover the relative quality ordering (i.e., by label correctness rate) of corresponds to a random variable sampled from a distribution with zero
the participants. mean. We can further assume that en and en are i.i.d. for n = n . As
We simulated datasets with different qualities by utilizing unique a result, we can express the aggregated gradient vi = n ai,n un + E
label-flipping rates for each participant, and conduct experiments on where E is sampled from the convolution of the probability density
two neural network architectures (MLP and CNN) and two datasets function of e’s.
(MNIST and CIFAR10). We consider three FL settings, where 2 out of In this case, due to the Gauss–Markov theorem [17], the solution
5, 5 out of 25, and 10 out of 100 participants are selected in each round in (1) is the best linear unbiased estimator, with error ||v − Au||22 =
to update the model, respectively. v T (I − A(AT A)−1 AT )v (where I is the identity matrix) with an
Our experiments show that the three proposed heuristic scoring rules expected value of b(I − N ). Note, that with more iterations more
significantly outperform the baseline in determining the participants’ information is leaking, which should decrease the error. Yet, this is not
data qualities relative to each other (i.e., correct label rates). We find captured by the theorem as it considers every round as a new constraint.
that the accuracy of QI depends on both the complexity of the task and This problem lies within estimation theory [18], from which we
the trained model architecture. We also conduct an ablation study on already know that estimating a single random variable with added
the hyperparameters of the proposed rules. noise is already hard; moreso, factoring in that in our setting, we
Finally, we investigate three potential applications of QI: on-the-fly have multiple variables forming an equation system. Moreover, these
performance boosting, contribution score computation, and misbehav- random variables are different per round; a detail we have omitted thus
ior detection (by considering free-riding and poisoning). We find that far. Nevertheless, each iteration corresponds to a different expected
i) carefully weighting the participants based on the inferred scores accuracy improvement level, as with time the iterations improve less
smooths the learning curve, ii) the scores could be used as a mea- and less. Consequently, to estimate individual dataset quality we have
sure of participant contribution, and iii) the scores are able to reveal to know the baseline expected learning curve; in turn, the learning curve
misbehaving participants. This latter implies that besides the label depends exactly on those quality values. Being a chicken-egg problem,
correctness rate, QI is also capable of inferring other, more general we focus on empirical observations to break this vicious cycle.
quality aspects of the data. We are not aware of any work tackling any
of the aforementioned issues when SA is enabled. III. QUALITY SCORING
In this section we devise the three intuitive scoring rules which are
II. THE THEORETICAL MODEL the core of QI: they either reward or punish the participants in the
In this section, we introduce the theoretical model of quality infer- FL rounds. The notations used in this section are summarized in the
ence and highlight its complexity. We note with n a participant in FL, Appendix (Table IV), available online. We define ωi as the aggregated
while N denotes the number of all participants. Similarly, i denotes model’s improvement in the ith round and ϕi,n as the quality score of
a round in FL, while I denotes the number of all rounds. The set Si participant n after round i. Note that in the rest of the paper we slightly
contains the randomly selected participants for round i, and b = |Si | abuse the notation by removing index i where it is not relevant.
captures the number of selected participants. Dn is participant n’s
dataset consisting of (x, y) ∈ Dn data-label pairs. We assume Dn is A. Assumptions
associated with a single scalar un , which measures its quality. We use
θn and vi to capture the quality of the nth participant’s gradient and We assume an honest-but-curious setting; the aggregator server
the quality of the aggregated gradient in the ith round, respectively. A (and the participants) cannot deviate from the FL protocol. Further
summary of the variables is listed in the Appendix (Table III), available restrictions on the attacker include limited computational power and no
online. background knowledge besides access to an evaluation oracle. For this
reason, we neither utilize any contribution score based techniques nor
existing inference attacks, as these require either significant computa-
A. Deterministic Case tional resources or user-specific relevant background information.
In this simplified scenario, we assume the gradient quality is equal
to the dataset quality, i.e., θn = un . Consequently, the aggregated B. Scoring Rules
gradients represent the average quality of the participants’ datasets.
Based on the round-wise improvements ωi , we created three simple
As a result, the round-wise quality values of aggregated gradients
rules to reward or punish the participants. We named them The Good,
form a linear equation system Au = v, where u = [u1 , . . . , uN ]T ,
The Bad, and The Ugly (as in the spaghetti western movie [16]); the
v = [v1 , . . . , vI ]T , and ai,n ∈ AI×N indicates whether participant n is
first one (G) rewards the participants in the more useful aggregates, the
selected for round i. Depending on the dimensions of A, the system can
second one (B) punishes in the less useful ones, while the last one (U)
be under- or overdetermined. In case of I < N (i.e., no exact solution
punishes when the aggregate does not improve the model at all.
exists) and if I > N (i.e., many exact solutions exist), the problem itself
and the approximate solution are shown in (1) and (2), respectively. G Each participant n contributing in round i that improves the model
−1 T more than the previous round (i.e., ωi > ωi−1 ) receives +1, i.e.,
min ||v − Au||22 ⇒ u = AT A A v (1)
u ϕi,n ← ϕi−1,n + 1.
−1 B Each participant n contributing in round i that improves the model
min ||u||22 s.t. Au = v ⇒ u = AT AAT v (2) less than the following round (i.e., ωi < ωi+1 ) receives −1, i.e.,
u
ϕi+1,n ← ϕi,n − 1.
B. Stochastic Case U Each participant n contributing in round i that does not improve the
model at all (i.e., ωi < 0) receives −1, i.e., ϕi,n ← ϕi−1,n − 1.
The above equations do not take into account any randomness. Given
that the training is stochastic, we can treat the quality of participant n’s Note, that the quality score in round i is only updated for participant
gradient as a random variable θn sampled from a distribution with who has contributed in that round (The Good and The Ugly) and the
1432 IEEE TRANSACTIONS ON BIG DATA, VOL. 9, NO. 5, OCTOBER 2023
previous round (The Bad). For instance, if in round i the improvement redundancy, readability, accessibility, consistency, usefulness, and trust,
was negative and in the following round it was positive, the participants with several having their own subcategories [14]. In this paper, we focus
of round i receive −1 due to The Ugly in round i and receive another on image recognition tasks as it is a key ML task with standard datasets
−1 in round i + 1 due to The Bad. For the rest of the participants (noted available. Still, we have to consider several of these aspects in relation
with n̂), the scores remain unchanged, i.e., ϕi,n̂ ← ϕi−1,n̂ . to image data.
It is reasonable to expect that the improvements in consecutive Unfortunately, we are not aware of any public datasets encompassing
rounds are decreasing (i.e., ωi < ωi−1 ): first the model improves data from several well-categorized quality classes. Since visual percep-
rapidly, while improvement slows down considerably in later rounds. tion is a complex process, to avoid serious pitfalls, we do not manipulate
The first two scoring rules (The Good and The Bad) capture the deviation the images themselves, but simulate different qualities similarly to [15]:
from this pattern: we can postulate that i) high dataset quality increases we modify the label y corresponding to a specific image x. To have a
the improvement more than in the previous round, and ii) low dataset clear quality-wise ordering between the datasets (i.e., the ground truth),
quality decreases the improvement, which would be compensated in we perturbed the labels of the participants according to (4), where
the following round. These phenomena were also shown in [19]. While ψk is drawn uniformly at random over all available labels. Putting
these rules are relative, the last one (The Ugly) is absolute: it builds it differently, the labels of the participants’ datasets are randomized
on the premise that if a particular round does not improve the model, before training with a linearly decreasing probability, e.g., in the case
there is a higher chance that some of the corresponding participants of five participants with IDs [1,2,3,4,5], the ratio of assigned random
have supplied low-quality data. labels are 100%, 75%, 50%, 25%, and 0%, respectively.
Independently of the participants’ dataset qualities, round-wise im-
N −n
provements could deviate from this pattern owing to the stochastic Pr(yk = ψk |(xk , yk ) ∈ Dn ) = (4)
nature of learning. We postulate that this affects all participants evenly, N −1
independently of their dataset quality; thus, the ordering among the
individual scores is not significantly affected by this “noise”. Participant C. Datasets, ML Models and Experiment Setup
selection also introduces a similar effect; however, we assume that For our experiments, we used the MNIST [22] and the CIFAR10 [23]
participants are selected uniformly, hence, its effect should also be datasets. MNIST corresponds to the simple task of digit recognition. It
similar across participants. contains 70,000 hand-written digits in the form of 28 × 28 gray-scale
images. CIFAR10 is more involved, as it consists of 60,000 32 × 32
IV. EXPERIMENTAL SETUP color images of various objects. For MLP, we used a three-layered
structure with hidden layer size 64, while for CNN, we used two
In this section, we describe our experimental setup, including the
convolutional layers with 10 and 20 kernels of size 5×5, followed
evaluation metric, the quality simulation, and the utilized datasets and
by two fully-connected hidden layers of sizes 120 and 84. For the
model architectures.
optimizer, we used SGD with a learning rate of 0.01 and a dropout
rate of 0.5. The combination of the two datasets and the two neural
A. Evaluation Metric network models yields four use cases. In the rest of the paper, we will
The quality scores of the participants are unlikely to converge; hence, refer to these as MM for MLP-MNIST, MC for MLP-CIFAR10, CM
we focus on their ordering. We denote with qi,n the inferred quality- for CNN-MNIST, and CC for CNN-CIFAR10.
wise rank of participant n after round i, and we measure the accuracy We ran all the experiments for 100 rounds and with three different
of the inferred qualities by comparing qi,n for each participant to the FL settings, corresponding to 5, 25, and 100 participants where 2, 5,
baseline quality-wise ordering. For this purpose, we use the Spearman and 10 of them are selected in each round, respectively. The three FL
correlation coefficient rs [20], which is based on the Spearman distance settings combined with the four use cases result in twelve evaluation
ds [21] (as seen in (3)). The Spearman distance measures the absolute scenarios. We ran every experiment 10-fold, with randomly selected
difference between this inferred and the actual position, while the participants.
Spearman correlation coefficient assesses monotonic relationships on
the scale [−1, 1]; 1 corresponds to perfect correlation, while any positive D. Empirical Quality Scores
value signals a positive correlation between the actual and the inferred
We present the pseudo-code of the whole process in Algorithm 1. We
quality ordering. E.g., if the inferred quality order (via the three rules)
split the dataset randomly into N + 1 parts (line 1), representing the
expressed with participant IDs is 5–3–2–4–1, while the actual quality
N datasets of the participants and the test set DN +1 , to determine the
order is 5–4–3–2–1, then the Spearman distances are 0–2–1–1–0, and
quality of the aggregated updates. As highlighted earlier, the splitting
the Spearman correlation is 0.7, suggesting that the inferred quality
is done in a way that the resulting sub-datasets are IID; otherwise, the
order is very close to the original one. Note, that the Spearman distance
splitting itself would introduce some quality difference between the
(and consequently the coefficient) handles any misalignment equally,
participants.
irrespective of the position.
Concerning DN +1 , having access to a dataset is standard practice
6· N n=1 ds (i, n)
2 both in the field of privacy attacks and contribution score computation,
ds (i, n) = |n − qi,n |rs (i) = 1 − (3) and our work is at the intersection of these. Shadow datasets are a
N · (N 2 − 1)
widespread technique to mimic the training dataset, and having access
to an evaluation oracle (via an IID test set) is a fundamental assumption
B. Simulating Data Quality
for contribution score computation methods. Although we foresee
Data quality can only be considered in terms of the proposed use and multiple options for how DN +1 could be obtained, this is orthogonal
in relation to other data samples, i.e., participants with different data to our main contribution; we leave it as a future work.
distributions could have different views of the same dataset. To tackle Next, we artificially create the baseline dataset qualities using (4)
this issue, we consider only the IID case in our experiments. Besides, (line 3): each participant’s labels are randomized with a different ratio.
data quality entails multiple aspects such as accuracy, completeness, This is followed by FL (lines 5-9). Round-wise improvements are
IEEE TRANSACTIONS ON BIG DATA, VOL. 9, NO. 5, OCTOBER 2023 1433
V. EXPERIMENTAL RESULTS
In this section, we detail our experimental results and elaborate on
possible rule improvements and plausible mitigation strategies.
The quality scores based on the three scoring rules for a handful of
selected scenarios are presented in Fig. 1; the rest of the studied cases are
shown in the Appendix (Figs. 6 and 7), available online. In Fig. 1(a) we
visualize the round-wise evolution of scores for each participant where
Fig. 1. Quality scores of the participants. Left - MLP, right - CNN, top -
the corresponding grayness level depends on the participant ID. More MNIST with 5 and 100 participants, bottom - CIFAR with 25 participants.
precisely, the lighter shades correspond to participants with higher IDs
(i.e., less noisy labels according to (4)), while the darker shades mark
low ID participants (i.e., a higher ratio of random labels). It is visible
that the more rounds have passed, the better our scoring rules correctly
differentiate the participants.
In Fig. 1(b) we show the mean (dot), the variance (black line), and
the minimum and maximum values (gray line) of the inferred quality
scores for each participant. One can see an increasing trend in the quality
scores following the participant IDs. This is in line with the ground truth
based on (4). Note, that even for the participant with the perfect label
quality (i.e., the highest ID or the lightest curve), the quality score is
rather negative, and keeps decreasing with more rounds. This is an
expected characteristic of the scoring rules: there is only one rule to
increase the score (The Good), while two to decrease it (The Bad and The
Fig. 2. Spearman coefficient for the 12 scenarios.
Ugly). Applied jointly, these three heuristic scoring rules approximate
the ground truth label quality ordering remarkably well exclusively from
the aggregates.
Finally, we utilize the Spearman coefficient rs introduced in (3)
to measure the accuracy of the inferred qualities; the 12 studied keeps increasing with more rounds, as shown in the Appendix (Fig. 8),
scenarios are presented in Fig. 2. Note, that rs ∈ [−1, 1], and any available online.
positive value indicates correlation. Thus, the value of the baseline (i.e.,
randomly guessed ordering) is zero. Consequently, the three simple
A. Fine-Tuning
rules significantly improve on the baseline, as the coefficients for all
scenarios are positive. Moreover, as suggested by Fig. 1(a), this value We consider four ways of improving the accuracy of QI.
1434 IEEE TRANSACTIONS ON BIG DATA, VOL. 9, NO. 5, OCTOBER 2023
B. Mitigation
Note that the demonstrated quality information leakage is not by
design; this is a bug, rather than a feature in FL. The simplest and most
straightforward way to mitigate this vulnerability is to use a protocol
where every participant contributes in each round (incurring a sizable
communication overhead). Another approach is to hide the participants’
IDs (e.g., via mixnets [24]), so no one knows which participant con-
tributed in which round except for the participants themselves. Finally,
the aggregation itself could be done in a differentially private manner as
well, where a carefully calculated noise is added to the updates in each
round. Client-level DP [25] would by default hide the dataset quality Fig. 3. QI application scenarios.
of the participants, although at the price of requiring large volumes of
noise, and therefore, having low utility.
VI. APPLICATIONS OF QI of the selected participants’ weights. For our experiments, we set
κ = {0.00, 0.05, 0.10, 0.20}, where the first value corresponds to the
In this section, we envisage three scenarios where computing quality baseline without participant weighting. We highlight some of our results
scores could be helpful: training accuracy stabilization, contribution in Fig. 3(a); the rest can be found in the Appendix (Fig. 9), available
score computation, and misbehavior detection. online. It is conclusive that using weights based on our scoring rules
Even though QI is not a mechanism purposefully engineered into FL enhances the training as the training curves are smoother and the final
(with SA), it does enable the above-mentioned beneficial applications. accuracies are higher.
Note that while there are a handful of existing mechanisms for these
tasks within FL, they do not work under SA; hence, we do not compare
our results quantitatively to the SotA methods. Our results are shown B. Contribution Score Computation
in Fig. 3.
The second use case we envisioned for QI is contribution score com-
putation. The holy grail of this sub-discipline is the Shapley value [27],
A. Enhancing the Training which is exponentially hard to compute, as besides the individual
It is expected that both training speed and obtained accuracy could be information, it requires information about all potential coalitions of
improved by weighting the participants according to their data qualities. participants. Thus, many approximation methods exist (e.g., [15], [28].
Hence, a potential use case for QI is to adopt the inferred scores as Yet, all methods assume explicit access to the individual datasets or the
weights during training. For weighting, we used the multiplicative corresponding gradients, which is not possible with SA. Consequently,
weight update approach [26], which multiplies the weights with a fixed there exists no contribution scoring mechanism which could be consid-
rate κ, i.e., each time during training one of the three scoring rules ered a relevant baseline for QI.
is invoked in Algorithm 1, the weights (initialized as [1, . . . , 1]) are According to [29], payment distribution based on the Shapley value
updated in the ith round with ×(1 ± κ) for the appropriate participants. is optimal for our IID setting. Moreover, the federated leave-one-out
Note that without access to individual gradients (owing to SA), method (LO) method approximates the Federated Shapley value well,
only the aggregates can be scaled by the server. Consequently, in in this case [15]. Although LO does need individual information (hence,
each round, only the aggregate is scaled with the arithmetic mean not applicable with SA), we compare our method to it, as it only utilizes
IEEE TRANSACTIONS ON BIG DATA, VOL. 9, NO. 5, OCTOBER 2023 1435
A. Participant Scoring
Simple but effective scoring rules are prevalent in complex ICT-
based systems, especially characterizing quality. For instance, binary
or counting signals can be utilized to i) steer peer-to-peer systems
measuring the trustworthiness of peers [32], ii) assess and promote
content in social media [33], iii) ensure the proper natural selection of
products in online marketplaces [34], and iv) select trustworthy clients
via simple credit scoring mechanisms [35].
There exist free-rider detection mechanisms for collaborative learn-
ing [36], [37]. In contrast, [38] proposes an online evaluation method
that defines each participant’s impact based on the current and the
previous rounds. Although their goal is similar to ours, we consider
SA being utilized, while neither of the above mechanisms is applicable
in such a case. A disaggregation technique is presented in [39], which
each individual gradient once (to obtain the grand coalition minus that reconstructs the participation matrix by simulating the same round
participant). several times with different participants. Instead, we assume such
The Spearman coefficients of the ordering based on QI and LO are participation information to be available and emulate the training rounds
presented in Fig. 3(b). As expected, LO is superior to QI, as it operates by properly updating the model.
on individual information, which is by-design avoided by QI. What is Accuracy boosting by participant weighting is considered in [40]
somewhat surprising is that LO (benefiting from individual gradients) where the weights are determined by the underlying data quality cal-
also struggles with reconstructing the quality-wise ordering perfectly. culated via the cross-entropy of the local model predictions. These
This suggests that separating participants with different label qualities experiments consider only five participants and two quality classes
is indeed a challenging task; given the restricted information setting, (fully correct or incorrect); we study fine-grained quality levels with
QI performs reasonably well. larger sets of participants. A similar method was utilized in an SA setting
in [41] using homomorphic encryption. In contrast, our method does
not require any cryptographic primitive and can be utilized on top of
C. Misbehavior Detection any federated learning protocol.
We naively assume that data quality is directly related to the noise
Another potential application of QI is misbehavior detection. It is a present in the labels. Naturally, this is a simplification: there is an entire
notoriously hard task even without SA [30]. At the time of writing, we computer science discipline devoted to data quality [14].
are not aware of any work tackling this problem in the SA setting. Authors of [42] listed several incentive mechanisms for contribution
Here we consider both malicious attackers and free-riders. Their computation in FL (which can be interpreted as data quality). A perti-
goal is either to decrease the accuracy of the aggregated model or to nent notion is the Shapley value [27], which was designed to allocate
benefit from the aggregated model without contributing, respectively. goods to players proportionally to their contributions. A high-level
We do not scramble the labels of honest participants, and simulate summary of the role of the Shapley value within ML is presented
attackers by computing the additive inverse of the correct gradients, in [43]. The main drawback of the Shapley value is its exponential
while we use zero as the gradient for free-riders. These are naive but computational requirement, which makes it unfeasible in most scenar-
stealthy strategies owing to SA. With this use case, our goal is not to ios. Several approximation methods were proposed in the literature
propose a defense against SotA attackers but rather to demonstrate the using sampling [44], gradients [28], [45], [46], [47], [48], [49] and
usability of QI besides label quality inference. Note that QI also shows influence functions [50], [51], [52]. Although some are promising (e.g.,
promise for being applicable to determine other quality disparities the conceptual idea in [53]), all previous methods assume explicit access
among participants. to either the datasets or the corresponding gradients. Consequently,
We studied the score of the honest and malicious participants; the these methods are not applicable when SA is enabled during FL. QI can
average values for the selected scenarios are presented in Table I; be considered as the first step towards a contribution score when no
the rest can be found in the Appendix (Table II), available online. information on individual datasets is available.
We also run various statistical tests to determine whether there is
any difference between the honest and malicious participant’s scores.
B. Privacy Attacks
Table I contains highlighted results, while the rest is presented in the
Appendix (Tables V, VI, VII, VIII, and IX), available online. The tests There are several indirect threats against FL models. These could be
concluded unanimously that the two score distributions are different, categorized into model inference [5], membership inference [6], param-
thus, QI is capable of correctly flagging dishonest participants. Besides eter inference [8], and property inference [9]. QI could be considered
the score differences, we also studied the inferred position of a single as an instance of the last. Source inference [54] is also such an attack,
cheater, which is always in the bottom half (see Fig. 5 in the Appendix, which could tie the extracted information to specific participants of
available online). FL. However, it does not work with SA. Another property inference
1436 IEEE TRANSACTIONS ON BIG DATA, VOL. 9, NO. 5, OCTOBER 2023
attack is the quantity composition attack [55], which aims at inferring the GDPR? This issue has significant practical relevance to federated
the proportion of training labels among the participants in FL. This learning platforms already in operation.
attack is successful even under SA protocols or DP. In contrast to our
work, the paper focuses on inferring the distributions of the non-IID
datasets while we aim to recover the relative quality information on IID ACKNOWLEDGMENT
datasets. Finally, [56] also attempts to explore user-level privacy leakage The authors are grateful to András Tótth for his work in the experi-
within FL. Similarly to our work, the attack defines client-dependent ments on contribution score computation.
properties, which then can be used to distinguish the clients from one
another. The authors assume an active malicious server utilizing a
computationally heavy GAN for the attack, which is the exact opposite REFERENCES
of our honest-but-curious setup with limited computational power. [1] B. Pejo, Q. Tang, and G. Biczok, “Together or alone: The price of privacy
in collaborative learning,” in Proc. Privacy Enhancing Technol., vol. 2019,
pp. 47–65, 2019.
C. Privacy Defenses [2] R. Cramer et al., Secure Multiparty Computation. Cambridge, U.K.: Cam-
bridge Univ. Press, 2015.
QI can be considered as a property inference attack; hence, naturally, [3] C. Gentry et al., A Fully Homomorphic Encryption Scheme. Stanford, CA,
it can be “mitigated” via client-level DP [25]. Moreover, as we simulate USA: Stanford Univ. Press, 2009.
different dataset qualities with the amount of added noise, we want [4] J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and
to prevent the leakage of the added noise volume. Consequently, this D. Bacon, “Federated learning: Strategies for improving communication
efficiency,” 2016, arXiv:1610.05492.
problem relates to private privacy parameter selection, as label pertur- [5] M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacks that
bation [57] (which we use to mimic different dataset quality levels) exploit confidence information and basic countermeasures,” in Proc. 22nd
is one technique for achieving DP [10]. Although some works set the ACM SIGSAC Conf. Comput. Commun. Secur., 2015, pp. 1322–1333.
privacy parameter using economic incentives [1], we are not aware of [6] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membership inference
attacks against machine learning models,” in Proc. IEEE Symp. Secur.
any research aiming to define the privacy parameter itself also privately.
Privacy, 2017, pp. 3–18.
[7] L. Zhu, Z. Liu, and S. Han, “Deep leakage from gradients,” in Proc. Int.
Conf. Neural Inf. Process. Syst., 2019, pp. 14747–14756.
VIII. CONCLUSION [8] F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Stealing
Federated learning is the most popular collaborative learning frame- machine learning models via prediction APIs,” in Proc. 25th USENIX
Secur. Symp., 2016, pp. 601–618.
work, wherein each round only a subset of participants updates a [9] L. Melis, C. Song, E. De Cristofaro, and V. Shmatikov, “Exploiting
joint machine learning model. Fortified with secure aggregation, only unintended feature leakage in collaborative learning,” in Proc. IEEE Symp.
aggregated information is learned both by the participants and the Secur. Privacy, 2019, pp. 691–706.
server. Yet, in this paper, we devised a simple set of quality scoring [10] D. Desfontaines and B. Pejó, “SoK: Differential privacies,” in Proc.
Privacy Enhancing Technol., 2020, vol. 2, pp. 288–313.
rules that successfully recover the relative ordering of the participant’s
[11] H. B. McMahan et al., “Communication-efficient learning of deep net-
dataset qualities (measured by perturbed label ratio). Besides a small works from decentralized data,” 2016, arXiv:1602.05629.
representative dataset to evaluate the improvement of the model after [12] L. Sweeney, “k-anonymity: A model for protecting privacy,” Int. J. Un-
each aggregation, our method neither requires any computational power certainty Fuzziness Knowl.-Based Syst., vol. 10, pp. 557–570, 2002.
nor background information. [13] E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, and V. Shmatikov, “How to
backdoor federated learning,” in Proc. Int. Conf. Artif. Intell. Statist., 2020,
Through a series of image recognition experiments, we showed that pp. 2938–2948.
it is possible to restore the relative ordering based on label quality with [14] C. Batini et al., Data and Information Quality. Cham, Switzerland:
reasonably high accuracy. Our experiments also revealed a connection Springer, 2016.
between the accuracy of the quality inference and both the complexity of [15] T. Wang, J. Rausch, C. Zhang, R. Jia, and D. Song, “A principled approach
to data valuation for federated learning,” in Federated Learning. Berlin,
the task and the used architecture. Moreover, we performed an ablation
Germany: Springer, 2020.
study suggesting that the original rules are near optimal. Lastly, we [16] IMDB, “The good, the bad and the ugly,” 1966. [Online]. Available: https:
demonstrated how quality inference could i) boost training efficiency by //www.imdb.com/title/tt0060196/
weighting the participants, ii) yield an operational contribution metric, [17] D. Harville, “Extension of the Gauss-Markov theorem to include the
and iii) detect misbehaving participants based on their quality scores. estimation of random effects,” Ann. Statist., vol. 4, pp. 384–395, 1976.
[18] L. C. Ludeman, Random Processes: Filtering, Estimation, and Detection.
Hoboken, NJ, USA: Wiley, 2003.
[19] R. Kerkouche, G. Ács, and C. Castelluccia, “Federated learning in adver-
A. Limitations and Future Work sarial settings,” 2020, arXiv: 2010.07808.
This paper has barely scratched the surface of quality inference [20] J. H. Zar, “Spearman rank correlation,” in Encyclopedia of
Biostatistics. John Wiley & Sons, Ltd, 2005, doi: https://doi.org/10.
in federated learning based only on aggregated updates. We foresee 1002/0470011815.b2a15150.
multiple avenues towards improving and extending this work, e.g., [21] P. Diaconis and R. L. Graham, “Spearman’s footrule as a measure of
using machine learning techniques to replace our naive rules by re- disarray,” J. Roy. Stat. Soc. Ser. B Methodol., vol. 39, pp. 262–268, 1977.
laxing the attacker constraints concerning computational power and [22] L. Deng, “The MNIST database of handwritten digit images for machine
background knowledge. In the early rounds, selecting the participants learning research [best of the web],” IEEE Signal Process. Mag., vol. 29,
no. 6, pp. 141–142, Nov. 2012.
in a non-random manner similar to [48] could also be beneficial. [23] A. Krizhevsky, V. Nair, and G. Hinton, “The CIFAR-10 dataset,” 2014.
For clarity, we have restricted our experiments to visual recognition [Online]. Available: http://www.cs.toronto.edu/kriz/cifar.html
tasks with noisy labels as the measure of data quality. Although we [24] D. L. Chaum, “Untraceable electronic mail, return addresses, and digital
expect our results to generalize well to other domains, we leave further pseudonyms,” Commun. ACM, vol. 24, pp. 84–90, 1981.
[25] R. C. Geyer, T. Klein, and M. Nabi, “Differentially private federated
experiments as future work. Finally, the personal data protection impli- learning: A client level perspective,” 2017, arXiv: 1712.07557.
cations of the information leakage caused by quality inference are also [26] S. Arora, E. Hazan, and S. Kale, “The multiplicative weights update
of interest: should such quality information be considered private, and, method: A meta-algorithm and applications,” Theory Comput., vol. 8,
consequently, should it fall under data protection regulations such as pp. 121–164, 2012.
IEEE TRANSACTIONS ON BIG DATA, VOL. 9, NO. 5, OCTOBER 2023 1437
[27] L. S. Shapley, “A value for n-person games,” in Contributions to the Theory [49] C. Yang, J. Liu, H. Sun, T. Li, and Z. Li, “WTDP-Shapley: Efficient and
of Games. Princeton, NJ, USA: Princeton Univ. Press, 1953. effective incentive mechanism in federated learning for intelligent safety
[28] A. Ghorbani and J. Zou, “Data Shapley: Equitable valuation of data for inspection,” in IEEE Trans. Big Data, to be published, doi: 10.1109/TB-
machine learning,” 2019, arXiv: 1904.02868. DATA.2022.3198733.
[29] J. Huang, C. Hong, L. Y. Chen, and S. Roos, “Is Shapley value [50] P. W. Koh and P. Liang, “Understanding black-box predictions via influ-
fair? Improving client selection for mavericks in federated learning,” ence functions,” 2017, arXiv: 1703.04730.
2021, arXiv:2106.10734. [51] Y. Xue et al., “Toward understanding the influence of individual clients in
[30] C. Fung, C. J. Yoon, and I. Beschastnikh, “Mitigating sybils in federated federated learning,” 2020, arXiv: 2012.10936.
learning poisoning,” 2018, arXiv: 1808.04866. [52] X. Xu, A. Hannun, and L. Van Der Maaten, “Data appraisal without data
[31] I. Dinur and K. Nissim, “Revealing information while preserving privacy,” sharing,” in Proc. Int. Conf. Artif. Intell. Statist., 2022, pp. 11422–11437.
in Proc. 22nd ACM SIGMOD-SIGACT-SIGART Symp. Princ. Database [53] B. Pejó, G. Biczók, and G. Ács, “Measuring contributions in privacy-
Syst., 2003, pp. 202–210. preserving federated learning,” ERCIM News, vol. 2021, 2021, Art. no. 35.
[32] S. D. Kamvar, M. T. Schlosser, and H. Garcia-Molina, “Incentives for [54] H. Hu, Z. Salcic, L. Sun, G. Dobbie, and X. Zhang, “Source inference
combatting freeriding on P2P networks,” in Proc. Eur. Conf. Parallel attacks in federated learning,” 2021, arXiv:2109.05659.
Process., Springer, 2003, pp. 1273–1279. [55] L. Wang, S. Xu, X. Wang, and Q. Zhu, “Eavesdrop the composition pro-
[33] P. Van Mieghem, “Human psychology of common appraisal: The reddit portion of training labels in federated learning,” 2019, arXiv: 1910.06044.
score,” IEEE Trans. Multimedia, vol. 13, no. 6, pp. 1404–1406, Dec. 2011. [56] Z. Wang, M. Song, Z. Zhang, Y. Song, Q. Wang, and H. Qi, “Beyond
[34] E.-P. Lim, V.-A. Nguyen, N. Jindal, B. Liu, and H. W. Lauw, “Detecting inferring class representatives: User-level privacy leakage from federated
product review spammers using rating behaviors,” in Proc. 19th ACM Int. learning,” in Proc. IEEE Conf. Comput. Commun., 2019, pp. 2512–2520.
Conf. Inf. Knowl. Manage., 2010, pp. 939–948. [57] N. Papernot, M. Abadi, U. Erlingsson, I. Goodfellow, and K. Talwar,
[35] L. Thomas, J. Crook, and D. Edelman, Credit Scoring and its Applications. “Semi-supervised knowledge transfer for deep learning from private train-
Philadelphia, PA, USA: SIAM, 2017. ing data,” 2016, arXiv:1610.05755.
[36] J. Lin, M. Du, and J. Liu, “Free-riders in federated learning: Attacks and
defenses,” 2019, arXiv: 1911.12560.
[37] Y. Fraboni, R. Vidal, and M. Lorenzi, “Free-rider attacks on model aggre-
gation in federated learning,” in Proc. Int. Conf. Artif. Intell. Statist., 2021,
pp. 1846–1854.
[38] B. Liu et al., “FedCM: A real-time contribution measurement method for Balázs Pejó received the BSc degree in mathematics
participants in federated learning,” 2021, arXiv: 2009.03510. from the Budapest University of Technology and
[39] J. So, R. E. Ali, B. Guler, J. Jiao, and S. Avestimehr, “Securing secure ag- Economics (BME, Hungary), in 2012, the MSc de-
gregation: Mitigating multi-round privacy leakage in federated learning,” grees in computer science from the Security and
2021, arXiv:2106.03328. Privacy Program of EIT Digital, University of Trento
[40] Y. Chen, X. Yang, X. Qin, H. Yu, B. Chen, and Z. Shen, “Focus: (UNITN, Italy) and Eotvos Lorand University (ELTE,
Dealing with label quality disparity in federated learning,” 2020, arXiv: Hungary), in 2014, and the PhD degree in infor-
2001.11359. matics from the University of Luxembourg (UNILU,
[41] J. Guo, Z. Liu, K.-Y. Lam, J. Zhao, Y. Chen, and C. Xing, “Secure weighted Luxembourg), in 2019. Currently, he is an Assistant
aggregation for federated learning,” 2020, arXiv: 2010.08730. Professor with the Laboratory of Cryptography and
[42] J. Huang, R. Talbi, Z. Zhao, S. Boucchenak, L. Y. Chen, and S. Roos, Systems Security (CrySyS Lab).
“An exploratory analysis on users’ contributions in federated learning,” in
Proc. 2nd IEEE Int. Conf. Trust Privacy Secur. Intell. Syst. Appl., 2020,
pp. 20–29.
[43] B. Rozemberczki et al., “The Shapley value in machine learning,”
2022, arXiv:2202.05594.
[44] J. Castro, D. Gómez, and J. Tejada, “Polynomial calculation of the
Shapley value based on sampling,” Comput. Operations Res., vol. 36, Gergely Biczók received the MSc and PhD degrees
pp. 1726–1730, 2009. in computer science from the Budapest University
[45] L. Nagalapatti and R. Narayanam, “Game of gradients: Mitigating irrele- of Technology and Economics (BME), in 2003 and
vant clients in federated learning,” in Proc. AAAI Conf. Artif. Intell., 2021, 2010, respectively. He is an associate professor with
pp. 9046–9054. the CrySyS Lab, Department of Networked Systems
[46] A. Ghorbani, M. Kim, and J. Zou, “A distributional framework for data and Services, Budapest University of Technology
valuation,” in Proc. Int. Conf. Mach. Learn., 2020, Art. no. 331. and Economics. Previously, he was a postdoctoral
[47] Y. Kwon and J. Zou, “Beta Shapley: A unified and noise-reduced data fellow with the Norwegian University of Science
valuation framework for machine learning,” 2021, arXiv:2110.14049. and Technology, a fulbright visiting researcher with
[48] Z. Liu, Y. Chen, H. Yu, Y. Liu, and L. Cui, “GTG-Shapley: Efficient Northwestern University, and a research fellow with
and accurate participant contribution evaluation in federated learning,” Ericsson Research. His research focuses on the secu-
2021, arXiv:2109.02053. rity, privacy, and economics of networked systems.