0% found this document useful (0 votes)
13 views15 pages

The Balanced-Pairwise-Affinities Feature Transform: Instance

The Balanced-Pairwise-Affinities (BPA) feature transform enhances feature representation for set-input tasks by utilizing optimal transport optimization to create a rich encoding of high-order relations among input items. This parameterless, efficient, and differentiable method consistently improves performance in tasks such as few-shot classification, unsupervised image clustering, and person re-identification. The BPA transform is designed to be easily integrated into existing neural network architectures, providing a robust solution for various applications in machine learning.

Uploaded by

afridimahmud99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views15 pages

The Balanced-Pairwise-Affinities Feature Transform: Instance

The Balanced-Pairwise-Affinities (BPA) feature transform enhances feature representation for set-input tasks by utilizing optimal transport optimization to create a rich encoding of high-order relations among input items. This parameterless, efficient, and differentiable method consistently improves performance in tasks such as few-shot classification, unsupervised image clustering, and person re-identification. The BPA transform is designed to be easily integrated into existing neural network architectures, providing a robust solution for various applications in machine learning.

Uploaded by

afridimahmud99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

The Balanced-Pairwise-Affinities Feature Transform

Daniel Shalam 1 Simon Korman 1

Abstract instance of the task is in the form of a set of n items (e.g.


images) {xi }ni=1 . A generic neural-network pipeline (Fig. 1
The Balanced-Pairwise-Affinities (BPA) feature
Left) typically uses a feature embedding (extractor) F , that
transform is designed to upgrade the features of a
is applied independently to each input item, to obtain a set
arXiv:2407.01467v1 [cs.LG] 25 Jun 2024

set of input items to facilitate downstream match-


of features V ={vi }ni=1 ={F (xi )}ni=1 , prior to downstream
ing or grouping related tasks. The transformed
task-specific processing G (e.g. a clustering head or classi-
set encodes a rich representation of high order
fier). The features V can be of high quality (concise, unique,
relations between the input features. A particu-
descriptive), but are limited in representation since they are
lar min-cost-max-flow fractional matching prob-
extracted based on knowledge acquired for similar examples
lem, whose entropy regularized version can be
at train time, with no context of the test time instance they
approximated by an optimal transport (OT) op-
are part of, which is critical in set-input tasks.
timization, leads to a transform which is effi-
cient, differentiable, equivariant, parameterless We rather consider the more general framework (Fig. 1
and probabilistically interpretable. While the Right), in which the per-item independently extracted fea-
Sinkhorn OT solver has been adapted extensively ture collection V is passed to an attention-mechanism type
in many contexts, we use it differently by min- computation, in which some transform jointly processes the
imizing the cost between a set of features to it- entire set of instance features, re-embedding each feature in
self and using the transport plan’s rows as the light of the joint statistics of the entire instance.
new representation. Empirically, the transform is
The main idea of BPA is very intuitive and is demonstrated
highly effective and flexible in its use and con-
on a toy example in Fig. 2. The embedding of each feature
sistently improves networks it is inserted into,
will encode the distribution of its affinities to the rest of the
in a variety of tasks and training schemes. We
set items. Specifically, items in the embedded space will be
demonstrate state-of-the-art results in few-shot
close if and only if they share a similar such distribution, i.e.
classification, unsupervised image clustering and
’agree’ on the way they ’see’ the entire set. In fact, the trans-
person re-identification. Code is available at
form largely discards the item-specific feature information,
github.com/DanielShalam/BPA .
resulting in a purely relative normalized representation that
results in a highly efficient embedding with many attractive
properties.
1. Introduction
The proposed transformation can be computed very effi-
In this work, we reassess the functionality of features in set- ciently, with negligible runtime within the hosting network,
input problems, in which a task is defined over a set of items. and can be easily used in different contexts, as can be seen
Prominent examples of this setting are few-shot classifica- in the pseudo-code snippets we provide in Sections A and
tion (Ravi & Larochelle, 2017), clustering (Van Gansbeke C of the Appendix. The embedding itself is given by rows
et al., 2020), feature matching (Korman & Avidan, 2015) of an optimal-transport (OT) plan matrix, which is the solu-
and person re-identification (Ye et al., 2021), to name but a tion to a regularized min-cost-max-flow fractional matching
few. In such tasks, features computed at test time are mainly problem that is defined over the pairwise (self)-affinities
compared relative to one another, and less so to the features matrix of the features in the set.
seen at training time. For such tasks, the practice of learning
a generic feature extractor during training and applying it at Technically, it involves the computation of pairwise dis-
test time is sub-optimal. tances and several normalization iterations of a Sinkhorn
(Cuturi, 2013) algorithm, baring apparent similarities to
In set-input problems, such as few-shot classification, an many related methods based on either Spectral Clustering
1
Department of Computer Science, University of Haifa, Israel. (Ng et al., 2001) that normalize the same affinity matrix),
Correspondence to: Daniel Shalam <[email protected]>. attention-mechanisms (Vaswani et al., 2017) that learn fea-
tures based on a self-affinities matrix perhaps even normal-

1
The Balanced-Pairwise-Affinities Feature Transform

Figure 1. Generic designs of networks that act on sets of items. These cover relevant architectures, e.g. for few-shot-classification and
clustering. Left: A generic network for processing a set of input items typically follows the depicted structure: (i) Each item separately
goes through a common feature extractor F . (ii) The set of extracted features is the input to a downstream task processing module G. ;
Right: A more general structure in which the extracted features undergo a joint processing by a transform T . Our BPA transform (as well
as other attention mechanisms) is of this type and its high-level design (within the ‘green’ module) is detailed in Fig. 2.

ized (Sander et al., 2022) and other matching (Sarlin et al., curated large-scale tasks. In all three applications, over the
2020) or classification (Hu et al., 2020) algorithms where different setups and datasets, BPA consistently improves its
optimal-transport plans are computed between source items hosting methods, achieving new state-of-the-art results.
and target items or class centers. However, the most im-
portant difference and our main novel observation is that 2. Relation to Prior Work
the self fractional matching itself (which can be viewed as
a balanced affinity matrix) can serve as a powerful embed- 2.1. Related Techniques
ding, since the distances in this space (between assignment
Set-to-Set (or Set-to-Feature) Functions have been devel-
vectors) have explicit interpretations that we explore, which
oped to act jointly on a set of items (typically features) and
are highly beneficial to general grouping based algorithms
output an updated set (or a single feature), which are used
that are applied to such set-input tasks.
for downstream inference tasks. Deep-Sets (Zaheer et al.,
2017) formalized fundamental requirements from architec-
Contribution
tures that process sets. Point-Net (Qi et al., 2017) presented
We propose a parameter-less optimal-transport based feature an influential design for learning local and global features on
transform, termed BPA, which can be used as a drop-in addi- 3D point-clouds, while Maron et al. (2020) study the design
tion that converts a generic feature extraction scheme to one of equi/in-variant layers. Unlike BPA, the joint processing
that is well suited to set-input tasks (e.g. from Figure 1 Left in these methods is limited, amounting to weight-sharing
to Right). It is analyzed and shown to have the following between separate processes and joint aggregations.
attractive set of qualities. (i) efficiency: having real-time
inference; (ii) differentiability: allowing end-to-end training Attention Mechanisms. The introduction of Relational
of the entire ‘embedding-transform-inference’ pipeline of Networks (Santoro et al., 2017) and Transformers (Vaswani
Fig. 1 Right; (iii) equivariance: ensuring that the embed- et al., 2017) with their initial applications in vision mod-
ding works coherently under any order of the input items; els (Ramachandran et al., 2019) have lead to the huge impact
(iv) probabilistic interpretation: each embedded feature of Vision Transformers (ViTs) (Dosovitskiy et al., 2020) in
will encode its distribution of affinities to all other features, many vision tasks (Khan et al., 2021). While BPA can be
by conforming to a doubly-stochastic constraint; (iv) valu- seen as a self-attention module, it is very different, first,
able metrics for the item set: Distances between embedded since it is parameterless, and hence can work at test-time on
vectors will include both direct and indirect (third-party) a pre-trained network. In addition, is can provide an explicit
similarity information between input features. probabilistic global interpretation of the instance data.
Empirically, we show BPA’s flexibility and ease of applica- Spectral Methods have been widely used as simple trans-
tion to a wide variety of tasks, by incorporating it in leading forms applied on data that needs to undergo grouping or
methods of each type. We test different configurations, such search based operations, jointly processing the set of fea-
as whether the hosting network is pre-trained or re-trained tures, resulting in a compact and perhaps discriminative
with BPA inside, across different backbones, whether trans- representation. PCA (Pearson, 1901) provides a joint di-
ductive or inductive. Few-shot-classification is our main ap- mension reduction, which maximally preserves data vari-
plication with extensive experimentation on standard bench- ance, but does not necessarily improve feature affinities
marks, testing on unsupervised-image-clustering shows the for downstream tasks. Spectral Clustering (SC) (Shi &
potential of BPA in the unsupervised domain and the person- Malik, 2000; Ng et al., 2001) is the leading non-learnable
re-identification experiments show how BPA deals with non- clustering method in use in the field. If we ignore its final

2
The Balanced-Pairwise-Affinities Feature Transform

Figure 2. The BPA transform: illustrated on a toy 7 image 3-class MNIST example.

clustering stage, SC consists of forming a pairwise affinity In the meta-learning approach, the training data is split into
matrix which is normalized (Zass & Shashua, 2006) before tasks (or episodes) mimicking the test time tasks to which
extracting its leading eigenvectors, which form the final the learner is required to generalize. MAML (Finn et al.,
embedding. BPA is also based on normalizing an affinity 2017) “learns to fine-tune” by learning a network initializa-
matrix, but uses this matrix’s rows as embedded features tion from which it can quickly adapt to novel classes. In
and avoids any further spectral decompositions, which are ProtoNet (Snell et al., 2017), a learner is meta-trained to
costly and difficult to differentiate through. predict query feature classes, based on distances from sup-
port class-prototypes in the embedding space. The trainable
version of BPA can be viewed as a meta-learning algorithm.
Optimal Transport (OT) problems are directly related to
measuring distances between distributions or sets of features. Subsequent works (Chen et al., 2018; Dhillon et al., 2020)
Cuturi (2013) popularized the Sinkhorn algorithm which is advocate using larger and more expressive backbones, em-
a simple, differentiable and fast approximation of entropy- ploying transductive inference, which fully exploits the data
regularized OT, which has since been used extensively, for at inference, including unlabeled images. BPA is transduc-
clustering (Lee et al., 2019; Asano et al., 2020), few-show- tive, but does not make assumptions on (nor needs to know)
classification (Huang et al., 2019; Ziko et al., 2020; Hu the number of classes (ways) or items per class (shots), as it
et al., 2020; Zhang et al., 2021; Chen & Wang, 2021; Zhu executes a general probabilistic grouping action.
& Koniusz, 2022), matching (Wang et al., 2019; Fey et al., Recently, attention mechanisms were shown to be effective
2020; Sarlin et al., 2020), representation learning (Caron for FSC (Kang et al., 2021; Zhang et al., 2020; Ye et al.,
et al., 2020; Asano et al., 2020), retrieval (Xie et al., 2020), 2020) and a number of works (Ziko et al., 2020; Huang
person re-identification (Wang et al., 2022), style-transfer et al., 2019; Hu et al., 2020; Zhang et al., 2021; Chen &
(Kolkin et al., 2019) and attention (Sander et al., 2022). Wang, 2021) have adopted Sinkhorn (Cuturi, 2013) as a
Our approach also builds on some attractive properties of the parameterless unsupervised classifier that computes match-
Sinkhorn solver. While our usage of Sinkhorn is extremely ings between query embeddings and class centers. Sill-Net
simple (see Algorithm 1), it is fundamentally different from (Zhang et al., 2021) that augments training samples with
all other OT usages we are aware of, since: (i) We com- illumination features and PTMap-SF (Chen & Wang, 2021)
pute the transport-plan between a set of features and itself - that proposes DCT-based feature embedding, are both based
not between feature-sets and label/class-prototypes (Huang on PTMap (Hu et al., 2020). The state-of-the-art PMF (Hu
et al., 2019; Ziko et al., 2020; Hu et al., 2020; Zhang et al., et al., 2022), proposed a 3 stage pipeline of pre-training on
2021; Chen & Wang, 2021; Zhu & Koniusz, 2022; Lee et al., external data, meta-training with labelled tasks, and fine-
2019; Asano et al., 2020; Xie et al., 2020; Wang et al., 2022; tuning on unseen tasks. BPA can be incorporated into these
Kolkin et al., 2019), or between two different feature-sets methods, immediately after their feature extraction stage.
(Wang et al., 2019; Fey et al., 2020; Sarlin et al., 2020; Unsupervised Image Clustering (UIC) is the task of
Sander et al., 2022); (ii) While others use the transport-plan grouping related images, without any label information,
to obtain distances or associations between features and into representative clusters. Naturally, the ability to measure
features/classes, we use its own rows as new feature vectors the similarities among samples is a crucial aspect of UIC.
for downstream tasks.
Recent methods have achieved tremendous progress in this
task, towards closing the gap with supervised counterparts.
2.2. Instance-Specific Applications
The leading approaches directly learn to map images to la-
Few-Shot Classification (FSC) is a branch of few-shot bels, by constraining the training of an unsupervised classi-
learning in which a classifier learns to recognize previously fication model with different types of indirect loss functions.
unseen classes given a limited number of labeled examples. Prominent works in this area include DAC (Chang et al.,

3
The Balanced-Pairwise-Affinities Feature Transform

2017), which recasts the clustering problem into a binary We suggest a novel re-embedding of the feature set V, using
pairwise-classification framework and SCAN (Van Gans- a transform, that we denote by T , in order to obtain a new
beke et al., 2020) which builds on a pre-trained encoder set of features W = T (V), where W ∈ Rn×n . The new
that provides nearest-neighbor based constraints for training feature set W has an explicit probabilistic interpretation,
a classifier. The recent state-of-the-art SPICE (Niu et al., which is specifically suited for tasks related to classification,
2022), is a pseudo-labeling based method, which divides matching or grouping of items in the input set X . In par-
the clustering network into a feature model for measuring ticular, W will be a symmetric, doubly-stochastic matrix
the instance-level similarity and a clustering head for identi- (non-negative, with rows and columns that sum to 1), where
fying the cluster-level discrepancy. the entry wij (for i ̸= j) encodes the belief that items xi
and xj belong to the same class or cluster.

Person Re-Identification (Re-ID) is the task identifying a The proposed transform T : Rn×d → Rn×n (see Fig. 2)
certain person (identity) between multiple detected pedes- acts on the original feature set V as follows. It begins by
trian images, from different non-overlapping cameras. It computing the squared Euclidean pairwise distances matrix
is challenging due to the scale of the problem and large D, namely, dij = ||vi − vj ||2 , which can be computed
variation in pose, background and illumination. efficiently as dij = 2(1 − cos(vi , vj )) = 2(1 − vi · vjT ),
when the rows of V are unit normalized. Or in a compact
See Ye et al. (2021) for an excellent comprehensive sur- form, D = 2(1 − S), where 1 is the all ones n × n matrix
vey on the topic. Among the most popular methods are and S = V · V T is the cosine affinity matrix of V.
OSNet (Zhou et al., 2019) that developed an efficient small-
scale network with high performance and DropBlock (Top- W will be computed as the optimal transport (OT) plan
DB-Net) (Quispe & Pedrini, 2020) which achieved state-of- matrix between the n-dimensional all-ones vector 1n and
the-art results by dropping a region block in the feature map itself, under the self cost matrix D∞ , which is the distance
for attentive learning. The Re-ID task is typically larger matrix D with a very (infinitely) large scalar replacing each
scale - querying thousands of identities against a target of of the entries on its diagonal (which were all zero), that
tens of thousands. Also, the data is much more real-world enforces the affinities of each feature to distribute among
compared to the carefully curated FSC sets. the others. Explicitly, let D∞ = D + αI, where α is a very
(infinitely) large constant and I is the n × n identity matrix.

3. The BPA Transform W is defined to be the doubly-stochastic matrix that is the


minimizer of the functional
3.1. Derivation
W = arg min ⟨D∞ , W⟩ (1)
Assume we are given a task instance which consists of an W∈Bn
inference problem over a set of n items {xi }ni=1 , where where Bn is the set (known as the Birkhoff polytope) of
each of the items belongs to a space of input items Ω ⊆ RD . n × n doubly-stochastic matrices and ⟨·, ·⟩ stands for the
The inference task can be modeled as fθ ({xi }ni=1 ), using a Frobenius (standard) dot-product.
learned function fθ , which acts on the set of input items and
is parameterized by a set of parameters θ. Typically, such This objective can be minimized using simplex or interior
functions combine an initial feature extraction stage that is point methods with complexity Θ(n3 log n). In practice,
applied independently to each input item, with a subsequent we use the highly efficient Sinkhorn-Knopp method (Cu-
stage of (separate or joint) processing of the feature vectors turi, 2013), which is an iterative scheme that optimizes an
(see Fig. 1 Left or Right, respectively). entropy-regularized version of the problem, where each iter-
ation takes Θ(n2 ). Namely:
That is, the function fθ takes the form fθ ({xi }ni=1 ) =
Gψ ({Fϕ (xi )}ni=1 ), where Fϕ is the feature extractor (or 1
W = arg min ⟨D∞ , W⟩ − h(W) (2)
embedding network) and Gψ is the task inference function, W∈Bn λ
parameterized by ϕ and ψ respectively, where θ = ϕ ∪ ψ. P
where h(W) = − i,j wij log(wij ) is the Shannon en-
The feature embedding F : RD → Rd , usually in the form tropy of W and λ is the entropy regularization parameter.
of a neural-network (with d ≪ D), could be either pre-
The transport-plan matrix W that is the minimizer of Equa-
trained, or trained in the context of the task function f ,
tion (2) will become the result of our transform, after ’restor-
along with the inference function G.
ing’ perfect affinities on the diagonal (replacing the diagonal
For an input {xi }ni=1 , let us define the set of embedded entries from 0s to 1s) by W = W + I, where I is the n × n
features {vi }ni=1 = {F (xi )}ni=1 . In the following, we con- identity matrix. Our final set of features is T (V) = W and
sider these sets of input vectors and features as real-valued each of its rows is the re-embedding of each of the corre-
row-stacked matrices X ∈ Rn×D and V ∈ Rn×d . sponding features (rows) in V. The BPA transform is given

4
The Balanced-Pairwise-Affinities Feature Transform

Figure 3. The min-cost max-flow perspective: Costs are shown.

in Algorithm 1 in the appendix, in PyTorch-style pseudo-


code. Note that W is symmetric as a result of the symmetry Figure 4. The (symmetric) embedding matrix W and the abso-
lute difference between its ith and jth rows.
of D and its own double-stochasticity. We next explain its
probabilistic interpretation.
wi (ith row of W) is a distribution (non-negative entries,
summing to 1), where wii = 0 and wij is the relative belief
3.2. Probabilistic interpretation that features i and j belong to the same ‘class’.
The optimization problem in Equation (1) can be written
more explicitly as follows: 3.3. Properties

min ⟨D∞ , W⟩ s.t. W · 1n = W T · 1n = 1n We can now point out some important properties of the
W
proposed embedding, given by the rows of the matrix W.
(3)
Some of these properties can be observed in the toy 3-class
which can be seen to be the same as: MNIST digit example, illustrated in Fig. 2.
min ⟨D, W⟩ s.t. W · 1n = W T · 1n = 1n Interpretability of distances in the embedded space: An
W (4)
wii = 0 for i = 1, . . . n important property of our embedding is that each embed-
ded feature encodes its distribution of affinities to all other
since the use of the infinite weights on the diagonal of D∞ features. In particular, the comparison of embedded vectors
is equivalent to using the original D with a constraint of wi and wj (of items i and j in a set) includes both direct
zeros along the diagonal of W. and indirect information about the similarity between the
features. Refer to Figure 4 for a detailed explanation of
The optimization problem in Equation (4) is in fact a frac-
this property. If we look at the different coordinates k of
tional matching instance between the set of n original fea-
the absolute difference vector a = |wi − wj |, BPA cap-
tures and itself. It can be posed as a bipartite-graph min-cost
tures (i) direct affinity: For k which is either i or j, it holds
max-flow instance (The problem of finding a min cost flow
that ak = 1 − wij = 1 − wji 1 . This amount measures how
out of all max-flow solutions), as depicted in Fig. 3. The
high (close to 1) is the mutual belief of features i and j about
graph has n nodes on each side, representing the original
one another. (ii) indirect (3rd-party) affinity: For k ∈/ {i, j},
features {vi }ni=1 (the rows of V). Across the two sides, the
we have ak = |wik − wjk |, which is a comparison of the
cost of the edge (vi , vj ) is the distance dij and the edges
beliefs of features i and j regarding the (third-party) feature
of the type (vi , vi ) have a cost of infinity (or can simply be
k. The double-stochasticity of the transformed feature-set
removed). Each ‘left’ node is connected to a ’source’ node
ensures that the compared vectors are similarly scaled (as
S by an edge of cost 0 and similarly each ’right’ node is
distributions, plus 1 on the diagonal) and the symmetry
connected to a ‘target’ (sink) node T. All edges in the graph
further enforces the equal relative affinity between pairs.
have a capacity of 1 and the goal is to find an optimal frac-
tional self matching, by finding a min-cost max-flow from As an example, observe the output features 4 and 5 in Fig. 2,
source to sink. Note that the max-flow can easily be seen to that re-embed the ’green’ features of the digit ’7’ images.
be n, but a min-cost flow is sought among max-flows. As desired, these embedding are close in the target 7D space.
The closeness is driven by both their closeness in the original
In this set-to-itself matching view, each vector vi is fraction-
space (coordinates 4 and 5) as well as the agreement on
ally matched to the set of all other vectors V −{vi } based on
specific large differences from other images. This property
the pairwise distances, but importantly taking into account
is responsible for better separation between classes in the
the fractional matches of the rest of the vectors in order to
target domain, which leads to improved performance on
satisfy the double-stochasticity constraint. The construction
tasks like classification, clustering or retrieval.
constrains the max flow to have a total outgoing flow of
1 from each ‘left’ node and a total incoming flow of 1 to 1
Note: (i) wii = wjj = 1 ; (ii) wij = wji from symmetry of
each ‘right’ node. Therefore, the ith transformed feature W ; (iii) all elements of W are ≤ 1 hence the | · | can be dropped ;

5
The Balanced-Pairwise-Affinities Feature Transform

Parameterless-ness, Differentiability and Equivariance: Table 1. Feature-dimension control strategies: Accuracy on


These three properties are inherited from the Sinkhorn OT 5-way 1-shot MiniImagenet. * marks the dimension of original
solver. The transform is parameterless, giving it the flexi- 640d pre-trained resnet-12 features. # marks the size of a batch that
bility to be used in other pipelines, directly over different includes a single 5-way 1-shot 15-query task (80 = 5 · (1 + 15)),
kinds of embeddings, without the harsh requirement of re- which is the output dimension of vanilla BPA. Best and second
best results, per dimension, are in Bold and italics.
training the entire pipeline. Retraining is certainly possible,
and beneficial in many situations, but not mandatory, as our input to ProtoNet / dim. 5 10 20 40 80# 640∗
experiments work quite well without it. Also, due to the V (orignal) - - - - - 64.6
differentiability of the Sinkhorn algorithm (Cuturi, 2013), PCA(V ) 66.2 65.7 64.4 64.1 64.3 -
SC(V ) 66.8 58.2 46.2 38.3 25.5 -
back-propagating through BPA can be done naturally, hence
BPAp (V ) - - - - 71.2 -
it is possible to (re-)train the hosting network to adapt to BPAt (V ) - - - - 72.1 -
BPA, if desired. The embedding works coherently with BPAp Attn(V ) - - - - - 69.1
respect to any change of order of the input items (features). BPAt Attn(V ) - - - - - 70.0
This can be shown by construction, since min-cost max-flow BPAp Attn(SC(V )) 69.1 69.1 68.1 68.5 69.2 -
solvers as well as the Sinkhorn OT solver are equivariant BPAp Attn(PCA(V )) 67.1 67.8 67.5 67.6 67.8 -
with respect to permutations of their inputs.
Usage flexibility: Recall that BPA is applied on sets of of features (Think of the inter-relations which are more
features, typically computed by some embedding network complex to model). On the other hand, it might impose
and its output features are passed to downstream network a problem in situations at which the downstream calcula-
components. Since BPA is parameterless, it can be simply tion that follows expects a specific feature dimension, for
inserted to any trained hosting network and since it is differ- example with a pre-trained non-convolutional layer.
entiable, it is possible to train the hosting network with BPA
inside it. We therefore denote by BPAp the basic drop-in In order to make BPA usable in such cases, we propose
usage of BPA, inserted into a pretrained network. This is the an attention-like variant, BPA Attn, in which the normal-
easiest and most flexible way to use BPA, nevertheless show- ized BPA matrix is used to balance the input features with-
ing consistent benefits in the different tested applications. out changing their dimension, by simple multiplication, i.e.
We denote by BPAt the usage where the hosting network BPA Attn(V ) = BPA(V ) · V . This variant allows to main-
is trained with BPA within. It allows to adapt the hosting tain the original feature dimension d, or even a smaller
network’s parameters to the presence of the transform, with dimension if desired, by applying dimension reduction on
the potential of further improving performance. the original set of features prior to applying BPA Attn.
In Table 1, we examine few-shot classification accuracy on
Transductive or Inductive: Note that BPA is a transductive
MiniImagenet (Vinyals et al., 2016) with downstream classi-
method in the sense that it needs to jointly process the data,
fication by ProtoNet (Snell et al., 2017). Each classification
but in doing so, unlike many transductive methods, it does
instance consists of 80 images, encoded to 640-dimensional
not make any limiting assumptions about the input structure,
features by a pre-trained resnet-12 network. ProtoNet works
such as knowing the number of classes, or items per class.
on either: (i) the original feature set V (ii) its dimension
In any case, we consider the BPAp and BPAt variants to be
reduced versions, calculated by either PCA or Spectral-
transductive, regardless of the nature of the hosting network.
Clustering (SC) (iii) vanilla BPA (iv) BPA Attn on original
Nevertheless, being transductive is possibly restrictive for
or reduced features. As can be observed, the best accuracies
certain tasks, for which test-time inputs might be received
are achieved by vanilla BPA, but the attention provided by
one-by-one. Therefore, we suggest a third usage type, BPAi ,
BPA is able to stabilize performance across the entire range
where the hosting network is trained with BPA inside (just
of dimensions.
like in BPAt ), but BPA is not applied at inference (simply
not inserted), hence the hosting network remains inductive Hyper-parameters and ablations: BPA has two hyper-
if it was so in the first place. parameters that were chosen through cross-validation and
Dimensionality: BPA has the unique property that the kept fixed for each application over all datasets. The number
dimension of its embedded feature depends on (equals) of Sinkhorn iterations for computing the optimal transport
the number of features in the set. Given a batch of n d- plan was fixed to 5 and entropy regularization parameter λ
dimensional features V ∈ Rn×d , it outputs a batch of n (Eq. (3.1)) was set to 0.1 for UIC and FSC and to 0.25 for
n-dimensional features W = BPA(V ) ∈ Rn×n . On one ReID. In Appendix B we ablate these hyper-parameters as
hand, this is a desired property, since it is natural that the well as the scalability of BPA in terms of set-input size (Fig.
feature dimensionality (and capacity) depends on the com- 5) on few-shot-classification, and in Appendix D, we study
plexity of the task, which typically grows with the number its robustness to noise and feature dimensionality (Fig. 10)
by a controlled synthetic clustering experiment.

6
The Balanced-Pairwise-Affinities Feature Transform
Table 2. Few-Shot Classification (FSC) accuracy on MiniIma- Table 3. Few-Shot Classification (FSC) accuracy on CIFAR-FS.
genet. Results are ordered by backbone (resnet-12, wrn-28-10, ViT method T/I network 5-way 1-shot 5-way 5-shot
small/base), each listing baseline methods and BPA variants. BPA PTMap($) T WRN 87.69 90.68
improvements (colored percentages) are in comparison with each SillNet($) T WRN 87.73 91.09
respective baseline hosting method (obtained by division). Bold PTMap-SF($) T WRN 89.39 92.08
and italics highlight best and second best results per backbone. PTMap-BPAp T WRN 87.37 (-0.4%) 91.12 (+0.5%)
T/I denotes transductive/inductive methods. (&) from Ziko et al. SillNet-BPAp T WRN 87.30 (-0.5%) 91.40 (+0.3%)
PTMap-SF-BPAp T WRN 89.94 (+0.6%) 92.83 (+0.8%)
(2020); ($) from original paper; (#) our implementation;
PMF($) I ViT-s 81.1 92.5
method T/I network 5-way 1-shot 5-way 5-shot PMF-BPAp T ViT-s 84.7 (+4.4%) 92.8 (+0.3%)
ProtoNet(#) I ResNet 62.39 80.33 PMF-BPAi I ViT-s 84.80 (+4.5%) 93.40 (+0.9%)
DeepEMD($) I ResNet 65.91 82.41 PMF-BPAt T ViT-s 88.90 (+9.6%) 93.80 (+1.4%)
FEAT($) I ResNet 66.78 82.05 PMF($) I ViT-b 84.30 92.20
RENet($) I ResNet 67.60 82.58 PMF-BPAp T ViT-b 88.2 (+4.6%) 94 (+1.9%)
ProtoNet-BPAp T ResNet 67.34 (+7.9%) 81.84 (+1.6%) PMF-BPAi I ViT-b 87.10 (+3.3%) 94.70 (+2.7%)
ProtoNet-BPAi I ResNet 64.36 (+3.1%) 81.82 (+1.8%) PMF-BPAt T ViT-b 91.00 (+7.9%) 95.00 (+3.0%)
ProtoNet-BPAt T ResNet 67.90 (+8.8%) 83.09 (+3.2%)
ProtoNet(&) I WRN 62.60 79.97
PTMap($) T WRN 82.92 88.80 et al., 2022)) as well as to conventional methods like the
SillNet($) T WRN 82.99 89.14 popular ProtoNet (Snell et al., 2017). While in the Mini-
PTMap-SF($) T WRN 84.81 90.62 Imagenet evaluation we include a wide range of methods
PTMap-BPAp T WRN 83.19 (+0.3%) 89.56 (+0.9%) and backbones, in the CIFAR-FS evaluation we focus on the
PTMap-BPAt T WRN 84.18 (+1.5%) 90.51 (+1.9%)
SillNet-BPAp T WRN 83.35 (+0.4%) 89.65 (+0.6%) state-of-the-art methods and configurations.
PTMap-SF-BPAp T WRN 85.59 (+0.9%) 91.34 (+0.8%)
For each evaluated ’hosting’ method, we incorporate BPA
PMF($) I ViT-s 93.10 98.00
PMF-BPAp T ViT-s 94.49 (+1.4%) 97.68 (-0.3%) into the pipeline as follows. Given an FSC instance, we
PMF-BPAi I ViT-s 92.70 (-0.4%) 98.00 (+0.0%) transform the entire set of method-specific feature repre-
PMF-BPAp T ViT-s 95.30 (+2.3%) 97.90 (-0.1%) sentations using BPA, in order to better capture relative
PMF($) I ViT-b 95.30 98.40 information. The rest of the pipeline is resumed, allowing
PMF-BPAp T ViT-b 95.90 (+0.6%) 98.30 (-0.1%)
PMF-BPAi I ViT-b 95.20 (-0.1%) 98.70 (+0.3%)
for both inference and training. Note that BPA flexibly fits
PMF-BPAt T ViT-b 96.3 (+1.0%) 98.5 (+0.1%) into the FSC task, with no required knowledge or assump-
tions regarding the setting (# of ways, shots or queries).
4. Results The basic ‘drop-in’ BPAp consistently, and in many cases
also significantly, improves the hosting method performance,
In this section, we experiment with BPA on three applica-
including state-of-the-art, across all benchmarks and back-
tions: Few-Shot Classification (Sec. 4.1), Unsupervised
bones with accuracy improvement of around 3.5% and 1.5%
Image Clustering (Sec. 4.2) and Person Re-Identification
on the 1- and 5- shot tasks. This improvement without re-
(Sec. 4.3). In each, we achieve state-of-the-art results, by
training the embedding backbone shows BPA’s effectiveness
merely using current state-of-the-art methods as hosting
in capturing meaningful relationships between features in a
networks of the BPA transform. Perhaps more importantly,
very general sense. When re-training the hosting network
we demonstrate the flexibility and simplicity of applying
with BPA inside, in an end-to-end fashion, BPAt provides
BPA in these setups, with improvements in the entire range
further improvements, in almost every method, with aver-
of testing, including different hosting methods, different
ages of 5% and 3% on the 1- and 5- shot tasks.
feature embeddings of different complexity backbones and
whether retraining the hosting network or just dropping-in While most of the leading methods are transductive, our
BPA and performing standard inference. To show the sim- inductive version, BPAi , can be seen to steadily improve on
plicity of inserting BPA into hosting algorithms, we provide inductive methods like ProtoNet and PMF, without intro-
pseudocodes for each of the experiments in Appendix C. ducing transductive inference. This further emphasizes the
generality and applicability of our method.
4.1. Few-Shot Classification (FSC)
4.2. Unsupervised Image Clustering (UIC)
Our main experiment is a comprehensive evaluation on the
standard few-shot classification benchmarks MiniImagenet Next, we evaluate BPA in the unsupervised domain, using
(Vinyals et al., 2016) and CIFAR-FS (Bertinetto et al., 2019), the unsupervised image clustering task, with the additional
with detailed results in Tables 2 and 3 respectively. We challenge of capturing the relation between features that
evaluate the performance of the proposed BPA, applying were learned without labels. To do so, we adopt SPICE (Niu
it to a variety of FSC methods including the recent state- et al., 2022), a recent method that has shown phenomenal
of-the-art (PTMap (Hu et al., 2020), SillNet (Zhang et al., success in the field. In SPICE, training is divided into
2021), PTMap-SF (Chen & Wang, 2021) and PMF (Hu 3 phases: (i) unsupervised representation learning (using

7
The Balanced-Pairwise-Affinities Feature Transform
Table 4. Unsupervised Image Clustering (UIC) results on STL- Table 5. Image Re-Identification (Re-ID) results on CUHK03
10 (Coates et al., 2011), CIFAR-100-20 (Krizhevsky & Hinton, (Li et al., 2014) and Market-1501 (Zheng et al., 2015).
2009) and CIFAR-100-20 (Krizhevsky & Hinton, 2009). benchmark CUHK03-det CUHK03-lab Market-1501
benchmark STL-10 CIFAR-10 CIFAR-100-20 network mAP Rank-1 mAP Rank-1 mAP Rank-1
network ACC NMI ARI ACC NMI ARI ACC NMI ARI MHN 65.4 71.7 72.4 77.2 85.0 95.1
k-means 0.192 0.125 0.061 0.229 0.087 0.049 0.130 0.084 0.028 SONA 76.3 79.1 79.2 81.8 88.6 95.6
DAC 0.470 0.366 0.257 0.522 0.396 0.306 0.238 0.185 0.088 OSNet 67.8 72.3 – – 84.9 94.8
DSEC 0.482 0.403 0.286 0.478 0.438 0.340 0.255 0.212 0.110 Pyramid 74.8 78.9 76.9 81.8 88.2 95.7
IDFD 0.756 0.643 0.575 0.815 0.711 0.663 0.425 0.426 0.264 TDB 72.9 75.7 75.6 77.7 85.7 94.3
SPICEs 0.908 0.817 0.812 0.838 0.734 0.705 0.468 0.448 0.294 TDBRK 87.1 87.1 89.1 89.0 94.0 95.3
SPICE 0.938 0.872 0.870 0.926 0.865 0.852 0.538 0.567 0.387 TDB-BPAp 77.9 80.4 80.4 82.6 88.1 94.4
SPICEs -BPAt 0.912 0.823 0.821 0.880 0.784 0.769 0.494 0.477 0.334 TDBRK -BPAp 87.9 88.0 89.5 89.8 94.0 95.0
SPICE-BPAt 0.943 0.880 0.879 0.933 0.870 0.866 0.550 0.560 0.402

by Euclidean distances. BPA is used to replace such pre-


MoCo (He et al., 2019) over a resnet-34 backbone); (ii) computed image features, by a well balanced representation
clustering-head training, with result termed SPICEs ; and with strong relative information, that is jointly computed
(iii) a joint training phase (using FixMatch (Sohn et al., over the union of query and gallery features. BPA is applied
2020) over a wrn backbone), result termed SPICE. on pre-trained TopDBNet (Quispe & Pedrini, 2020) resnet-
50 features and tested on the large-scale ReID benchmarks
We insert BPA into phase (ii), clustering-head training, as CUHK03 (Li et al., 2014) (both ’detected’ and ‘labeled’) as
follows. Given a batch of representations, SPICE assigns well as the Market-1501 (Zheng et al., 2015) set, reporting
class pseudo-labels to the nearest neighbors of the most mAP (mean Average Precision) and Rank-1 metrics.
probable samples (k samples with the highest probability
per class). In the original work, SPICE uses the dot-product In Table 5, TDB and TDBRK are shorthands for using
of the MoCo features to find the neighbors. Instead, we TopDBNet features, before and after re-ranking (Zhong
transform each batch of MoCo features using BPA and et al., 2017). There is a consistent benefit in applying BPA
use the same dot-product on the resulting informative BPA to these state-of-the-art features, prior to the distance com-
features to find a more reliable set of neighbors. We experi- putations, with a significant average increase of over 5% in
ment on 3 standard datasets, STL-10 (Coates et al., 2011), mAP and 4% in Rank-1 prior to re-ranking and a modest
CIFAR-10 and CIFAR-100-20 (Krizhevsky & Hinton, 2009), increase of 0.5% in both measures after ranking. These re-
while keeping all original SPICE implementation hyper- sults demonstrate that BPA can handle large-scale instances
parameters unchanged. We report both SPICEs and SPICE (with thousands of features) and successfully improve per-
results, as in the original work (Niu et al., 2022). formance measures in such retrieval oriented tasks.

Table 4 summarizes the experiment, in terms of clustering


Accuracy (ACC), Normalized Mutual Information (NMI),
5. Conclusions, Limitations and Future Work
and Adjusted Rand Index (ARI). It is done for the two stages We presented a novel feature-embedding approach for set-
of SPICE, with and without BPA, along with several other input grouping-related tasks such as clustering, classifica-
baselines. The results show a significant improvement of tion and retrieval. The proposed BPA feature-set transform
SPICEs -BPAt over SPICEs (just by applying BPA to the is non-parametric, differentiable, efficient, easy to use and
learned features), with an average increase of 5% in NMI is shown to capture complex relations between the set-input
and 8% in ARI. The advantage brought by the insertion items. Applying BPA to the tasks of few-shot-classification,
of BPA carries on to the joint-processing stage (BPAt over unsupervised-image-clustering and person-re-identification,
SPICEs ), though with a smaller average increase of 0.1% whether by insertion into a pre-trained network or by re-
in NMI and 2.2% in ARI, leading to new state-of-the-art training the hosting network, has shown across-the-board
results on these datasets. These results demonstrate the improvements, setting new state-of-the-art results.
relevance of BPA to unsupervised feature learning setups
and its possible potential to other applications in this area. In future work, we plan to address current limitations and
explore potential extensions. BPA is currently limited to pro-
ducing features that represent relative information, within
4.3. Person Re-Identification (Re-ID)
the set-items. It could possibly be applied to tokens (e.g.
We explore the application of BPA to large-scale instances patches) of a single item (e.g. image), similar to trans-
and datasets by considering the person re-identification task formers, perhaps dropping the equivariance property and
(Ye et al., 2021). Given a set of query images and a large utilizing spatial encoding, to improve non-relative represen-
set of gallery images, the task is to rank the similarities of tations. In addition, it could be useful for guiding contrastive
each query against the entire gallery. This is typically done self-supervised learning, where embeddings are trained by
by learning specialized image features that are compared relative information of augmented views.

8
The Balanced-Pairwise-Affinities Feature Transform

References He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Mo-
mentum contrast for unsupervised visual representation
Asano, Y., Rupprecht, C., and Vedaldi, A. Self-labelling via
learning. arXiv:1911.05722, 2019.
simultaneous clustering and representation learning. In
International Conference on Learning Representations Hu, S. X., Li, D., Stühmer, J., Kim, M., and Hospedales,
(ICLR), 2020. T. M. Pushing the limits of simple pipelines for few-shot
Bertinetto, L., Henriques, J. F., Torr, P., and Vedaldi, A. learning: External data and fine-tuning make a difference.
Meta-learning with differentiable closed-form solvers. In In Proceedings of the IEEE/CVF conference on computer
International Conference on Learning Representations vision and pattern recognition (CVPR), 2022.
(ICLR), 2019.
Hu, Y., Gripon, V., and Pateux, S. Leveraging the fea-
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., ture distribution in transfer-based few-shot learning. In
and Joulin, A. Unsupervised learning of visual features arXiv:2006.03806, 2020.
by contrasting cluster assignments. Advances in neural
information processing systems (NeurIPS), 2020. Huang, G., Larochelle, H., and Lacoste-Julien, S. Are
few-shot learning benchmarks too simple? solving them
Chang, J., Wang, L., Meng, G., Xiang, S., and Pan, C. Deep without task supervision at test-time. arXiv:1902.08605,
adaptive image clustering. In Proceedings of the IEEE 2019.
International Conference on Computer Vision (ICCV),
2017. Kang, D., Kwon, H., Min, J., and Cho, M. Relational
embedding for few-shot classification. In Proceedings
Chen, W.-Y., Liu, Y.-C., Kira, Z., Wang, Y.-C. F., and Huang, of the IEEE/CVF International Conference on Computer
J.-B. A closer look at few-shot classification. In Interna- Vision (ICCV), 2021.
tional Conference on Learning Representations (ICLR),
2018. Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan,
F. S., and Shah, M. Transformers in vision: A survey.
Chen, X. and Wang, G. Few-shot learning by integrating arXiv:2101.01169, 2021.
spatial and frequency representation. arXiv:2105.05348,
2021. Kolkin, N., Salavon, J., and Shakhnarovich, G. Style trans-
fer by relaxed optimal transport and self-similarity. In
Coates, A., Ng, A., and Lee, H. An analysis of single- Proceedings of the IEEE/CVF Conference on Computer
layer networks in unsupervised feature learning. In Pro- Vision and Pattern Recognition (CVPR), 2019.
ceedings of the fourteenth international conference on
artificial intelligence and statistics, 2011. Korman, S. and Avidan, S. Coherency sensitive hashing.
IEEE Transactions on Pattern Analysis and Machine In-
Cuturi, M. Sinkhorn distances: Lightspeed computation
telligence (PAMI), 2015.
of optimal transport. In Advances in Neural Information
Processing Systems (NeurIPS), 2013. Krizhevsky, A. and Hinton, G. Learning multiple layers of
Dhillon, G. S., Chaudhari, P., Ravichandran, A., and Soatto, features from tiny images. 2009.
S. A baseline for few-shot image classification. In Inter-
Kuhn, H. W. The hungarian method for the assignment
national Conference on Learning Representations (ICLR),
problem. Naval Research Logistics Quarterly, 2, 1955.
2020.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., and Teh,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, Y. W. Set transformer: A framework for attention-based
M., Heigold, G., Gelly, S., et al. An image is worth permutation-invariant neural networks. In International
16x16 words: Transformers for image recognition at scale. Conference on Machine Learning (ICML), 2019.
arXiv:2010.11929, 2020.
Li, W., Zhao, R., Xiao, T., and Wang, X. Deepreid: Deep
Fey, M., Lenssen, J. E., Morris, C., Masci, J., and filter pairing neural network for person re-identification.
Kriege, N. M. Deep graph matching consensus. In Proceedings of the IEEE Conference on Computer
arXiv:2001.09621, 2020. Vision and Pattern Recognition (CVPR), 2014.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta- Maron, H., Litany, O., Chechik, G., and Fetaya, E. On
learning for fast adaptation of deep networks. In Interna- learning sets of symmetric elements. In International
tional Conference on Machine Learning (ICML), 2017. Conference on Machine Learning (ICML), 2020.

9
The Balanced-Pairwise-Affinities Feature Transform

Ng, A., Jordan, M., and Weiss, Y. On spectral clustering: Fixmatch: Simplifying semi-supervised learning with
Analysis and an algorithm. Advances in neural informa- consistency and confidence. arXiv:2001.07685, 2020.
tion processing systems (NeurIPS), 2001.
Van Gansbeke, W., Vandenhende, S., Georgoulis, S., Proes-
Niu, C., Shan, H., and Wang, G. Spice: Semantic pseudo- mans, M., and Van Gool, L. Scan: Learning to classify
labeling for image clustering. IEEE Transactions on images without labels. In European Conference on Com-
Image Processing (TIP), 2022. puter Vision (ECCV), 2020.

Pearson, K. On lines and planes of closest fit to systems Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
of points in space. The London, Edinburgh, and Dublin L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-
philosophical magazine and journal of science, 1901. tion is all you need. In Advances in Neural Information
Processing Systems (NeurIPS), 2017.
Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet:
Deep learning on point sets for 3d classification and seg- Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K.,
mentation. In Proceedings of the IEEE Conference on and Wierstra, D. Matching networks for one shot learning.
Computer Vision and Pattern Recognition (CVPR), 2017. In Proceedings of the 30th International Conference on
Neural Information Processing Systems (NeurIPS), 2016.
Quispe, R. and Pedrini, H. Top-db-net: Top dropblock
for activation enhancement in person re-identification. Wang, J., Zhang, Z., Chen, M., Zhang, Y., Wang, C., Sheng,
International Conference on Pattern Recognition (ICPR), B., Qu, Y., and Xie, Y. Optimal transport for label-
2020. efficient visible-infrared person re-identification. In Pro-
ceedings of the European Conference on Computer Vision
Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Lev- (ECCV), 2022.
skaya, A., and Shlens, J. Stand-alone self-attention in
vision models. Advances in Neural Information Process- Wang, R., Yan, J., and Yang, X. Learning combinatorial
ing Systems (NeurIPS), 2019. embedding networks for deep graph matching. In Pro-
ceedings of the IEEE/CVF international conference on
Ravi, S. and Larochelle, H. Optimization as a model for few- computer vision (ICCV), 2019.
shot learning. In International Conference on Learning
Representations (ICLR), 2017. Xie, Y., Dai, H., Chen, M., Dai, B., Zhao, T., Zha, H.,
Wei, W., and Pfister, T. Differentiable top-k with optimal
Sander, M. E., Ablin, P., Blondel, M., and Peyré, G. Sink- transport. Advances in Neural Information Processing
formers: Transformers with doubly stochastic attention. Systems (NeurIPS), 2020.
In International conference on artificial intelligence and
statistics (AISTATS). PMLR, 2022. Ye, H.-J., Hu, H., Zhan, D.-C., and Sha, F. Few-shot learning
via embedding adaptation with set-to-set functions. In
Santoro, A., Raposo, D., Barrett, D. G., Malinowski, M., Proceedings of the IEEE Conference on Computer Vision
Pascanu, R., Battaglia, P., and Lillicrap, T. A simple and Pattern Recognition (CVPR), 2020.
neural network module for relational reasoning. Advances
in Neural Information Processing Systems (NeurIPS), Ye, M., Shen, J., Lin, G., Xiang, T., Shao, L., and Hoi,
2017. S. C. Deep learning for person re-identification: A survey
and outlook. IEEE Transactions on Pattern Analysis and
Sarlin, P.-E., DeTone, D., Malisiewicz, T., and Rabinovich, Machine Intelligence (PAMI), 2021.
A. Superglue: Learning feature matching with graph
neural networks. In Proceedings of the IEEE/CVF Con- Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B.,
ference on Computer Vision and Pattern Recognition Salakhutdinov, R. R., and Smola, A. J. Deep sets.
(CVPR), 2020. In Advances in Neural Information Processing Systems
(NeurIPS), 2017.
Shi, J. and Malik, J. Normalized cuts and image segmenta-
tion. IEEE Transactions on pattern analysis and machine Zass, R. and Shashua, A. Doubly stochastic normalization
intelligence (PAMI), 2000. for spectral clustering. Advances in neural information
processing systems (NeurIPS), 2006.
Snell, J., Swersky, K., and Zemel, R. Prototypical networks
for few-shot learning. In Advances in Neural Information Zhang, C., Cai, Y., Lin, G., and Shen, C. Deepemd: Few-
Processing Systems (NeurIPS), 2017. shot image classification with differentiable earth mover’s
distance and structured classifiers. In IEEE/CVF Con-
Sohn, K., Berthelot, D., Li, C.-L., Zhang, Z., Carlini, N., ference on Computer Vision and Pattern Recognition
Cubuk, E. D., Kurakin, A., Zhang, H., and Raffel, C. (CVPR), 2020.

10
The Balanced-Pairwise-Affinities Feature Transform

Zhang, H., Cao, Z., Yan, Z., and Zhang, C. Sill-net: Feature normalized input features); (ii) Avoiding self-matching by
augmentation with separated illumination representation. placing infinity values on the distances matrix diagonal; (iii)
arXiv:2102.03539, 2021. Applying a standard Sinkhorn procedure, given the distance
matrix and the only 2 (hyper-) parameters with their fixed
Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., and Tian,
values: entropy regularization parameter λ and the number
Q. Scalable person re-identification: A benchmark. In
of row/col iterative normalization steps. Note that Sinkhorn
2015 IEEE International Conference on Computer Vision
defaultly maps between source and target vectors of ones;
(ICCV), 2015.
(iv) Restoring the perfect self-matching probabilities of one,
Zhong, Z., Zheng, L., Cao, D., and Li, S. Re-ranking along the diagonal.
person re-identification with k-reciprocal encoding. In
Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2017. B. Ablation Studies
Zhou, K., Yang, Y., Cavallaro, A., and Xiang, T. Omni- B.1. Scalability (accuracy, runtime vs. input size)
scale feature learning for person re-identification. In
Proceedings of the IEEE/CVF International Conference Being a transductive module, the accuracy and efficiency
on Computer Vision (ICCV), 2019. of the BPA transform depend on the number of inputs that
are processed as a batch. Recall that BPA is a drop-in
Zhu, H. and Koniusz, P. Ease: Unsupervised discriminant addition that usually follows feature extraction and precedes
subspace learning for transductive few-shot learning. In further computation - e.g. k-means for clustering, or (often
Proceedings of the IEEE/CVF Conference on Computer transductive) layers in FSC and ReID.
Vision and Pattern Recognition (CVPR), 2022.
The ReID experiment is a good stress-test for BPA, since
Ziko, I. M., Dolz, J., Granger, E., and Ayed, I. B. Laplacian we achieve excellent results for batch sizes of up to ∼15K
regularized few-shot learning. In International Confer- image descriptors. In terms of runtime, although BPA’s
ence on Machine Learning (ICML), 2020. complexity is quadratic in sample size, its own (self) runtime
is empirically negligible compared to that of the processing
that follows, in all applications tested.
Appendix
Typical FSC tasks sizes ((shots+queries)·ways) are small:
The Appendix includes the following sections: 100 = (5 + 15) · 5 at the largest. To concretely address this
matter, we test a resnet-12 PTMap-BPAp on large-scale FSC,
A. PyTorch-style BPA Implementation
following (Dhillon et al., 2020), on the Tiered-Imagenet
B. Ablation Studies dataset and report accuracy for 1/5/10-shot (15-query) tasks
C. BPA Insertion into Hosting Algorithms for an increasing range of ways. The results, shown in
Fig. 5, show that: (i) Total runtime, where BPA is only a
D. Clustering on the Sphere - a Case Study small contributor (compare black vs. yellow dashed line),
increases gracefully (notice log10 x-axis) even for extremely
A. PyTorch-style BPA Implementation large FSC tasks of 4000 = (10 + 15) · 160 images; (ii) Our
accuracy scales as expected - following the observation in
We provide in Algorithm 1 a PyTorch Style implementation
that fully aligns with the description in the paper as well as Algorithm 1 BPA transform on a set of n features.
with our actual implementation that was used to execute all input: n × d matrix V output: n × n matrix W
of the experiments. In Appendix C we further demonstrate def BPA(V):
the ”insertions” of BPA into hosting methods, for each of # compute self pairwise-distances
our three main applications. D = 1 - pwise cosine sim(V/V.norm())
# infinity self-distances on diagonal
Notice mainly that: (i) The transform can easily be dropped-
in, using the simple one-line call: X = BPA(X). (ii) It is fully D inf = D.fill diagonal(10e9)
differentiable (as Sinkhorn and the other basic operations # compute optimal transport plan
are). (iii) The transform does not need to know (or even W = Sinkhorn(D inf,lambda=.1,iters=5)
assume) anything about the number of features, their dimen- # stretch affinities to [0,1]
sion, or distribution statistics among classes (e.g. whether W = W/W.max()
balanced or not). # self-affinity on diagonal to 1
It follows the simple steps of: (i) Computing Euclidean self return W.fill diagonal(1)
pairwise distances (using cosine similarities between unit

11
The Balanced-Pairwise-Affinities Feature Transform
100
1 Shot
5 Shot 0.200

Running time per batch (sec)


80 10 Shot
0.175

0.150
Accuracy (%)

60
0.125

40 0.100
Running time PT-MAP-BPA
Running time PT-MAP 0.075
20 0.050

0.025
0
0.8 1.0 1.2 1.4 1.6 1.8
Ways (Log10)

Figure 5. BPA scaling in terms of accuracy and efficiency.

(Dhillon et al., 2020) that it changes logarithmically with


ways (straight line in log-scale).

B.2. Sinkhorn Iterations


In Table 6 we ablate the number of normalization iterations
in the Sinkhorn-Knopp (SK) (Cuturi, 2013) algorithm at
test-time. We measured accuracy on the validation set of
MiniImagenet (Vinyals et al., 2016), using ProtoNet-BPAp
(which is the non-fine-tuned drop-in version of BPA within
ProtoNet (Snell et al., 2017)). As was reported in prior
works following (Cuturi, 2013), we empirically observe that
a very small number of iterations provide rapid convergence,
with diminishing return for higher numbers of iterations.
We observed similar behavior for other hosting methods, Figure 6. Ablation of entropy regularization parameter λ using
and therefore chose to use a fixed number of 5 iterations the Few-Shot-Classification (FSC) task: Considering different
throughout the experiments. ‘ways’ (top), and different ‘shots’ (bottom). See text for details.

Table 6. Sinkhorn iterations ablation study: See text for details. can be seen to have a certain dependence on the number of
shots. Recall that we chose to use a fixed value of λ = 0.1,
method iters 5-way 1-shot 5-way 5-shot which gives an overall good accuracy trade-off. Note that a
ProtoNet-BPAp 1 70.71 83.79 further improvement could be achieved by picking the best
ProtoNet-BPAp 2 71.10 84.01 values for the particular cases. Notice also the log-scale of
ProtoNet-BPAp 4 71.18 84.08 the x-axes to see that performance is rather stable around
ProtoNet-BPAp 8 71.20 84.10 the chosen value.
ProtoNet-BPAp 16 71.20 84.10
For Re-ID, in Fig. 7, we experiment with a range of λ
values on the validation set of the Market-1501 dataset. The
B.3. Sinkhorn Entropy Regularization λ results (shown both for mAP and rank-1 measures) reveal a
strong resemblance to those of the FSC experiment in Fig. 6,
We measured the impact of using different values of the
however, the optimal choices for λ are slightly higher, which
optimal-transport entropy regularization parameter λ (the
is consistent with the dependence on the shots number, since
main parameter of the Sinkhorn algorithm) on a variety of
the re-ID tasks are typically large ones. We found that a
configurations (ways and shots) in Few-Shot-Classification
value of λ = 0.25 gives good results across both datasets.
(FSC) on MiniImagenet (Vinyals et al., 2016) in Fig. 6 as
well as on the Person-Re-Identification (RE-ID) experiment
on Market-1501 (Zheng et al., 2015) in Fig. 7. In both cases, B.4. BPA vs. Naive Baselines
the ablation was executed on the validation set. In Fig. 8, we ablate different simple alternatives to BPA,
For FSC, in Fig. 6, the top plot shows that the effect of the with the PTMap (Hu et al., 2020) few-shot-classifier as the
choice of λ is similar across tasks with a varying number of ’hosting’ method, using MiniImagenet (Vinyals et al., 2016).
ways. The bottom plot shows the behavior as a function of Each result is the average of 100 few-shot episodes, using
λ across multiple shot-values, where the optimal value of λ a WRN-28-10 backbone feature encoder. In blue is the

12
The Balanced-Pairwise-Affinities Feature Transform

and versatile solution to a variety of applications.

Algorithm 2 PTMap training and inference


inputs: xs , xq # support, query images
ℓs , (ℓq ) # support, (query) labels
fϕ # pre-trained embedding network

fs = fϕ (xs ), fq = fϕ (xq ) # extract features


(fs ∪ fq ) = BPA (fs ∪ fq ) # BPA transformed features
cj = 1s · f ∈fs ,ℓs (f )=j f , ∀j # init class centers
P

repeat:
· Lij = ∥fi − cj ∥2 , ∀i, fi ∈fq # feature-center dists
· M = Sinkhorn(L, λ) # S-horn soft assignments
· cj ← cj + α(g(M, j) − cj ), ∀j # update centers
Figure 7. Ablation of entropy regularization parameter λ us-
ing the Person-Re-Identification (Re-ID) task. Accuracy vs. λ, ℓˆq (fi ) = arg maxj (M[i, j]) # prediction per fi ∈fq
using the validation set of Market-1501 (Zheng et al., 2015) and if inference:
considering both mAP and Rank-1 measures. See text for details. return ℓ̂q # query predictions
else (training):
update fϕ by ∇ϕ C-Entropy(M, ℓq ) # grad-desc.
baseline of applying no transform at all, using the original
features. In orange - using BPA. In gray and yellow, respec-
tively, are other naive ways of transforming the features,
where the affinity matrix is only row-normalized (’softmax’) C.2. SPICE (Niu et al., 2022) (Unsupervised Clustering)
or not normalized at all (’cosine’) before taking its rows as In our implementation of SPICE, as detailed in the paper,
the output features. It is empirically evident that only BPA we utilize BPA during phase 2 of the algorithm (clustering-
outperforms the baseline consistently, which is due to the head training). Specifically, as depicted in Alg. C.2, we
properties that we had proved regarding the transform. transform the features using BPA, batch-wise, before con-
ducting a nearest-neighbor search. Afterwards, we retrieve
C. BPA Insertion into Hosting Algorithms the pseudo-labels and resume with the original features, as
in the original implementation.
C.1. PTMap (Hu et al., 2020) (Few-Shot Classification)
Algorithm 3 SPICE training
We present the pseudo-code for utilizing BPA within the
Phase (i): pre-train embedding network fϕ
PTMap pipeline, as outlined in Alg. C.1. The only alteration
from the original implementation pertains to row 5, wherein Phase (ii): train clustering network cθ
the support and query sets are concatenated and transformed repeat per batch x:
using BPA. This approach can be extended to a wide range · f = fϕ (x) # extract features
of distance-based methodologies, thus providing a simple · f BPA = BPA (f ) # BPA transformed features
· Find 3 most confident samples per cluster (use f )
· Compute cluster centers as their means (use f BPA )
· Find nearest-neighbors of each center (use f BPA )
· Assign them to the cluster (as pseudo-labels)
· Use pseudo-labels to train (update) cθ
Phase (iii): jointly fine-tune fϕ and cθ

C.3. TopDBNet (Quispe & Pedrini, 2020) (Person ReID)

Figure 8. Comparison of BPA to different baselines over differ- Finally, Alg. C.3 illustrates the application of BPA dur-
ent configurations in few-shot learning tasks over MiniImagenet ing inference in the context of Person ReID. Typically, the
(Vinyals et al., 2016). Created by measuring accuracy (y-axis) query identity search within the gallery involves identifying
over a varying number of shots (x-axis), with fixed 5-ways and the nearest sample to each query. In our implementation,
15-queries. See text for details. we adopt the same methodology, with the additional step

13
The Balanced-Pairwise-Affinities Feature Transform

of transforming the concatenated set of query and gallery


features, using the BPA transform prior to the search.

Algorithm 4 TopDBNet inference


inputs: xg , xq # gallery images, query images
fϕ # pre-trained embedding network

# extract features
fg = fϕ (xg ), fq = fϕ (xq )
# transform them with BPA
(fg ∪ fq ) = BPA (fg ∪ fq )
# return gallery image with closest feature
return argmin ∥fi − fj ∥ for every {i : fi ∈ fq }
{j:fj ∈fg }

D. Clustering on the Sphere - a Case Study


We demonstrate the effectiveness of BPA using a
controlled synthetically generated clustering experiment,
with k = 10 cluster centers that are distributed uniformly at
random on a d-dimensional unit-sphere, and 20 points per
cluster (200 in total) that are perturbed around the cluster
centers by Gaussian noise of increasing standard deviation,
of up to 0.75, followed by a re-projection back to the sphere
by dividing each vector by its L2 magnitude. See Fig. 9
for a visualization of the 3D case, for several noise STDs.
Following the random data generation, we also apply dimen-
sionality reduction with PCA to d = 50, if d > 50.
We performed the experiment over a logarithmic 2D grid
of combinations of data dimensionalities d in the range
[10, 1234] and Gaussian in-cluster noise STD in the range
[0.1, 0.75]. Each point is represented by its d-dimensional
coordinates vector, where the baseline clustering is obtained
by running k-means on these location features. In addition,
we run k-means on the set of features that has undergone
BPA. Hence, the benefits of the transform (embedding) are
measured indirectly through the accuracy2 achieved by run-
ning k-means on the embedded vs. original vectors.
Evaluation results, in terms of Normalized Mutual Informa-
tion (NMI) and Adjusted Rand Index (ARI), are reported
in Fig. 10, averaged over 10 runs, as a function of either
dimensionality (for different noise STDs) or noise STDs
(for different dimensionalities). The results show (i) general
gains and robustness to wide ranges of data dimensionality
(ii) the ability of BPA to find meaningful representations
that enable clustering quality to degrade gracefully with the
increase in cluster noise level. Note that the levels of noise
are rather high, as they are relative to a unit radius sphere.

2
Accuracy is measured by comparison with the optimal permu-
tation of the predicted labels, found by the Hungarian Algorithm
(Kuhn, 1955).

14
The Balanced-Pairwise-Affinities Feature Transform

Figure 9. Clustering on the sphere: Data Generation. 10 Random cluster centers on the unit sphere, perturbed by increasing noise STD.

Figure 10. Clustering on the sphere: Detailed Results. Clustering measures (top: ARI, bottom: NMI) of k-means, using BPA features
(dashed lines) vs. original features (solid lines). For both measures - the higher the better. Shown over different configurations of feature
dimensions d (left) and noise levels σ (right).
15

You might also like