The Balanced-Pairwise-Affinities Feature Transform: Instance
The Balanced-Pairwise-Affinities Feature Transform: Instance
1
The Balanced-Pairwise-Affinities Feature Transform
Figure 1. Generic designs of networks that act on sets of items. These cover relevant architectures, e.g. for few-shot-classification and
clustering. Left: A generic network for processing a set of input items typically follows the depicted structure: (i) Each item separately
goes through a common feature extractor F . (ii) The set of extracted features is the input to a downstream task processing module G. ;
Right: A more general structure in which the extracted features undergo a joint processing by a transform T . Our BPA transform (as well
as other attention mechanisms) is of this type and its high-level design (within the ‘green’ module) is detailed in Fig. 2.
ized (Sander et al., 2022) and other matching (Sarlin et al., curated large-scale tasks. In all three applications, over the
2020) or classification (Hu et al., 2020) algorithms where different setups and datasets, BPA consistently improves its
optimal-transport plans are computed between source items hosting methods, achieving new state-of-the-art results.
and target items or class centers. However, the most im-
portant difference and our main novel observation is that 2. Relation to Prior Work
the self fractional matching itself (which can be viewed as
a balanced affinity matrix) can serve as a powerful embed- 2.1. Related Techniques
ding, since the distances in this space (between assignment
Set-to-Set (or Set-to-Feature) Functions have been devel-
vectors) have explicit interpretations that we explore, which
oped to act jointly on a set of items (typically features) and
are highly beneficial to general grouping based algorithms
output an updated set (or a single feature), which are used
that are applied to such set-input tasks.
for downstream inference tasks. Deep-Sets (Zaheer et al.,
2017) formalized fundamental requirements from architec-
Contribution
tures that process sets. Point-Net (Qi et al., 2017) presented
We propose a parameter-less optimal-transport based feature an influential design for learning local and global features on
transform, termed BPA, which can be used as a drop-in addi- 3D point-clouds, while Maron et al. (2020) study the design
tion that converts a generic feature extraction scheme to one of equi/in-variant layers. Unlike BPA, the joint processing
that is well suited to set-input tasks (e.g. from Figure 1 Left in these methods is limited, amounting to weight-sharing
to Right). It is analyzed and shown to have the following between separate processes and joint aggregations.
attractive set of qualities. (i) efficiency: having real-time
inference; (ii) differentiability: allowing end-to-end training Attention Mechanisms. The introduction of Relational
of the entire ‘embedding-transform-inference’ pipeline of Networks (Santoro et al., 2017) and Transformers (Vaswani
Fig. 1 Right; (iii) equivariance: ensuring that the embed- et al., 2017) with their initial applications in vision mod-
ding works coherently under any order of the input items; els (Ramachandran et al., 2019) have lead to the huge impact
(iv) probabilistic interpretation: each embedded feature of Vision Transformers (ViTs) (Dosovitskiy et al., 2020) in
will encode its distribution of affinities to all other features, many vision tasks (Khan et al., 2021). While BPA can be
by conforming to a doubly-stochastic constraint; (iv) valu- seen as a self-attention module, it is very different, first,
able metrics for the item set: Distances between embedded since it is parameterless, and hence can work at test-time on
vectors will include both direct and indirect (third-party) a pre-trained network. In addition, is can provide an explicit
similarity information between input features. probabilistic global interpretation of the instance data.
Empirically, we show BPA’s flexibility and ease of applica- Spectral Methods have been widely used as simple trans-
tion to a wide variety of tasks, by incorporating it in leading forms applied on data that needs to undergo grouping or
methods of each type. We test different configurations, such search based operations, jointly processing the set of fea-
as whether the hosting network is pre-trained or re-trained tures, resulting in a compact and perhaps discriminative
with BPA inside, across different backbones, whether trans- representation. PCA (Pearson, 1901) provides a joint di-
ductive or inductive. Few-shot-classification is our main ap- mension reduction, which maximally preserves data vari-
plication with extensive experimentation on standard bench- ance, but does not necessarily improve feature affinities
marks, testing on unsupervised-image-clustering shows the for downstream tasks. Spectral Clustering (SC) (Shi &
potential of BPA in the unsupervised domain and the person- Malik, 2000; Ng et al., 2001) is the leading non-learnable
re-identification experiments show how BPA deals with non- clustering method in use in the field. If we ignore its final
2
The Balanced-Pairwise-Affinities Feature Transform
Figure 2. The BPA transform: illustrated on a toy 7 image 3-class MNIST example.
clustering stage, SC consists of forming a pairwise affinity In the meta-learning approach, the training data is split into
matrix which is normalized (Zass & Shashua, 2006) before tasks (or episodes) mimicking the test time tasks to which
extracting its leading eigenvectors, which form the final the learner is required to generalize. MAML (Finn et al.,
embedding. BPA is also based on normalizing an affinity 2017) “learns to fine-tune” by learning a network initializa-
matrix, but uses this matrix’s rows as embedded features tion from which it can quickly adapt to novel classes. In
and avoids any further spectral decompositions, which are ProtoNet (Snell et al., 2017), a learner is meta-trained to
costly and difficult to differentiate through. predict query feature classes, based on distances from sup-
port class-prototypes in the embedding space. The trainable
version of BPA can be viewed as a meta-learning algorithm.
Optimal Transport (OT) problems are directly related to
measuring distances between distributions or sets of features. Subsequent works (Chen et al., 2018; Dhillon et al., 2020)
Cuturi (2013) popularized the Sinkhorn algorithm which is advocate using larger and more expressive backbones, em-
a simple, differentiable and fast approximation of entropy- ploying transductive inference, which fully exploits the data
regularized OT, which has since been used extensively, for at inference, including unlabeled images. BPA is transduc-
clustering (Lee et al., 2019; Asano et al., 2020), few-show- tive, but does not make assumptions on (nor needs to know)
classification (Huang et al., 2019; Ziko et al., 2020; Hu the number of classes (ways) or items per class (shots), as it
et al., 2020; Zhang et al., 2021; Chen & Wang, 2021; Zhu executes a general probabilistic grouping action.
& Koniusz, 2022), matching (Wang et al., 2019; Fey et al., Recently, attention mechanisms were shown to be effective
2020; Sarlin et al., 2020), representation learning (Caron for FSC (Kang et al., 2021; Zhang et al., 2020; Ye et al.,
et al., 2020; Asano et al., 2020), retrieval (Xie et al., 2020), 2020) and a number of works (Ziko et al., 2020; Huang
person re-identification (Wang et al., 2022), style-transfer et al., 2019; Hu et al., 2020; Zhang et al., 2021; Chen &
(Kolkin et al., 2019) and attention (Sander et al., 2022). Wang, 2021) have adopted Sinkhorn (Cuturi, 2013) as a
Our approach also builds on some attractive properties of the parameterless unsupervised classifier that computes match-
Sinkhorn solver. While our usage of Sinkhorn is extremely ings between query embeddings and class centers. Sill-Net
simple (see Algorithm 1), it is fundamentally different from (Zhang et al., 2021) that augments training samples with
all other OT usages we are aware of, since: (i) We com- illumination features and PTMap-SF (Chen & Wang, 2021)
pute the transport-plan between a set of features and itself - that proposes DCT-based feature embedding, are both based
not between feature-sets and label/class-prototypes (Huang on PTMap (Hu et al., 2020). The state-of-the-art PMF (Hu
et al., 2019; Ziko et al., 2020; Hu et al., 2020; Zhang et al., et al., 2022), proposed a 3 stage pipeline of pre-training on
2021; Chen & Wang, 2021; Zhu & Koniusz, 2022; Lee et al., external data, meta-training with labelled tasks, and fine-
2019; Asano et al., 2020; Xie et al., 2020; Wang et al., 2022; tuning on unseen tasks. BPA can be incorporated into these
Kolkin et al., 2019), or between two different feature-sets methods, immediately after their feature extraction stage.
(Wang et al., 2019; Fey et al., 2020; Sarlin et al., 2020; Unsupervised Image Clustering (UIC) is the task of
Sander et al., 2022); (ii) While others use the transport-plan grouping related images, without any label information,
to obtain distances or associations between features and into representative clusters. Naturally, the ability to measure
features/classes, we use its own rows as new feature vectors the similarities among samples is a crucial aspect of UIC.
for downstream tasks.
Recent methods have achieved tremendous progress in this
task, towards closing the gap with supervised counterparts.
2.2. Instance-Specific Applications
The leading approaches directly learn to map images to la-
Few-Shot Classification (FSC) is a branch of few-shot bels, by constraining the training of an unsupervised classi-
learning in which a classifier learns to recognize previously fication model with different types of indirect loss functions.
unseen classes given a limited number of labeled examples. Prominent works in this area include DAC (Chang et al.,
3
The Balanced-Pairwise-Affinities Feature Transform
2017), which recasts the clustering problem into a binary We suggest a novel re-embedding of the feature set V, using
pairwise-classification framework and SCAN (Van Gans- a transform, that we denote by T , in order to obtain a new
beke et al., 2020) which builds on a pre-trained encoder set of features W = T (V), where W ∈ Rn×n . The new
that provides nearest-neighbor based constraints for training feature set W has an explicit probabilistic interpretation,
a classifier. The recent state-of-the-art SPICE (Niu et al., which is specifically suited for tasks related to classification,
2022), is a pseudo-labeling based method, which divides matching or grouping of items in the input set X . In par-
the clustering network into a feature model for measuring ticular, W will be a symmetric, doubly-stochastic matrix
the instance-level similarity and a clustering head for identi- (non-negative, with rows and columns that sum to 1), where
fying the cluster-level discrepancy. the entry wij (for i ̸= j) encodes the belief that items xi
and xj belong to the same class or cluster.
Person Re-Identification (Re-ID) is the task identifying a The proposed transform T : Rn×d → Rn×n (see Fig. 2)
certain person (identity) between multiple detected pedes- acts on the original feature set V as follows. It begins by
trian images, from different non-overlapping cameras. It computing the squared Euclidean pairwise distances matrix
is challenging due to the scale of the problem and large D, namely, dij = ||vi − vj ||2 , which can be computed
variation in pose, background and illumination. efficiently as dij = 2(1 − cos(vi , vj )) = 2(1 − vi · vjT ),
when the rows of V are unit normalized. Or in a compact
See Ye et al. (2021) for an excellent comprehensive sur- form, D = 2(1 − S), where 1 is the all ones n × n matrix
vey on the topic. Among the most popular methods are and S = V · V T is the cosine affinity matrix of V.
OSNet (Zhou et al., 2019) that developed an efficient small-
scale network with high performance and DropBlock (Top- W will be computed as the optimal transport (OT) plan
DB-Net) (Quispe & Pedrini, 2020) which achieved state-of- matrix between the n-dimensional all-ones vector 1n and
the-art results by dropping a region block in the feature map itself, under the self cost matrix D∞ , which is the distance
for attentive learning. The Re-ID task is typically larger matrix D with a very (infinitely) large scalar replacing each
scale - querying thousands of identities against a target of of the entries on its diagonal (which were all zero), that
tens of thousands. Also, the data is much more real-world enforces the affinities of each feature to distribute among
compared to the carefully curated FSC sets. the others. Explicitly, let D∞ = D + αI, where α is a very
(infinitely) large constant and I is the n × n identity matrix.
4
The Balanced-Pairwise-Affinities Feature Transform
min ⟨D∞ , W⟩ s.t. W · 1n = W T · 1n = 1n We can now point out some important properties of the
W
proposed embedding, given by the rows of the matrix W.
(3)
Some of these properties can be observed in the toy 3-class
which can be seen to be the same as: MNIST digit example, illustrated in Fig. 2.
min ⟨D, W⟩ s.t. W · 1n = W T · 1n = 1n Interpretability of distances in the embedded space: An
W (4)
wii = 0 for i = 1, . . . n important property of our embedding is that each embed-
ded feature encodes its distribution of affinities to all other
since the use of the infinite weights on the diagonal of D∞ features. In particular, the comparison of embedded vectors
is equivalent to using the original D with a constraint of wi and wj (of items i and j in a set) includes both direct
zeros along the diagonal of W. and indirect information about the similarity between the
features. Refer to Figure 4 for a detailed explanation of
The optimization problem in Equation (4) is in fact a frac-
this property. If we look at the different coordinates k of
tional matching instance between the set of n original fea-
the absolute difference vector a = |wi − wj |, BPA cap-
tures and itself. It can be posed as a bipartite-graph min-cost
tures (i) direct affinity: For k which is either i or j, it holds
max-flow instance (The problem of finding a min cost flow
that ak = 1 − wij = 1 − wji 1 . This amount measures how
out of all max-flow solutions), as depicted in Fig. 3. The
high (close to 1) is the mutual belief of features i and j about
graph has n nodes on each side, representing the original
one another. (ii) indirect (3rd-party) affinity: For k ∈/ {i, j},
features {vi }ni=1 (the rows of V). Across the two sides, the
we have ak = |wik − wjk |, which is a comparison of the
cost of the edge (vi , vj ) is the distance dij and the edges
beliefs of features i and j regarding the (third-party) feature
of the type (vi , vi ) have a cost of infinity (or can simply be
k. The double-stochasticity of the transformed feature-set
removed). Each ‘left’ node is connected to a ’source’ node
ensures that the compared vectors are similarly scaled (as
S by an edge of cost 0 and similarly each ’right’ node is
distributions, plus 1 on the diagonal) and the symmetry
connected to a ‘target’ (sink) node T. All edges in the graph
further enforces the equal relative affinity between pairs.
have a capacity of 1 and the goal is to find an optimal frac-
tional self matching, by finding a min-cost max-flow from As an example, observe the output features 4 and 5 in Fig. 2,
source to sink. Note that the max-flow can easily be seen to that re-embed the ’green’ features of the digit ’7’ images.
be n, but a min-cost flow is sought among max-flows. As desired, these embedding are close in the target 7D space.
The closeness is driven by both their closeness in the original
In this set-to-itself matching view, each vector vi is fraction-
space (coordinates 4 and 5) as well as the agreement on
ally matched to the set of all other vectors V −{vi } based on
specific large differences from other images. This property
the pairwise distances, but importantly taking into account
is responsible for better separation between classes in the
the fractional matches of the rest of the vectors in order to
target domain, which leads to improved performance on
satisfy the double-stochasticity constraint. The construction
tasks like classification, clustering or retrieval.
constrains the max flow to have a total outgoing flow of
1 from each ‘left’ node and a total incoming flow of 1 to 1
Note: (i) wii = wjj = 1 ; (ii) wij = wji from symmetry of
each ‘right’ node. Therefore, the ith transformed feature W ; (iii) all elements of W are ≤ 1 hence the | · | can be dropped ;
5
The Balanced-Pairwise-Affinities Feature Transform
6
The Balanced-Pairwise-Affinities Feature Transform
Table 2. Few-Shot Classification (FSC) accuracy on MiniIma- Table 3. Few-Shot Classification (FSC) accuracy on CIFAR-FS.
genet. Results are ordered by backbone (resnet-12, wrn-28-10, ViT method T/I network 5-way 1-shot 5-way 5-shot
small/base), each listing baseline methods and BPA variants. BPA PTMap($) T WRN 87.69 90.68
improvements (colored percentages) are in comparison with each SillNet($) T WRN 87.73 91.09
respective baseline hosting method (obtained by division). Bold PTMap-SF($) T WRN 89.39 92.08
and italics highlight best and second best results per backbone. PTMap-BPAp T WRN 87.37 (-0.4%) 91.12 (+0.5%)
T/I denotes transductive/inductive methods. (&) from Ziko et al. SillNet-BPAp T WRN 87.30 (-0.5%) 91.40 (+0.3%)
PTMap-SF-BPAp T WRN 89.94 (+0.6%) 92.83 (+0.8%)
(2020); ($) from original paper; (#) our implementation;
PMF($) I ViT-s 81.1 92.5
method T/I network 5-way 1-shot 5-way 5-shot PMF-BPAp T ViT-s 84.7 (+4.4%) 92.8 (+0.3%)
ProtoNet(#) I ResNet 62.39 80.33 PMF-BPAi I ViT-s 84.80 (+4.5%) 93.40 (+0.9%)
DeepEMD($) I ResNet 65.91 82.41 PMF-BPAt T ViT-s 88.90 (+9.6%) 93.80 (+1.4%)
FEAT($) I ResNet 66.78 82.05 PMF($) I ViT-b 84.30 92.20
RENet($) I ResNet 67.60 82.58 PMF-BPAp T ViT-b 88.2 (+4.6%) 94 (+1.9%)
ProtoNet-BPAp T ResNet 67.34 (+7.9%) 81.84 (+1.6%) PMF-BPAi I ViT-b 87.10 (+3.3%) 94.70 (+2.7%)
ProtoNet-BPAi I ResNet 64.36 (+3.1%) 81.82 (+1.8%) PMF-BPAt T ViT-b 91.00 (+7.9%) 95.00 (+3.0%)
ProtoNet-BPAt T ResNet 67.90 (+8.8%) 83.09 (+3.2%)
ProtoNet(&) I WRN 62.60 79.97
PTMap($) T WRN 82.92 88.80 et al., 2022)) as well as to conventional methods like the
SillNet($) T WRN 82.99 89.14 popular ProtoNet (Snell et al., 2017). While in the Mini-
PTMap-SF($) T WRN 84.81 90.62 Imagenet evaluation we include a wide range of methods
PTMap-BPAp T WRN 83.19 (+0.3%) 89.56 (+0.9%) and backbones, in the CIFAR-FS evaluation we focus on the
PTMap-BPAt T WRN 84.18 (+1.5%) 90.51 (+1.9%)
SillNet-BPAp T WRN 83.35 (+0.4%) 89.65 (+0.6%) state-of-the-art methods and configurations.
PTMap-SF-BPAp T WRN 85.59 (+0.9%) 91.34 (+0.8%)
For each evaluated ’hosting’ method, we incorporate BPA
PMF($) I ViT-s 93.10 98.00
PMF-BPAp T ViT-s 94.49 (+1.4%) 97.68 (-0.3%) into the pipeline as follows. Given an FSC instance, we
PMF-BPAi I ViT-s 92.70 (-0.4%) 98.00 (+0.0%) transform the entire set of method-specific feature repre-
PMF-BPAp T ViT-s 95.30 (+2.3%) 97.90 (-0.1%) sentations using BPA, in order to better capture relative
PMF($) I ViT-b 95.30 98.40 information. The rest of the pipeline is resumed, allowing
PMF-BPAp T ViT-b 95.90 (+0.6%) 98.30 (-0.1%)
PMF-BPAi I ViT-b 95.20 (-0.1%) 98.70 (+0.3%)
for both inference and training. Note that BPA flexibly fits
PMF-BPAt T ViT-b 96.3 (+1.0%) 98.5 (+0.1%) into the FSC task, with no required knowledge or assump-
tions regarding the setting (# of ways, shots or queries).
4. Results The basic ‘drop-in’ BPAp consistently, and in many cases
also significantly, improves the hosting method performance,
In this section, we experiment with BPA on three applica-
including state-of-the-art, across all benchmarks and back-
tions: Few-Shot Classification (Sec. 4.1), Unsupervised
bones with accuracy improvement of around 3.5% and 1.5%
Image Clustering (Sec. 4.2) and Person Re-Identification
on the 1- and 5- shot tasks. This improvement without re-
(Sec. 4.3). In each, we achieve state-of-the-art results, by
training the embedding backbone shows BPA’s effectiveness
merely using current state-of-the-art methods as hosting
in capturing meaningful relationships between features in a
networks of the BPA transform. Perhaps more importantly,
very general sense. When re-training the hosting network
we demonstrate the flexibility and simplicity of applying
with BPA inside, in an end-to-end fashion, BPAt provides
BPA in these setups, with improvements in the entire range
further improvements, in almost every method, with aver-
of testing, including different hosting methods, different
ages of 5% and 3% on the 1- and 5- shot tasks.
feature embeddings of different complexity backbones and
whether retraining the hosting network or just dropping-in While most of the leading methods are transductive, our
BPA and performing standard inference. To show the sim- inductive version, BPAi , can be seen to steadily improve on
plicity of inserting BPA into hosting algorithms, we provide inductive methods like ProtoNet and PMF, without intro-
pseudocodes for each of the experiments in Appendix C. ducing transductive inference. This further emphasizes the
generality and applicability of our method.
4.1. Few-Shot Classification (FSC)
4.2. Unsupervised Image Clustering (UIC)
Our main experiment is a comprehensive evaluation on the
standard few-shot classification benchmarks MiniImagenet Next, we evaluate BPA in the unsupervised domain, using
(Vinyals et al., 2016) and CIFAR-FS (Bertinetto et al., 2019), the unsupervised image clustering task, with the additional
with detailed results in Tables 2 and 3 respectively. We challenge of capturing the relation between features that
evaluate the performance of the proposed BPA, applying were learned without labels. To do so, we adopt SPICE (Niu
it to a variety of FSC methods including the recent state- et al., 2022), a recent method that has shown phenomenal
of-the-art (PTMap (Hu et al., 2020), SillNet (Zhang et al., success in the field. In SPICE, training is divided into
2021), PTMap-SF (Chen & Wang, 2021) and PMF (Hu 3 phases: (i) unsupervised representation learning (using
7
The Balanced-Pairwise-Affinities Feature Transform
Table 4. Unsupervised Image Clustering (UIC) results on STL- Table 5. Image Re-Identification (Re-ID) results on CUHK03
10 (Coates et al., 2011), CIFAR-100-20 (Krizhevsky & Hinton, (Li et al., 2014) and Market-1501 (Zheng et al., 2015).
2009) and CIFAR-100-20 (Krizhevsky & Hinton, 2009). benchmark CUHK03-det CUHK03-lab Market-1501
benchmark STL-10 CIFAR-10 CIFAR-100-20 network mAP Rank-1 mAP Rank-1 mAP Rank-1
network ACC NMI ARI ACC NMI ARI ACC NMI ARI MHN 65.4 71.7 72.4 77.2 85.0 95.1
k-means 0.192 0.125 0.061 0.229 0.087 0.049 0.130 0.084 0.028 SONA 76.3 79.1 79.2 81.8 88.6 95.6
DAC 0.470 0.366 0.257 0.522 0.396 0.306 0.238 0.185 0.088 OSNet 67.8 72.3 – – 84.9 94.8
DSEC 0.482 0.403 0.286 0.478 0.438 0.340 0.255 0.212 0.110 Pyramid 74.8 78.9 76.9 81.8 88.2 95.7
IDFD 0.756 0.643 0.575 0.815 0.711 0.663 0.425 0.426 0.264 TDB 72.9 75.7 75.6 77.7 85.7 94.3
SPICEs 0.908 0.817 0.812 0.838 0.734 0.705 0.468 0.448 0.294 TDBRK 87.1 87.1 89.1 89.0 94.0 95.3
SPICE 0.938 0.872 0.870 0.926 0.865 0.852 0.538 0.567 0.387 TDB-BPAp 77.9 80.4 80.4 82.6 88.1 94.4
SPICEs -BPAt 0.912 0.823 0.821 0.880 0.784 0.769 0.494 0.477 0.334 TDBRK -BPAp 87.9 88.0 89.5 89.8 94.0 95.0
SPICE-BPAt 0.943 0.880 0.879 0.933 0.870 0.866 0.550 0.560 0.402
8
The Balanced-Pairwise-Affinities Feature Transform
References He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Mo-
mentum contrast for unsupervised visual representation
Asano, Y., Rupprecht, C., and Vedaldi, A. Self-labelling via
learning. arXiv:1911.05722, 2019.
simultaneous clustering and representation learning. In
International Conference on Learning Representations Hu, S. X., Li, D., Stühmer, J., Kim, M., and Hospedales,
(ICLR), 2020. T. M. Pushing the limits of simple pipelines for few-shot
Bertinetto, L., Henriques, J. F., Torr, P., and Vedaldi, A. learning: External data and fine-tuning make a difference.
Meta-learning with differentiable closed-form solvers. In In Proceedings of the IEEE/CVF conference on computer
International Conference on Learning Representations vision and pattern recognition (CVPR), 2022.
(ICLR), 2019.
Hu, Y., Gripon, V., and Pateux, S. Leveraging the fea-
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., ture distribution in transfer-based few-shot learning. In
and Joulin, A. Unsupervised learning of visual features arXiv:2006.03806, 2020.
by contrasting cluster assignments. Advances in neural
information processing systems (NeurIPS), 2020. Huang, G., Larochelle, H., and Lacoste-Julien, S. Are
few-shot learning benchmarks too simple? solving them
Chang, J., Wang, L., Meng, G., Xiang, S., and Pan, C. Deep without task supervision at test-time. arXiv:1902.08605,
adaptive image clustering. In Proceedings of the IEEE 2019.
International Conference on Computer Vision (ICCV),
2017. Kang, D., Kwon, H., Min, J., and Cho, M. Relational
embedding for few-shot classification. In Proceedings
Chen, W.-Y., Liu, Y.-C., Kira, Z., Wang, Y.-C. F., and Huang, of the IEEE/CVF International Conference on Computer
J.-B. A closer look at few-shot classification. In Interna- Vision (ICCV), 2021.
tional Conference on Learning Representations (ICLR),
2018. Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan,
F. S., and Shah, M. Transformers in vision: A survey.
Chen, X. and Wang, G. Few-shot learning by integrating arXiv:2101.01169, 2021.
spatial and frequency representation. arXiv:2105.05348,
2021. Kolkin, N., Salavon, J., and Shakhnarovich, G. Style trans-
fer by relaxed optimal transport and self-similarity. In
Coates, A., Ng, A., and Lee, H. An analysis of single- Proceedings of the IEEE/CVF Conference on Computer
layer networks in unsupervised feature learning. In Pro- Vision and Pattern Recognition (CVPR), 2019.
ceedings of the fourteenth international conference on
artificial intelligence and statistics, 2011. Korman, S. and Avidan, S. Coherency sensitive hashing.
IEEE Transactions on Pattern Analysis and Machine In-
Cuturi, M. Sinkhorn distances: Lightspeed computation
telligence (PAMI), 2015.
of optimal transport. In Advances in Neural Information
Processing Systems (NeurIPS), 2013. Krizhevsky, A. and Hinton, G. Learning multiple layers of
Dhillon, G. S., Chaudhari, P., Ravichandran, A., and Soatto, features from tiny images. 2009.
S. A baseline for few-shot image classification. In Inter-
Kuhn, H. W. The hungarian method for the assignment
national Conference on Learning Representations (ICLR),
problem. Naval Research Logistics Quarterly, 2, 1955.
2020.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., and Teh,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, Y. W. Set transformer: A framework for attention-based
M., Heigold, G., Gelly, S., et al. An image is worth permutation-invariant neural networks. In International
16x16 words: Transformers for image recognition at scale. Conference on Machine Learning (ICML), 2019.
arXiv:2010.11929, 2020.
Li, W., Zhao, R., Xiao, T., and Wang, X. Deepreid: Deep
Fey, M., Lenssen, J. E., Morris, C., Masci, J., and filter pairing neural network for person re-identification.
Kriege, N. M. Deep graph matching consensus. In Proceedings of the IEEE Conference on Computer
arXiv:2001.09621, 2020. Vision and Pattern Recognition (CVPR), 2014.
Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta- Maron, H., Litany, O., Chechik, G., and Fetaya, E. On
learning for fast adaptation of deep networks. In Interna- learning sets of symmetric elements. In International
tional Conference on Machine Learning (ICML), 2017. Conference on Machine Learning (ICML), 2020.
9
The Balanced-Pairwise-Affinities Feature Transform
Ng, A., Jordan, M., and Weiss, Y. On spectral clustering: Fixmatch: Simplifying semi-supervised learning with
Analysis and an algorithm. Advances in neural informa- consistency and confidence. arXiv:2001.07685, 2020.
tion processing systems (NeurIPS), 2001.
Van Gansbeke, W., Vandenhende, S., Georgoulis, S., Proes-
Niu, C., Shan, H., and Wang, G. Spice: Semantic pseudo- mans, M., and Van Gool, L. Scan: Learning to classify
labeling for image clustering. IEEE Transactions on images without labels. In European Conference on Com-
Image Processing (TIP), 2022. puter Vision (ECCV), 2020.
Pearson, K. On lines and planes of closest fit to systems Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
of points in space. The London, Edinburgh, and Dublin L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-
philosophical magazine and journal of science, 1901. tion is all you need. In Advances in Neural Information
Processing Systems (NeurIPS), 2017.
Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet:
Deep learning on point sets for 3d classification and seg- Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K.,
mentation. In Proceedings of the IEEE Conference on and Wierstra, D. Matching networks for one shot learning.
Computer Vision and Pattern Recognition (CVPR), 2017. In Proceedings of the 30th International Conference on
Neural Information Processing Systems (NeurIPS), 2016.
Quispe, R. and Pedrini, H. Top-db-net: Top dropblock
for activation enhancement in person re-identification. Wang, J., Zhang, Z., Chen, M., Zhang, Y., Wang, C., Sheng,
International Conference on Pattern Recognition (ICPR), B., Qu, Y., and Xie, Y. Optimal transport for label-
2020. efficient visible-infrared person re-identification. In Pro-
ceedings of the European Conference on Computer Vision
Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Lev- (ECCV), 2022.
skaya, A., and Shlens, J. Stand-alone self-attention in
vision models. Advances in Neural Information Process- Wang, R., Yan, J., and Yang, X. Learning combinatorial
ing Systems (NeurIPS), 2019. embedding networks for deep graph matching. In Pro-
ceedings of the IEEE/CVF international conference on
Ravi, S. and Larochelle, H. Optimization as a model for few- computer vision (ICCV), 2019.
shot learning. In International Conference on Learning
Representations (ICLR), 2017. Xie, Y., Dai, H., Chen, M., Dai, B., Zhao, T., Zha, H.,
Wei, W., and Pfister, T. Differentiable top-k with optimal
Sander, M. E., Ablin, P., Blondel, M., and Peyré, G. Sink- transport. Advances in Neural Information Processing
formers: Transformers with doubly stochastic attention. Systems (NeurIPS), 2020.
In International conference on artificial intelligence and
statistics (AISTATS). PMLR, 2022. Ye, H.-J., Hu, H., Zhan, D.-C., and Sha, F. Few-shot learning
via embedding adaptation with set-to-set functions. In
Santoro, A., Raposo, D., Barrett, D. G., Malinowski, M., Proceedings of the IEEE Conference on Computer Vision
Pascanu, R., Battaglia, P., and Lillicrap, T. A simple and Pattern Recognition (CVPR), 2020.
neural network module for relational reasoning. Advances
in Neural Information Processing Systems (NeurIPS), Ye, M., Shen, J., Lin, G., Xiang, T., Shao, L., and Hoi,
2017. S. C. Deep learning for person re-identification: A survey
and outlook. IEEE Transactions on Pattern Analysis and
Sarlin, P.-E., DeTone, D., Malisiewicz, T., and Rabinovich, Machine Intelligence (PAMI), 2021.
A. Superglue: Learning feature matching with graph
neural networks. In Proceedings of the IEEE/CVF Con- Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B.,
ference on Computer Vision and Pattern Recognition Salakhutdinov, R. R., and Smola, A. J. Deep sets.
(CVPR), 2020. In Advances in Neural Information Processing Systems
(NeurIPS), 2017.
Shi, J. and Malik, J. Normalized cuts and image segmenta-
tion. IEEE Transactions on pattern analysis and machine Zass, R. and Shashua, A. Doubly stochastic normalization
intelligence (PAMI), 2000. for spectral clustering. Advances in neural information
processing systems (NeurIPS), 2006.
Snell, J., Swersky, K., and Zemel, R. Prototypical networks
for few-shot learning. In Advances in Neural Information Zhang, C., Cai, Y., Lin, G., and Shen, C. Deepemd: Few-
Processing Systems (NeurIPS), 2017. shot image classification with differentiable earth mover’s
distance and structured classifiers. In IEEE/CVF Con-
Sohn, K., Berthelot, D., Li, C.-L., Zhang, Z., Carlini, N., ference on Computer Vision and Pattern Recognition
Cubuk, E. D., Kurakin, A., Zhang, H., and Raffel, C. (CVPR), 2020.
10
The Balanced-Pairwise-Affinities Feature Transform
Zhang, H., Cao, Z., Yan, Z., and Zhang, C. Sill-net: Feature normalized input features); (ii) Avoiding self-matching by
augmentation with separated illumination representation. placing infinity values on the distances matrix diagonal; (iii)
arXiv:2102.03539, 2021. Applying a standard Sinkhorn procedure, given the distance
matrix and the only 2 (hyper-) parameters with their fixed
Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., and Tian,
values: entropy regularization parameter λ and the number
Q. Scalable person re-identification: A benchmark. In
of row/col iterative normalization steps. Note that Sinkhorn
2015 IEEE International Conference on Computer Vision
defaultly maps between source and target vectors of ones;
(ICCV), 2015.
(iv) Restoring the perfect self-matching probabilities of one,
Zhong, Z., Zheng, L., Cao, D., and Li, S. Re-ranking along the diagonal.
person re-identification with k-reciprocal encoding. In
Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2017. B. Ablation Studies
Zhou, K., Yang, Y., Cavallaro, A., and Xiang, T. Omni- B.1. Scalability (accuracy, runtime vs. input size)
scale feature learning for person re-identification. In
Proceedings of the IEEE/CVF International Conference Being a transductive module, the accuracy and efficiency
on Computer Vision (ICCV), 2019. of the BPA transform depend on the number of inputs that
are processed as a batch. Recall that BPA is a drop-in
Zhu, H. and Koniusz, P. Ease: Unsupervised discriminant addition that usually follows feature extraction and precedes
subspace learning for transductive few-shot learning. In further computation - e.g. k-means for clustering, or (often
Proceedings of the IEEE/CVF Conference on Computer transductive) layers in FSC and ReID.
Vision and Pattern Recognition (CVPR), 2022.
The ReID experiment is a good stress-test for BPA, since
Ziko, I. M., Dolz, J., Granger, E., and Ayed, I. B. Laplacian we achieve excellent results for batch sizes of up to ∼15K
regularized few-shot learning. In International Confer- image descriptors. In terms of runtime, although BPA’s
ence on Machine Learning (ICML), 2020. complexity is quadratic in sample size, its own (self) runtime
is empirically negligible compared to that of the processing
that follows, in all applications tested.
Appendix
Typical FSC tasks sizes ((shots+queries)·ways) are small:
The Appendix includes the following sections: 100 = (5 + 15) · 5 at the largest. To concretely address this
matter, we test a resnet-12 PTMap-BPAp on large-scale FSC,
A. PyTorch-style BPA Implementation
following (Dhillon et al., 2020), on the Tiered-Imagenet
B. Ablation Studies dataset and report accuracy for 1/5/10-shot (15-query) tasks
C. BPA Insertion into Hosting Algorithms for an increasing range of ways. The results, shown in
Fig. 5, show that: (i) Total runtime, where BPA is only a
D. Clustering on the Sphere - a Case Study small contributor (compare black vs. yellow dashed line),
increases gracefully (notice log10 x-axis) even for extremely
A. PyTorch-style BPA Implementation large FSC tasks of 4000 = (10 + 15) · 160 images; (ii) Our
accuracy scales as expected - following the observation in
We provide in Algorithm 1 a PyTorch Style implementation
that fully aligns with the description in the paper as well as Algorithm 1 BPA transform on a set of n features.
with our actual implementation that was used to execute all input: n × d matrix V output: n × n matrix W
of the experiments. In Appendix C we further demonstrate def BPA(V):
the ”insertions” of BPA into hosting methods, for each of # compute self pairwise-distances
our three main applications. D = 1 - pwise cosine sim(V/V.norm())
# infinity self-distances on diagonal
Notice mainly that: (i) The transform can easily be dropped-
in, using the simple one-line call: X = BPA(X). (ii) It is fully D inf = D.fill diagonal(10e9)
differentiable (as Sinkhorn and the other basic operations # compute optimal transport plan
are). (iii) The transform does not need to know (or even W = Sinkhorn(D inf,lambda=.1,iters=5)
assume) anything about the number of features, their dimen- # stretch affinities to [0,1]
sion, or distribution statistics among classes (e.g. whether W = W/W.max()
balanced or not). # self-affinity on diagonal to 1
It follows the simple steps of: (i) Computing Euclidean self return W.fill diagonal(1)
pairwise distances (using cosine similarities between unit
11
The Balanced-Pairwise-Affinities Feature Transform
100
1 Shot
5 Shot 0.200
0.150
Accuracy (%)
60
0.125
40 0.100
Running time PT-MAP-BPA
Running time PT-MAP 0.075
20 0.050
0.025
0
0.8 1.0 1.2 1.4 1.6 1.8
Ways (Log10)
Table 6. Sinkhorn iterations ablation study: See text for details. can be seen to have a certain dependence on the number of
shots. Recall that we chose to use a fixed value of λ = 0.1,
method iters 5-way 1-shot 5-way 5-shot which gives an overall good accuracy trade-off. Note that a
ProtoNet-BPAp 1 70.71 83.79 further improvement could be achieved by picking the best
ProtoNet-BPAp 2 71.10 84.01 values for the particular cases. Notice also the log-scale of
ProtoNet-BPAp 4 71.18 84.08 the x-axes to see that performance is rather stable around
ProtoNet-BPAp 8 71.20 84.10 the chosen value.
ProtoNet-BPAp 16 71.20 84.10
For Re-ID, in Fig. 7, we experiment with a range of λ
values on the validation set of the Market-1501 dataset. The
B.3. Sinkhorn Entropy Regularization λ results (shown both for mAP and rank-1 measures) reveal a
strong resemblance to those of the FSC experiment in Fig. 6,
We measured the impact of using different values of the
however, the optimal choices for λ are slightly higher, which
optimal-transport entropy regularization parameter λ (the
is consistent with the dependence on the shots number, since
main parameter of the Sinkhorn algorithm) on a variety of
the re-ID tasks are typically large ones. We found that a
configurations (ways and shots) in Few-Shot-Classification
value of λ = 0.25 gives good results across both datasets.
(FSC) on MiniImagenet (Vinyals et al., 2016) in Fig. 6 as
well as on the Person-Re-Identification (RE-ID) experiment
on Market-1501 (Zheng et al., 2015) in Fig. 7. In both cases, B.4. BPA vs. Naive Baselines
the ablation was executed on the validation set. In Fig. 8, we ablate different simple alternatives to BPA,
For FSC, in Fig. 6, the top plot shows that the effect of the with the PTMap (Hu et al., 2020) few-shot-classifier as the
choice of λ is similar across tasks with a varying number of ’hosting’ method, using MiniImagenet (Vinyals et al., 2016).
ways. The bottom plot shows the behavior as a function of Each result is the average of 100 few-shot episodes, using
λ across multiple shot-values, where the optimal value of λ a WRN-28-10 backbone feature encoder. In blue is the
12
The Balanced-Pairwise-Affinities Feature Transform
repeat:
· Lij = ∥fi − cj ∥2 , ∀i, fi ∈fq # feature-center dists
· M = Sinkhorn(L, λ) # S-horn soft assignments
· cj ← cj + α(g(M, j) − cj ), ∀j # update centers
Figure 7. Ablation of entropy regularization parameter λ us-
ing the Person-Re-Identification (Re-ID) task. Accuracy vs. λ, ℓˆq (fi ) = arg maxj (M[i, j]) # prediction per fi ∈fq
using the validation set of Market-1501 (Zheng et al., 2015) and if inference:
considering both mAP and Rank-1 measures. See text for details. return ℓ̂q # query predictions
else (training):
update fϕ by ∇ϕ C-Entropy(M, ℓq ) # grad-desc.
baseline of applying no transform at all, using the original
features. In orange - using BPA. In gray and yellow, respec-
tively, are other naive ways of transforming the features,
where the affinity matrix is only row-normalized (’softmax’) C.2. SPICE (Niu et al., 2022) (Unsupervised Clustering)
or not normalized at all (’cosine’) before taking its rows as In our implementation of SPICE, as detailed in the paper,
the output features. It is empirically evident that only BPA we utilize BPA during phase 2 of the algorithm (clustering-
outperforms the baseline consistently, which is due to the head training). Specifically, as depicted in Alg. C.2, we
properties that we had proved regarding the transform. transform the features using BPA, batch-wise, before con-
ducting a nearest-neighbor search. Afterwards, we retrieve
C. BPA Insertion into Hosting Algorithms the pseudo-labels and resume with the original features, as
in the original implementation.
C.1. PTMap (Hu et al., 2020) (Few-Shot Classification)
Algorithm 3 SPICE training
We present the pseudo-code for utilizing BPA within the
Phase (i): pre-train embedding network fϕ
PTMap pipeline, as outlined in Alg. C.1. The only alteration
from the original implementation pertains to row 5, wherein Phase (ii): train clustering network cθ
the support and query sets are concatenated and transformed repeat per batch x:
using BPA. This approach can be extended to a wide range · f = fϕ (x) # extract features
of distance-based methodologies, thus providing a simple · f BPA = BPA (f ) # BPA transformed features
· Find 3 most confident samples per cluster (use f )
· Compute cluster centers as their means (use f BPA )
· Find nearest-neighbors of each center (use f BPA )
· Assign them to the cluster (as pseudo-labels)
· Use pseudo-labels to train (update) cθ
Phase (iii): jointly fine-tune fϕ and cθ
Figure 8. Comparison of BPA to different baselines over differ- Finally, Alg. C.3 illustrates the application of BPA dur-
ent configurations in few-shot learning tasks over MiniImagenet ing inference in the context of Person ReID. Typically, the
(Vinyals et al., 2016). Created by measuring accuracy (y-axis) query identity search within the gallery involves identifying
over a varying number of shots (x-axis), with fixed 5-ways and the nearest sample to each query. In our implementation,
15-queries. See text for details. we adopt the same methodology, with the additional step
13
The Balanced-Pairwise-Affinities Feature Transform
# extract features
fg = fϕ (xg ), fq = fϕ (xq )
# transform them with BPA
(fg ∪ fq ) = BPA (fg ∪ fq )
# return gallery image with closest feature
return argmin ∥fi − fj ∥ for every {i : fi ∈ fq }
{j:fj ∈fg }
2
Accuracy is measured by comparison with the optimal permu-
tation of the predicted labels, found by the Hungarian Algorithm
(Kuhn, 1955).
14
The Balanced-Pairwise-Affinities Feature Transform
Figure 9. Clustering on the sphere: Data Generation. 10 Random cluster centers on the unit sphere, perturbed by increasing noise STD.
Figure 10. Clustering on the sphere: Detailed Results. Clustering measures (top: ARI, bottom: NMI) of k-means, using BPA features
(dashed lines) vs. original features (solid lines). For both measures - the higher the better. Shown over different configurations of feature
dimensions d (left) and noise levels σ (right).
15