A Simple Baseline For Low-Budget Active Learning
A Simple Baseline For Low-Budget Active Learning
1 Introduction
We are interested in active learning with very low budget. Given a large set
of unlabeled images and an oracle that can label a small set of images, we
want to train an accurate image classification model by choosing a small set of
images for the oracle to annotate. Active learning [14] has been studied for a long
time. However, most active learning methods start with a large initial seed pool
(usually randomly chosen images that are annotated) and also choose a large
set of images to be annotated actively. For instance, [49,16,5] use more than 3%
of ImageNet [44] as the initial pool, which contains more than 40, 000 images.
In some applications, e.g., medical image analysis, the labeling budget is much
smaller, so the current active learning algorithms may not be a viable solution
[32,55].
One may argue that our setting is similar to few-shot and semi-supervised
learning where we assume a few examples of each class are labeled. However, we
argue that those settings are not practical in many applications since sampling a
subset of images for labeling, that are “uniformly” distributed across categories,
itself needs a larger subset of data to be annotated first. In some real applications,
even finding one example of an object to annotate may be challenging. For
2
Trained Trained
Model
Model centers Model
Fig. 1: Our simple baseline. The goal is to train an accurate image classifica-
tion model with a very small set of labeled images. a) A self-supervised learning
model learns from the unlabeled data and provides the feature embeddings. b)
K-means algorithm clusters the unlabeled data features and chooses data points
nearest to the center of each cluster. c) These selected points are then annotated
by an oracle. Finally, a linear classifier on the top of features learns from the
annotated data and performs the image classification task.
instance, in a self-driving car application, one may want to build a detector for
motorbikes with annotating a few motorbike examples only, but finding those
few examples in the large dataset of many hours of driving video footage is a
challenging task by itself.
As an example, Table 6a shows that some categories may not be even rep-
resented in a pool of 3, 000 randomly sampled ImageNet images. This makes
standard few-shot or semi-supervised learning methods impractical. Hence, we
believe active learning, when we want to have only a few annotations per cate-
gory in average, is an important practical problem.
2 Related Work
archical approach to avoid selecting repeated samples from the same cluster.
Core-set [46] increases diversity in the selected batch of data by minimizing
the Euclidean distance between instances of already labeled data and instances
in the unlabeled pool. Despite working well on datasets with small number of
classes, the performance of Core-set degrades when p-norms suffer from the curse
of dimensionality in high-dimensions [18].
Some methods take advantage of both uncertainty and diversity [34,29,4].
VAAL [49] and DFAL [15] use adversarial learning to learn the representation
of data points. It is shown in MAL [16] that although VAAL does not need the
annotations to sample data, it may result in selecting multiple instances of the
same class while there are already plenty of them in the labeled pool.
Many active learning methods require a large initial labeled pool and sam-
pling budget [46,49,16,47]. This condition is difficult to meet in medical image
analysis or other domains that the size of unlabeled training data is very large
and it costs experts a lot of effort to label a large subset of them [26]. An
approach to mitigate this problem is presented in [36]. However, it requires com-
puting large distance matrices to solve Wasserstein distances for large datasets
(e.g., with 1M images) that presents a scalability bottleneck.
In this paper, we train an off-the-shelf SSL method on the unlabeled data to
provide rich feature vectors only once and select a small proportion of them in
a non-iterative approach to achieve strong performance on a variety of datasets,
as well as large-scale datasets, without requiring an initial labeled pool.
Semi-supervised learning. Semi-supervised learning for image classification
aims in making best use of limited labeled data while leveraging the unlabeled
dataset [11,54,31]. Three most explored directions are consistency regularization
[45,33], entropy minimization [23], and pseudo-labeling [6,51]. Despite achieving
highly competitive performance to supervised methods, semi-supervised learn-
ing, similar to few-shot learning, assumes there are a few equal number of exam-
ples per category and needs more than the amount of training set to be annotated
from the unlabeled dataset to ensure a uniform distribution. Therefore, we can-
not compare active and semi-supervised methods directly. However, in Section
4.5, we will introduce two labeled set generation settings to show that active
strategies can benefit the performance of semi-supervised models.
Self-supervised learning. SSL provides a strategy to train a neural network on
the unlabeled data and creates rich features that can be fine-tuned for different
downstream tasks using limited labeled data. SSL methods coarsely learn to solve
a pretext task [22,42,41] or contrast between similar pairs of an image and its
negative pairs [8,10,25]. CompRess [1] is one of the recent state-of-the-art SSL
models that compresses a deep teacher network to a smaller student network
such that for any query, the student ranks anchors similar to the teacher.
Prior active learning methods that take advantage of unsupervised learning
tackle with time-consuming human-in-the-loop workflow of waiting for the new
model to train on previous batches to annotate a new selected batch [48,20,56,16],
or use large sampling budgets [9,38,17]. In contrast, we select a single batch of
a few examples using the initial SSL pre-trained model.
5
pixels. The validation set for ImageNet/LT experiments is the same and contains
50, 000 samples. All datasets are augmented by horizontally flipping the images.
Baselines. We compare K-means and multi K-means with the following base-
lines: i) Random in which samples are selected randomly (uniformly) from the
entire dataset; ii) Max-Entropy [53] which samples the points with the high-
est probability distribution entropy; iii) Core-set [46]; iv) VAAL [49]; and v)
Uniform that selects equal number of random samples per class. Note that
few-shot and semi-supervised learning methods use Uniform strategy to create
training sets that may require more than the size of sets to be annotated.
Implementation details. In all experiments, unless specified, we use feature
outputs of ResNet-18 that is pre-trained on unlabeled ImageNet using Com-
pRess SSL method [1] for 130 epochs, which uses MoCo-v2 [12] as its teacher
network. Note that this pre-trained feature extractor is used even for CIFAR ex-
periments which means that technically, CIFAR experiments use the unlabeled
data beyond CIFAR datasets. Max-Entropy sampling method [53] freezes the
pre-trained backbone and trains an extra linear layer as the classifier on the top
of that for 100 epochs. In all Max-Entropy experiments, we use Adam optimizer
and lr=0.001 which is multiplied by 0.1 at epochs 50 and 75. The budget for it-
erative sampling methods is the difference between two consecutive budget sizes.
For Random, Uniform, and K-means sampling, each budget size is equivalent to
the amount of unlabeled data selection.
Evaluation metrics. Unless specified, all experiments are averaged over 3 runs
with 3 constant random seeds. We follow four evaluation metric protocols:
i) Linear classification. We train a linear classifier on the top of the frozen
backbone features (without back-propagation in the backbone weights) on the
pool of labeled data for 100 epochs and report its top-1 accuracy on the test
set. We apply the mean and standard deviation normalization at each dimen-
sion of backbone outputs to reduce the computational overhead of tuning the
hyper-parameters per experiment. We use Adam optimizer and lr=0.01 that is
multiplied by 0.1 at epochs 50 and 75. The batch size is 128 on ImageNet/LT.
For CIFAR-10/100 experiments, initial pools contain only 10/100 examples, so
we set batch size=4.
ii) Nearest neighbor classification. This uses cosine similarity as a distance
metric to search for the most semantically similar neighbors of test set data
from the pool of labeled images. When the pool of labeled data is small, this
metric is faster than linear evaluation since nearest neighbor classification needs
no hyper-parameter tuning. We use FAISS GPU library [27] for implementation.
iii) Evaluation on fine-grained tasks. We evaluate on Flowers-102 [40], DTD-
47 [13], and Aircraft [37] as examples of fine-grained datasets. For feature em-
beddings, we use the same frozen backbone that is pre-trained on unlabeled
ImageNet. In each dataset, we first choose a small subset of training data to
be annotated, then we use the subset to train a linear classifier on top of the
frozen backbone. This is similar to the transfer learning procedure in [10,24].
Full details on training sets are in the appendix Table 15.
7
examples per class as long as it does not surpass the class size. Table 7 shows
that K-means strategies are strong active learning methods in both evaluation
metrics with no annotation information. Also, in the appendix A.2, we report
top-1 linear and nearest neighbor classification results of different strategies on
ImageNet-LT using an ImageNet-LT pre-trained backbone, as a realistic fea-
ture extractor to show that K-means is insensitive to category distribution of
unlabeled training set.
Table 4: Top-1 linear (LIN) and nearest neighbor (NN) classification re-
sults of different strategies on CIFAR-100. In both evaluation benchmarks,
K-means strategies outperform Max-Entropy, Core-set, VAAL, and Random and
are on par with Uniform sampling. Despite having equal number of examples per
category, Uniform outperforms K-means in linear classification only on budgets
larger than 4%. K-means and multi K-means are competitive on CIFAR-100.
Budgets
100 300 500 1000 2000 2500 4000 5000 7500
Method LIN NN
0.2% 0.6% 1% 2% 4% 5% 8% 10% 15%
Uniform ✓ 10.2 ± .2 18.7 ± .1 21.3 ± .1 27.2 ± .0 29.9 ± .1 30.9 ± .1 32.7 ± .0 32.9 ± .0 33.6 ± .0
Random ✓ 10.4 ± .3 16.5 ± .1 20.7 ± .1 24.6 ± .4 29.3 ± .1 29.5 ± .5 30.8 ± .1 31.7 ± .1 32.8 ± .1
Max-Entropy ✓ 10.4 ± .3 14.6 ± .1 17.2 ± .0 20.8 ± .1 21.9 ± .2 24.6 ± .1 26.1 ± .0 27.6 ± .0 28.8 ± .0
Core-set ✓ 10.4 ± .3 15.1 ± .0 17.4 ± .1 22.2 ± .1 25.9 ± .0 26.9 ± .0 27.8 ± .1 28.6 ± .0 29.3 ± .1
VAAL ✓ 10.4 ± .3 15.9 ± .1 19.1 ± .0 24.1 ± .0 28.4 ± .1 29.9 ± .1 30.9 ± .0 31.6 ± .1 33.1 ± .0
Multi K-means ✓ 13.4 ± .1 20.0 ± .1 22.2 ± .0 26.1 ± .0 29.5 ± .0 30.2 ± .0 31.5 ± .0 31.5 ± .0 32.8 ± .0
K-means ✓ 13.4 ± .1 18.8 ± .1 23.7 ± .1 27.1 ± .0 29.4 ± .0 31.1 ± .1 32.5 ± .0 32.8 ± .1 33.4 ± .0
Uniform ✓ 10.1 ± .3 13.8 ± .1 15.3 ± .1 17.8 ± .0 20.2 ± .0 21.9 ± .0 24.1 ± .1 25.8 ± .1 30.0 ± .1
Random ✓ 8.3 ± .1 12.8 ± .1 15.0 ± .2 16.9 ± .1 19.7 ± .1 20.8 ± .0 22.0 ± .1 23.2 ± .0 25.5 ± .0
Max-Entropy ✓ 8.3 ± .1 10.1 ± .0 10.9 ± .0 12.1 ± .1 12.5 ± .1 12.7 ± .1 12.9 ± .0 13.1 ± .0 13.6 ± .1
Core-set ✓ 8.3 ± .1 10.4 ± .1 10.9 ± .0 13.3 ± .0 16.4 ± .1 16.8 ± .0 18.2 ± .0 18.9 ± .0 20.7 ± .1
VAAL ✓ 8.3 ± .1 12.1 ± .0 13.7 ± .1 16.7 ± .0 19.2 ± .1 20.4 ± .0 22.1 ± .1 23.1 ± .0 24.9 ± .0
Multi K-means ✓ 13.7 ± .2 17.2 ± .0 18.0 ± .1 20.4 ± .1 22.9 ± .0 23.5 ± .0 24.2 ± .1 25.1 ± .0 27.2 ± .0
K-means ✓ 13.7 ± .2 17.1 ± .0 19.4 ± .1 21.1 ± .1 23.1 ± .0 23.6 ± .1 24.5 ± .0 24.9 ± .1 26.1 ± .1
1K 3K 7K 13K
Budgets
0.08% 0.2% 0.5% ≥ 1%
Uniform 100 ± 0 100 ± 0 100 ± 0 100 ± 0
Random 62.9 ± .2 94.6 ± .4 100 ± 0 100 ± 0
Max-Entropy 62.9 ± .2 84.3 ± .5 94.8 ± .2 100 ± 0
Core-set 62.9 ± .2 87.9 ± .1 97.0 ± .5 100 ± 0
VAAL 62.9 ± .2 94.6 ± .1 98.1 ± .3 100 ± 0
Multi K-means 72.2 ± .1 97.0 ± .0 99.8 ± .0 100 ± 0
K-means 72.2 ± .1 97.8 ± .2 99.9 ± .0 100 ± 0
Random 80.56 ±1.6 82.06 ±1.3 53.86 ±0.6 61.06 ±0.7 36.10 ±1.7 37.36 ±0.1
K-means 82.20 ±1.0 82.76 ±1.2 64.30 ±0.0 65.96 ±0.7 37.20 ±0.0 38.67 ±0.2
Table 10: Semi-supervised evaluation with FixMatch [51]. The scores are
top-1 accuracies (in %) of the model on the test set. In contrast to Random and
K-means, Uniform and Uniform(K-means) take advantage of annotation infor-
mation to sample a labeled set. With no labels, K-means performs consistently
better than Random. Using labels, Uniform(K-means) outperforms Uniform in
low budgets. The results are from 1 repetition of the experiments.
CIFAR-10 Flowers-102 DTD-47
1×10 4×10 10×10 1×102 3×102 4×102 1×47 3×47 4×47
Methods
0.02% 0.08% 0.2% 10% 30% 40% 2.5% 7.5% 10%
With Labels
Uniform 54.79 88.76 91.20 15.22 34.33 42.54 8.88 18.40 23.19
Uniform(K-means) 57.53 89.11 89.40 18.08 37.79 41.60 16.49 22.29 22.39
Without Labels
Random 38.69 60.21 86.73 13.81 31.42 42.51 8.30 16.86 21.65
K-means 43.37 84.46 86.96 19.87 37.60 46.09 14.36 19.04 23.56
12
5 Ablation Study
In this section, we perform ablation study on the initial labeled pool and the
feature extraction backbone.
Effect of size. We repeat the experiments with the same setting as Section 4.2
for a larger randomly sampled initial set of 2%. By comparing linear classification
results of Tables 5 and 11, we find that all three iterative methods perform better
using a larger initial pool. As a result, iterative methods are suitable options
when large initial pools are available.
Effect of sampling strategy. We repeat the analyses in Section 4.2 with K-
means selected instead of randomly selected 0.08% of ImageNet as the initial
pool. The linear classification results on 0.2% and 0.5% of ImageNet are shown
in Table 12. Comparing Tables 5 and 12 demonstrates that although the per-
formance of iterative methods are improved with a richer initialization, there is
still a great gap with K-means in low budgets.
Ablating the sampling backbone. In Table 13, we find that by changing the
selection backbone weights to random and keeping the evaluation setting the
same, K-means performance drops in low budgets.
Ablating the classification backbone. Table 13 also presents the key role of
SSL pre-trained classification backbone in achieving strong accuracy. When not
fine-tuning the randomly initialized backbone, we train a linear layer on the top
of frozen random features.
Effect of fine-tuning. In Table 13, we also report the fine-tuning results of
all ablation variants. For randomly initialized backbones, we train both feature
extractor and linear classifier using SGD optimizer for 100 epochs with the same
learning rate of 0.1, which is multiplied by 0.1 in epochs 30, 60, and 90. For SSL
pre-trained variants, we apply mean and standard deviation normalization at
features before feeding to the linear layer. The optimizer is Adam and a lower
learning rate is used for the backbone compared to the linear layer (10−4 vs.
10−2 ). It is shown in Table 13 that back-propagating on a pre-trained model
with a new objective causes the model to forget previously learned features and
a drop in the performance happens.
Table 14: Effect of MoCo-v2 pre-trained R50 on top-1 linear (LIN) and
nearest neighbor (NN) classification results of ImageNet. The superi-
ority of K-means is insensitive to the choice of network architecture and SSL
pre-training method. The results are from 1 repetition of the experiments.
Budgets
1K 3K 7K 13K 26K 64K
Method LIN NN
0.08% 0.2% 0.5% 1% 2% 5%
Uniform ✓ 30.0 40.2 46.8 50.8 54.9 59.5
Random ✓ 23.4 35.4 45.2 49.9 54.5 59.5
Max-Entropy ✓ 23.4 26.5 34.8 40.7 45.6 51.8
Core-set ✓ 23.4 32.1 37.6 41.6 46.5 53.8
VAAL ✓ 23.4 35.9 41.8 48.0 54.0 59.4
K-means ✓ 33.3 42.5 47.9 51.8 55.7 60.2
Uniform ✓ 30.8 37.5 40.8 43.1 45.2 48.1
Random ✓ 23.7 33.9 40.2 42.9 45.2 48.0
Max-Entropy ✓ 23.7 25.1 29.5 31.6 32.5 33.4
Core-set ✓ 23.7 30.3 32.5 33.7 35.3 38.2
VAAL ✓ 23.7 34.3 38.2 41.9 44.8 47.8
K-means ✓ 33.6 41.2 44.2 46.0 47.7 49.8
6 Discussion
7 Conclusion
Most active learning benchmarks assume they have access to a large budget and
large labeled seed pool. We believe there is practical need for active learning
with smaller budgets. However, the problem is challenging as some categories in
image classification may not be presented in the seed. We introduce a very sim-
ple baseline for this problem and show that it outperforms state-of-the-art active
learning methods in low budgets. Our method leverages the recent progress in
self-supervised learning along with simple K-means clustering for selecting the
images that need to be annotated.
15
References
1. Abbasi Koohpayegani, S., Tejankar, A., Pirsiavash, H.: Compress: Self-supervised
learning by compressing representations. In: Advances in Neural Information Pro-
cessing Systems. vol. 33, pp. 12980–12992 (2020) 2, 4, 6
2. Aggarwal, U., Popescu, A., Hudelot, C.: Optimizing active learning for low anno-
tation budgets. arXiv preprint arXiv:2201.07200 (2022) 3
3. Andrychowicz, M., Denil, M., Gómez, S., Hoffman, M.W., Pfau, D., Schaul, T.,
Shillingford, B., de Freitas, N.: Learning to learn by gradient descent by gradient
descent. In: Advances in Neural Information Processing Systems. vol. 29 (2016) 3
4. Ash, J.T., Zhang, C., Krishnamurthy, A., Langford, J., Agarwal, A.: Deep batch
active learning by diverse, uncertain gradient lower bounds. In: International Con-
ference on Learning Representations (ICLR) (2020) 4
5. Beluch, W.H., Genewein, T., Nürnberger, A., Köhler, J.M.: The power of ensembles
for active learning in image classification. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. pp. 9368–9377 (2018) 1
6. Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.A.:
Mixmatch: A holistic approach to semi-supervised learning. In: Advances in Neural
Information Processing Systems. vol. 32 (2019) 4
7. Brinker, K.: Incorporating diversity in active learning with support vector ma-
chines. In: Proceedings of the 20th international conference on machine learning
(ICML-03). pp. 59–66 (2003) 3
8. Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised
learning of visual features by contrasting cluster assignments. In: Proceedings of
Advances in Neural Information Processing Systems (NeurIPS) (2020) 4
9. Chandra, A.L., Desai, S.V., Devaguptapu, C., Balasubramanian, V.N.: On initial
pools for deep active learning. arXiv preprint arXiv:2011.14696 (2020) 4
10. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for con-
trastive learning of visual representations. In: International conference on machine
learning. pp. 1597–1607. PMLR (2020) 2, 4, 6
11. Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.E.: Big self-supervised
models are strong semi-supervised learners. In: Advances in Neural Information
Processing Systems. vol. 33, pp. 22243–22255 (2020) 4
12. Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum con-
trastive learning. arXiv preprint arXiv:2003.04297 (2020) 2, 6, 13
13. Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures
in the wild. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. pp. 3606–3613 (2014) 6, 7, 19
14. Cohn, D.A., Ghahramani, Z., Jordan, M.I.: Active learning with statistical models.
Journal of artificial intelligence research 4, 129–145 (1996) 1
15. Ducoffe, M., Precioso, F.: Adversarial active learning for deep networks: a margin
based approach. arXiv preprint arXiv:1802.09841 (2018) 4
16
16. Ebrahimi, S., Gan, W., Salahi, K., Darrell, T.: Minimax active learning. arXiv
preprint arXiv:2012.10467 (2020) 1, 4
17. Emam, Z.A.S., Chu, H.M., Chiang, P.Y., Czaja, W., Leapman, R., Gold-
blum, M., Goldstein, T.: Active learning at the imagenet scale. arXiv preprint
arXiv:2111.12880 (2021) 4
18. François, D.: High-dimensional data analysis. From Optimal Metric to Feature
Selection pp. 54–55 (2008) 4
19. Gal, Y., Ghahramani, Z.: Dropout as a bayesian approximation: Representing
model uncertainty in deep learning. In: international conference on machine learn-
ing. pp. 1050–1059. PMLR (2016) 3
20. Gao, M., Zhang, Z., Yu, G., Arık, S.Ö., Davis, L.S., Pfister, T.: Consistency-based
semi-supervised active learning: Towards minimizing labeling cost. In: European
Conference on Computer Vision. pp. 510–526. Springer (2020) 2, 4
21. Gidaris, S., Komodakis, N.: Dynamic few-shot visual learning without forgetting.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition. pp. 4367–4375 (2018) 3
22. Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by pre-
dicting image rotations. In: International Conference on Learning Representations
(2018) 4
23. Grandvalet, Y., Bengio, Y.: Semi-supervised learning by entropy minimization. In:
Advances in Neural Information Processing Systems. vol. 17 (2005) 4
24. Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doer-
sch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own
latent-a new approach to self-supervised learning. Advances in Neural Information
Processing Systems 33, 21271–21284 (2020) 6, 20
25. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised
visual representation learning. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. pp. 9729–9738 (2020) 4
26. Hoi, S.C., Jin, R., Zhu, J., Lyu, M.R.: Batch mode active learning and its appli-
cation to medical image classification. In: Proceedings of the 23rd international
conference on Machine learning. pp. 417–424 (2006) 4
27. Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. arXiv
preprint arXiv:1702.08734 (2017) 6
28. Kapoor, A., Grauman, K., Urtasun, R., Darrell, T.: Active learning with gaussian
processes for object categorization. In: 2007 IEEE 11th International Conference
on Computer Vision. pp. 1–8. IEEE (2007) 3
29. Kirsch, A., van Amersfoort, J., Gal, Y.: Batchbald: Efficient and diverse batch
acquisition for deep bayesian active learning. In: Advances in Neural Information
Processing Systems. vol. 32 (2019) 4
30. Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech. rep.,
Citeseer (2009) 5
31. Kuo, C.W., Ma, C.Y., Huang, J.B., Kira, Z.: Featmatch: Feature-based augmenta-
tion for semi-supervised learning. In: European Conference on Computer Vision.
pp. 479–495 (2020) 4
32. Kuo, W., Häne, C., Yuh, E., Mukherjee, P., Malik, J.: Cost-sensitive active learning
for intracranial hemorrhage detection. In: International Conference on Medical Im-
age Computing and Computer-Assisted Intervention. pp. 715–723. Springer (2018)
1
33. Laine, S., Aila, T.: Temporal ensembling for semi-supervised learning. In: ICLR
(Poster) (2017) 4
17
34. Li, X., Guo, Y.: Adaptive active learning for image classification. In: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2013) 4
35. Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., Yu, S.X.: Large-scale long-tailed
recognition in an open world. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. pp. 2537–2546 (2019) 5
36. Mahmood, R., Fidler, S., Law, M.T.: Low budget active learning via wasser-
stein distance: An integer programming approach. arXiv preprint arXiv:2106.02968
(2021) 4
37. Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual
classification of aircraft. arXiv preprint arXiv:1306.5151 (2013) 6, 19
38. Mottaghi, A., Yeung, S.: Adversarial representation active learning. arXiv preprint
arXiv:1912.09720 (2019) 4
39. Nguyen, H.T., Smeulders, A.: Active learning using pre-clustering. In: Proceedings
of the twenty-first international conference on Machine learning. p. 79 (2004) 3
40. Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number
of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image
Processing. pp. 722–729. IEEE (2008) 6, 7, 19
41. Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving
jigsaw puzzles. In: European conference on computer vision. pp. 69–84. Springer
(2016) 4
42. Noroozi, M., Pirsiavash, H., Favaro, P.: Representation learning by learning to
count. In: Proceedings of the IEEE International Conference on Computer Vision.
pp. 5898–5906 (2017) 4
43. Roy, N., McCallum, A.: Toward optimal active learning through monte carlo esti-
mation of error reduction. ICML, Williamstown pp. 441–448 (2001) 3
44. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog-
nition challenge. International journal of computer vision 115(3), 211–252 (2015)
1, 5
45. Sajjadi, M., Javanmardi, M., Tasdizen, T.: Regularization with stochastic trans-
formations and perturbations for deep semi-supervised learning. In: Advances in
Neural Information Processing Systems. vol. 29 (2016) 4
46. Sener, O., Savarese, S.: Active learning for convolutional neural networks: A core-
set approach. In: International Conference on Learning Representations (2018) 2,
3, 4, 6
47. Shui, C., Zhou, F., Gagné, C., Wang, B.: Deep active learning: Unified and princi-
pled method for query and training. In: Chiappa, S., Calandra, R. (eds.) Proceed-
ings of the Twenty Third International Conference on Artificial Intelligence and
Statistics. vol. 108, pp. 1308–1318. PMLR (26–28 Aug 2020) 4
48. Siméoni, O., Budnik, M., Avrithis, Y., Gravier, G.: Rethinking deep active learning:
Using unlabeled data at model training. arXiv preprint arXiv:1911.08177 (2019) 4
49. Sinha, S., Ebrahimi, S., Darrell, T.: Variational adversarial active learning. In: Pro-
ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
(2019) 1, 2, 4, 6
50. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In:
Advances in Neural Information Processing Systems. vol. 30 (2017) 3
51. Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C.A., Cubuk,
E.D., Kurakin, A., Li, C.L.: Fixmatch: Simplifying semi-supervised learning with
consistency and confidence. In: Advances in Neural Information Processing Sys-
tems. vol. 33, pp. 596–608 (2020) 4, 7, 11
18
52. Tong, S., Koller, D.: Support vector machine active learning with applications to
text classification. Journal of machine learning research 2(Nov), 45–66 (2001) 3
53. Wang, D., Shang, Y.: A new active labeling method for deep learning. In: 2014
International Joint Conference on Neural Networks (IJCNN). pp. 112–119 (2014)
3, 6
54. Xie, Q., Dai, Z., Hovy, E., Luong, T., Le, Q.: Unsupervised data augmentation
for consistency training. In: Advances in Neural Information Processing Systems.
vol. 33, pp. 6256–6268 (2020) 4
55. Yang, L., Zhang, Y., Chen, J., Zhang, S., Chen, D.Z.: Suggestive annotation: A deep
active learning framework for biomedical image segmentation. In: International
conference on medical image computing and computer-assisted intervention. pp.
399–407. Springer (2017) 1
56. Zhang, B., Li, L., Yang, S., Wang, S., Zha, Z.J., Huang, Q.: State-relabeling adver-
sarial active learning. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. pp. 8756–8765 (2020) 4
19
A Appendix
Here, we provide additional details about Sections 4.2 through 4.5. Table 15 shows
the details of fine-grained datasets used in semi-supervised learning and fine-grained
evaluation tasks described in Sec 4.4 and 4.5.
Table 15: Fine-grained datasets details. Training, val, and test split details
of the fine-grained datasets used in semi-supervised learning and fine-grained
evaluation tasks are listed. For DTD and Flowers, we use the provided val sets.
For Aircraft, we sample 20% of samples per class.
Dataset Classes Train size Val size Test size Accuracy measure
DTD [13] 47 1, 880 1, 880 1, 880 Top-1
Aircraft [37] 100 5, 367 1, 300 3, 333 Mean per-class
Flowers [40] 102 1, 020 1, 020 6, 149 Mean per-class
Table 16: Top-1 linear (LIN) and nearest neighbor (NN) classification
results of different sampling methods on ImageNet-LT. We use a ResNet-
50 that is pre-trained on ImageNet-LT using BYOL [24] to assure the insensitiv-
ity of K-means to category distribution of unlabeled training set. With no label
information, K-means performs better than non-uniform sampling methods in
both linear and nearest neighbor classification.
Budgets
1K 3K 5K 7K 9K 12K
Method LIN NN
0.8% 3% 4% 6% 8% 10%
Uniform ✓ 5.34 ± .1 10.6 ± .1 13.5 ± .2 15.6 ± .0 17.6 ± .1 19.5 ± .0
Random ✓ 5.04 ± .2 8.60 ± .1 11.4 ± .1 12.8 ± .0 14.1 ± .0 15.6 ± .0
Max-Entropy ✓ 5.04 ± .2 7.41 ± .0 9.29 ± .1 10.7 ± .0 11.9 ± .1 13.6 ± .1
Core-set ✓ 5.04 ± .2 7.77 ± .0 9.66 ± .0 10.8 ± .1 12.0 ± .0 13.7 ± .0
VAAL ✓ 5.04 ± .2 8.58 ± .0 10.6 ± .0 12.2 ± .1 13.4 ± .0 14.9 ± .0
Multi k-means ✓ 6.01 ± .0 9.69 ± .0 11.4 ± .1 12.7 ± .0 13.9 ± .0 15.4 ± .0
K-means ✓ 6.01 ± .0 9.60 ± .1 11.7 ± .0 13.1 ± .2 14.4 ± .0 15.9 ± .1
Uniform ✓ 4.81 ± .0 7.03 ± .1 8.21 ± .1 9.02 ± .1 9.95 ± .0 10.8 ± .0
Random ✓ 4.42 ± .1 6.40 ± .0 7.64 ± .0 8.22 ± .0 8.85 ± .0 9.55 ± .0
Max-Entropy ✓ 4.42 ± .1 4.45 ± .1 4.56 ± .1 4.76 ± .0 5.02 ± .0 5.36 ± .0
Core-set ✓ 4.42 ± .1 5.58 ± .0 6.34 ± .0 7.03 ± .0 7.65 ± .0 8.40 ± .0
VAAL ✓ 4.42 ± .1 6.33 ± .0 7.26 ± .0 7.95 ± .0 8.51 ± .1 9.12 ± .0
Multi k-means ✓ 5.48 ± .0 7.58 ± .0 8.50 ± .0 9.20 ± .1 9.74 ± .1 10.4 ± .0
K-means ✓ 5.48 ± .0 7.61 ± .0 8.64 ± .1 9.05 ± .0 9.74 ± .0 10.2 ± .0