2022 Class
2022 Class
Intraclass Diversity
Shenjian Gong1 , Shanshan Zhang1(B) , Jian Yang1 , Dengxin Dai2 , and Bernt Schiele2
1
PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information
of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for Social
Security, School of Computer Science and Engineering, Nanjing University of Science
and Technology, Nanjing, China
{shenjiangong,shanshan.zhang,csjyang}@njust.edu.cn
2
MPI Informatics, Saarbrcken, Germany
{ddai,schiele}@mpi-inf.mpg.de
Abstract. Most previous works on object counting are limited to pre-defined cat-
egories. In this paper, we focus on class-agnostic counting, i.e., counting object
instances in an image by simply specifying a few exemplar boxes of interest. We
start with an analysis on intraclass diversity and point out three factors: color,
shape and scale diversity seriously hurts counting performance. Motivated by this
analysis, we propose a new counter robust to high intraclass diversity, for which
we propose two effective modules: Exemplar Feature Augmentation (EFA) and
Edge Matching (EM). Aiming to handle diversity from all aspects, EFA generates
a large variety of exemplars in the feature space based on the provided exemplars.
Additionally, the edge matching branch focuses on the more reliable cue of shape,
making our counter more robust to color variations. Experimental results on stan-
dard benchmarks show that our Robust Class-Agnostic Counter (RCAC) achieves
state-of-the-art performance. The code is publicly available at https://github.com/
Yankeegsj/RCAC.
1 Introduction
Object counting, i.e., estimating the number of object instances of a certain category in a
given image, has a wide range of applications such as video surveillance and agriculture.
However, most methods in previous works can only count pre-defined categories, such
as people [13, 16], animals [3], plants [18, 22] and cars [17]. For most existing works,
each model is typically trained for one category with a large amount of labeled data.
They have two notable limitations. On one hand, we need to train multiple models if we
are required to count objects of various categories, which is computationally expensive
and inconvenient. On the other hand, such models cannot be adapted to unseen cate-
gories at test time. But in practice, it is desirable to develop counting methods that are
more general and flexible, which are extendable to any arbitrary new category at test
time.
To this end, class-agnostic object counting is more suited for real applications and
has been investigated recently. Interactive Object Counting (IOC) [2] addresses the
counting task with human interaction. The user is asked to annotate a small number
of objects with dots and the algorithm learns a codebook and partitions all pixels into
object and background groups. This process is repeated until the results are satisfactory.
In contrast, some more recent works [15, 19] formulate counting as a matching prob-
lem, turning out to be more effective and efficient. Generic Matching Network (GMN)
[15] learns the matching function from concatenation of query image and exemplar box
features to a similarity heatmap. When adapting the model to a novel category, only
a fraction of parameters need to be optimized. Few-shot adaptation & matching Net-
work (FamNet) [19] computes the correlation maps between exemplar box and image
features and then predicts the density map.
However, the current best performance is still far from satisfactory. For example,
the average ground truth count on the FSC-147 validation set is 63.54, while the mean
average error (MAE) of the current top method FamNet [19] is as high as 24. In order
to understand the limitations of current methods, we analyze failure cases and find that
objects of interest in the same image may differ in color, shape and scale, which largely
hinders counting performance. A detailed analysis can be found in Sect. 3. It has been
shown by FamNet [19] that it is helpful to provide more diverse exemplar boxes. Yet the
exemplar boxes are provided by annotators subjectively and thus the diversity cannot
be guaranteed; also, the number of provided exemplar boxes is limited, potentially not
covering all instances. To address this problem, in this paper, we aim to develop a
new counting method, which is more robust to intraclass diversity. Specifically, we
propose two effective modules. One the one hand, we apply exemplar augmentation
in the feature space to handle high diversity in various aspects. One the other hand, we
introduce an additional matching branch that uses edge features to deal with diversity
in color.
To summarize, the main contributions of our work are as follows: (1) We ana-
lyze the top-performing class-agnostic counting method FamNet [19], showing that
intra-class diversity is a key factor decreasing counting performance, and point out
the diversity comes from three aspects: color, shape and scale. (2) Two modules are
proposed to overcome the high diversity challenge. The exemplar feature augmenta-
tion module increases the exemplar diversity so as to achieve more effective matching
with a wide range of instances. Moreover, the additional matching branch using edge
features focuses on the more reliable cue of shape, down-weighing some less reliable
cues, including background and object colors. (3) Experimental results on two related
datasets show that our method achieves state-of-the-art results for class-agnostic count-
ing, outperforming previous methods by a large margin; also, since no test time adapta-
tion is employed, our method is more convenient to apply.
2 Related Work
In this section, we first briefly review recent works on class-aware object counting and
then focus on class-agnostic object counting methods.
390 S. Gong et al.
Class-Aware Object Counting. Most object counting methods are limited to pre-
defined categories, e.g., people, animals and cars. Generally, they can be divided into
two groups. One of them is detection based counting [4, 9, 12]. Each of them applies
an object detector on the given image, and then counts the number of bounding boxes.
However, it is hard to choose a proper threshold for the detection confidence to select
out reasonable boxes; and object detectors usually perform poorly at crowds. The other
group is regression based counting [5, 6, 13, 16, 19, 22]. These methods estimate a den-
sity map for each image, and counting is achieved by summing up the pixel values. For
both kinds of methods, box or point annotations for all persons are required at training
time, which are rather expensive. Class-aware object counters perform well on trained
categories but they cannot be adapted to a new category at test time. Also, it is expensive
to obtain rich training annotations.
Class-Agnostic Object Counting. Similar to class-aware object counting, a straight-
forward way for class-agnostic object counting is to apply a few-shot object detector
[7, 10, 11] on the given image. But the major disadvantage is that it is tricky to choose a
proper detection score threshold for counting; also, the detectors usually fail at crowded
scenes. In contrast, regression based methods are cleaner and expected to achieve higher
performance.
Some early regression based works perform pixel-wise classification. For example,
IOC [2] learns a codebook from a few dot annotations marked by the user, so as to
distinguish object and background pixels. Few-Shot Sequential Approach (FSSA) [24]
uses the extracted prototype features to classify each pixel as one of the object classes
present in the support set or as background. More recently, counting is formulated as
a matching problem, which becomes more effective and efficient. GMN [15] proposed
a class-agnostic counting approach consisting of three modules, namely embedding
module, matching module and adaptation module. The exemplar box and query image
features extracted from the embedding module are concatenated and fed to the match-
ing module to predict a similarity heatmap. The adaption module is used to adapt to
a new domain and is the only module needs to be updated for adaptation. FamNet
[19] and Class-agnostic Few-shot Object Counting Network (CFOCNet) [25] are most
related to our work. They both take correlation matching maps between the exemplar
box and query images and then predict the density maps based on them. FamNet per-
forms additional fine-tuning at test time. Model Agnostic Meta Learning (MAML) [8]
based few-shot approaches also fine-tune some parameters to make the model better
adapt to novel classes. In this paper, we also employ correlation maps for matching.
The major difference is that, we propose new modules against high diversity aiming for
more effective matching.
Fig. 1. Some Failure cases of FamNet* [19] from the FSC-147 dataset. At each row, from left
to right, We show each query image, its ground truth density map, estimated density map given
each exemplar box and the final average density map. B r,1 , B r,2 , B r,3 are shown in red, blue
and white bounding boxes respectively. The numbers indicate the ground truth counting number
or estimated counting results. Colorful for better visualization.
The pipeline of FamNet* is as follows (shown as the black arrows in Fig. 2): the query
image is fed to the backbone network (ResNet-50) for feature extraction, which is
trained on ImageNet and not updated during training; multi-scale features for each
exemplar box are obtained by performing ROI pooling on the feature maps from the
third and fourth ResNet-50 blocks; the query image features also come from the third
and fourth blocks; correlation maps are calculated by taking each exemplar box feature
as a convolution kernel, which is applied to the query image feature maps; the den-
sity map is then predicted by a shallow subnet consisting of 5 conv layers using the
correlation matching maps as input.
We start with analyzing failure cases for the FamNet* we trained on the FSC-147.
We pick those samples with relative errors higher than 20% and do visual inspection.
The relative error is calculated as absolute prediction error divided by the ground truth
count. By observing the above samples, we find three typical factors that affect the
performance: high diversity w.r.t. color, scale and shape. In Fig. 1 we show some failure
cases from the FSC-147 dataset. Each image is provided with three exemplar boxes
(B1r , B2r , B3r ), each generates a density map and the final output density map is obtained
by averaging the above three. The counting number (shown at the right bottom) of each
density map is calculated by summing up all pixel values on it.
The color diversity comes from two aspects: the foreground objects and the back-
ground. For the query image in row 1, there is a high color difference among the object
instances. Although the provided three exemplar boxes are of different colors, they still
fail to cover all colors of different objects. Similarly, in row 2, we can also see color
392 S. Gong et al.
Fig. 2. Pipeline of our proposed method – Robust Class-Agnostic Counter (RCAC). Given an
input query image Iq along with several exemplar boxes B r , we first apply an edge detector
to obtain a gray-scale edge image, and then we have a two-stream scheme using the RGB and
edge images for matching in parallel. Specifically, the RGB and edge images are fed into separate
backbone networks for feature extraction; exemplar box features are cropped from the full feature
maps via ROI pooling, which are augmented via our proposed feature augmentation module; after
that, feature correlation layer takes feature maps of each exemplar feature as convolution kernel
to calculate the correlation map on the full feature maps; correlation maps come from the same
exemplar goes through the density prediction module to generate one density map, and the final
predicted density map is obtained by averaging all the density maps. Need to note that only
the edge backbone and density prediction module represented with blue trapezoid are optimized
during training. The black arrows indicate the shared flows between ours and the baseline method.
difference inside object boxes. The fishes are all white but their background regions
differ in color. The high diversity w.r.t. color results in large counting errors.
For the query image in row 3, the chairs distribute from the near to the distant,
showing large variance in scale. Although the provided three exemplar boxes are of
different scales, they are not able to cover the full scale range of all instances. This
challenging scenario makes the predicted counting number (37.36) become much lower
than the ground truth (252).
For the query image in row 4, the skateboards show different shapes caused by
different orientations. We can see many of them are put horizontally, but some are put
vertically, e.g. those ones close to the blue box. Also, they are of different scales from
the near to the distant. Moreover, the color varies a lot across different instances. This
example is representative that different factors may happen at the same time, leading to
very challenging scenarios.
We further provide some quantitative analysis regarding the impact of diversity.
First, from the full validation set we select three subsets with high diversity w.r.t. color,
scale and shape respectively in the following way. For each image, we compute the
variance values w.r.t. color (represented by hue), scale (represented by area) and shape
(represented by aspect ratio) based on the provided exemplar boxes. Then we select
top hundreds of images with highest scale variance as the scale diversity subset, similar
for color and shape. A comparison of results of FamNet* on the full validation set
and three diverse subsets is shown in Table 1. We can find that compared to the full
Class-Agnostic Object Counting Robust to Intraclass Diversity 393
validation set, diversity w.r.t. color, scale and shape all bring a significant performance
drop. Especially for the subsets with high color and scale diversity, the performance
drops by –20 pp w.r.t. MAE.
The above analysis indicates that counting performance is highly affected by the
diversity of object instances. Therefore, we are aiming to develop a new counting
method robust to high intraclass diversity. Qualitative results of our method on high
intraclass diversity images is shown in Fig. 4.
In this section, we first introduce the setting of few-shot counting. After that, we pro-
vide the pipeline of our method, followed by detailed description of two new modules:
exemplar feature augmentation and edge matching.
We follow the few-shot setting from our baseline method FamNet* [19]: given a query
image (I q ∈ R3×H×W ) and K exemplar bounding boxes (B r ∈ RK×4 ) that locate
the reference instances belonging to the same category, the task is to predict the density
map Y of the query image and the counting number is calculated by summing up all
pixel values of Y .
4.2 Pipeline
Each exemplar feature map is first resized to the same size based on the maximal exem-
plar box. And then we resize each exemplar feature map by 0.9 and 1.1 to obtain multi-
scale features, following FamNet*. In this way, we obtain multi-level and multi-scale
feature maps for each exemplar:
q,r q,r
Fk,i,s = Resize(Fk,i , s),
(2)
i ∈ {1, 2} , s ∈ {0.9, 1.0, 1.1} .
where Conv denotes the convolution operation that correlate the exemplar features
with the query features to obtain multiple correlation maps. After convolution, for each
exemplar, we append the obtained 6 correlation maps (2 × 3: two-level (output of 3rd
and 4th blocks of ResNet-50) and three-scale (0.9, 1.0, 1.1) features) to Mkq for density
prediction.
Density Prediction. For the k-th exemplar, given Mkq from the previous step, the den-
sity prediction module (D) predicts a relevant density map. The final density map is
obtained by averaging K density maps.
2
LD = Y − Y . (5)
For instance, when the weight vector is equal to (1, 0, ..., 0), the augmented feature is
q,r
the same as F1,i . Please note that we sample α with a multinomial dirichlet distribution.
Class-Agnostic Object Counting Robust to Intraclass Diversity 395
In this way, we obtain a larger set of density maps, and the final density map can be
formulated as:
Y = M ean(D(M1q ), D(M2q ), ..., D(MK+Nq
)). (7)
Imagine that we want to count objects of various colors, but only three samples are
given. EFA is like creating new samples of different colors in the feature space via
combining the provided exemplars. In this way, objects with various colors can be bet-
ter matched; and similarly, the intraclass diversity w.r.t. shape and scale can be also
handled.
Dirichlet Distribution. In machine learning, one common distribution called Beta dis-
tribution is denoted as:
Γ(θ1 )Γ(θ2 ) θ1 −1
Beta(α | θ1 , θ2 ) = α (1 − α)θ2 −1 , (8)
Γ(θ1 + θ2 )
where Γ represents Gamma function. Dirichlet distribution generalizes the Beta distri-
bution to a multinomial distribution. It is expressed as:
Different object instances may differ in color, while shape is a more reliable cue across
instances, leading to more robust counting. On the other hand, edge is a kind of class-
agnostic knowledge, which will not bring category bias. To allow our model more focus
on the shape cue, we introduce an additional stream for matching, where edge features
are used instead of RGB features.
The gray-scale edge image we use in this paper is generated by the RCF model [14]
trained on the BSDS500 dataset [1]. We obtain one edge image for each RGB image.
For instance, in Fig. 2, I e is predicted from I q with the trained RCF model.
396 S. Gong et al.
The structure of the edge stream is the same as the RGB stream. The only difference
is that we use a shallower network for edge feature extraction. Since the gray-scale edge
image is much lighter than the RGB image, we employ a VGG-like net [23] with a
smaller number of channels as the edge backbone, and update it during training.
In the same way as depicted in Sect. 4.2, for the k-th exemplar, we get 6 edge corre-
lation maps and append them to Mke for the edge branch. Finally, the density prediction
module takes Mkq and Mke as input and then predicts the corresponding density map.
The final density map can be computed as:
5 Experiments
In this section, we first describe the datasets and evaluation metrics we use, followed
by implementation details; then we show our experimental results with comparisons to
the state-of-the-art; finally, we perform ablation studies.
5.1 Datasets
1
N
M AE = Yi − Yi , (11)
N i=1
1 2
N
RM SE = Yi − Yi , (12)
N i=1
where N is the number of test images; Yi , Yi represent ground truth and predicted
counts.
From the results in Table 2, we have the following observations. (1) Generally,
regression based counting methods (GMN [15], MAML [8], FamNet [19] and Ours)
perform better than detection based approaches (FR [11], FSOD [7]). (2) Our method
outperforms the baseline method FamNet [19] by a large margin. In particular, on the
validation set, the gain is 3.21 pp w.r.t. MAE; on the test set, the improvement is as
large as 17.68 pp. These improvements demonstrate the effects of our proposed two
new modules. (3) Our method surpasses all existing methods, defining a new state-of-
the-art on the FSC-147 dataset.
Table 2. Comparison of our method and previous methods on the FSC-147 dataset.
We further observe the improvements of our method to the baseline on high diver-
sity images. We show the effects of two modules on high-diversity subsets from the
comparison in Table 1. As stated in our abstract and introduction, EFA handles all kinds
of diversity, while EM focuses on handling color diversity, indicating that our method
is more robust to high intraclass diversity.
Additionally, Fig. 4 shows some qualitative results on the FSC-147 dataset. In row 1,
our method obtains stronger responses and a more accurate count number at the scenario
of high shape diversity led by severe occlusion. In row 2, our method produces cleaner
density maps with less noises at the background regions than the baseline by handling
color diversity. Inside each exemplar box, the background colors are dominant, resulting
in noisy responses at background regions on the baseline density map. In row 3, our
method produces more balanced density maps across different scales than the baseline
by handling scale diversity. In row 4, our method produces more uniform density maps
across different foreground color diversity.
CARPK Dataset. Similar to [19], we further verify our method on the CARPK dataset,
due to the lack of class-agnostic counting datasets. The experiments are implemented
under the same few-shot setting. Since there is only one category for CARPK, it is
considered rather a simple version of class-agnostic object counting. The results are
shown in Table 3. Our model outperforms all previous approaches except GMN, which
Class-Agnostic Object Counting Robust to Intraclass Diversity 399
Table 3. Comparison of car counting performance on the CARPK dataset. *GMN uses extra
images of cars from the ILSVRC video dataset for training. “Fine-tuned” denotes whether the
models are further fine-tuned on CARPK.
uses external training images of cars from the ILSVRC video dataset. It is notable that
our approach improves over FamNet by 4.57 pp w.r.t MAE and 14.58 pp w.r.t RMSE.
These results indicate that our method generalizes well to different datasets.
GT: 151 Cnt: 106.34 (err: 44.66) Cnt 162.69 (err: 11.69)
GT: 437 Cnt: 343.75 (err: 93.25) Cnt 428.28 (err: 8.72)
Fig. 4. Qualitative results of different methods on high intraclass diversity images from FSC-147.
Zoom in and colorful for better visualization. (Color figure online)
400 S. Gong et al.
In the following, we conduct some ablation studies to analyze our proposed exemplar
feature augmentation and edge matching modules. All experiments are conducted on
the validation set of FSC-147.
Effects of Two New Modules. As shown in Table 4, the performance is improved
by 1.24 pp w.r.t MAE (from 24.32 to 23.08) when exemplar feature augmentation is
employed. On the other hand, we also observe a remarkable improvement of 1.03 pp
w.r.t MAE (from 24.32 to 23.29) from edge matching. Moreover, we obtain a total gain
of 3.78 pp w.r.t MAE by adding both modules. These results indicate the effects of two
proposed modules.
Impact of Dirichlet Distribution Parameter θ. From Table 5, we can see our method
obtains consistent improvements to the baseline by using exemplar feature augmenta-
tion no matter which sampling parameter we choose. By analyzing Fig. 3 and Table 5
simultaneously, we find it works better to set θi evenly ({2,2,2} vs. {3,2,2}) such that
we have a high probability to include the average fusion of the K exemplars. Also, it
helps to have a larger sampling area ({2,2,2} vs. {5,5,5} such that more diverse com-
binations can be generated. Finally we set θ to {2,2,2} for all our experiments as it
performs the best.
θ MAE RMSE
– 23.29 63.35
{ 3,2,2 } 21.44 63.40
{ 5,5,5 } 20.46 61.70
{ 2,2,2 } 20.54 60.78
Inference Time Analysis. To verify the efficiency of our RCAC, we compare the infer-
ence time of our RCAC with FamNet* and FamNet in Table 7. We can see our RCAC
runs slightly slower than our baseline FamNet* (75 ms vs. 47 ms) due to additional
computations for edge detection. In order to accelerate our RCAC, we replace RCF
with Sobel operators, which reduces the inference time at a cost of small performance
drop. Please note our RCAC (w/ Sobel) still outperforms FamNet* by –2pp at a similar
speed; and compared to previous top method FamNet, our RCAC (w/ RCF) not only
obtains better performance (by –3pp), but also runs much faster (75 ms vs. 3,900 ms).
Method K = 3, N = 0
MAE T (ms)
FamNet* 24.32 47
FamNet 23.75 3,900
RCAC (w/ Sobel) 22.41 59
RCAC (w/ RCF) 20.94 75
In the supplementary material, we provide more ablation studies on the impact of the
effect of using edge images at the 2nd branch, effect of number of exemplars, qualitative
results of augmented exemplars and application of EFA in another task.
6 Conclusion
In this paper, we analyze failure cases of previous top-performing class-agnostic object
counter and find high intraclass diversity in the query image has an adverse effect on
counting performance. To solve this problem, we propose two novel modules: exemplar
feature augmentation and edge matching. They make our counter robust to high intra-
class diversity. Extensive experiments have demonstrated the effectiveness and robust-
ness of our method.
Acknowledgements. This work was supported in part by the National Natural Science Founda-
tion of China (Grant No. 62172225), Fundamental Research Funds for the Central Universities
(No. 30920032201) and the “111” Program B13022.
402 S. Gong et al.
References
1. Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image
segmentation. PAMI 33(5), 898–916 (2010)
2. Arteta, C., Lempitsky, V., Noble, J.A., Zisserman, A.: Interactive object counting. In: Fleet,
D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 504–518.
Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10578-9 33
3. Arteta, C., Lempitsky, V., Zisserman, A.: Counting in the wild. In: Leibe, B., Matas, J., Sebe,
N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 483–498. Springer, Cham (2016).
https://doi.org/10.1007/978-3-319-46478-7 30
4. Chattopadhyay, P., Vedantam, R., Selvaraju, R.R., Batra, D., Parikh, D.: Counting everyday
objects in everyday scenes. In: CVPR, pp. 1135–1144 (2017)
5. Cholakkal, H., Sun, G., Khan, F.S., Shao, L.: Object counting and instance segmentation
with image-level supervision. In: CVPR, pp. 12397–12405 (2019)
6. Cholakkal, H., Sun, G., Khan, S., Khan, F.S., Shao, L., Van Gool, L.: Towards partial super-
vision for generic object counting in natural scenes. PAMI (2020)
7. Fan, Q., Zhuo, W., Tang, C.K., Tai, Y.W.: Few-shot object detection with attention-RPN and
multi-relation detector. In: CVPR, pp. 4013–4022 (2020)
8. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep
networks. In: ICML, pp. 1126–1135 (2017)
9. Hsieh, M.R., Lin, Y.L., Hsu, W.H.: Drone-based object counting by spatially regularized
regional proposal network. In: ICCV, pp. 4145–4153 (2017)
10. Hsieh, T.I., Lo, Y.C., Chen, H.T., Liu, T.L.: One-shot object detection with co-attention and
co-excitation. In: NIPS, pp. 2725–2734 (2019)
11. Kang, B., Liu, Z., Wang, X., Yu, F., Feng, J., Darrell, T.: Few-shot object detection via feature
reweighting. In: ICCV, pp. 8420–8429 (2019)
12. Laradji, I.H., Rostamzadeh, N., Pinheiro, P.O., Vazquez, D., Schmidt, M.: Where are the
blobs: counting by localization with point supervision. In: ECCV, pp. 547–562 (2018)
13. Liu, Y., Wen, Q., Chen, H., Liu, W., Qin, J., Han, G., He, S.: Crowd counting via cross-stage
refinement networks. IEEE TIP 29, 6800–6812 (2020)
14. Liu, Y., Cheng, M.M., Hu, X., Bian, J.W., Zhang, L., Bai, X., Tang, J.: Richer convolutional
features for edge detection. PAMI 41(8), 1939–1946 (2019)
15. Lu, E., Xie, W., Zisserman, A.: Class-agnostic counting. In: Jawahar, C.V., Li, H., Mori, G.,
Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11363, pp. 669–684. Springer, Cham (2019).
https://doi.org/10.1007/978-3-030-20893-6 42
16. Mo, H., Ren, W., Xiong, Y., Pan, X., Zhou, Z., Cao, X., Wu, W.: Background noise filtering
and distribution dividing for crowd counting. IEEE TIP 29, 8199–8212 (2020)
17. Mundhenk, T.N., Konjevod, G., Sakla, W.A., Boakye, K.: A large contextual dataset for
classification, detection and counting of cars with deep learning. In: Leibe, B., Matas, J.,
Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 785–800. Springer, Cham
(2016). https://doi.org/10.1007/978-3-319-46487-9 48
18. Rahnemoonfar, M., Sheppard, C.: Deep count: fruit counting based on deep simulated learn-
ing. Sensors 17(4), 905 (2017)
19. Ranjan, V., Sharma, U., Nguyen, T., Hoai, M.: Learning to count everything. In: CVPR, pp.
3394–3403 (2021)
20. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time
object detection. In: CVPR, pp. 779–788 (2016)
21. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-CNN: towards real-time object detection with
region proposal networks. NIPS 28, 91–99 (2015)
Class-Agnostic Object Counting Robust to Intraclass Diversity 403
22. Ribera, J., Guera, D., Chen, Y., Delp, E.J.: Locating objects without bounding boxes. In:
CVPR, pp. 6479–6489 (2019)
23. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recog-
nition. arXiv preprint arXiv:1409.1556 (2014)
24. Sokhandan, N., Kamousi, P., Posada, A., Alese, E., Rostamzadeh, N.: A few-shot sequential
approach for object counting. arXiv preprint arXiv:2007.01899 (2020)
25. Yang, S.D., Su, H.T., Hsu, W.H., Chen, W.C.: Class-agnostic few-shot object counting. In:
WACV, pp. 870–878 (2021)