Object Counting Yolo
Object Counting Yolo
and Correlative-Attention modules, LaoNet exploits Generic Matching Network (GMN) for class-agnostic count-
the correlation among novel category objects with high ing. However it still needs several dozens to hundreds exam-
accuracy and efficiency. ples of a novel category for adaptation and good performance.
CFOCNet is introduced to match and utilize the similarity be-
• We propose a Scale Aggregation mechanism to extract tween objects within the same category [5]. The work [4]
more comprehensive features and fuse multi-scale in- presents a Few Shot Adaptation and Matching Network (Fam-
formation from the supporting box. Net) to learn feature correlations and few-shot adaptation and
• The experimental results show that our model achieves also introduces a few-shot counting dataset named FSC-147.
state-of-the-art results with significant improvements When the number of labeled example decreases to one,
on FSC-147 [4] and COCO [6] datasets under the one- the task evolves into one-shot counting. In other visual tasks,
shot setting without fine-tuning. researchers develop methods for one-shot segmentation [17]
and one-shot object detection [18, 19]. Compared to the few-
shot setting which usually uses at least three instances for
2. RELATED WORKS each object [4], the one-shot setting, where only one instance
is available, is clearly more challenging.
Object counting methods can be briefly divided into two
It is worth mentioning that detection based approaches
types. Detection based methods [7] count the number of ob-
[20, 21, 22] are inferior for the tasks of few-shot and one-shot
jects by exhaustively detecting every target in images. But
counting. One main reason is that it requires extra and costly
they rely on the complex labels such as bounding boxes. Re-
bounding-box annotations of all instances in the training stage
gression based methods [1, 2] learn to count by predicting a
while one-shot counting approach which we focus on depends
density map, in which each value represents the density of
on dot annotations and only one supporting box. To illus-
target objects at the corresponding location. The count pre-
trate this point further, we perform experiments in Section 4.3
diction equals to the total sum of density map.
to compare with detection based approaches and validate the
Nevertheless, most of the counting methods are category
proposed network for one-shot counting.
specifically, e.g. for human crowd [1, 2, 8, 9, 10, 11], for
cars [3, 12], for plants [13] or for cells [14, 15]. They focus
on only one category and will loss the original satisfied per- 3. APPROACH
formance when transferring to other categories. Moreover,
most traditional approaches usually rely on tens of thousands 3.1. Problem Definition
of instances to train a counting model [2, 8, 9, 11, 3, 12].
To reduce considerably the number of samples needed to One-shot object counting consists of a training set
train a counting model for a particular category, recently, few- (It , st , yt ) ∈ T and a query set (Iq , sq ) ∈ Q, in which cate-
shot counting task has been developed. The key lies in the gories are mutually exclusive. Each input for the model con-
generalization ability of the model to deal with novel cate- tains an image I and a supporting bounding box s annotating
gories from few labeled examples. The study [16] proposes a one object of the desired category. In training set, abundant
point annotations yt are available to supervise the model. In Previous few-shot counting methods [4, 5] usually adopt
inference stage, we aim the model to learn to count the novel a convolution operation where the supporting features act as
objects in Iq with a supporting category instance sampled by kernels to match the similarities for target category. However,
sq . the results will greatly depend on the quality of supporting
features and the consistency of objects’ properties, including
3.2. Feature Correlation rotations and scales.
To this end, we propose a Correlative-Attention module
As the model is required to learn to count from only one sup- to learn inter-relations between query and supporting features
porting object, seizing the correlation between features with and alleviate the constraints of irrelevant properties.
high efficiency is quite important. Therefore, we build the Specifically, we extend the MA by learning correlations
feature correlation model in our one-shot network based on between different feature sequences and add a feed-forward
Self-Attention and Correlative-Attention modules, for learn- network (FFN) to fuse the features, i.e.,
ing the inner-relations and inter-relations respectively.
As illustrated in Figure 1 (violet block), our Self- X ∗ = Corr(X̃, S̃) = G(M A(X̃Q , S̃K , S̃V ) + X̃). (4)
Attention module consists of a Multi-head Attention (MA) G includes two LNs and a FFN in the form of residual (light
and a layer normalization (LN). We first introduce the defi- blue block in Figure 1). Finally, X ∗ and S̃ will be fed into the
nition of attention [23], given the query Q, key K and value cycle as new feature sequences where each cycle consists of
vector V : two Self-Attention modules and a Correlative-Attention mod-
(QW Q )(KW K )T ule.
A(Q, K, V | W ) = S( √ + P E)(V W V ),
d
(1) 3.3. Feature Extraction and Scale Aggregation
where S is the softmax function and √1d is a scaling factor
To extract feature sequences from images, we use VGG-19 as
based on the vector dimension d. W : W Q , W K , W V ∈ our backbone. For query image, the output of the final level
Rd×d are weight matrices for projections and P E is the posi- is directly flattened and transmitted into Self-Attention mod-
tion embedding. ule. For the supporting box, as there are uncontrollable scale
To leverage on more representation subspaces, we adopt variations among instances due to the perspective, we pro-
the extending form with multi attention heads: pose a Scale Aggregation mechanism to fuse different scale
information.
M A(Q, K, V ) = Concat(head1 , .., headh )W O Given l as the number of layers in CNN, we aggregate the
(2)
where headi = A(Q, K, V | Wi ). feature maps among different scales:
The representation dimensions are divided by parallel atten- S = Concat(F l (s), F l−1 (s), ..., F l+1−δ (s)), (5)
tion heads, where parameter matrices Wi : WiQ , WiK , WiV ∈
Rd×d/h and WO ∈ Rd×d . where F i represents a feature map at ith level and δ ∈ [1, l]
One challenging problem in counting task is the existence decides the number of layers taken for aggregation.
of many complex interfering things. To efficiently weaken the Meanwhile, we leverage on identifying position embed-
negative influence by those irrelevant background, we apply ding to help the model distinguish the integrated scale in-
Multi-head Self-Attention in image features to learn inner- formation in attention model. By adopting the fixed sinu-
relations and encourage the model to focus more on repetitive soidal absolute position embedding [23], feature sequences
objects that can be counted. from different scales can still maintain the consistency be-
We denote the feature sequences of the query image and tween positions, i.e.,
the supporting box region as X and S, with sizes X ∈ P E(posj ,2i) = sin(posj /100002i/d ),
RHW ×C and S ∈ Rhw×C . And the refined query feature (6)
is calculated by: P E(posj ,2i+1) = cos(posj /100002i/d ).
i is the dimension and posj is the position for jth feature map.
X̃ = LN (M A(XQ , XK , XV ) + X). (3)
A layer normalization (LN) is adopted to balance the value 3.4. Training Loss
scales. We use Euclidean distance to measure the difference between
Meanwhile, as there is only one supporting object in one- estimated density map and ground truth density map, which is
shot counting problem, refining the salient features within the generated based on annotated points following [1]. The loss
object is necessary and helpful for counting efficiency and is defined as follows:
accuracy. Therefore we apply another Self-Attention module
to supporting feature and get refined S̃. LE = ||Dgt − D||22 , (7)
where D is the estimated density map and Dgt is the ground Val Test
Methods
truth density map. To improve the local pattern consistency, MAE RMSE MAE RMSE
we also adopt a SSIM loss followed the calculation in [8]. By 3-shot
integrating the above two loss functions, we have Mean 53.38 124.53 47.55 147.67
Median 48.68 129.70 47.73 152.46
L = LE + λLSSIM , (8) FR detector [25] 45.45 112.53 41.64 141.04
FSOD detector [26] 36.36 115.00 32.53 140.65
where λ is the balanced weight. GMN [16] 29.66 89.81 26.52 124.57
MAML [27] 25.54 79.44 24.90 112.68
FamNet [4] 23.75 69.07 22.08 99.54
4. EXPERIMENTS 1-shot
CFOCNet [5] 27.82 71.99 28.60 123.96
4.1. Implement Details and Evaluation Metrics FamNet [4] 26.55 77.01 26.76 110.95
LaoNet (Ours) 17.11 56.81 15.78 97.15
We design the density regressor by an upsampling layer and
three convolution layers with ReLU activation. The kernel
Table 1. Comparisons with previous state-of-the-art few-shot
sizes of first two layers are 3 × 3 and that of last is 1 × 1. Ran-
methods on FSC-147. The upper part of the table presents the
dom scaling and flipping are adopted for each training image.
results in 3-shot setting while the lower part presents 1-shot
Adam [24] with a learning rate 0.5 × 10−5 is used to optimize
results. FamNet [4] uses the adaptation strategy during test-
the parameters. We set the number of attention heads h as 4,
ing. It is worth noticing that our one-shot LaoNet outperforms
the correlation cycle T as 2, the number of aggregated layers
all of previous methods, even those in 3-shot setting, without
δ as 2, and the loss balanced parameter λ as 10−4 .
any fine-tuning strategy.
Mean Absolute Error (MAE) and Root Mean Squared Er-
ror (RMSE) are used to measure the performance of our meth-
ods. They are defined by:
M
1 X gt
M AE = N − Ni ,
M i=1 i
v (9)
u
u 1 XM
GT: 33 GT: 14 GT: 35
RM SE = t (N gt − Ni )2 ),
M i=1 i
Table 2. Results on each of four folds of COCO val2017. Methods with † follow the experiment setting in [5]. Our method
achieves great accuracy without any fine-tuning on testing categories.
Table 3. Ablation study for different terms. X stands for Table 4. Comparisons with pre-trained object detectors on
feature sequences of query image and S stands for that of FSC147-COCO splits of FSC147 which contain images with
supporting box region. Experiments are performed in FSC- COCO categories. Even pre-trained with thousands of anno-
147 val and test. tated examples on MS-COCO dataset, these object detectors
still perform unsatisfied accuracy on counting task.