Fast Visual Object Counting Via Example Based Density Estimation 2
Fast Visual Object Counting Via Example Based Density Estimation 2
The whole FE-VOC algorithm is summarized as follow-
ing Algorithm 1. It also has been illustrated in Fig. 2.
Algorithm 1 (FE-VOC)
Input: test image , training examples sets and
Output: the density map , the estimated count ( )
1: for each input patch extracted by in test image
, where is a projection matrix that extracts the ( , )th
patch from
do
2: Find the index ∗ of the desired neighborhood based
Fig. 3. The partial centroids of clusters on dataset UCSD
on and Eqn. (9).
(displayed in foreground feature). The patch size is 8×8
3: Compute the density map patch: = ∗ ( ). Put
where = + is the embedding matrix into the based on .
computed from the neighborhood of in two feature spaces. 4: end for
Inspired by the idea that the possible crowd behaviors are 5: Get the estimated density map of : , and the esti-
infinite but the space of distinguishable crowd motion pat- mated count of : ( ) = ∑ ∈ ( ).
terns may not be all that large [11], we expend this insight
further: in VOC, the salient patterns of image patches ex-
4. EXPERIMENT
tracted from images of objects are finite. With this insight, it
is believed the primary neighborhoods of image patches in We evaluate our method on cell and pedestrian data from two
feature space are also finite, hence they can be depicted in public benchmark datasets including bacterial cell dataset [6]
training phase instead of testing phase, in which will save and UCSD [3]. The details of these two datasets can be found
huge time. in Table 1 and example frames are shown in Fig. 1. For com-
To describe the possible neighborhoods of specific VOC parison of different methods, mean absolute error (MAE) is
problem, we cluster the feature space of image patches to ap- employed as the evaluation metric. Unless otherwise speci-
proximate the results. Assumed the number of the possible
fied, the patch size used is 4×4, and the patch step is set to 2
salient neighborhoods of training data and is , so all for both training and testing in our method.
feature vectors in will be clustered into groups as (1 ≤
≤ ), and the counterpart density maps cluster is pro- 4.1. Performance on bacterial cell dataset
duced by putting the density map patches together according
to the index set of the corresponding elements in . In Fig. On this dataset, we adhere to the experimental protocols in
3, some centroids of clusters on UCSD dataset are visual- [6], hence the performance of comparative methods can be
ized. It is obvious that these centroids grasp the prominent comparable directly. Specifically, the first 100 images are re-
crowd patterns, such as the outlines of head, body, or some- served for training and the rest are for validation. Each time
thing else. With and , their embedding matrix can be 5 different random subsets containing ( = 1,2,4, … ,32)
formulated as samples from training set are generated for calculating the
(7) MAE and their standard deviations. In our method, the clus-
= +
tering number is estimated empirically on the union of the
where = 0.001 for our method in experiments. training and testing sets. For features, RR [1], KRR [12] and
With the all precomputed embedding matrices, we need Density MESA [6] employ dense SIFT coded by bag of
to figure out which neighborhood the test patch belongs to. In words as features, while E-VOC and FE-VOC only use the
LLE, the correlation is described as the similarity of features raw data extracted from blue channel of images as features.
vectors in their feature space. Following this idea, we decide From Fig. 4, it is worth noting that the MAE computed
the neighborhood of test patch based on the similarity of the by all methods descent gradually with the rise of training
centroid of the neighborhood and itself, just as: number, and the MAE produced by FE-VOC stays lowest
∗ = arg min ( ( ), ( )) with no more than 8 training images among all. When train-
(8)
ing size is larger than 8, Density MESA [6] performs best but
where ∗ is the desired index of the cluster which most slightly better than FE-VOC.
likely lies in. Here (. ) is also Euclidean distance metric, so
the classification of input patch is finally formulated as: 4.2. Performance on pedestrian datasets
∗ ( ) − ( )||
= arg min || (9)
Thus, the density map of can be quickly calculated by ∗
On UCSD pedestrian dataset, the experimental protocols we
and Eqn. (6). use in UCSD are just the same as that in [6]. Specifically, the
dataset is divided into 4 different training and testing sets: 1)
‘maximal’: training on frames 600:5:1400; 2) ‘downscale’:
50 Table 1. Statistics of three datasets ( : total number of
RR [1] frames; R: the resolution; : the mean number of objects pre-
40 KRR [12] sented in single frame; C: the color channel of frames)
Mean Absolute Error
[7] Y. Zhou and J. Luo, "A practical method for counting arbitrary
7. REFERENCES target objects in arbitrary scenes," in Multimedia and Expo
(ICME), 2013 IEEE International Conference on, 2013, pp. 1-
[1] K. Chen, C. C. Loy, S. Gong, and T. Xiang, "Feature Mining for 6.
Localised Crowd Counting," in BMVC, 2012, p. 3. [8] L. Fiaschi, R. Nair, U. Koethe, and F. Hamprecht, "Learning to
[2] D. Kong, D. Gray, and H. Tao, "A viewpoint invariant approach count with regression forest and structured labels," in Pattern
for crowd counting," in Pattern Recognition, 2006. ICPR 2006. Recognition (ICPR), 2012 21st International Conference on,
18th International Conference on, 2006, pp. 1187-1190. 2012, pp. 2685-2688.
[3] A. B. Chan, Z.-S. J. Liang, and N. Vasconcelos, "Privacy [9] S. T. Roweis and L. K. Saul, "Nonlinear dimensionality
preserving crowd monitoring: Counting people without people reduction by locally linear embedding," Science, vol. 290, pp.
models or tracking," in Computer Vision and Pattern 2323-2326, 2000.
Recognition, 2008. CVPR 2008. IEEE Conference on, 2008, pp. [10] H. Chang, D.-Y. Yeung, and Y. Xiong, "Super-resolution
1-7. through neighbor embedding," in Computer Vision and Pattern
[4] D. Ryan, S. Denman, C. Fookes, and S. Sridharan, "Crowd Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE
counting using multiple local features," in Digital Image Computer Society Conference on, 2004, pp. I-I.
Computing: Techniques and Applications, 2009. DICTA'09., [11] M. Rodriguez, J. Sivic, I. Laptev, and J.-Y. Audibert, "Data-
2009, pp. 81-88. driven crowd analysis in videos," in Computer Vision (ICCV),
[5] K. Chen, S. Gong, T. Xiang, and C. C. Loy, "Cumulative 2011 IEEE International Conference on, 2011, pp. 1235-1242.
attribute space for age and crowd density estimation," in [12] S. An, W. Liu, and S. Venkatesh, "Face recognition using
Computer Vision and Pattern Recognition (CVPR), 2013 IEEE kernel ridge regression," in Computer Vision and Pattern
Conference on, 2013, pp. 2467-2474. Recognition, 2007. CVPR'07. IEEE Conference on, 2007, pp.
[6] V. Lempitsky and A. Zisserman, "Learning to count objects in 1-7.
images," in Advances in Neural Information Processing
Systems, 2010, pp. 1324-1332.