0% found this document useful (0 votes)
41 views

Fast Visual Object Counting Via Example Based Density Estimation 2

This document proposes a fast visual object counting method called FE-VOC. It estimates density maps and object counts by classifying test image patches into precomputed neighborhoods from training data. Neighborhood embeddings and centroids are computed from training patches and corresponding density maps. At test time, a patch is classified and its density map estimated by applying the corresponding neighborhood's embedding. Experimental results show the method is comparable in accuracy to mainstream approaches but much faster for testing.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Fast Visual Object Counting Via Example Based Density Estimation 2

This document proposes a fast visual object counting method called FE-VOC. It estimates density maps and object counts by classifying test image patches into precomputed neighborhoods from training data. Neighborhood embeddings and centroids are computed from training patches and corresponding density maps. At test time, a patch is classified and its density map estimated by applying the corresponding neighborhood's embedding. Experimental results show the method is comparable in accuracy to mainstream approaches but much faster for testing.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

FAST VISUAL OBJECT COUNTING VIA EXAMPLE-BASED DENSITY ESTIMATION

Yi Wang, Yuexian Zou*

ADSPLAB/ELIP, School of ECE, Peking University, Shenzhen, 518055, China


*Corresponding author: [email protected]
ABSTRACT

Density estimation based visual object counting (DE-VOC)


methods estimate the counts of an image by integrating over
its predicted density map. They perform effectively but inef-
ficiently. This paper proposes a fast DE-VOC method but
maintains its effectiveness. Essentially, the feature space of
image patches from VOC can be clustered into subspaces,
and the examples of each subspace can be collected to learn
its embedding. Also, it is assumed that the neighborhood em- (a) (b)
beddings of image patches and their corresponding density Fig. 1. Images with objects (first row) and their corre-
maps generated from training images are similar. With these sponding generated density maps (second row, displayed
principles, a closed form DE-VOC algorithm is derived, in jet colormap). (a) cell image; (b) pedestrian image in
where the embedding and centroid of each neighborhood are dataset UCSD.
precomputed by the training samples. Consequently, the den-
integral over any image region yields the object counts within
sity map of a given patch is estimated by simple classification
that region [6]. This idea is firstly introduced by Lempitsky,
and mapping. Experimental results show that our proposed
and he realized it by learning the pixel-wise linear regression
method is comparable with mainstream ones on counting ac-
between image dense features and its object density. Follow-
curacy while running much faster in testing phase.
ing his work, Zhou extended original framework to do VOC
for arbitrary objects and scenes [7]; Fiaschi applied regres-
Index Terms— Visual object counting, density estima-
sion forests and structural labels for efficient implementation
tion, example-based, locally linear embedding, fast imple-
while preserving the effectiveness [8]. These DE-VOC meth-
mentation
ods are all using regression to estimate the object density. For
achieving satisfying regression performance, the dense fea-
1. INTRODUCTION
tures they used are sophisticated and time-consuming.
In this paper, we propose a fast example-based object
The task of visual object counting (VOC) is to estimate the
density estimation method for VOC (FE-VOC). Instead of
number of objects we are interested in from an image or video.
learning the mapping between dense features and their coun-
Nowadays, VOC has been attracting tremendous attention as
terpart density maps, we exploit relationship between images
its pervasive application in numerous fields, such as wildlife
and their corresponding density maps in two distinguished
census, crowd surveillance, etc.
feature spaces. Specifically, based on the observation of Fig.
VOC problem is challenging as the result of common
1, the images of cells and pedestrians look similar to their
overlap between objects, severe occlusion and complicated
counterpart density maps in geometry. Hence, we assume that
background environment. To tackle these problems, main-
the patches of object images share the similar local geometry
stream methods can be categorized into two types: global re-
with that of counterpart density maps in two feature spaces.
gression based and density estimation based. For global re-
Such geometry can be solved by locally linear embedding
gression based methods (GR-VOC) [1-5], they learn the di-
(LLE) [9, 10], and then the density map of the image patch
rect mapping between global image features and counterpart
can be estimated by preserving the geometry.
count. The performance of such methods depend heavily on
However, implementing LLE is time-consuming due to
the craftiness of the used feature and regression model.
exhaustive search for nearest neighbors. Encouraged by the
Compared with GR-VOC, density estimation based
work in [11], we stick to the same insight that the distinguish-
methods (DE-VOC) are more promising with their ability to
able object distribution patterns are finite. It suggests the pos-
estimate the object count in any image region, which can of-
sible salient neighborhoods of the object image patches are
fer the object distribution information. DE-VOC counts the
also finite in feature space. Thus, we divide the feature spaces
number of objects by predicting an image density whose

‹,(((  ,&,3


In our method, training data are desired in patch form.
Training image patches set Y
Consequently, a set of image patches = { , , … , } are
...
Test patch xij
extracted from the training images , ∈ {1,2, … , }, and
counterpart density maps set Yd the density maps set = { , , … , } of corresponding
Test image X
Local patch
classification
Class label t ... patches are extracted from , ∈ {1,2, … , }. The feature
set = { ( ), ( ), …, ( )} is generated by applying
Neighborhood Embedding
regression: Et
Patches clustering
computation feature extractor (. ) to all patches in . All elements from
xdij=Et f(xij) , and can be seen as feature vectors in their feature
Cluster centroids
1 2 3 K spaces, respectively. By using Eqn. (3) with , the object
...
count ( ) of every image patch can be acquired.
Reconstructed counterpart embedding matrices
density map xdij E1 E2 E3 ... EK 3. METHOD
Estimated density map Xd
Fig. 2. The framework of our proposed method. Left area 3.1. Example-based object density estimation by LLE
of dash line is for testing and the other side is for training.
Instead of learning the regression model between image fea-
of image patches and their counterpart density maps into sub-
tures and their counterpart density maps, we learn to estimate
spaces, and compute the embedding of each subspace formed
the density of input image patch over the generalization of
by image patches. Consequently, the density map of input
training patches. As assumed that two manifolds, which are
patch can be estimated by simple classification and mapping
formed by the features of image patches and their counterpart
with the corresponding embedding matrix. The whole frame-
density maps respectively, share the similar local geometry.
work of this method is illustrated in Fig. 2.
In LLE, such local geometry of a feature vector can be char-
The rest of paper is organized as follows: section 2 intro-
acterized by how the feature vector can be linearly recon-
duces the generation of density maps for training; section 3
structed by its neighbors [9, 10]. For a given test image patch
presents the DE-VOC problem formulation in LLE way and
with unknown density, we compute its reconstruction
a fast example-based method for it; section 4 shows extensive
weights of its neighbors from feature space of by mini-
experiments and related analysis on cell and pedestrian data;
section 5 makes the conclusion of our work. mizing the reconstruction error. Then the density map will
be predicted by using the reconstruction weights to the den-
2. PRELIMINARIES sity maps of neighboring patches from . As this method are
implemented via the generalization of examples, it is named
The core of object density estimation based VOC is to com- as example-based VOC (E-VOC). Similar to the formulation
pute the relationship between object images and their coun- in [9, 10], E-VOC can be modeled as:

terpart density maps. Usually the ground truth density maps = arg min || ( ) − || (4)
are usually defined as a sum of 2D Gaussian kernels of the
∗ (5)
distribution of objects, just as described in [6, 7]. In training ≅
phase, a set of images , , , … , is pre-allocated. In where = [ ( ), ( ), … , ( )] is a training patch
every (1 ≤ ≤ ), all objects of interest are considered to subset formed by the nearest neighbors of ( ) from .
be annotated with a set of 2D points . Thus the ground truth = [ , , … , ] and is the density map of .
density function of each pixel ∈ is computed by a sum In E-VOC, Eqn. (4) achieves the local geometry of ( ),
of 2D Gaussian kernels based on annotated points: and Eqn. (5) reconstructs the target density map by pre-
( )= ( ; , ) serving such local geometry.
(1)
∈ As its constrained least squares form, Eqn. (4) has an an-
where is a user-annotated dot and is the smoothness pa- alytic solution and can be solved efficiently: =
rameter. is set to 6 for all experiments in Section 3. With ( + ) ( ). The implementation of constraints in
the definition in Eqn. (1), the ground truth density map of Eqn. (5) is usually realized by K-nearest-neighbors (KNN)
training image is defined as algorithm.
∀ ∈ , ( )= ( ) (2)
3.2. Fast example-based VOC
Some instances of are displayed in Fig. 1.
With the density map , the object count ( ) can be Taking the analytic solution of into Eqn. (5), the solution
computed by integrating over the density map of can be expanded as:
( )= ( ) ≅ + ( )= ( ) (6)
(3)


The whole FE-VOC algorithm is summarized as follow-
ing Algorithm 1. It also has been illustrated in Fig. 2.
Algorithm 1 (FE-VOC)
Input: test image , training examples sets and
Output: the density map , the estimated count ( )
1: for each input patch extracted by in test image
, where is a projection matrix that extracts the ( , )th
patch from
do
2: Find the index ∗ of the desired neighborhood based
Fig. 3. The partial centroids of clusters on dataset UCSD
on and Eqn. (9).
(displayed in foreground feature). The patch size is 8×8
3: Compute the density map patch: = ∗ ( ). Put
where = + is the embedding matrix into the based on .
computed from the neighborhood of in two feature spaces. 4: end for
Inspired by the idea that the possible crowd behaviors are 5: Get the estimated density map of : , and the esti-
infinite but the space of distinguishable crowd motion pat- mated count of : ( ) = ∑ ∈ ( ).
terns may not be all that large [11], we expend this insight
further: in VOC, the salient patterns of image patches ex-
4. EXPERIMENT
tracted from images of objects are finite. With this insight, it
is believed the primary neighborhoods of image patches in We evaluate our method on cell and pedestrian data from two
feature space are also finite, hence they can be depicted in public benchmark datasets including bacterial cell dataset [6]
training phase instead of testing phase, in which will save and UCSD [3]. The details of these two datasets can be found
huge time. in Table 1 and example frames are shown in Fig. 1. For com-
To describe the possible neighborhoods of specific VOC parison of different methods, mean absolute error (MAE) is
problem, we cluster the feature space of image patches to ap- employed as the evaluation metric. Unless otherwise speci-
proximate the results. Assumed the number of the possible
fied, the patch size used is 4×4, and the patch step is set to 2
salient neighborhoods of training data and is , so all for both training and testing in our method.
feature vectors in will be clustered into groups as (1 ≤
≤ ), and the counterpart density maps cluster is pro- 4.1. Performance on bacterial cell dataset
duced by putting the density map patches together according
to the index set of the corresponding elements in . In Fig. On this dataset, we adhere to the experimental protocols in
3, some centroids of clusters on UCSD dataset are visual- [6], hence the performance of comparative methods can be
ized. It is obvious that these centroids grasp the prominent comparable directly. Specifically, the first 100 images are re-
crowd patterns, such as the outlines of head, body, or some- served for training and the rest are for validation. Each time
thing else. With and , their embedding matrix can be 5 different random subsets containing ( = 1,2,4, … ,32)
formulated as samples from training set are generated for calculating the
(7) MAE and their standard deviations. In our method, the clus-
= +
tering number is estimated empirically on the union of the
where = 0.001 for our method in experiments. training and testing sets. For features, RR [1], KRR [12] and
With the all precomputed embedding matrices, we need Density MESA [6] employ dense SIFT coded by bag of
to figure out which neighborhood the test patch belongs to. In words as features, while E-VOC and FE-VOC only use the
LLE, the correlation is described as the similarity of features raw data extracted from blue channel of images as features.
vectors in their feature space. Following this idea, we decide From Fig. 4, it is worth noting that the MAE computed
the neighborhood of test patch based on the similarity of the by all methods descent gradually with the rise of training
centroid of the neighborhood and itself, just as: number, and the MAE produced by FE-VOC stays lowest
∗ = arg min ( ( ), ( )) with no more than 8 training images among all. When train-
(8)
ing size is larger than 8, Density MESA [6] performs best but
where ∗ is the desired index of the cluster which most slightly better than FE-VOC.
likely lies in. Here (. ) is also Euclidean distance metric, so
the classification of input patch is finally formulated as: 4.2. Performance on pedestrian datasets
∗ ( ) − ( )||
= arg min || (9)
Thus, the density map of can be quickly calculated by ∗
On UCSD pedestrian dataset, the experimental protocols we
and Eqn. (6). use in UCSD are just the same as that in [6]. Specifically, the
dataset is divided into 4 different training and testing sets: 1)
‘maximal’: training on frames 600:5:1400; 2) ‘downscale’:


50 Table 1. Statistics of three datasets ( : total number of
RR [1] frames; R: the resolution; : the mean number of objects pre-
40 KRR [12] sented in single frame; C: the color channel of frames)
Mean Absolute Error

Density MESA [6] Dataset R C


E-VOC Cell 200 256×256 171±64 RGB
30
FE-VOC UCSD 2000 158×238 29±9 Gray
20 Table 2. Mean absolute errors (MAE) on UCSD dataset
Method max down up min
10 Regression [2] 2.07 2.66 2.78 N/A
Regression [4] 1.8 2.34 2.52 4.46
0 Density-MESA [6] 1.7 1.28 1.59 2.02
0 5 10 15 20 25 30 Density-RF [8] 1.7 2.16 1.61 2.2
The size of training set FE-VOC 1.98 1.82 2.74 2.10
Fig. 4. Mean absolute error (MAE) of different methods
on cell dataset. Table 3. Computational cost on cell dataset
5 Feature Density map
Method Total time
Extraction reconstruction
Mean Absolute Error

4.5 Density [6] 9.499 s 0.006 s 9.505 s


E-VOC 0.084 s 201.242 s 201.326 s
4 FE-VOC 0.083 s 1.569 s 1.652 s
would be obtained with a more reasonable setting on , and
3.5
vice versa.
3
0 200 400 600 800 1000 4.4. Computational efficiency evaluation
The number of clusters
Fig. 5. Mean absolute error (MAE) of FE-VOC with dif- In this part, the computational cost among original E-VOC,
ferent clustering number on cell dataset. Density-MESA [6] and FE-VOC is compared on bacterial
training on frames 1205:5:1600; 3) ‘upscaleÿ: training on cell dataset. All experimental settings of these methods are
frames 805:5:1100; 4) ‘minimal’: training on frames the same, and they all employ the same 16 images for training.
640:80:1360. The frames which do not show in training pro- The final time consumed is calculated by the mean processing
cedure would be tested. For our method, only 32 random im- time of 100 test images.
ages are taken as training set for ‘maximal’ and ‘downscale’, Just as shown in Table 3, FE-VOC is one or two orders
and all training data are used for ‘upscale’ and ‘minimal’. The of magnitude speed faster than other two. Specifically, FE-
used is set by trials, and the features used in FE-VOC are VOC and E-VOC spend much less time than Density-MESA
simple foreground features of images (just as shown in Fig. on feature extraction as they just use simple features. Owing
3), while other methods employ fused features [6] and feature to the precomputed embedding matrices, FE-VOC runs much
selection technique. Here E-VOC is not introduced on this faster compared with E-VOC on density map reconstruction.
dataset as there exist more than thousands of testing images
and E-VOC is very slow on testing. The experimental results 5. CONCLUSION
are reported in Table 2.
It is clear that Density MESA [6] performs best in all set- In this paper, we propose a fast example-based method for
tings, and our method gives fairly good predictions as second VOC problem. It runs fast while nearly makes no compro-
only to Density-MESA in setting ‘downscale’ and ‘minimal’. mise on counting accuracy. This method is developed under
the intuition that the distinguishable object distribution pat-
4.3. The impact of clustering number terns are finite, thus all the embedding of counterpart neigh-
borhoods can be computed in training phase instead of testing
Here the diverse number of possible primary neighborhoods phase. Extensive experiments validate the effectiveness and
in FE-VOC is evaluated on cell dataset. We still adhere to efficiency of our method even with simple geometric features.
protocols used in subsection 4.1 and set training size = 16. In the future, we will try to estimate the quantity of salient
The clustering number is set from 128 to 1024 with step object distribution patterns automatically.
128. Fig. 5 indicates the trend of counting accuracy with dif-
ferent . It is noted that the MAE computed by FE-VOC 6. ACKNOWLEDGEMENT
reaches the minimum ( = 512 ). The MAE decreases
smoothly from = 128 to 512, and increases gradually This work was partially supported by the Shenzhen Science
& Technology Fundamental Research Program (No:
from = 512 to 1024. Thus, for FE-VOC, the smaller MAE
JCYJ20150430162332418).


[7] Y. Zhou and J. Luo, "A practical method for counting arbitrary
7. REFERENCES target objects in arbitrary scenes," in Multimedia and Expo
(ICME), 2013 IEEE International Conference on, 2013, pp. 1-
[1] K. Chen, C. C. Loy, S. Gong, and T. Xiang, "Feature Mining for 6.
Localised Crowd Counting," in BMVC, 2012, p. 3. [8] L. Fiaschi, R. Nair, U. Koethe, and F. Hamprecht, "Learning to
[2] D. Kong, D. Gray, and H. Tao, "A viewpoint invariant approach count with regression forest and structured labels," in Pattern
for crowd counting," in Pattern Recognition, 2006. ICPR 2006. Recognition (ICPR), 2012 21st International Conference on,
18th International Conference on, 2006, pp. 1187-1190. 2012, pp. 2685-2688.
[3] A. B. Chan, Z.-S. J. Liang, and N. Vasconcelos, "Privacy [9] S. T. Roweis and L. K. Saul, "Nonlinear dimensionality
preserving crowd monitoring: Counting people without people reduction by locally linear embedding," Science, vol. 290, pp.
models or tracking," in Computer Vision and Pattern 2323-2326, 2000.
Recognition, 2008. CVPR 2008. IEEE Conference on, 2008, pp. [10] H. Chang, D.-Y. Yeung, and Y. Xiong, "Super-resolution
1-7. through neighbor embedding," in Computer Vision and Pattern
[4] D. Ryan, S. Denman, C. Fookes, and S. Sridharan, "Crowd Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE
counting using multiple local features," in Digital Image Computer Society Conference on, 2004, pp. I-I.
Computing: Techniques and Applications, 2009. DICTA'09., [11] M. Rodriguez, J. Sivic, I. Laptev, and J.-Y. Audibert, "Data-
2009, pp. 81-88. driven crowd analysis in videos," in Computer Vision (ICCV),
[5] K. Chen, S. Gong, T. Xiang, and C. C. Loy, "Cumulative 2011 IEEE International Conference on, 2011, pp. 1235-1242.
attribute space for age and crowd density estimation," in [12] S. An, W. Liu, and S. Venkatesh, "Face recognition using
Computer Vision and Pattern Recognition (CVPR), 2013 IEEE kernel ridge regression," in Computer Vision and Pattern
Conference on, 2013, pp. 2467-2474. Recognition, 2007. CVPR'07. IEEE Conference on, 2007, pp.
[6] V. Lempitsky and A. Zisserman, "Learning to count objects in 1-7.
images," in Advances in Neural Information Processing
Systems, 2010, pp. 1324-1332.



You might also like