Object Counting
Object Counting
1 Introduction
few seconds for each iteration of the relevance feedback that includes recomputation of
the features, discriminative (re)training, and visualization.
The approaches that cast the counting problem as one of object density estima-
tion [5,8,10,12] have so far been more accurate and also faster than approaches that de-
tect individual instances. For this reason, we base our system around object density esti-
mation. Counting through density estimation works by learning a mapping F : X 7 Y
between local image features X and object density Y, which then allows the derivation
of an estimated object density map for unseen images (or non-annotated parts of an im-
age in our case). If the learning is successful, integrating over regions in the estimated
density map provides an object count within such regions. The accuracy of the density
estimation depends crucially on the choice of local image features. Another important
aspect of density-based counting is that density per se is not informative for a human
user and cannot be used directly to verify the accuracy of the counting (and to provide
feedback).
To address this aspect, and to achieve the processing speed demanded by the interac-
tive scenario, we make the following three contributions. First, we propose a simplified
approach for supervised density learning based on ridge regression (i.e. simple linear
algebra operations). We show that even in the traditional batch mode (i.e. when learn-
ing from a set of annotated images and applying to a number of similar images) this
approach achieves similar counting accuracy to the constraint-generation based learn-
ing in [12] (using the same features), while being dramatically faster to train, which is
crucial for an interactive system. Furthermore, we propose two complementary ways
to visualize the estimated density, so that counting mistakes are easily identifiable by
the user. This allows our system to incorporate user feedback in an interactive regime
according to the users own criterion of goodness. Finally, we propose an online code-
book learning [20] that, in real-time, re-estimates low-level feature encoding as the user
annotates progressively larger parts of an image. Such feature re-estimation (together
with fast supervised density estimation) allows our system to avoid severe under- or
over-fitting, whenever a small or a large part of an image is annotated.
In summary, this work presents a system for counting multiple instances of an ob-
ject class in crowded scenes that (i) can be used interactively as it is fast to compute
and can re-use the results from previous iterations, (ii) uses the simplicity and ease of
computation of ridge regression, and (iii) presents the density estimation results in in-
tuitive ways such that it is easy to know where further annotations may be required. We
show the performance of the method in some of the typical applications for counting
and detection methods such as counting cells in microscopy images, and various objects
of interest in aerial images.
I 1 1 2 1 E
T 1 1 R
1 1 R
E 2 1 1
R 1 O
1 2 1 R
1
1 11%
Input Output
I E
T 5 1 7 R
R
E 3 6 O
R 7 R
1 2
2
2 1 4%
Input Output
I E
T 10 R
2 R
E 1 2 O
R 11 2 1 R
3 1 2 1
1 1
3 4 1%
1
2 1
1
Input Output
Fig. 1: (Interactive Framework Overview) Given an input image (or set of images) con-
taining multiple instances of an object, our framework learns from regions with dot-
annotations placed by the user in order to compute a map of object density for the
non-annotated regions. The left column shows the annotated pixels provided by the
user all regions in the annotation images outside the green mask (i.e. with or without
dot-annotations) are used as observations in the regression. The right column shown the
intuitive visualization tool of the density estimation that allows the user to inspect the
results and add further annotations where required in order to refine the output of the
counting framework.
4 C. Arteta, V. Lempitsky, J.A. Noble and A. Zisserman
applications. Interactive image segmentation methods have also been widely adopted in
medical imaging [9, 19, 20] where not only is it highly expensive to prepare datasets,
but also users need to have the possibility of refining results in real-time. More recently,
interactive methods have been applied to object detection [25] and learning of edge
models for image partitioning [21], which is also highly relevant for biological image
analysis. We expect this paper to bring the general benefits of interactive learning to
counting problems.
Counting through object density estimation: Counting objects in images without ex-
plicit object detection is a simplified alternative for cases where only the number of
objects is required; it has proved to be especially useful in those tasks too challeng-
ing for object detectors such as crowded scenes. Initial efforts in this class of methods
attempted to learn global counts through regressors from image-level [6, 11] or region-
level [5, 13, 17] features to the total number of objects in it. However, these approaches
ignored the local arrangements of the objects of interest. Towards this end, [12] pro-
posed learning pixel-level object density through the minimization of the M ESA dis-
tance, a cost function specially tailored for the counting problem. Following [12], [8]
proposed a simplified version of the object density estimator based on random forest,
which represented an improvement in training time. Generating a pixel-level object den-
sity estimation not only has the advantage of estimating object counts in any region of
the image, but can also be used to boost the confidence of object detectors as shown
in [15]. Our approach follows the ideas in [8, 12] and further simplifies the learning to
a few algebraic operations using a ridge-regression, thus enabling an interactive appli-
cation. Note that the learning approach proposed here should not be confused with the
ridge-regression used as baseline in [12], which belongs to the category of image-level
regressors.
Given an image I, the counting proceeds within a feedback loop. At each iteration,
the user marks a certain portion of an image using a freehand selection tool (we refer
to pixels being included into such regions as annotated pixels). Then the user dots the
objects, by clicking once on each object of interest within this annotation region. In the
first iteration, the user also marks a diameter of a typical object of an image by placing
a line segment over such a typical object.
At each iteration, given the set of dotted pixels P placed by the user on top of ob-
jects of interest in I, our system aims to (1) build a codebook X of low-level features,
(2) learn a mapping F : X 7 Y from the entries in the codebook X to an object
density Y, (3) use the learned mapping F to estimate the object density in the entire
image I, and (4) present the estimated object density map to the user through an intu-
itive visualization. The estimated object density map produced is such that integrating
over a region of interest gives the estimated number of objects in it (e.g. integrating
over the entire map gives an estimate of the total number of objects of interest in I). By
using the density visualization, the user can easily spot significant errors in the object
density estimate, and can proceed to provide further annotations to refine the results in
a next iteration of the process. An example with three iterations is shown in Figure 1.
Interactive Object Counting 5
In practice, only a small portion or few small portions of a potentially large image have
to be inspected in order to validate the correctness of the learned model or to identify
parts with gross errors.
The first design requirement for an interactive counting system, such as the above, is
that the codebook needs to be fast to compute, and adapt its size according to the amount
of annotations in order to prevent under- or over-fitting. To address these problems, we
propose in Section 3 a simple progressive codebook learning procedure, which builds a
kd-tree that can grow from its current state as the user provides further annotations. The
second requirement is a fast computation of the mapping F. Our proposal is a pixel-
level ridge-regression presented in Section 4, which has a closed-form solution. Thus,
the mapping F can be computed extremely fast through a few algebraic operations on
sparse matrices. Finally, the system needs to present its current estimates to the user in
such a way that identifying errors can be done through a quick visual inspection, which
is not possible with the raw object density map and/or the global count. Therefore, we
propose in Section 5 two complementary methods to visualize object density maps by
generating local summaries of it.
We show the performance of the interactive counting system through a series of
examples in the experimental section, and videos of the interactive process are provided
at [1].
1. In the i-th iteration, find the partitions with more than N descriptors zp assigned to
it (only annotated pixels are taken into account).
2. For each of those partitions, find the feature dimension t of maximum variance
(among the d dimensions), as well as the median of the values of all annotated
pixels corresponding to this dimension.
3. Split such a partition into two according to whether a pixel value at the dimension
t is greater or smaller then the median.
6 C. Arteta, V. Lempitsky, J.A. Noble and A. Zisserman
4. Repeat until every partition has less than N annotated pixels assigned to it.
The proposed algorithm thus constructs the kd-tree (w.r.t. the annotated pixels).
Note, however, that we also maintain the partition assignments of the unannotated pixels
(and the resulting partitions can be unbalanced w.r.t. the unannotated pixels). We finally
note that there is no need to store the resulting kd-tree explicitly because the algorithm
maintains the assignments of pixels to the leaves of the kd-tree (i.e. partitions). The
partitioning algorithm is resumed whenever new annotations are added by the user. At
this point, the codebook can grow from its current state by continuing the splitting (and
is not re-learned from scratch).
Once the codebook has been learned, each pixel p in the image I is represented by a
sparse k-dimensional vector xp , where all entries are zero except the one corresponding
to the partition to which the image descriptor zp was assigned (one-hot encoding).
The representation xp is then used as the pixel features within the learning framework
proposed in Section 4. Once again, we emphasize that the vector xp changes (and be-
comes more high-dimensional) between the learning rounds as more user annotations
become available.
We now introduce our simple alternative to [12] for learning an object density estimator.
Our method here is similar to (and, arguably, simpler than) [8]. Most importantly, com-
pared to [12], our approach reduces the training time from several dozens of seconds to
a few seconds (for heavily annotated images), thus enabling the interaction.
Similarly to [8, 12], we define the ground truth density from the set of user dot-
annotations P as the sum of delta functions centred on each of the annotation dots:
X
F 0 (p) = (p p0 ) , (1)
p0 P
Here, X is the matrix of predictors (feature vectors) with each row containing xp ,
Y is a vector of corresponding density values from the ground truth density F 0 and
controls the balance between prediction error and regularization of w. Although com-
putationally simple and efficient, the fitting procedure (2) tries to match the ground
truth density values exactly in every pixel, which is unnecessary and leads to severe
overfitting. Instead, as was argued in [12], the estimated densities Xw should match
Interactive Object Counting 7
the ground truth densities Y when integrated over extended image regions (i.e. we do
not care about very local deviations between Xw and Y as long as these deviations are
unbiased and sum approximately to zero over extended regions).
Based on this motivation, [12] replaces the L2-distance between Xw and Y in (2)
by the so called MESA-distance which looks at integrals over all possible box regions.
While this change of distance dramatically improves the generalization, the learning
with MESA-distance is costly as it employs constraint generation and has to solve a
large number of quadratic cost functions. Here, we propose another, much simpler, al-
ternative for the L2-distance in (2). Namely, we minimize a smoothed version of the
objective by convolving the difference between the ground truth density and the esti-
mated density with a Gaussian kernel:
Similarly, the smoothed (but still sparse) matrix of predictors G X can be obtained
by convolving independently each dimension of the feature vectors (i.e. each column of
X), that is, spatially blurring each of the feature channels.
Using the vertically concatenated smoothed maps Xs = [G X] and Ys = [G Y ]
respectively, w can be expressed using a standard ridge regression solution formula:
F (p) = wT xp (7)
8 C. Arteta, V. Lempitsky, J.A. Noble and A. Zisserman
It can be seen that learning the mapping vector w only involves simple matrix op-
erations (mostly on sparse matrices) and Gaussian smoothing. Thus, w can be learned
on-the-fly and with a low memory footprint.
We have found that the generalization performance is improved slightly if the non-
negativity of the estimated w is enforced (which for non-negative xp results in physi-
cally meaningful non-negative object densities at test time). Since it is computationally
more expensive to include a non-negativity constraint within ridge regression in a prin-
cipled manner (compared to the closed form solution that we have now), we use a
simple trick of iteratively re-running ridge regression, while clipping the components
of w having negative values to zero after each iteration.
It can be seen from Table 1 and Table 2 that using the simple ridge-regression never
results in a significant drop in performance compared to the more complex approach
in [12], and thus, we use it as part of our interactive system (described next) without
any compromise on the counting accuracy. Crucially for sustaining interactivity, our
simplification yields a dramatic speedup of the learning procedure.
To avoid confusion, we note again that a ridge-regression was proposed as a baseline
for counting in [12], as shown in Table 2, and tested with the same set of features we
have used, but resulting in much poorer performance. This is because, the regression in
that baseline was learned at the image level, i.e. it treated each of the training images as
a single training example, which resulted in a severe overfitting due to a limited number
of examples. This differs from learning w to match the pixel-wise object densities (it
can be seen as an extreme case of infinitely wide Gaussian kernel G). Finally, we note a
connection between the proposed approach and [8] that also performs smoothing of the
output density function (at the testing stage) using structured-output random forests.
The visualization of the object density estimate plays a key role in our interactive count-
ing system as it assists the user to identify the parts of the image where the system has
estimated the counts with large errors, and thus, where to add further annotations. While
the predicted densities are sufficient to estimate the counts in any region, the accuracy
of these densities cannot be controlled by the user without further post-processing due
to the mismatch between the continuous nature of the densities and the discrete nature
of the objects. To address this problem, we propose two density visualization methods,
which convert the estimated density into representations that are intuitive for the user.
The first method is based on non-overlapping extremal regions and is algorithmically
similar to [3], and aims to localize the objects from the density estimate. The second
method is based on recursive image partitioning, and aims to split the image into a set
of small regions where the number of objects can be easily eyeballed and compared to
the density-based estimates.
10 C. Arteta, V. Lempitsky, J.A. Noble and A. Zisserman
N
X
F (y) = max yi (Vi + ) (9)
yY
i=1
where is a constant that prevents from selecting the trivial solution (one biggest region
containing the whole image) and biases the solution towards a set of small regions. The
objective ((9)) is optimized efficiently by using dynamic programming due to the tree
structure of the regions as in [3]. We now discuss the details of the two approaches and
the difference between them.
Visualization using non-overlapping extremal regions. Extremal regions are the con-
nected components on the binary images resulting from thresholding a gray image I
with any arbitrary threshold . A key property of the extremal regions is the nestedness
as described above. Therefore, the set of extremal regions of an image can be arranged
into a tree (or a forest) according to the nestedness.
Following [3], we use extremal regions as candidates for object detection. In this
case, extremal regions are extracted from the estimated object density map, and the
ones selected by the optimization (9) should delineate entire objects or entire clusters
of objects (Figure 2-c).
In practice, we collect these candidate regions using the method of Maximally Sta-
ble Extremal Regions (MSER) [14]. This method only keeps those extremal regions that
are stable in the sense that they do not change abruptly between consecutive thresholds
of the image (i.e. on regions with strong edges). During the inference, we exclude the
regions which have an integral of density smaller than 0.5 from consideration as we
Interactive Object Counting 11
have found that allowing any extremal region to be selected can result in very cluttered
visualizations. Instead, this visualization aims to show only regions containing entire
objects.
Visualization using hierarchical image partitioning. In this approach, we build a
hierarchical image partition driven by the density. To obtain the partition, we iteratively
apply spectral graph clustering, dividing image regions into two (akin to normalized
cuts [18]). Unlike the extremal region visualization and unlike the traditional use of
normalized cuts, we encourage the boundaries of this partition to go through regions of
low density, thus creating a tile of regions that enclose entire objects (Figure 2-d). To
achieve this, we build a 4-connected weighted graph G = (V, E) with the adjacency
matrix W defining the weights of the edges based on the estimated density map F (p)
as wi,j =0.5 (F (i)+F (j)) for (i, j) E.
The normalized cuts then tend to cut through the parts of the image where the den-
sity is near-zero, and also as usual have a bias towards equal-size partitions (which is
desirable for our purpose).
Once the tree-structured graph is built, the inference selects the set of non-overlapping
regions through the maximization of the sum of the integrality scores of the regions, as
explained above. Additionally, we enforce at inference time that every pixel in the es-
timated density map must belong to one of the selected regions (i.e. that the selected
subset of regions represent a cover). Therefore, the entire density distribution is ex-
plained. Accordingly, all regions from the hierarchical partitioning of the image are
considered, including those with near zero density integrals.
Compared to the visualization using the extremal regions, the visualization based
on recursive partition does not tend to outline object boundaries, but represents the
underlying ground truth density with greater fidelity due to the fact that the whole image
ends up being covered by the selected regions.
We show the qualitative performance of the interactive counting system. The aim is to
give a sense of the amount of annotations (and effort) required to obtain an object count
that would closely approximate the ground truth (i.e. with an approximate absolute
counting error of 10% or less). This section is complemented with [1], where a video
of the system in use is shown, as well as additional example results including counting
elongated objects using a scribble variant of the annotation. Note that, though results
are shown here using the same images as input and output, it is possible to propagate
the density estimation to other images in a batch.
Figure 3 shows example results of the interactive counting system, indicating the
number of annotations added and the estimated object count for that amount of annota-
tion. Part of the examples (Figures 3-a,c) have been taken from the benchmark dataset
of [24], and we use their results as reference in Figure 3. We do not attempt to do a direct
comparison of performance with [24] due to the fact that for the cases where a single
image is given, our interactive method requires annotations on this image in order to
produce results, and thus, disrupts the possibility of a fair comparison of performance.
Moreover, due to the nature of the low-level features, our system crops the borders of
12 C. Arteta, V. Lempitsky, J.A. Noble and A. Zisserman
Fig. 2: (Density Visualization) In order to assess the density estimation (b) of the original
image (a), we propose two visualization methods. The first method (c) is based on
non-overlapping extremal regions (ER) and aims to localize objects in the estimated
density map (more intuitive but biased towards undercounting). The second method (d)
is based on hierarchical image partitioning with spectral clustering (SC) and aims to
explain the distribution of the density estimate across the entire image (higher fidelity
but less intuitive visualization of the density). See text for details. In (c) and (d), the
numbers indicate the objects contained within the region. Green regions contain a single
object, but the number has been omitted for clarity. Non-outlined regions in (d) have
zero counts.
the image by half the size of the texture patches (see implementation details), resulting
in a possible difference of the ground truth count w.r.t. the original image. The addi-
tional examples (Figure 3-b and examples in [1]) correspond to aerial images extracted
from Google Maps.
The same set of parameters have been used for all the examples shown, with the
most relevant ones indicated in the implementation details. Due to space limitations, we
use a single visualization method for each of the examples in Figure 3, but it can be seen
that they are complementary. Nevertheless, depending on the image, one visualization
can be more convenient than the other.
Low-level features. We compute the initial (low-level) pixel descriptor zp based on two
types of local features on the Lab color-space. First, we use the contrast-normalized
lightness values (L channel) of the pixels in a patch of size n n centered at p [22].
The patches are rotated such that their dominant gradients are aligned in order to be
invariant to the objects rotation in the image. Secondly, we collect the raw L, a and
b values of the center pixel. The descriptor zp Rd is the concatenation of the two
local features. Therefore, the dimensionality d of the pixel descriptor for a color image
is n2 + 3. In the case of grayscale images, we do the feature computation on the given
intensity channel, which results in d = n2 + 1.
Collecting extremal regions. Extremal regions are extracted from the estimated density
map using the MSER implementation from VLFeat [23]. In order to collect enough
Interactive Object Counting 13
(a) Synthetic cells. Number of dot-annotations = 16. Estimated count/GT = 476/484. Ref-
erence results from [24] = 482/500.
Fig. 3: (Example results) For the experiments shown, a large image (first column) is
annotated interactively (second column) until qualitatively reasonable results are pro-
duced (fourth column). For all examples, the green masks on the annotation images
(second column) indicate regions that have not been annotated by the user. All regions
outside the green mask, with or without dot-annotations, are used as observations in
the regression (Section 4). Annotated regions without dots can be seen as zero annota-
tions. As expected, the number of annotations required increases with the difficulty of
the problem. In cases where the background is complex, such as in aerial images (b),
residual density tends to appear all over the image, which can be seen in the visualiza-
tion. However, this can be easily fixed interactively using zero annotations, which are
very fast and simple to add. Many more examples, as well as videos of the interactive
process, are provided at [1].
14 C. Arteta, V. Lempitsky, J.A. Noble and A. Zisserman
candidate regions for the inference to select from, we set a low stability threshold in the
MSER algorithm.
Building a binary tree with spectral clustering. Computing the traditional spectral
clustering as in [18] can be too slow for our interactive application, and the reason is the
expensive computation of eigenvectors. Therefore, in practice we use the method from
Dhillon et al. [7] which solves the equivalent problem of weighted kernel k-means thus
greatly reducing the computation time. We use the implementation from the authors
of [7].
Setting object-size dependent parameters. Some of the parameters used in the im-
plementation of the interactive counting system are better set with respect to the size of
the object of interest. These are the size n n of the patches for the low level features
and the standard deviation for the Gaussian kernel used to smooth the dot-annotations
and feature channels for the ridge-regression. As discussed in Section 2, we chose to
request an additional input from the user, where the approximate diameter of the object
of interest is input by drawing a line segment over a single object in the image. The
image is then rescaled with the scale factor of the object. For the experiments of the in-
teractive system shown in the experimental section, we use an object size of 10 pixels,
patches of 9 9 pixels and = 3 pixels.
Region visualization. The boundaries of the regions that are chosen to visualize the
density are superimposed on top of the original images. Alongside the boundaries, we
show the density integrals over the highlighted regions rounded to the nearest integer
(recall that regions are chosen so that such integrals tend to be near integer). We also
color code the boundaries according to the counts (e.g. green for objects containing one
object, blue for two objects, etc.).
References
1. http://www.robots.ox.ac.uk/~vgg/research/counting/
Interactive Object Counting 15
2. Arteta, C., Lempitsky, V., Noble, J.A., Zisserman, A.: Learning to detect cells using non-
overlapping extremal regions. In: Proc. MICCAI (2012)
3. Arteta, C., Lempitsky, V., Noble, J.A., Zisserman, A.: Learning to detect partially overlap-
ping instances. In: CVPR (2013)
4. Boykov, Y., Jolly, M.P.: Interactive graph cuts for optimal boundary and region segmentation
of objects in N-D images. In: Proc. ICCV. vol. 2, pp. 105112 (2001)
5. Chan, A.B., Liang, Z.S.J., Vasconcelos, N.: Privacy preserving crowd monitoring: Counting
people without people models or tracking. In: CVPR (2008)
6. Cho, S.Y., Chow, T.W.S., Leung, C.T.: A neural-based crowd estimation by hybrid global
learning algorithm. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions
on 29(4), 535541 (Aug 1999)
7. Dhillon, I.S., Guan, Y., Kulis, B.: Weighted graph cuts without eigenvectors a multilevel
approach. Pattern Analysis and Machine Intelligence, IEEE Transactions on 29(11), 1944
1957 (2007)
8. Fiaschi, L., Nair, R., Koethe, U., Hamprecht, F.: Learning to count with regression forest and
structured labels. In: Proc. ICPR (2012)
9. Grady, L.: Random walks for image segmentation. TPAMI 28(11), 17681783 (2006)
10. Idrees, H., Saleemi, I., Seibert, C., Shah, M.: Multi-source multi-scale counting in extremely
dense crowd images. In: CVPR (2013)
11. Kong, D., Gray, D., Tao, H.: A viewpoint invariant approach for crowd counting. In: Proc.
ICPR (2006)
12. Lempitsky, V., Zisserman, A.: Learning to count objects in images. In: NIPS (2010)
13. Ma, W., Huang, L., Liu, C.: Crowd density analysis using co-occurrence texture features.
In: Computer Sciences and Convergence Information Technology (ICCIT), 2010 5th Inter-
national Conference on. pp. 170175 (Nov 2010)
14. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide-baseline stereo from maximally
stable extremal regions. Image and Vision Computing (2004)
15. Mikel, R., Ivan, L., Josef, S., Jean-Yves, A.: Density-aware person detection and tracking in
crowds. In: ICCV (2011)
16. Rother, C., Kolmogorov, V., Blake, A.: Grabcut: interactive foreground extraction using iter-
ated graph cuts. Proc. ACM SIGGRAPH 23(3), 309314 (2004)
17. Ryan, D., Denman, S., Fookes, C., Sridharan, S.: Crowd counting using multiple local fea-
tures. In: Proc. DICTA (2009)
18. Shi, J., Malik, J.: Normalized cuts and image segmentation. TPAMI (2000)
19. Singaraju, D., Grady, L., Vidal, R.: Interactive image segmentation of quadratic energies on
directed graphs. In: Proc. of CVPR 2008. IEEE Computer Society, IEEE (June 2008)
20. Sommer, C., Straehle, C.N., Kothe, U., Hamprecht, F.A.: Ilastik: Interactive learning and
segmentation toolkit. In: ISBI. pp. 230233 (2011)
21. Straehle, C., Koethe, U., Hamprecht, F.A.: Weakly supervised learning of image partitioning
using decision trees with structured split criteria. In: ICCV (2013)
22. Varma, M., Zisserman, A.: Texture classification: Are filter banks necessary? In: CVPR
(2003)
23. Vedaldi, A., Fulkerson, B.: Vlfeat - an open and portable library of computer vision algo-
rithms. In: ACM Multimedia (2010)
24. Verdie, Y., Lafarge, F.: Detecting parametric objects in large scenes by monte carlo sampling.
International Journal of Computer Vision pp. 119 (2013)
25. Yao, A., Gall, J., Leistner, C., Gool, L.J.V.: Interactive object detection. In: CVPR. pp. 3242
3249 (2012)