0% found this document useful (0 votes)
18 views7 pages

IET Image Processing - 2018 - Ma - Scene Invariant Crowd Counting Using Multi Scales Head Detection in Video Surveillance

Uploaded by

1198969944
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views7 pages

IET Image Processing - 2018 - Ma - Scene Invariant Crowd Counting Using Multi Scales Head Detection in Video Surveillance

Uploaded by

1198969944
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

IET Image Processing

Research Article

Scene invariant crowd counting using multi- ISSN 1751-9659


Received on 17th November 2017
Revised 24th April 2018
scales head detection in video surveillance Accepted on 28th August 2018
E-First on 20th September 2018
doi: 10.1049/iet-ipr.2018.5368
www.ietdl.org

Tianjun Ma1,2, Qingge Ji1,2 , Ning Li1,2


1School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510006, People's Republic of China
2Guangdong Province Key Laboratory of Big Data Analysis and Processing, Guangzhou 510006, People's Republic of China
E-mail: [email protected]

Abstract: With a soaring increase in the application of video surveillance in daily life, the estimation of crowd density has
already become a hot field. Crowd counting has a very close relationship with traffic planning, pedestrian analysing and
emergency warning. Here, a novel crowd counting method based on multi-scales head detection is proposed. The authors’
approach first uses gradients difference to extract the foreground of the images and apply the overlapped patches in different
scales to split the input images. Then, the patches are selected and classified into different groups corresponding to their
gradient distributions, and features are extracted for training. Finally, with the predicting result, density maps of different scales
are computed and summed with the perspective map. In particular, the authors’ method overcomes the traditional detecting
method's deficiencies of low accuracy when facing perspective transformation. Also, experiments demonstrate that this
proposed method not only achieved high accuracy in counting but also has outstanding robustness in our data sets.

1 Introduction perform better in general. We think that this result is reasonable,


and for in local method, each patch is regarded as a single
AUTOMATED crowd counting has become a hot field of computer individual. Compared with the whole image, a single patch is less
vision research in recent years. For the urban population has risen sensitive to noise. In addition, Patch's low dimension makes it
sharply, intelligent video surveillance is required in many easier to train a proper model which can classify or regress the
situations. Automated crowd counting not only sets the employee features at state-of-the-art level.
free but also makes it possible to store the real-time message that For counting crowds in videos, adjacent two frames are more
people can hardly record. In return, these messages play a relevant. To take advantage of relationship over consecutive
significant role in transportation planning, analysing crowd frames, researchers have proposed to cluster trajectories of tracked
congestion patterns as well as pedestrian behaviour learning, which visual features. For instance, Hashemzadeh et al. [13] propose a
can give an early warning of mass panic or emergency situation. In new method combining velocity and location-specific spatial clues
addition, the merchants also have a huge interest in crowd in trajectories to count moving crowd [14] using ICA (independent
counting, as crowd counting technique provides surveillance on the component analysis) transformed domain and clustering techniques
flow of consumers, so that they can arrange goods reasonably. In a to count crowd. Mahdi et al. [15] propose a method using motion
word, crowd counting has wide application scenarios. statistics of feature points to estimate the number of moving people
At the same time, due to the impact of illumination variation, in the crowd. However, such tracking-based methods do not work
camera perspective occlusion and the practical application of high for estimating crowds from individual still images.
accuracy and real-time requirements, crowd counting is still a More recently, due to the upsurge of deep learning, many
challenging research topic. In recent research, crowd counting convolutional neural networks (CNN)-based algorithms [16–19]
algorithms are generally categorised into two groups [1]: holistic are proposed for crowd counting. Shang et al. [20] propose an end-
and local. For holistic methods, they use the whole image to obtain to-end deep CNN regression model for counting people of images
the global feature and try to find a way to map straightway between in extremely dense crowds. In Zhang et al.'s work [21], CNN is
those features and the size of the crowd in that image sequences. trained alternatively with two related learning objectives, crowd
Someone comes up with the method that extracts edge feature and density and crowd count. Moreover, to handle an unseen target
trains with Gaussian process regression [2]. Someone utilises two- crowd scene, they present a data-driven method to fine tune the
dimensional discrete wavelet transform as a basis for extracting trained CNN model for the target scene. Zhang et al. [22] propose a
textural features [3]. Fiaschi et al. [4] propose a method that multi-column architecture to train convolution kernels at different
estimates the density map over image region. Although various scales.
methods are used in features extracting and modelling, they all Although CNN can indeed achieve a higher accuracy in crowd
obtain total crowd number from the map directly. counting problem, these methods require a large number of training
Same as the holistic method, the local method also extracts samples as well as a very long time to train and predict. The main
features from edge [5], size [6], key point [7], shape [8] and texture data sets we use: Mall [23] and UCSD [24] have limited frames
[9], but the features are analysed independently after grouping. (2000), so traditional method is more suitable for this kind of small
Also, the total number is the sum of every separate part. A sliding data sets. In terms of practicality, CNN requires large RAM
window is often used in the local method. Chen et al. [10] adopt a (random-access memory) to support the training and storing the
strategy that extracts a concatenated local feature from images and high-dimensional data, making such methods unsuitable for
trains with regression. Ryan et al. [11] propose an approach that embedding on cameras for large-scale systems.
uses local features to count the number of people in each In this study, we come up with a novel method for real-time,
foreground blob segment. Instead of a linear mapping, Pham et al. embedded scene invariant crowd counting, called multi-scales head
[12] formulate the problem of estimating density in a structured detection (MSHD) crowd counting. Aiming to a crowd counting
learning framework applied to random decision forests. In the method with low computational complexity and high accuracy,
comparison of holistic and local methods [1], local methods multi-scale function is designed to address the issue of perspective

IET Image Process., 2018, Vol. 12 Iss. 12, pp. 2258-2263 2258
© The Institution of Engineering and Technology 2018
17519667, 2018, 12, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-ipr.2018.5368, Wiley Online Library on [04/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
In this paper, we use difference mean gradients method to
extract foreground. As demonstrated in Fig. 3, we build a queue to
obtain a real-time mean gradient and show the difference from the
gradients of temporary frame. Formally, given an input image
frame sequence S, St presents the frame at the time t. We first use
region of interest to obtain the interest region after clipping. Then,
LoG (Laplace of Gaussian) [25] operator is introduced to detect the
edge of the image as it is good enough and computationally
efficient.
For the frame St, the corresponding grayscale gradients can be
defined as follows:E t = (Ext, Eyt), where Ext is the horizontal
Fig. 1 Final predicted state in our proposed method by multi-scale gradients magnitude and Eyt is the vertical gradients magnitude.
approach Gradients direction is also computed for HoG features in later
process
2 2
E t = (Ext) + (Eyt) (1)

Eyt
O = arctan (2)
Ext

l
1
t
Emean = ∑ Ei (3)
Slt i

Fig. 2 Framework of our proposed method We use l to represent the queue length. When l sets 10, we got best
results on test data. The mean gradient is given by (3), where Slt
refers to the sets including l frames before time t. It is obvious that
Slt is a variable changing with time t. In the begin of the
experiment, frames may be unable to fill l length queue. So, we use
all the frames before the time t until the queue is full at time of t.
t
Before the gradients difference, we came up with a variable Evar
t
to improve the result, Evar is a matrix which is the same size as
t
gradients matrix, and every element in Evar is the variance of
gradients matrices at the same pixel position in the queue
Fig. 3 Final differenced gradients D is computed by the current image's
11 1n
gradients differing the mean gradients of a l length sequence and filtered avar ⋯ avar
with the variance of each pixel as the threshold t
Evar = ⋮ ⋱ ⋮ (4)
m1 mn
occlusion, which is a major obstacle to improve the detecting avar ⋯ avar
accuracy (Fig. 1 demonstrates how ideal multi-scales detection
works like). It is proved that our method obtains a state-of-the-art mn
avar = σ(etmn mn mn
− l, ⋯, et − 2, et − 1) (5)
result on Mall and UCSD data set.
As shown in (4) and (5), etmn refers to the element at row m and
2 Methodology mn
column n in gradients matrix at time t, and avar refers the variance
As previously stated, local methods outperform the holistic ones of the gradients in queue at the same position. We introduce this as
for it is easier to train an effective model that can regress the the threshold to preprocess the mean gradients. In the main idea of
features to approach ground-truth. Our method is based on the gradients difference, the background gradients are essentially
features extracted by overlapped sliding window with different unchanged compared with foreground, so we can eliminate
sizes, and the whole algorithm framework is shown in Fig. 2. The background according to difference. However, difference function
workflow of our algorithm is as follows. First, the differenced relies on the fact that the gradients of the foreground often vary
gradients are computed. Second, we use the various scales to hugely. Although averaging gradients can weaken the effect of
divide the image into patches, then the patches are classified and close gradients, in the scene with the high flow of pedestrian,
extracted with HoG (histogram of oriented gradient) and size people may appear continuously in a certain area, which can lead
features. Finally, after training and predicting, we generate the to the mean gradients staying on a high level. To deal with the
density maps for different scales, and with the perspective map, the problem mentioned above, we utilise Evar t
as threshold, for the
crowd count is accumulated. variance is low in background area and high in the densely
In this paper, we divide the method description into three parts: populated area. Each element in differenced matrix computed by:
foreground segmentation (Section 2.1); multi-scale feature
extraction (Section 2.2); calculation and addition of multi-scale
density maps (Section 2.3). ∥ etmn − emean
mn
∥ mn
if avar <T
dtmn = (6)
t
∥ etmn − min (Emean )∥ else
2.1 Foreground segmentation
mn
Effective extraction of foreground can reduce the interference of When avar is below threshold T, we regard it as a background pixel.
mn
noisy background to the crowd information and improve the cross- Otherwise, it is a foreground pixel, and we coordinate the emean to
scene prediction ability of the model. The background model t
min (Emean), so that the current frame's foreground gradients can be
should not only be robust to illumination, noise and the shadow
preserved. Finally, the differenced matrix Dt is computed and
caused by occlusion, but also meet the requirements of real-time
camera video data processing, as well as the dynamic scene stands for the difference result at time t.
changes.

IET Image Process., 2018, Vol. 12 Iss. 12, pp. 2258-2263 2259
© The Institution of Engineering and Technology 2018
17519667, 2018, 12, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-ipr.2018.5368, Wiley Online Library on [04/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
A histogram component of θi is defined as (7), where e(p)
denotes a single pixel's gradients in differenced gradients matrix of
a patch and O(p) represents the direction of the gradients of that
pixel. In a patch window W, e(p) is accumulated for θi if O(p) is
between θi − 1 and θi

h(θi) = ∑ e(p) (7)


p ∈ W | O(p) ∈ (θi − 1, θi]
Fig. 4 Training samples are classified recursively from the larger scale to
small scale In our experiments, there are nine gradients orientations, so
θi = 2i ∗ π/9 , i ∈ (1, 2⋯9). We utilise fhog [28] function in
2.2 Multi-scale HoG patch extract Matlab. The computed HoG features are 3 * 9 + 5 dimensional.
The perspective transformation has always been a difficulty in the There are 2 × 9 contrast-sensitive orientation channels, nine
crowd counting problem. Different cameras can cause different contrast-insensitive orientation channels, four texture channels and
perspective effects. For crowd counting, the ideal state is a tilt one all-zeros channel (used as a ‘truncation’ feature). In summary,
angle camera without perspective effects. Yet, in practice, the it gives a 32-dimensional feature vector for each cell.
camera angles are basically 45° because of the limitations of the Besides, we define the size features in a single patch as
roof and the requirements of usage scenarios. There have been s = ∑ d(p), d(p) denotes differenced gradient at the pixel p. Also
many methods to tackle with perspective transformation, such as we obtain H by stringing h(θi) together. Finally, features f extracted
camera calibration [26] and normalisation [27]. Whereas camera from a patch can be represented as (8)
calibration needs multi-camera to collect data, and simply applying
the weight is not so helpful to improve the counting consequence, f = {H, s} (8)
especially in the detection-based method. Overall, we come up
with the multi-scale detection to fix perspective transformation, 2.2.3 Sample selected and training: At the beginning of the
and the result turns out to work well in comparison to state of the experiment, the classifying result obtained from training is not
art. satisfying. Then, we figure out that the poor performance is due to
our training samples. UCSD and Mall data set all have head-based
2.2.1 Multi-scale overlapped patch: Given a difference result ground-truth with the dotted point. We label each sample according
to whether there is a ground-truth point in the range of patch. This
Dt, our method cuts it into patches. An overlapped partitioning
method may produce some samples with ground-truth at the very
method is adopted for following reasons:
marginal position, and these samples have a poor gradients value in
or even almost empty. To tackle the problem, we set limitation to
i. Traditional sliding window with none overlapped method is
filter the samples lacking information. A positive sample is
prone to segment the detected objects. So, the matching
selected only if the following requirements are met.
accuracy is much lower than the overlapped method. To reduce
the computational cost, we choose half of the sliding window
i. The Euclidean distance between patch centre and the ground-
as step length.
truth point is less a set threshold T d.
ii. Compared to direct division, overlapping is more flexible. We
are able to adjust the sliding step length to adapt the needs of a ii. The total gradients of the patch are greater than a set threshold
diverse data set, which is in line with our practical application T g.
goals.
Following these rules, we can filter a number of the bad
According to different perspective distance, we designed three samples, then classify those sample recursively into different scale
patch scales to correspond the various pedestrian positions in the sets depending on (9)
training images. The scale is defined as ps ∈ psn, psm, psf , where the
(Ls + 1)2
psn, psm, psf refer the scales from near to far. Also in different data Ps ∈ Ps + 1 if ∑ E(Ps +1 )≥
(Ls)2
⋅ ∑ E(Ps) (9)
sets, the patch scale ps is different. As in the Mall [23] data set, we
use 30 × 30 px patch for the near distance, 20 × 20 px patch for the
Ps here presents as the scale at s level, and Ls refers to the one side
middle distance and 10 × 10 px patch for the far distance scale. The
main principle is that the size of near scale patch fits the size of the length of Ps. The s + 1 level is the scale smaller than s. In the
head of a nearby person, so do the middle and distant ones. To classifying procedure, first, we obtain a patch set from gradients
meet the demands stated above, the scales are calculated manually difference result with the largest patch size. Then, if patches in the
with a scale testing script. set meet the conditions as (9), they will be included in a smaller
Note that although the scales size is judged manually at the level of the patch set. We repeat this process until the minimum
present stage of the experiment. We already have the idea which patch granularity is reached as shown in Fig. 4.
can make scales auto-chosen. Briefly, we first build a classifying Due to our strict selection and the data set's sparsity itself, the
model trained by different size head samples. Then, we scan the positive samples we obtained is far less than the negative ones. To
whole image and utilise the model to obtain a score for the avoid unbalanced sampling, we simply apply undersampling to
different sizes. The higher the score is, the more suitably the scale adjust to training data set. We randomly picked out a certain
fits. We will realise fully automatic prediction in further research. amount of negative samples according to the number of positive
samples. Together with positive samples, the training data set is
2.2.2 HoG extracting: After gradients difference, HoG features formed.
are extracted from each patch. Compared with other feature For training, we choose LSBoost (least square boost) as the
description methods, HoG has many advantages. First of all, classifier. The boosted weak classifier is easy to implement and has
because HoG is operated on the local grid element of the image, it good robustness, which satisfies the requirement of our
can keep the geometric and optical deformation of the image very experiment.
well. Secondly, HoG is robust in the spatial sampling, coarse-fine
orientation sampling and strong local optical normalisation 2.3 Crowd counting
conditions. As long as the pedestrians generally maintain upright The crowd counting procedure begins after the classified results are
posture or even have some subtle body movements, HoG is still acquired. Based on the overlapped sliding window, we choose
able to keep detecting accuracy. Therefore, HoG features are density map to evaluate the total crowd count and then sum the
especially suitable for human detection in images.
2260 IET Image Process., 2018, Vol. 12 Iss. 12, pp. 2258-2263
© The Institution of Engineering and Technology 2018
17519667, 2018, 12, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-ipr.2018.5368, Wiley Online Library on [04/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
f
different scale density map up with the weight of perspective map. density maps Mdn, Mdm and Md for each image. The total count N in
The weight of the perspective is positively correlated with the an image is computed by (11)
distance between the object in the scene and the camera. The whole
counting process is demonstrated in Fig. 5. 1 1
N= ∑ (
δ(p)
⋅ mdn + β ⋅ mdm + (1 − β −
δ(p)
) ⋅ mdf ) (11)
m∈M
2.3.1 Generating density map: Given a sequence of detection
results at a certain scale, we calculate the density map by following
several steps. Firstly, if the prediction is positive (people head is where δ(p) refers to the perspective weight at the pixel p. Values in
detected), we use k-means to cluster a centre as the predicted head perspective map raise from near to far. In another word, people in
position. The object to be clustered is the binarised gradients patch the distant area have more weight than the near ones. The variable
that corresponds to the prediction. Secondly, relative coordinates of m presents the density map's value at pixel p. β is the parameter to
the prediction centre in the patch are transformed into the absolute adjust the weight to different scales. The equation reduces the
coordinates on the whole picture. Thirdly, we use a Gauss kernel to contribution from large-scale detection when the pixel belongs to
discrete the location information of the point, which can solve the the distant area. On the contrary, it reduces the contribution from
problem of repeated detection of the same target effectively. small-scale detection when the pixel belongs to the near area. We
We construct two matrices called sum map Ms and count map find out although smaller scale has a little more error detecting than
the larger ones and tends to overcount, it truly achieves a better
Mc. Both matrices are the same size as the original image. For sum
performance on the distant pedestrian's detection.
map Ms, it records the values of every predicted head position after
Gaussian convolution. For count map Mc, the element records how
3 Experiments
many times the elements updated at the corresponding position of
Ms. After all positive predictions of an image are processed, we can 3.1 Data set
obtain the density map Md at present scale with (10) We choose two common data sets in crowd counting: Mall [23] and
UCSD [24]. Fig. 6 gives a glance of both data sets and their
Md = Ms ⋅ /Mc (10) ground-truth distributions, and Table 1 gives the detail information
of two data sets. Mall data set has high resolution and a very
2.3.2 Summing with perspective: We obtain density maps obvious perspective effect, which is in line with our purpose of
respectively corresponding to different scales. In our experiment, applying multi-scales. Both two data sets provide the dotted
three scales from small to large are chosen, so we obtain three ground-truth at the head of each person. This way of annotation is
our starting point based on the head detection. In our early
experiment, the detecting target is the whole human body instead
of head. The result turns out to be unsatisfactory, as people in the
image are easily segmented at the time of detection and some of
them are obscured by objects so only their head can be seen.

3.2 Training method


In experiment, patch scales are set as 30 × 30, 20 × 20, 10 × 10 in
the Mall data set and patch scales in UCSD are 12 × 12, 10 × 10, 8
Fig. 5 Process of how a positive sample is turned into the values in the × 8. Also, Gaussian kernels with different variance correspond to
sum map Ms. Note that the centre clustered by k-means is a single point, but these scales. Taking into account both computational efficiency and
contributes a patch of the values in Ms (with Gaussian convolution) counting accuracy, the overlap threshold is set as 0.5. Variance
threshold used in the differenced gradients is set as 3.8. The

parameter β that adjusting the perspective weight in Mall is 0.35


and in UCSD is 0.55.
Owing to Mall data set's low FPS(2), the count distribution in
Mall basically has no regularity. So, we use k-fold method to split
the data set. Also, k in experiment is 5, namely 400 frames for
testing and 1600 frames for training. The testing set is selected
from the start of the frames and switches in the fivefolds of frames,
once the testing set is confirmed, the rest frames make up the
training set. Unlike the Mall data set, UCSD has a high FPS(10), so
it has a more smooth count distribution. Chan et al. [24] propose to
split data set by density. They split UCSD data set into four
portions as ‘max’, ‘down’, ‘up’ and ‘min’. In particular, ‘max’
picks one frame for every five frames from 600 to 1400, containing
the most frames in four. ‘downscale’ selects frames from 1205 to
1600, which is most crowded. On the contrary, ‘up’ chooses the
most sparse part in distribution: 805–1100 frames. ‘min’ only
contains 10 frames 640 : 80 : 1360. Beside the k-fold method, we
also use this splitting method to construct new training data for a
Fig. 6 Input image and count distribution in comprehensive comparison with other crowd counting algorithms.
(a) Mall data set, (b) UCSD data set. It can be seen that the perspective effect in Mall
is more evident. So, we choose relative hierarchical scales set for Mall and smooth
scales set for UCSD

Table 1 Basic information for USCD and Mall data sets


Data Frames Resolution FPS Count Total
UCSD 2000 238 × 158 10 11–46 49,885
Mall 2000 640 × 480 2 13–53 62,325

IET Image Process., 2018, Vol. 12 Iss. 12, pp. 2258-2263 2261
© The Institution of Engineering and Technology 2018
17519667, 2018, 12, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-ipr.2018.5368, Wiley Online Library on [04/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
3.3 Counting performance surveillance, which greatly improves the accuracy of the detection.
Then, to obtain the count result, density maps for different scales
Crowd counting performance is evaluated by three universal are accumulated with the weight of perspective map. Experimental
quantitative metrics: mean absolute error (MAE), mean square results showed that our method achieves a state-of-the-art
error (MSE) and mean deviation error (MDE) performance in two test data sets. In particular, the algorithm has a
N
remarkable stability for all test data sets, which proves the
1
N∑
MAE = yi − y^ i (12)
1

N
1
N∑
2
MSE = (yi − y^ i) (13)
1

N
1 yi − y^ i
MDE = ∑
N 1 yi
(14)

We choose several relative algorithms to compare against: recent


pedestrian detector (Detector [29]), histogram of moving gradients
[30], least square support vector regression (LSSVR [31]), kernel
ridge regression (KRR [32]), random forest regression (RFR [33]),
Gaussian process regression (GPR [24]), ridge regression (RR
[34]) and cumulative attribute ridge regression (CA-RR [10]). In
particular, the HoMG [30] is regarded as a very appropriate
comparison by us. Only because same as our method, HoMG is
also a local method using the overlapped sliding window and
extracting Hog features from differenced gradients. The major
difference lies that HoMG detecting target is the whole body of the
pedestrian, while our goal is to detect the head with multi-scales.
Table 1 shows the estimation results of the experiments. It can be
seen that in the Mall data set, our method achieves the best
performance among all compared methods. A much lower MSE
and MDE proves that our algorithm have high stability in
prediction. Fig. 7 shows the comparison of the count prediction and
the ground-truth on two data sets (Table 2). In the UCSD data set,
although a slight decrease in performance, we still obtain a
suboptimal result. We think that the unsatisfactory performance in
‘down’ splitting part (Table 3) data set is due to the low resolution
in UCSD. It is hard for our method to give an accurate prediction
when the a plenty of pedestrians are overlapped in such a low-
resolution region.

4 Conclusion
In this paper, a novel crowd counting method called MSHD is
proposed. The novel method aims to evaluate the crowd density in Fig. 7 Estimation result in
an effective and accurate way. We apply multi-scales to enable (a) Mall, (b)UCSD
MSHD to deal with perspective transformation in video

Table 2 Comparison result for Mall and UCSD data sets. Our method is the best in the Mall data set and demonstrates
outstanding stability in the UCSD data set
Method Mall [23] UCSD [24]
MAE MSE MDE MAE MSE MDE
Detector [29] 20.55 439.1 0.641 — — —
HoMG [30] 5.34 38.8 0.18 — — —
LSSVR [31] 3.51 18.2 0.108 2.20 7.3 0.107
KRR [32] 3.51 18.1 0.108 2.16 7.5 0.107
RFR [33] 3.91 21.5 0.121 2.42 8.5 0.116
GPR [24] 3.72 20.1 0.115 2.24 8.0 0.112
RR [34]) 3.59 19.0 0.110 2.25 7.8 0.110
CA-RR [10] 3.43 17.7 0.105 2.07 6.9 0.102
ours 2.90 14.4 0.91 2.10 6.5 0.097

Table 3 MAEs in four density-based splits of UCSD


Method ‘max’ ‘down’ ‘up’ ‘min’
MESA [35] 1.70 1.28 1.59 2.02
RF [4] 1.70 2.16 1.61 2.20
Arteta et al. [36] 1.24 1.31 1.69 1.49
ours 1.26 1.93 1.54 1.61

2262 IET Image Process., 2018, Vol. 12 Iss. 12, pp. 2258-2263
© The Institution of Engineering and Technology 2018
17519667, 2018, 12, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-ipr.2018.5368, Wiley Online Library on [04/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
robustness of multi-scales method. Besides, our method has the [16] Wang, C., Zhang, H., Yang, L., Liu, S., Cao, X.: ‘Deep people counting in
extremely dense crowds’. Proc. 23rd ACM Int. Conf. Multimedia, Brisbane,
advantages of taking small memory and fast training, which we Australia, October 2015, pp. 1299–1302
believe is the key point in large-scale commercial production. [17] Marsden, M., McGuinnes, K., Little, S., O'Connor, N.: ‘Fully convolutional
crowd counting on highly congested scenes’. Int. Conf. Computer Vision
Theory Appl., Porto, Portugal, February 2017
5 Acknowledgments [18] Marsden, M., McGuinnes, K., Little, S., O'Connor, N.: ‘Resnetcrowd: a
residual deep learning architecture for crowd counting, violent behaviour
This research was supported by the Natural Science Foundation of detection and crowd density level classification’. arXiv preprint
Guangdong Province, China (No. 2016A030313288). We thank the arXiv:1705.10698, 2017
anonymous reviewers for their suggestions and comments. [19] Walach, E., Wolf, L.: ‘Learning to count with cnn boosting’, ‘ECCV’
(Springer, Berlin, 2016), pp. 660–676
[20] Shang, C., Ai, H., Bai, B.: ‘End-to-end crowd counting via joint learning local
6 References and global count’. IEEE ICIP, Phoenix, USA, September 2016, pp. 1215–
1219
[1] Ryan, D., Denman, S., Sridharan, S., et al.: ‘An evaluation of crowd counting [21] Zhang, C., Li, H., Wang, X., et al.: ‘Cross-scene crowd counting via deep
methods, features and regression models’, J. Comp. Vis. Image Und., 2015, convolutional neural networks’, Proc. 2015 IEEE Conf. on Computer Vision
130, pp. 1–17 and Pattern Recognition, Boston, USA, June 2015, pp. 833–841
[2] Kong, D., Gray, D., Tao, H.: ‘A viewpoint invariant approach for crowd [22] Zhang, Y., Zhou, D., Chen, S., et al.: ‘Single-image crowd counting via multi-
counting’. Proc. 18th Int. Conf. on Pattern Recognition, Hong Kong, China, column convolutional neural network’. Proc. IEEE Conf. Computer Vision
August 2006, pp. 1187–1190 and Pattern Recognition (CVPR), Las Vegas, USA, June 2016, pp. 589–597
[3] Li, X., Shen, L., Li, H.: ‘Estimation of crowd density based on wavelet and [23] Chen, K., Loy, C.C., Gong, S., et al.: ‘From semi-supervised to transfer
support vector machine’, Trans. Inst. Meas. Control, 2006, 28, (3), pp. 299– counting of crowds’. Proc. IEEE Int. Conf. Computer Vision (ICCV), Sydney,
308 Australia, December 2013, pp. 2256–2263
[4] Fiaschi, L., Nair, R., Koethe, U., et al.: ‘Learning to count with a regression [24] Chan, A.B., Liang, Z.S.J., Vasconcelos, N.: ‘Privacy preserving crowd
forest and structured labels’. Proc. Int. Conf. Pattern Rec., Tsukuba, Japan, monitoring: counting people without people models or tracking’. Proc. Comp.
November 2012, pp. 2685–2688 Vision and Pattern Rec., Anchorage, USA, June 2008, pp. 1–7
[5] Ryan, D., Denman, S., Fookes, C., et al.: ‘Crowd counting using multiple [25] ‘Laplace of Gaussian’ Available at http://fourier.eng.hmc.edu/e161/lectures/
local features’. Proc. 2009 Digital Image Computing: Techniques and gradients/node9.html, accessed 27 October 2017
Applications, Melbourne, Australia, December 2009, pp. 81–88 [26] Ryan, D., Denman, S., Fookes, C., et al.: ‘Scene invariant multi camera crowd
[6] Celik, H., Hanjalic, A., Hendriks, E.A.: ‘Towards a robust solution to people counting’, Patt. Recogn. Lett., 2014, 44, (15), pp. 98–112
counting’. Image Processing, 2006 IEEE Int. Conf., Atlanta, USA, October [27] Loy, C.C., Chen, K., Gong, S., et al.: ‘Crowd counting and profiling:
2006, pp. 2401–2404 methodology and evaluation, modeling, simulation and visual analysis of
[7] Donatello, C., Pasquale, F., Gennaro, P., et al.: ‘A method for counting crowds’. (Springer, New York, 2013, pp. 347–382)
moving people in video surveillance videos’, EURASIP J. Adv. Sig. Process, [28] ‘Matlab-fhog’ Available at http://www.cs.berkeley.edu/~rbg/latent/index.html
2010, (1), pp. 231–240 [29] Benenson, R., Omran, M., Hosang, J., et al.: ‘Ten years of pedestrian
[8] Kilambi, P., Masoud, O., Papanikolopoulos, N.: ‘Crowd analysis at mass detection, what have we learned?’ Proc. Eur. Conf. Comp. Vision,
transit sites’. Intelligent Transportation Syst. Conf. (ITSC)’, IEEE, September CVRSUAD Workshop, Zurich, Switzerland, September 2014
2006, pp. 753–758 [30] Siva, P., Shafiee, M.J., Jamieson, M., et al.: ‘Real-time, embedded scene
[9] Meynberg, O., Cui, S., Reinartz, P., et al.: ‘Detection of high-density crowds invariant crowd counting using scale-normalized histogram of moving
in aerial images using texture classification’, Remote Sens., 2016, 8, (6) gradients (HoMG)’. Computer Vision and Pattern Recognition, Las Vegas,
[10] Chen, K., Loy, C. C., Gong, S., et al.: ‘Feature mining for localised crowd USA, June 2016, pp. 885–892
counting’. Proc. British Machine Vision Conf., Guildford, UK, September [31] Van Gestel, T., Suykens, J.A.K., De Moor, B., et al.: ‘Automatic relevance
2012, pp. 21.1–21.11 determination for least squares support vector machine regression’. Int. Joint
[11] Ryan, D., Denman, S., Fookes, C., et al.: ‘Crowd counting using multiple Conf. Neural Networks, Washington DC, USA, July 2001, pp. 2416–2421
local features’, Digital Image Comput. Tech. Appl., 2009, pp. 81–88 [32] An, S., Liu, W., Venkatesh, S.: ‘Face recognition using kernel ridge
[12] Pham, V., Kozakaya, T., Yamaguchi, O., et al.: ‘Count forest: co-voting regression’, Proc. 2007 IEEE Conf. on Computer Vision and Pattern
uncertain number of targets using random forest for crowd density Recognition, Minneapolis, USA, June 2007, pp. 1–7
estimation’. Int. Conf. Computer Vision, Santiago, Chile, December 2015, pp. [33] Liaw, A., Wiener, M.: ‘Classification and regression by random forest’, R
3253–3261 news, 2002, 2, (3), pp. 18–22
[13] Hashemzadeh, M., Pan, G., Wang, Y., et al.: ‘Combining velocity and [34] Chan, A.B., Liang, Z.S.J., Vasconcelos, N.: ‘Privacy preserving crowd
location-specific spatial clues in trajectories for counting crowded moving monitoring: Counting people without people models or tracking’. Proc.
objects’, Int. J. Patt. Rec. Artific. Intel., 2013, 27, (2), pp. 1–31 Comp. Vision and Pattern Rec., Anchorage, USA, June 2008, pp. 1–7
[14] Antonini, G., Thiran, J.: ‘Trajectories clustering in ICA space: an application [35] Lempitsky, V., Zisserman, A.: ‘Learning to count objects in images’. Proc.
to automatic counting of pedestrians in video sequences’, Proc. Advanced Advances in Neural Information Processing Systems, Vancouver, Canada,
Concepts for Intelligent Vision Systems, Brussels, Belgium, August 2004, pp. December 2010, pp. 1324–1332
1–17 [36] Arteta, C., Lempitsky, V., Noble, J.A., et al.: ‘Interactive object counting’.
[15] Mahdi, H., Gang, P., Min, Y.: ‘Counting moving people in crowds using Proc. Eur. Conf. Comp. Vision, Zurich, Switzerland, September 2014, pp.
motion statistics of feature-points’, Multi. Tool. Appl., 2014, 72, (1) pp. 453– 504–518
487

IET Image Process., 2018, Vol. 12 Iss. 12, pp. 2258-2263 2263
© The Institution of Engineering and Technology 2018

You might also like