IET Image Processing - 2018 - Ma - Scene Invariant Crowd Counting Using Multi Scales Head Detection in Video Surveillance
IET Image Processing - 2018 - Ma - Scene Invariant Crowd Counting Using Multi Scales Head Detection in Video Surveillance
Research Article
Abstract: With a soaring increase in the application of video surveillance in daily life, the estimation of crowd density has
already become a hot field. Crowd counting has a very close relationship with traffic planning, pedestrian analysing and
emergency warning. Here, a novel crowd counting method based on multi-scales head detection is proposed. The authors’
approach first uses gradients difference to extract the foreground of the images and apply the overlapped patches in different
scales to split the input images. Then, the patches are selected and classified into different groups corresponding to their
gradient distributions, and features are extracted for training. Finally, with the predicting result, density maps of different scales
are computed and summed with the perspective map. In particular, the authors’ method overcomes the traditional detecting
method's deficiencies of low accuracy when facing perspective transformation. Also, experiments demonstrate that this
proposed method not only achieved high accuracy in counting but also has outstanding robustness in our data sets.
IET Image Process., 2018, Vol. 12 Iss. 12, pp. 2258-2263 2258
© The Institution of Engineering and Technology 2018
17519667, 2018, 12, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-ipr.2018.5368, Wiley Online Library on [04/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
In this paper, we use difference mean gradients method to
extract foreground. As demonstrated in Fig. 3, we build a queue to
obtain a real-time mean gradient and show the difference from the
gradients of temporary frame. Formally, given an input image
frame sequence S, St presents the frame at the time t. We first use
region of interest to obtain the interest region after clipping. Then,
LoG (Laplace of Gaussian) [25] operator is introduced to detect the
edge of the image as it is good enough and computationally
efficient.
For the frame St, the corresponding grayscale gradients can be
defined as follows:E t = (Ext, Eyt), where Ext is the horizontal
Fig. 1 Final predicted state in our proposed method by multi-scale gradients magnitude and Eyt is the vertical gradients magnitude.
approach Gradients direction is also computed for HoG features in later
process
2 2
E t = (Ext) + (Eyt) (1)
Eyt
O = arctan (2)
Ext
l
1
t
Emean = ∑ Ei (3)
Slt i
Fig. 2 Framework of our proposed method We use l to represent the queue length. When l sets 10, we got best
results on test data. The mean gradient is given by (3), where Slt
refers to the sets including l frames before time t. It is obvious that
Slt is a variable changing with time t. In the begin of the
experiment, frames may be unable to fill l length queue. So, we use
all the frames before the time t until the queue is full at time of t.
t
Before the gradients difference, we came up with a variable Evar
t
to improve the result, Evar is a matrix which is the same size as
t
gradients matrix, and every element in Evar is the variance of
gradients matrices at the same pixel position in the queue
Fig. 3 Final differenced gradients D is computed by the current image's
11 1n
gradients differing the mean gradients of a l length sequence and filtered avar ⋯ avar
with the variance of each pixel as the threshold t
Evar = ⋮ ⋱ ⋮ (4)
m1 mn
occlusion, which is a major obstacle to improve the detecting avar ⋯ avar
accuracy (Fig. 1 demonstrates how ideal multi-scales detection
works like). It is proved that our method obtains a state-of-the-art mn
avar = σ(etmn mn mn
− l, ⋯, et − 2, et − 1) (5)
result on Mall and UCSD data set.
As shown in (4) and (5), etmn refers to the element at row m and
2 Methodology mn
column n in gradients matrix at time t, and avar refers the variance
As previously stated, local methods outperform the holistic ones of the gradients in queue at the same position. We introduce this as
for it is easier to train an effective model that can regress the the threshold to preprocess the mean gradients. In the main idea of
features to approach ground-truth. Our method is based on the gradients difference, the background gradients are essentially
features extracted by overlapped sliding window with different unchanged compared with foreground, so we can eliminate
sizes, and the whole algorithm framework is shown in Fig. 2. The background according to difference. However, difference function
workflow of our algorithm is as follows. First, the differenced relies on the fact that the gradients of the foreground often vary
gradients are computed. Second, we use the various scales to hugely. Although averaging gradients can weaken the effect of
divide the image into patches, then the patches are classified and close gradients, in the scene with the high flow of pedestrian,
extracted with HoG (histogram of oriented gradient) and size people may appear continuously in a certain area, which can lead
features. Finally, after training and predicting, we generate the to the mean gradients staying on a high level. To deal with the
density maps for different scales, and with the perspective map, the problem mentioned above, we utilise Evar t
as threshold, for the
crowd count is accumulated. variance is low in background area and high in the densely
In this paper, we divide the method description into three parts: populated area. Each element in differenced matrix computed by:
foreground segmentation (Section 2.1); multi-scale feature
extraction (Section 2.2); calculation and addition of multi-scale
density maps (Section 2.3). ∥ etmn − emean
mn
∥ mn
if avar <T
dtmn = (6)
t
∥ etmn − min (Emean )∥ else
2.1 Foreground segmentation
mn
Effective extraction of foreground can reduce the interference of When avar is below threshold T, we regard it as a background pixel.
mn
noisy background to the crowd information and improve the cross- Otherwise, it is a foreground pixel, and we coordinate the emean to
scene prediction ability of the model. The background model t
min (Emean), so that the current frame's foreground gradients can be
should not only be robust to illumination, noise and the shadow
preserved. Finally, the differenced matrix Dt is computed and
caused by occlusion, but also meet the requirements of real-time
camera video data processing, as well as the dynamic scene stands for the difference result at time t.
changes.
IET Image Process., 2018, Vol. 12 Iss. 12, pp. 2258-2263 2259
© The Institution of Engineering and Technology 2018
17519667, 2018, 12, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-ipr.2018.5368, Wiley Online Library on [04/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
A histogram component of θi is defined as (7), where e(p)
denotes a single pixel's gradients in differenced gradients matrix of
a patch and O(p) represents the direction of the gradients of that
pixel. In a patch window W, e(p) is accumulated for θi if O(p) is
between θi − 1 and θi
IET Image Process., 2018, Vol. 12 Iss. 12, pp. 2258-2263 2261
© The Institution of Engineering and Technology 2018
17519667, 2018, 12, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-ipr.2018.5368, Wiley Online Library on [04/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
3.3 Counting performance surveillance, which greatly improves the accuracy of the detection.
Then, to obtain the count result, density maps for different scales
Crowd counting performance is evaluated by three universal are accumulated with the weight of perspective map. Experimental
quantitative metrics: mean absolute error (MAE), mean square results showed that our method achieves a state-of-the-art
error (MSE) and mean deviation error (MDE) performance in two test data sets. In particular, the algorithm has a
N
remarkable stability for all test data sets, which proves the
1
N∑
MAE = yi − y^ i (12)
1
N
1
N∑
2
MSE = (yi − y^ i) (13)
1
N
1 yi − y^ i
MDE = ∑
N 1 yi
(14)
4 Conclusion
In this paper, a novel crowd counting method called MSHD is
proposed. The novel method aims to evaluate the crowd density in Fig. 7 Estimation result in
an effective and accurate way. We apply multi-scales to enable (a) Mall, (b)UCSD
MSHD to deal with perspective transformation in video
Table 2 Comparison result for Mall and UCSD data sets. Our method is the best in the Mall data set and demonstrates
outstanding stability in the UCSD data set
Method Mall [23] UCSD [24]
MAE MSE MDE MAE MSE MDE
Detector [29] 20.55 439.1 0.641 — — —
HoMG [30] 5.34 38.8 0.18 — — —
LSSVR [31] 3.51 18.2 0.108 2.20 7.3 0.107
KRR [32] 3.51 18.1 0.108 2.16 7.5 0.107
RFR [33] 3.91 21.5 0.121 2.42 8.5 0.116
GPR [24] 3.72 20.1 0.115 2.24 8.0 0.112
RR [34]) 3.59 19.0 0.110 2.25 7.8 0.110
CA-RR [10] 3.43 17.7 0.105 2.07 6.9 0.102
ours 2.90 14.4 0.91 2.10 6.5 0.097
2262 IET Image Process., 2018, Vol. 12 Iss. 12, pp. 2258-2263
© The Institution of Engineering and Technology 2018
17519667, 2018, 12, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-ipr.2018.5368, Wiley Online Library on [04/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
robustness of multi-scales method. Besides, our method has the [16] Wang, C., Zhang, H., Yang, L., Liu, S., Cao, X.: ‘Deep people counting in
extremely dense crowds’. Proc. 23rd ACM Int. Conf. Multimedia, Brisbane,
advantages of taking small memory and fast training, which we Australia, October 2015, pp. 1299–1302
believe is the key point in large-scale commercial production. [17] Marsden, M., McGuinnes, K., Little, S., O'Connor, N.: ‘Fully convolutional
crowd counting on highly congested scenes’. Int. Conf. Computer Vision
Theory Appl., Porto, Portugal, February 2017
5 Acknowledgments [18] Marsden, M., McGuinnes, K., Little, S., O'Connor, N.: ‘Resnetcrowd: a
residual deep learning architecture for crowd counting, violent behaviour
This research was supported by the Natural Science Foundation of detection and crowd density level classification’. arXiv preprint
Guangdong Province, China (No. 2016A030313288). We thank the arXiv:1705.10698, 2017
anonymous reviewers for their suggestions and comments. [19] Walach, E., Wolf, L.: ‘Learning to count with cnn boosting’, ‘ECCV’
(Springer, Berlin, 2016), pp. 660–676
[20] Shang, C., Ai, H., Bai, B.: ‘End-to-end crowd counting via joint learning local
6 References and global count’. IEEE ICIP, Phoenix, USA, September 2016, pp. 1215–
1219
[1] Ryan, D., Denman, S., Sridharan, S., et al.: ‘An evaluation of crowd counting [21] Zhang, C., Li, H., Wang, X., et al.: ‘Cross-scene crowd counting via deep
methods, features and regression models’, J. Comp. Vis. Image Und., 2015, convolutional neural networks’, Proc. 2015 IEEE Conf. on Computer Vision
130, pp. 1–17 and Pattern Recognition, Boston, USA, June 2015, pp. 833–841
[2] Kong, D., Gray, D., Tao, H.: ‘A viewpoint invariant approach for crowd [22] Zhang, Y., Zhou, D., Chen, S., et al.: ‘Single-image crowd counting via multi-
counting’. Proc. 18th Int. Conf. on Pattern Recognition, Hong Kong, China, column convolutional neural network’. Proc. IEEE Conf. Computer Vision
August 2006, pp. 1187–1190 and Pattern Recognition (CVPR), Las Vegas, USA, June 2016, pp. 589–597
[3] Li, X., Shen, L., Li, H.: ‘Estimation of crowd density based on wavelet and [23] Chen, K., Loy, C.C., Gong, S., et al.: ‘From semi-supervised to transfer
support vector machine’, Trans. Inst. Meas. Control, 2006, 28, (3), pp. 299– counting of crowds’. Proc. IEEE Int. Conf. Computer Vision (ICCV), Sydney,
308 Australia, December 2013, pp. 2256–2263
[4] Fiaschi, L., Nair, R., Koethe, U., et al.: ‘Learning to count with a regression [24] Chan, A.B., Liang, Z.S.J., Vasconcelos, N.: ‘Privacy preserving crowd
forest and structured labels’. Proc. Int. Conf. Pattern Rec., Tsukuba, Japan, monitoring: counting people without people models or tracking’. Proc. Comp.
November 2012, pp. 2685–2688 Vision and Pattern Rec., Anchorage, USA, June 2008, pp. 1–7
[5] Ryan, D., Denman, S., Fookes, C., et al.: ‘Crowd counting using multiple [25] ‘Laplace of Gaussian’ Available at http://fourier.eng.hmc.edu/e161/lectures/
local features’. Proc. 2009 Digital Image Computing: Techniques and gradients/node9.html, accessed 27 October 2017
Applications, Melbourne, Australia, December 2009, pp. 81–88 [26] Ryan, D., Denman, S., Fookes, C., et al.: ‘Scene invariant multi camera crowd
[6] Celik, H., Hanjalic, A., Hendriks, E.A.: ‘Towards a robust solution to people counting’, Patt. Recogn. Lett., 2014, 44, (15), pp. 98–112
counting’. Image Processing, 2006 IEEE Int. Conf., Atlanta, USA, October [27] Loy, C.C., Chen, K., Gong, S., et al.: ‘Crowd counting and profiling:
2006, pp. 2401–2404 methodology and evaluation, modeling, simulation and visual analysis of
[7] Donatello, C., Pasquale, F., Gennaro, P., et al.: ‘A method for counting crowds’. (Springer, New York, 2013, pp. 347–382)
moving people in video surveillance videos’, EURASIP J. Adv. Sig. Process, [28] ‘Matlab-fhog’ Available at http://www.cs.berkeley.edu/~rbg/latent/index.html
2010, (1), pp. 231–240 [29] Benenson, R., Omran, M., Hosang, J., et al.: ‘Ten years of pedestrian
[8] Kilambi, P., Masoud, O., Papanikolopoulos, N.: ‘Crowd analysis at mass detection, what have we learned?’ Proc. Eur. Conf. Comp. Vision,
transit sites’. Intelligent Transportation Syst. Conf. (ITSC)’, IEEE, September CVRSUAD Workshop, Zurich, Switzerland, September 2014
2006, pp. 753–758 [30] Siva, P., Shafiee, M.J., Jamieson, M., et al.: ‘Real-time, embedded scene
[9] Meynberg, O., Cui, S., Reinartz, P., et al.: ‘Detection of high-density crowds invariant crowd counting using scale-normalized histogram of moving
in aerial images using texture classification’, Remote Sens., 2016, 8, (6) gradients (HoMG)’. Computer Vision and Pattern Recognition, Las Vegas,
[10] Chen, K., Loy, C. C., Gong, S., et al.: ‘Feature mining for localised crowd USA, June 2016, pp. 885–892
counting’. Proc. British Machine Vision Conf., Guildford, UK, September [31] Van Gestel, T., Suykens, J.A.K., De Moor, B., et al.: ‘Automatic relevance
2012, pp. 21.1–21.11 determination for least squares support vector machine regression’. Int. Joint
[11] Ryan, D., Denman, S., Fookes, C., et al.: ‘Crowd counting using multiple Conf. Neural Networks, Washington DC, USA, July 2001, pp. 2416–2421
local features’, Digital Image Comput. Tech. Appl., 2009, pp. 81–88 [32] An, S., Liu, W., Venkatesh, S.: ‘Face recognition using kernel ridge
[12] Pham, V., Kozakaya, T., Yamaguchi, O., et al.: ‘Count forest: co-voting regression’, Proc. 2007 IEEE Conf. on Computer Vision and Pattern
uncertain number of targets using random forest for crowd density Recognition, Minneapolis, USA, June 2007, pp. 1–7
estimation’. Int. Conf. Computer Vision, Santiago, Chile, December 2015, pp. [33] Liaw, A., Wiener, M.: ‘Classification and regression by random forest’, R
3253–3261 news, 2002, 2, (3), pp. 18–22
[13] Hashemzadeh, M., Pan, G., Wang, Y., et al.: ‘Combining velocity and [34] Chan, A.B., Liang, Z.S.J., Vasconcelos, N.: ‘Privacy preserving crowd
location-specific spatial clues in trajectories for counting crowded moving monitoring: Counting people without people models or tracking’. Proc.
objects’, Int. J. Patt. Rec. Artific. Intel., 2013, 27, (2), pp. 1–31 Comp. Vision and Pattern Rec., Anchorage, USA, June 2008, pp. 1–7
[14] Antonini, G., Thiran, J.: ‘Trajectories clustering in ICA space: an application [35] Lempitsky, V., Zisserman, A.: ‘Learning to count objects in images’. Proc.
to automatic counting of pedestrians in video sequences’, Proc. Advanced Advances in Neural Information Processing Systems, Vancouver, Canada,
Concepts for Intelligent Vision Systems, Brussels, Belgium, August 2004, pp. December 2010, pp. 1324–1332
1–17 [36] Arteta, C., Lempitsky, V., Noble, J.A., et al.: ‘Interactive object counting’.
[15] Mahdi, H., Gang, P., Min, Y.: ‘Counting moving people in crowds using Proc. Eur. Conf. Comp. Vision, Zurich, Switzerland, September 2014, pp.
motion statistics of feature-points’, Multi. Tool. Appl., 2014, 72, (1) pp. 453– 504–518
487
IET Image Process., 2018, Vol. 12 Iss. 12, pp. 2258-2263 2263
© The Institution of Engineering and Technology 2018