Deep Spatial Regression Model For Image Crowd Counting
Deep Spatial Regression Model For Image Crowd Counting
Abstract
Computer vision techniques have been used to produce accurate and generic crowd count
estimators in recent years. Due to severe occlusions, appearance variations, perspective
distortions and illumination conditions, crowd counting is a very challenging task. To this end,
we propose a deep spatial regression model(DSRM) for counting the number of individuals
present in a still image with arbitrary perspective and arbitrary resolution. Our proposed model
is based on Convolutional Neural Network (CNN) and long short term memory (LSTM). First,
we put the images into a pretrained CNN to extract a set of high-level features. Then the features
in adjacent regions are used to regress the local counts with a LSTM structure which takes the
spatial information into consideration. The final global count is obtained by a sum of the local
patches. We apply our framework on several challenging crowd counting datasets, and the
experiment results illustrate that our method on the crowd counting and density estimation
problem outperforms state-of-the-art methods in terms of reliability and effectiveness.
Keywords: crowd counting, convolutional neural network, long short term memory (LSTM),
spatial regression.
1. Introduction
In the recent years, along with the increasing degree of urbanization, more and more people
choose to live in the city. The benefits of this trend are enriching the cultural life and making
full use of the convenient urban infrastructure. At the same time, a large scale of people gathers
together to organize various activities, such as Olympic Games, religious rally, festival
celebration, strike, Marathon, concert and so on. When tens of thousands of people gathering
together in limited space, a tragedy is probably to happen. In Shanghai Bund, the new year’s
eve of 2015, 36 persons were killed and 49 persons were injured in a massive stampede. In
order to avoid such deadly accidents, the research on automatic detection and counting and
density in large scale crowd is playing a significant role in city security and city management.
In computer vision, many studies have focused on how to establish models which can
accurately estimate the numbers of pedestrians in images and videos. These models can be
extended to be applied on other domains, such as vehicles estimation at traffic junctions or
super highway [3], animal crowd estimation in wildlife migration, quantification of specific
populations of cells for precision diagnostic in laboratory medicine [1].
The challenges in the crowd counting and density estimation are the severe occlusions,
appearance variations, perspective distortions and illumination conditions which affect the
performance of the model in different degrees. Specifically, the density and distribution of
crowd vary significantly in the crowd counting task. This phenomenon can be observed in the
datasets we can access. Figure 1 illustrates some examples of the datasets for our experiments.
To tackle these challenges, we propose a new framework for counting the number of individuals
present in a still image with arbitrary perspective and arbitrary resolution. First, we put the
images into a pre-trained CNN to extract a set of high-level features. Then the features in
adjacent regions are used to regress the local counts with long short term memory (LSTM) [8]
structure which takes the spatial information into consideration. The final global count is
obtained by a sum of the local patches. Our approach achieves the state-of-the-art results on all
of these challenging datasets and demonstrate the effectiveness and reliability.
The main contributions of our study are as follows: We propose a deep spatial regression
model to estimate the people counting in images. Due to the variability of camera view-point
and density, strong correlation exists in transverse direction. Overlapping regions strategy
makes this correlation stronger. A novel deep features matrix is set up which contains the spatial
information. Our deep spatial regression model can learn the spatial constraint relation of local
counts in adjacent regions effectively and improve the accuracy significantly.
The rest of this paper is organized as follows. In Section 2, we briefly review the related
work of crowd counting and density estimation. Then a novel DSRM estimation model is
proposed in Section 3. Experimental results for our proposed framework on datasets of different
density distribution are presented in Section 4. Finally, conclusions are drawn in Section 5.
2. Related work
Many studies have been made in the literature of crowd counting. In the early days, many
methods adopted a counting-by-detection strategy. [9] has used a count estimation method
which combines foreground segmentation and head-shoulder detection. They first detect active
areas, then detect heads and count the number from the foreground areas. Cheriyadat et al. [10]
have proposed an object detection system based on coherent motion region detection for
counting and locating objects in the presence of high object density and inter-object occlusions.
[11] has utilized an unsupervised data-driven Bayesian clustering algorithm which detect the
individual entities. These counting-by-detection methods attempt to determine the number of
people by detecting individual entities and their locations simultaneously. However, the
performance of the detectors reduces dramatically when dealing with dense crowds and severe
occlusion. Most of these works experiment on datasets containing sparse crowd scenes, such as
UCSD dataset [12], Mall dataset[13] and PETS dataset[14].
Loy et al. [15] have proposed semi-supervised regression and data transferring approaches
to reduce the amount of training set. Further work by Idrees et al. [20] has presented a method
which combines different kinds of hand-crafted features, i.e. HOG based head detections,
Fourier analysis, and interest points based counting. Once they estimate density and counts in
each patch by combined features, they place them in multi-scale Markov Random Field to
smooth the results in counts among nearby patches. Although this method relatively improves
the accuracy, it is still dependent on traditional hand-engineered representations, e.g. SIFT [4],
HOG [5], LBP [6].
In recent years, deep learning has attracted people’s attention. Some studies [7, 16] have
shown that the features extracted from deep models are more effective than hand-crafted
features for many applications. For example, methods of deep learning have remarkably
improved the state-of-the-art in visual object recognition, speech recognition, object detection
and many other domains [17]. In order to adapt the change of the crowd density and perspective,
[22] has introduced a multi-column CNN (MCNN) model to estimate the density map of a still
image. Each column has filters with receptive fields of different sizes. They pre-train each
single column separately and then fine-tune the multi-column CNN. Zhang et al.[21] have
proposed a CNN model of iterative switchable training scheme with two objectives: estimating
global count and density map. Firstly they pre-train their CNN model based on all training set.
Then they retrieve the samples with the similar distribution to the test scene and added them to
the training data to fine-tune the CNN model. Perspective maps of frames are used in this
process which can significantly improve the performance. Unfortunately, generating
perspective maps on both training scenes and test scenes is computationally complex and time-
consuming, which limits the applicability of this method. Generally speaking, these neural
networks contain less than seven layers.
Currently, many deep neural networks produce amazing results on classification, object
detection, localization and segmentation tasks. Several attempts have been made to apply these
deep models to crowd counting and density estimation. Boominatahn et al. [24] used a
combination of deep (VGG-16 [31]) and shallow fully convolutional networks to predict the
density map for a dense image. They evaluated their approach on only one dataset, but the
experiment result is not competitive. Shang et al.[23] introduced an end-to-end CNN network
that directly maps the whole image to the counting result. A pre-trained GoogLeNet model [32]
is used to extract high-level deep features and the LSTM decoders for the local count and fully
connected layers for the final count. The authors resize images to 640×480 pixels before
feeding them to the network, which will bring the errors.
Our approach is related to the ResNet model [2], which is trained on ImageNet dataset and
gets the perfect score on the classification task. The 152 layer ResNet is utilized to extract deep
features from the patches cropped from the whole images with overlaps. Due to the 50% overlap,
the crowd counts of the adjacent patches have high correlation. So we use a LSTM structure
considering the spatial information to regress the local counts. Finally, the total number of a
still image is the sum of the local counts.
3 Method
3.1 System overview
In this section, we give a general overview of the proposed method, details are provided
in the following sections. In this paper, we propose a deep spatial regression model for crowd
counting and density estimation which is shown in Figure 2.
In the first place, we feed the patches cropped from the whole image to a pre-trained CNN
called ResNet. The 152-layer residual net is the deepest network ever presented on ImageNet
and still has lower complexity than VGG nets. The purpose is to get the 1000 dimensional high-
level features. We crop 100×100 patches with 50% overlap from every image. This data
augmentation helps us to address the problem of the limited training set. Meanwhile, due to the
irregularity of the crowd distribution and non-uniform in large scale, smaller slices can make it
to be approximately uniform distribution.
Moreover, in order to get the accurate local counts, a novel deep features matrix which
contains the features extracted by the ResNet is learned by a LSTM neural network. In this
process, the spatial constraint relation of local counts in adjacent regions is considered to
improve the accuracy of the estimated result.
Figure 2. Overview of our proposed DSRM crowd counting method. In the dashed box, the
block diagram of the LSTM structure considering the spatial information to regress the counts
is shown.
Finally, we obtain the local counts matrix consisting of local counts in every patch. The
final count of the whole image is the sum of the local counts. Furthermore, we get the more
intuitive density map which can be seen clearly in Figure 2.
The Euclidean distance is used to measure the difference between the ground truth and the
prediction count. The loss function is defined as follows.
1
𝐿(𝜃) = ∑𝑁
𝑖=1‖𝐹 (𝑋𝑖 ; 𝜃) − 𝑧𝑖 ‖
2 (1)
𝑁
where N is the number of the training image patches in the dataset and 𝜃 is the parameters of
the framework. 𝐿(𝜃) is the loss between the regressed count 𝐹(𝑋𝑖 ; 𝜃) from the network and
the ground truth 𝑧𝑖 of the image patches 𝑋𝑖 ( i = 1,2,…,N). The loss is minimized using mini-
batch gradient descent and back-propagation.
4. Experiment
We evaluate our algorithm on four crowd counting datasets. Comparing to most CNN
based methods in the literature, the proposed DSRM model achieved excellent performance in
all the datasets. Implementation of the proposed model and its training are based on the
TensorFlow deep learning framework, using NVIDIA Tesla K20 GPU.
Mean absolute error (MAE), mean squared error (MSE) and mean normalized absolute
error (MNAE) are utilized to evaluate and compare the performance of different methods.
These three metrics are defined as follows:
1
MAE = ∑𝑀
𝑖=1 |𝑧𝑖 − 𝑧̂ 𝑖 | (3)
𝑀
1
MSE = √ ∑𝑀
𝑖=1(𝑧𝑖 − 𝑧̂ 𝑖 )
2 (4)
𝑀
1 |𝑧𝑖−𝑧̂𝑖|
MNAE = ∑𝑀
𝑖=1 (5)
𝑀 𝑧𝑖
where M is the number of images in the test set, 𝑧𝑖 is the ground truth of people in the ith image
and 𝑧̂𝑖 is the prediction value of people in the ith image. MAE indicates the accuracy of the
estimation, the MNAE is related to the MAE which represents the average deviation rate.The
MSE indicates the robustness of the estimates. Lower MAE, MNAE and MSE values mean
more accuracy and better estimates.
Part_A
GT: 416
Count: 415
Error: 1
Part_B
GT: 159
Count: 159
Error: 0
Figure 3. Our counting results and density maps on the Shanghaitech Part_A, Shanghaitech
Part_B datasets. (Left) sample selected from each test scene. (Middle) ground truth density map
on the sample. (Right) estimated density map on the sample. The ground truth, estimated count
and absolute error are shown at the right of the maps.
Figure 4. The comparison of the ground truth and the estimated count on Shanghaitech Part_A
dataset (left) and Shanghaitech Part_B dataset (right). Absolute counts in the vertical axis is
the average crowd number of images in each group.
Since the image of the UCF dataset is gray, we extend the image to three channels by
copying the data. We compare our proposed methods with five existing approaches on the
UCF_CC_50 dataset in Table 2. Our proposed method achieved the best MAE and MSE. The
counting results and density maps selected from our experiment can be seen in Figure 5. Similar
to Figure 4, the comparison of the ground truth and the estimated count is illustrated in Figure
6.
UCF_CC_50
GT: 562
Count: 560
Error: 2
Figure 5. Our counting results and density maps on the UCF_CC_50 datasets. (Left) sample
selected from each test scene. (Middle) ground truth density map on the sample. (Right)
estimated density map on the sample. The ground truth, estimated count and absolute error are
shown at the right of the maps.
Figure 6. The comparison of the ground truth and the estimated count on UCF_CC_50
dataset. Absolute counts in the vertical axis is the average crowd number of images in each
group.
GT: 935
Count: 933
Error: 2
Figure 7. Our counting results and density maps on the AHU-CROWD datasets. (Left) sample
selected from each test scene. (Middle) ground truth density map on the sample. (Right)
estimated density map on the sample. The ground truth, estimated count and absolute error are
shown at the right of the maps.
Figure 8. The comparison of the ground truth and the estimated count on AHU-CROWD
dataset. Absolute counts in the vertical axis is the average crowd number of images in each
group.
For fair comparison, our experiment follows the work of [21]. In this dataset, we feed half
of the test set in each scene into our network as training set and test on the remaining frames.
Zhang et al.[21] use different methods to compare the performances of five scenes. The mean
absolute errors are good in Scene 1, Scene 3, Scene 4 and Scene 5. However, for Scene 2, a
worse result is achieved due to a large number of stationary crowds and cannot segment
foreground accurately. Our algorithm overcomes this problem and gets the competitive results
especially in challenging Scene 2. Note that our method does not rely on foreground
segmentation and is tested on the whole image rather than just the area of ROI. We achieve the
best average mean absolute error and comparable results in the five test scenes. Details can be
seen in Table 4. The density estimation and counting results on the five test scenes are shown
in Figure 9.
Scene 1
GT: 46
Count: 48
Error: 2
Scene 2
GT: 204
Count: 207
Error: 3
Scene 3
GT: 82
Count: 79
Error: 3
Scene 4
GT: 102
Count: 103
Error: 1
Scene 5
GT: 84
Count: 77
Error: 7
Figure 9. Our counting results and density maps on the WorldExpo’10 counting dataset. (Left)
sample selected from each test scene. (Middle) ground truth density map on the sample. (Right)
estimated density map on the sample. The ground truth, estimated count and absolute error are
shown at the right of the maps.
(d) (e)
Figure 10. Histograms of crowd counts of different datasets. (a) Histograms of Shanghaitech
Part_A dataset, (b) Histograms of UCF_CC_50 dataset, (c) Histograms of AHU-CROWD
dataset, (d) Histograms of Shanghaitech Part_B dataset (e) Histograms of test set of
WorldExpo’10 dataset
Table 5. The statistics of different datasets. N is the number of images in the dataset; N train is the
number of the train set, and Ntest is the number of the test set, Npeople is the count of people in
the images, Max is the maximal count, Min is the minimal count and Average is the average
count in the dataset. “5-fold” means 5-fold cross validation.
Dataset N Ntrain Ntest Npeople Resolution
Max Min Average
UCF_CC_50 50 5-fold 4543 94 1279.5 different
Shanghaitech Part_A 482 300 182 3139 33 501.4 different
AHU-COWD 107 5-fold 2201 58 420.6 different
Shanghaitech Part_B 716 400 316 578 9 123.6 768×1024
WorldExpo’10 3974 3374 600 253 1 50.2 576×720
After analysis we can obtain the different distribution between these famous datasets
concerning to crowd counting and estimation: Shanghaitech Part_A, AHU-CROWD and
UCF_CC_50 have similar count distribution. Shanghaitech Part_B and WorldEXPO’10 have a
similar distribution of pedestrian numbers. On the other hand, the total number of images in the
datasets are different. We assume that using more training data will result in a better model. So
we split these five datasets into source domain and target domain. The one which has the most
number of images is chosen to be the source domain. The model is trained on the source dataset
first, then transfer the learned model to the target dataset which helds a similar level of density.
To test and verify this idea, first we choose to train our algorithm on Shanghaitech Part_A,
then the model was fine-tuned on the AHU-CROWD. The experiment results validate this
hypothesis which is shown in Table 6. It can be seen that the MAE, MSE and MNAE are
reduced in different degrees in AHU-CROWD. This surprising enhancement of accuracy also
happens on the challenging UCF_CC_50 dataset which is illustrated in Table 7.
Table 6. Transfer learning on AHU
Method MAE MNAE MSE
DSRM 81 0.199 129
DSRM trained on Part_A 77 0.179 128.03
Fuller experiment are conducted on the WorldExpo’10 crowd counting dataset. By fin-
tuning the three layers of LSTM structure with training data in WorldExpo’10 crowd counting
dataset, the accuracy can be greatly improved. The reason is that the knowledge of both the
source and target data can be combined to improve the performance. Firstly, we fine-tune the
model trained on Shanghaitech Part_A, the results are listed in Table 8 and Table 9. The two
metrics MAE and MSE are used to evaluate the performance respectively. Secondly, we fine-
tune the model trained on Shanghaitech Part_B, we can see the experiment results in the tables.
It is clear that the second method is better than the first method both in MAE and MSE. The
reason is the data distribution of Shanghaitech Part_B is similar to the target domain
WorldExpo’10. This results validate our hypothesis and better accuracy is obtained.
Table 8. MAE of the WorldExpo’10 crowd counting dataset. ModelA means the model trained
on Shanghaitech Part_A, ModelB means the model trained on Shanghaitech Part_B.
Method Scene 1 Scene 2 Scene 3 Scene 4 Scene 5 Average
DSRM 3.0 10.3 9.8 14.4 4.8 8.4
Fine-tune the modelA 4.2 9.8 11.1 8.8 3.4 7.46
Fine-tune the modelB 3.5 10.9 9.0 8.3 4.1 7.16
5. Conclusion
We have present a deep spatial regression model (DSRM) to estimate the counts of still
images. Our general model is based on a Convolutional Neural Network (CNN) and long short
term memory (LSTM) for crowd counting taking spatial information into consideration. With
the overlapping patches divided strategy, the adjacent local counts are highly correlated. So we
feed the images into a pre-trained convolutional neural network to extract high-level features.
The features in adjacent regions are leveraged to regress the local counts with a LSTM structure
considering the spatial information. Then the final global count of a single image is obtained
by the sum of the local patches. We perform our approach on several challenging crowd
counting datasets, and the experiment results illustrate that our deep spatial regression model
outperforms state-of-the-art methods in terms of reliability and effectiveness.
Acknowledgement
This work is supported by the National Natural Science Foundation of China (61373084),
Shanghai Science and Technology Committee Research Plan Project (17511106802).
References
[1] Yao Xue et al. Cell Counting by Regression Using Convolutional Neural Network.
Springer International Publishing, 2016.
[2] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: (2015), pp. 770–
778.
[3] Yunsheng Zhang, Chihang Zhao, and Qiuge Zhang. “Counting vehicles in urban traffic
scenes using foreground time-spatial images”. In: Iet Intelligent Transport Systems 11.2
(2017), pp. 61–67.
[4] D. G. Lowe. “Object recognition from local scaleinvariant features”. In: The Proceedings
of the Seventh IEEE International Conference on Computer Vision. 2002, p. 1150.
[5] Navneet Dalal and Bill Triggs. “Histograms of Oriented Gradients for Human Detection”.
In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society
Conference on. 2005, pp. 886–893.
[6] Timo Ojala, Matti Pietikäinen, and Topi Mäenpää. “Gray Scale and Rotation Invariant
Texture Classification with Local Binary Patterns”. In: IEEE Transactions on Pattern
Analysis & Machine Intelligence 1842.7 (2000), pp. 404–420.
[7] Lubomir Bourdev, Subhransu Maji, and Jitendra Ma-lik. “Describing people: A poselet-
based approach to attribute classification”. In: IEEE International Conference on computer
Vision. 2012, pp. 1543–1550.
[8] K Greff et al. “LSTM: A Search Space Odyssey”. In: IEEE Transactions on Neural
Networks & Learning Systems PP.99 (2016), pp. 1–11.
[9] Min Li et al. “Estimating the number of people in crowded scenes by MID based
foreground segmentation and head-shoulder detection”. In: International Conference on
Pattern Recognition. 2012, pp. 1–4.
[10] Anil M. Cheriyadat, Budhendra L. Bhaduri, and Richard J. Radke. “Detecting multiple
moving objects in crowded environments with coherent motion regions”. In: Computer
Vision and Pattern Recognition Workshops, 2008. CVPRW ’08. IEEE Computer Society
Conference on. 2010, pp. 1–8.
[11] G. J Brostow and R Cipolla. “Unsupervised Bayesian Detection of Independent Motion in
Crowds”. In: Computer Vision and Pattern Recognition, 2006 IEEE Computer Society
Conference on. 2006, pp. 594–601.
[12] Antoni B. Chan, Zhang Sheng John Liang, and Nuno Vasconcelos. “Privacy preserving
crowd monitoring: Counting people without people models or tracking”. In: Computer
Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. 2008, pp. 1–7.
[13] Kang Han et al. “Image Crowd Counting Using Convolutional Neural Network and
Markov Random Field”. In: (2017).
[14] A. Ellis and J. Ferryman. “PETS2010 and PETS2009 Evaluation of Results Using
Individual Ground Truthed Single Views”. In: IEEE International Conference on
Advanced Video and Signal Based Surveillance. 2010, pp. 135–142.
[15] Change Loy Chen, Shaogang Gong, and Tao Xiang. “From Semi-supervised to Transfer
Counting of Crowds”.In: IEEE International Conference on Computer Vision. 2014, pp.
2256–2263.
[16] Pierre Sermanet et al. “OverFeat: Integrated Recognition, Localization and Detection
using Convolutional Networks”. In: Eprint Arxiv (2013).
[17] Yann Lecun, Yoshua Bengio, and Geoffrey Hinton. “Deep learning”. In: Nature 521.7553
(2015),pp. 436–444.
[18] Mikel Rodriguez et al. “Density-aware person detection and tracking in crowds”. In:
International Conference on Computer Vision. 2011, pp. 2423–2430.
[19] Victor Lempitsky and Andrew Zisserman. “Learning To count objects in images”. In:
International Conference on Neural Information Processing Systems. 2010, pp. 1324–1332.
[20] Haroon Idrees etal. “Multi-source Multi-scale Counting in Extremely Dense Crowd
Images”. In: IEEE Conference on Computer Vision and Pattern Recognition. 2013, pp.
2547–2554.
[21] Cong Zhang et al. “Cross-scene crowd counting via deep convolutional neural networks”.
In: IEEE Conference on Computer Vision and Pattern Recognition. 2015, pp. 833–841.
[22] Yingying Zhang et al. “Single-Image Crowd Counting via Multi-Column Convolutional
Neural Network”. In: Computer Vision and Pattern Recognition.2016, pp. 589–597.
[23] Chong Shang, Haizhou Ai, and Bo Bai. “End-to-end crowd counting via joint learning
local and global count”. In: IEEE International Conference on Image Processing. 2016, pp.
1215–1219.
[24] Lokesh Boominathan, Srinivas S S Kruthiventi, and R. Venkatesh Babu. “CrowdNet: A
Deep Convo-lutional Network for Dense Crowd Counting”. In:(2016), pp. 640–644.
[25] L. Fiaschi et al. “Learning to count with regression forest and structured labels”. In: (2012),
pp. 2685–2688.
[26] Yaocong Hu et al. “Dense crowd counting from still images with convolutional neural
networks”.In:Journal of Visual Communication & Image Representation 38.C (2016), pp.
530–539.
[27] Michael Oren et al. “Pedestrian Detection Using Wavelet Templates”. In: Conference on
Computer Vision and Pattern Recognition. 1997, p. 193.
[28] Pedro Felzenszwalb, David Mcallester, and Deva Ramanan. “A discriminatively trained,
multiscale, deformable part model”. In: IEEE Computer Society Conference on Computer
Vision and Pattern Recognition. 2008, pp. 1–8.
[29] Gabriella Csurka et al. “Visual categorization with bags of keypoints”. In: Workshop on
Statistical Learning in Computer Vision Eccv 44.247 (2004), pp. 1–22.
[30] Ke Chen et al. “Feature Mining for Localised Crowd Counting”. In: British Machine
Vision Conference. 2013.
[31] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-
scale image recognition.” arXiv preprint arXiv:1409.1556, 2014.
[32] Szegedy Christian et al. "Going deeper with convolutions." Proceedings of the IEEE
conference on computer vision and pattern recognition. 2015.
[33] Kingma Diederik, and J. Ba. "Adam: A Method for Stochastic Optimization." Computer
Science (2014).