2017 Synthetic Data Generation For Deep Learning in Counting Pedestrians
2017 Synthetic Data Generation For Deep Learning in Counting Pedestrians
Keywords: Synthetic Data Generation, Deep Convolutional Neural Network, Deep Learning, Computer Vision.
Abstract: One of the main limitations of the application of Deep Learning (DL) algorithms is when dealing with prob-
lems with small data. One workaround to this issue is the use of synthetic data generators. In this framework,
we explore the benefits of synthetic data generation as a surrogate for the lack of large data when applying DL
algorithms. In this paper, we propose a problem of learning to count the number of pedestrians using synthetic
images as a substitute for real images. To this end, we introduce an algorithm to create synthetic images for
being fed to a designed Deep Convolutional Neural Network (DCNN) to learn from. The model is capable of
accurately counting the number of individuals in a real scene.
318
Ekbatani, H., Pujol, O. and Segui, S.
Synthetic Data Generation for Deep Learning in Counting Pedestrians.
DOI: 10.5220/0006119203180323
In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2017), pages 318-323
ISBN: 978-989-758-222-6
Copyright c 2017 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
Synthetic Data Generation for Deep Learning in Counting Pedestrians
Figure 1: A schematic of our proposal. In this paper, we show that by creating realistic synthetic images, we are able to train
a DCNN that is able to count the number of pedestrians in similar but real images.
as a surrogate for replacing small training sets when Segui et al. in (Seguı́ et al., 2015) proposed synthetic
applying deep architectures. data generation to counter lack of data issue for learn-
ing to count the number of objects in images using
deep convolutional neural networks. In their work,
2 BACKGROUND AND RELATED they took advantage of existent unlabeled and labeled
datasets to generate synthetic images representative of
WORKS the actual images. The authors introduce two count-
ing problems, counting number of even-digits in im-
2.1 Synthetic Data Generation ages, and counting the amount of pedestrians in a
walkway.
The main purpose of generating synthetic datasets has
been to protect the privacy and confidentiality of the 2.2 Crowd Counting
actual data (Phua et al., 2010), (Yao et al., 2013),
since it does not hold any personal information and Learning to count the objects of interest in an im-
cannot be traced back by any individual. Problems age can be approached from two different perspec-
such as fraud detection (Phua et al., 2010), or health tives: either training an object detector, or training an
care (Yao et al., 2013), are normally tackled by the object counter. In the field of object detection, nu-
use of synthetic data. However, most of the previ- merous works have been previously proposed (Kong
ously mentioned approaches towards synthetic data et al., 2005), (Marana et al., 1998). Furthermore,
generation would not be applicable when it comes Wu and Nevatia in (Wu and Nevatia, 2005) proposed
to synthetic image generation. This is due to the edgelet features (an edgelet is a short segment of line
fact that standard methods such as Probability Den- or curve) as a new type of silhouette-oriented features
sity Function (PDF) or Interpolation operate element- to deal with the problem of detecting individuals in
wise. The need for generating and synthesizing im- crowded still images.
ages using object-wise operations led researchers to As a similar line of work in the course of object
the use image processing tools for creating synthetic counting and more specifically crowd counting, in
images to tackle vision problems. (Leibe et al., 2007) and (Rabaud and Belongie, 2006),
In computer vision, usage of synthetic images has different object tracking approaches were taken to de-
a longstanding history, as in 2000, Cappelli et al. tect and count moving objects in the scene. However,
in (Cappelli et al., 2000) presented an approach to most of object tracking approaches met with skepti-
synthetic fingerprint generation on the basis of some cism by society, given the perception of infringing in-
mathematical models that describe the main features dividuals’ privacy rights.
of real finger prints. More recently, after the success More recently, in (Chan et al., 2008), Chan et al.
of deep convolutional neural networks in various vi- presented a novel approach with no explicit object
sion tasks concerning object detection or classifica- segmentation or tracking to estimate the number of
tion, generation and use of synthetic datasets has been people moving in each direction (towards and away
frequently considered. For example, in (Eggert et al., from camera) in a privacy-preserving manner.
2015), synthetic images are generated to be fed to a On the other hand, in case of feature learning,
DCNN in order to learn how to detect company logo Segui et al. in (Seguı́ et al., 2015) proposed a novel
in the absence of a large training set. approach for counting objects representations using
Moreover, as one of the most recent approaches, deep object features. In their work, objects’ features
319
ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods
are learned by a counting DCNN and are used to un- make the backgrounds of images as realistic as
derstand the underlying representation. Contrary to possible by:
the previous approaches, their proposal is the first one • making a sparse combination of median back-
where counting problem is handled by learning deep grounds.
features. Additionally, no hints on the object of inter-
• changing the global illumination of the images
est was given besides its’ occurrence multiplicity.
randomly.
• adding some random Gaussian noise to the
backgrounds.
3 SYNTHETIC IMAGE 4. Region Of Interest (ROI). Then, for training and
GENERATION comparison purposes, images are masked with a
filter of Region Of Interest (ROI).
The main hypothesis of this work is that synthetic data
5. Creating Synthetic Images. Afterwards, pedes-
generation algorithms can be used as a workaround
trians are added to the masked background in a
for problems with no or little training sets. On this
way that the center of each person is placed in-
course, we propose an algorithm for creating highly
side white area of the mask. Finally images are
realistic synthetic images of pedestrians in a walkway.
normalized (between 0 and 255) and resized to
We used UCSD unlabeled Anomaly detection dataset
158 × 158 in order to be fed to convolution lay-
of pedestrians collected by Chan et al. and used in
ers.
(Mahadevan et al., 2010) and (Chan et al., 2009).
UCSD Anomaly detection dataset contains clips of
groups of people walking towards and away from the
3.2 Image Improvement
camera, and consists of 34 training video samples and
Although we managed to successfully create syn-
36 testing video samples. Each video has 200 frames
thetic images of people in the street, the generated
of each 238 × 158 pixels.
images were still quite distinguishable from the real
dataset. Thus, in order to make images as highly real-
3.1 Image Generation istic as possible, we improved the dataset as explained
underneath. Figure 3 depicts this procedure.
In our dataset, we employed all 70 training and test-
ing video samples to generate the synthetic pedestrian 1. Remove Non-pedestrians. Amongst the ex-
dataset. We constrained each image by having up to tracted pedestrians, there were some non-
29 pedestrians in the walkway. The process of gener- pedestrians with objects instead of pedestrians,
ating the data includes the following steps while fig- and yet others with more than one person. There-
ure 2 illustrates this process. fore, we manually removed these outliers. After
this edition, we ended with 426 samples of peo-
1. Background Extraction. Firstly, we simply sub- ple.
tract the background from each video frame and
2. Lack of Pedestrians. For the sake of general-
from there, we extract the median backgrounds of
ization, we needed a decent variety of pedestrians
each video (in total, 70 different backgrounds).
in the images to train with. For this purpose, we
2. Pedestrian Extraction. Subtracting each image created 2 versions of current pedestrians list, each
from the mean background, we are able to label darkened by the factor of 20% from each other.
the connected regions (each individual in case of 3. Halos Around the Pedestrians. Due to lack of
our images) using morphological labeling meth- accuracy of the region measuring method, a fine
ods. layer of the background that pedestrians were ex-
3. Background Generation. In this step, we try to tracted from, still remained around the pedestri-
320
Synthetic Data Generation for Deep Learning in Counting Pedestrians
ans. In the created images, depending on where has been set to 400,000 iterations. The output layer is
the person was placed, these thin layers appeared configured as a classification problem.
like a halo around the person. We used morpho- On the validation set, the performance of the
logical erosion on pedestrians’ masks and also model is 0.70 mean absolute error and 0.94 mean
Poisson image editing to remove the halos. squared error. This results improve the achieved re-
4. Image Perspective. Finally, Since pedestrians sults in a similar experiment done by (Seguı́ et al.,
of different sizes were put randomly in the im- 2015) (the comparison is shown in table 2). On the
ages, we considered peoples tallness perspective other hand, on the real test set, we obtained 1.38 mean
in the images. Humans height almost follows a absolute error and 3.61 mean squared error which
Gaussian distribution (Subramanian et al., 2011). closely follow the results in (Chan et al., 2008) which
Therefore, with respect to (Subramanian et al., was obtained by hand-crafting highly specialized im-
2011), we mapped individuals heights with the age features that are dependent on the object class.
length of the walkway in the image, considering This comparison is depicted in table 3 The confusion
a Gaussian noise with mean µ = 0 and σ = 3.5. matrix regarding the model performance is illustrated
in figure 4. As you may notice, due to the inevitable
differences between real and synthetic samples, the
model mostly over-predicts. Moreover, as the number
4 EXPERIMENTS AND RESULTS of pedestrians increases in the images, the prediction
accuracy of the model decreases.
For learning to count the number of pedestrians in a
walkway, we synthetically generated a set of 1 million
Table 2: Performance comparison on the synthetic data be-
images of size 158 × 158 with up to 29 pedestrians in tween our proposal and related work in (Seguı́ et al., 2015).
each image. Maximum overlapping was considered
in the creation of the images. We divided this dataset Experiments MSE MAE
into a training set of 800k images and 200k images Our approach (29 peds) 0.942 0.707
for validation set. To test our model, we used UCSD (Seguı́ et al., 2015) (25 peds) 1.12 0.74
crowd counting dataset with 3375 manually labeled
images of pedestrians. The selected UCSD images Table 3: Performance comparison on the real data between
our proposal and related work in (Chan et al., 2008).
contain from 11 to 29 pedestrians in each image.
We designed a seven layers architecture DCNN Experiments MSE MAE
with four convolutional layers and three fully con- Proposed method 3.61 1.38
nected layers. The architecture is shown in Table 1. (Chan et al., 2008) approach 2.73 1.24
Table 1: Proposed DCNN for counting pedestrians. As you may observe in table 2, in case of synthetic
images, although our images contain more pedestri-
Convolutions Fully connects
ans, our results beat the previous approach in (Seguı́
10 × 15 × 15 & x2 pooling 128
et al., 2015). This proves the improvement we made
10 × 11 × 11 & x2 pooling 64
in synthetic data generation process and the designed
20 × 9 × 9 1
deep architecture.
20 × 5 × 5
Respectively, in case of real images, although we
The algorithm is trained using the Caffe pack- could not improve the work done in (Chan et al.,
age[11] on a GPU NVIDIA Tesla K40. The network 2008), our results follows their results closely. We
321
ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods
REFERENCES
Cappelli, R., Erol, A., Maio, D., and Maltoni, D. (2000).
Synthetic fingerprint-image generation. In Pattern
Recognition, 2000. Proceedings. 15th International
Conference on. IEEE.
Chan, A. B., Liang, Z.-S. J., and Vasconcelos, N. (2008).
Privacy preserving crowd monitoring: Counting peo-
ple without people models or tracking. In Computer
Vision and Pattern Recognition, 2008. CVPR 2008.
IEEE Conference on. IEEE.
Chan, A. B., Morrow, M., and Vasconcelos, N. (2009).
Analysis of crowded scenes using holistic properties.
In Performance Evaluation of Tracking and Surveil-
Figure 4: Confusion matrix regarding the model perfor- lance workshop at CVPR.
mance on the real test set. The starting point of the graph is Ciregan, D., Meier, U., and Schmidhuber, J. (2012). Multi-
11 since the minimum amount of pedestrians in the real test column deep neural networks for image classification.
set is 11. In Computer Vision and Pattern Recognition (CVPR),
2012 IEEE Conference on. IEEE.
should mention that Chan et.al experiment in (Chan
Eggert, C., Winschel, A., and Lienhart, R. (2015). On the
et al., 2008) was done by hand-crafting highly spe- benefit of synthetic data for company logo detection.
cialized features and exhaustive labeling. This results In Proceedings of the 23rd ACM international confer-
approve the suitability of synthetic data as a surrogate ence on Multimedia. ACM.
for the small real data when using DCNN. Griffin, G., Holub, A., and Perona, P. (2007). Caltech-256
object category dataset. California Institute of Tech-
nology.
5 CONCLUSIONS Kong, D., Gray, D., and Tao, H. (2005). Counting pedes-
trians in crowds using viewpoint invariant training. In
BMVC. Citeseer.
In this paper we explore the benefits of synthetic data
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-
generation for the application of deep convolutional agenet classification with deep convolutional neural
neural networks for a crowd counting problem with networks. In Advances in neural information process-
small training set. We propose an algorithm for cre- ing systems.
ating a highly realistic synthetic dataset of pedestri- LeCun, Y. and Bengio, Y. (2005). Convolutional networks
ans in a walkway to train the proposed DCNN with. for images, speech, and time series. In BMVC. Cite-
Moreover, we provide a system trained with synthetic seer.
images capable of predicting the number of pedestri- Leibe, B., Schindler, K., and Van Gool, L. (2007). Coupled
ans in an image to a satisfactory extent. The obtained detection and trajectory estimation for multi-object
results suggest the incorporation of synthetic data as tracking. In 2007 IEEE 11th International Conference
a well-suited surrogate for the missing real along with on Computer Vision. IEEE.
alleviating required exhaustive labeling. Mahadevan, V., Li, W., Bhalodia, V., and Vasconcelos, N.
There are still many open questions to be ad- (2010). Anomaly detection in crowded scenes. In
dressed such as, when and to what extent synthetic CVPR.
images are applicable as a substitute to solve real Marana, A., Costa, L. d. F., Lotufo, R., and Velastin, S.
world problems. which is the best network architec- (1998). On the efficacy of texture analysis for crowd
monitoring. In Computer Graphics, Image Process-
ture for counting the crowd? ing, and Vision, 1998. Proceedings. SIBGRAPI’98. In-
ternational Symposium on. IEEE.
Phua, C., Lee, V., Smith, K., and Gayler, R. (2010). A
ACKNOWLEDGEMENTS comprehensive survey of data mining-based fraud de-
tection research. In arXiv preprint arXiv:1009.6119.
This work has been partially funded by the Spanish Rabaud, V. and Belongie (2006). Counting crowded moving
MINECO Grants TIN2013-43478-P and TIN2012- objects. In 2006 IEEE Computer Society Conference
38187- C03. We gratefully acknowledge the support on Computer Vision and Pattern Recognition. IEEE.
of NVIDIA Corporation with the donation of a Tesla Seguı́, S., Pujol, O., and Vitria, J. (2015). Learning to count
K40 GPU used for this research. with deep object features. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recogni-
tion Workshops.
322
Synthetic Data Generation for Deep Learning in Counting Pedestrians
323