Security - Tracking People Via of CCTV
Security - Tracking People Via of CCTV
Security - Tracking People Via of CCTV
ABSTRACT
The capability to track individuals in CCTV cameras is important for e.g. surveillance applications at large areas such as
train stations, airports and shopping centers. However, it is laborious to track and trace people over multiple cameras. In
this paper, we present a system for real-time tracking and fast interactive retrieval of persons in video streams from
multiple static surveillance cameras. This system is demonstrated in a shopping mall, where the cameras are positioned
without overlapping fields-of-view and have different lighting conditions. The results show that the system allows an
operator to find the origin or destination of a person more efficiently. The misses are reduced with 37%, which is a
significant improvement.
Keywords: Surveillance, tracking, CCTV, person re-identification, multi-sensor fusion.
1. INTRODUCTION
The capability to track individuals in CCTV cameras is important for surveillance applications at e.g. train stations,
airports and shopping centers. For the camera operators, however, it is laborious to track and trace people over multiple
cameras. In this paper, we present a semi-autonomous system for real-time tracking and fast interactive retrieval of
persons in video streams from multiple surveillance cameras. This system is demonstrated in a shopping mall. We
describe our system, which consists of tracklet generation, re-identification and a graphical man-machine interface. The
system is tested in a shopping mall with eight static cameras and 6 cameras were selected for an operator-efficiency
experiment. These cameras have non-overlapping (or hardly overlapping) field-of-views and different lighting
conditions. All video streams are processed in parallel on a distributed system and tracks and detections are continuously
stored in a database. The operator can use these tracks and detections to quickly answer questions such as “where did a
particular person come from?” or “where did he go to?”. The interface enables live tracking in current streams and
interactive searches in historic data. The results show that the system allows an operator to find the origin or destination
of a person more efficiently with less misses.
The outline of this paper is as follows. Section 2 describes our system, Section 3 describes the experiments and results
and finally Section 4 summarizes the conclusions.
2. METHOD
The system overview is shown in Figure 1. The main components are tracklet generation and the re-identification engine
and a graphical man-machine interface. Tracklet generation is an activity that continuously processes the incoming video
streams to detect persons and track them within a single camera. The resulting tracklets are stored in a tracklet database.
This database allows our system to quickly retrieve similar candidates after human interaction without computational
intensive video processing. In order to track a person in a large environment over multiple non-overlapping cameras, the
separate tracklets of a certain person from different cameras need to be combined. The re-identification engine compares
*
[email protected]; phone +31 888 66 4054; http://www.tno.nl
Henri Bouma, Jan Baan, Sander Landsmeer, Chris Kruszynski, Gert Van Antwerpen, Judith Dijk, “Real-time tracking and fast retrieval of persons in
multiple surveillance cameras of a shopping mall”, Proc. SPIE, Vol. 8756, (2013).
Copyright 2013 Society of Photo-Optical Instrumentation Engineers (SPIE). One print or electronic copy may be made for personal use only.
Systematic reproduction and distribution, duplication of any material in this paper for a fee or for commercial purposes,
or modification of the content of the paper are prohibited.
the query with tracklets in the database and presents the most likely candidates. This engine consists of two components:
appearance-based matching and space-time localization. The combination of both is used to present the best matching
candidate. The human-machine interface allows the operator to interact with the system, by selecting queries and
candidates. Each component is described in more detail in the following subsections.
Figure 1: The system consists of tracklet generation, a re-identification engine and a graphical user interface.
Our video-processing framework consists of robust re-usable video-processing components that can handle different
types of inputs (e.g. standard IP cameras), flexible parallel and distributed processing over multiple computers in a
network, and reliable data transfer by using FIFO buffers in the real-time processing environment.
Figure 2: The graphical user interface includes the following: map (top-left), spot-camera view (top-center), other cameras (top-right),
candidate view (bottom). The spot-view camera is displayed in red on the map. The candidate view displays time on the horizontal
axis and cameras on the vertical axis.
Tracklet generation consists of pedestrian detection and within-camera tracking. An example of generated tracklets in a
camera view is shown in Figure 5 and a visualization on the map is shown in Figure 6.
Figure 5: Example of generated tracklets in approximately 2 minutes. All lines indicate different tracklets, i.e. different persons.
0.9
0.8
0.7
cumulative recognition rate
0.6
0.5 MCHH
MCHH+rgbn+tr
Figure 7: CMC of different color models on data from the shopping center (77 person pairs): multi color-height histograms (MCHH),
transformed (tr), normalized (n), opponent (opp), and RGB-rank (r). The figure shows that a combination of MCHH with transformed-
normalized RGB results in the best performance.
0.9
0.8
0.7
cumulative recognition rate
0.6
0.5 MCHH
MCHH+rgbn+tr
Figure 8: CMC of different color models on data from VIPeR [18] (10 randomly selected sample sets of 316 person pairs out of the
632 pairs): multi color-height histograms (MCHH), transformed (tr), normalized (n), RGB (rgb), opponent (opp), and RGB-rank
(rgbr). The figure shows that a combination of MCHH with transformed-normalized RGB results in the best performance.
The results are shown in Figure 7 and Figure 8. The figures show similar results on VIPeR data and in the shopping mall.
Furthermore, the combination of MCHH with transformed-normalized RGB (both described in [4]) performs even better
than each of them separately.
The size of the database has a large effect on the matching performance. In a larger database, it is likely to have a larger
number of images with a high similarity to the query image, which do not actually constitute a correct match. In this
experiment, performed on the shopping mall data, the standard set of database images was enlarged by adding a varying
numbers of images from another period than the standard images.
0.9
0.8
cumulative recognition rate
0.7
0.6
0.5
0.4
0.3
77
0.2 100
500
0.1 1000
4000
0
0 10 20 30 40 50 60 70 80 90 100
rank
Figure 9: CMC curves for varying database sizes, where the rank on the horizontal axis is an absolute value.
0.9
0.8
cumulative recognition rate
0.7
0.6
0.5
0.4
0.3
77
0.2 100
500
0.1 1000
4000
0
0 10 20 30 40 50 60 70 80 90 100
percent of database
Figure 10: CMC curves for varying database sizes, where the percentage on the horizontal axis is the rank relative to the database size.
Figure 9 shows that the CMC curve decreases for larger database sizes, as expected. The results in Figure 10 show that
the CMC remains approximately constant on different database sizes, when the horizontal axis is rescaled to 100% of the
database size (instead of the absolute value of the database size on the horizontal axis). This allows us to estimate how
well the system will perform in sparse or crowded environments.
Table 2: Comparison of operator efficiency with and without our system based on 8 volunteers who received 17 query pedestrians.
The absolute number of FN are shown for each pedestrian. The results show a reduction of 37% of the misses when the system is
used.
Figure 11: Missed pedestrians with and without candidate view, as a function of the number of possible detections. It is shown that
the results for the system especially improve for a higher number of possible detections of a pedestrian.
ACKNOWLEDGEMENT
The work for this paper was supported by the ‘Maatschappelijke Innovatie Agenda - Veiligheid’ in the project:
“Watching People Security Services” (WPSS). This project is a collaboration between TNO, Eagle Vision, Vicar Vision,
Noldus IT, Cameramanager.com and Borking Consultancy. This consortium acknowledges the “Centrum voor Innovatie
en Veiligheid” (CIV) and the “Diensten Centrum Beveiliging” (DCB) in Utrecht for providing the fieldlab facilities and
support. The development of the demonstrator was partially funded by the EU FP7-Security project PROTECTRAIL.
REFERENCES
[1] Barbu, A., Michaux, A., Narayanaswamy, S., Siskind, J.M., “Simultaneous object detection, tracking and event
recognition,” Advances in Cognitive Systems, (2012).
[2] Bialkowski, A., Denman, S., Sridharan, S., Fookes, C., “A database for person re-identification in multi-camera
surveillance networks,” IEEE Int. Conf. Digital Image Computing Techniques and Appl. DICTA, (2012).
[3] Benenson, R., Mathias, M., Timofte, R., Van Gool, L., “Pedestrian detection at 100 frames per second,” IEEE
Conf. Computer Vision and Pattern Recognition CVPR, (2012).