stationary_objects_detection
stationary_objects_detection
Filip Leiding
1 Introduction 4
1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Implementation 20
3.1 System overview . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 Graphical user interface . . . . . . . . . . . . . . . . . 21
3.1.2 Background subtraction . . . . . . . . . . . . . . . . . 25
3.1.3 Foreground sampling . . . . . . . . . . . . . . . . . . . 26
3.1.4 Detection and notification . . . . . . . . . . . . . . . . 26
3.1.5 Saving to file . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Program and libraries . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2
3.2.2 Tkinter . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.3 OpenCV . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.4 MKVmerge . . . . . . . . . . . . . . . . . . . . . . . . 29
4 Results 30
4.1 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Comparison Between Languages . . . . . . . . . . . . . . . . . 30
5 Conclusions 34
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3
Chapter 1
Introduction
1.1 Purpose
This thesis is a project executed together with CodeMill AB, [16] which
is an IT consulting firm location in Umeå in northern Sweden. CodeMill
specialises in the field of media and broadcast. This thesis was formed out
of a demand for automatic notification when certain objects have been left
at improper places. Places where this type of problem could occur is for
example at loading docks when new sets of cargo have been delivered and
needs to be taken care of, emergency exits where objects are blocking way
and immediately needs to be removed for safety reason. The last and most
critical example is abandoned bags at airports which can potentially be very
dangerous. The questions to be answered in this thesis are as follows:
3. How can one filter out static data from non-static data?
4
over time. Now the focus lies on improving the existing models to fit the mod-
erns needs of automated surveillance. Many researches such as S.C Cheung
et al.[2], Medha Bhargava et al.[11] and Sen-Ching S et al.[7] are trying to
improve the background models responsiveness and accuracy for functioning
in crowded areas such as train stations or rush hour traffic sites. Another side
of the modern needs is the ability to distinguish objects from its shadow to
reduce the number of faulty detections when the environment light changes
or when an objects shadow enter the scene but not the object itself. Kaew
Tra Kul Pong et al.[12] and Thanarat Horprasert et al.[13] have been work-
ing with this kind of improvement. Some other researchers have been trying
to improve the overall function of background subtraction by using new and
different approaches. Rubén Heras Evangelio et al.[10] are using a 2-model
background model together with a finite state machine to detect static ob-
jects and Antonio Albiol et al.[5] are using spatio-temporal maps to detect
stationary objects within predefined areas of a scene. Thi Thi Zin et al.[6]
and YingLi Tian et al.[8] have done research about object detection suited
for real surveillance application where security is the main purpose. In the
paper Wallflower: Principles and practice of background maintenance by K.
Toyama et al.[3] they have developed a new background subtraction algo-
rithm which they clam are one of the best algorithms to use, at least when
the paper was written.
5
Chapter 2
Video content analysis is the study of visual changes and events in video
streams and video files. Video files are collections of images put together in
a specified sequence which are shown at a certain speed to make the illusion
of motion to the human eye. To make analysis of video, one must go down
to the image level and perform analysis on one images at the time. Video
content analysis, in particular object detection is made at an even smaller
scale than just an image, namely pixel level. To be able to understand the
concept in this paper an introduction to digital image representation will
follow.
An image is represented by pixels in the digital world and inside computers
an image is represented by a 2-dimensional matrix where each element in the
matrix is one pixel. The colourspace of the image determines the dimension
of the images represented. If the image is in grayscale, each pixel will only
contain one value, from 0 to 255 (in 8-bit colourspace per channel) (see Fig-
ure 2.2 on the facing page). In RGB (Red, Green, Blue) colourspace, each
pixel has three values, one per channel, and this also makes the images rep-
resentation 3-dimensional. As for the grayscale representation each channel
in RGB colourspace can take a value between 0 and 255 (see Figure 2.1 on
the next page).
6
Figure 2.1: RGB colour space
Figure 2.2: grayscale colour space
7
2.3 Non-recursive background modelling
Non-recursive modelling is based on using buffer of size N to store previous
frames of the scene. The stored images are used to estimated the background
images using the variation of temporal values of each pixel in each frame
within the buffer. These techniques can vary in its adaptiveness depending
on the size of the buffer. A large buffer implies that the adaption takes longer
time and the storages requirement for the using equipment increases and vice
versa for a smaller buffer.
8
2.3.3 Non-parametric modelling
Non-parametric modelling is similar to the other non-recursive methods but
instead of having a fixed buffer of frames for estimating the background
update, it uses the all of the previous frames to make the estimate parameter
independent. This method was constructed by Elgammal et al. [14] by using
all the previous frames to estimate the pixel density function f (Pt = u):
t−1
X
f (Pt = u) = 1/N K(u − Pi ) (2.2)
i=t−N
9
every pixel in the frame. So the same frame could be on either side in the
line depending on which pixel you look at.
where the values of a1,2 = a1,2 = 0.7 as used in [4]. H(ti ) is called the
measurement matrix and is also a constant matrix.
H= 1 0 (2.6)
z(ti ) is the input matrix, the current frame from the camera and K(ti ) is the
gain matrix. This matrix is derived from the error covariance matrix and if
the gain is high, the noise of the input is low and vice versa. So the Kalman
filter procedure gets its estimated matrix by weighing the difference between
the prediction matrix and the current input matrix. So values which have
a large difference from the input matrix will get a lower weigh which means
that errors do not linger for a long time. See a graphical explanation of the
process in Figures 2.3 on the facing page, 2.4 on the next page and 2.5 on
page 12
10
Figure 2.3
Figure 2.4
11
Figure 2.5
12
a set of distributions, ranging from 3 to 5 distributions mostly. Why more
than one distribution is used is to be able to ignore objects that belongs
to the background but are not stationary, such as swinging leafs, snow and
rain but even reduce the detection of shadows. Several distributions makes
it possible for multi modal backgrounds which means that a pixel can take
several different colour values and still not be classified as a foreground object.
Every pixel at each frame is compared to the set of mean values
µ(K) = {µ1 , µ2 , ..., µK } (2.7)
and a set of variances
σ(K) = {σ12 , σ22 , ..., σK
2
} (2.8)
where K is the number of Gaussian distributions used. The distribution
created with the K gaussian distributions can be described as
X K K
X
2
Z∼N µi , σi (2.9)
i=1 i=1
This new distribution can have a similar look as this illustration below (see
Figure 2.6)
14
the stationary object in the scene. This method is very basic and therefore it
has some restrictions. It can not distinguish permanently stationary objects
from temporary stationary objects, for example a person who stops to tie a
shoe. It would alert the system for every object that is standing still for a
moderate period of time. To solve this problem a similar but slightly more
advanced model was invented, described in the next section.
15
2.6 2-model background subtraction
In this method, the subtraction of the background will occur at two different
rates. One of the background images is updated every frame and the other
one is updated every L frames (see Figure 2.7 on the previous page). Masks
for both backgrounds are created at respectively rate. The short term back-
ground SBt (x, y) will be compared to the current frame CFt (x, y) and every
pixel will either increase in intensity or decrease depending on the result of
the comparison.
|CFt (x, y) − SBt (x, y)| > T (2.13)
Equality between pixels leave the pixel unchanged. This enables the SBt (x, y)
to change quickly in scenes where the lighting conditions change rapidly. The
long term background LBt (x, y) will do the same process at every L frames
and is compared to the current SBt (x, y) to gradually adapt to the environ-
ment of the scene. By increasing or decreasing the intensity of the pixels,
stationary objects will slowly become a part of the background and moving
objects will still only be part of the moving foreground. By having two back-
grounds updating at different intervals, detection of temporary stationary
objects and objects that were part of the background before but has been
moved can be made. This method can only be applied to frames who are in
the grayscale colourspace.
2.7 Comparison
Under this section a comparison between the different background modelling
methods will be made and be summarised with a table describing each meth-
ods important features (see Table 2.1 on page 19). To start the comparison
I will first separate recursive and non-recursive methods and compare the
methods within these two categories and then do a comparison between the
two groups.
The non-recursive models are very similar to each other since both median
filtering and non-parametric modelling is based on the frame differencing
technique. The frame differencing model and the median filtering model
both compares the current frame with the previous frame or buffer of frames
while the non-parametric modelling technique uses all of the previous frames
to make an estimate and then compare the estimate to the current frame.
16
The base function is as said very similar between these three and the biggest
difference between them is that only the non-parametric model works on
colour images while the other two only work for grayscale images (with a
small exception for the median filtering where the medoid can be used for
colour images). The non-parametric model is the most sophisticated but re-
quires a large buffer for storing all of the previous frames, which can be quite
many if the recording goes on for a longer period of time.
Non-recursive modelling
Pros:
Cons:
• Only works for grayscale images (except for median filtering using
medoid).
The recursive models are not as similar to each other as the non-recursive
models are. All of these techniques have different approaches which makes
them harder to compare. The approximated median filtering has the same
base as the median filtering, but this time no storage at all is used, instead
a threshold is used to either increase or decrease the background pixel value
by one. This unfortunately makes this model useless for colour images. The
Kalman filtering uses a prediction matrix and the current matrix to update
the background reference matrix and thus it does not need to store frames
for the estimate. To get this technique to work, some initialisation and pre-
defined variables are required. Gaussian mixtures is the most sophisticated
model presented in this paper because of its multi modal property and its
ability to work on both grayscale and colour images. Gaussian mixture uses
a statistical approach to determine the similarity of the current pixel and
17
the background pixel. This model also requires some predefined variables to
work. These variables must be set to determine the sensitivity of the model.
The Kalman and the Gaussian models have the colour images capability in
common while the approximated median filtering only work for grayscale.
The only thing all of these three models have in common is that they only
compare the background frame with the current frame and that the back-
ground image is updated after each frame.
Recursive modelling
Pros:
Cons:
18
Technique Multi-modal Shadow Det. Adaptive Rate Category
Frame Differencing No No Fast Non-recursive
Median filtering No No Fast Non-recursive
Non-parametric modelling No Yes Slow Non-recursive
Approx. median filtering No No Fast Recursive
Kalman filtering No No Slow Recursive
Gaussian mixtures Yes Yes Slow Recursive
19
Chapter 3
Implementation
20
3.1.1 Graphical user interface
The implemented GUI is a very simple one with only some basic function-
ality. Upon startup the user can choose to save the captured frames. This
feature can be switched on and off during runtime (see Figure 3.2). Another
immediate setting which can be changed either before or during runtime is
the learning rate of the background where a high number represents a slow
background update. The last setting of the startup window is the pixel area
detection which controls the threshold of the area of objects that should be
detected as stationary. With lower number, smaller objects can be detected.
This unit is measured in square pixel which is calculated after the contours of
the detected object have been calculated. When the record button is pressed
the program starts to capture from the source the user have entered in the
video input source setting (see Figure 3.3 on the next page) which can be
found under file in the menu bar.
The program can take any type of video file or a number which represent
the source of a connected camera, 0 is the default number for a connected
camera or a built in camera. When a valid source is entered the option to
pause, exit or watch (see Figure 3.2) the capturing live comes available for
the user. Pressing the pause button will stop the capturing and even the
21
saving of the captured frames, but the program will still be running. The
background reference will be reseted and will become the last frame before
the pause button was pressed. This enables the user to pause the capturing,
ignore the changes made during the pause and then continue the capturing
like the pausing never happened. The exit button will exit the program as
well as the exit button in the menu bar. The view button will enlarge the
window and display the captured images in real time (see Figure 3.4 on the
next page).
When live view is enabled, another option for the user comes up, namely
to extend the window further for an extend view mode (see Figure 3.5 on
page 24). This mode not only shows the current frame but also the applied
binary mask and the sampled images when an object has be stationary for
period of time.
The last setting the user can make is to type in their destination for the
notification mails (see Figure 3.6 on page 24). These mails are sent to the
user when new detection occurs. The user can specify which mail address it
should be sent to and which mail address the sender should have. Under that
the subject and the actual message can be specified to the user’s preferences.
If the used mail server requires login credentials, the optional input fields for
username and password can be used. The last input field is the address to
the mail server with a targeted port.
22
Figure 3.4: Live view of the capturing
23
Figure 3.5: Extended live view with binary mask applied
24
3.1.2 Background subtraction
The purpose of the program is to read the next frame from the input source
and make a background subtraction against the background reference image.
This is made by simply take the current frame and then subtract the back-
ground reference image. This is a simple binary operation since the images
are stored in matrixes so every operation is made for every pixel representa-
tion in the matrixes. A simple subtraction between the matrixes will result
in matrix with a lot of noise in it since the subtraction is absolute (see Fig-
ure 3.7). This means that even if the pixels just differ with decimals it will
be visible.
This program uses a predefined method for subtraction the images. This
method uses morphological filters which filers out pixel which have very small
differences (see Figure 3.8 on the next page). These filters provides improved
accuracy and less faulty detections when the binary masks are applied.
When the subtraction has been made the resulting matrix is used to
make a binary mask which then is used for the foreground separation and
the detection of objects not belonging to the background.
25
Figure 3.8: Background subtraction of a colour image with morphological
filter
26
Figure 3.9: Foreground sampling compared with logical AND
When the contours have been found, another built in function calculates the
area within the contours. This area is compared with the threshold specified
of the user in the GUI. If the areas is larger than the threshold, a rectangle
around the object is drawn on the current frame. When the rectangle has
been drawn a notification is sent to the specified email address with the user
specified message and subject (see Figure 3.6 on page 24).
27
3.2.1 Python
Python is a very diverse and multifunctional language that is supported
almost everywhere on every platform. It can be used both as a scripting
language as well as an object oriented language [19], which this program uses
it as, or as a mixture of both scripting and object oriented. The possibility
to embed other languages inside Python make its usage almost limitless.
This language is chosen for its diversity and its support on many major but
also minor platforms. This is to make this program as universal as possible
without compromise too much.
3.2.2 Tkinter
Tkinter is a graphical user interface (GUI) toolkit which comes embedded in
the Python environment when installed. This framework for creating GUI’s
are very similar to the Swing package in Java and is very easy to use and
creating something small but functional takes very little time. This toolkit
is mostly built for small GUI’s and therefore its functions are limited to only
support the most basic needs. If creation of big scale and advanced GUIs
another library is recommended. This toolkit was chosen for its simplicity
since the GUI will only consist of a few buttons and input lines for the user
to change some variables and settings during runtime. This toolkit is well
documented and have examples of every button and item it can support [20]
and also what input and output every method has for easy understanding
and usage.
3.2.3 OpenCV
OpenCV (Open Computer Vision) is a set of libraries written in optimised
C and C++ with the intention to be computational effective for real-time
programs and applications. These libraries is made under the BSD licence
which gives the user the right to use it freely under commercial as well as
under academic purposes. OpenCV has many well designed methods for
VCA which are constructed after well known research papers with robust
techniques. The methods used in this paper are mostly for the background
subtraction and the background update. Documentation of all the methods
and attributes of these can be found at the OpenCV webpage [21]
28
3.2.4 MKVmerge
MKV files or Matroska files is a media container which can hold video, audio
and subtitles in a single file [17]. These files are not of video or audio format,
just a simple container for files. To create this container a program called
MKVmerge [18] is used which takes a video file and can take a text file
containing chapters marks for the video file and then creates a new MKV
file. This program is used to mark new detections from the implemented
program to be able to search the file for detection events instead of having
to go through the whole video.
29
Chapter 4
Results
30
Figure 4.1: Tests made with CAVIAR test files, left side are scene without
objects, right side markeds the abandoned objects.
31
Language Resolution (pixels) Average frame rate (FPS) Live View
Python 320x240 30.02/14.95 No/Yes
Python 640x480 30.0/10.77 No/Yes
Python 1280x720 18.9/6.22 No/Yes
C++ 320x240 30.3/15.87 No/Yes
C++ 640x480 30.3/12.20 No/Yes
C++ 1280x720 20.41/6.49 No/Yes
Figure 4.2: Performance difference between Python and C++ with OpenCL
32
faster than 30 frames per second, so the only significant difference was when
the frame rate came up to max of the cameras capability at 1280x720 pixels.
Here the C++ program performs 8% better than the Python language. When
the live view was enabled differences could seen at all resolutions. Here the
difference was between 4, 3 to 13, 3%
33
Chapter 5
Conclusions
5.1 Conclusion
The intension of this thesis was to answer the three questions stated in the
beginning of the report:
3. How can one filter out static data from non-static data?
34
begin is created. This text file together with the video file can be merged
using the MKVmerge external program to create a video container. When
the new video file is played, the user is able to search the video with the
chapters, which represent the time of detection. This is just one of many
ways to be able to search through a video file for times of detection.
35
Bibliography
[2] S.C Cheung and C. Kamath ”Robust techniques for background subtrac-
tion in urban traffic video”. in Proc. Video Communications and Image
Processing, SPIE Electronic Imaging , San Jose, Calif, USA, January
2004.
[5] Antonio Albiol, Laura Sanchis, Alberto Albiol, and Jose M Mossi ”De-
tection of Parked vehicles using SpatioTemporal Maps” vol.12, pp.1277-
1291, December 2011.
[6] Thi Thi Zin, Member, IAENG, Pyke Tin, Takashi Toriu, and Hiromitsu
Hama ”A Novel Probabilistic Video Analysis for Stationary Object De-
tection in Video Surveillance Systems” IAENG International Journal of
Computer Science, 39:3, IJCS 39 3 09
36
entific Computing Lawrence Livermore National Laboratory 7000 East
Avenue, Livermore, CA 94550
[8] YingLi Tian, Rogerio Feris, Haowei Liu, Arun Humpapur, and Ming-
Ting Sun ”Robust Detection of Abandoned and Removed Objects in Com-
plex Surveillance Videos”
[9] Thi Thi Zin, Pyke Tin, Takashi Toriu and Hiromitsu Hama ”A
Probability-based Model for Detecting Abandoned Objects in Video
Surveillance Systems” Proceedings of the World Congress on Engineer-
ing 2012 Vol II WCE 2012, July 4 - 6, 2012, London, U.K.
[10] Rubén Heras Evangelio and Thomas Sikora ”Static Object Detec-
tion Based on a Dual Background Model and a Finite-State Ma-
chine” Hindawi Publishing Corporation EURASIP Journal on Im-
age and Video Processing Volume 2011, Article ID 858502, 11 pages
doi:10.1155/2011/858502
[12] P. Kaew Tra Kul Pong and R. Bowden ”An Improved Adaptive Back-
ground Mixture Model for Real- time Tracking with Shadow Detection”
In Proc. 2nd European Workshop on Advanced Video Based Surveil-
lance Systems, AVBS01. Sept 2001. VIDEO BASED SURVEILLANCE
SYSTEMS: Computer Vision and Distributed Processing, Kluwer Aca-
demic Publishers
37
[16] http://www.codemill.se
[17] http://www.matroska.org/technical/whatis/index.html
[18] http://www.matroska.org/node/50
[19] https://www.python.org
[21] http://www.opencv.org
[22] http://groups.inf.ed.ac.uk/vision/CAVIAR/CAVIARDATA1/
Image reference
[23] Figure 2.2 on page 7 borrowed from
http://inperc.com/wiki/index.php?title=Images as functions of two variables
38