Automatic Detection, Tracking and Counting of Birds in Marine Video Content
Automatic Detection, Tracking and Counting of Birds in Marine Video Content
Automatic Detection, Tracking and Counting of Birds in Marine Video Content
Abstract— Robust automatic detection of moving objects in a Automated object detection in a maritime and marine en-
marine context is a multi-faceted problem due to the complexity vironment is a complex problem due to various factors that
of the observed scene. The dynamic nature of the sea caused complicate the general video analysis approach. Camera mo-
by waves, boat wakes, and weather conditions poses huge
challenges for the development of a stable background model. tion, variety of objects and their appearance, highly dynamic
Moreover, camera motion, reflections, lightning and illumination background, meteorological circumstances, geographical loca-
changes may contribute to false detections. Dynamic background tions and direction of the camera make the detection process
subtraction (DBGS) is widely considered as a solution to tackle challenging [1]. In order to deal with all these issues, an
this issue in the scope of vessel detection for maritime traffic appropriate background model [2] is needed in combination
analysis. In this paper, the DBGS techniques suggested for ships
are investigated and optimized for the monitoring and tracking with a classifier to discriminate between the objects of interest
of birds in marine video content. In addition to background (i.e. birds) and other moving foreground objects. Furthermore,
subtraction, foreground candidates are filtered by a classifier being able to track the objects of interest across consecutive
based on their feature descriptors in order to remove non-bird video frames will facilitate detection, spatio-temporal behavior
objects. Different types of classifiers have been evaluated and analysis and recognition. In this paper, we propose a method-
results on a ground truth labeled dataset of challenging video
fragments show similar levels of precision and recall of about ology combining these three techniques and we particularly
95% for the best performing classifier. The remaining foreground focus on the specific problem of bird detection in dynamic
items are counted and birds are tracked along the video sequence scenes. First of all, we build a background model to remove as
using spatio-temporal motion prediction. This allows marine much of the sea as possible, without removing any flying birds
scientists to study the presence and behavior of birds. or birds resting on the water surface. Secondly, an image tex-
Keywords— dynamic background subtraction, texture analysis, ture analysis is performed to classify the foreground candidates
image classification, object detection, tracking, seabirds, marine
environment.
as water or bird. We investigate whether false detections (like
water) can be eliminated, while maintaining the true detections
(i.e. birds). This results in a number of validated foreground
I. I NTRODUCTION items per frame. Finally a spatio-temporal prediction technique
The Flanders Marine Institute (VLIZ)1 has evolved into the is used to track these items along the video sequence.
central coordination and information platform for marine and The remainder of this paper is organized as follows. Related
coastal scientific research in Flanders, Belgium. The objective work in marine/maritime video analysis is discussed in Sec-
of VLIZ is to support and promote Flemish marine scientific tion 2. Subsequently, the three main building blocks of our
research and to participate in local, national and international algorithm are described in Section 3 to 5. Next, Section 6
projects. The proposed research is linked to the European Life- presents our manually annotated VLIZ dataset (shown in Fig.
Watch project2 for monitoring biodiversity on earth. In the 1), the evaluation process and results. Finally, Section 7 lists
context of LifeWatch, VLIZ installed two PTZ cameras in a the conclusions and points out directions for future work.
marine setting for visual surveillance. The first camera was
placed at the Spuikom (Ostend), which is a local water mass
attracting bird and people alike. The second camera is set upon
the railing of a wind mill on the Thornton (sand)bank. The
resulting video feeds, shown in Fig. 1, allow marine biologists
to track the presence and behavior of birds on those sites.
Manual operation of these camera systems, however, is not
efficient due to fatigue, stress and the limited ability of human
operators to perform this kind of tasks. To aid the scientists and
avoid manual tallying, an automatic processing of the imagery
is investigated.
1 http://www.vliz.be/en/mission
2 http://www.lifewatch.eu/ Fig. 1. VLIZ dataset with manually annotated bird labels.
The pixel-based TMF approach keeps a limited set of Median filtering is very accurate for detecting moving
video images in memory using a frame buffer. Based on this objects (see Fig. 4). However, for stationary objects, such
frame buffer we continuously update the background model as birds resting on the water, pixel values can be within the
with each incoming frame. For each pixel in the new frame, threshold of the median value. As a result, the foreground
we determine the median value of the corresponding pixels mask might not fully encompass all foreground objects. This is
in the frame buffer in order to perform a threshold-based solved by dilating the foreground mask. In this way, stationary
background/foreground detection. objects can be detected as foreground, as shown in Fig. 5.
Three different approaches for updating the buffer are inves-
tigated. By default, the buffer is a sliding window in which the IV. B IRD / WATER CLASSIFICATION
oldest frame is deleted first, i.e. a first-in-first-out strategy. An
alternative approach uses a memoryless buffer in which frames To detect as many birds as possible, i.e. to limit the number
are replaced at random, i.e. a Random Temporal Median Filter of false negatives, some false positives (such as waves) must
(RTMF) [13]. In this manner the background model covers be allowed in the dynamic background subtraction. In order
a longer period of time/motion without enlarging the buffer. to filter out these false positives, we propose a classification
The technique is called memoryless because there is no link mechanism based on image texture analysis (Fig. 6). Key
between the buffer index and the moment in time the frame points and corresponding gradient features are extracted from
was added to the buffer. The third option that we investigated each detection result and are transformed into lower dimen-
is recursive, as only one frame is kept in memory and adjusted sional code words using the k-means clustering results of our
to each new frame. Approximate Temporal Median Filtering training samples. Next, these code words are classified as
(ATMF) [14] saves the first frame received. If the new pixel ’water’ or ’bird’ by a linear SVM classifier (Support Vector
value is higher than the model, the model is incremented by Machines). During an offline training phase, labeled (bird,
one. If a new pixel value is lower, the model pixel is decreased water) code words are fed to the classifier.
by one. For each of the investigated TMF approaches, all new SIFT, i.e. the Scale-Invariant Feature Transform method of
frames are thresholded with the background model, resulting in Lowe [15], is used as our baseline feature detector to which
the foreground mask. Pixels which differ more than the exper- other approaches will be compared. SIFT describes an image
imentally defined optimal thresholds are seen as foreground. by its most representative local characteristics and can be used
Again, three different approaches were investigated. A first for image stitching, object detection and object classification.
type of threshold uses a static threshold. However, as video The SIFT keypoints, i.e. circular image regions with an orien-
content changes frequently (e.g. due to lighting and waves), the tation, are represented by 128-dimensional vectors where the
threshold should change accordingly. An alternative is using fastest varying dimension is the orientation. Some examples of
a relative threshold, in which the margin is expressed as a VLIZ images with SIFT keypoints and orientations are shown
percentage instead of an actual pixel value difference. The last in Fig. 7. Important to mention is that this technique is scale
and the best approach is using a normalized threshold based and rotation invariant in the 2D picture plane and to some
on the mean and standard deviations of all pixel differences. degree to rotations in 3D.
Fig. 6. General overview of our image texture analysis workflow. Fig. 8. General overview of our spatio-temporal tracking workflow.
For a classifier to work properly, all input data should V. S PATIO - TEMPORAL BIRD TRACKING
have the same size and be sufficiently small to attain high
performance. As our detected regions vary in size, the number In order to track the detected birds across the video frames,
of keypoints will greatly differ. Furthermore, a SIFT feature individual detections must be linked to detections in previous
vector contains 128 bins, which is also rather big. Both frames. For each registered object, the new location is pre-
issues complicate the training/classification process. However, dicted on its last position and movement. If the bounding box
features can be transformed into a combination of a limited of a newly found detection intersects with this prediction, the
number of visual code words [16]. Features of our training new detection is assumed the same object and the trajectory
set are clustered with k-means, creating a small vocabulary of of the object is extended with the path to the new detection
visual code words. The classifier is then trained on pictures (shown in Fig. 9). Detections with no corresponding objects
described as a combination of these code words. For our are seen as a new foreground object. Objects which have not
baseline SIFT feature detector, a nonlinear SVM classifier with been detected in a series of consecutive frames are deleted.
histogram intersection kernel [17] and only five code words Before removal, object tracking information is written to a
is found to achieve the highest accuracy with a precision over JSON-structured output file, as shown in Fig. 10. For each
90% (as discussed in Section VI - Table 1). object, we log its first and last appearance and the spatio-
Alternative feature extraction approaches based on SURF temporal information of all its observations. These kind of
and HOG features have also been investigated. As shown in loggings facilitate the querying of the video data and allows
Section VI - Table 2, HOG features [18] seem to perform direct access to objects of interest, supporting VLIZ scientists
better than SIFT for the bird classification task. HOG is in their study of the presence and behavior of birds.
a dense feature extraction method for images that extracts
features for all locations in the image as opposed to only VI. E VALUATION PROCESS AND RESULTS
the local neighborhood of keypoints like SIFT. Since the
detected regions in our set-up can be very small, the number The proposed methodology is objectively evaluated using a
of SIFT features can be too low to discriminate between birds ground truth labeled dataset of five videos coming from the
and water. Contrarily, the HOG descriptor technique counts VLIZ set-up, containing data recorded in different scenarios
occurrences of gradient orientation in localized portions, i.e. with varying light and weather conditions (as shown in Fig.
cells of 8x8 pixels, over the entire region of interset, leading to 1). Different dictionary sizes, feature extraction methods and
a higher accuracy. Furthermore, it is also important to mention SVM kernels have been investigated.
that we don’t use k-means clustering in the HOG based
approach, which drastically decreases the computational cost
(up to 80%) compared to our baseline SIFT-based approach.
Fig. 7. Gradient analysis of bird and water examples from the VLIZ dataset. Fig. 9. Results of our spatio-temporal tracking of flying birds.
Fig. 10. Output of the proposed algorithm: JSON output file and annotated video stream.
Table 1 lists the results for our baseline SIFT based ap- best with an average precision, recall and F1-score of 0.95.
proach with three types of SVM kernels (linear, intersect Important to mention is that no k-means clustering is used
and RBF) and a dictionary size of 5 and 10. A balanced in HOG based classification, reducing the computational cost
training set of approx. 600 bird/water images and a test with 80% compared to the SIFT-based approach and making
set of 10000 bird/water images were used in the evaluation it more suitable for real time monitoring.
process. In general, we achieve a high precision and recall. The confusion matrix of the best performing HOG - RBF
The best configuration consists of a dictionary size of 5 and an approach contains 2892 true positives, 173 false negatives, 150
intersection kernel, resulting in 92% and 90% bird and water false positives and 3333 true negatives. In order to further
accuracy. Performance results of the classification step show decrease the number of false positives/negatives, we will
that this configuration is also the fastest. We also have studied investigate to incorporate temporal tracking information in
the impact of the type of median filter for dynamic background the classification process. Furthermore, first tests have already
subtraction. No particular median filter can be seen as overall been performed on state-of-the-art CNN architectures [19] for
best, but a normalized threshold is clearly the better option in bird/water classification. The gaining importance of CNN for
terms of accuracy and performance. object detection and recognition will also be part of future
Precision, recall and F1-scores of alternative approaches for work. Preliminary results of the Pyfaster object detection
our baseline SIFT based approach are presented in Table 2. and recognition (shown in Fig. 11) with the COCO dataset
In this test, grid-search and shuffle split cross validation were (http://mscoco.org) show the feasibility of this approach.
used to avoid overfitting and to get the best parameters. A set The proposed detection results can be computed in quasi
of 3065 bird samples and 3483 water samples were used in real-time and with low memory requirements when number of
this evaluation. In addition to SIFT, SURF and HOG have been objects is low. However, the performance diminishes greatly
tested to because both are mentioned quite often in literature. if a lot of detections need to be classified. Redesigning the
Again, different types of SVM kernels were evaluated. HOG feature selection process in the bird/water classification, i.e.
features in combination with an SVM RBF kernel perform our computational bottleneck, will be further investigated.
Table 1. Accuracy and performance results for our baseline SIFT algorithm. Table 2. Precision, recall and F1-score for different feature extractors.
R EFERENCES
[1] M. Hartemink. Robust Automatic Object Detection in a Maritime
Environment. Master of Science Thesis, Faculty of Electrical Engineer-
ing, Mathematics and Computer Science (EEMCS), Delft University of
Technology, Delft-the Netherlands, 2012.
[2] M. Piccardi Background subtraction techniques: a review In IEEE
International Conference on Systems, Man and Cybernetics, proceedings,
vol. 4, pp. 3099-3104, The Hague, Netherlands, October 2004.
[3] D. Bloisi, L. Iocchi, M. Fiorini and G. Graziano. Camera Based Target
Recognition for Maritime Awareness. In 15th International Conference on
Information Fusion (FUSION), proceedings, pp. 1982-1987, Singapore,
July 2012.