0% found this document useful (0 votes)
12 views16 pages

Video Foreground Extraction Using Multi

This article presents a novel Multi-view Receptive Field Encoder-Decoder Convolutional Neural Network (MvRF-CNN) for automatic foreground extraction in video traffic and surveillance applications. The model enhances segmentation performance by utilizing multiple convolutional kernel views and residual feature fusions, addressing challenges such as background noise and varying object sizes. Experimental results demonstrate the model's effectiveness in complex environments, achieving a mean average performance of 95% and 42 frames per second.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views16 pages

Video Foreground Extraction Using Multi

This article presents a novel Multi-view Receptive Field Encoder-Decoder Convolutional Neural Network (MvRF-CNN) for automatic foreground extraction in video traffic and surveillance applications. The model enhances segmentation performance by utilizing multiple convolutional kernel views and residual feature fusions, addressing challenges such as background noise and varying object sizes. Experimental results demonstrate the model's effectiveness in complex environments, achieving a mean average performance of 95% and 42 frames per second.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
1

Video Foreground Extraction Using Multi-view


Receptive Field and Encoder-Decoder DCNN for
Traffic and Surveillance Applications
Thangarajah Akilan, Member, IEEE, Q.M. Jonathan Wu, Senior Member, IEEE, and
Wandong Zhang, Student Member, IEEE

Abstract—The automatic detection of foreground (FG) objects I. Introduction


in videos is a demanding area of computer vision, with essential
applications in video-based traffic analysis and surveillance. New
solutions have attempted exploiting deep neural network (DNN)
for this purpose. In DNN, learning agents, i.e., features for video
FG object segmentation is nontrivial, unlike image segmentation.
I N Internet of Things (IoT)-based traffic and surveillance
systems, the recorded videos contain full of temporal
redundancies. Besides, there are more than 70% of pixels
It is a temporally processed decision-making problem, where the in every frame that do not have any useful information for
agents involved are the spatial and temporal correlations of the high-level applications [1]. Such redundancy and existence
FG objects and the background (BG) of the scene. To handle this of unwanted data, eventually affect the tracking or detection
and to overcome the conventional DL models’ poor delineation
performance of desired applications. Researchers have demon-
at the borders of FG regions due to fixed-view receptive filed-
based learning, this work introduces a Multi-view Receptive Field strated that a reliable and effective scheme that can get rid of
Encoder-Decoder Convolutional Neural Network called MvRF- these issues and improve the results of video content analysis,
CNN. is background subtraction and foreground segmentation [1]–
The main contribution of the model is harnessing multiple [12]. Recent advancements in Artificial Intelligence (AI), IoT,
views of convolutional (conv) kernels with residual feature Intelligent Vehicle and Transportation Systems (IVTS), and
fusions at early, mid and late stages in an encoder-decoder
(EnDec) architecture. It enhances the ability of the model to Surveillance Cameras have shown that extracting clues of
learn condition-invariant agents resulting in highly delineated dynamic objects or foreground regions in video data, plays a
FG masks when compared to the existing approaches from crucial role in urban traffic and surveillance video processing
heuristic- to DL-based techniques. The model is trained with [2]–[6]. The extracted low-level foreground information from
sequence-specific labeled samples to predict scene-specific video sequences can be used for myriad high-level applica-
pixel-level labels of FG objects in near static scenes with
a minute dynamism. The experimental study on 37 video tions, including, but not limited to, vehicle behavior analysis
sequences from traffic and surveillance scenarios that include [5], traffic monitoring, viz. vehicles, bicycles and pedestri-
complex environments, viz. dynamic background, camera jittery, ans detection [3], and object counting [13], visual tracking
intermittent object motion, scenes with cast shadows, night [7], obstacle avoidance, autonomous driving, advanced driver-
videos, and lousy weather proves the effectiveness of the model. assistance systems (ADAS), and abandoned/removed object
The study covers two input configurations: a 3-channel (RGB)
single frame and a 3-channel double-frame with a BG such that detection in the cabin of a Robo-Taxi [4].
two consecutive grayscale frames stacked with a prior BG model. Hence, the FG segmentation has also become a vital subsys-
The ablation investigations are also conducted to show the tem in public and private video surveillance applications, and
importance of transfer learning (TL) and mid-fusion approaches other machine understanding problems. For instance, object
for enhancing the segmentation performance and the model’s
robustness on failure modes: when there is lack of manually segmentation [10], salient content preserved data coding for
annotated hard ground truths (HGT) and testing the model IoT [6], [10], image quality assessment [9], rediscovery of
under non-scene-specific videos. In overall, the model achieves a objects [11], [12], and human-robot/machine interaction [14].
figure-of-merit of 95% and 42 FPS of mean average performance. The primary objective of FG extraction is to place a tight
binary mask on the most probable salient regions, wherein
keywords- Background subtraction, Encoder-decoder network,
Foreground extraction, Transfer learning
moving objects, generally humans and vehicles can be identi-
fied. Such FG mask is very informative than a simple detection
with bounding box, as it does a close localization of the
Copyright (c) 2015 IEEE. Personal use of this material is permitted. However,
permission to use this material for any other purposes must be obtained from
objects. However, it has been an open problem, since the
the IEEE by sending a request to [email protected]. video data generated from mounted cameras pose some issues.
The work is supported in part by the Canada Research Chair Program and The distributed monitoring devices in urban cities capture
the NSERC Discovery Grant.
T. Akilan is with the Department of Computer Science, Lakehead University,
video sources in different scenarios, even in a constrained
Thunder Bay, ON, Canada. (e-mail: [email protected]). environment. It requires robust algorithms to entirely extracts
Q.M. Jonathan Wu and W. Zhang are with the Department of Electrical and the FG regions under varying motion styles, like walking
Computer Engineering, University of Windsor, Windsor, ON, Canada. (e-mail:
{jwu, zhang1lq}@uwindsor.ca)
pedestrians, running bicyclist, moving vehicles, and chang-
Manuscript received March 25, 2019; revised June 22, 2019; accepted August ing environmental factors, such as illumination and lighting
19, 2019. changes, cast shadows, non-static BG scenes, night times, and

0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
2

iii. Experimental side: Extensive ablation studies are carried


out to support the benefits of the architectural improvements
introduced by the work for video FG extraction. Here, the net-
work is trained end-to-end with FG-BG segmentation samples
to predict the most likely FG objects in a given input se-
quence. The investigations include qualitative and quantitative
analyses that can be summarized as follows. (a). The model
is trained with sequence-specific labeled samples to predict
scene-specific pixel-level labels of FG objects in near static
scenes with a minute dynamism in the BG. It covers thirty-
seven video sequences from traffic and surveillance cases
Fig. 1: High-level FG extraction flow of the MvRF-CNN. that include complex environments, viz. dynamic background,
camera jittery, intermittent object motion, scenes with cast
bad weather conditions [7], [8]. Thus, deriving a performant shadows, night videos, and lousy weather, (b). The study
model for FG extraction becomes an intriguing task. considers two input configurations: a 3-channel (RGB) single
Over the past three decades, much interest has been paid to frame and a 3-channel double-frame with a prior background
automate the FG extraction process; from heuristic methods such that two consecutive grayscale frames stacked with a
to the neural network (NN)-based learning schemes. To name BG model, (c). The ablation investigations are also conducted
a few, the Gaussian mixture models (GMM) [1], [15], and to show the importance of multi-view receptive field, transfer
Bayesian optimization-based background modeling [16]–[18], learning, and mid-fusion approaches for enhancing the per-
graph-based algorithms, like Markov random field (MRF) [2], formance and the model’s robustness on failure modes: when
and NN models, like self-organizing maps (SOM) [19]. Re- there is lack of manually annotated hard ground truths (HGT)
cently, the deep neural networks (DNN) based approaches, and testing the model under non-scene-specific videos.
including convolutional neural networks (CNN) [20] and The rest of this paper is organized as follows. The literature
reinforcement learning (RL) [21]–[23] are also successfully review in Section II provides sufficient information on FG
adopted for FG segmentation tasks. The main challenge in extraction. The proposed model is described in Section III.
DCNN-based methods using a fixed-view receptive field is the Sections IV elaborates the experimental setup and results
dithering effect at bordering pixels of FG objects and dealing with discussion. Finally, Section VII concludes the paper with
with varying scale object sizes. directions for future work.
To overcome these challenges, this work proposes a new
model inspired by the Inception module [24] that performs II. Literature Review
convolution of multiple filters with different scales on the same
input by simulating human cognitive processes in perceiving A. CNNs for Segmentation
multi-scale information and the ResNet [25] that acts as a The DCNNs have shown state-of-the-art performance over
lost-feature recovery module. To alleviate the learning ability, the traditional approaches for computer vision-based applica-
we exploit intra-domain TL that boosts the correct prediction tions, like object segmentation [26], 3D face pose estimation
of the FG pixels. Thus, the key insight of this work is to [27], and image retrieval/restoration [28]. Here, the fully con-
propose a CNN that improves the feature learning for a better volutional network (FCN) is used for semantic segmentation
FG object/region identification based on novel DL strategies. effectively [29]. It enhances pixel predictions trough feature-
The contribution of the work is tri-fold: Application side, level augmentation with a skip network that fuses the fea-
Architectural side, and Experimental side. ture hierarchy to combine deep, coarse, semantic information
i. Application side: The traffic monitoring and surveillance and shallow, fine, appearance information from selected mid-
are crucial for urban planning to reduce the congestion and level layers [29]. It favours having less number of filters
enhance the safety of all kind of users. This work focuses in the convolutional layers at the same time to carry for-
on an encoder-decoder model for dealing with this problem ward previous layer’s features intact. In contrast, the MvRF-
based on a DNN. Figure 1 depicts the ideal four-step process CNN does the coarse-level feature fusion inspired by the
of the model in application side. The application of this work ResNet [30] as shown
N in Fig.N2-(a). The futures are fused
is very important for various areas of intelligent vehicles, as H(X) = F(X) X, where denotes depthwise feature-
transportation systems, and smart city. map concatenation. It is built upon the intuition of increasing
ii. Architectural side: The proposal is a multi-view recep- depth of the NN rather than widening it, via residual feature
tive field fully convolutional neural network. In architectural flows to provide a rich data representation. Followed by the
perspective, the key contribution of the model is harnessing success of ResNet, the Inception module was introduced by
multiple views of convolutional (conv) kernels with residual Szegedy et al. [30]. It is a kind of micro-architecture that
feature fusions at early, mid and late stages in the EnDec archi- computes multiple features through different scale convolution
tecture. It enhances the ability of the model to learn condition- and average-pooling operations on the same input. Where,
invariant agents resulting in highly delineated FG masks when all the conv and pooling branches maintain the same spatial
compared to the existing approaches from heuristic to DNN- dimension as the output of the previous layer by using a stride
based techniques. of 1 (S 1).

0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
3

(a) (b)

Fig. 2: (a). The ResNet-like and (b). Inception-like feature embedding exploited in MvRF-CNN.

Thus, all the outcomes are combined along the channel normal distribution, and median filtering. Then, the aggregated
dimension to generate a multi-level feature representation. frame is subtracted from the current frame followed by a
However, our module allows us to use stride rate > 1 in each thresholding procedure as described earlier. The number of
branch and performs the feature concatenation whenever the frames aggregated varies from three to twelve depends on the
spatial dimension of the branches are matching as an example real-time requirement of the surveillance system [37], [38].
shown in Fig. 2-(b). Consequently, the FCN is extended like With time, researchers introduced advanced online background
an encoder-decoder NN named U-net for an application of bio- estimation and subtraction methods, such as [39], [38], [40],
medical cell segmentation [31]. Wherein, the activation maps and [41]. These methods speed up the computation by updat-
at the encoding stage are combined with the activation maps ing the model parameters, incrementally one frame at a time
at the decoding phase. on the fly. Deem these methods prone to low performance
The proposed MvRF-CNN inherits the following variations when the initial condition is not optimal.
from the U-net: To tackle the problem, the scene-specific initialization us-
• To capture spatio-temporal agents, the model’s input layer ing collected prior information of the surveillance region is
is configured as 3-channel that expects two consecutive introduced. Here, samples are collected from the specific zone
frames and a scene-specific prior background. before actual surveillance operation [1], [15], [33], [42]. The
• To handle object size variations due to objects com- same phenomenon is true for DNN-based solutions as well.
ing towards cameras or going towards vanishing points Recent studies have shown that different video surveillance
and camera view changes due to mount jitteriness, the conditions contain various types of BG changes that hinder the
model harnesses multi-view receptive field feature fusions learning ability of the DNNs [4], [34], [35], [43]. To overcome
(micro-inception modules) in early, mid, and late stages. this, Minematsu et al. [43] suggest an integrated model that
• U-net uses conventional subsampling layer with max- combines multiple scene-specific networks. Nonetheless, such
pooling operation. However, it has a toll on the seg- integration results in a large number of parameters to be
mentation accuracy [32]. To address this, the proposed trained and low inference speed. Thus, it is a plausible solution
network carries out the subsampling process through 2D for traffic video surveillance to have scene-specific models for
convolution with a stride rate of 2, a kernel size of 3 × 3, moving object segmentation since the cameras are fixed on
and zero padding. near static mounts.
In this direction, Zhang et al. [44] develop a NN-based
B. Video Foreground Segmentation: From Conventional Mod- model that has a stacked denoising auto-encoder (SDAE)
els to Deep Neural Networks learning network and a binary scene modeling based on
The moving object segmentation in videos is a challenging density analysis. Whereby, the SDAE encodes the essential
task than image segmentation or frame-level object segmenta- structural information of the scene. The encoded features of
tion. It has to handle the variations and the dynamism in the image patches are then hashed in Hamming space and then a
background. Thus, most of the learning-based solutions are hash-based binary scene is modeled by density analysis, which
scene-specific modelings, and they perform well since changes captures the temporally distributed spatial information.
in the background context are limited in a specific scene [33]– Similarly, Zhao et al. [19] also exploit a stacked multi-layer
[35]. However, some attempts were carried out for non-scene- SOM. Wherein, the initial training is carried using some BG
specific FG segmentation as well. The basic model in this samples, then during the FG detection for a test sample, the
category is the frame differencing-based approach [3], [36], BG model is maintained through online updates. Gemignani
[37]. It takes the absolute pixel-difference between two adja- and Rozza [45] improve the basic SOM model with a self-
cent frames and applies a threshold to segment the salient part balancing multi-layered network that tracks a long-time pixel
of the scene. The results of this technique may not necessarily dynamics for better FG extraction. Authors in [46] approach
be the moving object, rather a illumination change between the BG modeling as an evidence collection of each pixel in
the two frames. The improved versions of this approach use a scene with a weighted sample-based method. They also use
an adaptive strategy, whereby a few members of past frames a minimum-weight and a reward/penalty scheme that takes
are aggregated through heuristic analysis, like running average, into account the sudden changes in the scene such a way that

0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
4

[b, 240, 320, 32]


ConvT(3, 2) →

ConvT(9, 8) →
[b, 60, 80, 32] [b, 60, 80, 32] [b, 30, 40, 32]
Early Fusion Mid Fusion Late Fusion

ConvT(3, 2) →ReLU

ConvT(3, 2) →ReLU

ConvT(3, 2) →ReLU

ConvT(3, 2) →ReLU
L11

L15

L18

L25
L28
L31
L3

[b, 120, 160, 32]

[b, 240, 320, 32]


[b, 30, 40, 32]
3-channel

ConvT(5, 4) →ReLU

ConvT(5, 4) →ReLU
Input

L10

L12

L14
L17
L19

L20

L21
L22
L24
L27
L30

L32
L33
L34
L35

L36
L37
L38
L39

L40

L41
mfs
L1
L4
L5
L7

L8

lfs
ef

[b, 240, 320, 32]


L13

L16

L23
L26
L29
L2

L6

L9

Legend: - Input, - Conv/ConvT with kernel size of 3, - Conv/ConvT with kernel size of 5, - Conv/ConvT with kernel
size of 9, - Conv, ^ - Concatenation-based feature fusion, - ConvT, - Drop out, → - Data/Feature flow. h

Fig. 3: Layer connectivity diagram of the proposed MvRF-CNN. Description of each layer (L<ID>) with respective inputs
and output dimension are given in Table I.

most irrelevant sample is replaced instead of the oldest or a possibility of a frame in training set and a temporarily nearest
random sample. Then, the FG/BG classification is carried out (next or previous) frame in test set. It literally results in the
as an application of a threshold. Although their method has model memorizing the sequences, rather than it is trained
computational speed, they record a poor accuracy of the FG to predict unknown scenarios. Hence, relaying on complex
detection. On the other hand, Varadarajan et al. [47] come external algorithms for BG estimation should not required for a
up with an algorithm, where the spatial relationship between deep learning. On the other hand, the patch-based training and
neighboring pixels is considered during classification through prediction is a slow processing strategy and it is not applicable
a region-based GMM in contrast to the traditional GMM that for real-time traffic and surveillance applications. Besides, they
models the distributions in pixel-level [1], [15]. only tested the model on mere number of 20 frames per video,
The local features, like local binary pattern (LBP) are also while taking 90% of the samples for training. The proposed
used for BG modeling. For instance, Charles et al. [48] coined MvRF-CNN overcomes the shortcomings with whole-frame
a system Self-Balanced SENsitivity SEgmenter (SuBSENSE) processing, temporal median filtering-based BG generation,
that adapts local binary similarity pattern (LBSP) as additional and temporally exclusive training and test samples.
features to pixel intensities in a non-parametric BG model. The The LSTM-based CNN model [4] attempts to handle the
pixel-level BG model is updated using feedback loops. Like- spatio-temporal agents between moving objects and back-
wise, authors in [49] utilize the LBP with local singular value ground. This model requires 4 consecutive frames including
decomposition (SVD) operator to extract invariant-feature se- the current one to extract the FG region. Although, the model
lection. Then, they use SAmple CONsensus(SACON) model achieves an average of 94% FoM, it suffers a low throughput of
for creating the BG based on statistics of the past 300- 24 FPS due to heavy computational LSTM modules integrated
frame pixel process. Then they use the Hamming distance with 3D convolutions (3D Conv). To subdue this, the proposed
measure similar to [44] with a predetermined threshold value model exploits a background prior and a previous frame
to separate the FG pixels. These models demand static and stacked with the current frame under grayscale setting; and
clean samples for creating a background dictionary; thus, the model avoids using 3D Conv and LSTM modules. This
they lack suitability for a real-time applications. Similarly, approach produces average of 1.75 higher FPS and 1% more
Allebosch et al. [50] also utilize local cues, namely RGB color FoM than 3D Conv-LSTM method in [4].
intensity values and local ternary pattern (LTP)-based edge Although over the past few decades many ideas have been
descriptors. They form two backgrounds and create two FG introduced, due to the challenging nature of FG extraction
masks. Then, using a pixel-wise AND operation, they refine there is not a single method that can be claimed as the ultimate
the primarily detected FG masks and get a rectified FG region. solution. Therefore, Bianco et al. [52] attempt harnessing
Meanwhile, the DeepBS et al. [20] exploits a conventional multiple state-of-the-art FG extraction algorithms under one
CNN, train the network with randomly selected video frames umbrella. They use the genetic programming (GP) to obtain
and their corresponding patch-based ground truth segmentation a solution tree. Likewise, Sajid et al. [53] introduce a multi-
samples, like in [44], and carry out a post-processing stage framework that computes a background model bank (BMB)
that performs spatial-median filtering on the outputs. The with multiple BG models. Then, to extract the FG, they use
downfall of their method is the random selection of frames a spatial de-noising based on Mega-Pixel (MP) to pixel-level
for training, dependency of background library generation on probability. Then, a they employ a fusion technique to define
external algorithms (SuBSENSE [48] and Flux Tensor [51]), a refined FG region. Nonetheless, these models also cannot be
and patched-based processing. The random selection creates a the universal model for the FG identification problem.

0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
5

Layer Layer type Output Shape III. The structure: Table I provides detailed information
Input
ID A(k, s) [b, H, W, D] of the layer connectivity patterns, kernel sizes, interconnected
Input Input Layer [b, 240, 320, 3] mini-batch activators, and output dimensions, while Fig. 3 presents an
L1 Conv (3, 1)→ ReLU [b, 240, 320, 32] Input
L2 Conv (5, 1)→ ReLU [b, 240, 320, 32] Input intuitive schematic of the proposed MvRF-CNN. It subsumes
L3 Conv (9, 1)→ ReLU [b, 240, 320, 32] Input two learning phases: Encoding and Decoding, like any other
L4 Conv (3, 2)→ ReLU [b, 120, 160, 32] L1
L5 Conv (3, 2)→ ReLU [b, 60, 80, 32] L4 Encoder-Decoder or EnDec models used in neural network
Encoding phase

L6 Conv (5, 4)→ ReLU [b, 60, 80, 32] L2 and learning systems.
L7 Cat [b, 60, 80, 64] L5, L6
L8 Conv (3, 2)→ ReLU [b, 30, 40, 32] L7
Encoding phase - It consists of nineteen layers beginning
L9 Conv (3, 2)→ ReLU [b, 30, 40, 32] L6 from an input layer to the subsampling Conv, where the spatial
L10 Cat [b, 30, 40, 96] L8, L9, L11 dimension of the feature maps reaches 15 × 20. Thus, at
L11 Conv (9, 8)→ ReLU [b, 30, 40, 32] L3
L12 Conv (3, 2)→ ReLU [b, 15, 20, 32] L10 this stage the encoding process completes. Note that, it is
L13 Conv (5, 4)→ ReLU [b, 15, 20, 32] L6 important to terminate the encoding phase when one of the
L14 Cat [b, 15, 20, 96] L12, L13, L15
L15 Conv (3, 2)→ ReLU [b, 15, 20, 32] L11 feature map dimensions either width or height reaches a value
L16 Conv (5, 1)→ ReLU [b, 15, 20, 32] L17 of odd integer. Otherwise, the decoding process through up-
L17 Conv (3, 1)→ ReLU [b, 15, 20, 64] L14 sampling with transposed convolution may not reconstruct
L18 Conv (3, 1)→ ReLU [b, 15, 20, 32] L17
L19 ConvT (3, 2)→ ReLU [b, 30, 40, 32] L17 feature maps spatially aligned with the intermediate feature
L20 Cat [b, 30, 40, 128] L10, L19 maps generated in the encoding phase, resulting in complex
L21 Drop out (0.3) L20
L22 Conv (3, 1)→ ReLU [b, 30, 40, 32] L21 feature-level fusions. Since the smallest kernel, in this work,
L23 ConvT (5, 4)→ ReLU [b, 60, 80, 32] L16 is 3 × 3, we stop the encoding sub-network when the feature
L24 ConvT(3, 2)→ ReLU [b, 60, 80, 32] L22
L25 ConvT (3, 2)→ ReLU [b, 30, 40, 32] L18
map dimension becomes 15 × 20.
L26 Conv (5, 1)→ ReLU [b, 60, 80, 32] L23 Decoding phase - It is initiated by a ConvT layer (L19) that
Decoding phase

L27 Cat [b, 60, 80, 128] L7, L24, L26 up-samples the output of the encoding sub-network and sub-
L28 Conv (9, 1)→ ReLU [b, 30, 40, 32] L25
L29 ConvT (5, 4)→ ReLU [b, 240, 320, 32] L26 sequently ends at the last ConvT layer (L37) with producing
L30 Drop out (0.3) L27 feature map dimension of 240 × 320. Similar to the encoding
L31 ConvT (9, 8)→ ReLU [b, 240, 320, 32] L28
L32 Conv (3, 1)→ ReLU [b, 60, 80, 32] L30 phase, there are nineteen layers networked to complete the
L33 ConvT(3, 2)→ ReLU [b, 120, 160, 32] L32 up-sampling process.
L34 Cat [b, 120, 160, 64] L4, L33
L35 Drop out (0.3) L34
Fusion of feature flows - The MvRF-CNN integrates three
L36 Conv (3, 1)→ ReLU [b, 120, 160, 32] L35 sub-networks; a pivotal feature flow (PFF) and two comple-
L37 ConvT(3, 2)→ ReLU [b, 240, 320, 32] L36 mentary feature flows (CFF-1 and 2). The kernel sizes for
L38 Cat → BN [b, 240, 320, 96] L29, L31, L37
L39 Drop out (0.3) L38 the sub-networks are determined empirically. This work finds
Top

L40 Conv (3, 1)→ ReLU [b, 240, 320, 128] L39 that the filter sizes of 5 × 5 and 9 × 9 for the two CFFs
L41 Conv (3, 1)→ f (·) [b, 240, 320, NC] L40
Total number of trainable parameters 864,385
perform robustly when having a kernel size of 3 × 3 in the
A(k, s): A- Type of convolutional operation, k - kernel size, and PFF. The PFF is essentially an EnDec network that takes the
s - stride rate; Output shape as [b, H, W, D]: b - mini-batch size, following layers L1, L4, L5, L7, L8, L10, L12, L14, L17, L19 -
H - hight, W - width, and D - number of channels;
f (·) - classifier (Sigmoid), NC - number of output channels L22, L24, L27, L30, and L32 - L41. This sub-network performs
core operations, i.e., down sampling and up-sampling using
TABLE I: Layer detail and connectivity pattern of the pro- 3 × 3 Conv filters. The learning ability of this sub-network
posed MvRF-CNN. is complemented by the CFF-1 and CFF-2 as their key
operations employ different receptive fields as stated earlier.
III. Proposed MvRF-CNN Architecture It enhances the robustness of the model by learning scale-
invariant FG agents. Here, the CFF-1 includes the layers
A. Network Formation
L2, L6, L9, L13, L16, L23, L26, and 29 while the CFF-2 con-
The approach to model the network is based on intuition sists of the layers L3, L11, L15, L18, L25, L28, and L31. The
and ablation study. agents learned throughout the CFF sub-networks are used for
I. Purpose: To have a scene-specific model for moving feature-level augmentation under three principles: early fusion,
object detection. The model is to be trained on manually mid fusion, and late fusion. The early fusion occurs in layers
delineated FG (moving objects) from a set of training frames, L7, L10, and L14. The mid fusion happens in layers L20 and
then it should automatically identify the FG objects on the L27. Finally, the late fusion is set in layers L34 and L38.
remaining frames of the same video sequence. The segmenta- Notably, the mid and late fusions are facilitated by two mini
tion results must be sufficiently accurate that do not need any decoders placed in the CFFs that consist of L16, L23, L26,
post-processing. and L29 in CFF-1 and L18, L25, L28 and L31 in CFF-2. These
II. Input dimension: Firstly, we determine an appropriate mini decoders also perform two levels of up-sampling process,
dimension of the input layer through analytical reasoning. The besides the PFF. The feature-level fusions at every level is
target dataset considered for this study tabulated in Table III carried out through residual feature forwarding from encoding
has various frame sizes (width×height) ranging from 320×240 phase to decoding phase. There are seven such layers, in total
to 720 × 576 having median of 320 × 240. This median value as described earlier.
is chosen as the input layer dimension and all the samples are The residual feature forwarding is the way of reusing
resized to this dimensionality. activations of a downstream layer to amplify the features in

0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
6

an upstream layer while skipping a single intermediate layer Fine-tuned for Transferred from Fine-tuned for Transferred from
or set of such layers [25], [54], [55]. It functions well when a Highway Turnpike_0_5fps Boulevard Highway
single non-linear layer is stepped over, or when the set of Office CopyMachine CopyMachine Office
Canoe Boats PeopleInShade Pedestrians
middle layers are all linear. Otherwise, an explicit weight Boats Canoe Turnpike_0_5fps Highway
matrix should be learned for the skipped connection. The Overpass Pedestrians Boulevard Turnpike_0_5fps
amplification can be in the form of a mathematical operation, Traffic Highway Fall Blizzard
Skating SnowFall SnowFall Blizzard
like addition or a channel expansion by concatenation as
used in this work. An example of the residual connection
TABLE II: Sample pairs used for transfer learning.
is shown in Figure 2 (b) in the main manuscript. Thus, the
fusion of feature maps from encoding layers that hold high- factor of 0.8 over the training.
frequency detail for finer FG boundaries [54], [56], [57]. Thus,
the heterogeneous Conv kernels used in the sub-networks N P
M 
−1 P 
and feature-level argumentation carried out via residual con- E( p̂n ) = NM pn,m log p̂n,m + (1 − pn,m ) log(1 − p̂n,m ) ,
n=1 m=1
nections capture scale-invariant intricate information. Hence, (2)
the MvRF-CNN learns well-generalized representation of FG where N is the number of samples in the mini-batch, M is
objects based on local and global contextual cues. the number of pixels (H × W) in the unrolled image, p̂n,m is
In summary, the MvRF-CNN contains 864,385 trainable the output from the final layer of the network (layer 41 in
parameters that subsume convolution (conv), transpose convo- Table I) for the m th location in the n th image. p̂n,m maps the
lution (convT), concatenation (cat), drop out, and batch nor- FG pixel probabilities using the Sigmoid function (σ(xn,m ) ∈
malization (BN) layers. The encoding phase includes 18 layers [0, 1]). And the target pn,m ∈ [0, 1] is a supervised signal, that
that end the sub-sampling when the spatial dimension reaches sets 0 for BG class and 1 for FG class. Both p̂n,m and pn,m
15 × 20. It is immediately followed by the decoding phase that have the same dimensionality.
up-samples encoded feature maps systematically back to the
same dimension as input spatial dimension. The symmetric
D. Transfer Learning
decoding paths contain 19 layers that consist of inception and
residual-feature fusions to capture high-resolution contextual To improve the network’s generalization calibre, the intr-
cues of the FG objects. It does not have any max pooling or aclass domain transfer is incorporated. Here, a network is
hidden densely connected layers. It accepts input frames with trained on a source domain, then its weights are used to
any spatial dimensions and resizes them into 240 × 320 by initialize a fresh model in the target domain. Thereafter, all
using nearest-neighbor re-scaling. the layers of the new model are fine tuned end-to-end via
back-propagation. Table II lists some sample video sequence
pairs used to fine-tune the scene-specific MvRF-CNN model.
B. Convolutional Layer For example, the pre-trained network with Turnpike_0_5fps
The heart of the CNNs is the convolutional operation, where is retrained for Highway. The theoretical and philosophical
the filter weights function like a dictionary of visual patterns expositions of the transfer learning strategy can be referred to
or system memory. Respect to a kernel ω, bias vector b, and [60] and [61].
an input image/patch x it is computed as
K−1
K−1 X
E. Binary FG Mask
X
C(m, n) = b + ω(k, l) ∗ x(m + k, n + l), (1) To transform the salient- or probability-map generated by
k=0 l=0 the MvRF-CNN to a binary mask, we apply a dataset-specific
global threshold in the range [0.05, 0.75] and the Otsu-based
where ∗, K, {m, n}, and {k, l} denote, respectively, the conv op-
automatic segmentation method. Then the noisy artifacts in the
eration, filter size, origin of the image-patch, and the kernel’s
binary mask is cleaned by a neighbourhood pixel connectivity
element index [58].
that removes regions with less than 50 pixels. The Otsu’s
The transpose convolution (convT) layers, on the other
procedure iteratively finds a threshold that lies in between two
hand, perform up-sampling process while maintaining the
peaks of the intensity histogram of a bi-modal image such that
connectivity pattern. In contrast to the standard image resizing
the intra-class variances of FG and BG classes are minimum.
operation, the convT has trainable parameters, and they are
An explicit derivation of the method can be found in [62].
updated during training. It is achieved by inserting zeros
between consecutive neurons in the receptive field of the input,
then sliding the conv kernel with unit strides [59]. F. Algorithmic Summary
The entire process of the proposed FG extraction model
from training to inferencing can be summarized as shown in
C. Optimizer Fig. 4.
The MvRF-CNN is trained through the Adam-optimizer 1. Video sequences: It is the collection of scene specific raw
that minimizes binary cross-entropy loss E defined by (2), data. The detail of each sequence is listed in Table III.
where the optimizer takes a base learning rate of 0.0002 with 2. Preprocessing: It consits of scene specific prior BG model
a learning rate scheduler that reduces the learning rate by a estimation and dataset splitting (refer to Section IV-A-II) as

0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
7

Fig. 4: General framework of the proposed foreground extraction model.


Frame size N N frames
depicted in Fig. 5. Dataset (W × H) Video Category framesa for BGb
2.1. The BG model b0 is computed through temporall median Highway (HW) 320 × 240 1229 520
filtering from the first k number of frames of the video Office (OF) 360 × 240
BaseLine (BL)
1447 580
sequence. Note that the temporal median filter is an effective Pedestrians (PD) 320 × 240 598 300
PETS2006 (PE) 720 × 576 989 299
tool that has been reported outperforming the approaches Canoe (CA) 320 × 240 342 840
based on the average filtering for background estimation [3]. Boats (BO) 320 × 240 6026 1930
2.2. The training set is rearranged such that at time t the Fountain01 (F1) 432 × 288 Dynamic 148 500
Fountain02 (F2) 432 × 288 Background (DB) 187 500
input takes two consecutive frames ft and ft−1 in grayscale Overpass (OV) 320 × 240 440 2330
and stacks them depth wise with keeping the b0 as the last Fall (FA) 720 × 480 560 1467
channel (input = [ ft ; ft−1 ; b0 ]) respect to the ground truth G.T t Traffic (TR) 320 × 240 609 900
Boulevard (BV) 320 × 240 1004 790
as depicted in Fig. 5-b. Badminton (BM) 720 × 480
Camera Jitter (CJ)
350 799
3. Training MvRF-CNN: The training is carried out as Sidewalk (SW) 320 × 240 400 799
described in Section III-C. AbandonedBox (AB) 432 × 288 2050 2499
Parking (PA) 320 × 240 489 1306
3.1. Random initialization - Train the network with a randomly StreetLight (SL) 320 × 240
Intermittent Object
2140 179
initialized layers from a normal distribution N(µ = 0, σ2 = Sofa (SO) 320 × 240
Motion (IOM)
2240 499
0.01) for n number of epochs and store the best model. TramStop (TS) 432 × 288 1879 1320
WinterDriveway (WD) 320 × 240 1499 999
3.2. Retrain - Initialize the model with the learned weights in
CopyMachine (CM) 720 × 480 1401 499
3.1. and fine tune it for another video sequence for m number Backdoor (BD) 320 × 240 531 365
of epochs and store the best model (refer to Section III-D). Bungalows (BG) 320 × 240 ShaDow (SD) 531 365
4. Testing MvRF-CNN: 4.1. Test the trained model on the BusStation (BS) 320 × 240 830 299
Cubicle (CC) 320 × 240 531 365
exclusive testing set described in the Section IV-A and store PeopleInShade (PP) 380 × 244 558 280
the FG sigmoid probability score maps. Corridor (CO) 320 × 240 1490 570
4.2. Convert the score maps into binary fore-/back-ground Library (LI) 320 × 240 THermal (TH) 3600 860
Park (PK) 352 × 288 279 249
segmentations as elaborated in Section III-E and compute IntermittentPan (IP) 570 × 368 PTZ camera (PTZ) 270 1198
figure of merit (f-measure) using eqn. 3. FluidHighway (FH) 700 × 450 456 399
TramStation (TN) 480 × 295 Night Video (NV) 1249 499
WinterStreet (WS) 624 × 420 439 899
IV. Experimental Setup, Results, and Discussion Turnpike_0_5fps (TP) 320 × 240 Low Framerate (LF) 349 799
Blizzard (BL) 720 × 480 1805 899
A. Dataset Skating (SK) 540 × 360 Bad Weather (BW) 806 899
I. Summary: The ablation study is carried out on change Snowfall (SF) 720 × 480 656 660
detection net (CDnet) 2014 benchmark database for FG detec- a Number of frames considered with GTs, in which both the FG and BG are
tion research [42]. This database consists of eleven diversified presented in the same frame. b The number of initial frames that do not have
video categories with each containing 4 to 6 sequences. GTs used to estimate a scene-specific BG. and stand for traffic and
in-/out-door surveillance related videos respectively.
From that, this study considers 37 videos as summarized in
Table III. The baseline benchmark represents a mixture of mild TABLE III: 37 Video sequences used in the ablation study.
challenges, like subtle background motion, isolated shadows,
swaying tree branches, and natural illumination changes. The
dynamic background category has scenes with strong (para- and outdoor videos exhibiting sharp as well as faint cast
sitic) BG motion, like shimmery water body and a swaying shadows. Similarly, the thermal indoor and outdoor sequences
tree branches in the fall. The camera jitter sequences contain contains videos shot by a thermal camera.
outside videos captured by vibrating cameras due to strong In the PTZ camera recordings, the motion in the camera
wind or unstable camera mount. The jittery magnitude differs changes the backgrounds. It breaks the assumption that the
from one frame to another. Hence, the intermittent object recording devices are relatively static; thus, it is a challenging
motion class includes scenes with background objects moving condition to the FG extraction task. This category of videos
away, abandoned items and objects stopping for a short while are not suitable for the double frame-based experiments since
and then going away. The shadow category consists of indoor the viewpoint of the sequence is changed from time to time

0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
8

(a) (b)

Fig. 5: Sequence-specific training and test sets: (a). Temporally exclusive training and test datasets and (b). Input data
configuration: G.T - Ground truth, I - Input frame, Med(·)- Temporal median filtering, b0 - The pre-computed BG model,
ft - The current scene at time t, and ft−1 - The previous scene at time t − 1.

as the cameras span around. So the motion detail captured foreground segmentation it is a questionable approach since
by taking two consecutive frames along with a generalized the test sequence has to be temporally unknown or excluded
BG model will be entirely different when the camera pans from the training set.
from one point to another. On the other hand, the night ii. The training to test set ratio is ≈ 88% to 12%, which is too
video set includes traffic flow videos shot at night. There is less to validate the model, we suppose. A similar issue is
strong lighting variation in all the sequences in this category. found in the CNN-based FG detection model presented in
Likewise, the low frame-rate category consists of traffic flows the literature [63], where 90% of the samples are used for
scenes shot with low frame rate. Lastly, the bad weather set training and only a set of 20 randomly selected frames are
contains videos shot in adverse snow conditions. used for evaluating the model.
It is worth mentioning that among the thirty-seven videos To address these issues, we create temporally exclusive
more than ten of them contain continuous dynamism in the training and test sets by splitting each video sequence in
background due to vibrating camera mounts, swaying leaves, an orderly manner, whereby the training set takes the first
shimmery water-body, fluid dynamics, jumping water, non- 70% while the test set takes the last 30% from the input
monotonic snowfall, flaying flag, and so forth. For instance, frames that have ground truths as illustrated in Fig. 5-a. During
such minute dynamism can be observed in: Highway (HW), training, the model takes advantage of data transformation,
Canoe (CA), Boats (BO), Fountain01 (F1), Fountain02 (F2), a.k.a augmentation that includes vertical and horizontal trans-
Overpass (OV), Fall (FA), Parking (PA), Blizzard (BL), Skat- lation with a fraction of 0.1 from the actual input image’s
ing (SK), and Snowfall (SF) datasets. height and width, random rotation within 10 degrees, and a
II. Temporally exclusive training and test sets: The CDnet zooming factor in the range of 0.1 inside the input image.
2014 datasets were particularly arranged for the traditional Hence, the training procedure uses a generator (in this case,
background subtraction task. Here, the task is defined as a Keras ImageDataGenerator) that loops over the data (in
model has to be trained on a set of training samples from a batches) and applies the aforesaid transformations to batch
sequence and then the trained model must automatically label of images randomly on-the-fly. These data transformations are
the rest of the frames of the same sequence with adequate applied to the training samples and to their corresponding fore-
accuracy. Thus, the CDnet has a minimum of 299 initial frames ground truths simultaneously. The model handles the temporal
without ground truths (refer to Table III) for BG formation information of the video sequence as shown in Fig. 5-b, where
targeting the conventional models (e.g., GMM) and rest of the the input is a two consecutive Gray-scale frames stacked with a
frames are provided with hand segmented foreground objects. generic Gray-scale BG model estimated from the prior samples
From this, creating an appropriate train and test datasets is listed in Table III that have no ground truths (GT).
crucial, as the performance of deep learning models hugely
depend on the correctness of the training set. For that, the B. Step-by-Step Quantitative Analysis
literature [20] the authors create a training set using randomly In FG extraction the standard performance matrix consid-
collected 150 frames from every video in the CDnet 2014 and ered is f-measure. It is a similarity count between the extracted
extract patches with a size of 48 × 48 at a stride of 10. Then, FG mask and the ground truth. It is a weighted mean measure
they test it on very few samples as a set of 20 frames for each of precision (Pr) and recall (Re) defined as
sequence. Such dataset preparation exposes to two downfalls. 2 × (Pr × Re)
F − measure = . (3)
i. A random selection of frames may assign a frame ft to Pr + Re
the training set and a temporally closest frame like ft+1 The recall and precision are given by Re = T P/(T P + FN),
or ft−1 to the test set. Consequentially, there can be many and Pr = T P/(T P + FP), where T P, FN, and FP refer to true
such samples in the training and test sets resulting in a positive, false negative, and false positive respectively.
mere exclusiveness. The random selection is acceptable for I. Impact of complimentary feature flows: To verify the
object recognition or classification task; however, for video intuition of the multi-view receptive field, we conduct a sanity

0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
9

(a) A frame from Traffic (b) High fussiness (c) Moderate fussiness (d) Moderate fussiness (e) Low fussiness
Fig. 6: Visualizing impact of complimentary feature flows: (a) - (e) are input frame and salient maps of PFF, combined feature
flows of PFF + CFF2, PFF + CFF1, and PFF + CFF1 + CFF2, respectively. The white boxes highlight a sample region for
fussiness variation between FG and BG.

0.98 0.98 0.97


0.96 0.96 0.96 0.96 0.96 0.96
0.95 0.93 0.94 0.93 0.94
0.92
0.9 0.9 0.91
0.9 0.88 0.87
0.87 0.86 0.87
F-measure

0.85 0.840.84
0.81 0.81 0.81 0.81
0.8
0.77
0.75
.
hi

0.7
Tu ha.

0.68
Co ard

.
ac
y

ke
s
as
wa

nS
p.M
ev

pi
c
e

erP
e

ats

affi
gh

no

o.I

rn
fic

ul
Ov
Hi

Bo
Bo
Ca
Of

Pe

0.65
Tr

SF SF + TL BG + TL
(a) Performance vs input configurations: SF - Single Frame,
SF + TL - Single frame with Transfer Learning, and BG + (b) MvRF-CNN salient maps and their corresponding histograms for
TL - Input data using BG model as in Fig. 5-b with Transfer an input frame (a) when: (b)- trained from scratch, and (c)- fine-tuned
Learning. with intra-class transfer learning.
Fig. 7: Impact of transfer learning and background model using MvRF-CNN.
Combination of Dataset D 84.88
91.27
feature flows Office Traffic 82.23
1
C 90.06
A: PFF 87.99 82.27
B: PFF + CFF2 89.02 75.01 B Office 75.01
89.02
Traffic
C: PFF + CFF1 90.06 82.23 A 82.27
D: PFF + CFF1 + CFF2 91.27 84.88 87.99 0.8
50 60 70 80 90 100

TABLE IV: F-measure vs combination of PFF and CFF.


F-measure

0.6
check with a different combination of complementary feature
flows as described in Table IV. The test is carried out on
Office sequence from the baseline and the Traffic sequence
from the camera-jitter category, where the input is a single- 0.4
frame RGB. The investigation reveals that the fusion of all the
complementary feature flows provides a valuable improvement
in the FG segmentation. For instance, the average performance 0.2
N

of all feature combination gains ≈ 3.5% improvement over the


CN

-5

S
CS
TM

pB
BS

O
S

basic structure (PFF).


F-

TI

Bo
ee
LS

M
PA
vR

IU

A sample input frame from Traffic video sequence and its


M

corresponding salient maps generated by a different combina- Method


tion of feature flows are shown in Fig. 6. It is found that when
all the feature flows are combined the discrimination between Fig. 8: F-measure vs method.
BG and FG is more salient (referring to Fig. 6: (e)). When they
are uncombined, there is a noticeable fussiness in the score flows of MvRF-CNN is higher than the basic model (tabulated
maps, especially around the borders of FG object (referring in Table IV). Considering these results as empirical proofs,
to Fig. 6: (b) - (d)) that will mislead the FG extraction. It further experiments are carried out to analyse the impact of
is clearly evident as the f-measure of the combined feature TL and using a BG information as one of the input channels.

0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
10

1 PAWCS IUTIS-5 MBS


DeepBS LSTM BoO

Gain of MvRF-CNN in %
0.8 60
F-measure

0.6
40

0.4
20

0.2 PAWCS IUTIS-5 MBS


DeepBS LSTM BoO 0
BL DB CJ IOM SD TH BW LF NV PTZ BL DB CJ IOM SD TH BW LF NV PTZ
Category Category

(a) Best f-measure vs dataset category. (b) Improvements of the MvRF-CNN compared to others.
Fig. 9: Category-wise performance analysis of the MvRF-CNN.

Method → MvRF-CNN PAWCS IUTIS-5 MBS DeepBS LSTM Others


Dataset ↓ G-th O-th [48]-2015 [52]-2017 [53]-2017 [20]-2018 [4]-2019 (2013, 2015, 2016, 2017)
Pedestrians (PD) 0.9705 0.9602 0.9461 0.9692 0.9566 0.9459 0.9553 0.9282 [19], 0.8352 [44], 0.9656 [45], 0.9569 [46]
PETS2006 (PE) 0.9599 0.9474 0.9315 0.9354 0.8648 0.9425 0.9379 0.8717 [19], 0.7922 [44], 0.8237 [45], 0.9359 [46]
Highway (HW) 0.9793 0.9765 0.9436 0.9535 0.9217 0.9655 0.9771 0.9466 [19], 0.8780 [44], 0.9523 [45], 0.9410 [46]
Office (OF) 0.9807 0.9779 0.9375 0.9686 0.9719 0.9780 0.9686 0.9605 [19], 0.9606 [45], 0.9312 [46]
Fountain01 (F1) 0.8430 0.8296 0.7778 0.8229 0.5235 0.7685 - 0.6102 [19], 0.2264 [45], 0.7329 [46]
Fountain02 (F2) 0.9286 0.9250 0.9415 0.9553 0.9214 0.9254 0.8619 0.8321 [19], 0.7374 [45], 0.9434 [46]
Canoe (CA) 0.9590 0.9546 0.9379 0.9462 0.9345 0.9794 0.9539 0.6337 [19], 0.7258 [44], 0.9628 [45], 0.6131 [46]
Boats (BO) 0.9445 0.9420 0.8416 0.7532 0.9041 0.8121 0.9260 0.6107 [19], 0.6560 [44], 0.9576 [45], 0.6401 [46]
Overpass (OV) 0.9584 0.9577 0.9590 0.9272 0.8990 0.9416 0.9578 0.5977 [19], 0.9461 [45], 0.7209 [46]
Fall (FA) 0.9704 0.9685 0.9052 0.9361 0.5668 0.8294 0.9573 0.3055 [45], 0.8138 [46], 0.6731 [47]
Sidewalk (SW) 0.9519 0.9518 0.6904 0.8140 0.8994 0.9034 0.8970 0.3458 [19], 0.6860 [45], 0.8460 [46]
Traffic (TF) 0.9507 0.8734 0.8278 0.8302 0.6781 0.8776 0.8897 0.7750 [19], 0.8120 [45], 0.7983 [46]
Badminton (BM) 0.9202 0.9152 0.8920 0.9204 0.9021 0.9527 0.9197 0.7216 [19], 0.9648 [45], 0.8304 [46]
Boulevard (BV) 0.9647 0.9654 0.8444 0.7680 0.8672 0.8623 0.9570 0.6892 [19], 0.8532 [45], 0.7157 [46]
Parking (PA) 0.9673 0.9669 0.8190 0.6482 0.6185 0.5971 0.9619 0.6800 [45], 0.8216 [46], 0.4585 [47]
WinterDriveway (WD) 0.9161 0.9006 0.4721 0.4389 0.3760 0.2997 0.6800 0.1471 [45], 0.4076 [46], 0.4320 [47]
AbandonedBox (AB) 0.9646 0.9625 0.9135 0.9019 0.8232 0.5567 0.9271 0.6736 [45], 0.8927 [46], 0.6805 [47]
StreetLight (SL) 0.9393 0.9300 0.9864 0.9892 0.9921 0.9161 0.9439 0.9780 [45], 0.9911 [46], 0.7260 [47]
TramStop (TS) 0.9830 0.9827 0.7428 0.6080 0.8856 0.4754 0.9883 0.9780 [45], 0.5630 [46], 0.4156 [47],
Sofa (SO) 0.9481 0.9430 0.7247 0.7915 0.8455 0.8134 0.9285 0.7759 [45], 0.7591 [46], 0.5459 [47]
Backdoor (BD) 0.9497 0.9492 0.9523 0.9704 0.8296 0.9800 0.8879 0.6722 [45], 0.9586 [46], 0.7800 [47]
BusStation (BS) 0.8954 0.8937 0.8729 0.8826 0.8695 0.9374 0.8785 0.9087 [45], 0.8628 [46], 0.7820 [47]
Bungalows (BG) 0.9541 0.9491 0.8387 0.8392 0.7475 0.8492 0.9525 0.7035 [45], 0.8374 [46], 0.8060 [47]
PeopleInShade (PP) 0.9754 0.9746 0.8986 0.9103 0.9016 0.9197 0.9533 0.9178 [45], 0.8948 [46], 0.7716 [47]
Cubicle (CC) 0.9619 0.9588 0.8713 0.9219 0.5613 0.9427 0.9434 0.6932 [45], 0.9243 [46], 0.6390 [47]
CopyMachine (CM) 0.9660 0.9632 0.9143 0.9260 0.8711 0.9534 0.9683 0.9039 [45], 0.9217 [46], 0.5482 [47]
Park (PK) 0.9249 0.9209 0.8286 0.7652 0.7099 0.8741 0.8780 0.5916 [19], 0.7461 [45], 0.7973 [46]
Corridor (CO) 0.9509 0.9498 0.8802 0.9087 0.9208 0.8896 0.9230 0.8119 [19], 0.8407 [45], 0.8944 [46]
Library (LI) 0.9778 0.9774 0.9390 0.9547 0.9594 0.4773 0.9724 0.9462 [19], 0.9395 [45], 0.7901 [46]
Blizzard (BL) 0.9748 0.9732 0.7737 0.8540 0.8572 0.6115 0.9535 0.8572 [44], 0.8584 [46], 0.5814 [47]
Snowfall (SF) 0.9490 0.9430 0.8393 0.8453 0.8782 0.8648 0.8906 0.8979 [46], 0.7561 [47], 0.8588 [50]
Skating (SK) 0.9727 0.9702 0.8984 0.9156 0.9223 0.9669 0.9498 0.8732 [46], 0.7921 [47], 0.9048 [50]
Turnpike_0_5fps (TP) 0.9632 0.9596 0.9146 0.8802 0.8901 0.4917 0.9558 0.9130 [46], 0.7528 [47], 0.9386 [50]
FluidHighway (FH) 0.8721 0.8641 0.4019 0.4133 0.3536 0.6600 0.8110 0.4483 [46], 0.2885 [47], 0.5880 [50]
WinterStreet (WS) 0.8792 0.8786 0.3863 0.5999 0.4938 0.7489 0.8181 0.5854 [46], 0.5095 [47], 0.7095 [50]
TramStation (TN) 0.8963 0.8889 0.7129 0.7506 0.7369 0.1551 0.8611 0.7932 [44], 0.7946 [46], 0.6270 [47]
IntermittentPan (IP) 0.9442 0.9132 0.5453 0.5103 0.5409 0.2063 0.9008 0.4118 [46], 0.2147 [47], 0.9424 [50]
- Traffic or Pedestrian related and - Indoor or outdoor surveillance related videos.
TABLE V: F-measure performance comparison: G-th and O-th stand for the two used thresholds global and Otsu respectively.
Values in red are the best figures while the ones in blue are the second best.
II. Impact of transfer learning and scene-specific prior conducted on a subset of datasets given in Table III. The
background model: To validate the effectiveness of TL and experiments are carried out under three different input config-
the impact of using the temporally median filtered BG model, urations: a). Single frame RGB image (SF), b). Single frame
b0 (referring to Fig. 5-b) in FG extraction, experiments are RGB image with transfer learning (SF + TL), and c). The input

0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
11

(a) Sample qualitative results for Pedestrians, PETS2006, Highway, (b) Sample qualitative results for BusStation, Bungalows, PeopleIn-
Office, Fountain02, Canoe, Boats, and OverPass video sequences Shade, Cubicle, CopyMachine, Park, Corridor, and Library video
(row 1 to 8). Where, from column 1 to 5: input frames, ground truths, sequences. Where, column 1 to 5: input frames, ground truths,
Sigmoid confidence maps, binary FG-BG segmentations generated Sigmoid maps, binary FG-BG segmentations generated through G-th
through G-th and O-th, respectively. and O-th, respectively.
Fig. 10: Part I sample visual results of the MvRF-CNN FG extractor: bright and dark pixels represent FG and BG respectively.
configuration with the BG model boosts the segmentation
0.98

0.97

1 results by additional ≈ 10%. Figure 7b visualizes the impact


0.96
0.96

0.95

of TL via intensity histograms. From the salient maps, we


0.94
0.94

can see that when the network is fine-tuned with pre-trained


0.95 weights, the model generates the probability scores that have
near-zero fussiness between FG and BG. It is evident as the
0.9
0.89
0.88
0.88

intensity histogram of the salient map has multiple peaks


FoM

0.87

0.9
when the model is trained from scratch, while the intensity
distribution of the salient map falls around two distinct peaks,
generally with the intensity values of 0 (dark as BG) and
0.82

0.85
0.81

255 (bright as FG) when it is fine-tuned via TL. Thus, the


TL-based parameter initialization allows the MvRF-CNN to
0.8 produce stronger discrimination of FG regions from BG.
III. Impact of number of frames used in background
HW BZ FH TP WS TN IP estimation: There is no rule of thumb to pick an optimal
Video Sequence number of frames that would leads to the best performance.
Mid-fussion OFF Mid-fussion ON Generally, most statistical-based approaches take about 200
frames to estimate the initial BG, then update it over the
Fig. 11: Impact of mid-fusion. time using a parametric formulation [1], [15]. However, on
the whole, if the selected number of frames cover all the
data using BG model as in Fig. 5-b with transfer learning (BG variabilities, such as illumination variation and the dynamics
+ TL). Figure 7a compares the results, where the f-measure is in the scene then the generalization will be better. In this
computed based on binary FG segmentation generated by the case, we observe an interesting trend within challenging video
Otsu’s automatic threshold. The analysis leads to a conclusion sequences that when the number of frames in the BG esti-
that the fine-tuning approach with intraclass domain transfer mation increases, the performance of FG detection enhances.
improves FG extraction results by ≈ 4%, while the input For instance, when FluidHighway (FH), TramStation (TN),

0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
12

Snowfall (SF), Turnpike_0_5fps (TP), Blizzard (BL), and of improving the FoM of the foreground extraction system
Skating (SK) sequences are considered, there is more than approximately by 3% when compared to the model without
6% improvement as the number of frames in background Mid-level fusion.
estimation is increased by ≈ 2 times. Thus, it is important V. Comparison of results to the literature: Taking the
to have a well-generalized background model if FG extraction above outcomes in Part I and IV as foundation of the MvRF-
accuracy is the primary goal. CNN, extensive experiments are carried out on all the datasets
IV. Impact of the mid-fusion: Many researchers have listed in Table III with the input configuration described in
already proven the effectiveness of early and late fusions in Fig. 5-b. Thus, this subsection provides an analysis on the
deep learning [64]–[69]. However, we have not come across performance of the MvRF-CNN in comparison to the prior-
a work that provide a theoretical or experimental support art and state-of-the-art techniques. It includes the probabilistic-
for mid-fusion. In our case, the intuition of using them is based approaches and NN-based learning systems from recent
the features extracted through multi-view receptive fields may years: PAWCS (2015) [48], IUTIS-5 (2017) [52], MBS (2017)
contain complementary spatiotemporal agents at every stage [53], DeepBS (2018) [20], LSTM (2019) [4], and few Others.
in the network. So, a fusion operation of features at early, Figure 8 summarizes the results across all the video sequences
mid, and late stages would lead the model to learn well- via box-plots, where Ours and BoO based on the best results
generalized features of the targeted domain. To investigate of our method, and the best results of other methods listed
this, an additional experiment is conducted without the mid- in Table V respectively. Figure 9 compares the category-wise
fusion. However, all the training and input configurations are achievements of the MvRF-CNN with other methods. When
maintained to be the same with the experiments with mid- the average performance across categories is considered, the
fusion as the results tabulated in Table V. This experiment proposed model has a very consistent result than all other
is conducted on seven traffic video surveillance sequences, methods. Although, all other approaches show competitive
namely Highway (HW), Blizzard (BZ), FluidHighway (FH), results in the baseline (BL) category, they fail to exhibit
Turnpike_0_5fps (TP), WinterStreet (WS), TramStation (TN), the same for the rest. Most of the other techniques, record
and IntermittentPan (IP). The experimental results show (re- poor outcomes in the challenging conditions, like dynamic
fer to Fig. 11) that the mid-level fusion has an attribute background (DB) and night videos (NV). In overall, the

(a) Sample qualitative results for Fall, Traffic, Badminton, Boulevard, (b) Sample qualitative results for Blizzard, Snowfall, Skating, Turn-
Parking, StreetLight, Sofa, and backDoor video sequences (row 1 to pike 5 fps, FluidHighway, WinterStreet, TramStation, and Intermit-
8). Where, from column 1 to 5: input frames, ground truths, Sigmoid tentPan video sequences (row 1 to 8). Where, column 1 to 5: input
confidence maps, binary FG-BG segmentations generated through G- frames, ground truths, Sigmoid maps, binary FG-BG segmentations
th and O-th, respectively. generated through G-th and O-th, respectively.
Fig. 12: Part II sample visual results of the MvRF-CNN FG extractor: bright and dark pixels represent FG and BG respectively.

0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
13

Cat. Ours+ Ours⋆ Ours• PAWCS IUTIS-5 MBS DeepBS BoO


MvRF-CNN gains 14.7%, 14.2%, 16.4%, 24.1%, 2.9%, and
BL 0.9726 0.9298 0.9609 0.9397 0.9567 0.9288 0.9580 0.9522
6.8% of more f-measure than PAWS, IUTIS-5, MBS, DeepBS, DB 0.9340 0.8688 0.9250 0.8938 0.8902 0.7916 0.8761 0.8928
LSTM, and BoO respectively. CJ 0.9471 0.7419 0.8984 0.8137 0.8332 0.8367 0.8990 0.8690
IOM 0.9531 0.9125 0.9566 0.7764 0.7296 0.7568 0.6097 0.8152
SD 0.9504 0.8531 0.9099 0.8914 0.9084 0.7968 0.9304 0.9114
C. Qualitative Analysis Ave. 0.9514 0.8612 0.9302 0.8630 0.8632 0.8222 0.8546 0.8881

A few sample visual results are shown in Fig. 10 and 12 + - Original results. Trained on first 70% of hard GTs that are temporally
in support of the quantitative results listed in Table V. These exclusive to test sequence. ⋆ - Trained on GMM annotated GTs (noisy & lack
of delineation) then fine-tuned on next 10% of hard GTs that are temporally
visual comparisons against the corresponding ground truths exclusive to the test sequence. • - Trained on automatically poorly annotated
show that the proposed MvRF-CNN segments the moving FG GTs then fine tuned on next 30% of hard GTs that are temporally exclusive
objects in videos very tightly. to test sequence.

TABLE VI: Failure analysis: Training on poor ground truths.


V. Failure Analysis
Cat. Ours+ Ours♣ Ours♠ PAWCS IUTIS-5 MBS DeepBS BoO
This section analyses the robustness of the MvRF-CNN DB 0.9590 0.9194 0.9357 0.9379 0.9492 0.9345 0.9794 0.9628
under two conditions: i. Lack of precisely delineated ground CJ 0.9507 0.7321 0.9403 0.8302 0.8302 0.6781 0.8786 0.8120
IOM 0.9660 0.7145 0.9109 0.7751 0.7751 0.7209 0.5769 0.8572
truths for training, and ii. Non-scene-specific model testing. SD 0.9579 0.8453 0.9324 0.9144 0.9144 0.7524 0.9313 0.9105
NV 0.8963 0.7833 0.8558 0.7506 0.7506 0.7369 0.1551 0.7946
LF 0.9632 0.7946 0.9587 0.8802 0.8802 0.8901 0.4917 0.9386
A. Poorly Annotated Ground Truths BW 0.9490 0.8150 0.9483 0.8453 0.8453 0.8782 0.8648 0.8979
Ave. 0.9489 0.8006 0.9264 0.8561 0.8488 0.7987 0.6967 0.8819
Deep learning models require a warehouse of precisely
labelled datasets for classifier training. For better performance + - Original results. Trained on scene-specific first 70% of HGTs that are
on video sequence FG labelling task, researchers take 80% to temporally exclusive to test set of the same video. ♣ - Testing on unknown
video sequences, ♠ - The model in ♣ is fine-tuned on scene-specific first 30%
90% [20], [63] human annotated (hard ground truth - HGT) of HGTs that are temporally exclusive to test sequence of the same video.
samples to train a model. However, how about a scenario
TABLE VII: Failure analysis: non-scene-specific testing.
where there is only about 30% of HGT data available. The
following steps can handle such a case.
i. Generating GTs using an automatic method. In this case, B. Non-Scene-Specific Model Testing
we use a basic 3-component GMM that uses the first Although the scope of this work is limited to a scene-
120 frames to model a BGM and annotates the rest. specific model for FG-BG labelling task, it is important to
The GMM for BG-FG segmentation is openly available evaluate its robustness in unknown environment as well. Thus,
in native OpenCV and MatLab platforms. The CDnet the MvRF-CNN is trained on randomly selected 20 video
datasets have first hundreds of frames without GTs for sequences (IP, PD, FH, PE, SK, OF, BL, HW, PK, F1, PP, F2,
BG initialization. In this experiment, we generate GTs BS, BO, SO, FA, TS, BM, WD, and SL) from the videos in
within that initial frames. We do not utilize any additional Table III and tested on randomly selected 11 sequences from
tools to remove noise or to improve the annotations. the rest (CA, TR, PA, CM, AB, BD, BG, CC, SF, TP, and
Thus, the automatically annotated GTs are very poor in TN). The results with and without fine-tuning are summarized
quality. Training on such cluttered GTs resulting in a week categorically in Table VII. The study shows that the model
model and the overall performance falls within 76% of FoM. is able to label the FG with an average FoM of 80% on the
unknown datasets. However, when it is fine-tuned with a mere
ii. The issue above is addressed by model fine-tuning on few 30% of temporally exclusive training set and tested on the last
HGTs.For this experiment, we fine-tune the model with 30% of the frames from the same sequence (i.e. the training
two different sets the first 10%⋆ and 30%• of HGT fine- and test sets are temporally separated by 40% of frames in
tuning. The analysis includes nineteen sequences from five time-domain) it achieves an average FoM of 93%, which is
categories BL, DB, CJ, IOM, and SD. The results are listed just 2% less than our original result; but still it is better
in Table VI. than the existing methods. For instance, the fine-tuned MvRF-
The above analysis shows that the model performs very CNN achieves 7%, 8%, 13%, 23%, and 4% more FoM than the
competitively even after it is pre-trained on poorly annotated PAWS, IUTIS-5, MBS, DeepBS, and BoO respectively.
ground truths if it is fine-tuned on a few hard GTs. For The above failure analyses show that the model is robust
example, regardless of the quality of automatically generated enough for the conditions with lack of hard ground truths and
GTs during pre-training, if the MvRF-CNN is fine-tuned on deploying it for FG labelling on unknown videos.
10% of fine hard GTs, it achieves average FoM of 86%.
However, when the amount of training samples is increased VI. Future Work
to 30%, its performance becomes better than PAWS, IUTIS- The future direction of the work is dedicated mainly for
5, MBS, DeepBS, and BoO by 6.7%, 6.7%, 4.1%, 7.5%, and network optimization. Despite the advancements of DNNs
4.2% respectively. and their state-of-the-art performances in infinite applications,
the decision on their hyper-parameters, viz. depth (number of
layers), wide (number of filters per layer), and type of layers

0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
14

remains an empirical process. Similarly, the current work also [3] M. Vargas, J. M. Milla, S. L. Toral, and F. Barrero, “An enhanced
sets the parameters empirically as discussed in Section III. background estimation algorithm for vehicle detection in urban traffic
scenes,” IEEE Transactions on Vehicular Technology, vol. 59, no. 8,
However, for real-world applications, like traffic monitoring pp. 3694–3709, 2010.
and surveillance, these parameters often become a bottleneck [4] T. Akilan, Q. J. Wu, A. Safaei, J. Huo, and Y. Yang, “A 3d cnn-lstm-
when deploying on resource-constrained harware. Thus, this based image-to-image foreground segmentation,” IEEE Transactions on
Intelligent Transportation Systems, pp. 1–13, March 2019.
work identifies the following as potential methodologies for [5] H.-S. Song, S.-N. Lu, X. Ma, Y. Yang, X.-Q. Liu, and P. Zhang, “Vehicle
optimizing the proposed MvRF-CNN: i. Network training behavior analysis using target motion trajectories,” IEEE Transactions
acceleration strategies [57], [70], [71], ii. Bit-quantization on Vehicular Technology, vol. 63, no. 8, pp. 3580–3591, 2014.
[6] L. Tian, H. Wang, Y. Zhou, and C. Peng, “Video big data in smart
[72]–[74], iii. Model compression via knowledge distillation city: Background construction and optimization for surveillance video
[75], [76], iv. Tensor factorization [77], [78], and v. Network processing,” Future Generation Computer Systems, vol. 86, pp. 1371–
pruning [79]–[83]. Hence, the future work will also consider 1382, 2018.
[7] C. Tang and A. Hussain, “Robust vehicle surveillance in night traffic
Tensor RT API for achieving a high-performance inference videos using an azimuthally blur technique,” IEEE Transactions on
model so that it can be deployed on embedded devices such Vehicular Technology, vol. 64, no. 10, pp. 4432–4440, 2014.
as Jetson TX21 and the autonomous car development platform [8] T. Bouwmans, “Traditional and recent approaches in background mod-
eling for foreground detection: An overview,” Computer science review,
NVIDIA Drive PX2 . vol. 11, pp. 31–66, 2014.
[9] S. Bosse, D. Maniry, K.-R. Müller, T. Wiegand, and W. Samek,
VII. Conclusion “Deep neural networks for no-reference and full-reference image quality
assessment,” IEEE Transactions on Image Processing, vol. 27, no. 1,
This work introduces a scene-specific FG extraction model pp. 206–219, 2018.
inspired by recent innovations in deep learning. The proposed [10] Z. Zhang, S. Fidler, and R. Urtasun, “Instance-level segmentation for
model utilizes a heterogeneous set of convolutions to capture autonomous driving with deep densely connected mrfs,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
scale-invariant FG object representations. We believe that the pp. 669–677, 2016.
proposed MvRF-CNN is a unique addition to the NN-based [11] K.-H. Jo et al., “Cumulative dual foreground differences for illegally
scene-specific FG extraction systems. Extensive experiments parked vehicles detection,” IEEE Transactions on Industrial Informatics,
vol. 13, no. 5, pp. 2464–2473, 2017.
are conducted to analyse the proposed model’s performance [12] S. Kwak, M. Cho, I. Laptev, J. Ponce, and C. Schmid, “Unsupervised
under the following conditions: i. Training with random-state object discovery and tracking in video collections,” in Proceedings of
model initialization, ii. Model fine tuning with transferred the IEEE international conference on computer vision, pp. 3173–3181,
2015.
parameters, iii. Input configuration with single frame in RGB [13] Y. Zhang, C. Zhao, and Q. Zhang, “Counting vehicles in urban traffic
color space, iv. Input configuration that takes gray scale scenes using foreground time-spatial images,” IET Intelligent Transport
two consecutive frames and a temporally median filtered BG Systems, vol. 11, no. 2, pp. 61–67, 2017.
[14] A. I. Guha and S. Tellex, “Towards meaningful human-robot collabora-
model, v. Training with poorly annotated GTs, and vi. Test- tion on object placement,” in RSS Workshop on Planning for Human-
ing on unknown dataset (non-scene-specific FG extraction). Robot Interaction: Shared Autonomy and Collaborative Robotics, 2016.
The qualitative and quantitative performance analyses of the [15] A. Thangarajah, Q. J. Wu, and J. Huo, “A unified threshold updat-
ing strategy for multivariate gaussian mixture based moving object
proposed MvRF-CNN on 37 challenging video sequences detection,” in 2016 International Conference on High Performance
collected from traffic and surveillance videos demonstrate that Computing & Simulation (HPCS), pp. 570–574, IEEE, 2016.
the model performs robustly and consistently better than or [16] F. Porikli and O. Tuzel, “Bayesian background modeling for foreground
detection,” in Proceedings of the third ACM international workshop on
very competitively to the prior- and state-of-the-art methods. Video surveillance & sensor networks, pp. 55–58, ACM, 2005.
It records a real-time speed with ≈ 42 frame per second (FPS) [17] M. Imani, S. F. Ghoreishi, D. Allaire, and U. Braga-Neto, “Mfbo-ssm:
or ≈ 22ms (mean average) prediction time on a GeForce Multi-fidelity bayesian optimization for fast inference in state-space
models,” AAAI, 2019.
GTX 1080 Ti GPU. However, the limitation of the network [18] M. Imani, S. F. Ghoreishi, and U. M. Braga-Neto, “Bayesian control
comes with a high number of trainable parameters. The of large mdps with unknown dynamics in data-poor environments,”
optimization of trainable parameters and architecture is left in Advances in neural information processing systems, pp. 8146–8156,
2018.
for future work. In application point of view, the MvRF- [19] Z. Zhao, X. Zhang, and Y. Fang, “Stacked multilayer self-organizing
CNN can be exploited for various applications, like path map for background modeling,” IEEE Transactions on Image Process-
segmentation for autonomous vehicles and traffic analysis for ing, vol. 24, no. 9, pp. 2841–2850, 2015.
[20] M. Babaee, D. T. Dinh, and G. Rigoll, “A deep convolutional neural net-
video surveillance. Finally, it is understandable that modelling work for video sequence background subtraction,” Pattern Recognition,
a perfect FG extraction algorithm is still an open and intriguing vol. 76, pp. 635–649, 2018.
task. [21] V. Goel, J. Weng, and P. Poupart, “Unsupervised video object segmenta-
tion for deep reinforcement learning,” in Advances in Neural Information
References Processing Systems, pp. 5683–5694, 2018.
[22] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and
[1] T. Akilan, Q. J. Wu, and Y. Yang, “Fusion-based foreground en- A. Farhadi, “Target-driven visual navigation in indoor scenes using
hancement for background subtraction using multivariate multi-model deep reinforcement learning,” in 2017 IEEE international conference
gaussian distribution,” Information Sciences, vol. 430, pp. 414–431, on robotics and automation (ICRA), pp. 3357–3364, IEEE, 2017.
2018. [23] J. Han, L. Yang, D. Zhang, X. Chang, and X. Liang, “Reinforcement
[2] K. Wang, Y. Liu, C. Gou, and F.-Y. Wang, “A multi-view learning cutting-agent learning for video object segmentation,” in Proceedings
approach to foreground detection for traffic surveillance applications,” of the IEEE Conference on Computer Vision and Pattern Recognition,
IEEE Transactions on Vehicular Technology, vol. 65, no. 6, pp. 4144– pp. 9080–9089, 2018.
4158, 2016. [24] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
the inception architecture for computer vision,” in Proceedings of the
1 https://developer.nvidia.com/embedded/buy/jetson-tx2 IEEE conference on computer vision and pattern recognition, pp. 2818–
2 https://www.nvidia.com/en-us/self-driving-cars/drive-platform/ 2826, 2016.
[25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image

0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
15

recognition,” in Proceedings of the IEEE conference on computer vision for Video Technology, vol. 28, no. 9, pp. 2105–2115, 2017.
and pattern recognition, pp. 770–778, 2016. [47] S. Varadarajan, P. Miller, and H. Zhou, “Spatial mixture of gaussians
[26] J. Dai, K. He, Y. Li, S. Ren, and J. Sun, “Instance-sensitive fully for dynamic background modelling,” in 2013 10th IEEE International
convolutional networks,” in European Conference on Computer Vision, Conference on Advanced Video and Signal Based Surveillance, pp. 63–
pp. 534–549, Springer, 2016. 68, IEEE, 2013.
[27] G. Trigeorgis, P. Snape, I. Kokkinos, and S. Zafeiriou, “Face normals [48] P.-L. St-Charles, G.-A. Bilodeau, and R. Bergevin, “Subsense: A uni-
"in-the-wild" using fully convolutional networks,” in 2017 IEEE Confer- versal change detection method with local adaptive sensitivity,” IEEE
ence on Computer Vision and Pattern Recognition (CVPR), pp. 340–349, Transactions on Image Processing, vol. 24, no. 1, pp. 359–373, 2015.
July 2017. [49] L. Guo, D. Xu, and Z. Qiang, “Background subtraction using local svd
[28] D. Eigen, D. Krishnan, and R. Fergus, “Restoring an image taken binary pattern,” in Proceedings of the IEEE Conference on Computer
through a window covered with dirt or rain,” in Proceedings of the IEEE Vision and Pattern Recognition Workshops, pp. 86–94, 2016.
International Conference on Computer Vision, pp. 633–640, 2013. [50] G. Allebosch, D. Van Hamme, F. Deboeverie, P. Veelaert, and W. Philips,
[29] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks “C-efic: color and edge based foreground background segmentation with
for semantic segmentation,” in Proceedings of the IEEE conference on interior classification,” in International Joint Conference on Computer
computer vision and pattern recognition, pp. 3431–3440, 2015. Vision, Imaging and Computer Graphics, pp. 433–454, Springer, 2016.
[30] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, [51] R. Wang, F. Bunyak, G. Seetharaman, and K. Palaniappan, “Static and
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” moving object detection using flux tensor with split gaussian models,”
in Proceedings of the IEEE conference on computer vision and pattern in Proceedings of the IEEE Conference on Computer Vision and Pattern
recognition, pp. 1–9, 2015. Recognition Workshops, pp. 414–418, 2014.
[31] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks [52] S. Bianco, G. Ciocca, and R. Schettini, “How far can you get by
for biomedical image segmentation,” in International Conference on combining change detection algorithms?,” in International Conference
Medical image computing and computer-assisted intervention, pp. 234– on Image Analysis and Processing, pp. 96–107, Springer, 2017.
241, Springer, 2015. [53] H. Sajid and S.-C. S. Cheung, “Universal multimode background
[32] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, subtraction,” IEEE Transactions on Image Processing, vol. 26, no. 7,
“Deeplab: Semantic image segmentation with deep convolutional nets, pp. 3249–3260, 2017.
atrous convolution, and fully connected crfs,” IEEE transactions on [54] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, connected convolutional networks,” in Proceedings of the IEEE confer-
2017. ence on computer vision and pattern recognition, pp. 4700–4708, 2017.
[33] M. Sultana, A. Mahmood, S. Javed, and S. K. Jung, “Unsupervised [55] J. Winterer, N. Maier, C. Wozny, P. Beed, J. Breustedt, R. Evangelista,
deep context prediction for background estimation and foreground Y. Peng, T. DâĂŹAlbis, R. Kempter, and D. Schmitz, “Excitatory
segmentation,” Machine Vision and Applications, vol. 30, no. 3, pp. 375– microcircuits within superficial layers of the medial entorhinal cortex,”
395, 2019. Cell Reports, vol. 19, no. 6, pp. 1110–1116, 2017.
[34] Z. Hu, T. Turki, N. Phan, and J. T. Wang, “A 3d atrous convolutional long [56] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep con-
short-term memory network for background subtraction,” IEEE Access, volutional encoder-decoder architecture for image segmentation,” IEEE
vol. 6, pp. 43450–43459, 2018. transactions on pattern analysis and machine intelligence, vol. 39,
[35] M. Braham and M. V. Droogenbroeck, “Deep background subtraction no. 12, pp. 2481–2495, 2017.
with scene-specific convolutional neural networks,” 2016 International [57] Y. Yang, J. Q. Wu, X. Feng, and A. Thangarajah, “Recomputation
Conference on Systems, Signals and Image Processing (IWSSIP), pp. 1– of dense layers for the performance improvement of dcnn,” IEEE
4, 2016. transactions on pattern analysis and machine intelligence, 2019.
[36] G. Thapa, K. Sharma, and M. Ghose, “Moving object detection and [58] P. Li, Z. Chen, L. T. Yang, Q. Zhang, and M. J. Deen, “Deep
segmentation using frame differencing and summing technique,” Inter- convolutional computation model for feature learning on big data in
national Journal of Computer Applications, vol. 975, p. 8887, 2014. internet of things,” IEEE Transactions on Industrial Informatics, vol. 14,
[37] X. Zang, G. Li, J. Yang, and W. Wang, “Adaptive difference modelling no. 2, pp. 790–798, 2017.
for background subtraction,” in 2017 IEEE Visual Communications and [59] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep
Image Processing (VCIP), pp. 1–4, Dec 2017. learning,” stat, vol. 1050, p. 23, 2016.
[38] Y. Tian, Y. Wang, Z. Hu, and T. Huang, “Selective eigenbackground [60] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer
for background modeling and subtraction in crowded scenes,” IEEE learning,” Journal of Big data, vol. 3, no. 1, p. 9, 2016.
transactions on circuits and systems for video technology, vol. 23, no. 11, [61] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans-
pp. 1849–1864, 2013. actions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–
[39] H. Yong, D. Meng, W. Zuo, and L. Zhang, “Robust online matrix 1359, 2010.
factorization for dynamic background subtraction,” IEEE transactions [62] N. Otsu, “A threshold selection method from gray-level histograms,”
on pattern analysis and machine intelligence, vol. 40, no. 7, pp. 1726– IEEE transactions on systems, man, and cybernetics, vol. 9, no. 1,
1740, 2017. pp. 62–66, 1979.
[40] N. Wang, T. Yao, J. Wang, and D.-Y. Yeung, “A probabilistic approach [63] L. Yang, J. Li, Y. Luo, Y. Zhao, H. Cheng, and J. Li, “Deep background
to robust matrix factorization,” in European Conference on Computer modeling using fully convolutional network,” IEEE Transactions on
Vision, pp. 126–139, Springer, 2012. Intelligent Transportation Systems, vol. 19, no. 1, pp. 254–262, 2018.
[41] J. Xu, V. K. Ithapu, L. Mukherjee, J. M. Rehg, and V. Singh, “Gosus: [64] C. G. Snoek, M. Worring, and A. W. Smeulders, “Early versus late
Grassmannian online subspace updates with structured-sparsity,” in fusion in semantic video analysis,” in Proceedings of the 13th annual
Proceedings of the IEEE International Conference on Computer Vision, ACM international conference on Multimedia, pp. 399–402, ACM, 2005.
pp. 3376–3383, 2013. [65] M. Silvestri, A. Elia, D. Bertelli, E. Salvatore, C. Durante, M. L. Vigni,
[42] Y. Wang, P.-M. Jodoin, F. Porikli, J. Konrad, Y. Benezeth, and P. Ishwar, A. Marchetti, and M. Cocchi, “A mid level data fusion strategy for
“Cdnet 2014: an expanded change detection benchmark dataset,” in the varietal classification of lambrusco pdo wines,” Chemometrics and
Proceedings of the IEEE conference on computer vision and pattern Intelligent Laboratory Systems, vol. 137, pp. 181–189, 2014.
recognition workshops, pp. 387–394, 2014. [66] A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, and W. Burgard,
[43] T. Minematsu, A. Shimada, and R.-i. Taniguchi, “Analytics of deep “Multimodal deep learning for robust rgb-d object recognition,” in 2015
neural network in change detection,” in 2017 14th IEEE International IEEE/RSJ International Conference on Intelligent Robots and Systems
Conference on Advanced Video and Signal Based Surveillance (AVSS), (IROS), pp. 681–687, IEEE, 2015.
pp. 1–6, IEEE, 2017. [67] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and
[44] Y. Zhang, X. Li, Z. Zhang, F. Wu, and L. Zhao, “Deep learning L. Fei-Fei, “Large-scale video classification with convolutional neural
driven blockwise moving object detection with binary scene modeling,” networks,” in Proceedings of the IEEE conference on Computer Vision
Neurocomputing, vol. 168, pp. 454–463, 2015. and Pattern Recognition, pp. 1725–1732, 2014.
[45] G. Gemignani and A. Rozza, “A novel background subtraction approach [68] H. T. Mustafa, J. Yang, and M. Zareapoor, “Multi-scale convolutional
based on multi layered self-organizing maps,” in 2015 IEEE Interna- neural network for multi-focus image fusion,” Image and Vision Com-
tional Conference on Image Processing (ICIP), pp. 462–466, IEEE, puting, 2019.
2015. [69] T. Akilan, Q. J. Wu, A. Safaei, and W. Jiang, “A late fusion approach
[46] S. Jiang and X. Lu, “Wesambe: a weight-sample-based method for for harnessing multi-cnn model high-level features,” in 2017 IEEE
background subtraction,” IEEE Transactions on Circuits and Systems International Conference on Systems, Man, and Cybernetics (SMC),

0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
16

pp. 566–571, IEEE, 2017. Thangarajah Akilan (S’13-M’18) received his


[70] Z. Allen-Zhu, Y. Li, and Z. Song, “A convergence theory for deep learn- Ph.D. degree in electrical and computer engineer-
ing via over-parameterization,” in Proceedings of the 36th International ing from University of Windsor, Windsor, Ontario,
Conference on Machine Learning (K. Chaudhuri and R. Salakhutdinov, Canada. He is currently an Assistant Professor
eds.), vol. 97 of Proceedings of Machine Learning Research, pp. 242– with the Department of Computer Science, Lake-
252, PMLR, 2019. head University, Thunder Bay, ON, Canada. His
[71] S. Arora, N. Cohen, and E. Hazan, “On the optimization of deep research interests include object/action recognition,
networks: Implicit acceleration by overparameterization,” in ICML, image/video processing and segmentation, and data
2018. fusion using statistical techniques, machine/deep
[72] S. Shin, Y. Boo, and W. Sung, “Fixed-point optimization of deep neural learning, and natural language processing. He is
networks with adaptive step size retraining,” in 2017 IEEE International a recipient of 2015-2016 Golden Key’s premier
Conference on Acoustics, Speech and Signal Processing (ICASSP), Graduate Scholar Award and 2013-2014 His Majesty the King’s Scholarship
pp. 1203–1207, March 2017. of Royal Thai Government. He serves as secretary of IEEE Windsor Section,
[73] D. Lin, S. Talathi, and S. Annapureddy, “Fixed point quantization of Canada and a reviewer for several journals including IEEE Transactions on
deep convolutional networks,” in International Conference on Machine Multimedia, IEEE Transactions on Intelligent Transportation Systems, and
Learning, pp. 2849–2858, 2016. IEEE Transactions on Industrial Informatics.
[74] X. He and J. Cheng, “Learning compression from limited unlabeled Q.M. Jonathan Wu (M’92-SM’09) received his
data,” in Proceedings of the European Conference on Computer Vision Ph.D. in electrical engineering from the University
(ECCV), pp. 752–769, 2018. of Wales, Swansea, U.K., in 1990. He was affiliated
[75] N. Papernot, M. Abadi, U. Erlingsson, I. Goodfellow, and K. Talwar, with the National Research Council of Canada for
“Semi-supervised knowledge transfer for deep learning from private ten years beginning in 1995, where he became a
training data,” arXiv preprint arXiv:1610.05755, 2016. Senior Research Officer and a Group Leader. He
[76] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a is currently a Professor with the Department of
neural network,” in NIPS Deep Learning and Representation Learning Electrical and Computer Engineering, University of
Workshop, 2015. Windsor, Windsor, ON, Canada. He is a Visiting
[77] B. Peng, W. Tan, Z. Li, S. Zhang, D. Xie, and S. Pu, “Extreme network Professor with the Department of Computer Science
compression via filter group approximation,” in Proceedings of the and Engineering, Shanghai Jiao Tong University,
European Conference on Computer Vision (ECCV), pp. 300–316, 2018. Shanghai, China. He has published more than 300 peer-reviewed papers
[78] J. Chien and Y. Bao, “Tensor-factorized neural networks,” IEEE Trans- in computer vision, image processing, intelligent systems, robotics, and
actions on Neural Networks and Learning Systems, vol. 29, no. 5, integrated microsystems. His current research interests include 3-D computer
pp. 1998–2011, 2018. vision, active video object tracking and extraction, interactive multimedia,
[79] N. Passalis and A. Tefas, “Training lightweight deep convolutional sensor analysis and fusion, and visual sensor networks. Dr. Wu holds the Tier
neural networks using bag-of-features pooling,” IEEE Transactions on 1 Canada Research Chair in Automotive Sensors and Information Systems.
Neural Networks and Learning Systems, vol. 30, no. 6, pp. 1705–1715, He is an Associate Editor of the IEEE Transactions on Neural Networks and
2019. Learning Systems, and the Cognitive Computation. He has served on technical
[80] R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V. I. Morariu, X. Han, M. Gao, C.-Y. program committees and international advisory boards for many prestigious
Lin, and L. S. Davis, “Nisp: Pruning networks using neuron importance conferences.
score propagation,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 9194–9203, 2018. Wandong Zhang (S’17) received his M.S. degree
[81] T. Cui, S. Han, and C. Tellambura, “Probability-distribution-based from the Ocean University of China in 2018. He
node pruning for sphere decoding,” IEEE Transactions on Vehicular has been awarded a scholarship under the State
Technology, vol. 62, no. 4, pp. 1586–1596, 2013. Scholarship Fund by the China Scholarship Council
[82] M. Bortman and M. Aladjem, “A growing and pruning method for (CSC) for pursuing his Ph.D. degree in the depart-
radial basis function networks,” IEEE Transactions on Neural Networks, ment of electrical and computer engineering at the
vol. 20, no. 6, pp. 1039–1045, 2009. University of Windsor, ON, Canada. His current
[83] N. Rathi, P. Panda, and K. Roy, “Stdp-based pruning of connections research interests include radar image processing,
and weight quantization in spiking neural networks for energy-efficient remote sensing data processing, feature learning and
recognition,” IEEE Transactions on Computer-Aided Design of Inte- representation, deep neural networks as well as their
grated Circuits and Systems, vol. 38, no. 4, pp. 668–677, 2019. applications in computer vision.

0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like