Video Foreground Extraction Using Multi
Video Foreground Extraction Using Multi
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
1
0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
2
0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
3
(a) (b)
Fig. 2: (a). The ResNet-like and (b). Inception-like feature embedding exploited in MvRF-CNN.
Thus, all the outcomes are combined along the channel normal distribution, and median filtering. Then, the aggregated
dimension to generate a multi-level feature representation. frame is subtracted from the current frame followed by a
However, our module allows us to use stride rate > 1 in each thresholding procedure as described earlier. The number of
branch and performs the feature concatenation whenever the frames aggregated varies from three to twelve depends on the
spatial dimension of the branches are matching as an example real-time requirement of the surveillance system [37], [38].
shown in Fig. 2-(b). Consequently, the FCN is extended like With time, researchers introduced advanced online background
an encoder-decoder NN named U-net for an application of bio- estimation and subtraction methods, such as [39], [38], [40],
medical cell segmentation [31]. Wherein, the activation maps and [41]. These methods speed up the computation by updat-
at the encoding stage are combined with the activation maps ing the model parameters, incrementally one frame at a time
at the decoding phase. on the fly. Deem these methods prone to low performance
The proposed MvRF-CNN inherits the following variations when the initial condition is not optimal.
from the U-net: To tackle the problem, the scene-specific initialization us-
• To capture spatio-temporal agents, the model’s input layer ing collected prior information of the surveillance region is
is configured as 3-channel that expects two consecutive introduced. Here, samples are collected from the specific zone
frames and a scene-specific prior background. before actual surveillance operation [1], [15], [33], [42]. The
• To handle object size variations due to objects com- same phenomenon is true for DNN-based solutions as well.
ing towards cameras or going towards vanishing points Recent studies have shown that different video surveillance
and camera view changes due to mount jitteriness, the conditions contain various types of BG changes that hinder the
model harnesses multi-view receptive field feature fusions learning ability of the DNNs [4], [34], [35], [43]. To overcome
(micro-inception modules) in early, mid, and late stages. this, Minematsu et al. [43] suggest an integrated model that
• U-net uses conventional subsampling layer with max- combines multiple scene-specific networks. Nonetheless, such
pooling operation. However, it has a toll on the seg- integration results in a large number of parameters to be
mentation accuracy [32]. To address this, the proposed trained and low inference speed. Thus, it is a plausible solution
network carries out the subsampling process through 2D for traffic video surveillance to have scene-specific models for
convolution with a stride rate of 2, a kernel size of 3 × 3, moving object segmentation since the cameras are fixed on
and zero padding. near static mounts.
In this direction, Zhang et al. [44] develop a NN-based
B. Video Foreground Segmentation: From Conventional Mod- model that has a stacked denoising auto-encoder (SDAE)
els to Deep Neural Networks learning network and a binary scene modeling based on
The moving object segmentation in videos is a challenging density analysis. Whereby, the SDAE encodes the essential
task than image segmentation or frame-level object segmenta- structural information of the scene. The encoded features of
tion. It has to handle the variations and the dynamism in the image patches are then hashed in Hamming space and then a
background. Thus, most of the learning-based solutions are hash-based binary scene is modeled by density analysis, which
scene-specific modelings, and they perform well since changes captures the temporally distributed spatial information.
in the background context are limited in a specific scene [33]– Similarly, Zhao et al. [19] also exploit a stacked multi-layer
[35]. However, some attempts were carried out for non-scene- SOM. Wherein, the initial training is carried using some BG
specific FG segmentation as well. The basic model in this samples, then during the FG detection for a test sample, the
category is the frame differencing-based approach [3], [36], BG model is maintained through online updates. Gemignani
[37]. It takes the absolute pixel-difference between two adja- and Rozza [45] improve the basic SOM model with a self-
cent frames and applies a threshold to segment the salient part balancing multi-layered network that tracks a long-time pixel
of the scene. The results of this technique may not necessarily dynamics for better FG extraction. Authors in [46] approach
be the moving object, rather a illumination change between the BG modeling as an evidence collection of each pixel in
the two frames. The improved versions of this approach use a scene with a weighted sample-based method. They also use
an adaptive strategy, whereby a few members of past frames a minimum-weight and a reward/penalty scheme that takes
are aggregated through heuristic analysis, like running average, into account the sudden changes in the scene such a way that
0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
4
ConvT(9, 8) →
[b, 60, 80, 32] [b, 60, 80, 32] [b, 30, 40, 32]
Early Fusion Mid Fusion Late Fusion
ConvT(3, 2) →ReLU
ConvT(3, 2) →ReLU
ConvT(3, 2) →ReLU
ConvT(3, 2) →ReLU
L11
L15
L18
L25
L28
L31
L3
ConvT(5, 4) →ReLU
ConvT(5, 4) →ReLU
Input
L10
L12
L14
L17
L19
L20
L21
L22
L24
L27
L30
L32
L33
L34
L35
L36
L37
L38
L39
L40
L41
mfs
L1
L4
L5
L7
L8
lfs
ef
L16
L23
L26
L29
L2
L6
L9
Legend: - Input, - Conv/ConvT with kernel size of 3, - Conv/ConvT with kernel size of 5, - Conv/ConvT with kernel
size of 9, - Conv, ^ - Concatenation-based feature fusion, - ConvT, - Drop out, → - Data/Feature flow. h
Fig. 3: Layer connectivity diagram of the proposed MvRF-CNN. Description of each layer (L<ID>) with respective inputs
and output dimension are given in Table I.
most irrelevant sample is replaced instead of the oldest or a possibility of a frame in training set and a temporarily nearest
random sample. Then, the FG/BG classification is carried out (next or previous) frame in test set. It literally results in the
as an application of a threshold. Although their method has model memorizing the sequences, rather than it is trained
computational speed, they record a poor accuracy of the FG to predict unknown scenarios. Hence, relaying on complex
detection. On the other hand, Varadarajan et al. [47] come external algorithms for BG estimation should not required for a
up with an algorithm, where the spatial relationship between deep learning. On the other hand, the patch-based training and
neighboring pixels is considered during classification through prediction is a slow processing strategy and it is not applicable
a region-based GMM in contrast to the traditional GMM that for real-time traffic and surveillance applications. Besides, they
models the distributions in pixel-level [1], [15]. only tested the model on mere number of 20 frames per video,
The local features, like local binary pattern (LBP) are also while taking 90% of the samples for training. The proposed
used for BG modeling. For instance, Charles et al. [48] coined MvRF-CNN overcomes the shortcomings with whole-frame
a system Self-Balanced SENsitivity SEgmenter (SuBSENSE) processing, temporal median filtering-based BG generation,
that adapts local binary similarity pattern (LBSP) as additional and temporally exclusive training and test samples.
features to pixel intensities in a non-parametric BG model. The The LSTM-based CNN model [4] attempts to handle the
pixel-level BG model is updated using feedback loops. Like- spatio-temporal agents between moving objects and back-
wise, authors in [49] utilize the LBP with local singular value ground. This model requires 4 consecutive frames including
decomposition (SVD) operator to extract invariant-feature se- the current one to extract the FG region. Although, the model
lection. Then, they use SAmple CONsensus(SACON) model achieves an average of 94% FoM, it suffers a low throughput of
for creating the BG based on statistics of the past 300- 24 FPS due to heavy computational LSTM modules integrated
frame pixel process. Then they use the Hamming distance with 3D convolutions (3D Conv). To subdue this, the proposed
measure similar to [44] with a predetermined threshold value model exploits a background prior and a previous frame
to separate the FG pixels. These models demand static and stacked with the current frame under grayscale setting; and
clean samples for creating a background dictionary; thus, the model avoids using 3D Conv and LSTM modules. This
they lack suitability for a real-time applications. Similarly, approach produces average of 1.75 higher FPS and 1% more
Allebosch et al. [50] also utilize local cues, namely RGB color FoM than 3D Conv-LSTM method in [4].
intensity values and local ternary pattern (LTP)-based edge Although over the past few decades many ideas have been
descriptors. They form two backgrounds and create two FG introduced, due to the challenging nature of FG extraction
masks. Then, using a pixel-wise AND operation, they refine there is not a single method that can be claimed as the ultimate
the primarily detected FG masks and get a rectified FG region. solution. Therefore, Bianco et al. [52] attempt harnessing
Meanwhile, the DeepBS et al. [20] exploits a conventional multiple state-of-the-art FG extraction algorithms under one
CNN, train the network with randomly selected video frames umbrella. They use the genetic programming (GP) to obtain
and their corresponding patch-based ground truth segmentation a solution tree. Likewise, Sajid et al. [53] introduce a multi-
samples, like in [44], and carry out a post-processing stage framework that computes a background model bank (BMB)
that performs spatial-median filtering on the outputs. The with multiple BG models. Then, to extract the FG, they use
downfall of their method is the random selection of frames a spatial de-noising based on Mega-Pixel (MP) to pixel-level
for training, dependency of background library generation on probability. Then, a they employ a fusion technique to define
external algorithms (SuBSENSE [48] and Flux Tensor [51]), a refined FG region. Nonetheless, these models also cannot be
and patched-based processing. The random selection creates a the universal model for the FG identification problem.
0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
5
Layer Layer type Output Shape III. The structure: Table I provides detailed information
Input
ID A(k, s) [b, H, W, D] of the layer connectivity patterns, kernel sizes, interconnected
Input Input Layer [b, 240, 320, 3] mini-batch activators, and output dimensions, while Fig. 3 presents an
L1 Conv (3, 1)→ ReLU [b, 240, 320, 32] Input
L2 Conv (5, 1)→ ReLU [b, 240, 320, 32] Input intuitive schematic of the proposed MvRF-CNN. It subsumes
L3 Conv (9, 1)→ ReLU [b, 240, 320, 32] Input two learning phases: Encoding and Decoding, like any other
L4 Conv (3, 2)→ ReLU [b, 120, 160, 32] L1
L5 Conv (3, 2)→ ReLU [b, 60, 80, 32] L4 Encoder-Decoder or EnDec models used in neural network
Encoding phase
L6 Conv (5, 4)→ ReLU [b, 60, 80, 32] L2 and learning systems.
L7 Cat [b, 60, 80, 64] L5, L6
L8 Conv (3, 2)→ ReLU [b, 30, 40, 32] L7
Encoding phase - It consists of nineteen layers beginning
L9 Conv (3, 2)→ ReLU [b, 30, 40, 32] L6 from an input layer to the subsampling Conv, where the spatial
L10 Cat [b, 30, 40, 96] L8, L9, L11 dimension of the feature maps reaches 15 × 20. Thus, at
L11 Conv (9, 8)→ ReLU [b, 30, 40, 32] L3
L12 Conv (3, 2)→ ReLU [b, 15, 20, 32] L10 this stage the encoding process completes. Note that, it is
L13 Conv (5, 4)→ ReLU [b, 15, 20, 32] L6 important to terminate the encoding phase when one of the
L14 Cat [b, 15, 20, 96] L12, L13, L15
L15 Conv (3, 2)→ ReLU [b, 15, 20, 32] L11 feature map dimensions either width or height reaches a value
L16 Conv (5, 1)→ ReLU [b, 15, 20, 32] L17 of odd integer. Otherwise, the decoding process through up-
L17 Conv (3, 1)→ ReLU [b, 15, 20, 64] L14 sampling with transposed convolution may not reconstruct
L18 Conv (3, 1)→ ReLU [b, 15, 20, 32] L17
L19 ConvT (3, 2)→ ReLU [b, 30, 40, 32] L17 feature maps spatially aligned with the intermediate feature
L20 Cat [b, 30, 40, 128] L10, L19 maps generated in the encoding phase, resulting in complex
L21 Drop out (0.3) L20
L22 Conv (3, 1)→ ReLU [b, 30, 40, 32] L21 feature-level fusions. Since the smallest kernel, in this work,
L23 ConvT (5, 4)→ ReLU [b, 60, 80, 32] L16 is 3 × 3, we stop the encoding sub-network when the feature
L24 ConvT(3, 2)→ ReLU [b, 60, 80, 32] L22
L25 ConvT (3, 2)→ ReLU [b, 30, 40, 32] L18
map dimension becomes 15 × 20.
L26 Conv (5, 1)→ ReLU [b, 60, 80, 32] L23 Decoding phase - It is initiated by a ConvT layer (L19) that
Decoding phase
L27 Cat [b, 60, 80, 128] L7, L24, L26 up-samples the output of the encoding sub-network and sub-
L28 Conv (9, 1)→ ReLU [b, 30, 40, 32] L25
L29 ConvT (5, 4)→ ReLU [b, 240, 320, 32] L26 sequently ends at the last ConvT layer (L37) with producing
L30 Drop out (0.3) L27 feature map dimension of 240 × 320. Similar to the encoding
L31 ConvT (9, 8)→ ReLU [b, 240, 320, 32] L28
L32 Conv (3, 1)→ ReLU [b, 60, 80, 32] L30 phase, there are nineteen layers networked to complete the
L33 ConvT(3, 2)→ ReLU [b, 120, 160, 32] L32 up-sampling process.
L34 Cat [b, 120, 160, 64] L4, L33
L35 Drop out (0.3) L34
Fusion of feature flows - The MvRF-CNN integrates three
L36 Conv (3, 1)→ ReLU [b, 120, 160, 32] L35 sub-networks; a pivotal feature flow (PFF) and two comple-
L37 ConvT(3, 2)→ ReLU [b, 240, 320, 32] L36 mentary feature flows (CFF-1 and 2). The kernel sizes for
L38 Cat → BN [b, 240, 320, 96] L29, L31, L37
L39 Drop out (0.3) L38 the sub-networks are determined empirically. This work finds
Top
L40 Conv (3, 1)→ ReLU [b, 240, 320, 128] L39 that the filter sizes of 5 × 5 and 9 × 9 for the two CFFs
L41 Conv (3, 1)→ f (·) [b, 240, 320, NC] L40
Total number of trainable parameters 864,385
perform robustly when having a kernel size of 3 × 3 in the
A(k, s): A- Type of convolutional operation, k - kernel size, and PFF. The PFF is essentially an EnDec network that takes the
s - stride rate; Output shape as [b, H, W, D]: b - mini-batch size, following layers L1, L4, L5, L7, L8, L10, L12, L14, L17, L19 -
H - hight, W - width, and D - number of channels;
f (·) - classifier (Sigmoid), NC - number of output channels L22, L24, L27, L30, and L32 - L41. This sub-network performs
core operations, i.e., down sampling and up-sampling using
TABLE I: Layer detail and connectivity pattern of the pro- 3 × 3 Conv filters. The learning ability of this sub-network
posed MvRF-CNN. is complemented by the CFF-1 and CFF-2 as their key
operations employ different receptive fields as stated earlier.
III. Proposed MvRF-CNN Architecture It enhances the robustness of the model by learning scale-
invariant FG agents. Here, the CFF-1 includes the layers
A. Network Formation
L2, L6, L9, L13, L16, L23, L26, and 29 while the CFF-2 con-
The approach to model the network is based on intuition sists of the layers L3, L11, L15, L18, L25, L28, and L31. The
and ablation study. agents learned throughout the CFF sub-networks are used for
I. Purpose: To have a scene-specific model for moving feature-level augmentation under three principles: early fusion,
object detection. The model is to be trained on manually mid fusion, and late fusion. The early fusion occurs in layers
delineated FG (moving objects) from a set of training frames, L7, L10, and L14. The mid fusion happens in layers L20 and
then it should automatically identify the FG objects on the L27. Finally, the late fusion is set in layers L34 and L38.
remaining frames of the same video sequence. The segmenta- Notably, the mid and late fusions are facilitated by two mini
tion results must be sufficiently accurate that do not need any decoders placed in the CFFs that consist of L16, L23, L26,
post-processing. and L29 in CFF-1 and L18, L25, L28 and L31 in CFF-2. These
II. Input dimension: Firstly, we determine an appropriate mini decoders also perform two levels of up-sampling process,
dimension of the input layer through analytical reasoning. The besides the PFF. The feature-level fusions at every level is
target dataset considered for this study tabulated in Table III carried out through residual feature forwarding from encoding
has various frame sizes (width×height) ranging from 320×240 phase to decoding phase. There are seven such layers, in total
to 720 × 576 having median of 320 × 240. This median value as described earlier.
is chosen as the input layer dimension and all the samples are The residual feature forwarding is the way of reusing
resized to this dimensionality. activations of a downstream layer to amplify the features in
0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
6
an upstream layer while skipping a single intermediate layer Fine-tuned for Transferred from Fine-tuned for Transferred from
or set of such layers [25], [54], [55]. It functions well when a Highway Turnpike_0_5fps Boulevard Highway
single non-linear layer is stepped over, or when the set of Office CopyMachine CopyMachine Office
Canoe Boats PeopleInShade Pedestrians
middle layers are all linear. Otherwise, an explicit weight Boats Canoe Turnpike_0_5fps Highway
matrix should be learned for the skipped connection. The Overpass Pedestrians Boulevard Turnpike_0_5fps
amplification can be in the form of a mathematical operation, Traffic Highway Fall Blizzard
Skating SnowFall SnowFall Blizzard
like addition or a channel expansion by concatenation as
used in this work. An example of the residual connection
TABLE II: Sample pairs used for transfer learning.
is shown in Figure 2 (b) in the main manuscript. Thus, the
fusion of feature maps from encoding layers that hold high- factor of 0.8 over the training.
frequency detail for finer FG boundaries [54], [56], [57]. Thus,
the heterogeneous Conv kernels used in the sub-networks N P
M
−1 P
and feature-level argumentation carried out via residual con- E( p̂n ) = NM pn,m log p̂n,m + (1 − pn,m ) log(1 − p̂n,m ) ,
n=1 m=1
nections capture scale-invariant intricate information. Hence, (2)
the MvRF-CNN learns well-generalized representation of FG where N is the number of samples in the mini-batch, M is
objects based on local and global contextual cues. the number of pixels (H × W) in the unrolled image, p̂n,m is
In summary, the MvRF-CNN contains 864,385 trainable the output from the final layer of the network (layer 41 in
parameters that subsume convolution (conv), transpose convo- Table I) for the m th location in the n th image. p̂n,m maps the
lution (convT), concatenation (cat), drop out, and batch nor- FG pixel probabilities using the Sigmoid function (σ(xn,m ) ∈
malization (BN) layers. The encoding phase includes 18 layers [0, 1]). And the target pn,m ∈ [0, 1] is a supervised signal, that
that end the sub-sampling when the spatial dimension reaches sets 0 for BG class and 1 for FG class. Both p̂n,m and pn,m
15 × 20. It is immediately followed by the decoding phase that have the same dimensionality.
up-samples encoded feature maps systematically back to the
same dimension as input spatial dimension. The symmetric
D. Transfer Learning
decoding paths contain 19 layers that consist of inception and
residual-feature fusions to capture high-resolution contextual To improve the network’s generalization calibre, the intr-
cues of the FG objects. It does not have any max pooling or aclass domain transfer is incorporated. Here, a network is
hidden densely connected layers. It accepts input frames with trained on a source domain, then its weights are used to
any spatial dimensions and resizes them into 240 × 320 by initialize a fresh model in the target domain. Thereafter, all
using nearest-neighbor re-scaling. the layers of the new model are fine tuned end-to-end via
back-propagation. Table II lists some sample video sequence
pairs used to fine-tune the scene-specific MvRF-CNN model.
B. Convolutional Layer For example, the pre-trained network with Turnpike_0_5fps
The heart of the CNNs is the convolutional operation, where is retrained for Highway. The theoretical and philosophical
the filter weights function like a dictionary of visual patterns expositions of the transfer learning strategy can be referred to
or system memory. Respect to a kernel ω, bias vector b, and [60] and [61].
an input image/patch x it is computed as
K−1
K−1 X
E. Binary FG Mask
X
C(m, n) = b + ω(k, l) ∗ x(m + k, n + l), (1) To transform the salient- or probability-map generated by
k=0 l=0 the MvRF-CNN to a binary mask, we apply a dataset-specific
global threshold in the range [0.05, 0.75] and the Otsu-based
where ∗, K, {m, n}, and {k, l} denote, respectively, the conv op-
automatic segmentation method. Then the noisy artifacts in the
eration, filter size, origin of the image-patch, and the kernel’s
binary mask is cleaned by a neighbourhood pixel connectivity
element index [58].
that removes regions with less than 50 pixels. The Otsu’s
The transpose convolution (convT) layers, on the other
procedure iteratively finds a threshold that lies in between two
hand, perform up-sampling process while maintaining the
peaks of the intensity histogram of a bi-modal image such that
connectivity pattern. In contrast to the standard image resizing
the intra-class variances of FG and BG classes are minimum.
operation, the convT has trainable parameters, and they are
An explicit derivation of the method can be found in [62].
updated during training. It is achieved by inserting zeros
between consecutive neurons in the receptive field of the input,
then sliding the conv kernel with unit strides [59]. F. Algorithmic Summary
The entire process of the proposed FG extraction model
from training to inferencing can be summarized as shown in
C. Optimizer Fig. 4.
The MvRF-CNN is trained through the Adam-optimizer 1. Video sequences: It is the collection of scene specific raw
that minimizes binary cross-entropy loss E defined by (2), data. The detail of each sequence is listed in Table III.
where the optimizer takes a base learning rate of 0.0002 with 2. Preprocessing: It consits of scene specific prior BG model
a learning rate scheduler that reduces the learning rate by a estimation and dataset splitting (refer to Section IV-A-II) as
0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
7
0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
8
(a) (b)
Fig. 5: Sequence-specific training and test sets: (a). Temporally exclusive training and test datasets and (b). Input data
configuration: G.T - Ground truth, I - Input frame, Med(·)- Temporal median filtering, b0 - The pre-computed BG model,
ft - The current scene at time t, and ft−1 - The previous scene at time t − 1.
as the cameras span around. So the motion detail captured foreground segmentation it is a questionable approach since
by taking two consecutive frames along with a generalized the test sequence has to be temporally unknown or excluded
BG model will be entirely different when the camera pans from the training set.
from one point to another. On the other hand, the night ii. The training to test set ratio is ≈ 88% to 12%, which is too
video set includes traffic flow videos shot at night. There is less to validate the model, we suppose. A similar issue is
strong lighting variation in all the sequences in this category. found in the CNN-based FG detection model presented in
Likewise, the low frame-rate category consists of traffic flows the literature [63], where 90% of the samples are used for
scenes shot with low frame rate. Lastly, the bad weather set training and only a set of 20 randomly selected frames are
contains videos shot in adverse snow conditions. used for evaluating the model.
It is worth mentioning that among the thirty-seven videos To address these issues, we create temporally exclusive
more than ten of them contain continuous dynamism in the training and test sets by splitting each video sequence in
background due to vibrating camera mounts, swaying leaves, an orderly manner, whereby the training set takes the first
shimmery water-body, fluid dynamics, jumping water, non- 70% while the test set takes the last 30% from the input
monotonic snowfall, flaying flag, and so forth. For instance, frames that have ground truths as illustrated in Fig. 5-a. During
such minute dynamism can be observed in: Highway (HW), training, the model takes advantage of data transformation,
Canoe (CA), Boats (BO), Fountain01 (F1), Fountain02 (F2), a.k.a augmentation that includes vertical and horizontal trans-
Overpass (OV), Fall (FA), Parking (PA), Blizzard (BL), Skat- lation with a fraction of 0.1 from the actual input image’s
ing (SK), and Snowfall (SF) datasets. height and width, random rotation within 10 degrees, and a
II. Temporally exclusive training and test sets: The CDnet zooming factor in the range of 0.1 inside the input image.
2014 datasets were particularly arranged for the traditional Hence, the training procedure uses a generator (in this case,
background subtraction task. Here, the task is defined as a Keras ImageDataGenerator) that loops over the data (in
model has to be trained on a set of training samples from a batches) and applies the aforesaid transformations to batch
sequence and then the trained model must automatically label of images randomly on-the-fly. These data transformations are
the rest of the frames of the same sequence with adequate applied to the training samples and to their corresponding fore-
accuracy. Thus, the CDnet has a minimum of 299 initial frames ground truths simultaneously. The model handles the temporal
without ground truths (refer to Table III) for BG formation information of the video sequence as shown in Fig. 5-b, where
targeting the conventional models (e.g., GMM) and rest of the the input is a two consecutive Gray-scale frames stacked with a
frames are provided with hand segmented foreground objects. generic Gray-scale BG model estimated from the prior samples
From this, creating an appropriate train and test datasets is listed in Table III that have no ground truths (GT).
crucial, as the performance of deep learning models hugely
depend on the correctness of the training set. For that, the B. Step-by-Step Quantitative Analysis
literature [20] the authors create a training set using randomly In FG extraction the standard performance matrix consid-
collected 150 frames from every video in the CDnet 2014 and ered is f-measure. It is a similarity count between the extracted
extract patches with a size of 48 × 48 at a stride of 10. Then, FG mask and the ground truth. It is a weighted mean measure
they test it on very few samples as a set of 20 frames for each of precision (Pr) and recall (Re) defined as
sequence. Such dataset preparation exposes to two downfalls. 2 × (Pr × Re)
F − measure = . (3)
i. A random selection of frames may assign a frame ft to Pr + Re
the training set and a temporally closest frame like ft+1 The recall and precision are given by Re = T P/(T P + FN),
or ft−1 to the test set. Consequentially, there can be many and Pr = T P/(T P + FP), where T P, FN, and FP refer to true
such samples in the training and test sets resulting in a positive, false negative, and false positive respectively.
mere exclusiveness. The random selection is acceptable for I. Impact of complimentary feature flows: To verify the
object recognition or classification task; however, for video intuition of the multi-view receptive field, we conduct a sanity
0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
9
(a) A frame from Traffic (b) High fussiness (c) Moderate fussiness (d) Moderate fussiness (e) Low fussiness
Fig. 6: Visualizing impact of complimentary feature flows: (a) - (e) are input frame and salient maps of PFF, combined feature
flows of PFF + CFF2, PFF + CFF1, and PFF + CFF1 + CFF2, respectively. The white boxes highlight a sample region for
fussiness variation between FG and BG.
0.85 0.840.84
0.81 0.81 0.81 0.81
0.8
0.77
0.75
.
hi
0.7
Tu ha.
0.68
Co ard
.
ac
y
ke
s
as
wa
nS
p.M
ev
pi
c
e
erP
e
ats
affi
gh
no
o.I
rn
fic
ul
Ov
Hi
Bo
Bo
Ca
Of
Pe
0.65
Tr
SF SF + TL BG + TL
(a) Performance vs input configurations: SF - Single Frame,
SF + TL - Single frame with Transfer Learning, and BG + (b) MvRF-CNN salient maps and their corresponding histograms for
TL - Input data using BG model as in Fig. 5-b with Transfer an input frame (a) when: (b)- trained from scratch, and (c)- fine-tuned
Learning. with intra-class transfer learning.
Fig. 7: Impact of transfer learning and background model using MvRF-CNN.
Combination of Dataset D 84.88
91.27
feature flows Office Traffic 82.23
1
C 90.06
A: PFF 87.99 82.27
B: PFF + CFF2 89.02 75.01 B Office 75.01
89.02
Traffic
C: PFF + CFF1 90.06 82.23 A 82.27
D: PFF + CFF1 + CFF2 91.27 84.88 87.99 0.8
50 60 70 80 90 100
0.6
check with a different combination of complementary feature
flows as described in Table IV. The test is carried out on
Office sequence from the baseline and the Traffic sequence
from the camera-jitter category, where the input is a single- 0.4
frame RGB. The investigation reveals that the fusion of all the
complementary feature flows provides a valuable improvement
in the FG segmentation. For instance, the average performance 0.2
N
-5
S
CS
TM
pB
BS
O
S
TI
Bo
ee
LS
M
PA
vR
IU
0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
10
Gain of MvRF-CNN in %
0.8 60
F-measure
0.6
40
0.4
20
(a) Best f-measure vs dataset category. (b) Improvements of the MvRF-CNN compared to others.
Fig. 9: Category-wise performance analysis of the MvRF-CNN.
0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
11
(a) Sample qualitative results for Pedestrians, PETS2006, Highway, (b) Sample qualitative results for BusStation, Bungalows, PeopleIn-
Office, Fountain02, Canoe, Boats, and OverPass video sequences Shade, Cubicle, CopyMachine, Park, Corridor, and Library video
(row 1 to 8). Where, from column 1 to 5: input frames, ground truths, sequences. Where, column 1 to 5: input frames, ground truths,
Sigmoid confidence maps, binary FG-BG segmentations generated Sigmoid maps, binary FG-BG segmentations generated through G-th
through G-th and O-th, respectively. and O-th, respectively.
Fig. 10: Part I sample visual results of the MvRF-CNN FG extractor: bright and dark pixels represent FG and BG respectively.
configuration with the BG model boosts the segmentation
0.98
0.97
0.95
0.87
0.9
when the model is trained from scratch, while the intensity
distribution of the salient map falls around two distinct peaks,
generally with the intensity values of 0 (dark as BG) and
0.82
0.85
0.81
0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
12
Snowfall (SF), Turnpike_0_5fps (TP), Blizzard (BL), and of improving the FoM of the foreground extraction system
Skating (SK) sequences are considered, there is more than approximately by 3% when compared to the model without
6% improvement as the number of frames in background Mid-level fusion.
estimation is increased by ≈ 2 times. Thus, it is important V. Comparison of results to the literature: Taking the
to have a well-generalized background model if FG extraction above outcomes in Part I and IV as foundation of the MvRF-
accuracy is the primary goal. CNN, extensive experiments are carried out on all the datasets
IV. Impact of the mid-fusion: Many researchers have listed in Table III with the input configuration described in
already proven the effectiveness of early and late fusions in Fig. 5-b. Thus, this subsection provides an analysis on the
deep learning [64]–[69]. However, we have not come across performance of the MvRF-CNN in comparison to the prior-
a work that provide a theoretical or experimental support art and state-of-the-art techniques. It includes the probabilistic-
for mid-fusion. In our case, the intuition of using them is based approaches and NN-based learning systems from recent
the features extracted through multi-view receptive fields may years: PAWCS (2015) [48], IUTIS-5 (2017) [52], MBS (2017)
contain complementary spatiotemporal agents at every stage [53], DeepBS (2018) [20], LSTM (2019) [4], and few Others.
in the network. So, a fusion operation of features at early, Figure 8 summarizes the results across all the video sequences
mid, and late stages would lead the model to learn well- via box-plots, where Ours and BoO based on the best results
generalized features of the targeted domain. To investigate of our method, and the best results of other methods listed
this, an additional experiment is conducted without the mid- in Table V respectively. Figure 9 compares the category-wise
fusion. However, all the training and input configurations are achievements of the MvRF-CNN with other methods. When
maintained to be the same with the experiments with mid- the average performance across categories is considered, the
fusion as the results tabulated in Table V. This experiment proposed model has a very consistent result than all other
is conducted on seven traffic video surveillance sequences, methods. Although, all other approaches show competitive
namely Highway (HW), Blizzard (BZ), FluidHighway (FH), results in the baseline (BL) category, they fail to exhibit
Turnpike_0_5fps (TP), WinterStreet (WS), TramStation (TN), the same for the rest. Most of the other techniques, record
and IntermittentPan (IP). The experimental results show (re- poor outcomes in the challenging conditions, like dynamic
fer to Fig. 11) that the mid-level fusion has an attribute background (DB) and night videos (NV). In overall, the
(a) Sample qualitative results for Fall, Traffic, Badminton, Boulevard, (b) Sample qualitative results for Blizzard, Snowfall, Skating, Turn-
Parking, StreetLight, Sofa, and backDoor video sequences (row 1 to pike 5 fps, FluidHighway, WinterStreet, TramStation, and Intermit-
8). Where, from column 1 to 5: input frames, ground truths, Sigmoid tentPan video sequences (row 1 to 8). Where, column 1 to 5: input
confidence maps, binary FG-BG segmentations generated through G- frames, ground truths, Sigmoid maps, binary FG-BG segmentations
th and O-th, respectively. generated through G-th and O-th, respectively.
Fig. 12: Part II sample visual results of the MvRF-CNN FG extractor: bright and dark pixels represent FG and BG respectively.
0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
13
A few sample visual results are shown in Fig. 10 and 12 + - Original results. Trained on first 70% of hard GTs that are temporally
in support of the quantitative results listed in Table V. These exclusive to test sequence. ⋆ - Trained on GMM annotated GTs (noisy & lack
of delineation) then fine-tuned on next 10% of hard GTs that are temporally
visual comparisons against the corresponding ground truths exclusive to the test sequence. • - Trained on automatically poorly annotated
show that the proposed MvRF-CNN segments the moving FG GTs then fine tuned on next 30% of hard GTs that are temporally exclusive
objects in videos very tightly. to test sequence.
0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
14
remains an empirical process. Similarly, the current work also [3] M. Vargas, J. M. Milla, S. L. Toral, and F. Barrero, “An enhanced
sets the parameters empirically as discussed in Section III. background estimation algorithm for vehicle detection in urban traffic
scenes,” IEEE Transactions on Vehicular Technology, vol. 59, no. 8,
However, for real-world applications, like traffic monitoring pp. 3694–3709, 2010.
and surveillance, these parameters often become a bottleneck [4] T. Akilan, Q. J. Wu, A. Safaei, J. Huo, and Y. Yang, “A 3d cnn-lstm-
when deploying on resource-constrained harware. Thus, this based image-to-image foreground segmentation,” IEEE Transactions on
Intelligent Transportation Systems, pp. 1–13, March 2019.
work identifies the following as potential methodologies for [5] H.-S. Song, S.-N. Lu, X. Ma, Y. Yang, X.-Q. Liu, and P. Zhang, “Vehicle
optimizing the proposed MvRF-CNN: i. Network training behavior analysis using target motion trajectories,” IEEE Transactions
acceleration strategies [57], [70], [71], ii. Bit-quantization on Vehicular Technology, vol. 63, no. 8, pp. 3580–3591, 2014.
[6] L. Tian, H. Wang, Y. Zhou, and C. Peng, “Video big data in smart
[72]–[74], iii. Model compression via knowledge distillation city: Background construction and optimization for surveillance video
[75], [76], iv. Tensor factorization [77], [78], and v. Network processing,” Future Generation Computer Systems, vol. 86, pp. 1371–
pruning [79]–[83]. Hence, the future work will also consider 1382, 2018.
[7] C. Tang and A. Hussain, “Robust vehicle surveillance in night traffic
Tensor RT API for achieving a high-performance inference videos using an azimuthally blur technique,” IEEE Transactions on
model so that it can be deployed on embedded devices such Vehicular Technology, vol. 64, no. 10, pp. 4432–4440, 2014.
as Jetson TX21 and the autonomous car development platform [8] T. Bouwmans, “Traditional and recent approaches in background mod-
eling for foreground detection: An overview,” Computer science review,
NVIDIA Drive PX2 . vol. 11, pp. 31–66, 2014.
[9] S. Bosse, D. Maniry, K.-R. Müller, T. Wiegand, and W. Samek,
VII. Conclusion “Deep neural networks for no-reference and full-reference image quality
assessment,” IEEE Transactions on Image Processing, vol. 27, no. 1,
This work introduces a scene-specific FG extraction model pp. 206–219, 2018.
inspired by recent innovations in deep learning. The proposed [10] Z. Zhang, S. Fidler, and R. Urtasun, “Instance-level segmentation for
model utilizes a heterogeneous set of convolutions to capture autonomous driving with deep densely connected mrfs,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
scale-invariant FG object representations. We believe that the pp. 669–677, 2016.
proposed MvRF-CNN is a unique addition to the NN-based [11] K.-H. Jo et al., “Cumulative dual foreground differences for illegally
scene-specific FG extraction systems. Extensive experiments parked vehicles detection,” IEEE Transactions on Industrial Informatics,
vol. 13, no. 5, pp. 2464–2473, 2017.
are conducted to analyse the proposed model’s performance [12] S. Kwak, M. Cho, I. Laptev, J. Ponce, and C. Schmid, “Unsupervised
under the following conditions: i. Training with random-state object discovery and tracking in video collections,” in Proceedings of
model initialization, ii. Model fine tuning with transferred the IEEE international conference on computer vision, pp. 3173–3181,
2015.
parameters, iii. Input configuration with single frame in RGB [13] Y. Zhang, C. Zhao, and Q. Zhang, “Counting vehicles in urban traffic
color space, iv. Input configuration that takes gray scale scenes using foreground time-spatial images,” IET Intelligent Transport
two consecutive frames and a temporally median filtered BG Systems, vol. 11, no. 2, pp. 61–67, 2017.
[14] A. I. Guha and S. Tellex, “Towards meaningful human-robot collabora-
model, v. Training with poorly annotated GTs, and vi. Test- tion on object placement,” in RSS Workshop on Planning for Human-
ing on unknown dataset (non-scene-specific FG extraction). Robot Interaction: Shared Autonomy and Collaborative Robotics, 2016.
The qualitative and quantitative performance analyses of the [15] A. Thangarajah, Q. J. Wu, and J. Huo, “A unified threshold updat-
ing strategy for multivariate gaussian mixture based moving object
proposed MvRF-CNN on 37 challenging video sequences detection,” in 2016 International Conference on High Performance
collected from traffic and surveillance videos demonstrate that Computing & Simulation (HPCS), pp. 570–574, IEEE, 2016.
the model performs robustly and consistently better than or [16] F. Porikli and O. Tuzel, “Bayesian background modeling for foreground
detection,” in Proceedings of the third ACM international workshop on
very competitively to the prior- and state-of-the-art methods. Video surveillance & sensor networks, pp. 55–58, ACM, 2005.
It records a real-time speed with ≈ 42 frame per second (FPS) [17] M. Imani, S. F. Ghoreishi, D. Allaire, and U. Braga-Neto, “Mfbo-ssm:
or ≈ 22ms (mean average) prediction time on a GeForce Multi-fidelity bayesian optimization for fast inference in state-space
models,” AAAI, 2019.
GTX 1080 Ti GPU. However, the limitation of the network [18] M. Imani, S. F. Ghoreishi, and U. M. Braga-Neto, “Bayesian control
comes with a high number of trainable parameters. The of large mdps with unknown dynamics in data-poor environments,”
optimization of trainable parameters and architecture is left in Advances in neural information processing systems, pp. 8146–8156,
2018.
for future work. In application point of view, the MvRF- [19] Z. Zhao, X. Zhang, and Y. Fang, “Stacked multilayer self-organizing
CNN can be exploited for various applications, like path map for background modeling,” IEEE Transactions on Image Process-
segmentation for autonomous vehicles and traffic analysis for ing, vol. 24, no. 9, pp. 2841–2850, 2015.
[20] M. Babaee, D. T. Dinh, and G. Rigoll, “A deep convolutional neural net-
video surveillance. Finally, it is understandable that modelling work for video sequence background subtraction,” Pattern Recognition,
a perfect FG extraction algorithm is still an open and intriguing vol. 76, pp. 635–649, 2018.
task. [21] V. Goel, J. Weng, and P. Poupart, “Unsupervised video object segmenta-
tion for deep reinforcement learning,” in Advances in Neural Information
References Processing Systems, pp. 5683–5694, 2018.
[22] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and
[1] T. Akilan, Q. J. Wu, and Y. Yang, “Fusion-based foreground en- A. Farhadi, “Target-driven visual navigation in indoor scenes using
hancement for background subtraction using multivariate multi-model deep reinforcement learning,” in 2017 IEEE international conference
gaussian distribution,” Information Sciences, vol. 430, pp. 414–431, on robotics and automation (ICRA), pp. 3357–3364, IEEE, 2017.
2018. [23] J. Han, L. Yang, D. Zhang, X. Chang, and X. Liang, “Reinforcement
[2] K. Wang, Y. Liu, C. Gou, and F.-Y. Wang, “A multi-view learning cutting-agent learning for video object segmentation,” in Proceedings
approach to foreground detection for traffic surveillance applications,” of the IEEE Conference on Computer Vision and Pattern Recognition,
IEEE Transactions on Vehicular Technology, vol. 65, no. 6, pp. 4144– pp. 9080–9089, 2018.
4158, 2016. [24] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
the inception architecture for computer vision,” in Proceedings of the
1 https://developer.nvidia.com/embedded/buy/jetson-tx2 IEEE conference on computer vision and pattern recognition, pp. 2818–
2 https://www.nvidia.com/en-us/self-driving-cars/drive-platform/ 2826, 2016.
[25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
15
recognition,” in Proceedings of the IEEE conference on computer vision for Video Technology, vol. 28, no. 9, pp. 2105–2115, 2017.
and pattern recognition, pp. 770–778, 2016. [47] S. Varadarajan, P. Miller, and H. Zhou, “Spatial mixture of gaussians
[26] J. Dai, K. He, Y. Li, S. Ren, and J. Sun, “Instance-sensitive fully for dynamic background modelling,” in 2013 10th IEEE International
convolutional networks,” in European Conference on Computer Vision, Conference on Advanced Video and Signal Based Surveillance, pp. 63–
pp. 534–549, Springer, 2016. 68, IEEE, 2013.
[27] G. Trigeorgis, P. Snape, I. Kokkinos, and S. Zafeiriou, “Face normals [48] P.-L. St-Charles, G.-A. Bilodeau, and R. Bergevin, “Subsense: A uni-
"in-the-wild" using fully convolutional networks,” in 2017 IEEE Confer- versal change detection method with local adaptive sensitivity,” IEEE
ence on Computer Vision and Pattern Recognition (CVPR), pp. 340–349, Transactions on Image Processing, vol. 24, no. 1, pp. 359–373, 2015.
July 2017. [49] L. Guo, D. Xu, and Z. Qiang, “Background subtraction using local svd
[28] D. Eigen, D. Krishnan, and R. Fergus, “Restoring an image taken binary pattern,” in Proceedings of the IEEE Conference on Computer
through a window covered with dirt or rain,” in Proceedings of the IEEE Vision and Pattern Recognition Workshops, pp. 86–94, 2016.
International Conference on Computer Vision, pp. 633–640, 2013. [50] G. Allebosch, D. Van Hamme, F. Deboeverie, P. Veelaert, and W. Philips,
[29] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks “C-efic: color and edge based foreground background segmentation with
for semantic segmentation,” in Proceedings of the IEEE conference on interior classification,” in International Joint Conference on Computer
computer vision and pattern recognition, pp. 3431–3440, 2015. Vision, Imaging and Computer Graphics, pp. 433–454, Springer, 2016.
[30] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, [51] R. Wang, F. Bunyak, G. Seetharaman, and K. Palaniappan, “Static and
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” moving object detection using flux tensor with split gaussian models,”
in Proceedings of the IEEE conference on computer vision and pattern in Proceedings of the IEEE Conference on Computer Vision and Pattern
recognition, pp. 1–9, 2015. Recognition Workshops, pp. 414–418, 2014.
[31] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks [52] S. Bianco, G. Ciocca, and R. Schettini, “How far can you get by
for biomedical image segmentation,” in International Conference on combining change detection algorithms?,” in International Conference
Medical image computing and computer-assisted intervention, pp. 234– on Image Analysis and Processing, pp. 96–107, Springer, 2017.
241, Springer, 2015. [53] H. Sajid and S.-C. S. Cheung, “Universal multimode background
[32] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, subtraction,” IEEE Transactions on Image Processing, vol. 26, no. 7,
“Deeplab: Semantic image segmentation with deep convolutional nets, pp. 3249–3260, 2017.
atrous convolution, and fully connected crfs,” IEEE transactions on [54] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, connected convolutional networks,” in Proceedings of the IEEE confer-
2017. ence on computer vision and pattern recognition, pp. 4700–4708, 2017.
[33] M. Sultana, A. Mahmood, S. Javed, and S. K. Jung, “Unsupervised [55] J. Winterer, N. Maier, C. Wozny, P. Beed, J. Breustedt, R. Evangelista,
deep context prediction for background estimation and foreground Y. Peng, T. DâĂŹAlbis, R. Kempter, and D. Schmitz, “Excitatory
segmentation,” Machine Vision and Applications, vol. 30, no. 3, pp. 375– microcircuits within superficial layers of the medial entorhinal cortex,”
395, 2019. Cell Reports, vol. 19, no. 6, pp. 1110–1116, 2017.
[34] Z. Hu, T. Turki, N. Phan, and J. T. Wang, “A 3d atrous convolutional long [56] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep con-
short-term memory network for background subtraction,” IEEE Access, volutional encoder-decoder architecture for image segmentation,” IEEE
vol. 6, pp. 43450–43459, 2018. transactions on pattern analysis and machine intelligence, vol. 39,
[35] M. Braham and M. V. Droogenbroeck, “Deep background subtraction no. 12, pp. 2481–2495, 2017.
with scene-specific convolutional neural networks,” 2016 International [57] Y. Yang, J. Q. Wu, X. Feng, and A. Thangarajah, “Recomputation
Conference on Systems, Signals and Image Processing (IWSSIP), pp. 1– of dense layers for the performance improvement of dcnn,” IEEE
4, 2016. transactions on pattern analysis and machine intelligence, 2019.
[36] G. Thapa, K. Sharma, and M. Ghose, “Moving object detection and [58] P. Li, Z. Chen, L. T. Yang, Q. Zhang, and M. J. Deen, “Deep
segmentation using frame differencing and summing technique,” Inter- convolutional computation model for feature learning on big data in
national Journal of Computer Applications, vol. 975, p. 8887, 2014. internet of things,” IEEE Transactions on Industrial Informatics, vol. 14,
[37] X. Zang, G. Li, J. Yang, and W. Wang, “Adaptive difference modelling no. 2, pp. 790–798, 2017.
for background subtraction,” in 2017 IEEE Visual Communications and [59] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep
Image Processing (VCIP), pp. 1–4, Dec 2017. learning,” stat, vol. 1050, p. 23, 2016.
[38] Y. Tian, Y. Wang, Z. Hu, and T. Huang, “Selective eigenbackground [60] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer
for background modeling and subtraction in crowded scenes,” IEEE learning,” Journal of Big data, vol. 3, no. 1, p. 9, 2016.
transactions on circuits and systems for video technology, vol. 23, no. 11, [61] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans-
pp. 1849–1864, 2013. actions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–
[39] H. Yong, D. Meng, W. Zuo, and L. Zhang, “Robust online matrix 1359, 2010.
factorization for dynamic background subtraction,” IEEE transactions [62] N. Otsu, “A threshold selection method from gray-level histograms,”
on pattern analysis and machine intelligence, vol. 40, no. 7, pp. 1726– IEEE transactions on systems, man, and cybernetics, vol. 9, no. 1,
1740, 2017. pp. 62–66, 1979.
[40] N. Wang, T. Yao, J. Wang, and D.-Y. Yeung, “A probabilistic approach [63] L. Yang, J. Li, Y. Luo, Y. Zhao, H. Cheng, and J. Li, “Deep background
to robust matrix factorization,” in European Conference on Computer modeling using fully convolutional network,” IEEE Transactions on
Vision, pp. 126–139, Springer, 2012. Intelligent Transportation Systems, vol. 19, no. 1, pp. 254–262, 2018.
[41] J. Xu, V. K. Ithapu, L. Mukherjee, J. M. Rehg, and V. Singh, “Gosus: [64] C. G. Snoek, M. Worring, and A. W. Smeulders, “Early versus late
Grassmannian online subspace updates with structured-sparsity,” in fusion in semantic video analysis,” in Proceedings of the 13th annual
Proceedings of the IEEE International Conference on Computer Vision, ACM international conference on Multimedia, pp. 399–402, ACM, 2005.
pp. 3376–3383, 2013. [65] M. Silvestri, A. Elia, D. Bertelli, E. Salvatore, C. Durante, M. L. Vigni,
[42] Y. Wang, P.-M. Jodoin, F. Porikli, J. Konrad, Y. Benezeth, and P. Ishwar, A. Marchetti, and M. Cocchi, “A mid level data fusion strategy for
“Cdnet 2014: an expanded change detection benchmark dataset,” in the varietal classification of lambrusco pdo wines,” Chemometrics and
Proceedings of the IEEE conference on computer vision and pattern Intelligent Laboratory Systems, vol. 137, pp. 181–189, 2014.
recognition workshops, pp. 387–394, 2014. [66] A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, and W. Burgard,
[43] T. Minematsu, A. Shimada, and R.-i. Taniguchi, “Analytics of deep “Multimodal deep learning for robust rgb-d object recognition,” in 2015
neural network in change detection,” in 2017 14th IEEE International IEEE/RSJ International Conference on Intelligent Robots and Systems
Conference on Advanced Video and Signal Based Surveillance (AVSS), (IROS), pp. 681–687, IEEE, 2015.
pp. 1–6, IEEE, 2017. [67] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and
[44] Y. Zhang, X. Li, Z. Zhang, F. Wu, and L. Zhao, “Deep learning L. Fei-Fei, “Large-scale video classification with convolutional neural
driven blockwise moving object detection with binary scene modeling,” networks,” in Proceedings of the IEEE conference on Computer Vision
Neurocomputing, vol. 168, pp. 454–463, 2015. and Pattern Recognition, pp. 1725–1732, 2014.
[45] G. Gemignani and A. Rozza, “A novel background subtraction approach [68] H. T. Mustafa, J. Yang, and M. Zareapoor, “Multi-scale convolutional
based on multi layered self-organizing maps,” in 2015 IEEE Interna- neural network for multi-focus image fusion,” Image and Vision Com-
tional Conference on Image Processing (ICIP), pp. 462–466, IEEE, puting, 2019.
2015. [69] T. Akilan, Q. J. Wu, A. Safaei, and W. Jiang, “A late fusion approach
[46] S. Jiang and X. Lu, “Wesambe: a weight-sample-based method for for harnessing multi-cnn model high-level features,” in 2017 IEEE
background subtraction,” IEEE Transactions on Circuits and Systems International Conference on Systems, Man, and Cybernetics (SMC),
0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2937076, IEEE
Transactions on Vehicular Technology
16
0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.