0% found this document useful (0 votes)
51 views

Deep Learning For Object Detection and Segmentation in Videos Toward An Integration With Domain Knowledge

This document summarizes recent research in deep learning techniques for computer vision tasks in videos, specifically object detection and segmentation. It analyzes existing deep learning methods, identifies challenges, and proposes a hybrid framework that couples deep learning with domain knowledge. The paper also reviews approaches integrating domain knowledge with deep learning to address current challenges. Finally, it discusses conclusions on implementing hybrid architectures for computer vision tasks in videos.

Uploaded by

Free O Free
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Deep Learning For Object Detection and Segmentation in Videos Toward An Integration With Domain Knowledge

This document summarizes recent research in deep learning techniques for computer vision tasks in videos, specifically object detection and segmentation. It analyzes existing deep learning methods, identifies challenges, and proposes a hybrid framework that couples deep learning with domain knowledge. The paper also reviews approaches integrating domain knowledge with deep learning to address current challenges. Finally, it discusses conclusions on implementing hybrid architectures for computer vision tasks in videos.

Uploaded by

Free O Free
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Batch no : 28

Team Members : TAMIIL THENDTRAL M ( 310819104989 )


RAM GANESH S ( 310819104066 )
PRASANTH R ( 310819104059 )

Received March 8, 2022, accepted March 21, 2022, date of publication March 28, 2022, date of current version April 4, 2022.
Digital Object Identifier 10.1109/ACCESS.2022.3162827

Deep Learning for Object Detection and


Segmentation in Videos: Toward an
Integration With Domain Knowledge
ATHINA ILIOUDI , AZITA DABIRI , BEN J. WOLF , AND BART DE SCHUTTER , (Fellow, IEEE)
Delft Center for Systems and Control, Delft University of Technology, 2628 Delft, The Netherlands
Corresponding author: Athina Ilioudi ([email protected])
This work was supported in part by the European Union’s Horizon 2020 Research and Innovation Programme under Grant 871295
(SeaClear), and in part by the European Research Council (ERC) under the European Union’s Horizon 2020 Research and Innovation
Programme under Grant 101018826-CLariNet.

ABSTRACT Deep learning has enabled the rapid expansion of computer vision tasks from image frames to
video segments. This paper focuses on the review of the latest research in the field of computer vision tasks
in general and on object localization and identification of their associated pixels in video frames in particular.
After performing a systematic analysis of the existing methods, the challenges related to computer vision
tasks are presented. In order to address the existing challenges, a hybrid framework is proposed, where deep
learning methods are coupled with domain knowledge. An additional feature of this survey is that a review of
the currently existing approaches integrating domain knowledge with deep learning techniques is presented.
Finally, some conclusions on the implementation of hybrid architectures to perform computer vision tasks
are discussed.

INDEX TERMS Computer vision, object detection, deep learning, theory-guided data science.

I. INTRODUCTION The objective of this work is to investigate the advance-


Just as motion perception is essential to our visual system, ments of deep learning techniques for computer vision tasks
allowing us to interpret the world, to detect the presence in videos as well as their research perspectives to address their
of creatures [25], and to avoid danger [34], video computer current weaknesses. More specifically, the contributions of
vision helps artificial intelligence agents to decipher their our study are trifold:
surrounding environment and to synthesize actionable infor- • We present an analysis of the existing DL techniques for
mation. Inspired by the human visual system and enabled by detection and segmentation of objects in videos.
the latest advancements in deep learning (DL), novel video • We present an overview of the challenges with the
processing methods are emerging that achieve remarkable existing data-driven approaches.
results and that seek to revolutionize how computer vision • We outline new directions for research in video process-
tasks are implemented. Yet, similarly to human perception, ing.
computer vision is quite prone to illusions. The paper is organized in seven sections. Section II presents
The fast pace of DL breakthroughs in combination with an overview of necessary preliminary knowledge. Section III
the improvement in hardware capabilities in terms of com- gives a comprehensive overview of DL-based video computer
putation power, memory capacity, and sensor resolution have vision methods. In Section IV the current challenges are
accelerated the spread of data-driven methods over the con- presented and analyzed. To address these challenges, Section
ventional computer vision techniques. Contrary to classical V presents an overview of approaches that couple DL
techniques, DL reaches human-level accuracy, requires less methods with domain knowledge. Section VI highlights the
expert analysis, and provides superior flexibility including most prominent topics that are expected to draw major
allowing re-training whenever new data are available [115]. interest from the research community in the following years,
and Section VII gives concluding remarks.
The associate editor coordinating the review of this manuscript and A list of abbreviations mentioned in this paper and their
approving it for publication was Zhongyi Guo . definitions are presented in Table 1.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
34562 VOLUME 10, 2022
A. Ilioudi et al.: DL for Object Detection and Segmentation in Videos: Toward Integration With Domain Knowledge

TABLE 1. List of abbreviations. image and marking the pixels that belong to each one of
them [39]. Extending this task to the video domain results
in simultaneous detection, segmentation, and tracking of the
instances [121]. The instance segmentation task combines
object detection, where individual objects are classified and
localized with a bounding box, and semantic segmentation,
where each pixel is classified into the given classes.
The task of object classification & localization is included
in object detection. At the same time, in semantic seg-
mentation, each pixel of an image is associated with a
class label like a road, tree, pedestrian, etc. In other words,
all objects of an image that belong to the same class are
treated as a single entity. On the other hand, each object
of the same class is treated as a distinct individual instance
with instance segmentation. Hence, instance segmentation
can be considered as a more elaborate implementation
of semantic segmentation. Since all the computer vision
tasks are similar, in this work mainly object detection
and instance segmentation techniques will be examined,
as they are the most dominant techniques required in
extensive applications such as autonomous driving [69],
video surveillance [100], face recognition [108], and robot
II. PRELIMINARIES navigation [120].
In this section, we introduce the most typical tasks of
computer vision and we present a brief, comparative analysis B. DEEP LEARNING VS. TRADITIONAL COMPUTER VISION
between deep learning and conventional techniques in the TECHNIQUES
domain of computer vision, as well as an overview of Traditional computer vision methods are based on hard-
basic deep learning methods such as convolutional neural coded, rigid-rule algorithms to apply feature extraction on
networks, restricted Boltzmann machines, and auto-encoders, images [80]. Several algorithms have been developed to
which constitute the core for DL architectures in computer extract properties such as corners, edges, and regions of
vision. interest from images [2], [12], [40], [74], [88]. These algo-
rithms showcase advantages such as transparency, in terms
A. COMPUTER VISION TASKS of allowing to trace back to all steps of how a decision was
Computer vision tasks can be categorised into 4 major fields: made, and performance that is independent of the training
(1) semantic segmentation, (2) classification & localization, dataset. At the same time, however, they have been criticised
(3) object detection, and (4) instance segmentation. The to be inflexible, difficult to improve or adapt, and highly
task of semantic segmentation refers to the process of time-consuming to develop manually for each additional
assigning a class label to every pixel in an image [72]. One object to be detected [83]. Moreover, the performance of
of the shortcomings of this task is the fact that semantic these methods significantly deteriorates when the number
segmentation does not differentiate between instances of of classes to be detected increases. By contrast, DL utilizes
the same class. On the other hand, the classification & massive data sets and numerous training cycles to learn how
localization task aims to predict the class of a specific an object looks, following a process during which relevant
object in an image and to draw a bounding box around the features of an object of interest are extracted automatically.
region of the classified object in an image [126]. This task The DL architecture can then be implemented on previously
refers to a single object. However, most images in real-world unseen images and make accurate predictions. DL-based
settings contain multiple objects of different shapes and sizes. methods perform remarkably better than traditional methods,
Therefore, object detection [37] refers to a more general albeit with trade-offs regarding computational requirements
approach where a varying number of predicted objects for and training time [83]. As a result, they have vastly replaced
every input image can be extracted, since it is unknown how traditional computer vision techniques, thanks to their strong
many objects are expected to be detected in each image. ability to be easily adjusted, to extract complex features in
Object detection systems strive to find every instance much more detail, and to be much more efficient in terms
of an object and estimate the spatial extent of each one. of accuracy and versatility [83]. Tremendous advancements
Nevertheless, the detected objects are located just with in research have taken place in this domain, resulting in the
bounding boxes. development of numerous methods. The fundamental DL
The task of instance segmentation refers to the prob- methods implemented on image computer vision applications
lem of detecting all the instances of a category in an are discussed in section II-C.

VOLUME 10, 2022 34563


A. Ilioudi et al.: DL for Object Detection and Segmentation in Videos: Toward Integration With Domain Knowledge

C. IMAGE-BASED DEEP LEARNING METHODS the region proposals. Fast R-CNNs can train detection
1) CONVOLUTIONAL NEURAL NETWORKS networks whose architecture involves multiple layers
Convolutional neural networks (CNNs) have been widely like VGG-16 [99], as they are 9 times faster compared to
used in image processing applications over the past R-CNNs and 3 times faster than spatial pyramid pooling
decades [62], [66], [133]. Their structure consists of a networks [105]. The drawback of the high time cost has
number of convolutional and pooling layers, stacked one been further addressed by faster R-CNNs [92]. In faster
after another [5]. The convolutional layer can be visualized R-CNNs the time-consuming selective-search algorithm
as a square matrix W of weights, called kernel [87]. The is replaced with a fully convolutional network that
kernel slides over the image looking for patterns and when it learns the region proposals of an image with arbitrary
distinguishes a part of an image that is similar with its pattern, size. A major additional development of the previous
it returns a large positive value, otherwise, it returns a small R-CNNs is achieved by Mask-RCNNs [41]. Mask
value. The input image is represented as a pixel matrix with R-CNNs extend the previous architectures by labeling
size length × width × number of color channels (i.e. an RGB the pixels corresponding to each object instance. The
image has 3 color channels). Mask R-CNN inherits the region proposal network from
The convolutional layer is utilized for feature extraction faster R-CNNs and employs an additional branch that
and the pooling layer to downsample the resolution of the outputs a binary mask classifying whether or not a given
convolutional layer output. In this way, a dimension reduction pixel is part of an object. Two-stage approaches yield
is accomplished, which reduces the number of necessary a high accuracy since each stage performs one specific
parameters in the next layer, resulting in a less complex task. However, in terms of real-time applications, two-
architecture. During the training process, the training samples stage approaches show weaknesses in computational
are fed through the CNN and the error with respect to the time.
desired output is calculated. The error and its gradient are then • One-stage approach: One-stage approaches skip the
backpropagated through the network layers and the weights first stage of region proposal and simply run detection
are updated. directly on the input image. This simpler architecture
CNN-based image object detectors can be separated into allows them to have faster inference. Some networks
two main categories [105], [127]: can achieve a processing speed of up to 150 frames
• Two-stage approach: In the two-stage method, the per second (fps). There is a trade-off, however, in terms
first stage extracts region proposals and the second of accuracy. Notable one-stage methods are the ‘‘you
stage classifies those region proposals and determines only look once’’ (YOLO) network [91], which extracts
the bounding boxes of the classified objects. In the class and bounding boxes predictions directly from an
region proposal part, sliding window techniques such as input image using a CNN and the single-shot detector
Deformable Part Models [20] are adopted. An additional (SSD) [71], which takes an input image and passes
region proposal technique, employed in region-based it through multiple convolutional layers with different
convolutional neural networks (R-CNNs) [27], is selec- sizes of filters.
tive search [111]. R-CNNs extract around 2000 region 2) RESTRICTED BOLTZMANN MACHINES
proposals on each input image, which is a significantly The Restricted Boltzmann Machine (RBM) is a two-layer
reduced number of regions needed compared to other undirected graphical model [6] that was introduced in
sliding window methods. At the second stage of this 1986 [46]. It consists of a set of visible nodes and a set of
architecture, a CNN is used for object detection over hidden nodes. RBMs are in essence a variant of Boltzmann
the region proposals. The size of the proposed regions machines, but in RBMs there are no intralayer connections
is arbitrary, while the CNN requires a fixed size input. between the nodes in the visible layer and the hidden layer
Hence, a major drawback of R-CNNs is due to the fact (i.e. no visible node is connected to any other visible node
that images need to be cropped or resized to accomplish and no hidden node is connected to any other hidden node
the requirement for a fixed size input. Spatial pyramid respectively). In this way, RBMs are easier to implement and
pooling [31], [42], [64] is a method used in order to more efficient in training compared to Boltzmann Machines.
achieve a fixed-size output irrespective of the input Their visible nodes receive the input, combine it with weights
image size. Hence, spatial pyramid pooling networks and a bias, and pass it to the hidden nodes. The value
can be trained and tested on varying size images, which generated at the hidden nodes is combined accordingly with
reduces overfitting of the model. weights and a bias and the result is passed to the visible nodes
Both R-CNNs and spatial pyramid pooling networks to reconstruct the input.
are particularly slow during training. Fast R-CNN [27] If we consider the visible vector V , the hidden vector H ,
tries to solve this drawback by passing the original and the weight parameters αi , bi , wij , an RBM configuration
image through the CNN instead of using the region can be assigned with an energy E given by [24]:
proposals. As a result, fast R-CNN is faster than R-CNN X X X
because the convolutional operation is implemented E(V , H ) = − αi vi − bj hj − vi wij hj . (1)
only once on the original image instead of 2000 times on i j ij

34564 VOLUME 10, 2022


A. Ilioudi et al.: DL for Object Detection and Segmentation in Videos: Toward Integration With Domain Knowledge

Given this energy function, a probability P is assigned to image. The encoder and decoder mappings φ : X → Z and
every pair (V , H ): ψ : Z → X are given by:

P(V , H ) =
1 −E(V ,H )
e , (2) (φ, ψ) = argmin kX − (ψ ◦ φ)(X )k2 , (6)
Z (φ,ψ)

where Z is equal to the sum of the energy of all the pairs of where the operator ◦ refers to the composition function:
visible and hidden vectors. ψ ◦ φ(X ) = ψ(φ(X )). The autoencoder is trained with the
objective to select the optimal encoder and decoder functions
e−E(V ,H ) .
X
Z= (3) so that the minimum amount of information is required to
(V ,H ) encode the image in order to be regenerated on the decoder
side.
For a given visible vector V , the probability that is assigned
to the hidden node hj is
III. DEEP LEARNING METHODS FOR DETECTION AND
SEGMENTATION OF OBJECTS IN VIDEOS
!
X
P(hj = 1|V ) = σ bj + vi wij , (4) Due to the similarity between video detection and image
i detection, some methods of image detection are often used
where σ (·) is the logistic sigmoid function [38]. For a hidden for video detection. The methods described above can be
vector H the assigned probability of a visible node vi is extended to the video domain by running detection for each
respectively: image in a sequence of frames [7]. In this way, however,
the temporal correlation between frames is not taken into
 
X account. In addition, running a detection algorithm for
P(vi = 1|H ) = σ αi + hj wij  . (5) each frame results in computational inefficiency since there
j might be feature extraction redundancies between sequential
frames. Furthermore, in a video sequence, there might be
The weight parameters are optimized with the aim to poor-quality frames which could lead to low inference
maximize the likelihood of the visible and hidden vectors accuracy. One obvious reason that this extension is not
(V , H ). trivial is due to the fact that a video sequence introduces
The intuition behind RBMs is based on the association an additional dimension; the temporal one. In other words,
of a scalar energy to each combination of the variables of instead of being considered as a sequence of frames, a video
interest. Learning is achieved, therefore, by calculating the should be rather regarded as a sequence of related frames.
combination that has the lowest energy. Due to the complexity of video data and the computation
RBMs are useful for dimensionality reduction, classifica- cost for training, research has been limited in this field.
tion, regression, and feature learning. However, due to the However, more and more video-related research works
fact that RBMs consist of only two layers, the complexity of have surfaced lately, due to the release of ImageNet
the data representation that they can achieve is limited [24]. VID [93] and other massive video datasets. Depending on the
For this reason, a number of extended, architectures has architecture, DL-based techniques for video object detection
been developed. An example of such architecture is the can be broadly diversified into six categories, namely
Deep Belief Network [44], which consists of multiple stacked (1) optical flow, (2) tracking, (3) long short-term memory,
RBMs. Deep Belief Networks are used for feature extraction (4) gated recurrent unit, (5) self-attention mechanism, and
in many computer vision applications. Except for Deep Belief (6) generative learning. In the following subsections a critical
Networks, another RBM-based architecture is the Deep appraisal of these architecture paradigms is presented.
Boltzmann Machine [95], [96]. Deep Boltzmann Machines
are similar to Deep Belief Networks, although the former A. OPTICAL FLOW
have only undirected connections between their layers, which One of the most fundamental concepts in video processing
makes them more robust to noisy observations, while the is optical flow. Optical flow was originally introduced
latter have bidirectional connections in the last layer [104]. in [25] referring to human perception and the changing
pattern of light that reaches our eyes. In computer vision
3) AUTO-ENCODERS applications, optical flow refers to the problem of estimating
Auto-encoders [8], [45] refer to a specific type of neural the displacement vector for each pixel in subsequent image
networks that aim to compress the input image data into a frames [48].
lower-dimension (latent) representation and then reconstruct A key assumption in optical flow is brightness constancy.
the original image from this representation. Their architecture This practically means that a pixel at the position (x, y) of an
consists of two main parts, namely, the encoder and the image at time t moves to the position (x +1x, y+1y) at time
decoder. The encoder maps an input vector of images X into t + 1t and the brightness I (x, y, t) remains constant:
a compressed, lower dimensional vector Z , while the decoder
part maps the latent variable Z to a reconstruction of the input I (x + 1x, y + 1y, t + 1t) ≈ I (x, y, t). (7)

VOLUME 10, 2022 34565


A. Ilioudi et al.: DL for Object Detection and Segmentation in Videos: Toward Integration With Domain Knowledge

The Taylor series expansion of the left-hand side of (7) is overcomes DFF both in terms of accuracy and inference
speed. It is also faster than FGFA although it achieves a
I (x + 1x, y + 1y, t + 1t) slightly lower accuracy level. An alternative architecture,
∂I ∂I ∂I
= I (x, y, t) + 1x + 1y + 1t + . . . which outperforms FGFA, is proposed in [17], where a
∂x ∂y ∂t two-stream feature aggregation approach is integrated into
⇒ I (x + 1x, y + 1y, t + 1t) − I (x, y, t) a one-stage detector to achieve video object detection.
= Ix 1x + Iy 1y + It 1t, (8) In particular, the first stream applies optical flow to estimate
the motion and to aggregate the features along the motion
where Ix , Iy , It are the partial derivatives of the intensity path, while the second stream predicts the features of
function I with respect to x, y, and t respectively. Hence, if we the frame of interest by spatio-temporal sampling and
substitute (7) into (8) we can derive: aggregation of features from the adjacent frames. The final
∇I · vT + It = 0, (9) predictions result from blending the outcomes from the two
streams.
1y
where ∇I = (Ix , Iy ) and v = ( 1x1t ,
1t ) are the components of
the optical flow, and It is the temporal gradient of the intensity B. TRACKING
function. Visual tracking can be described as the problem of estimating
Optical flow can be applied to estimate the motion of an unknown target trajectory over a sequence of image
detected objects in video segments by assigning an optical frames [78]. Traditional methods employ a variety of
flow vector to the pixels corresponding to the detected object. tracking algorithms, such as mean shift algorithm [14],
Optical flow can be either ‘‘sparse’’ or ‘‘dense’’. Sparse particle filtering [30], and Kalman filtering [54]. With the
optical flow estimates the flow vectors of some specific advancements in data science in recent years, novel DL-based
features, such as corners or edges of an object within an image visual trackers have been developed.
frame. Dense optical flow, on the other hand, includes the Object tracking outperforms optical flow in accu-
flow vectors of all the pixels in an image frame. The latter racy [129]. This can be explained by the fact that tracking
method achieves higher accuracy than the former, although uses shared networks to achieve feature extraction for
at the cost of increased computational requirements. detection and tracking. Hence, the requirements in terms
Recently, modern CNN architectures have been success- of computational power are limited and at the same
fully used for optical flow estimation applications [18]. CNNs time, the fusion between the two tasks is performed in a
can be trained to run on pairs of images and to predict the more straightforward way, which achieves higher accuracy
optical flow field. These flow networks are employed in compared to optical flow based models.
computer vision tasks for videos according to two different CNN is the first architecture that was adopted for DL-based
approaches. In the first approach, one neural network is visual tracking. In [19], a region-based fully convolutional
responsible for the task of object detection and it is applied neural network [15] is used for jointly performing detection
on sparse key frames. The extracted feature maps from these and tracking in an integrated framework. The model is fed
key frames are then propagated to the next frames with a with a set of two consecutive image frames, from which the
flow network. This technique is called Deep Feature Flow convolutional feature maps are computed. Object detection is
(DFF) [132] and it achieves great computational efficiency run on each frame and a regressor is employed to compute
due to the fact that it implements the object detection task the box transformation from one frame to the other. CNN-
only on key frames. based object tracking models showcase some weaknesses in
The second approach involving flow networks is known performance though, due to the scarcity of labeled data in
as flow-guided feature aggregation (FGFA) [131]. In FGFA, terms of including sets of two consecutive frames, which are
a feature extraction network is run on all individual frames to necessary for their training, as well as their speed limitations
create the respective feature maps per frame. The inference with respect to real-time applications [79].
at a reference frame is enhanced with an optical flow A baseline approach presented in [121] extends the Mask
network that predicts the motion between the neighbor R-CNN to include an additional tracking branch with
frames and the adjacent frames. The propagated feature an external memory for tracking object instances across
maps from neighbor frames are aggregated with the feature frames. The proposed architecture extracts the classification,
map from the reference frame in an adaptive weighting the bounding boxes, and the segmentation predictions of
method. FGFA achieves higher inference accuracy but at a Mask R-CNN, and it takes into account the past frame
higher computation time compared to DFF. For this reason, information only for tracking. In this way, the task of instance
an impression network [43] is another proposed architecture segmentation is extended to videos. CrossVIS [122] presents
that combines the two abovementioned techniques, with a novel, cross-frame learning approach that uses the features
the objective to take advantage of both methods. Sparse of an instance in the current frame to segment the same
key frame feature maps are then aggregated with other instance in other frames. Crossover learning is integrated
key frames feature maps and at the same time they are with the instance segmentation loss as an objective to obtain
propagated to other non-key frames. The impression network cross-frame instance segmentation consistency, achieving a

34566 VOLUME 10, 2022


A. Ilioudi et al.: DL for Object Detection and Segmentation in Videos: Toward Integration With Domain Knowledge

low computational cost. CrossVIS outperforms MaskTrack


R-CNN [121] in terms of both accuracy and speed [122].
An additional DL-based method for tracking arbitrary
objects involves Siamese Neural Networks (SNNs) [109].
SNNs have been extensively implemented on visual tracking
applications in the past years [4]. An SNN is basically a
two-stream network that takes as input pairs of the target
and search image and outputs a similarity map. In other
words, SNNs learn a function f : (z, x) → f (z, x) which
compares an image z with a candidate image x returning a FIGURE 1. LSTM cell structure, adapted from [125].
high score when the two images are similar with each other.
The position tracking of an object can thus be determined
by checking all possible locations and selecting the one that using temporal information from other previous frames. Long
corresponds to an image with the highest similarity to the short-term memory (LSTM) [47] is an improved type of RNN
previous frame. SNNs can learn the function f from a training that is capable of utilizing long-term dependencies.
video dataset with labeled object trajectories and they are one The architecture of an LSTM cell is depicted in Figure 1.
of the most promising methods for object tracking due to their LSTMs are cells consisting of three parts which are known as
performance and efficiency. gates. The first gate determines what part of the information
Recurrent neural networks (RNNs) [28] are an alternative coming from past time steps needs to be ‘‘remembered’’ or
architecture employed in visual object tracking applications. can be ‘‘forgotten’’. The second gate inputs information of
RNNs can be considered to operate on a sequence that the current time step to the cell. Finally, the third gate passes
contains vectors x(t) and each vector can describe e.g. the updated information from the current time step to the next
an image frame from a video at time step t. In other one. The first gate is called forget gate while the second and
words, an RNN is a neural network that is specialized for the third ones are called input and output gates respectively.
processing a sequence of values x(1), . . . , x(n), where n is the In the following equations f (t), i(t), o(t) represent the
length of the sequence, in a similar way as a convolutional forget, input and output gate vectors respectively, σ is the
network is specialized for processing a tensor representing sigmoid function, W (j) and b(j) refer to the weights and biases
an image. The same update rule is applied to each part of the corresponding to the j-th gate’s neurons, h(t − 1) refers to
output, resulting in the sharing of parameters through a deep the output of the previous cell at time stamp t − 1, and x(t)
computational graph. RNN-based methods can be considered represents the input at time t [49].
as a suitable method for visual object tracking since they • Forget gate
take into account both spatial and temporal features of video  
frames [124]. The RNN-based methods aim to improve f (t) = σ W (f ) [h(t − 1), x(t)] + b(f ) (10)
the tracking performance by utilizing temporal information • Input gate
such as past states of the target’s position. However, their  
implementation is limited because their complex architecture i(t) = σ W (i) [h(t − 1), x(t)] + b(i) (11)
involves a significant number of parameters that need to be
determined [68]. • Output gate
 
C. LONG SHORT-TERM MEMORY o(t) = σ W (o) [h(t − 1), x(t)] + b(o) (12)
Although RNNs are naturally suited to time series data, Moreover, an additional vector C̄ is used that modifies the
like videos, their implementation suffers from various cell’s state C:
weaknesses. First of all, while they take into consideration  
information from the previous time stamp, their performance C̄(t) = tanh W (c) [h(t − 1), x(t)] + b(c) (13)
is deteriorated, when storing information for a longer time C(t) = f (t) C(t − 1) + i(t) C̄(t), (14)
period [60]. Sometimes, certain information stored at long
past time step might be required to accurately predict where the operator corresponds to the elementwise
the current output. RNNs in that cases are incapable of multiplication. The hidden state is equal to:
utilizing such ‘‘long-term’’ dependencies. In addition, RNNs h(t) = o(t) tanh C(t). (15)
do not have the possibility to keep part of the past time
stamp information and to discard the rest. An additional LSTMs can maintain important information over a long
challenge in RNNs is that gradients propagated through sequence of data. [33] presents an extensive analysis of
the network tend to either vanish or explode because of variants of LSTM as well as a review of the impact of the
the repetition of the weight matrix over all recurrent units. involved hyperparameters. In [75] an LSTM framework is
At the same time, optical flow techniques make use of developed as an extension to an SSD architecture in order to
temporal information only on two adjacent frames without associate detected object instances across consecutive frames.

VOLUME 10, 2022 34567


A. Ilioudi et al.: DL for Object Detection and Segmentation in Videos: Toward Integration With Domain Knowledge

The proposed method outperforms other RNN architectures


[110] and it can be applied online. However, the weakness
of this approach is that the SSD architecture involved
is pre-trained in advance and thus, the SSD features do
not get updated in response to the output of the LSTMs.
In [70], an approach is suggested where LSTM is used in
combination with interleaving conventional feature extractors
with extremely lightweight ones. The main advantage of this
approach is that minimal computation is required to produce
accurate detection. In other words, an interleaved model FIGURE 2. GRU structure, adapted from [125].
framework is proposed, where multiple feature extractors
are run sequentially or concurrently. A memory mechanism
is then proposed to aggregate these frame-level features.
A modified LSTM cell is used in [130] to achieve faster In [9] an SSD-based architecture is extended to multi-frame
results with low computational requirements. The proposed data. Convolutional GRUs are employed in order to fuse
architecture connects fast single-image object detection features across multiple frames and to enhance the accuracy
frameworks in series with convolutional LSTM layers in of object detection. From a mathematical perspective,
order to propagate frame-level information over time. This this architecture replaces the dot product operator in the
architecture inputs one single frame of the video at a time and standard gated recurrent unit definition in (16)-(18) with
it is quite simple. Hence, it achieves reduced computational the convolution operator. As reported in [23], this approach
cost as well as enhanced inference speed. improves the existing SSD architecture by 2.7 % in terms
of the mean average precision on the KITTI dataset [22].
D. GATED RECURRENT UNIT An additional example is provided in [110], where first a
Similarly to LSTMs, gated recurrent units (GRUs) [13] pseudo-labeler is trained on individual labeled frames. The
are another type of RNNs. However, GRUs have fewer pseudo-labeler assigns the labels to all video frames and then
parameters than LSTMs, since they only have two gates: the a recurrent architecture with GRUs is trained, which takes
update gate and the reset gate. As seen in Figure 2, in contrast sequences of pseudo-labeled frames as input. The standard
to LSTMs, a GRU cell does not have an output gate, as in the cost function used for the training of the RNN is augmented
case of LSTMs, and they combine the input and the forget with an additional term to ensure the consistency across
gate of LSTMs into the update gate. Due to their simplicity, consecutive frames. In [112] a human activity recognition
GRUs are significantly faster rather than LSTMs. technique is proposed, where skip connections are introduced
The update and reset gates in a GRU cell are defined among GRU layers to ensure that even in a deep architecture
as in equations (16) and (17) respectively. In the following with multiple layers, there is no vanishing gradient impact on
equations z(t), r(t) represent the update and reset gate vectors the performance.
respectively, and W (j) , b(j) refer to the weights and biases Both LSTM and GRU can ensure that important infor-
corresponding to the j-th gate’s neurons [49]. mation is maintained along long time-series data. GRU is
• Update gate faster than LSTM in terms of training speed [123]. Their
  performance is comparable, although in small datasets, GRU
z(t) = σ W (z) [h(t − 1), x(t)] + b(z) (16) slightly outperforms LSTM.
• Reset gate
  E. SELF-ATTENTION MECHANISM
r(t) = σ W (r) [h(t − 1), x(t)] + b(r) (17) RNNs, LSTMs, and GRUs have been widely adopted in
The update gate determines the amount of previous time-step sequence modeling applications. However, due to the fact that
information that passes along the next state, while the they process the data in a sequential manner, they do not allow
reset gate is responsible for deciding what part of the past for parallel computation, which could critically affect long
information is neglected. After multiplying the input vector sequences of frames, due to memory constraints limiting the
and the hidden state with the weights of the reset gate as batch size of samples during training.
presented in (17), the element-wise product between the reset Self-attention mechanism [58] relates different elements
gate and the previous time-step hidden state is calculated. of a sequence to generate a representation of this sequence.
Then, a non-linear activation function is applied to the result Contrary to the architectures mentioned above, it supports
leading to the candidate hidden state: parallel processing of sequential data. Originally it was pro-
  posed for machine translation [113] and then its application
h̄(t) = tanh W h [r(t) h(t − 1), x(t)] + b(h) . (18) was extended to video data [26].
Three vectors are involved in the self-attention mechanism.
The hidden state then reads as:
These vectors are used for the representation of features
h(t) = (1 − z(t)) h(t − 1) + z(t) h̄(t). (19) (key vector), values (value vector), and the values to be

34568 VOLUME 10, 2022


A. Ilioudi et al.: DL for Object Detection and Segmentation in Videos: Toward Integration With Domain Knowledge

determined (query vector). Let us assume that we have a a multi-stage architecture, in order to extract object relations
sequence of n elements (x1 , x2 , . . . , xn ) of X ∈ Rn×d , with in both spatial and temporal context. The relations are then
d being the embedding dimension for the representation of further distilled with refined supportive object proposals and
each element [57]. We can then define three learnable weight propagated across frames. Finally, in [98] an attention-based
matrices in order to transform the queries (W q ∈ Rn×dq ), module is developed to learn long-range temporal relations
keys (W k ∈ Rn×dk ) and values vectors (W v ∈ Rn×dv ). between objects, in order to propagate the extracted features.
In this way, the input X is first transformed with the weight The proposed architectures in [16], [118], and [98] outper-
matrices and projected onto Q = XW q , K = XW k , and V = form optical flow-based approaches in accuracy.
XW v . A similarity function is used to calculate the similarity
between the query and the key vector. The self-attention layer F. GENERATIVE LEARNING
outputs Z ∈ Rn×dv which is equal to The objective of generative learning is to approximate a
complex, high-dimensional probabilistic distribution that
!
QK T
Z = softmax p V, (20) generates a class of data, in order to generate similar
dq data. Developing generative architectures to understand
where softmax function is defined by complicated data distributions has been a long-standing
exi research problem [84]. Recent works in this area [29], [59]
softmax(X)i = Pk , (21) have provided a new set of generative algorithms that can
xj
j=1 e
efficiently generate video segments or extract features from
for i = 1 . . . k and X = (x1 , . . . , xk ) ∈ Rk . The self-attention them. The most outstanding generative algorithms are the
determines the similarity between the key and the query variational autoencoders (VAEs) and generative adversarial
vector by computing their dot product. The dot product is networks (GANs).
then normalized using softmax so that the sum of all the • Variational auto-encoders: Their architecture resem-
scores becomes equal to 1. Each element is then given by the bles an auto-encoder, with the difference that their latent
weighted sum of all elements in the sequence. The weights in variable distribution is regularised during the training.
this case correspond to the attention scores. The most well- VAEs stemmed from the limitation of auto-encoders to
known, self-attention architecture is the transformer [113]. generate new, unseen data, due to the fact that the dis-
In [26] a transformer framework is developed to recognize tribution of the latent variable is unknown. To alleviate
and localize human actions in a video. A person feature this issue, VAEs are trained to learn the distribution of
is represented as the query (Q) and the features from the latent variable, assuming that it follows a Gaussian
adjacent video frames correspond to the key (K) and the distribution with a mean µ and variance σ 2 [50].
values (V). A video instance segmentation architecture built One example of a VAE-based architecture for video
upon transformers is proposed in [116]. Four modules are object detection is presented in [67], where a modified
included in the developed architecture: a backbone CNN to VAE architecture, built on top of a Mask R-CNN is
extract features over the video frames, an encoder-decoder proposed, in order to detect and to segment multiple
transformer that determines the similarity of features on instances in diverse videos. The proposed architecture
pixel and instance level, an instance-sequence matching, outperforms MaskTrack R-CNN [121], because the
and a segmentation module. The overall performance of MaskTrack R-CNN architecture depends entirely on
this framework is competitive compared to the single-model the Mask R-CNN to perform predictions, resulting in
approaches tested on the YouTube-VIS dataset [121], difficulties to handle false negative proposals of the
although it is somewhat lower in comparison to other complex Mask R-CNN in highly diverse videos with occlu-
CNN-based models [3]. sions, deformations, and pose variations of objects.
In [35] a constrained self-attention architecture is proposed By contrast, the architecture proposed in [67] merges
for video object detection that captures motion cues under a VAE with a Mask R-CNN network in a topology
the assumption that moving objects follow a continuous consisting of one encoder and three decoders. This
trajectory. An additional, self-attention based architecture is results in three parallel branches that provide strong
proposed in [36], which is applied in the temporal-spatial complements for predictions about bounding boxes and
domain towards aligning two feature maps of consecutive mask features, and they significantly reduce the number
frames. The proposed method features a low amount of of false negatives in the Mask R-CNN module.
parameters, while it achieves higher accuracy in comparison • Generative adversarial networks: Generative adver-
to optical flow-based methods such as DFF and FGFA. sarial networks are built on the basis of a two-
A related, efficient, and simplified architecture for video player, min-max game. The generator network G and
object detection via aggregating semantic features across the discriminator network D correspond to the first
frames is presented in [118]. Cosine similarity is imple- and the second player respectively. The generator’s
mented to compute the semantic similarities of the extracted objective is to mislead the discriminator by generating
proposals across frames, which are then aggregated accord- natural-looking data (e.g. images, videos, etc.) from a
ingly. In [16] an object relation module is employed as part of random, latent vector z. The discriminator on the other

VOLUME 10, 2022 34569


A. Ilioudi et al.: DL for Object Detection and Segmentation in Videos: Toward Integration With Domain Knowledge

hand, tries to distinguish whether the data are real or deep neural networks with high complexity and numerous
fake (generated). The game is modeled as the following parameters, the cost function might have multiple minima,
optimization problem: which minimize the training error but may not generalize
well to unseen data. The presence of noise and outliers
min max(G, D) = Ex∼pdata (x) [log D(x)] in the training dataset is an additional reason for poor
G D
+Ez∼pz (z) [log(1 − D(G(z))]. (22) generalizability. Generalizability is also deteriorated due
to the weakness of DL methods to deal with hierarchical
A generative adversarial approach is developed in [102], structures, since DL modules tend to fail when generalization
to randomly generate masks that correspond to object depends on compositional processes [63].
appearance variations in time. The masks are then At the same time, although correlation does not imply
applied to reduce overfitting via adaptively dropping causation, they do not seem to be distinguishable for DL.
out input features. The developed architecture identifies Numerous neural network architectures have surfaced over
the mask that maintains the most robust features of the the last decades that are highly capable of discovering
target objects over a long period of time. In [106] a complex correlations in data, yet they lack in reasoning about
GAN is trained on color and depth information in order cause-effect relations or environment changes.
to generate similar backgrounds to the test samples. Finally, deep learning has delivered new, highly per-
The generated background samples are then subtracted forming approaches in computer vision tasks, whose dom-
from the given test samples to detect foreground moving inance, however, remains inversely proportional to their
objects. Finally, in [11] the encoder-decoder architecture explanatory power. Rationalizing the output of data-driven
of [82], which is limited to process information between techniques is a critical issue since more and more data-driven
only two adjacent frames, is extended with a GAN, systems are adopted in safety-critical and high impact
to enforce temporal and spatial coherence of the applications.
generated object masks and to exploit information
within a longer temporal window. The developed V. INTEGRATING DEEP LEARNING WITH DOMAIN
architecture exhibits similar accuracy as other state-of- KNOWLEDGE
the-art computer vision methods, while it is almost four A. MOTIVATION
times faster. A prudent approach to address the abovementioned chal-
lenges is to expand the current methods and to merge them
IV. CHALLENGES IN DEEP-LEARNING-BASED COMPUTER with principles that govern the dynamic behavior of systems
VISION over the time, enabling an adaptation to new, unseen sce-
Despite the tremendous advances in deep learning and the narios. Combining DL-based techniques with equation-based
fast pace of its breakthroughs over the last years, there are dynamic models (DMs) in a complementary way, or in
still challenges that prevent it from reaching its full potential. other words, integrating common sense understanding into
This section illustrates a set of major challenges related to artificial intelligence constitutes a particularly interesting
computer vision tasks on video analysis with DL techniques. challenge for computer vision systems.
DL-based methods have succeeded in achieving even Enabling data-driven vision systems to understand the
human-level performance in complex, computer vision tasks. principles that govern the behavior of objects is essential
However, this is possible only when massive datasets are for the development of autonomous systems that understand
available for training. Data are the core of any DL-based observed scenarios and have the ability to adopt these
process and hence their shortage is often responsible for poor principles to a never seen situation. Leveraging domain
performance. Large-scale amounts of data are not available knowledge to identify equation-based models that describe
for all video applications though. how the properties of objects and entities change over
The impact of data scarcity is further escalated by time and embedding them into DL techniques can lead to
the stand-alone approach of DL. A typical workflow for novel, highly robust, and performing architectures. Such
developing a DL module consists of creating a training set models could be developed for instance from well-known
of inputs associated with outputs and learning the relations first principles in order to describe how an object moves and
between them. In this way, however, the architecture becomes they could be coupled with DL methods forming a hybrid
free-standing and isolated from prior, useful knowledge. computer vision architecture. It is straightforward to conclude
Hence, the DL performance is highly determined by the that hybrid architectures are more efficient compared to
existence of big-volume datasets while at the same time, purely data-driven or model-based techniques as they harness
applications that are more related to common sense reasoning the benefits of both disciplines. Hybrid methods that combine
and less to categorization, cannot be sufficiently targeted with scientific domain knowledge with data-driven models allow
purely DL methods [76]. for accurate inference even with imperfect models and limited
Generalizability is an additional major challenge con- amounts of data.
cerning the performance of a data-driven model trained on The integration of the two disciplines in a hybrid architec-
one dataset when applied to other datasets. When training ture can be realized either by infusing mathematical rules to

34570 VOLUME 10, 2022


A. Ilioudi et al.: DL for Object Detection and Segmentation in Videos: Toward Integration With Domain Knowledge

Data transformation may include normalization of the data,


band-pass filtering, downsampling, and feature selection.
When the input involves time-series signals, the data can
be converted to the frequency domain via the fast Fourier
transform (FFT). This implementation can be implemented
in anomaly detection such as e.g. in the bearings of a rotating
machine [94]. Finally, reducing the dimension of the feature
set is another technique widely applied when preprocessing
the data. A thorough analysis of the data preprocessing
FIGURE 3. Hybrid architecture. The dynamic model could refer to a
techniques is presented in [21].
first-principle model or any other mathematical or computer model that
is derived from domain knowledge and that describes how the properties 2) INITIALIZATION
of objects and entities change over time.
One important design choice when building a neural network
architecture, is related to the parameter initialization [117].
a DL architecture or by combining the operation of the two Iterative optimization algorithms such as gradient descent are
separate modules in a complementary manner. An advantage used during the process of training a neural network in order
of this second version of a hybrid architecture is the fact to estimate the network’s parameters. In this process, an initial
that an easy and straightforward recalibration of the DM value for the parameters is required as a first step to start
module is feasible if a bidirectional interaction between the the optimization process. Quite often the initialization of the
two modules is enabled. More specifically, the DL module, parameters is done based on a random distribution. Random
which can be re-trained incrementally when new data become initialization though can make the optimization algorithm
available, can also enable the recalibration of the DM module. that is employed for the calculation of the network weights
This results in a hybrid architecture which is highly flexible to converge to local minima or saddle points.
and easily adaptable to different scenarios. An approach towards this issue would be to use a technique
Hybrid architectures merging data-driven techniques with called transfer learning [85]. The basic idea of transfer
domain knowledge, such as from physics have been recently learning is based on pretraining a neural network on a
developed, introducing a novel research field which is still simpler, related problem. This pretraining task takes place
in its infancy [55], [90]. As a result, their applications under the assumption that a big quantity of data is available.
are limited mainly to topics related to climate science This pretrained neural network can then be implemented as
and geology. Their expansion to other disciplines like the initial state for the training of the original problem as
computer vision tasks remains a challenging research topic it is closer to the optimal parameters value than random
but would undoubtedly contribute towards addressing the initialization. Transfer learning is a widely used technique in
abovementioned impediments in purely data-driven methods. complex DL applications such as natural language processing
and computer vision. However, the performance of this
B. HYBRID ARCHITECTURES technique is highly dependent on the availability of big-scale
A taxonomy of four general classes for integrated data-driven data. An alternative approach is to employ domain-specific
and model-based techniques can be derived. This classi- knowledge to assist the selection of the initial values of
fication is based on the level at which the integration the parameters involved [55]. In this way, first-principle
takes place [55], [90]. More specifically, the four classes models can be used to generate approximate simulations for
are: (1) preprocessing level, (2) initialization, (3) design of the initialization of the parameters of the neural network.
architecture, and (4) regularization. This section presents an Domain knowledge can ensure a reliable initialization of
analysis of these different methodologies. the parameters, which can assist in achieving generalizable,
interpretable, and physically consistent architectures.
1) DATA PREPOSSESSING LEVEL
Data preprocessing is essential in all data-driven techniques 3) DESIGN OF ARCHITECTURE
before passing the data through the DL module. The Data-driven techniques have made a major impact at realizing
reason is straightforward: the quality of data determines highly performing systems for solving hard problems related
the information that can be extracted and hence, it directly to pattern recognition, prediction, etc. However, a major
influences the learning process of the DL algorithm. As a impediment in their wide adoption in critical applications is
result, it is vital that we apply a preprocessing technique their ‘‘black box’’ nature since our understanding of their
before passing the data through the DL model. complexity is limited. Hence, domain knowledge can be
The concept of data preprocessing is a major area in the infused in a DL architecture to ensure its interpretability.
field of deep learning. There are three main steps involved One possible approach for this integration is to infuse the
in data preprocessing, that is: (1) data cleaning, (2) data output of the equation-based model fDM as input to the DL
transformation, and (3) data reduction. Data cleaning refers module fDL , i.e. fhybrid : (X , PDM , PDL ) → Y where X is
to the handling of missing data as well as, to noise removal. the input, Y is the output, PDM , PDL the parameters of the

VOLUME 10, 2022 34571


A. Ilioudi et al.: DL for Object Detection and Segmentation in Videos: Toward Integration With Domain Knowledge

dynamic model and the DL model respectively, and fhybrid the By introducing model-based constraints in the loss func-
composition of the two functions, fhybrid = fDL ◦ fDM [90]. tion for the train of DL modules, scientific consistency
Two main categories of architectures can result from is achieved, which is essential for training generalizable
merging DL with dynamic models, founded on prior domain models. In addition, the physics-based loss function fPhy
knowledge. In the first category, the output of the model is requires no labeled data which allows the training of the
fed through the DL module at the first or at an additional DL module to be expanded to non-labeled data. A plethora
layer. In the second category, the model is embedded into of implementations that impose physics-based constraints
the DL module. Many architectures with respect to the first on the training of DL modes has surfaced recently [81],
class have surfaced lately in the field of climate and geology [103], [107]. In [56] a physics-based loss function is
applications. In [52], [56], the output of a physics-based used for the training of a temperature lake predictor. The
model is provided as an additional input feature to the DL loss function encompasses a constraint resulting from the
module in an application related to predicting the temperature relationship between the temperature, the density, and the
of a lake based on the depth. In [86], a physics-based depth of the lake water. In this way, the trained predictor
neural network architecture is used in order to simulate achieves enhanced generalizability while at the same time
broadband earthquake ground motions. The DL module is consistency with first-principle laws is ensured for the
used to predict the ground motion in the short term, including results. In [51], the application of lake temperature prediction
transient effects, which are particularly complex to model is extended to include temporal physical processes. More
mathematically. The DM module is then used to simulate the specifically, a physics-based RNN is developed that involves
response in a long-term period. energy conservation constraints. Standard LSTM models
In the second class, the DM module is embedded into store specific information at each time step, which feeds to
the DL module architecture. An example of this class is a the next time step. However, when the models are trained
physics-based model with an RNN including LSTMs [101] on data from specific seasons or from multiple years, it is
where the sensor data as well as the DM generated output are difficult to generalize to data from different time periods
ingested as input to the RNN architecture. since the time profiles vary significantly between each
other. By including the energy flux changes, however, which
4) REGULARIZATION determine the temperature changes, the architecture can
Deep neural networks can involve numerous parameters. successfully predict the lake temperature, even on unseen
However, when no large amounts of data are available, deep data. Another example is given in [53], where the data-driven
neural networks tend to overfit or, in other words, they model is penalized with the equation describing the time
fail to discover the underlying relationship described by the evolution of waves in order to identify the location of
training data and hence they cannot extrapolate to observed underwater obstacles from acoustic measurements. In this
data outside the training set. One way to handle this issue way, the accuracy of the model outside the training dataset
is to apply physical constraints on the loss function of the is enhanced. Finally, [10] presents a case where multiple
neural network. Several regularization techniques have been physics-based terms are present in a loss function. These
developed in this way, to prevent neural networks from might be competing loss terms with multiple local minima
overfitting. This is achieved by applying penalties to layer and correspond to different physics equations that need to be
parameters, and by integrating these penalties in the loss minimized together. Hence, an approach is presented where
function that is minimized during training. The loss function the contribution of each term is adaptively tuned during the
in that case will be of the following form [117]: training phase in order to improve the generalizability of the
developed architecture.
fLoss = fTrn (Y , Ŷ ) + λR(W ) + γ fPhy (Ŷ ), (23)
where fTrn corresponds to a function that represents the C. HYBRID ARCHITECTURE IMPLEMENTATION IN
error between the predicted value Ŷ and the true value Y . COMPUTER VISION
This function can be for example the mean squared error Integrating useful domain knowledge into DL-based com-
or cross entropy. In addition, λ represents a hyperparameter puter vision tasks is essential to build robust, generalizable
determining the weight of the regularization term R(W ). The systems and to compensate for the lack of large-volume
first two terms of (23) describe the standard loss function training data. An example of such a hybrid architecture
used when training a neural network. The additional term is proposed in [103], where the height of a free-falling
fPhy corresponds to the physics-based constraint and it aims object is estimated on each frame of a video by training
to ensure the consistency of the trained system with first- a CNN to detect and track objects obeying to free-falling
principle laws or dynamic models. The weight of this function laws of physics. The training of this CNN is based on a loss
is represented by the hyperparameter γ . Given the true value function in which first-principle laws are encoded. In [1],
Y , the following is considered as the general optimization physics are blended with DL in the framework of a two-stage
problem to solve for (23): encoder with the aim to recover the shape of an object
based on polarized photos. In [61] an LSTM architecture
argmin fTrn (Y , Ŷ ) + λR(W ) + γ fPhy (Ŷ ). (24)
w is combined with a dynamics model in order to acquire a

34572 VOLUME 10, 2022


A. Ilioudi et al.: DL for Object Detection and Segmentation in Videos: Toward Integration With Domain Knowledge

proposal distribution over an object’s state. Finally, in [119], online and at the same time being capable of preserving
a generative vision system is proposed for estimating physical the knowledge learned during previous interactions are
features of objects by integrating the output of a multi-physics only a few of the desirable features of future vision
simulation engine in the loop. mechanisms.
Integrating DL techniques with domain knowledge is a • Multi-modal learning: Ultimately, major emphasis in
recently introduced research topic [55], [90]. As a result, research is expected to be placed upon developing meth-
using domain knowledge to derive first-principle models ods that can process and link information combining
or on a broader perspective, any dynamic mathematical or modalities from various architectures [65], [76], since
computer model [73] that describes how the properties of unimodal DL methods seem to fail to fulfill all the
objects and entities change over time (Figure 3), and merging desirable future DL capabilities. In particular, combined
them with existing DL architectures constitute an especially architectures that integrate DL modules with domain
promising research task to address the challenges of DL in knowledge could provide a suitable answer to most
computer vision. research questions arising from the DL directions listed
above.
VI. OUTLOOK: FUTURE DIRECTIONS IN DEEP LEARNING
FOR OBJECT DETECTION AND SEGMENTATION IN VIDEOS VII. CONCLUSION
Deep learning has brought a catalytic effect in the field of In this paper a study is presented about detection and
computer vision for video analysis. Although nobody knows segmentation of objects applied to video segments. A review
with certainty how DL will evolve over the coming decades, of the currently existing techniques has been presented as
it is expected that much of the future research will revolve well as the major challenges that data-driven techniques face.
around the following critical areas [32], [77], [114]: Then an extension of the data-driven techniques to a hybrid
• Out-of-distribution generalization: Future computer architecture that fuses data-driven techniques with equation-
vision systems should be able to make accurate predic- based models describing the dynamic behavior of objects
tions not only in a known context but also for data with and entities over time has been proposed in order to address
different distributions than the ones learned from the issues like data scarcity, generalizability, and interpretability
training samples. The main reason behind the difficulty of the purely data-driven architectures. Finally, a survey of
of DL systems to accurately generalize and predict on the current developments in hybrid architectures has been
unseen data is caused by the fundamental assumption presented. We hope that this work will assist in better
that training and test data are independent and identically understanding the current status of DL in computer vision for
distributed (IID) [97], [128]. In many real-life cases video analysis as well as in presenting interesting directions
however, the IID assumption is hardly satisfied. The as guidelines for future work.
ability to generalize under distribution shifts is of
critical significance, and hence, the investigation of REFERENCES
out-of-distribution generalization is expected to attract [1] Y. Ba, A. Ross Gilbert, F. Wang, J. Yang, R. Chen, Y. Wang,
enormous research interest in the academic field. L. Yan, B. Shi, and A. Kadambi, ‘‘Deep shape from polarization,’’ 2019,
arXiv:1903.10210.
• Deep learning systems with causal structures:
[2] H. Bay, T. Tuytelaars, and L. Van Gool, ‘‘SURF: Speeded up robust
Causality is expected to be a central strand of DL features,’’ in Computer Vision–(ECCV). Berlin, Germany: Springer, 2006,
research in the coming years [89]. Developing DL pp. 404–417.
[3] G. Bertasius and L. Torresani, ‘‘Classifying, segmenting, and tracking
systems that can represent causal relationships can object instances in video with mask propagation,’’ in Proc. IEEE/CVF
increase their safety and reliability, and introducing a Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 9739–9748.
causal understanding of basic concepts in DL methods [4] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr,
‘‘Fully-convolutional Siamese networks for object tracking,’’ in Proc.
could certainly be the key to achieve robustness in Eur. Conf. Comput. Vis. (ECCV Workshops), 2016, pp. 850–865.
complex real-world environments. [5] D. Bhatt, C. Patel, H. Talsania, J. Patel, R. Vaghela, S. Pandya, K.
• Effective representation learning with few or no Modi, and H. Ghayvat, ‘‘CNN variants for computer vision: History,
architecture, application, challenges and future scope,’’ Electronics,
labeled data: While techniques for representation vol. 10, no. 20, p. 2470, Oct. 2021.
learning when massive labeled datasets are available [6] C. M. Bishop, Pattern Recognition and Machine Learning (Information
have become remarkably powerful, various challenges Science and Statistics). New York, NY, USA: Springer-Verlag, 2006.
remain in the case of limited labeled data. Developing [7] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, ‘‘YOLACT++: Better real-time
instance segmentation,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 44,
approaches for addressing the issue of labeled data no. 2, pp. 1108–1121, Feb. 2022.
scarcity is an emerging popular direction of research. [8] H. Bourlard and Y. Kamp, ‘‘Auto-association by multilayer perceptrons
• Adaptation in time-varying environments: Adapt- and singular value decomposition,’’ Biol. Cybern., vol. 59, nos. 4–5,
pp. 291–294, Sep. 1988.
ing to time-varying environments and other dynamic- [9] A. Broad, M. Jones, and T. Y. Lee, ‘‘Recurrent multi-frame single shot
behavior-related problems has been under examination detector for video object detection,’’ in Proc. BMVC, 2018, pp. 1–14.
for many years and it is expected to gain massive [10] M. Elhamod, J. Bu, C. Singh, M. Redell, A. Ghosh, V. Podolskiy,
W.-C. Lee, and A. Karpatne, ‘‘CoPhy-PGNN: Learning physics-guided
attention by the DL research community over the neural networks with competing loss functions for solving eigenvalue
coming years. Allowing integration of new knowledge problems,’’ 2020, arXiv:2007.01420.

VOLUME 10, 2022 34573


A. Ilioudi et al.: DL for Object Detection and Segmentation in Videos: Toward Integration With Domain Knowledge

[11] S. Caelles, A. Pumarola, F. Moreno-Noguer, A. Sanfeliu, and L. Van Gool, [36] C. Guo, B. Fan, J. Gu, Q. Zhang, S. Xiang, V. Prinet, and C. Pan,
‘‘Fast video object segmentation with spatio-temporal GANs,’’ 2019, ‘‘Progressive sparse local attention for video object detection,’’ in Proc.
arXiv:1903.12161. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 3909–3918.
[12] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, ‘‘BRIEF: Binary robust [37] D. Hall, F. Dayoub, J. Skinner, H. Zhang, D. Miller, P. Corke,
independent elementary features,’’ in Computer Vision–(ECCV). Berlin, G. Carneiro, A. Angelova, and N. Sunderhauf, ‘‘Probabilistic object
Germany: Springer, 2010, pp. 778–792. detection: Definition and evaluation,’’ in Proc. IEEE Winter Conf. Appl.
[13] K. Cho, B. van Merrienboer, C. C. Gülçehre, D. Bahdanau, F. Bougares, Comput. Vis. (WACV), Mar. 2020, pp. 1020–1029.
H. Schwenk, and Y. Bengio, ‘‘Learning phrase representations using [38] J. Han and C. Moraga, ‘‘The influence of the sigmoid function parameters
RNN encoder-decoder for statistical machine translation,’’ in Proc. Conf. on the speed of backpropagation learning,’’ in From Natural to Artificial
Empirical Methods Natural Lang. Process. (EMNLP), 2014. Neural Computation. Berlin, Germany: Springer, 1995, pp. 195–201.
[14] D. Comaniciu, V. Ramesh, and P. Meer, ‘‘Real-time tracking of non- [39] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, ‘‘Simultaneous
rigid objects using mean shift,’’ in Proc. IEEE Conf. Comput. Vis. Pattern detection and segmentation,’’ 2014, arXiv:1407.1808.
Recognit. (CVPR), Jun. 2000, pp. 142–149. [40] C. Harris and M. Stephens, ‘‘A combined corner and edge detector,’’ in
[15] J. Dai, Y. Li, K. He, and J. Sun, ‘‘R-FCN: Object detection via region- Proc. Alvey Vis. Conf., 1988, pp. 147–151.
based fully convolutional networks,’’ 2016, arXiv:1605.06409. [41] K. He, G. Gkioxari, P. Dollár, and R. Girshick, ‘‘Mask R-CNN,’’ IEEE
Trans. Pattern Anal. Mach. Intell., vol. 42, no. 2, pp. 386–397, Feb. 2020.
[16] J. Deng, Y. Pan, T. Yao, W. Zhou, H. Li, and T. Mei, ‘‘Relation distillation
[42] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Spatial pyramid pooling in
networks for video object detection,’’ in Proc. IEEE/CVF Int. Conf.
deep convolutional networks for visual recognition,’’ IEEE Trans. Pattern
Comput. Vis. (ICCV), Seoul, South Korea, Oct. 2019, pp. 7022–7031.
Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, Sep. 2014.
[17] J. Deng, Y. Pan, T. Yao, W. Zhou, H. Li, and T. Mei, ‘‘Single shot video
[43] C. Hetang, H. Qin, S. Liu, and J. Yan, ‘‘Impression network for video
object detector,’’ IEEE Trans. Multimedia, vol. 23, pp. 846–858, 2021.
object detection,’’ 2017, arXiv:1712.05896.
[18] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, [44] G. Hinton, S. Osindero, and Y.-W. Teh, ‘‘A fast learning algorithm for
P. V. D. Smagt, D. Cremers, and T. Brox, ‘‘FlowNet: Learning optical deep belief nets,’’ Neural Comput., vol. 18, no. 7, pp. 1527–1554, 1960.
flow with convolutional networks,’’ in Proc. IEEE Int. Conf. Comput. Vis. [45] G. E. Hinton and R. R. Salakhutdinov, ‘‘Reducing the dimensionality of
(ICCV), Dec. 2015, pp. 2758–2766. data with neural networks,’’ Science, vol. 313, no. 5786, pp. 504–507,
[19] C. Feichtenhofer, A. Pinz, and A. Zisserman, ‘‘Detect to track and track 2006.
to detect,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, [46] G. E. Hinton and T. J. Sejnowski, Learning and Relearning in Boltzmann
pp. 3057–3065. Machines. Cambridge, MA, USA: MIT Press, 1986, pp. 282–317.
[20] P. Felzenszwalb, D. McAllester, and D. Ramanan, ‘‘A discriminatively [47] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural
trained, multiscale, deformable part model,’’ in Proc. IEEE Conf. Comput. Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
Vis. Pattern Recognit., Jun. 2008, pp. 1–8. [48] B. K. P. Horn and B. G. Schunck, ‘‘Determining optical flow,’’ Artif.
[21] S. García, S. Ramírez-Gallego, J. Luengo, J. M. Benítez, and F. Herrera, Intell., vol. 17, nos. 1–3, pp. 185–203, Aug. 1980.
‘‘Big data preprocessing: Methods and prospects,’’ Big Data Anal., vol. 1, [49] B. J. Hou and Z. H. Zhou, ‘‘Learning with interpretable structure from
no. 1, pp. 1–22, Dec. 2016. gated RNN,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 7,
[22] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, ‘‘Vision meets robotics: pp. 2267–2279, Jul. 2020.
The KITTI dataset,’’ Int. J. Robot. Res., vol. 32, no. 11, pp. 1231–1237, [50] A. Jabbar, X. Li, and B. Omar, ‘‘A survey on generative adversarial
Sep. 2013. networks: Variants, applications, and training,’’ ACM Comput. Surv.,
[23] A. Geiger, P. Lenz, and R. Urtasun, ‘‘Are we ready for autonomous vol. 54, no. 8, pp. 1–49, Nov. 2022.
driving? The KITTI vision benchmark suite,’’ in Proc. IEEE Conf. [51] X. Jia, J. Willard, A. Karpatne, J. Read, J. Zwart, M. S. Steinbach, and
Comput. Vis. Pattern Recognit., Jun. 2012, pp. 3354–3361. V. Kumar, ‘‘Physics guided RNNs for modeling dynamical systems: A
[24] T. Georgiou, Y. Liu, W. Chen, and M. Lew, ‘‘A survey of traditional case study in simulating lake temperature profiles,’’ in Proc. SIAM Int.
and deep learning-based feature descriptors for high dimensional data in Conf. Data Mining, May 2019, pp. 558–566.
computer vision,’’ Int. J. Multimedia Inf. Retr., vol. 9, no. 3, pp. 135–170, [52] X. Jia, J. Willard, A. Karpatne, J. S Read, J. A. Zwart, M. Steinbach,
Sep. 2020. and V. Kumar, ‘‘Physics-guided machine learning for scientific dis-
[25] J. J. Gibson, The Ecological Approach to Visual Perception. Boston, MA, covery: An application in simulating lake temperature profiles,’’ 2020,
USA: Houghton Mifflin, 1979. arXiv:2001.11086.
[26] R. Girdhar, J. Joao Carreira, C. Doersch, and A. Zisserman, ‘‘Video action [53] A. Kahana, E. Turkel, S. Dekel, and D. Givoli, ‘‘Obstacle segmentation
transformer network,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern based on the wave equation and deep learning,’’ J. Comput. Phys.,
Recognit. (CVPR), Jun. 2019, pp. 244–253. vol. 413, Jul. 2020, Art. no. 109458.
[54] R. E. Kalman, ‘‘A new approach to linear filtering and prediction
[27] R. Girshick, ‘‘Fast R-CNN,’’ in Proc. IEEE Int. Conf. Comput. Vis.
problems,’’ J. Basic Eng., vol. 82, no. 1, pp. 35–45, Mar. 1960.
(ICCV), Dec. 2015, pp. 1440–1448.
[55] A. Karpatne, G. Atluri, J. H. Faghmous, M. Steinbach, A. Banerjee,
[28] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (Adaptive A. Ganguly, S. Shekhar, N. Samatova, and V. Kumar, ‘‘Theory-guided
Computation and Machine Learning Series). Cambridge, MA, USA: MIT data science: A new paradigm for scientific discovery from data,’’ IEEE
Press, 2016. Trans. Knowl. Data Eng., vol. 29, no. 10, pp. 2318–2331, Jun. 2017.
[29] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, [56] A. Daw, A. Karpatne, W. Watkins, J. Read, and V. Kumar, ‘‘Physics-
S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial nets,’’ in guided neural networks (PGNN): An application in lake temperature
Proc. 27th Int. Conf. Neural Inf. Process. Syst., vol. 2. Cambridge, MA, modeling,’’ 2017, arXiv:1710.11431.
USA: MIT Press, 2014, pp. 2672–2680. [57] S. Khan, M. Naseer, M. Hayat, S. Waqas Zamir, F. Shahbaz Khan, and
[30] N. Gordon, ‘‘Novel approach to nonlinear/non-Gaussian Bayesian state M. Shah, ‘‘Transformers in vision: A survey,’’ 2021, arXiv:2101.01169.
estimation,’’ IEE Proc. F-Radar Signal Process., vol. 140, no. 6, [58] Y. Kim, C. Denton, L. Hoang, and A. M. Rush, ‘‘Structured attention
pp. 107–113, Apr. 1993. networks,’’ 2017, arXiv:1702.00887.
[31] K. Grauman and T. Darrell, ‘‘The pyramid match kernel: Discriminative [59] D. P. Kingma and M. Welling, ‘‘Auto-encoding variational Bayes,’’ in
classification with sets of image features,’’ in Proc. 10th IEEE Int. Conf. Proc. 2nd Int. Conf. Learn. Represent. (ICLR), Banff, AB, Canada,
Comput. Vis. (ICCV), Oct. 2005, pp. 1458–1465. Apr. 2014.
[32] H. S. Greenwald and C. K. Oertel, ‘‘Future directions in machine [60] J. F. Kolen and S. C. Kremer, ‘‘Gradient flow in recurrent nets: The
learning,’’ Frontiers Robot. AI, vol. 3, p. 79, Jan. 2017. difficulty of learning LongTerm dependencies,’’ in A Field Guide to
[33] K. Greff, R. K. Srivastava, J. Koutnìk, B. R. Steunebrink, and Dynamical Recurrent Networks. IEEE Press, 2001, pp. 237–243, doi:
J. Schmidhuber, ‘‘LSTM: A search space Odyssey,’’ IEEE Trans. Neural 10.1109/9780470544037.ch14.
Netw. Learn. Syst., vol. 28, no. 10, pp. 2222–2232, Oct. 2017. [61] J. Kossen, K. Stelzner, M. Hussing, C. Voelcker, and K. Kersting,
[34] R. L. Gregory, Eye and Brain: The Psychology of Seeing. New York, NY, ‘‘Structured object-aware physics prediction for video modeling and
USA: McGraw-Hill, 1978. planning,’’ in Proc. Int. Conf. Learn. Represent., 2020.
[35] Y. Gu, L. Wang, Z. Wang, Y. Liu, M.-M. Cheng, and S.-P. Lu, ‘‘Pyramid [62] A. Kumar and S. Srivastava, ‘‘Object detection system based on
constrained self-attention network for fast video salient object detection,’’ convolution neural networks using single shot multi-box detector,’’ Proc.
in Proc. AAAI Conf. Artif. Intell., 2020, pp. 10869–10876. Comput. Sci., vol. 171, pp. 2610–2617, Jan. 2020.

34574 VOLUME 10, 2022


A. Ilioudi et al.: DL for Object Detection and Segmentation in Videos: Toward Integration With Domain Knowledge

[63] B. M. Lake and M. Baroni, ‘‘Generalization without systematicity: On [89] J. Pearl and D. Mackenzie, The Book Why: The New Sci. Cause Effect,
the compositional skills of sequence-to-sequence recurrent networks,’’ in 1st ed. New York, NY, USA: Basic Books, 2018.
Proc. ICML, 2018, pp. 2879–2888. [90] R. Rai and C. K. Sahu, ‘‘Driven by data or derived through physics? A
[64] S. Lazebnik, C. Schmid, and J. Ponce, ‘‘Beyond bags of features: review of hybrid physics guided machine learning techniques with cyber-
Spatial pyramid matching for recognizing natural scene categories,’’ in physical system (CPS) focus,’’ IEEE Access, vol. 8, pp. 71050–71073,
Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020.
Jun. 2006, pp. 2169–2178. [91] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once:
[65] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521, Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis.
no. 7553, pp. 436–444, Feb. 2015. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
[66] K. Li, W. Ma, U. Sajid, Y. Wu, and G. Wang, ‘‘Object detection with [92] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-time
convolutional neural network,’’ 2019, arXiv:1912.01844. object detection with region proposal networks,’’ IEEE Trans. Pattern
[67] C.-C. Lin, Y. Hung, R. Feris, and L. He, ‘‘Video instance seg- Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
mentation tracking with a modified VAE architecture,’’ in Proc. [93] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and F. F. Li, ‘‘ImageNet
pp. 13144–13154. large scale visual recognition challenge,’’ Int. J. Comput. Vis., vol. 115,
[68] Z. C. Lipton, J. Berkowitz, and C. Elkan, ‘‘A critical review of recurrent no. 3, pp. 211–252, Dec. 2015.
neural networks for sequence learning,’’ 2015, arXiv:1506.00019. [94] M. Sadoughi and C. Hu, ‘‘Physics-based convolutional neural network
[69] D. Liu, Y. Cui, Y. Chen, J. Zhang, and B. Fan, ‘‘Video object detection for for fault diagnosis of rolling element bearings,’’ IEEE Sensors J., vol. 19,
autonomous driving: Motion-aid feature calibration,’’ Neurocomputing, no. 11, pp. 4181–4192, Jun. 2019.
vol. 409, pp. 1–11, Oct. 2020. [95] R. Salakhutdinov and G. Hinton, ‘‘Deep Boltzmann machines,’’ in Proc.
[70] M. Liu, M. Zhu, M. White, Y. Li, and D. Kalenichenko, ‘‘Looking 12th Int. Conf. Artif. Intell. Statist., vol. 5, Clearwater Beach, FL, USA,
fast and slow: Memory-guided mobile video object detection,’’ 2019, Apr. 2009, pp. 448–455.
arXiv:1903.10172. [96] R. Salakhutdinov and H. Larochelle, ‘‘Efficient learning of deep
[71] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and Boltzmann machines,’’ in Proc. AISTATS, 2010, pp. 693–700.
A. C. Berg, ‘‘SSD: Single shot multibox detector,’’ in Computer [97] Z. Shen, J. Liu, Y. He, X. Zhang, R. Xu, H. Yu, and P. Cui, ‘‘Towards
Vision (ECCV). Cham, Switzerland: Springer, 2016, pp. 21–37. out-of-distribution generalization: A survey,’’ 2021, arXiv:2108.13624.
[72] X. Liu, Z. Deng, and Y. Yang, ‘‘Recent progress in semantic image [98] M. Shvets, W. Liu, and A. Berg, ‘‘Leveraging long-range tem-
segmentation,’’ Artif. Intell. Rev., vol. 52, no. 2, pp. 1089–1106, poral relationships between proposals for video object detection,’’
Aug. 2019. in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019,
[73] L. Ljung, System Identification: Theory for the User, 2nd ed. pp. 9755–9763.
Upper Saddle River, NJ, USA: Prentice-Hall, 1999. [99] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for
[74] D. G. Lowe, ‘‘Object recognition from local scale-invariant features,’’ in large-scale image recognition,’’ 2014, arXiv:1409.1556.
Proc. IEEE Int. Conf. Comput. Vis., vol. 2, Sep. 1999, pp. 1150–1157. [100] S. Singh, A. Prasad, K. Srivastava, and S. Bhattacharya, ‘‘Object
[75] Y. Lu, C. Lu, and C.-K. Tang, ‘‘Online video object detection using motion detection methods for real-time video surveillance: A survey
association LSTM,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), with empirical evaluation,’’ in Smart Systems and IoT: Innovations in
Oct. 2017, pp. 2363–2371. Computing. Singapore: Springer, 2020, pp. 663–679.
[76] G. Marcus, ‘‘Deep learning: A critical appraisal,’’ 2018, [101] S. K. Singh, R. Yang, A. Behjat, R. Rai, S. Chowdhury, and
arXiv:1801.00631. I. Matei, ‘‘PI-LSTM: Physics-infused long short-term memory network,’’
[77] G. Marcus, ‘‘The next decade in AI: Four steps towards robust artificial in Proc. 18th IEEE Int. Conf. Mach. Learn. Appl. (ICMLA), Dec. 2019,
intelligence,’’ 2020, arXiv:2002.06177. pp. 34–41.
[78] S. Mojtaba Marvasti-Zadeh, L. Cheng, H. Ghanei-Yakhdan, and [102] Y. Song, C. Ma, X. Wu, L. Gong, L. Bao, W. Zuo, C. Shen,
S. Kasaei, ‘‘Deep learning for visual tracking: A comprehensive survey,’’ R. Lau, and M.-H. Yang, ‘‘VITAL: Visual tracking via adversarial
2019, arXiv:1912.00535. learning,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
[79] M. Müller, A. Bibi, S. Giancola, S. Al-Subaihi, and B. Ghanem, pp. 8990–8999.
‘‘TrackingNet: A large-scale dataset and benchmark for object tracking [103] R. Stewart and S. Ermon, ‘‘Label-free supervision of neural networks with
in the wild,’’ 2018, arXiv:1803.10794. physics and domain knowledge,’’ in Proc. 31st AAAI Conf. Artif. Intell.,
[80] W. K. Mutlag, S. K. Ali, Z. M. Aydam, and B. H. Taher, ‘‘Feature 2017, pp. 2576–2582.
extraction methods: A review,’’ in Proc. J. Phys., Conf., Jul. 2020, [104] H. Suk, An Introduction to Neural Networks and Deep Learning.
vol. 1591, no. 1, Art. no. 012028. Amsterdam, The Netherlands: Elsevier, Jan. 2017, pp. 3–24.
[81] M. Amin Nabian and H. Meidani, ‘‘Physics-driven regularization of deep [105] F. Sultana, A. Sufian, and P. Dutta, ‘‘A review of object detection models
neural networks for enhanced engineering design and analysis,’’ 2018, based on convolutional neural network,’’ in Intelligent Computing: Image
arXiv:1810.05547. Processing Based Applications. Singapore: Springer, 2020, pp. 1–16.
[82] S. W. Oh, J.-Y. Lee, K. Sunkavalli, and S. J. Kim, ‘‘Fast video object [106] M. Sultana, A. Mahmood, S. Javed, and S. Ki Jung, ‘‘Unsu-
segmentation by reference-guided mask propagation,’’ in Proc. IEEE pervised RGBD video object segmentation using GANs,’’ 2018,
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 7376–7385. arXiv:1811.01526.
[83] N. O’Mahony, S. Campbell, A. Carvalho, S. Harapanahalli, [107] L. Sun, H. Gao, S. Pan, and J.-X. Wang, ‘‘Surrogate modeling for fluid
G. V. Hernandez, L. Krpalkova, D. Riordan, and J. Walsh, ‘‘Deep flows based on physics-constrained deep learning without simulation
learning vs. traditional computer vision,’’ in Proc. Comput. Vis. Conf. data,’’ Comput. Methods Appl. Mech. Eng., vol. 361, Apr. 2020,
(CVC). Cham, Switzerland: Springer, 2020, pp. 128–144. Art. no. 112732.
[84] A. Oussidi and A. Elhassouny, ‘‘Deep generative models: Survey,’’ in [108] X. Sun, P. Wu, and S. C. H. Hoi, ‘‘Face detection using deep learning:
Proc. Int. Conf. Intell. Syst. Comput. Vis. (ISCV), Apr. 2018, pp. 1–8. An improved faster RCNN approach,’’ Neurocomputing, vol. 299,
[85] S. J. Pan and Q. Yang, ‘‘A survey on transfer learning,’’ IEEE Trans. pp. 42–50, Jul. 2018.
Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010. [109] R. Tao, E. Gavves, and A. W. M. Smeulders, ‘‘Siamese instance search for
[86] R. Paolucci, F. Gatti, and M. Infantino, ‘‘Broadband ground motions tracking,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
from 3D physics-based numerical simulations using artificial neural Jun. 2016, pp. 1420–1429.
networks,’’ Bull. Seismolog. Soc. Amer., vol. 108, no. 3A, pp. 1272–1286, [110] S. Tripathi, Z. Lipton, S. Belongie, and T. Nguyen, ‘‘Context matters:
Feb. 2018. Refining object detection in video with recurrent neural networks,’’ in
[87] C. Patel, D. Bhatt, U. Sharma, R. Patel, S. Pandya, K. Modi, N. Cholli, Proc. Brit. Mach. Vis. Conf., 2016, pp. 1–12.
A. Patel, U. Bhatt, M. A. Khan, S. Majumdar, M. Zuhair, K. Patel, [111] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and
S. A. Shah, and H. Ghayvat, ‘‘DBGC: Dimension-based generic A. W. M. Smeulders, ‘‘Selective search for object recognition,’’
convolution block for object recognition,’’ Sensors, vol. 22, no. 5, p. 1780, Int. J. Comput. Vis., vol. 104, no. 2, pp. 154–171, Apr. 2013.
Feb. 2022. [112] A. Ullah, K. Muhammad, W. Ding, V. Palade, I. U. Haq, and S. W. Baik,
[88] C. I. Patel, S. Garg, T. Zaveri, and A. Banerjee, ‘‘Top-down and bottom- ‘‘Efficient activity recognition using lightweight CNN and DS-GRU
up cues based moving object detection for varied background video network for surveillance applications,’’ Appl. Soft Comput., vol. 103,
sequences,’’ Adv. Multimedia, vol. 2014, pp. 1–20, Jan. 2014. May 2021, Art. no. 107102.

VOLUME 10, 2022 34575


A. Ilioudi et al.: DL for Object Detection and Segmentation in Videos: Toward Integration With Domain Knowledge

[113] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, ATHINA ILIOUDI received the joint M.S. degree
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. 31st Int. in smart electrical networks and systems from
Conf. Neural Inf. Process. Syst. Red Hook, NY, USA: Curran Associates, the KTH Royal Institute of Technology and the
2017, pp. 6000–6010. Eindhoven University of Technology, in 2018.
[114] R. Verschae and J. Ruiz-del-Solar, ‘‘Object detection: Current and future She is currently pursuing the Ph.D. degree with
directions,’’ Frontiers Robot. AI, vol. 2, p. 29, Nov. 2015. the Delft Center for Systems and Control, Delft
[115] J. Wang, E. Sezener, D. Budden, M. Hutter, and J. Veness, ‘‘A
University of Technology. Her research interests
combinatorial perspective on transfer learning,’’ in Proc. Adv. Neural
include deep learning methods with first-principle
Inf. Process. Syst. Red Hook, NY, USA: Curran Associates, 2020,
pp. 918–929. modeling techniques and physics informed neural
[116] Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia, networks for computer vision applications.
‘‘End-to-end video instance segmentation with transformers,’’ 2020,
arXiv:2011.14503.
[117] J. Willard, X. Jia, S. Xu, M. Steinbach, and V. Kumar, ‘‘Integrating scien-
tific knowledge with machine learning for engineering and environmental
systems,’’ 2020, arXiv:2003.04919. AZITA DABIRI received the Ph.D. degree from the
[118] H. Wu, Y. Chen, N. Wang, and Z.-X. Zhang, ‘‘Sequence level semantics Automatic Control Group, Chalmers University
aggregation for video object detection,’’ in Proc. IEEE/CVF Int. Conf. of Technology, in 2016. She was a Postdoctoral
Comput. Vis. (ICCV), Oct. 2019, pp. 9216–9224. Researcher with the Department of Transport and
[119] J. Wu, I. Yildirim, J. J. Lim, B. Freeman, and J. Tenenbaum, ‘‘Galileo:
Planning, TU Delft, from 2017 to 2019. In 2019,
Perceiving physical object properties by integrating a physics engine
with deep learning,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 28, she received an ERCIM Fellowship and a Marie
Red Hook, NY, USA: Curran Associates, 2015, pp. 1–9. Curie Individual Fellowship, which allowed her to
[120] C. Xie, Y. Xiang, A. Mousavian, and D. Fox, ‘‘Unseen object instance perform research at the Norwegian University of
segmentation for robotic environments,’’ 2020, arXiv:2007.08073. Technology (NTNU) as a Postdoctoral Researcher.
[121] L. Yang, Y. Fan, and N. Xu, ‘‘Video instance segmentation,’’ in Proc. In 2020, she joined the Delft Center for Systems
IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 5187–5196. and Control, TU Delft, as an Assistant Professor. Her research interests
[122] S. Yang, Y. Fang, X. Wang, Y. Li, C. Fang, Y. Shan, B. Feng, and W. include integration of model-based and learning-based control.
Liu, ‘‘Crossover learning for fast online video instance segmentation,’’
in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021,
pp. 8043–8052.
[123] S. Yang, X. Yu, and Y. Zhou, ‘‘LSTM and GRU neural network
performance comparison study: Taking yelp review dataset as an
BEN J. WOLF received the Ph.D. degree (cum
example,’’ in Proc. Int. Workshop Electron. Commun. Artif. Intell.
(IWECAI), Jun. 2020, pp. 98–101. laude) in artificial intelligence from the Univer-
[124] T. Yang and A. B. Chan, ‘‘Recurrent filter learning for visual tracking,’’ sity of Groningen, The Netherlands, in 2020,
2017, arXiv:1708.03874. on the topic of hydrodynamic imaging. He is
[125] Y. Yu, X. Si, C. Hu, and Z. Jianxun, ‘‘A review of recurrent neural currently a Postdoctoral Researcher at the Delft
networks: LSTM cells and network architectures,’’ Neural Comput., Center for Systems and Control, Delft University
vol. 31, no. 7, pp. 1235–1270, Jul. 2019. of Technology. His research interests include
[126] D. Zhang, J. Han, G. Cheng, and M.-H. Yang, ‘‘Weakly supervised object machine learning, neural networks, robotics, and
localization and detection: A survey,’’ IEEE Trans. Pattern Anal. Mach. hydrodynamic sensing.
Intell., early access, Apr. 20, 2021, doi: 10.1109/TPAMI.2021.3074313.
[127] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, ‘‘Object detection with deep
learning: A review,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 30,
no. 11, pp. 3212–3232, Nov. 2019.
[128] K. Zhou, Z. Liu, Y. Qiao, T. Xiang, and C. Change Loy, ‘‘Domain
generalization in vision: A survey,’’ 2021, arXiv:2103.02503. BART DE SCHUTTER (Fellow, IEEE) received
[129] H. Zhu, H. Wei, B. Li, X. Yuan, and N. Kehtarnavaz, ‘‘A review of video the Ph.D. degree (summa cum laude) in applied
object detection: Datasets, metrics and methods,’’ Appl. Sci., vol. 10, sciences from Katholieke Universiteit Leuven,
no. 21, p. 7834, Nov. 2020. Belgium, in 1996.
[130] M. Zhu and M. Liu, ‘‘Mobile video object detection with temporally-
He is currently a Full Professor and the Head
aware feature maps,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
of Department at the Delft Center for Systems and
Recognit., Jun. 2018, pp. 5686–5695.
[131] X. Zhu, Y. Wang, J. Dai, L. Yuan, and Y. Wei, ‘‘Flow-guided feature Control, Delft University of Technology, Delft,
aggregation for video object detection,’’ in Proc. IEEE Int. Conf. Comput. The Netherlands. His current research interests
Vis. (ICCV), Oct. 2017, pp. 408–417. include reinforcement learning, learning-based
[132] X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei, ‘‘Deep feature flow for control, multi-level and multi-agent control, and
video recognition,’’ 2016, arXiv:1611.07715. control of hybrid systems. He is a Senior Editor of the IEEE TRANSACTIONS
[133] I. Ševo and A. Avramović, ‘‘Convolutional neural network based ON INTELLIGENT TRANSPORTATION SYSTEMS and an Associate Editor of IEEE
automatic object detection on aerial images,’’ IEEE Geosci. Remote Sens. TRANSACTIONS ON AUTOMATIC CONTROL.
Lett., vol. 13, no. 5, pp. 740–744, May 2016.

34576 VOLUME 10, 2022

You might also like