Deep Learning For Object Detection and Segmentation in Videos Toward An Integration With Domain Knowledge
Deep Learning For Object Detection and Segmentation in Videos Toward An Integration With Domain Knowledge
Received March 8, 2022, accepted March 21, 2022, date of publication March 28, 2022, date of current version April 4, 2022.
Digital Object Identifier 10.1109/ACCESS.2022.3162827
ABSTRACT Deep learning has enabled the rapid expansion of computer vision tasks from image frames to
video segments. This paper focuses on the review of the latest research in the field of computer vision tasks
in general and on object localization and identification of their associated pixels in video frames in particular.
After performing a systematic analysis of the existing methods, the challenges related to computer vision
tasks are presented. In order to address the existing challenges, a hybrid framework is proposed, where deep
learning methods are coupled with domain knowledge. An additional feature of this survey is that a review of
the currently existing approaches integrating domain knowledge with deep learning techniques is presented.
Finally, some conclusions on the implementation of hybrid architectures to perform computer vision tasks
are discussed.
INDEX TERMS Computer vision, object detection, deep learning, theory-guided data science.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
34562 VOLUME 10, 2022
A. Ilioudi et al.: DL for Object Detection and Segmentation in Videos: Toward Integration With Domain Knowledge
TABLE 1. List of abbreviations. image and marking the pixels that belong to each one of
them [39]. Extending this task to the video domain results
in simultaneous detection, segmentation, and tracking of the
instances [121]. The instance segmentation task combines
object detection, where individual objects are classified and
localized with a bounding box, and semantic segmentation,
where each pixel is classified into the given classes.
The task of object classification & localization is included
in object detection. At the same time, in semantic seg-
mentation, each pixel of an image is associated with a
class label like a road, tree, pedestrian, etc. In other words,
all objects of an image that belong to the same class are
treated as a single entity. On the other hand, each object
of the same class is treated as a distinct individual instance
with instance segmentation. Hence, instance segmentation
can be considered as a more elaborate implementation
of semantic segmentation. Since all the computer vision
tasks are similar, in this work mainly object detection
and instance segmentation techniques will be examined,
as they are the most dominant techniques required in
extensive applications such as autonomous driving [69],
video surveillance [100], face recognition [108], and robot
II. PRELIMINARIES navigation [120].
In this section, we introduce the most typical tasks of
computer vision and we present a brief, comparative analysis B. DEEP LEARNING VS. TRADITIONAL COMPUTER VISION
between deep learning and conventional techniques in the TECHNIQUES
domain of computer vision, as well as an overview of Traditional computer vision methods are based on hard-
basic deep learning methods such as convolutional neural coded, rigid-rule algorithms to apply feature extraction on
networks, restricted Boltzmann machines, and auto-encoders, images [80]. Several algorithms have been developed to
which constitute the core for DL architectures in computer extract properties such as corners, edges, and regions of
vision. interest from images [2], [12], [40], [74], [88]. These algo-
rithms showcase advantages such as transparency, in terms
A. COMPUTER VISION TASKS of allowing to trace back to all steps of how a decision was
Computer vision tasks can be categorised into 4 major fields: made, and performance that is independent of the training
(1) semantic segmentation, (2) classification & localization, dataset. At the same time, however, they have been criticised
(3) object detection, and (4) instance segmentation. The to be inflexible, difficult to improve or adapt, and highly
task of semantic segmentation refers to the process of time-consuming to develop manually for each additional
assigning a class label to every pixel in an image [72]. One object to be detected [83]. Moreover, the performance of
of the shortcomings of this task is the fact that semantic these methods significantly deteriorates when the number
segmentation does not differentiate between instances of of classes to be detected increases. By contrast, DL utilizes
the same class. On the other hand, the classification & massive data sets and numerous training cycles to learn how
localization task aims to predict the class of a specific an object looks, following a process during which relevant
object in an image and to draw a bounding box around the features of an object of interest are extracted automatically.
region of the classified object in an image [126]. This task The DL architecture can then be implemented on previously
refers to a single object. However, most images in real-world unseen images and make accurate predictions. DL-based
settings contain multiple objects of different shapes and sizes. methods perform remarkably better than traditional methods,
Therefore, object detection [37] refers to a more general albeit with trade-offs regarding computational requirements
approach where a varying number of predicted objects for and training time [83]. As a result, they have vastly replaced
every input image can be extracted, since it is unknown how traditional computer vision techniques, thanks to their strong
many objects are expected to be detected in each image. ability to be easily adjusted, to extract complex features in
Object detection systems strive to find every instance much more detail, and to be much more efficient in terms
of an object and estimate the spatial extent of each one. of accuracy and versatility [83]. Tremendous advancements
Nevertheless, the detected objects are located just with in research have taken place in this domain, resulting in the
bounding boxes. development of numerous methods. The fundamental DL
The task of instance segmentation refers to the prob- methods implemented on image computer vision applications
lem of detecting all the instances of a category in an are discussed in section II-C.
C. IMAGE-BASED DEEP LEARNING METHODS the region proposals. Fast R-CNNs can train detection
1) CONVOLUTIONAL NEURAL NETWORKS networks whose architecture involves multiple layers
Convolutional neural networks (CNNs) have been widely like VGG-16 [99], as they are 9 times faster compared to
used in image processing applications over the past R-CNNs and 3 times faster than spatial pyramid pooling
decades [62], [66], [133]. Their structure consists of a networks [105]. The drawback of the high time cost has
number of convolutional and pooling layers, stacked one been further addressed by faster R-CNNs [92]. In faster
after another [5]. The convolutional layer can be visualized R-CNNs the time-consuming selective-search algorithm
as a square matrix W of weights, called kernel [87]. The is replaced with a fully convolutional network that
kernel slides over the image looking for patterns and when it learns the region proposals of an image with arbitrary
distinguishes a part of an image that is similar with its pattern, size. A major additional development of the previous
it returns a large positive value, otherwise, it returns a small R-CNNs is achieved by Mask-RCNNs [41]. Mask
value. The input image is represented as a pixel matrix with R-CNNs extend the previous architectures by labeling
size length × width × number of color channels (i.e. an RGB the pixels corresponding to each object instance. The
image has 3 color channels). Mask R-CNN inherits the region proposal network from
The convolutional layer is utilized for feature extraction faster R-CNNs and employs an additional branch that
and the pooling layer to downsample the resolution of the outputs a binary mask classifying whether or not a given
convolutional layer output. In this way, a dimension reduction pixel is part of an object. Two-stage approaches yield
is accomplished, which reduces the number of necessary a high accuracy since each stage performs one specific
parameters in the next layer, resulting in a less complex task. However, in terms of real-time applications, two-
architecture. During the training process, the training samples stage approaches show weaknesses in computational
are fed through the CNN and the error with respect to the time.
desired output is calculated. The error and its gradient are then • One-stage approach: One-stage approaches skip the
backpropagated through the network layers and the weights first stage of region proposal and simply run detection
are updated. directly on the input image. This simpler architecture
CNN-based image object detectors can be separated into allows them to have faster inference. Some networks
two main categories [105], [127]: can achieve a processing speed of up to 150 frames
• Two-stage approach: In the two-stage method, the per second (fps). There is a trade-off, however, in terms
first stage extracts region proposals and the second of accuracy. Notable one-stage methods are the ‘‘you
stage classifies those region proposals and determines only look once’’ (YOLO) network [91], which extracts
the bounding boxes of the classified objects. In the class and bounding boxes predictions directly from an
region proposal part, sliding window techniques such as input image using a CNN and the single-shot detector
Deformable Part Models [20] are adopted. An additional (SSD) [71], which takes an input image and passes
region proposal technique, employed in region-based it through multiple convolutional layers with different
convolutional neural networks (R-CNNs) [27], is selec- sizes of filters.
tive search [111]. R-CNNs extract around 2000 region 2) RESTRICTED BOLTZMANN MACHINES
proposals on each input image, which is a significantly The Restricted Boltzmann Machine (RBM) is a two-layer
reduced number of regions needed compared to other undirected graphical model [6] that was introduced in
sliding window methods. At the second stage of this 1986 [46]. It consists of a set of visible nodes and a set of
architecture, a CNN is used for object detection over hidden nodes. RBMs are in essence a variant of Boltzmann
the region proposals. The size of the proposed regions machines, but in RBMs there are no intralayer connections
is arbitrary, while the CNN requires a fixed size input. between the nodes in the visible layer and the hidden layer
Hence, a major drawback of R-CNNs is due to the fact (i.e. no visible node is connected to any other visible node
that images need to be cropped or resized to accomplish and no hidden node is connected to any other hidden node
the requirement for a fixed size input. Spatial pyramid respectively). In this way, RBMs are easier to implement and
pooling [31], [42], [64] is a method used in order to more efficient in training compared to Boltzmann Machines.
achieve a fixed-size output irrespective of the input Their visible nodes receive the input, combine it with weights
image size. Hence, spatial pyramid pooling networks and a bias, and pass it to the hidden nodes. The value
can be trained and tested on varying size images, which generated at the hidden nodes is combined accordingly with
reduces overfitting of the model. weights and a bias and the result is passed to the visible nodes
Both R-CNNs and spatial pyramid pooling networks to reconstruct the input.
are particularly slow during training. Fast R-CNN [27] If we consider the visible vector V , the hidden vector H ,
tries to solve this drawback by passing the original and the weight parameters αi , bi , wij , an RBM configuration
image through the CNN instead of using the region can be assigned with an energy E given by [24]:
proposals. As a result, fast R-CNN is faster than R-CNN X X X
because the convolutional operation is implemented E(V , H ) = − αi vi − bj hj − vi wij hj . (1)
only once on the original image instead of 2000 times on i j ij
Given this energy function, a probability P is assigned to image. The encoder and decoder mappings φ : X → Z and
every pair (V , H ): ψ : Z → X are given by:
P(V , H ) =
1 −E(V ,H )
e , (2) (φ, ψ) = argmin kX − (ψ ◦ φ)(X )k2 , (6)
Z (φ,ψ)
where Z is equal to the sum of the energy of all the pairs of where the operator ◦ refers to the composition function:
visible and hidden vectors. ψ ◦ φ(X ) = ψ(φ(X )). The autoencoder is trained with the
objective to select the optimal encoder and decoder functions
e−E(V ,H ) .
X
Z= (3) so that the minimum amount of information is required to
(V ,H ) encode the image in order to be regenerated on the decoder
side.
For a given visible vector V , the probability that is assigned
to the hidden node hj is
III. DEEP LEARNING METHODS FOR DETECTION AND
SEGMENTATION OF OBJECTS IN VIDEOS
!
X
P(hj = 1|V ) = σ bj + vi wij , (4) Due to the similarity between video detection and image
i detection, some methods of image detection are often used
where σ (·) is the logistic sigmoid function [38]. For a hidden for video detection. The methods described above can be
vector H the assigned probability of a visible node vi is extended to the video domain by running detection for each
respectively: image in a sequence of frames [7]. In this way, however,
the temporal correlation between frames is not taken into
X account. In addition, running a detection algorithm for
P(vi = 1|H ) = σ αi + hj wij . (5) each frame results in computational inefficiency since there
j might be feature extraction redundancies between sequential
frames. Furthermore, in a video sequence, there might be
The weight parameters are optimized with the aim to poor-quality frames which could lead to low inference
maximize the likelihood of the visible and hidden vectors accuracy. One obvious reason that this extension is not
(V , H ). trivial is due to the fact that a video sequence introduces
The intuition behind RBMs is based on the association an additional dimension; the temporal one. In other words,
of a scalar energy to each combination of the variables of instead of being considered as a sequence of frames, a video
interest. Learning is achieved, therefore, by calculating the should be rather regarded as a sequence of related frames.
combination that has the lowest energy. Due to the complexity of video data and the computation
RBMs are useful for dimensionality reduction, classifica- cost for training, research has been limited in this field.
tion, regression, and feature learning. However, due to the However, more and more video-related research works
fact that RBMs consist of only two layers, the complexity of have surfaced lately, due to the release of ImageNet
the data representation that they can achieve is limited [24]. VID [93] and other massive video datasets. Depending on the
For this reason, a number of extended, architectures has architecture, DL-based techniques for video object detection
been developed. An example of such architecture is the can be broadly diversified into six categories, namely
Deep Belief Network [44], which consists of multiple stacked (1) optical flow, (2) tracking, (3) long short-term memory,
RBMs. Deep Belief Networks are used for feature extraction (4) gated recurrent unit, (5) self-attention mechanism, and
in many computer vision applications. Except for Deep Belief (6) generative learning. In the following subsections a critical
Networks, another RBM-based architecture is the Deep appraisal of these architecture paradigms is presented.
Boltzmann Machine [95], [96]. Deep Boltzmann Machines
are similar to Deep Belief Networks, although the former A. OPTICAL FLOW
have only undirected connections between their layers, which One of the most fundamental concepts in video processing
makes them more robust to noisy observations, while the is optical flow. Optical flow was originally introduced
latter have bidirectional connections in the last layer [104]. in [25] referring to human perception and the changing
pattern of light that reaches our eyes. In computer vision
3) AUTO-ENCODERS applications, optical flow refers to the problem of estimating
Auto-encoders [8], [45] refer to a specific type of neural the displacement vector for each pixel in subsequent image
networks that aim to compress the input image data into a frames [48].
lower-dimension (latent) representation and then reconstruct A key assumption in optical flow is brightness constancy.
the original image from this representation. Their architecture This practically means that a pixel at the position (x, y) of an
consists of two main parts, namely, the encoder and the image at time t moves to the position (x +1x, y+1y) at time
decoder. The encoder maps an input vector of images X into t + 1t and the brightness I (x, y, t) remains constant:
a compressed, lower dimensional vector Z , while the decoder
part maps the latent variable Z to a reconstruction of the input I (x + 1x, y + 1y, t + 1t) ≈ I (x, y, t). (7)
The Taylor series expansion of the left-hand side of (7) is overcomes DFF both in terms of accuracy and inference
speed. It is also faster than FGFA although it achieves a
I (x + 1x, y + 1y, t + 1t) slightly lower accuracy level. An alternative architecture,
∂I ∂I ∂I
= I (x, y, t) + 1x + 1y + 1t + . . . which outperforms FGFA, is proposed in [17], where a
∂x ∂y ∂t two-stream feature aggregation approach is integrated into
⇒ I (x + 1x, y + 1y, t + 1t) − I (x, y, t) a one-stage detector to achieve video object detection.
= Ix 1x + Iy 1y + It 1t, (8) In particular, the first stream applies optical flow to estimate
the motion and to aggregate the features along the motion
where Ix , Iy , It are the partial derivatives of the intensity path, while the second stream predicts the features of
function I with respect to x, y, and t respectively. Hence, if we the frame of interest by spatio-temporal sampling and
substitute (7) into (8) we can derive: aggregation of features from the adjacent frames. The final
∇I · vT + It = 0, (9) predictions result from blending the outcomes from the two
streams.
1y
where ∇I = (Ix , Iy ) and v = ( 1x1t ,
1t ) are the components of
the optical flow, and It is the temporal gradient of the intensity B. TRACKING
function. Visual tracking can be described as the problem of estimating
Optical flow can be applied to estimate the motion of an unknown target trajectory over a sequence of image
detected objects in video segments by assigning an optical frames [78]. Traditional methods employ a variety of
flow vector to the pixels corresponding to the detected object. tracking algorithms, such as mean shift algorithm [14],
Optical flow can be either ‘‘sparse’’ or ‘‘dense’’. Sparse particle filtering [30], and Kalman filtering [54]. With the
optical flow estimates the flow vectors of some specific advancements in data science in recent years, novel DL-based
features, such as corners or edges of an object within an image visual trackers have been developed.
frame. Dense optical flow, on the other hand, includes the Object tracking outperforms optical flow in accu-
flow vectors of all the pixels in an image frame. The latter racy [129]. This can be explained by the fact that tracking
method achieves higher accuracy than the former, although uses shared networks to achieve feature extraction for
at the cost of increased computational requirements. detection and tracking. Hence, the requirements in terms
Recently, modern CNN architectures have been success- of computational power are limited and at the same
fully used for optical flow estimation applications [18]. CNNs time, the fusion between the two tasks is performed in a
can be trained to run on pairs of images and to predict the more straightforward way, which achieves higher accuracy
optical flow field. These flow networks are employed in compared to optical flow based models.
computer vision tasks for videos according to two different CNN is the first architecture that was adopted for DL-based
approaches. In the first approach, one neural network is visual tracking. In [19], a region-based fully convolutional
responsible for the task of object detection and it is applied neural network [15] is used for jointly performing detection
on sparse key frames. The extracted feature maps from these and tracking in an integrated framework. The model is fed
key frames are then propagated to the next frames with a with a set of two consecutive image frames, from which the
flow network. This technique is called Deep Feature Flow convolutional feature maps are computed. Object detection is
(DFF) [132] and it achieves great computational efficiency run on each frame and a regressor is employed to compute
due to the fact that it implements the object detection task the box transformation from one frame to the other. CNN-
only on key frames. based object tracking models showcase some weaknesses in
The second approach involving flow networks is known performance though, due to the scarcity of labeled data in
as flow-guided feature aggregation (FGFA) [131]. In FGFA, terms of including sets of two consecutive frames, which are
a feature extraction network is run on all individual frames to necessary for their training, as well as their speed limitations
create the respective feature maps per frame. The inference with respect to real-time applications [79].
at a reference frame is enhanced with an optical flow A baseline approach presented in [121] extends the Mask
network that predicts the motion between the neighbor R-CNN to include an additional tracking branch with
frames and the adjacent frames. The propagated feature an external memory for tracking object instances across
maps from neighbor frames are aggregated with the feature frames. The proposed architecture extracts the classification,
map from the reference frame in an adaptive weighting the bounding boxes, and the segmentation predictions of
method. FGFA achieves higher inference accuracy but at a Mask R-CNN, and it takes into account the past frame
higher computation time compared to DFF. For this reason, information only for tracking. In this way, the task of instance
an impression network [43] is another proposed architecture segmentation is extended to videos. CrossVIS [122] presents
that combines the two abovementioned techniques, with a novel, cross-frame learning approach that uses the features
the objective to take advantage of both methods. Sparse of an instance in the current frame to segment the same
key frame feature maps are then aggregated with other instance in other frames. Crossover learning is integrated
key frames feature maps and at the same time they are with the instance segmentation loss as an objective to obtain
propagated to other non-key frames. The impression network cross-frame instance segmentation consistency, achieving a
determined (query vector). Let us assume that we have a a multi-stage architecture, in order to extract object relations
sequence of n elements (x1 , x2 , . . . , xn ) of X ∈ Rn×d , with in both spatial and temporal context. The relations are then
d being the embedding dimension for the representation of further distilled with refined supportive object proposals and
each element [57]. We can then define three learnable weight propagated across frames. Finally, in [98] an attention-based
matrices in order to transform the queries (W q ∈ Rn×dq ), module is developed to learn long-range temporal relations
keys (W k ∈ Rn×dk ) and values vectors (W v ∈ Rn×dv ). between objects, in order to propagate the extracted features.
In this way, the input X is first transformed with the weight The proposed architectures in [16], [118], and [98] outper-
matrices and projected onto Q = XW q , K = XW k , and V = form optical flow-based approaches in accuracy.
XW v . A similarity function is used to calculate the similarity
between the query and the key vector. The self-attention layer F. GENERATIVE LEARNING
outputs Z ∈ Rn×dv which is equal to The objective of generative learning is to approximate a
complex, high-dimensional probabilistic distribution that
!
QK T
Z = softmax p V, (20) generates a class of data, in order to generate similar
dq data. Developing generative architectures to understand
where softmax function is defined by complicated data distributions has been a long-standing
exi research problem [84]. Recent works in this area [29], [59]
softmax(X)i = Pk , (21) have provided a new set of generative algorithms that can
xj
j=1 e
efficiently generate video segments or extract features from
for i = 1 . . . k and X = (x1 , . . . , xk ) ∈ Rk . The self-attention them. The most outstanding generative algorithms are the
determines the similarity between the key and the query variational autoencoders (VAEs) and generative adversarial
vector by computing their dot product. The dot product is networks (GANs).
then normalized using softmax so that the sum of all the • Variational auto-encoders: Their architecture resem-
scores becomes equal to 1. Each element is then given by the bles an auto-encoder, with the difference that their latent
weighted sum of all elements in the sequence. The weights in variable distribution is regularised during the training.
this case correspond to the attention scores. The most well- VAEs stemmed from the limitation of auto-encoders to
known, self-attention architecture is the transformer [113]. generate new, unseen data, due to the fact that the dis-
In [26] a transformer framework is developed to recognize tribution of the latent variable is unknown. To alleviate
and localize human actions in a video. A person feature this issue, VAEs are trained to learn the distribution of
is represented as the query (Q) and the features from the latent variable, assuming that it follows a Gaussian
adjacent video frames correspond to the key (K) and the distribution with a mean µ and variance σ 2 [50].
values (V). A video instance segmentation architecture built One example of a VAE-based architecture for video
upon transformers is proposed in [116]. Four modules are object detection is presented in [67], where a modified
included in the developed architecture: a backbone CNN to VAE architecture, built on top of a Mask R-CNN is
extract features over the video frames, an encoder-decoder proposed, in order to detect and to segment multiple
transformer that determines the similarity of features on instances in diverse videos. The proposed architecture
pixel and instance level, an instance-sequence matching, outperforms MaskTrack R-CNN [121], because the
and a segmentation module. The overall performance of MaskTrack R-CNN architecture depends entirely on
this framework is competitive compared to the single-model the Mask R-CNN to perform predictions, resulting in
approaches tested on the YouTube-VIS dataset [121], difficulties to handle false negative proposals of the
although it is somewhat lower in comparison to other complex Mask R-CNN in highly diverse videos with occlu-
CNN-based models [3]. sions, deformations, and pose variations of objects.
In [35] a constrained self-attention architecture is proposed By contrast, the architecture proposed in [67] merges
for video object detection that captures motion cues under a VAE with a Mask R-CNN network in a topology
the assumption that moving objects follow a continuous consisting of one encoder and three decoders. This
trajectory. An additional, self-attention based architecture is results in three parallel branches that provide strong
proposed in [36], which is applied in the temporal-spatial complements for predictions about bounding boxes and
domain towards aligning two feature maps of consecutive mask features, and they significantly reduce the number
frames. The proposed method features a low amount of of false negatives in the Mask R-CNN module.
parameters, while it achieves higher accuracy in comparison • Generative adversarial networks: Generative adver-
to optical flow-based methods such as DFF and FGFA. sarial networks are built on the basis of a two-
A related, efficient, and simplified architecture for video player, min-max game. The generator network G and
object detection via aggregating semantic features across the discriminator network D correspond to the first
frames is presented in [118]. Cosine similarity is imple- and the second player respectively. The generator’s
mented to compute the semantic similarities of the extracted objective is to mislead the discriminator by generating
proposals across frames, which are then aggregated accord- natural-looking data (e.g. images, videos, etc.) from a
ingly. In [16] an object relation module is employed as part of random, latent vector z. The discriminator on the other
hand, tries to distinguish whether the data are real or deep neural networks with high complexity and numerous
fake (generated). The game is modeled as the following parameters, the cost function might have multiple minima,
optimization problem: which minimize the training error but may not generalize
well to unseen data. The presence of noise and outliers
min max(G, D) = Ex∼pdata (x) [log D(x)] in the training dataset is an additional reason for poor
G D
+Ez∼pz (z) [log(1 − D(G(z))]. (22) generalizability. Generalizability is also deteriorated due
to the weakness of DL methods to deal with hierarchical
A generative adversarial approach is developed in [102], structures, since DL modules tend to fail when generalization
to randomly generate masks that correspond to object depends on compositional processes [63].
appearance variations in time. The masks are then At the same time, although correlation does not imply
applied to reduce overfitting via adaptively dropping causation, they do not seem to be distinguishable for DL.
out input features. The developed architecture identifies Numerous neural network architectures have surfaced over
the mask that maintains the most robust features of the the last decades that are highly capable of discovering
target objects over a long period of time. In [106] a complex correlations in data, yet they lack in reasoning about
GAN is trained on color and depth information in order cause-effect relations or environment changes.
to generate similar backgrounds to the test samples. Finally, deep learning has delivered new, highly per-
The generated background samples are then subtracted forming approaches in computer vision tasks, whose dom-
from the given test samples to detect foreground moving inance, however, remains inversely proportional to their
objects. Finally, in [11] the encoder-decoder architecture explanatory power. Rationalizing the output of data-driven
of [82], which is limited to process information between techniques is a critical issue since more and more data-driven
only two adjacent frames, is extended with a GAN, systems are adopted in safety-critical and high impact
to enforce temporal and spatial coherence of the applications.
generated object masks and to exploit information
within a longer temporal window. The developed V. INTEGRATING DEEP LEARNING WITH DOMAIN
architecture exhibits similar accuracy as other state-of- KNOWLEDGE
the-art computer vision methods, while it is almost four A. MOTIVATION
times faster. A prudent approach to address the abovementioned chal-
lenges is to expand the current methods and to merge them
IV. CHALLENGES IN DEEP-LEARNING-BASED COMPUTER with principles that govern the dynamic behavior of systems
VISION over the time, enabling an adaptation to new, unseen sce-
Despite the tremendous advances in deep learning and the narios. Combining DL-based techniques with equation-based
fast pace of its breakthroughs over the last years, there are dynamic models (DMs) in a complementary way, or in
still challenges that prevent it from reaching its full potential. other words, integrating common sense understanding into
This section illustrates a set of major challenges related to artificial intelligence constitutes a particularly interesting
computer vision tasks on video analysis with DL techniques. challenge for computer vision systems.
DL-based methods have succeeded in achieving even Enabling data-driven vision systems to understand the
human-level performance in complex, computer vision tasks. principles that govern the behavior of objects is essential
However, this is possible only when massive datasets are for the development of autonomous systems that understand
available for training. Data are the core of any DL-based observed scenarios and have the ability to adopt these
process and hence their shortage is often responsible for poor principles to a never seen situation. Leveraging domain
performance. Large-scale amounts of data are not available knowledge to identify equation-based models that describe
for all video applications though. how the properties of objects and entities change over
The impact of data scarcity is further escalated by time and embedding them into DL techniques can lead to
the stand-alone approach of DL. A typical workflow for novel, highly robust, and performing architectures. Such
developing a DL module consists of creating a training set models could be developed for instance from well-known
of inputs associated with outputs and learning the relations first principles in order to describe how an object moves and
between them. In this way, however, the architecture becomes they could be coupled with DL methods forming a hybrid
free-standing and isolated from prior, useful knowledge. computer vision architecture. It is straightforward to conclude
Hence, the DL performance is highly determined by the that hybrid architectures are more efficient compared to
existence of big-volume datasets while at the same time, purely data-driven or model-based techniques as they harness
applications that are more related to common sense reasoning the benefits of both disciplines. Hybrid methods that combine
and less to categorization, cannot be sufficiently targeted with scientific domain knowledge with data-driven models allow
purely DL methods [76]. for accurate inference even with imperfect models and limited
Generalizability is an additional major challenge con- amounts of data.
cerning the performance of a data-driven model trained on The integration of the two disciplines in a hybrid architec-
one dataset when applied to other datasets. When training ture can be realized either by infusing mathematical rules to
dynamic model and the DL model respectively, and fhybrid the By introducing model-based constraints in the loss func-
composition of the two functions, fhybrid = fDL ◦ fDM [90]. tion for the train of DL modules, scientific consistency
Two main categories of architectures can result from is achieved, which is essential for training generalizable
merging DL with dynamic models, founded on prior domain models. In addition, the physics-based loss function fPhy
knowledge. In the first category, the output of the model is requires no labeled data which allows the training of the
fed through the DL module at the first or at an additional DL module to be expanded to non-labeled data. A plethora
layer. In the second category, the model is embedded into of implementations that impose physics-based constraints
the DL module. Many architectures with respect to the first on the training of DL modes has surfaced recently [81],
class have surfaced lately in the field of climate and geology [103], [107]. In [56] a physics-based loss function is
applications. In [52], [56], the output of a physics-based used for the training of a temperature lake predictor. The
model is provided as an additional input feature to the DL loss function encompasses a constraint resulting from the
module in an application related to predicting the temperature relationship between the temperature, the density, and the
of a lake based on the depth. In [86], a physics-based depth of the lake water. In this way, the trained predictor
neural network architecture is used in order to simulate achieves enhanced generalizability while at the same time
broadband earthquake ground motions. The DL module is consistency with first-principle laws is ensured for the
used to predict the ground motion in the short term, including results. In [51], the application of lake temperature prediction
transient effects, which are particularly complex to model is extended to include temporal physical processes. More
mathematically. The DM module is then used to simulate the specifically, a physics-based RNN is developed that involves
response in a long-term period. energy conservation constraints. Standard LSTM models
In the second class, the DM module is embedded into store specific information at each time step, which feeds to
the DL module architecture. An example of this class is a the next time step. However, when the models are trained
physics-based model with an RNN including LSTMs [101] on data from specific seasons or from multiple years, it is
where the sensor data as well as the DM generated output are difficult to generalize to data from different time periods
ingested as input to the RNN architecture. since the time profiles vary significantly between each
other. By including the energy flux changes, however, which
4) REGULARIZATION determine the temperature changes, the architecture can
Deep neural networks can involve numerous parameters. successfully predict the lake temperature, even on unseen
However, when no large amounts of data are available, deep data. Another example is given in [53], where the data-driven
neural networks tend to overfit or, in other words, they model is penalized with the equation describing the time
fail to discover the underlying relationship described by the evolution of waves in order to identify the location of
training data and hence they cannot extrapolate to observed underwater obstacles from acoustic measurements. In this
data outside the training set. One way to handle this issue way, the accuracy of the model outside the training dataset
is to apply physical constraints on the loss function of the is enhanced. Finally, [10] presents a case where multiple
neural network. Several regularization techniques have been physics-based terms are present in a loss function. These
developed in this way, to prevent neural networks from might be competing loss terms with multiple local minima
overfitting. This is achieved by applying penalties to layer and correspond to different physics equations that need to be
parameters, and by integrating these penalties in the loss minimized together. Hence, an approach is presented where
function that is minimized during training. The loss function the contribution of each term is adaptively tuned during the
in that case will be of the following form [117]: training phase in order to improve the generalizability of the
developed architecture.
fLoss = fTrn (Y , Ŷ ) + λR(W ) + γ fPhy (Ŷ ), (23)
where fTrn corresponds to a function that represents the C. HYBRID ARCHITECTURE IMPLEMENTATION IN
error between the predicted value Ŷ and the true value Y . COMPUTER VISION
This function can be for example the mean squared error Integrating useful domain knowledge into DL-based com-
or cross entropy. In addition, λ represents a hyperparameter puter vision tasks is essential to build robust, generalizable
determining the weight of the regularization term R(W ). The systems and to compensate for the lack of large-volume
first two terms of (23) describe the standard loss function training data. An example of such a hybrid architecture
used when training a neural network. The additional term is proposed in [103], where the height of a free-falling
fPhy corresponds to the physics-based constraint and it aims object is estimated on each frame of a video by training
to ensure the consistency of the trained system with first- a CNN to detect and track objects obeying to free-falling
principle laws or dynamic models. The weight of this function laws of physics. The training of this CNN is based on a loss
is represented by the hyperparameter γ . Given the true value function in which first-principle laws are encoded. In [1],
Y , the following is considered as the general optimization physics are blended with DL in the framework of a two-stage
problem to solve for (23): encoder with the aim to recover the shape of an object
based on polarized photos. In [61] an LSTM architecture
argmin fTrn (Y , Ŷ ) + λR(W ) + γ fPhy (Ŷ ). (24)
w is combined with a dynamics model in order to acquire a
proposal distribution over an object’s state. Finally, in [119], online and at the same time being capable of preserving
a generative vision system is proposed for estimating physical the knowledge learned during previous interactions are
features of objects by integrating the output of a multi-physics only a few of the desirable features of future vision
simulation engine in the loop. mechanisms.
Integrating DL techniques with domain knowledge is a • Multi-modal learning: Ultimately, major emphasis in
recently introduced research topic [55], [90]. As a result, research is expected to be placed upon developing meth-
using domain knowledge to derive first-principle models ods that can process and link information combining
or on a broader perspective, any dynamic mathematical or modalities from various architectures [65], [76], since
computer model [73] that describes how the properties of unimodal DL methods seem to fail to fulfill all the
objects and entities change over time (Figure 3), and merging desirable future DL capabilities. In particular, combined
them with existing DL architectures constitute an especially architectures that integrate DL modules with domain
promising research task to address the challenges of DL in knowledge could provide a suitable answer to most
computer vision. research questions arising from the DL directions listed
above.
VI. OUTLOOK: FUTURE DIRECTIONS IN DEEP LEARNING
FOR OBJECT DETECTION AND SEGMENTATION IN VIDEOS VII. CONCLUSION
Deep learning has brought a catalytic effect in the field of In this paper a study is presented about detection and
computer vision for video analysis. Although nobody knows segmentation of objects applied to video segments. A review
with certainty how DL will evolve over the coming decades, of the currently existing techniques has been presented as
it is expected that much of the future research will revolve well as the major challenges that data-driven techniques face.
around the following critical areas [32], [77], [114]: Then an extension of the data-driven techniques to a hybrid
• Out-of-distribution generalization: Future computer architecture that fuses data-driven techniques with equation-
vision systems should be able to make accurate predic- based models describing the dynamic behavior of objects
tions not only in a known context but also for data with and entities over time has been proposed in order to address
different distributions than the ones learned from the issues like data scarcity, generalizability, and interpretability
training samples. The main reason behind the difficulty of the purely data-driven architectures. Finally, a survey of
of DL systems to accurately generalize and predict on the current developments in hybrid architectures has been
unseen data is caused by the fundamental assumption presented. We hope that this work will assist in better
that training and test data are independent and identically understanding the current status of DL in computer vision for
distributed (IID) [97], [128]. In many real-life cases video analysis as well as in presenting interesting directions
however, the IID assumption is hardly satisfied. The as guidelines for future work.
ability to generalize under distribution shifts is of
critical significance, and hence, the investigation of REFERENCES
out-of-distribution generalization is expected to attract [1] Y. Ba, A. Ross Gilbert, F. Wang, J. Yang, R. Chen, Y. Wang,
enormous research interest in the academic field. L. Yan, B. Shi, and A. Kadambi, ‘‘Deep shape from polarization,’’ 2019,
arXiv:1903.10210.
• Deep learning systems with causal structures:
[2] H. Bay, T. Tuytelaars, and L. Van Gool, ‘‘SURF: Speeded up robust
Causality is expected to be a central strand of DL features,’’ in Computer Vision–(ECCV). Berlin, Germany: Springer, 2006,
research in the coming years [89]. Developing DL pp. 404–417.
[3] G. Bertasius and L. Torresani, ‘‘Classifying, segmenting, and tracking
systems that can represent causal relationships can object instances in video with mask propagation,’’ in Proc. IEEE/CVF
increase their safety and reliability, and introducing a Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 9739–9748.
causal understanding of basic concepts in DL methods [4] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr,
‘‘Fully-convolutional Siamese networks for object tracking,’’ in Proc.
could certainly be the key to achieve robustness in Eur. Conf. Comput. Vis. (ECCV Workshops), 2016, pp. 850–865.
complex real-world environments. [5] D. Bhatt, C. Patel, H. Talsania, J. Patel, R. Vaghela, S. Pandya, K.
• Effective representation learning with few or no Modi, and H. Ghayvat, ‘‘CNN variants for computer vision: History,
architecture, application, challenges and future scope,’’ Electronics,
labeled data: While techniques for representation vol. 10, no. 20, p. 2470, Oct. 2021.
learning when massive labeled datasets are available [6] C. M. Bishop, Pattern Recognition and Machine Learning (Information
have become remarkably powerful, various challenges Science and Statistics). New York, NY, USA: Springer-Verlag, 2006.
remain in the case of limited labeled data. Developing [7] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, ‘‘YOLACT++: Better real-time
instance segmentation,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 44,
approaches for addressing the issue of labeled data no. 2, pp. 1108–1121, Feb. 2022.
scarcity is an emerging popular direction of research. [8] H. Bourlard and Y. Kamp, ‘‘Auto-association by multilayer perceptrons
• Adaptation in time-varying environments: Adapt- and singular value decomposition,’’ Biol. Cybern., vol. 59, nos. 4–5,
pp. 291–294, Sep. 1988.
ing to time-varying environments and other dynamic- [9] A. Broad, M. Jones, and T. Y. Lee, ‘‘Recurrent multi-frame single shot
behavior-related problems has been under examination detector for video object detection,’’ in Proc. BMVC, 2018, pp. 1–14.
for many years and it is expected to gain massive [10] M. Elhamod, J. Bu, C. Singh, M. Redell, A. Ghosh, V. Podolskiy,
W.-C. Lee, and A. Karpatne, ‘‘CoPhy-PGNN: Learning physics-guided
attention by the DL research community over the neural networks with competing loss functions for solving eigenvalue
coming years. Allowing integration of new knowledge problems,’’ 2020, arXiv:2007.01420.
[11] S. Caelles, A. Pumarola, F. Moreno-Noguer, A. Sanfeliu, and L. Van Gool, [36] C. Guo, B. Fan, J. Gu, Q. Zhang, S. Xiang, V. Prinet, and C. Pan,
‘‘Fast video object segmentation with spatio-temporal GANs,’’ 2019, ‘‘Progressive sparse local attention for video object detection,’’ in Proc.
arXiv:1903.12161. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 3909–3918.
[12] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, ‘‘BRIEF: Binary robust [37] D. Hall, F. Dayoub, J. Skinner, H. Zhang, D. Miller, P. Corke,
independent elementary features,’’ in Computer Vision–(ECCV). Berlin, G. Carneiro, A. Angelova, and N. Sunderhauf, ‘‘Probabilistic object
Germany: Springer, 2010, pp. 778–792. detection: Definition and evaluation,’’ in Proc. IEEE Winter Conf. Appl.
[13] K. Cho, B. van Merrienboer, C. C. Gülçehre, D. Bahdanau, F. Bougares, Comput. Vis. (WACV), Mar. 2020, pp. 1020–1029.
H. Schwenk, and Y. Bengio, ‘‘Learning phrase representations using [38] J. Han and C. Moraga, ‘‘The influence of the sigmoid function parameters
RNN encoder-decoder for statistical machine translation,’’ in Proc. Conf. on the speed of backpropagation learning,’’ in From Natural to Artificial
Empirical Methods Natural Lang. Process. (EMNLP), 2014. Neural Computation. Berlin, Germany: Springer, 1995, pp. 195–201.
[14] D. Comaniciu, V. Ramesh, and P. Meer, ‘‘Real-time tracking of non- [39] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, ‘‘Simultaneous
rigid objects using mean shift,’’ in Proc. IEEE Conf. Comput. Vis. Pattern detection and segmentation,’’ 2014, arXiv:1407.1808.
Recognit. (CVPR), Jun. 2000, pp. 142–149. [40] C. Harris and M. Stephens, ‘‘A combined corner and edge detector,’’ in
[15] J. Dai, Y. Li, K. He, and J. Sun, ‘‘R-FCN: Object detection via region- Proc. Alvey Vis. Conf., 1988, pp. 147–151.
based fully convolutional networks,’’ 2016, arXiv:1605.06409. [41] K. He, G. Gkioxari, P. Dollár, and R. Girshick, ‘‘Mask R-CNN,’’ IEEE
Trans. Pattern Anal. Mach. Intell., vol. 42, no. 2, pp. 386–397, Feb. 2020.
[16] J. Deng, Y. Pan, T. Yao, W. Zhou, H. Li, and T. Mei, ‘‘Relation distillation
[42] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Spatial pyramid pooling in
networks for video object detection,’’ in Proc. IEEE/CVF Int. Conf.
deep convolutional networks for visual recognition,’’ IEEE Trans. Pattern
Comput. Vis. (ICCV), Seoul, South Korea, Oct. 2019, pp. 7022–7031.
Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, Sep. 2014.
[17] J. Deng, Y. Pan, T. Yao, W. Zhou, H. Li, and T. Mei, ‘‘Single shot video
[43] C. Hetang, H. Qin, S. Liu, and J. Yan, ‘‘Impression network for video
object detector,’’ IEEE Trans. Multimedia, vol. 23, pp. 846–858, 2021.
object detection,’’ 2017, arXiv:1712.05896.
[18] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, [44] G. Hinton, S. Osindero, and Y.-W. Teh, ‘‘A fast learning algorithm for
P. V. D. Smagt, D. Cremers, and T. Brox, ‘‘FlowNet: Learning optical deep belief nets,’’ Neural Comput., vol. 18, no. 7, pp. 1527–1554, 1960.
flow with convolutional networks,’’ in Proc. IEEE Int. Conf. Comput. Vis. [45] G. E. Hinton and R. R. Salakhutdinov, ‘‘Reducing the dimensionality of
(ICCV), Dec. 2015, pp. 2758–2766. data with neural networks,’’ Science, vol. 313, no. 5786, pp. 504–507,
[19] C. Feichtenhofer, A. Pinz, and A. Zisserman, ‘‘Detect to track and track 2006.
to detect,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, [46] G. E. Hinton and T. J. Sejnowski, Learning and Relearning in Boltzmann
pp. 3057–3065. Machines. Cambridge, MA, USA: MIT Press, 1986, pp. 282–317.
[20] P. Felzenszwalb, D. McAllester, and D. Ramanan, ‘‘A discriminatively [47] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural
trained, multiscale, deformable part model,’’ in Proc. IEEE Conf. Comput. Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
Vis. Pattern Recognit., Jun. 2008, pp. 1–8. [48] B. K. P. Horn and B. G. Schunck, ‘‘Determining optical flow,’’ Artif.
[21] S. García, S. Ramírez-Gallego, J. Luengo, J. M. Benítez, and F. Herrera, Intell., vol. 17, nos. 1–3, pp. 185–203, Aug. 1980.
‘‘Big data preprocessing: Methods and prospects,’’ Big Data Anal., vol. 1, [49] B. J. Hou and Z. H. Zhou, ‘‘Learning with interpretable structure from
no. 1, pp. 1–22, Dec. 2016. gated RNN,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 7,
[22] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, ‘‘Vision meets robotics: pp. 2267–2279, Jul. 2020.
The KITTI dataset,’’ Int. J. Robot. Res., vol. 32, no. 11, pp. 1231–1237, [50] A. Jabbar, X. Li, and B. Omar, ‘‘A survey on generative adversarial
Sep. 2013. networks: Variants, applications, and training,’’ ACM Comput. Surv.,
[23] A. Geiger, P. Lenz, and R. Urtasun, ‘‘Are we ready for autonomous vol. 54, no. 8, pp. 1–49, Nov. 2022.
driving? The KITTI vision benchmark suite,’’ in Proc. IEEE Conf. [51] X. Jia, J. Willard, A. Karpatne, J. Read, J. Zwart, M. S. Steinbach, and
Comput. Vis. Pattern Recognit., Jun. 2012, pp. 3354–3361. V. Kumar, ‘‘Physics guided RNNs for modeling dynamical systems: A
[24] T. Georgiou, Y. Liu, W. Chen, and M. Lew, ‘‘A survey of traditional case study in simulating lake temperature profiles,’’ in Proc. SIAM Int.
and deep learning-based feature descriptors for high dimensional data in Conf. Data Mining, May 2019, pp. 558–566.
computer vision,’’ Int. J. Multimedia Inf. Retr., vol. 9, no. 3, pp. 135–170, [52] X. Jia, J. Willard, A. Karpatne, J. S Read, J. A. Zwart, M. Steinbach,
Sep. 2020. and V. Kumar, ‘‘Physics-guided machine learning for scientific dis-
[25] J. J. Gibson, The Ecological Approach to Visual Perception. Boston, MA, covery: An application in simulating lake temperature profiles,’’ 2020,
USA: Houghton Mifflin, 1979. arXiv:2001.11086.
[26] R. Girdhar, J. Joao Carreira, C. Doersch, and A. Zisserman, ‘‘Video action [53] A. Kahana, E. Turkel, S. Dekel, and D. Givoli, ‘‘Obstacle segmentation
transformer network,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern based on the wave equation and deep learning,’’ J. Comput. Phys.,
Recognit. (CVPR), Jun. 2019, pp. 244–253. vol. 413, Jul. 2020, Art. no. 109458.
[54] R. E. Kalman, ‘‘A new approach to linear filtering and prediction
[27] R. Girshick, ‘‘Fast R-CNN,’’ in Proc. IEEE Int. Conf. Comput. Vis.
problems,’’ J. Basic Eng., vol. 82, no. 1, pp. 35–45, Mar. 1960.
(ICCV), Dec. 2015, pp. 1440–1448.
[55] A. Karpatne, G. Atluri, J. H. Faghmous, M. Steinbach, A. Banerjee,
[28] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (Adaptive A. Ganguly, S. Shekhar, N. Samatova, and V. Kumar, ‘‘Theory-guided
Computation and Machine Learning Series). Cambridge, MA, USA: MIT data science: A new paradigm for scientific discovery from data,’’ IEEE
Press, 2016. Trans. Knowl. Data Eng., vol. 29, no. 10, pp. 2318–2331, Jun. 2017.
[29] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, [56] A. Daw, A. Karpatne, W. Watkins, J. Read, and V. Kumar, ‘‘Physics-
S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial nets,’’ in guided neural networks (PGNN): An application in lake temperature
Proc. 27th Int. Conf. Neural Inf. Process. Syst., vol. 2. Cambridge, MA, modeling,’’ 2017, arXiv:1710.11431.
USA: MIT Press, 2014, pp. 2672–2680. [57] S. Khan, M. Naseer, M. Hayat, S. Waqas Zamir, F. Shahbaz Khan, and
[30] N. Gordon, ‘‘Novel approach to nonlinear/non-Gaussian Bayesian state M. Shah, ‘‘Transformers in vision: A survey,’’ 2021, arXiv:2101.01169.
estimation,’’ IEE Proc. F-Radar Signal Process., vol. 140, no. 6, [58] Y. Kim, C. Denton, L. Hoang, and A. M. Rush, ‘‘Structured attention
pp. 107–113, Apr. 1993. networks,’’ 2017, arXiv:1702.00887.
[31] K. Grauman and T. Darrell, ‘‘The pyramid match kernel: Discriminative [59] D. P. Kingma and M. Welling, ‘‘Auto-encoding variational Bayes,’’ in
classification with sets of image features,’’ in Proc. 10th IEEE Int. Conf. Proc. 2nd Int. Conf. Learn. Represent. (ICLR), Banff, AB, Canada,
Comput. Vis. (ICCV), Oct. 2005, pp. 1458–1465. Apr. 2014.
[32] H. S. Greenwald and C. K. Oertel, ‘‘Future directions in machine [60] J. F. Kolen and S. C. Kremer, ‘‘Gradient flow in recurrent nets: The
learning,’’ Frontiers Robot. AI, vol. 3, p. 79, Jan. 2017. difficulty of learning LongTerm dependencies,’’ in A Field Guide to
[33] K. Greff, R. K. Srivastava, J. Koutnìk, B. R. Steunebrink, and Dynamical Recurrent Networks. IEEE Press, 2001, pp. 237–243, doi:
J. Schmidhuber, ‘‘LSTM: A search space Odyssey,’’ IEEE Trans. Neural 10.1109/9780470544037.ch14.
Netw. Learn. Syst., vol. 28, no. 10, pp. 2222–2232, Oct. 2017. [61] J. Kossen, K. Stelzner, M. Hussing, C. Voelcker, and K. Kersting,
[34] R. L. Gregory, Eye and Brain: The Psychology of Seeing. New York, NY, ‘‘Structured object-aware physics prediction for video modeling and
USA: McGraw-Hill, 1978. planning,’’ in Proc. Int. Conf. Learn. Represent., 2020.
[35] Y. Gu, L. Wang, Z. Wang, Y. Liu, M.-M. Cheng, and S.-P. Lu, ‘‘Pyramid [62] A. Kumar and S. Srivastava, ‘‘Object detection system based on
constrained self-attention network for fast video salient object detection,’’ convolution neural networks using single shot multi-box detector,’’ Proc.
in Proc. AAAI Conf. Artif. Intell., 2020, pp. 10869–10876. Comput. Sci., vol. 171, pp. 2610–2617, Jan. 2020.
[63] B. M. Lake and M. Baroni, ‘‘Generalization without systematicity: On [89] J. Pearl and D. Mackenzie, The Book Why: The New Sci. Cause Effect,
the compositional skills of sequence-to-sequence recurrent networks,’’ in 1st ed. New York, NY, USA: Basic Books, 2018.
Proc. ICML, 2018, pp. 2879–2888. [90] R. Rai and C. K. Sahu, ‘‘Driven by data or derived through physics? A
[64] S. Lazebnik, C. Schmid, and J. Ponce, ‘‘Beyond bags of features: review of hybrid physics guided machine learning techniques with cyber-
Spatial pyramid matching for recognizing natural scene categories,’’ in physical system (CPS) focus,’’ IEEE Access, vol. 8, pp. 71050–71073,
Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020.
Jun. 2006, pp. 2169–2178. [91] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once:
[65] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521, Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis.
no. 7553, pp. 436–444, Feb. 2015. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
[66] K. Li, W. Ma, U. Sajid, Y. Wu, and G. Wang, ‘‘Object detection with [92] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-time
convolutional neural network,’’ 2019, arXiv:1912.01844. object detection with region proposal networks,’’ IEEE Trans. Pattern
[67] C.-C. Lin, Y. Hung, R. Feris, and L. He, ‘‘Video instance seg- Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
mentation tracking with a modified VAE architecture,’’ in Proc. [93] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and F. F. Li, ‘‘ImageNet
pp. 13144–13154. large scale visual recognition challenge,’’ Int. J. Comput. Vis., vol. 115,
[68] Z. C. Lipton, J. Berkowitz, and C. Elkan, ‘‘A critical review of recurrent no. 3, pp. 211–252, Dec. 2015.
neural networks for sequence learning,’’ 2015, arXiv:1506.00019. [94] M. Sadoughi and C. Hu, ‘‘Physics-based convolutional neural network
[69] D. Liu, Y. Cui, Y. Chen, J. Zhang, and B. Fan, ‘‘Video object detection for for fault diagnosis of rolling element bearings,’’ IEEE Sensors J., vol. 19,
autonomous driving: Motion-aid feature calibration,’’ Neurocomputing, no. 11, pp. 4181–4192, Jun. 2019.
vol. 409, pp. 1–11, Oct. 2020. [95] R. Salakhutdinov and G. Hinton, ‘‘Deep Boltzmann machines,’’ in Proc.
[70] M. Liu, M. Zhu, M. White, Y. Li, and D. Kalenichenko, ‘‘Looking 12th Int. Conf. Artif. Intell. Statist., vol. 5, Clearwater Beach, FL, USA,
fast and slow: Memory-guided mobile video object detection,’’ 2019, Apr. 2009, pp. 448–455.
arXiv:1903.10172. [96] R. Salakhutdinov and H. Larochelle, ‘‘Efficient learning of deep
[71] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and Boltzmann machines,’’ in Proc. AISTATS, 2010, pp. 693–700.
A. C. Berg, ‘‘SSD: Single shot multibox detector,’’ in Computer [97] Z. Shen, J. Liu, Y. He, X. Zhang, R. Xu, H. Yu, and P. Cui, ‘‘Towards
Vision (ECCV). Cham, Switzerland: Springer, 2016, pp. 21–37. out-of-distribution generalization: A survey,’’ 2021, arXiv:2108.13624.
[72] X. Liu, Z. Deng, and Y. Yang, ‘‘Recent progress in semantic image [98] M. Shvets, W. Liu, and A. Berg, ‘‘Leveraging long-range tem-
segmentation,’’ Artif. Intell. Rev., vol. 52, no. 2, pp. 1089–1106, poral relationships between proposals for video object detection,’’
Aug. 2019. in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019,
[73] L. Ljung, System Identification: Theory for the User, 2nd ed. pp. 9755–9763.
Upper Saddle River, NJ, USA: Prentice-Hall, 1999. [99] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for
[74] D. G. Lowe, ‘‘Object recognition from local scale-invariant features,’’ in large-scale image recognition,’’ 2014, arXiv:1409.1556.
Proc. IEEE Int. Conf. Comput. Vis., vol. 2, Sep. 1999, pp. 1150–1157. [100] S. Singh, A. Prasad, K. Srivastava, and S. Bhattacharya, ‘‘Object
[75] Y. Lu, C. Lu, and C.-K. Tang, ‘‘Online video object detection using motion detection methods for real-time video surveillance: A survey
association LSTM,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), with empirical evaluation,’’ in Smart Systems and IoT: Innovations in
Oct. 2017, pp. 2363–2371. Computing. Singapore: Springer, 2020, pp. 663–679.
[76] G. Marcus, ‘‘Deep learning: A critical appraisal,’’ 2018, [101] S. K. Singh, R. Yang, A. Behjat, R. Rai, S. Chowdhury, and
arXiv:1801.00631. I. Matei, ‘‘PI-LSTM: Physics-infused long short-term memory network,’’
[77] G. Marcus, ‘‘The next decade in AI: Four steps towards robust artificial in Proc. 18th IEEE Int. Conf. Mach. Learn. Appl. (ICMLA), Dec. 2019,
intelligence,’’ 2020, arXiv:2002.06177. pp. 34–41.
[78] S. Mojtaba Marvasti-Zadeh, L. Cheng, H. Ghanei-Yakhdan, and [102] Y. Song, C. Ma, X. Wu, L. Gong, L. Bao, W. Zuo, C. Shen,
S. Kasaei, ‘‘Deep learning for visual tracking: A comprehensive survey,’’ R. Lau, and M.-H. Yang, ‘‘VITAL: Visual tracking via adversarial
2019, arXiv:1912.00535. learning,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
[79] M. Müller, A. Bibi, S. Giancola, S. Al-Subaihi, and B. Ghanem, pp. 8990–8999.
‘‘TrackingNet: A large-scale dataset and benchmark for object tracking [103] R. Stewart and S. Ermon, ‘‘Label-free supervision of neural networks with
in the wild,’’ 2018, arXiv:1803.10794. physics and domain knowledge,’’ in Proc. 31st AAAI Conf. Artif. Intell.,
[80] W. K. Mutlag, S. K. Ali, Z. M. Aydam, and B. H. Taher, ‘‘Feature 2017, pp. 2576–2582.
extraction methods: A review,’’ in Proc. J. Phys., Conf., Jul. 2020, [104] H. Suk, An Introduction to Neural Networks and Deep Learning.
vol. 1591, no. 1, Art. no. 012028. Amsterdam, The Netherlands: Elsevier, Jan. 2017, pp. 3–24.
[81] M. Amin Nabian and H. Meidani, ‘‘Physics-driven regularization of deep [105] F. Sultana, A. Sufian, and P. Dutta, ‘‘A review of object detection models
neural networks for enhanced engineering design and analysis,’’ 2018, based on convolutional neural network,’’ in Intelligent Computing: Image
arXiv:1810.05547. Processing Based Applications. Singapore: Springer, 2020, pp. 1–16.
[82] S. W. Oh, J.-Y. Lee, K. Sunkavalli, and S. J. Kim, ‘‘Fast video object [106] M. Sultana, A. Mahmood, S. Javed, and S. Ki Jung, ‘‘Unsu-
segmentation by reference-guided mask propagation,’’ in Proc. IEEE pervised RGBD video object segmentation using GANs,’’ 2018,
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 7376–7385. arXiv:1811.01526.
[83] N. O’Mahony, S. Campbell, A. Carvalho, S. Harapanahalli, [107] L. Sun, H. Gao, S. Pan, and J.-X. Wang, ‘‘Surrogate modeling for fluid
G. V. Hernandez, L. Krpalkova, D. Riordan, and J. Walsh, ‘‘Deep flows based on physics-constrained deep learning without simulation
learning vs. traditional computer vision,’’ in Proc. Comput. Vis. Conf. data,’’ Comput. Methods Appl. Mech. Eng., vol. 361, Apr. 2020,
(CVC). Cham, Switzerland: Springer, 2020, pp. 128–144. Art. no. 112732.
[84] A. Oussidi and A. Elhassouny, ‘‘Deep generative models: Survey,’’ in [108] X. Sun, P. Wu, and S. C. H. Hoi, ‘‘Face detection using deep learning:
Proc. Int. Conf. Intell. Syst. Comput. Vis. (ISCV), Apr. 2018, pp. 1–8. An improved faster RCNN approach,’’ Neurocomputing, vol. 299,
[85] S. J. Pan and Q. Yang, ‘‘A survey on transfer learning,’’ IEEE Trans. pp. 42–50, Jul. 2018.
Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010. [109] R. Tao, E. Gavves, and A. W. M. Smeulders, ‘‘Siamese instance search for
[86] R. Paolucci, F. Gatti, and M. Infantino, ‘‘Broadband ground motions tracking,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
from 3D physics-based numerical simulations using artificial neural Jun. 2016, pp. 1420–1429.
networks,’’ Bull. Seismolog. Soc. Amer., vol. 108, no. 3A, pp. 1272–1286, [110] S. Tripathi, Z. Lipton, S. Belongie, and T. Nguyen, ‘‘Context matters:
Feb. 2018. Refining object detection in video with recurrent neural networks,’’ in
[87] C. Patel, D. Bhatt, U. Sharma, R. Patel, S. Pandya, K. Modi, N. Cholli, Proc. Brit. Mach. Vis. Conf., 2016, pp. 1–12.
A. Patel, U. Bhatt, M. A. Khan, S. Majumdar, M. Zuhair, K. Patel, [111] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and
S. A. Shah, and H. Ghayvat, ‘‘DBGC: Dimension-based generic A. W. M. Smeulders, ‘‘Selective search for object recognition,’’
convolution block for object recognition,’’ Sensors, vol. 22, no. 5, p. 1780, Int. J. Comput. Vis., vol. 104, no. 2, pp. 154–171, Apr. 2013.
Feb. 2022. [112] A. Ullah, K. Muhammad, W. Ding, V. Palade, I. U. Haq, and S. W. Baik,
[88] C. I. Patel, S. Garg, T. Zaveri, and A. Banerjee, ‘‘Top-down and bottom- ‘‘Efficient activity recognition using lightweight CNN and DS-GRU
up cues based moving object detection for varied background video network for surveillance applications,’’ Appl. Soft Comput., vol. 103,
sequences,’’ Adv. Multimedia, vol. 2014, pp. 1–20, Jan. 2014. May 2021, Art. no. 107102.
[113] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, ATHINA ILIOUDI received the joint M.S. degree
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. 31st Int. in smart electrical networks and systems from
Conf. Neural Inf. Process. Syst. Red Hook, NY, USA: Curran Associates, the KTH Royal Institute of Technology and the
2017, pp. 6000–6010. Eindhoven University of Technology, in 2018.
[114] R. Verschae and J. Ruiz-del-Solar, ‘‘Object detection: Current and future She is currently pursuing the Ph.D. degree with
directions,’’ Frontiers Robot. AI, vol. 2, p. 29, Nov. 2015. the Delft Center for Systems and Control, Delft
[115] J. Wang, E. Sezener, D. Budden, M. Hutter, and J. Veness, ‘‘A
University of Technology. Her research interests
combinatorial perspective on transfer learning,’’ in Proc. Adv. Neural
include deep learning methods with first-principle
Inf. Process. Syst. Red Hook, NY, USA: Curran Associates, 2020,
pp. 918–929. modeling techniques and physics informed neural
[116] Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia, networks for computer vision applications.
‘‘End-to-end video instance segmentation with transformers,’’ 2020,
arXiv:2011.14503.
[117] J. Willard, X. Jia, S. Xu, M. Steinbach, and V. Kumar, ‘‘Integrating scien-
tific knowledge with machine learning for engineering and environmental
systems,’’ 2020, arXiv:2003.04919. AZITA DABIRI received the Ph.D. degree from the
[118] H. Wu, Y. Chen, N. Wang, and Z.-X. Zhang, ‘‘Sequence level semantics Automatic Control Group, Chalmers University
aggregation for video object detection,’’ in Proc. IEEE/CVF Int. Conf. of Technology, in 2016. She was a Postdoctoral
Comput. Vis. (ICCV), Oct. 2019, pp. 9216–9224. Researcher with the Department of Transport and
[119] J. Wu, I. Yildirim, J. J. Lim, B. Freeman, and J. Tenenbaum, ‘‘Galileo:
Planning, TU Delft, from 2017 to 2019. In 2019,
Perceiving physical object properties by integrating a physics engine
with deep learning,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 28, she received an ERCIM Fellowship and a Marie
Red Hook, NY, USA: Curran Associates, 2015, pp. 1–9. Curie Individual Fellowship, which allowed her to
[120] C. Xie, Y. Xiang, A. Mousavian, and D. Fox, ‘‘Unseen object instance perform research at the Norwegian University of
segmentation for robotic environments,’’ 2020, arXiv:2007.08073. Technology (NTNU) as a Postdoctoral Researcher.
[121] L. Yang, Y. Fan, and N. Xu, ‘‘Video instance segmentation,’’ in Proc. In 2020, she joined the Delft Center for Systems
IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 5187–5196. and Control, TU Delft, as an Assistant Professor. Her research interests
[122] S. Yang, Y. Fang, X. Wang, Y. Li, C. Fang, Y. Shan, B. Feng, and W. include integration of model-based and learning-based control.
Liu, ‘‘Crossover learning for fast online video instance segmentation,’’
in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021,
pp. 8043–8052.
[123] S. Yang, X. Yu, and Y. Zhou, ‘‘LSTM and GRU neural network
performance comparison study: Taking yelp review dataset as an
BEN J. WOLF received the Ph.D. degree (cum
example,’’ in Proc. Int. Workshop Electron. Commun. Artif. Intell.
(IWECAI), Jun. 2020, pp. 98–101. laude) in artificial intelligence from the Univer-
[124] T. Yang and A. B. Chan, ‘‘Recurrent filter learning for visual tracking,’’ sity of Groningen, The Netherlands, in 2020,
2017, arXiv:1708.03874. on the topic of hydrodynamic imaging. He is
[125] Y. Yu, X. Si, C. Hu, and Z. Jianxun, ‘‘A review of recurrent neural currently a Postdoctoral Researcher at the Delft
networks: LSTM cells and network architectures,’’ Neural Comput., Center for Systems and Control, Delft University
vol. 31, no. 7, pp. 1235–1270, Jul. 2019. of Technology. His research interests include
[126] D. Zhang, J. Han, G. Cheng, and M.-H. Yang, ‘‘Weakly supervised object machine learning, neural networks, robotics, and
localization and detection: A survey,’’ IEEE Trans. Pattern Anal. Mach. hydrodynamic sensing.
Intell., early access, Apr. 20, 2021, doi: 10.1109/TPAMI.2021.3074313.
[127] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, ‘‘Object detection with deep
learning: A review,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 30,
no. 11, pp. 3212–3232, Nov. 2019.
[128] K. Zhou, Z. Liu, Y. Qiao, T. Xiang, and C. Change Loy, ‘‘Domain
generalization in vision: A survey,’’ 2021, arXiv:2103.02503. BART DE SCHUTTER (Fellow, IEEE) received
[129] H. Zhu, H. Wei, B. Li, X. Yuan, and N. Kehtarnavaz, ‘‘A review of video the Ph.D. degree (summa cum laude) in applied
object detection: Datasets, metrics and methods,’’ Appl. Sci., vol. 10, sciences from Katholieke Universiteit Leuven,
no. 21, p. 7834, Nov. 2020. Belgium, in 1996.
[130] M. Zhu and M. Liu, ‘‘Mobile video object detection with temporally-
He is currently a Full Professor and the Head
aware feature maps,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
of Department at the Delft Center for Systems and
Recognit., Jun. 2018, pp. 5686–5695.
[131] X. Zhu, Y. Wang, J. Dai, L. Yuan, and Y. Wei, ‘‘Flow-guided feature Control, Delft University of Technology, Delft,
aggregation for video object detection,’’ in Proc. IEEE Int. Conf. Comput. The Netherlands. His current research interests
Vis. (ICCV), Oct. 2017, pp. 408–417. include reinforcement learning, learning-based
[132] X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei, ‘‘Deep feature flow for control, multi-level and multi-agent control, and
video recognition,’’ 2016, arXiv:1611.07715. control of hybrid systems. He is a Senior Editor of the IEEE TRANSACTIONS
[133] I. Ševo and A. Avramović, ‘‘Convolutional neural network based ON INTELLIGENT TRANSPORTATION SYSTEMS and an Associate Editor of IEEE
automatic object detection on aerial images,’’ IEEE Geosci. Remote Sens. TRANSACTIONS ON AUTOMATIC CONTROL.
Lett., vol. 13, no. 5, pp. 740–744, May 2016.