Deep Learning For Object Detection and Segmentation in Videos Toward An Integration With Domain Knowledge

This document summarizes recent research in deep learning techniques for computer vision tasks in videos, specifically object detection and segmentation. It analyzes existing deep learning methods, identifies challenges, and proposes a hybrid framework that couples deep learning with domain knowledge. The paper also reviews approaches integrating domain knowledge with deep learning to address current challenges. Finally, it discusses conclusions on implementing hybrid architectures for computer vision tasks in videos.

Uploaded by

Free O Free

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

117 views15 pages

Deep Learning For Object Detection and Segmentation in Videos Toward An Integration With Domain Knowledge

Uploaded by

Free O Free

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Batch no : 28

Team Members : TAMIIL THENDTRAL M ( 310819104989 )

RAM GANESH S ( 310819104066 )
PRASANTH R ( 310819104059 )

Received March 8, 2022, accepted March 21, 2022, date of publication March 28, 2022, date of current version April 4, 2022.
Digital Object Identifier 10.1109/ACCESS.2022.3162827

Deep Learning for Object Detection and

Segmentation in Videos: Toward an
Integration With Domain Knowledge
ATHINA ILIOUDI , AZITA DABIRI , BEN J. WOLF , AND BART DE SCHUTTER , (Fellow, IEEE)
Delft Center for Systems and Control, Delft University of Technology, 2628 Delft, The Netherlands
Corresponding author: Athina Ilioudi ([email protected])
This work was supported in part by the European Union’s Horizon 2020 Research and Innovation Programme under Grant 871295
(SeaClear), and in part by the European Research Council (ERC) under the European Union’s Horizon 2020 Research and Innovation
Programme under Grant 101018826-CLariNet.

ABSTRACT Deep learning has enabled the rapid expansion of computer vision tasks from image frames to
video segments. This paper focuses on the review of the latest research in the field of computer vision tasks
in general and on object localization and identification of their associated pixels in video frames in particular.
After performing a systematic analysis of the existing methods, the challenges related to computer vision
tasks are presented. In order to address the existing challenges, a hybrid framework is proposed, where deep
learning methods are coupled with domain knowledge. An additional feature of this survey is that a review of
the currently existing approaches integrating domain knowledge with deep learning techniques is presented.
Finally, some conclusions on the implementation of hybrid architectures to perform computer vision tasks
are discussed.

INDEX TERMS Computer vision, object detection, deep learning, theory-guided data science.

I. INTRODUCTION The objective of this work is to investigate the advance-

Just as motion perception is essential to our visual system, ments of deep learning techniques for computer vision tasks
allowing us to interpret the world, to detect the presence in videos as well as their research perspectives to address their
of creatures [25], and to avoid danger [34], video computer current weaknesses. More specifically, the contributions of
vision helps artificial intelligence agents to decipher their our study are trifold:
surrounding environment and to synthesize actionable infor- • We present an analysis of the existing DL techniques for
mation. Inspired by the human visual system and enabled by detection and segmentation of objects in videos.
the latest advancements in deep learning (DL), novel video • We present an overview of the challenges with the
processing methods are emerging that achieve remarkable existing data-driven approaches.
results and that seek to revolutionize how computer vision • We outline new directions for research in video process-
tasks are implemented. Yet, similarly to human perception, ing.
computer vision is quite prone to illusions. The paper is organized in seven sections. Section II presents
The fast pace of DL breakthroughs in combination with an overview of necessary preliminary knowledge. Section III
the improvement in hardware capabilities in terms of com- gives a comprehensive overview of DL-based video computer
putation power, memory capacity, and sensor resolution have vision methods. In Section IV the current challenges are
accelerated the spread of data-driven methods over the con- presented and analyzed. To address these challenges, Section
ventional computer vision techniques. Contrary to classical V presents an overview of approaches that couple DL
techniques, DL reaches human-level accuracy, requires less methods with domain knowledge. Section VI highlights the
expert analysis, and provides superior flexibility including most prominent topics that are expected to draw major
allowing re-training whenever new data are available [115]. interest from the research community in the following years,
and Section VII gives concluding remarks.
The associate editor coordinating the review of this manuscript and A list of abbreviations mentioned in this paper and their
approving it for publication was Zhongyi Guo . definitions are presented in Table 1.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
34562 VOLUME 10, 2022
A. Ilioudi et al.: DL for Object Detection and Segmentation in Videos: Toward Integration With Domain Knowledge

TABLE 1. List of abbreviations. image and marking the pixels that belong to each one of
them [39]. Extending this task to the video domain results
in simultaneous detection, segmentation, and tracking of the
instances [121]. The instance segmentation task combines
object detection, where individual objects are classified and
localized with a bounding box, and semantic segmentation,
where each pixel is classified into the given classes.
The task of object classification & localization is included
in object detection. At the same time, in semantic seg-
mentation, each pixel of an image is associated with a
class label like a road, tree, pedestrian, etc. In other words,
all objects of an image that belong to the same class are
treated as a single entity. On the other hand, each object
of the same class is treated as a distinct individual instance
with instance segmentation. Hence, instance segmentation
can be considered as a more elaborate implementation
of semantic segmentation. Since all the computer vision
tasks are similar, in this work mainly object detection
and instance segmentation techniques will be examined,
as they are the most dominant techniques required in
extensive applications such as autonomous driving [69],
video surveillance [100], face recognition [108], and robot
II. PRELIMINARIES navigation [120].
In this section, we introduce the most typical tasks of
computer vision and we present a brief, comparative analysis B. DEEP LEARNING VS. TRADITIONAL COMPUTER VISION
between deep learning and conventional techniques in the TECHNIQUES
domain of computer vision, as well as an overview of Traditional computer vision methods are based on hard-
basic deep learning methods such as convolutional neural coded, rigid-rule algorithms to apply feature extraction on
networks, restricted Boltzmann machines, and auto-encoders, images [80]. Several algorithms have been developed to
which constitute the core for DL architectures in computer extract properties such as corners, edges, and regions of
vision. interest from images [2], [12], [40], [74], [88]. These algo-
rithms showcase advantages such as transparency, in terms
A. COMPUTER VISION TASKS of allowing to trace back to all steps of how a decision was
Computer vision tasks can be categorised into 4 major fields: made, and performance that is independent of the training
(1) semantic segmentation, (2) classification & localization, dataset. At the same time, however, they have been criticised
(3) object detection, and (4) instance segmentation. The to be inflexible, difficult to improve or adapt, and highly
task of semantic segmentation refers to the process of time-consuming to develop manually for each additional
assigning a class label to every pixel in an image [72]. One object to be detected [83]. Moreover, the performance of
of the shortcomings of this task is the fact that semantic these methods significantly deteriorates when the number
segmentation does not differentiate between instances of of classes to be detected increases. By contrast, DL utilizes
the same class. On the other hand, the classification & massive data sets and numerous training cycles to learn how
localization task aims to predict the class of a specific an object looks, following a process during which relevant
object in an image and to draw a bounding box around the features of an object of interest are extracted automatically.
region of the classified object in an image [126]. This task The DL architecture can then be implemented on previously
refers to a single object. However, most images in real-world unseen images and make accurate predictions. DL-based
settings contain multiple objects of different shapes and sizes. methods perform remarkably better than traditional methods,
Therefore, object detection [37] refers to a more general albeit with trade-offs regarding computational requirements
approach where a varying number of predicted objects for and training time [83]. As a result, they have vastly replaced
every input image can be extracted, since it is unknown how traditional computer vision techniques, thanks to their strong
many objects are expected to be detected in each image. ability to be easily adjusted, to extract complex features in
Object detection systems strive to find every instance much more detail, and to be much more efficient in terms
of an object and estimate the spatial extent of each one. of accuracy and versatility [83]. Tremendous advancements
Nevertheless, the detected objects are located just with in research have taken place in this domain, resulting in the
bounding boxes. development of numerous methods. The fundamental DL
The task of instance segmentation refers to the prob- methods implemented on image computer vision applications
lem of detecting all the instances of a category in an are discussed in section II-C.