3 - Deep Learning For Vision-Based
3 - Deep Learning For Vision-Based
Research Proposal
Universita De Trento
Department of Information Engineering and Computer Science
Doctoral Program
Title: PhD Computer Science
Reserved Topic
Scholarship Title: Deep Learning for Vision-Based
Scene Understanding
Topic Preference: 2
Research Area A B C D
Abstract
Visual understanding of surroundings is a basic objective of Artificial Intelligence (AI) and
Robotics. Deep Learning have enabled the computers to distinguish between different objects,
scenes or landscapes in an imagery digital media. However, we need to train Deep Neural Network
(DNN) for each of the case and the scope of scene understanding in computers is defined by the
dataset we provide to the Deep Learning modal. Beside that dataset, AI cannot distinguish between
other related items by itself. To overcome this limitation, in this study, a novel approach is
introduced. In this novel approach, AI will be taught with geometric information of the scene,
object or element in digital media. Geometric Deep Neural Network (GDNN) will be used to
extract features and structural information of the object in dataset. While traditional Deep Learning
only learn feature information. Through this structural information, AI will be able to generalize
the similar structured objects, and scenes. A re-enforcement reward will be given on successfully
understanding the scenes. This way AI will learn incrementally about different scenes and objects.
Introduction
MOTIVATION
The scope of scene understanding is immense. There are several applications of scene
understanding ranging from semantic analysis in social media images, visionary ability of Robots,
etc. A domain adaptive AI for scene understanding will allow robots to learn about their surrounds
from previously learned datasets. An incremental learning will allow AI to predict similar structure
objects.
Literature Review
Scene understanding is a board area. There are several methods that are used in scene
understanding. Those methods are trained on widely available datasets. There are many available
datasets, that are most of the cases specific to train scene understanding [6], road scene
understanding [7], [8] , indoor scene understanding [9], [10], human crowd scene understanding
[11], post flood understanding dataset [11], human action recognition dataset [1]–[3], etc. Some
benchmark datasets were introduced for perception task of multi object segmentation, detection
and recognition. Microsoft COCO [5] is a leading dataset, which is continuously evolving and it
has 330K images which are labeled with 80 object classes, 91 stuff classes and 5 captions per
images. There are some more multi object scene understanding datasets [12] but Microsoft COCO
is in the lead.
Most of the scene understanding these days are using traditional machine learning algorithms and
semantic object analysis. X. Huang et. al., [13] used Support Vector Machine (SVM) for
classification of spatial features of images, for scene understanding. Similarly, G. Cheng et. al.,
[14] proposed approach constructs an object-based relationship, which enables to relate to real
objects. These object-oriented methods include segmentation and classification. Hence, because of
the object classification these methods are better. Deep Learning automated feature extraction took
the scene understanding to next level. The extracted features quality is high and the classification
becomes more enhanced. Nikita Dvornik et. al., [15] took advantage of Deep Learning and
created a real time scene understanding classifier. A Deep Neural Network is trained over VOC
and Microsoft COCO dataset for scene classification in real time. Xiaoxu Liu et. al., [16]
proposed an approach using Deep Learning scene understanding for Autonomous Vehicles.
There are several more scene understanding algorithms that are bench mark. Mohammad Javad
Shafiee [17] proposed YOLO framework for real time object detection, and it is widely used for
detecting objects in real time. These mentioned scene understanding algorithms only focus on
qualitative scene understanding, by classification only segmentation of scenes. Zhang X, et. al.,
[18] proposed a quantitative measurement classification method. They have measured the
quantitative measurement of changes in the economic structure of Beijing and Zhuhai districts.
How to generate semantic and structural feature network using Geometric Neural Network?
How to create a classifier for classifying similar structures?
How to train Re-enforcement agent for incremental scene understanding?
Research Gap
Although there are several scene understanding classifiers available using different methods. However,
those methods are not incremental and cannot work outside of the scope of trained dataset. Available
methods usually do not keep structural information along with semantics. Proposed approach will be an
incremental, and domain adaptive method. The proposed method will able to understand objects in
scenes, which are not available in training dataset and are similar.
Problem Statement
There are many methods available for semantic and object classification in a scene. This allows
AI to understand the scene. However, those methods are not domain adaptive and the AI trained
on those classifiers cannot predict out of the trained dataset. Incremental learning from
surrounding of the observer agent is an important objective as, the available classifiers do not
classify similar structural objects and scenes. Through Geometric Neural Network, we can train
the classifier to classify objects and scene based upon similar structures and allow the AI to learn
more about surroundings incrementally.
Statement of Purpose
Study purpose is to develop a novel method, for incremental and domain adaptive modal, that can
learn incrementally about the scene and objects itself, without being trained on the same object. A
classifier which predicts and train itself with similar structural dataset and learn from experience.
Research Methodology
There are many ways to research out different techniques and methods but to find out the
answers to the research related questions apply the different research methods by which gives the
answers related to the research questions. To find out more facts and figures and identify
variables related to the research we will be using both qualitative and quantitative research
methods and then analyze the research problem by applying analytical research on it. The
purpose is to find a solution to the research problems. We can use either Microsoft COCO [5]
dataset, or VOC Object classes dataset [19] to train Geometric Deep Neural Network (GDNN)
for creating a feature network of semantic and structural information of the scenes. A classifier to
classify objects that are similar in structure. Different structures will be created using Generated
Adversarial Network (GAN) [20] and classifier will train on more similar structures. While
deciding on predicting the object or a scene, the classifier will compare the geometric structure
of the subject and identify the images by itself. Re-enforcement network will be used for training
the network.
Data Gathering:
Data gathering is the basic part to perform research and we are extracting the dataset based on
different scenes, maximum objects classes and bench mark datasets i.e., [5] or dataset can also be
used in this research. VOC Object classes dataset [19] is available publicly for multiple object
classes. Moreover, some dataset for scene understanding will be generated manually. From data
gathering to analysis and results a complete methodology is described below in diagram.
Data Gathering
Pre-Processing
References:
[1] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A Dataset of 101 Human Actions
Classes From Videos in The Wild,” ArXiv, no. November, 2012.
[2] C. Xu and \bf J J \bf Corso, Actor-Action Semantic Segmentation with Grouping-Process
Models. 2016.
[3] M. Marszałek, I. Laptev, and C. Schmid, “Actions in context,” 2009 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition Workshops, CVPR
Workshops 2009, vol. 2009 IEEE, no. i, pp. 2929–2936, 2009, doi:
10.1109/CVPRW.2009.5206557.
[4] L. Hoyer, D. Dai, and L. van Gool, “HRDA: Context-Aware High-Resolution Domain-
Adaptive Semantic Segmentation,” Apr. 2022, [Online]. Available:
http://arxiv.org/abs/2204.13132
[5] T.-Y. Lin et al., “LNCS 8693 - Microsoft COCO: Common Objects in Context,” 2014.
[6] O. Zendel, M. Murschitz, M. Zeilinger, D. Steininger, S. Abbasi, and C. Beleznai,
“RailSem19: A Dataset for Semantic Rail Scene Understanding.” [Online]. Available:
www.wilddash.cc
[7] C. Sakaridis, D. Dai, L. van Gool, and E. Zürich, “ACDC: The Adverse Conditions
Dataset with Correspondences for Semantic Driving Scene Understanding.” [Online].
Available: https://acdc.vision.ee.ethz.ch
[8] V. Ramanishka, Y.-T. Chen, T. Misu, and K. Saenko, “Toward Driving Scene
Understanding: A Dataset for Learning Driver Behavior and Causal Reasoning.” [Online].
Available: https://usa.honda-ri.com/HDD
[9] I. Armeni, S. Sax, A. R. Zamir, and S. Savarese, “Joint 2D-3D-Semantic Data for Indoor
Scene Understanding,” Feb. 2017, [Online]. Available: http://arxiv.org/abs/1702.01105
[10] M. Roberts et al., “Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene
Understanding.” [Online]. Available: http://github.com/apple/ml-hypersim
[11] A. Zheng, Y. Zhang, X. Zhang, X. Qi, and J. Sun, “Progressive End-to-End Object
Detection in Crowded Scenes.” [Online]. Available: https://github.com/megvii-model/Iter-
E2EDET.
[12] B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva, “Places: An Image Database
for Deep Scene Understanding,” Oct. 2016, [Online]. Available:
http://arxiv.org/abs/1610.02055
[13] X. Huang, L. Zhang, and P. Li, “A multiscale feature fusion approach for classification of
very high resolution satellite imagery based on wavelet transform,” Int J Remote Sens, vol.
29, no. 20, pp. 5923–5941, Oct. 2008, doi: 10.1080/01431160802139922.
[14] G. Cheng, J. Han, L. Guo, Z. Liu, S. Bu, and J. Ren, “Effective and Efficient Midlevel
Visual Elements-Oriented Land-Use Classification Using VHR Remote Sensing Images,”
IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 8, pp. 4238–4249,
Aug. 2015, doi: 10.1109/TGRS.2015.2393857.
[15] N. Dvornik, K. Shmelkov, J. Mairal, and C. Schmid Inria, “BlitzNet: A Real-Time Deep
Network for Scene Understanding.”
[16] X. Liu, “Vehicle-Related Scene Understanding Using Deep Learning.”
[17] M. J. Shafiee, B. Chywl, F. Li, and A. Wong, “Fast YOLO: A Fast You Only Look Once
System for Real-time Embedded Object Detection in Video,” Sep. 2017, [Online].
Available: http://arxiv.org/abs/1709.05943
[18] X. Zhang and S. Du, “A Linear Dirichlet Mixture Model for decomposing scenes:
Application to analyzing urban functional zonings,” Remote Sens Environ, vol. 169, pp.
37–49, Nov. 2015, doi: 10.1016/j.rse.2015.07.017.
[19] M. Everingham, L. van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal
visual object classes (VOC) challenge,” Int J Comput Vis, vol. 88, no. 2, pp. 303–338, Jun.
2010, doi: 10.1007/s11263-009-0275-4.
[20] I. Goodfellow, “Generative adversarial nets,” Adv Neural Inf Process Syst, vol. 3, no.
January, pp. 2672–2680, 2014, doi: 10.3156/jsoft.29.5_177_2.