0% found this document useful (0 votes)

38 views10 pages

3 - Deep Learning For Vision-Based

This research proposal aims to develop a novel approach using geometric deep learning for vision-based scene understanding. The approach will use a geometric deep neural network to extract semantic and structural features from scene datasets. A structural classifier will then be developed to classify scenes based on similar structures. Finally, a reinforcement learning agent will be trained to allow for incremental learning of new scenes based on previous experience. The goal is to create a domain-adaptive model that can understand scenes beyond what it was directly trained on by analyzing structural similarities.

Uploaded by

shahid muneer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views10 pages

3 - Deep Learning For Vision-Based

Uploaded by

shahid muneer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Deep Learning for Vision-Based Scene Understanding

Research Proposal

Universita De Trento
Department of Information Engineering and Computer Science

Applicant Name: Muhammad Shahid Muneer

Applicant Surname: Muneer

Doctoral Program
Title: PhD Computer Science

Reserved Topic
Scholarship Title: Deep Learning for Vision-Based
Scene Understanding
Topic Preference: 2

Research Area A B C D

Research Proposal for PhD 1

Deep Learning for Vision-Based Scene Understanding

Abstract
Visual understanding of surroundings is a basic objective of Artificial Intelligence (AI) and
Robotics. Deep Learning have enabled the computers to distinguish between different objects,
scenes or landscapes in an imagery digital media. However, we need to train Deep Neural Network
(DNN) for each of the case and the scope of scene understanding in computers is defined by the
dataset we provide to the Deep Learning modal. Beside that dataset, AI cannot distinguish between
other related items by itself. To overcome this limitation, in this study, a novel approach is
introduced. In this novel approach, AI will be taught with geometric information of the scene,
object or element in digital media. Geometric Deep Neural Network (GDNN) will be used to
extract features and structural information of the object in dataset. While traditional Deep Learning
only learn feature information. Through this structural information, AI will be able to generalize
the similar structured objects, and scenes. A re-enforcement reward will be given on successfully
understanding the scenes. This way AI will learn incrementally about different scenes and objects.

Research Proposal for PhD 2

Deep Learning for Vision-Based Scene Understanding

Introduction

Scene understanding is a process of analyzing, perceive and interpret target of objects or

distributions of the objects. Scene understanding as an important role because of its applications in
smart homes, robotics, entertainment, healthcare, security systems, etc. In Artificial Intelligent
(AI) scene understanding is done using Computer Vision, by using different algorithms.
Convolutional Neural Network (CNN) is widely used for scene classifications. We train CNN on
widely available dataset, and the feature network from CNN is used to classify the objects in scene.
Multiple datasets are available [1]–[3], for object detections and scene understandings. There are
different techniques and algorithms for the object detection and classification. Qiki zhu et. al., [4]
classified the scene understanding in four classes. Which is scene classification based upon
semantics, mid-level features, deep learning and geographic data mining. However, all those
methods are not domain adaptive, and these methods cannot work for out of the dataset scope. In
this research, a novel approach will be investigated, which will use Geometric Deep Learning
(GDL) for getting semantic and structural information. Geometric Deep Neural Network (GDNN)
will be trained over COCO [5] or some other benchmark dataset, and the attention generated from
the network will be used to train reward based (re-enforcement) agent with the structural
information of the COCO dataset. A classifier will be used alongside reward-based agent to
classify similar structures. This approach will make the AI domain adaptive and incremental about
scene recognition.
We will be following step by step approach as bellow:
1- Collection of datasets, and dataset pre-processing for 3D perceptive.
2- Training proposed dataset with GDNN for generation of feature network with structural
information.
3- Creating a structural classifier of scene objects.
4- Using reward-based agent for incremental learning.

Research Proposal for PhD 3

Deep Learning for Vision-Based Scene Understanding

MOTIVATION

The scope of scene understanding is immense. There are several applications of scene
understanding ranging from semantic analysis in social media images, visionary ability of Robots,
etc. A domain adaptive AI for scene understanding will allow robots to learn about their surrounds
from previously learned datasets. An incremental learning will allow AI to predict similar structure
objects.

Literature Review

Scene understanding is a board area. There are several methods that are used in scene
understanding. Those methods are trained on widely available datasets. There are many available
datasets, that are most of the cases specific to train scene understanding [6], road scene
understanding [7], [8] , indoor scene understanding [9], [10], human crowd scene understanding
[11], post flood understanding dataset [11], human action recognition dataset [1]–[3], etc. Some
benchmark datasets were introduced for perception task of multi object segmentation, detection
and recognition. Microsoft COCO [5] is a leading dataset, which is continuously evolving and it
has 330K images which are labeled with 80 object classes, 91 stuff classes and 5 captions per
images. There are some more multi object scene understanding datasets [12] but Microsoft COCO
is in the lead.
Most of the scene understanding these days are using traditional machine learning algorithms and
semantic object analysis. X. Huang et. al., [13] used Support Vector Machine (SVM) for
classification of spatial features of images, for scene understanding. Similarly, G. Cheng et. al.,
[14] proposed approach constructs an object-based relationship, which enables to relate to real
objects. These object-oriented methods include segmentation and classification. Hence, because of
the object classification these methods are better. Deep Learning automated feature extraction took
the scene understanding to next level. The extracted features quality is high and the classification
becomes more enhanced. Nikita Dvornik et. al., [15] took advantage of Deep Learning and
created a real time scene understanding classifier. A Deep Neural Network is trained over VOC
and Microsoft COCO dataset for scene classification in real time. Xiaoxu Liu et. al., [16]
proposed an approach using Deep Learning scene understanding for Autonomous Vehicles.

Research Proposal for PhD 4

Deep Learning for Vision-Based Scene Understanding

There are several more scene understanding algorithms that are bench mark. Mohammad Javad
Shafiee [17] proposed YOLO framework for real time object detection, and it is widely used for
detecting objects in real time. These mentioned scene understanding algorithms only focus on
qualitative scene understanding, by classification only segmentation of scenes. Zhang X, et. al.,
[18] proposed a quantitative measurement classification method. They have measured the
quantitative measurement of changes in the economic structure of Beijing and Zhuhai districts.

Aims and Objectives

The main objective of the research:

➢ To collect dataset which has all possible scenes annotations.
➢ To develop a novel approach for creating feature network of Geometric Structural
information.
➢ To develop a classifier for classification of structural similarities.
➢ To train re-enforcement agent for incremental learning.

Key Research Questions

How to generate semantic and structural feature network using Geometric Neural Network?
How to create a classifier for classifying similar structures?
How to train Re-enforcement agent for incremental scene understanding?

Research Gap
Although there are several scene understanding classifiers available using different methods. However,
those methods are not incremental and cannot work outside of the scope of trained dataset. Available
methods usually do not keep structural information along with semantics. Proposed approach will be an
incremental, and domain adaptive method. The proposed method will able to understand objects in
scenes, which are not available in training dataset and are similar.

Research Proposal for PhD 5

Deep Learning for Vision-Based Scene Understanding

Problem Statement

There are many methods available for semantic and object classification in a scene. This allows
AI to understand the scene. However, those methods are not domain adaptive and the AI trained
on those classifiers cannot predict out of the trained dataset. Incremental learning from
surrounding of the observer agent is an important objective as, the available classifiers do not
classify similar structural objects and scenes. Through Geometric Neural Network, we can train
the classifier to classify objects and scene based upon similar structures and allow the AI to learn
more about surroundings incrementally.

Statement of Purpose

Study purpose is to develop a novel method, for incremental and domain adaptive modal, that can
learn incrementally about the scene and objects itself, without being trained on the same object. A
classifier which predicts and train itself with similar structural dataset and learn from experience.

Research Methodology
There are many ways to research out different techniques and methods but to find out the
answers to the research related questions apply the different research methods by which gives the
answers related to the research questions. To find out more facts and figures and identify
variables related to the research we will be using both qualitative and quantitative research
methods and then analyze the research problem by applying analytical research on it. The
purpose is to find a solution to the research problems. We can use either Microsoft COCO [5]
dataset, or VOC Object classes dataset [19] to train Geometric Deep Neural Network (GDNN)
for creating a feature network of semantic and structural information of the scenes. A classifier to
classify objects that are similar in structure. Different structures will be created using Generated
Adversarial Network (GAN) [20] and classifier will train on more similar structures. While
deciding on predicting the object or a scene, the classifier will compare the geometric structure
of the subject and identify the images by itself. Re-enforcement network will be used for training
the network.

Research Proposal for PhD 6

Deep Learning for Vision-Based Scene Understanding

Data Gathering:
Data gathering is the basic part to perform research and we are extracting the dataset based on
different scenes, maximum objects classes and bench mark datasets i.e., [5] or dataset can also be
used in this research. VOC Object classes dataset [19] is available publicly for multiple object
classes. Moreover, some dataset for scene understanding will be generated manually. From data
gathering to analysis and results a complete methodology is described below in diagram.

Data Gathering

Pre-Processing

Modification and Classification

Analysis and Results

Fig. 1 Propose Methodology

Pre-Processing
After extracting the data apply pre-processing techniques because the data available online is not
in the proper format after applying the pre-processing techniques apply the data extraction tools
and extract the data. The extracted dataset will be labeled with the quality index, according to the
scene semantics. Feature extraction tools i.e., Geometric Neural Network, extract the features
present in the data which is effective on the analysis results.

Research Proposal for PhD 7

Deep Learning for Vision-Based Scene Understanding

Modification and Classification

The output of GDNN, from training the refined dataset will be a latent space. This latent space
will be used to further train the classifier for classifying structural similarities.
Analysis and Results
Finally, at the end see analyses and results.

Research Proposal for PhD 8

Deep Learning for Vision-Based Scene Understanding

References:
[1] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A Dataset of 101 Human Actions
Classes From Videos in The Wild,” ArXiv, no. November, 2012.
[2] C. Xu and \bf J J \bf Corso, Actor-Action Semantic Segmentation with Grouping-Process
Models. 2016.
[3] M. Marszałek, I. Laptev, and C. Schmid, “Actions in context,” 2009 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition Workshops, CVPR
Workshops 2009, vol. 2009 IEEE, no. i, pp. 2929–2936, 2009, doi:
10.1109/CVPRW.2009.5206557.
[4] L. Hoyer, D. Dai, and L. van Gool, “HRDA: Context-Aware High-Resolution Domain-
Adaptive Semantic Segmentation,” Apr. 2022, [Online]. Available:
http://arxiv.org/abs/2204.13132
[5] T.-Y. Lin et al., “LNCS 8693 - Microsoft COCO: Common Objects in Context,” 2014.
[6] O. Zendel, M. Murschitz, M. Zeilinger, D. Steininger, S. Abbasi, and C. Beleznai,
“RailSem19: A Dataset for Semantic Rail Scene Understanding.” [Online]. Available:
www.wilddash.cc
[7] C. Sakaridis, D. Dai, L. van Gool, and E. Zürich, “ACDC: The Adverse Conditions
Dataset with Correspondences for Semantic Driving Scene Understanding.” [Online].
Available: https://acdc.vision.ee.ethz.ch
[8] V. Ramanishka, Y.-T. Chen, T. Misu, and K. Saenko, “Toward Driving Scene
Understanding: A Dataset for Learning Driver Behavior and Causal Reasoning.” [Online].
Available: https://usa.honda-ri.com/HDD
[9] I. Armeni, S. Sax, A. R. Zamir, and S. Savarese, “Joint 2D-3D-Semantic Data for Indoor
Scene Understanding,” Feb. 2017, [Online]. Available: http://arxiv.org/abs/1702.01105
[10] M. Roberts et al., “Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene
Understanding.” [Online]. Available: http://github.com/apple/ml-hypersim
[11] A. Zheng, Y. Zhang, X. Zhang, X. Qi, and J. Sun, “Progressive End-to-End Object
Detection in Crowded Scenes.” [Online]. Available: https://github.com/megvii-model/Iter-
E2EDET.
[12] B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva, “Places: An Image Database
for Deep Scene Understanding,” Oct. 2016, [Online]. Available:
http://arxiv.org/abs/1610.02055
[13] X. Huang, L. Zhang, and P. Li, “A multiscale feature fusion approach for classification of
very high resolution satellite imagery based on wavelet transform,” Int J Remote Sens, vol.
29, no. 20, pp. 5923–5941, Oct. 2008, doi: 10.1080/01431160802139922.
[14] G. Cheng, J. Han, L. Guo, Z. Liu, S. Bu, and J. Ren, “Effective and Efficient Midlevel
Visual Elements-Oriented Land-Use Classification Using VHR Remote Sensing Images,”
IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 8, pp. 4238–4249,
Aug. 2015, doi: 10.1109/TGRS.2015.2393857.
[15] N. Dvornik, K. Shmelkov, J. Mairal, and C. Schmid Inria, “BlitzNet: A Real-Time Deep
Network for Scene Understanding.”
[16] X. Liu, “Vehicle-Related Scene Understanding Using Deep Learning.”
[17] M. J. Shafiee, B. Chywl, F. Li, and A. Wong, “Fast YOLO: A Fast You Only Look Once
System for Real-time Embedded Object Detection in Video,” Sep. 2017, [Online].
Available: http://arxiv.org/abs/1709.05943

Research Proposal for PhD 9

Deep Learning for Vision-Based Scene Understanding

[18] X. Zhang and S. Du, “A Linear Dirichlet Mixture Model for decomposing scenes:
Application to analyzing urban functional zonings,” Remote Sens Environ, vol. 169, pp.
37–49, Nov. 2015, doi: 10.1016/j.rse.2015.07.017.
[19] M. Everingham, L. van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal
visual object classes (VOC) challenge,” Int J Comput Vis, vol. 88, no. 2, pp. 303–338, Jun.
2010, doi: 10.1007/s11263-009-0275-4.
[20] I. Goodfellow, “Generative adversarial nets,” Adv Neural Inf Process Syst, vol. 3, no.
January, pp. 2672–2680, 2014, doi: 10.3156/jsoft.29.5_177_2.