0% found this document useful (0 votes)
7 views7 pages

Graph Neural Network GNN in Image and Video Understanding Using Deep Learning For Computer Vision Applications

The document discusses the use of Graph Neural Networks (GNNs) in image and video understanding within computer vision applications, highlighting their effectiveness in tasks such as image categorization and object recognition. It reviews various GNN architectures and their applications, emphasizing the need for further research to enhance video comprehension through long-term reasoning and understanding of interactions in dynamic scenes. The paper also presents challenges and methodologies for improving GNN performance in video analysis, showcasing the potential of GNNs in processing complex graph-structured data.

Uploaded by

a3234049816
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views7 pages

Graph Neural Network GNN in Image and Video Understanding Using Deep Learning For Computer Vision Applications

The document discusses the use of Graph Neural Networks (GNNs) in image and video understanding within computer vision applications, highlighting their effectiveness in tasks such as image categorization and object recognition. It reviews various GNN architectures and their applications, emphasizing the need for further research to enhance video comprehension through long-term reasoning and understanding of interactions in dynamic scenes. The paper also presents challenges and methodologies for improving GNN performance in video analysis, showcasing the potential of GNNs in processing complex graph-structured data.

Uploaded by

a3234049816
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Proceedings of the Second International Conference on Electronics and Sustainable Communication Systems (ICESC-2021)

IEEE Xplore Part Number: CFP21V66-ART; ISBN: 978-1-6654-2867-5

Graph Neural Network (GNN) in Image and Video


Understanding Using Deep Learning for Computer
Vision Applications
Pradhyumna P, Shreya G P, Mohana
Electronics and Telecommunication Engineering, RV College of Engineering®
2021 Second International Conference on Electronics and Sustainable Communication Systems (ICESC) | 978-1-6654-2867-5/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICESC51422.2021.9532631

Bengaluru, Karnataka, India.

Abstract- Graph neural networks (GNNs) is an information - Social networks, chemical co mpounds, maps, and
processing system that uses message passing among graph transportation systems are just a few examp les. Due to
nodes. In recent years, GNN variants including graph attention advancements in DL, in variety of image co mprehension
network (GAT), graph convolutional network (GCN), and
tasks, including image categorization, object recognition, as
graph recurrent network (GRN) have shown revolutionary
performance in computer vision applications using deep
well as semantic seg mentation. Deep learn ing algorith ms'
learning and artificial intelligence. These neural network model capacity to learn at many levels of abstraction from data
extensions, collect information in the form of graphs. GNN may accounts for this resounding success. Different levels of
be divided into three groups based on the challenges it solves: abstraction are required for different activities. This
link prediction, node classification, graph classification. generalized image classificat ion challenge helps determine
Machines can differentiate and recognise objects in image and which object classes are included in a given image (usually
video using standard CNNs. Extensive amount of research work fro m a collection of predefined classifications). Image
needs to be done before robots can have same visual intuition as classification methodology based on GNN models is
humans. GNN architectures, on the other hand, may be used to
evolving, since GNN, wh ich derive their mot ivation fro m
solve various image categorization and video challenges. The
number of GNN applications in computer vision not limited,
CNN, are used in this domain. If g iven a large training dataset
continues to expand. Human-object interaction, actin of labelled class, many of these models, including GNN,
understanding, image categorization from a few shots and many provide interesting results. Recent video understanding
more. In this paper use of GNN in image and video datasets such as AVA and Charades, on the other hand, have
understanding, design aspects, architecture, applications and lagged behind in co mparison. Understanding the interactions
implementation challenges towards computer vision is between actors, objects, and other context in a scene is one of
described. GNN is a strong tool for analysing graph data and is the many reasons why video comprehension is very
still a relatively active area that needs further researches important. Furthermore, because these interactions aren't
attention to solve many computer vision applications.
always visible in a single frame, reasoning over large t ime
Keywords— Graph Neural Networks (GNNs), Convolutional intervals is required. Because video has an additional
Neural Network (CNN), Gated Adversarial Transformer (GAT), temporal axis, it has a much higher dimensional signal than
Atomic Visual Action(AVA), Human-Object Interactions (HOI), single images, and we believe that learn ing these unlabeled
Graph Parsing Neural Network (GPNN). interactions directly fro m current datasets with huge
convolutional networks is not possible. Effective video
I. INT RODUCT ION comprehension necessitates long-term reasoning about the
Graphs are one of the most useful data structures for various lin ks between objects, actors and their surroundings. Because
applications areas such as learning cell fingerprints, exp loring graph-structured data is ubiquitous, it can be employed in a
traffic networks, prescribing co mpanions to interpersonal wide range of scenarios.
interaction and modeling body systems. Non-Euclidean
graphs need to be addressed by these activities which contain II. LIT ERAT URE SURVEY
details of interpersonal relationships which can be mishandled Anurag Arnab [1] describes how to make a proposal when
by using deep traditional learning models. Nodes in graphs supervision is not available, they suggest a message-passing
regularly contain valuable data that is ignored in profoundly GNN that can use exp licit representations of objects and
moderated unregulated learning techniques. GNNs are clearly represents these spatio-temporal interconnections.
proposed to consolidate the graph structure and feature details This strategy is demonstrated on two different challenges in
to pursue better presentations on graphs with integration and video spatio-temporal action identificat ion that require
feature distribution. Because of its persuading execution and relational thin king. It also demonstrates how this method may
high interpretation, GNN has as of late become broadly more successfully model relationships between s ignificant
utilized. GNNs can model the relationship of nodes of the things in images, numerically and qualitatively. Natural v ideo
graph and generate a numerical representation of it. GNNs are events are usually outcome of spatio-temporal engagements
particularly essential since there is so much factual among actors and objects, and they frequently involve a large
information that can be expressed also as graphs. number of object types. Yubo Zhang [2] proposes a study that

978-1-6654-2867-5/21/$31.00 ©2021 IEEE 1183


Authorized licensed use limited to: GUANGZHOU UNIVERSITY. Downloaded on September 13,2024 at 09:52:31 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Electronics and Sustainable Communication Systems (ICESC-2021)
IEEE Xplore Part Number: CFP21V66-ART; ISBN: 978-1-6654-2867-5

argues that action detection is a difficu lt problem to solve video retrieval on several videos and datasets. On three
since the models wh ich must be trained are massive, and different datasets, this paper gives comprehensive
obtaining tagged data is costly. To overcome this constraint, experimental ev idence and achieves best-in-class results in a
they recommend incorporating domain knowledge into the range of downstream video interpretation tasks. Beyond
model's structure to make optimization easier. Santiago instance discrimination, the success of our methodology
Castro [3] presents the majority of research work on demonstrates the advantages of contrastive learning.
language-assisted video understanding has centered on two A. Gupta [7] proposes a simp le idea wh ich offers world
tasks: firstly, usage of mult iple choice questions for video features, a basic concept in wh ich each feature at each layer
question answering, here models perform well because has its own spatial transformation, and the feature map is only
candidate solutions are easily available. Second, to capture a altered when needed. Results show that a network created
video which uses an open-ended assessment framework. They with these World Features may be utilised to mimic eye
suggest fill in the blanks as just a video co mprehension movements including saccades, fixation, and smooth pursuit
assessment framework that corrects previous assessment on pre-recorded video in a batch environ ment. It finds that
problems but mo re closely matches real-life circu mstances in numerous eye movements are achievable, allowing for a wide
which many possibilit ies are unavailab le. Using this range of aug mentations without sacrificing relat ive feature
associated text and v ideo, the model must predict a concealed position. They utilised these concepts to supply the model
noun phrase inside this video description, wh ich assesses the with a transformation for each video in the experiments, but
system's knowledge of the film. The dataset is built fro m the learnt transformations might also be employed. Yubo Zhang
VATEX dataset, with blurred captions generated by stamping [8] proposes a notion that action detection is a difficult
noun phrases in the English captioning in VA TEX. To problem to solve based on the fact that the models are large to
construct an instance, we select the first English caption that be trained and obtaining labelled data is costly. To overcome
usually contains one noun phrase as recognized by spaCy1, this issue, they suggest incorporating domain expert ise into
then blank these kinds of nouns at random. As a result, we the model's structure to make optimizat ion easier. This model
start only with the VATEX v 1.1 training set, a randomized surpasses existing best practices, proving the method's utility
subset of size 1000 fro m the test dataset, respectively, to in modelling spatial correlat ion and reasoning about linkages.
construct our training, validating, and testing data. To acquire Most significantly, the performance of this model highlights
additional right answers for each space in the verificat ion and the need of incorporating relat ional and temporal knowledge
test sets, we used a crowd annotating technique. The key aim into design methods for detection of action. H. Huang [9]
for gathering such additional annotations, as previously said, describes a new dynamic h idden graph module in videos for
is to account for the diversity of words and to have several modelling complicated object-object interactions, with two
alternatives for each space. K. Sasabuchi [4] presents a instantiations: a v isual graph for capturing appearance/motion
learning-fro m-observation framework for ext racting precise changes among objects and a location graph to capture
action sequences fro m a video of a hu man demonstration split relative spatiotemporal position changes among objects. The
and understood with vocal instructions. Splitting is based on suggested graph module may explicitly capture interactions
minimu m local points in hand velocity, wh ich link hu man among objects in streaming v ideo contexts by evaluating
daily movements with object-centered facial contact object relations at the same time both in time domain and
transitions required for robot motion generation. They first frequency domain, which d istinguishes our work fro m prior
established that hand velocity motion splitting is a reliable techniques.
signal for partit ioning daily tasks. Secondly, they generated a Y. Chen [10] presents a novel way of exp lanation that
new motion description dataset with the goal of better aggregates a collection of features globally across the
understanding everyday human actions. Also provides the coordinate space before moving to an interacting area
informat ion of attention-based models, researchers developed whereby relationship thinking may be calculated successfully.
Gated Adversarial Transformer (GA T). To increase the J. Zhou [11] describes many learning activ ities including
model's performance even further, they applied adversarial working with graph data, which offers a wealth of relational
training methodologies. A regularization term was added to informat ion between parts. The usage of a model that can
the loss function, giving the model adversarial robustness to learn fro m graph inputs is required fo r simu lating the physical
both attention mappings and the final output space. Matthew systems, establishing fingerprints, predicting proteins
Hutchinson [5] describes deep learn ing research in co mputer interface, and diagnosing illnesses. They did a thorough
vision is a natural extension of video understanding . The examination of graph neural networks. They divide GNN
application of artificial neural network (ANN) mach ine models into variants based on compute modules, graph kinds,
learning (M L) approaches has tremendously benefitted the and training kinds. They also describe a nu mber of general
field of image interpretation. They grouped deep learning frameworks and present a number of theoretical studies. F.
model build ing blocks and state-of-the-art model families, Scarselli [12] proposed GNN model, for processing data it
and specified standard metrics for assessing models. They extends existing neural network approaches shown in this
also listed datasets suitable as benchmarks and pre-training graph domain. The mentioned GNN model can handle most
sources, discussed data preparation stages and techniques, and common graph topologies directly, including acyclic, cyclical,
organized deep learning model building blocks and state-of- guided, and unguided graphs. The paradigm revolves upon
the-art model families. Ishan Dave [6] proposes a temporal informat ion diffusion and relaxat ion processes. The generic
contrastive learning framework outperforms state of the art framework, prior generative method for organized data
outcomes in a variety of downstream video comprehension processing, and approaches based on vague walk mod are all
tasks, including action recognition, limited-label action incorporated into the approach. K. Simonyan [13] proposes
classification, and act ion classificat ion and nearest-neighbor the 2 stream ConvNet design that includes both spatial and

978-1-6654-2867-5/21/$31.00 ©2021 IEEE 1184


Authorized licensed use limited to: GUANGZHOU UNIVERSITY. Downloaded on September 13,2024 at 09:52:31 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Electronics and Sustainable Communication Systems (ICESC-2021)
IEEE Xplore Part Number: CFP21V66-ART; ISBN: 978-1-6654-2867-5

temporal networks. First they show that despite minimal categorization is to control structural in formation. As a result,
training samples, the results of a ConvNet taught on multi GNN appears to be highly tempting in this regard. The
frame intensive optical flo w might be great. Finally, they informat ion needed to lead the ZSL work may be found in
demonstrate how multitask learn ing may be utilised to knowledge graphs. The type of in formation that each
improve the quantity of data collected fro m t wo separate technique represents in the graph varies by knowledge.
action classification datasets. Graphs of such kind may well be built on co mmonalities
R. Girdhar [14] introduces a method for recognizing and between both the photos themselves or those of the objects
localizing living beings in video footage, use the Actions recovered using object recognition in the photos. Semantic
Converter model. They emp loy a transformer architecture for informat ion fro m embeddings of the image class labels may
collecting characteristics fro m the spatiotemporal also be included in the graphs. GNNs may then be used to
environment around the individual whose behaviors are being enhance the ZSL picture classification-recognition process by
classified. They demonstrated that even the Actions Converter applying them to this structured data. The process of creating
network can acquire spatiotemporal information fro m several a label for a video based on its frames is video classification,
other human Behaviour and items inside a film clip and use it strong video level classification not only delivers correct
to recognize and localize hu man activities. Siyuan Qi [15] frame labels, and also best represents the entire movie based
describes the challenge of identify ing and distinguishing on the characteristics and annotation of the indiv idual frames.
human-object interactions (HOI) in photos and videos in this The process of creating a label fo r a video based on its frames
work. A Graph Parsing Neural Net work (GPNN) is is video classification. A strong video level classification not
introduced, an end-to-end differentiable architecture that only delivers correct frame labels, and also best represents the
incorporates structural information. They test this model on entire movie based on the characteristics and annotation of the
various data sets such as V-COCO, HICO-DET and CAD- individual frames. Paper [1], p ropose a message passing
120, wh ich on photos and videos, are all HOI recognition graph GNN to spatio-temporal interactions and for object
standards. This technique outperforms current methods, representations it uses explicit object if monitoring is
indicating that GPNN was adaptable to large datasets. And available else implicit object shall be used. Their approach
can be used in both spatial and temporal scenarios. Boncelet broadens earlier structured models for video co mprehension,
[16] describes testing of the proposed method's performance, allo wing us to investigate how varied graph representation
where two picture understanding tasks were chosen: Emot ion and structure choices impact the model's performance. This
recognition at the group level and incident identification both shows how to apply a strategy to two separate tasks in videos
task is extremely mean ingful, and synthesizing mult iple cues that require related reasoning – on AVA and UCF101-24 it
necessitates the interplay of nu merous deep models. uses an action detection model of spatio temporal relation, on
Understanding an image includes not just recognising the the recently released Action Genome dataset it uses video
items in it, but also grasping their fundamental relationships scene graph categorizat ion on dataset. It also demonstrates
and interconnections. GNNs may take advantage of such that this strategy may more successfully model relationships
lin kages during the feature learn ing and forecasting phases by between significant things in the picture, both numerically
spreading nodal messages through the network and and qualitatively.
aggregating the outputs. Danfei Xu [17] suggests the use of
scene graphs, a graphical framework for a p icture that is
visually anchored, to formally model the objects and their
interactions. They also propose a revolutionary end -to-end
paradigm for creating structured scene representations fro m
an input image. They developed a novel end to end model that
solves the challenge of automat ically constructing a visibly
anchored virtual environ ment fro m an image by continuous
passing of messages between the primal and dual sub -graphs
along the topological structure of a scene graph. Yubo Zhang
[18] describes Action detection as an example of a d ifficult
problem: the models that must be trained are enormous, yet
labelled data is d ifficu lt to get by. To overcome this
constraint, they recommend incorporating domain knowledge
into the model's structure to make optimization easier. The
suggested methodology outperforms the by 5.5 percent mAP
in the I3D base and 4.8 percent mAP on AVA dataset.
III. GRAPH NEURAL NET WORK MODEL AND
ARCHIT ECT URE FOR IMAGE AND VIDEO Fig.1. Graph-structured data representation [19]
UNDRST ANDING Figure 1 shows the graph-structured data to understand the
Image categorizat ion, a classic computer vision problem, activity of video. The v isual ST graph with unique edge types,
where convolutional neural networks (CNN) being the most actor-to-actor temporal, object-to-actor spatial etc., and
prominent one. GNNs, which get their mot ivation fro m CNN, unique node types is a heterogeneous graph with varied
have been used in this arena as well. Main goal is to improve semantics and dimensions. First, it describes objects and
zero-shot and few-shot learning task models performance. actors visual spatio-temporal interactions. Second, co-
Zero shot learning (ZSL) is the process of train ing a model to occurrences, for example, are a co mmon connection between
recognize classes it has never seen before. ZSL image labels. These signals can be represented visually and

978-1-6654-2867-5/21/$31.00 ©2021 IEEE 1185


Authorized licensed use limited to: GUANGZHOU UNIVERSITY. Downloaded on September 13,2024 at 09:52:31 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Electronics and Sustainable Communication Systems (ICESC-2021)
IEEE Xplore Part Number: CFP21V66-ART; ISBN: 978-1-6654-2867-5

symbolically in a mixed spatial temporal and symbolic


attributed graphs. This hybrid graph to conduct supervised
learning on recognized semantic elements, such as objects and
actors, to create perspective models that may be used to tackle
subsequent video processing tasks [19].

Fig.4. Object – to - object interaction [9]


Figure 4 shows object to object interaction, mainly two
relations should be considered for recognising such
interactions: first, interactions between various images into a
single frame. Second, transitions of such interactions between
Fig.2. Action detection in video sequence [2] different items and then the same item across successive
Figure 2 shows the action detection in video sequence (AVA frames [9]. The former is referred to as a spatial relationship,
dataset).Person rising fro m their seat and collecting a letter whereas the latter is referred to as a temporal relationship.
fro m so me other person seated beside a table. Info rmation is Both are necessary for recognising mu lti-object operations.
genuinely necessary for recognising and localising this An efficient strategic recognition model will accurately and
activity out of the 2359296 p ixels inside the 36 frames of this concurrently capture both relationships.
snap, the actor's motion, location and interactions with other
actors and the text are all important indications. The rest of
the video's data, such as the wall color or the light on the
table, is extraneous and should be ignored. Action region
detection on such intuitive insights, it’s vital to collect both
deep temporal features and spatial interactions across actors
and objects when detecting actions [2].

Fig.5. Action detection and interaction framework [2].


Figure 5 shows the action detection and interaction
framework, it receives a video frames and runs through I3D
network concurrently, each frame is subjected to object
identification model, to generate person and object confidence
scores. Tubelets are created by combining personal bounding
boxes. Following, tubelets and object pieces (as nodes) are
utilised to create an actor centric graph for each actor in the
video [2].
3D convolution
Fig.3. Action understanding [5]
Action issues, video action data, data processing approaches,
deep learning models, and assessment measures all fall under
the umbrella of action understanding as shown in figure 3 [5].
The ideas of computing performance, data variety,
traceability, model resilience, and readability underpin these
processes in computer vision and deep learning. Su mmary of
action phases (dataset selection, problem format ion, model
construction, dataset preparation, and metrics basis
evaluation) as well as core assumptions (data diversity,
computational performance, robustness, transferability and
understandability). Fig.6. 2D and 3D convolution [5].
Backbone for many of the state of the art models are 1-
Dimensional CNNs (C1D) uses 1D kernels, 2-Dimensional
CNNs (C2D) uses 2D kernels, and 3-Dimensional CNNs
(C3D) uses 3D kernels. C1D is generally used for

978-1-6654-2867-5/21/$31.00 ©2021 IEEE 1186


Authorized licensed use limited to: GUANGZHOU UNIVERSITY. Downloaded on September 13,2024 at 09:52:31 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Electronics and Sustainable Communication Systems (ICESC-2021)
IEEE Xplore Part Number: CFP21V66-ART; ISBN: 978-1-6654-2867-5

convolutions of embedded type features along the time


dimension, wh ilst C2D and C3D are used to extract the
feature vectors fro m single frames or layered frames. Figure 6
shows singular channel type samples of 2 Dimensional and 3
Dimensional convolutions . Accurate hyperspectral image
classification has been a crucial yet difficult task. Convolution
neural networks (CNNs) in two dimensions (2D) and three
dimensions (3D) have been used to collect spectral or spatial
info in multispectral photographs [5].
Graph is a form of structured arrangement o f information that
represents objects and their relat ionships. New research on
graph analysis has aroused a lot of interest using ML because
graphs have such a high expressive potential. In the graph
field, GNNs are based on deep learning algorithms. GNN has
lately gained popularity as a graph monitoring system due to
its superior performance.

Fig.8. Design pipeline of GNN model [11].


Figure 8 shows the design pipeline for GNN model.
Propagation Module is one of the most often utilized
computational modules. Data is propagated across nodes
using propagation module, wh ich allo ws data aggregation to
record both topology and feature info. Convolution and
recurrent operators are typically emp loyed in propagation
modules wh ich gather the info rmation fro m neighbours, skip
connection operation is typically emp loyed to acquire detailed
info fro m prev ious node representations. Samp ling module
Fig.7. Genetic model for GNN architecture [11]. frequently require to carry out graph inclination. Sa mp ling
Figure 7. Shows the genetic model of GNN architecture. and propagation modules are frequently comb ined. To ext ract
GNN structures initial population is init ialized to (S0) first, the informat ion, pooling modules are used fro m nodes when
with every individual being a mu ltilayer Graph neural high-level subgraphs of graphs are needed [11]. A GNN
network where every layer is made up of co mponents chooses model is usually developed by mixing these computing
randomly, such as the activation function, hidden embedding components. The recurrent operator, convolutional operator,
size and aggregator. With respect to (S0) the GNN parameters skip connection and sampling module are used to spread info
population (P0) is then init ialized and it sets parameters in individual layer, and to retrieve high level informat ion
accordingly which evolves as the best suit (e.g., learn ing rate pooling modules are added, as shown in figure 8. To get
and dropout rate). Fo llowing that, to optimize the graph better representations, these layers are generally stacked. The
neural network structures using the optimal parameter setting architecture used here can simplify GNN models and outliers,
fro m P0, architecture is made fro m S0 to S1 [11]. After the such as NDCN, wh ich mixes GNNs and ordinary differential
first round of alternate development between structure and equation systems.
parameter, it creates a GNN arch itecture with ideal design and Transformers in video understanding -Action Transformers,
optimu m parameter settings produced from SI and PI. Six in wh ich 3D CNN characteristics are pooled and delivered to
encoded states of GNN architecture i.e., Hidden Dimension, identity explo it the Spatio-temporal data. By co mb ining the
Attention Head, Attention Function, Activation Function, patches generated frame at several time-steps, may use
Aggregation Function, and Skip Connection. transformer to classify video. GAT-GAT models, significance
of a frame based on the local and worldwide circu mstances
using an intra attention gate. This allo ws the network to
comprehend the video at multiple levels of granularity.

IV. USAGE OF GNN IN VARIOUS DOMAINS

Graph neural network practical applications include traffic


control [31] [32], human behavior detection [24] [25],
,adversarial attack prevention, recommender system, program
verification, logical reasoning, mo lecular structure study, and
social influence prediction Most GNN architectures can be
classed as structural and non-structural depending informat ion
they process. Following are so me intriguing applications fro m
each categories: - In the graph-like structure of nano-scale
mo lecules, the nodes are ions and edges are bonds connecting
them accord ing to GNN. In both cases , to learn about existing

978-1-6654-2867-5/21/$31.00 ©2021 IEEE 1187


Authorized licensed use limited to: GUANGZHOU UNIVERSITY. Downloaded on September 13,2024 at 09:52:31 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Electronics and Sustainable Communication Systems (ICESC-2021)
IEEE Xplore Part Number: CFP21V66-ART; ISBN: 978-1-6654-2867-5

mo lecular structures and to discover unique chemical


structures GNNs can be used. This would have a great effect
on the development of co mputer assisted medication. CNN
are the most popular machine learn ing techniques which have
remarkable answers to image categorizat ion, a classic
computer vision proble m [20] [21] [22] [23]. GNNs can then
be used to enhance the ZSL p icture classification recognition
task by applying them to this structured data. Text, like
images, does not have obvious relationships.

V. A PPLICAT IONS OF GNN IN COMPUT ER VISION


T ABLE II. P LATFORMS FOR GRAP H COMP UTING
Some of the applications of GNN in co mputer v ision
applications are [26] [27] [28] [29] [30] [33] [34],
 Object Localization
 Human-Object Interactions
 Question Answering
 Object Detection
 Features Learning
 Image Classification
 Relationships in a Photo
 Visual Question Answering
 Action Recognition T ABLE III. GRAP H MODELS FOR COMP UTER VISION AP P LICATIONS
 Point Clouds
 3D Classification and Segmentation
 RGBD Semantic Segmentation
 Situation Recognition
 Social Relationship Understanding
 Zero-Shot Action Recognition
Graph neural networks have the ability to immediately
analyze input graphs, thus incorporating its connectivity into
the product criteria. Most popular techniques to graph theory
are based on a beginning stage that translates each graph over
to a smaller data type, such as a vector or a series of reals. Table 1 shows the repository and web lin ks of standard graph
Interactions between humans and objects - GPNN repeatedly learning resources. Various platforms that can be used for
modifies eigenvector matrices and node labeling within such graph computing applications is tabulated in table 2. Table 3
a passing messages inference framework. The V-COCO, depicts the popular graph models can be used for computer
HICO-DET and CAD-120 datasets are used for testing on 3 vision applications.
HOI identificat ion benchmarks on images and videos. GPNN VII. RESEARCH AND IMPLEMENT AT ION
is adaptable to large datasets and can be used in both CHALLENGES
spatiotemporal scenarios. Visual QA is a graph-based way to Despite the positive outcomes, previous works are continually
address visual questions. Object detection- proposes spatial- hampered by the following two flaws:
temporal Graph Convolution Net work (ST-GCN), a new Hyper parameters-Aside fro m GNN structure, a little change
model of dynamic skeleton that overcomes the constraints of in hyper parameters can affect the performance of converging
earlier methods by understanding both spatial and temporal structural model. Currently availab le approaches that simply
variation fro m necessary data. Model based situation optimize structural variables with fixed hyper parameter
recognition is a GNN that allo ws to record joint values may result in a model that is unsatisfactory.
interdependence between tasks effectively using neural Scalability -The time it takes to train recurrent network
network models formed on a network. contributes to the search time. Run-t ime co mputation would
be required for both the controller training and the single
VI. RESOURCES AND PLAT FORMS FOR GRAPH GNN model t rain ing. Furthermore, the controller often
COMPUTING IN CV APPLICATIONS produces and analyses potential GNN structures in a
sequential fashion, which makes scaling to a vast searching
T ABLE .I. STANDARD GRAP H LEARNING RESOURCES space problematic.
VIII. CONCLUSION
Design and implementation of Graph Neural Net work (GNN)
for co mputer vision (CV) applications is currently an active
ongoing researching topic for various application do mains not
only limited to CV. Still there are several unanswered
concerns. Spatio-temporal graph neural network architecture
has been described to explicitly simulate interactions between

978-1-6654-2867-5/21/$31.00 ©2021 IEEE 1188


Authorized licensed use limited to: GUANGZHOU UNIVERSITY. Downloaded on September 13,2024 at 09:52:31 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Electronics and Sustainable Communication Systems (ICESC-2021)
IEEE Xplore Part Number: CFP21V66-ART; ISBN: 978-1-6654-2867-5

actors, objects, and their environment. This approach can [25] M. R. Nehashree et al.,“ Simulation and Performance Analysis of
implicitly or exp licit ly characterize objects, and it generalizes Feature Extraction and Matching Algorithms for Image Processing
Applications,” International Conference on Intelligent Sustainable
the existing structured models for video co mprehension. Systems (ICISS), 2019.
Usage of adaptive approach for better score, district task [26] R. K. Meghana et al.,“ Background-modelling techniques for
across distinct dataset proposed. On A VA, still lot of foreground detection and Tracking using Gaussian Mixture Model,” 3rd
additional work needs to be done to better utilize exp licit International Conference on Computing Methodologies and
Communication (ICCMC), 2019.
object representations.
[27] V. P. Korakoppa et al.,“ Implementation of highly efficient sorting
Further, Use of GNN in image and video understanding, algorithm for median filtering using FPGA Spartan 6,” International
architecture, applicat ions, resources, platforms, graph models Conference on Innovative Mechanisms for Industry Applications
and implementation challenges towards computer vision is (ICIMIA), 2017.
elaborated in detail. [28] D. Akash et al.,“Interfacing of flash memory and DDR3 RAM memory
with Kintex 7 FPGA board,” International Conference on Recent
REFERENCES Trends in Electronics, Information & Communication Technology
(RTEICT), 2017.
[1] Anurag Arnab et al., “Unified Graph Structured Models for Video
Understanding” CVPR, arXiv:2103.15662, 2021. [29] N. Jain et al.,“Performance Analysis of Object Detection and Tracking
Algorithms for Traffic Surveillance Applications using Neural
[2] Yubo Zhang et al., “A Structured Model For Action Detection” CVPR, Networks,” International conference on I-SMAC (IoT in Social, Mobile,
arXiv:1812.03544, 2019. Analytics and Cloud) (I-SMAC), 2019.
[3] Santiago Castro et al.,“Fill-in-the-blank as a Challenging Video [30] C. Kumar B et al.,“ YOLOv3 and YOLOv4: Multiple Object Detection
Understanding Evaluation Framework” CVPR, arXiv:2104.04182,2021. for Surveillance Applications,” International Conference on Smart
[4] Saurabh Sahu et al.,“ Enhancing Transformer for Video Understanding Systems and Inventive Technology (ICSSIT), 2020.
Using Gated Multi-Level Attention and Temporal Adversarial
[31] C. Kumar B et al.,“ Performance Analysis of Object Detection
T raining”CVPR, arXiv:2103.10043, 2021. Algorithm for Intelligent Traffic Surveillance System,” International
[5] Matthew Hutchinson et al.,“ Video Action Understanding: A T utorial” Conference on Inventive Research in Computing Applications
CVPR, arXiv:2010.06647, 2020. (ICIRCA), 2020.
[6] Ishan Dave et al.,“T CLR: T emporal Contrastive Learning for Video
Representation”CVPR, arXiv:2101.07974, 2021. [32] R. J. Franklin et al.,“Traffic Signal Violation Detection using Artificial
[7] Gunnar A. Sigurdsson et al.,“ Beyond the Camera: Neural Networks in Intelligence and Deep Learning,” International Conference on
World Coordinates”CVPR, arXiv:2003.05614, 2020. Communication and Electronics Systems (ICCES), 2020.
[8] Yubo Zhang et al.,“ A Structured Model For Action Detection” CVPR, [33] Mohana et al.,“ Object Detection and Tracking using Deep Learning
arXiv:1812.03544, 2018. and Artificial Intelligence for Video Surveillance Applications”
[9] Hao Huang et al.,“ Dynamic Graph Modules for Modeling Object- International Journal of Advanced Computer Science and
Object Interactions in Activity Recognition”CVPR, arXiv:1812.05637, Applications(IJACSA), 10(12), 2019.
2018. [34] Manoharan Samuel et al., “Improved Version of Graph-Cut Algorithm
[10] Yunpeng Chen et al.,“ Graph-Based Global Reasoning for CT Images of Lung Cancer with Clinical Property Condition”
Networks”,CVPR, arXiv:1811.12814, 2018. Journal of Artificial Intelligence and Capsule networks, 2(4), 201 –206.
[11] Jie Zhou et al.,“ Graph Neural Networks: A Review of Methods and
Applications”arXiv:1812.08434, 2021.
[12] F. Scarselli et al.,“The Graph Neural Network Model,” IEEE
Transactions on Neural Networks,vol. 20, no. 1, pp. 61-80, 2009.
[13] Karen Simonyan et al.,“T wo-Stream Convolutional Networks for
Action Recognition in Videos” Visual Geometry Group(VGG),
University of Oxford.
[14] Rohit Girdhar et al.,“Video Action Transformer Network” CVPR,
arXiv:1812.02707, 2018.
[15] Siyuan Qi et al.,“ Learning Human-Object Interactions by Graph
Parsing Neural Networks” CVPR, arXiv:1808.07962,2018.
[16] Xin Guo et al.,“ Graph Neural Networks for Image Understanding
Based on Multiple Cues: Group Emotion Recognition and Event
Recognition as Use Cases” CVPR, arXiv:1909.12911, 2020.
[17] Danfei Xu et al.,“ Scene Graph Generation by Iterative Message
Passing” CVPR, arXiv:1701.02426,2017.
[18] Yubo Zhang et al.,“ A Structured Model For Action Detection” CVPR,
arXiv:1812.03544, 2019.
[19] E. Mavroudi et al.,“Representation Learning on Visual-Symbolic
Graphs for Video Understanding” arXiv:1905.07385, 2020.
[20] Biswas A et al.,“ “Survey on Edge Computing–Key Technology in
Retail Industry” Lecture Notes on Data Engineering and
Communications Technologies, 2021, vol 58. Springer, Singapore.
[21] R. J. Franklin et al.,“Anomaly Detection in Videos for Video
Surveillance Applications using Neural Networks,” Fourth
International Conference on Inventive Systems and Control (ICISC),
2020.
[22] Mohana et al.,“Performance Evaluation of Background Modeling
Methods for Object Detection and Tracking,” 4 th International
Conference on Inventive Systems and Control (ICISC), 2020.
[23] A. Biswas et al.,“Classification of Objects in Video Records using
Neural Network Framework,” International Conference on Smart
Systems and Inventive Technology (ICSSIT), 2018.
[24] H. Jain et al.,“ Weapon Detection using Artificial Intelligence and Deep
Learning for Security Applications,” International Conference on
Electronics and Sustainable Communication Systems (ICESC), 2020.

978-1-6654-2867-5/21/$31.00 ©2021 IEEE 1189


Authorized licensed use limited to: GUANGZHOU UNIVERSITY. Downloaded on September 13,2024 at 09:52:31 UTC from IEEE Xplore. Restrictions apply.

You might also like