Graph Neural Network GNN in Image and Video Understanding Using Deep Learning For Computer Vision Applications
Graph Neural Network GNN in Image and Video Understanding Using Deep Learning For Computer Vision Applications
Abstract- Graph neural networks (GNNs) is an information - Social networks, chemical co mpounds, maps, and
processing system that uses message passing among graph transportation systems are just a few examp les. Due to
nodes. In recent years, GNN variants including graph attention advancements in DL, in variety of image co mprehension
network (GAT), graph convolutional network (GCN), and
tasks, including image categorization, object recognition, as
graph recurrent network (GRN) have shown revolutionary
performance in computer vision applications using deep
well as semantic seg mentation. Deep learn ing algorith ms'
learning and artificial intelligence. These neural network model capacity to learn at many levels of abstraction from data
extensions, collect information in the form of graphs. GNN may accounts for this resounding success. Different levels of
be divided into three groups based on the challenges it solves: abstraction are required for different activities. This
link prediction, node classification, graph classification. generalized image classificat ion challenge helps determine
Machines can differentiate and recognise objects in image and which object classes are included in a given image (usually
video using standard CNNs. Extensive amount of research work fro m a collection of predefined classifications). Image
needs to be done before robots can have same visual intuition as classification methodology based on GNN models is
humans. GNN architectures, on the other hand, may be used to
evolving, since GNN, wh ich derive their mot ivation fro m
solve various image categorization and video challenges. The
number of GNN applications in computer vision not limited,
CNN, are used in this domain. If g iven a large training dataset
continues to expand. Human-object interaction, actin of labelled class, many of these models, including GNN,
understanding, image categorization from a few shots and many provide interesting results. Recent video understanding
more. In this paper use of GNN in image and video datasets such as AVA and Charades, on the other hand, have
understanding, design aspects, architecture, applications and lagged behind in co mparison. Understanding the interactions
implementation challenges towards computer vision is between actors, objects, and other context in a scene is one of
described. GNN is a strong tool for analysing graph data and is the many reasons why video comprehension is very
still a relatively active area that needs further researches important. Furthermore, because these interactions aren't
attention to solve many computer vision applications.
always visible in a single frame, reasoning over large t ime
Keywords— Graph Neural Networks (GNNs), Convolutional intervals is required. Because video has an additional
Neural Network (CNN), Gated Adversarial Transformer (GAT), temporal axis, it has a much higher dimensional signal than
Atomic Visual Action(AVA), Human-Object Interactions (HOI), single images, and we believe that learn ing these unlabeled
Graph Parsing Neural Network (GPNN). interactions directly fro m current datasets with huge
convolutional networks is not possible. Effective video
I. INT RODUCT ION comprehension necessitates long-term reasoning about the
Graphs are one of the most useful data structures for various lin ks between objects, actors and their surroundings. Because
applications areas such as learning cell fingerprints, exp loring graph-structured data is ubiquitous, it can be employed in a
traffic networks, prescribing co mpanions to interpersonal wide range of scenarios.
interaction and modeling body systems. Non-Euclidean
graphs need to be addressed by these activities which contain II. LIT ERAT URE SURVEY
details of interpersonal relationships which can be mishandled Anurag Arnab [1] describes how to make a proposal when
by using deep traditional learning models. Nodes in graphs supervision is not available, they suggest a message-passing
regularly contain valuable data that is ignored in profoundly GNN that can use exp licit representations of objects and
moderated unregulated learning techniques. GNNs are clearly represents these spatio-temporal interconnections.
proposed to consolidate the graph structure and feature details This strategy is demonstrated on two different challenges in
to pursue better presentations on graphs with integration and video spatio-temporal action identificat ion that require
feature distribution. Because of its persuading execution and relational thin king. It also demonstrates how this method may
high interpretation, GNN has as of late become broadly more successfully model relationships between s ignificant
utilized. GNNs can model the relationship of nodes of the things in images, numerically and qualitatively. Natural v ideo
graph and generate a numerical representation of it. GNNs are events are usually outcome of spatio-temporal engagements
particularly essential since there is so much factual among actors and objects, and they frequently involve a large
information that can be expressed also as graphs. number of object types. Yubo Zhang [2] proposes a study that
argues that action detection is a difficu lt problem to solve video retrieval on several videos and datasets. On three
since the models wh ich must be trained are massive, and different datasets, this paper gives comprehensive
obtaining tagged data is costly. To overcome this constraint, experimental ev idence and achieves best-in-class results in a
they recommend incorporating domain knowledge into the range of downstream video interpretation tasks. Beyond
model's structure to make optimization easier. Santiago instance discrimination, the success of our methodology
Castro [3] presents the majority of research work on demonstrates the advantages of contrastive learning.
language-assisted video understanding has centered on two A. Gupta [7] proposes a simp le idea wh ich offers world
tasks: firstly, usage of mult iple choice questions for video features, a basic concept in wh ich each feature at each layer
question answering, here models perform well because has its own spatial transformation, and the feature map is only
candidate solutions are easily available. Second, to capture a altered when needed. Results show that a network created
video which uses an open-ended assessment framework. They with these World Features may be utilised to mimic eye
suggest fill in the blanks as just a video co mprehension movements including saccades, fixation, and smooth pursuit
assessment framework that corrects previous assessment on pre-recorded video in a batch environ ment. It finds that
problems but mo re closely matches real-life circu mstances in numerous eye movements are achievable, allowing for a wide
which many possibilit ies are unavailab le. Using this range of aug mentations without sacrificing relat ive feature
associated text and v ideo, the model must predict a concealed position. They utilised these concepts to supply the model
noun phrase inside this video description, wh ich assesses the with a transformation for each video in the experiments, but
system's knowledge of the film. The dataset is built fro m the learnt transformations might also be employed. Yubo Zhang
VATEX dataset, with blurred captions generated by stamping [8] proposes a notion that action detection is a difficult
noun phrases in the English captioning in VA TEX. To problem to solve based on the fact that the models are large to
construct an instance, we select the first English caption that be trained and obtaining labelled data is costly. To overcome
usually contains one noun phrase as recognized by spaCy1, this issue, they suggest incorporating domain expert ise into
then blank these kinds of nouns at random. As a result, we the model's structure to make optimizat ion easier. This model
start only with the VATEX v 1.1 training set, a randomized surpasses existing best practices, proving the method's utility
subset of size 1000 fro m the test dataset, respectively, to in modelling spatial correlat ion and reasoning about linkages.
construct our training, validating, and testing data. To acquire Most significantly, the performance of this model highlights
additional right answers for each space in the verificat ion and the need of incorporating relat ional and temporal knowledge
test sets, we used a crowd annotating technique. The key aim into design methods for detection of action. H. Huang [9]
for gathering such additional annotations, as previously said, describes a new dynamic h idden graph module in videos for
is to account for the diversity of words and to have several modelling complicated object-object interactions, with two
alternatives for each space. K. Sasabuchi [4] presents a instantiations: a v isual graph for capturing appearance/motion
learning-fro m-observation framework for ext racting precise changes among objects and a location graph to capture
action sequences fro m a video of a hu man demonstration split relative spatiotemporal position changes among objects. The
and understood with vocal instructions. Splitting is based on suggested graph module may explicitly capture interactions
minimu m local points in hand velocity, wh ich link hu man among objects in streaming v ideo contexts by evaluating
daily movements with object-centered facial contact object relations at the same time both in time domain and
transitions required for robot motion generation. They first frequency domain, which d istinguishes our work fro m prior
established that hand velocity motion splitting is a reliable techniques.
signal for partit ioning daily tasks. Secondly, they generated a Y. Chen [10] presents a novel way of exp lanation that
new motion description dataset with the goal of better aggregates a collection of features globally across the
understanding everyday human actions. Also provides the coordinate space before moving to an interacting area
informat ion of attention-based models, researchers developed whereby relationship thinking may be calculated successfully.
Gated Adversarial Transformer (GA T). To increase the J. Zhou [11] describes many learning activ ities including
model's performance even further, they applied adversarial working with graph data, which offers a wealth of relational
training methodologies. A regularization term was added to informat ion between parts. The usage of a model that can
the loss function, giving the model adversarial robustness to learn fro m graph inputs is required fo r simu lating the physical
both attention mappings and the final output space. Matthew systems, establishing fingerprints, predicting proteins
Hutchinson [5] describes deep learn ing research in co mputer interface, and diagnosing illnesses. They did a thorough
vision is a natural extension of video understanding . The examination of graph neural networks. They divide GNN
application of artificial neural network (ANN) mach ine models into variants based on compute modules, graph kinds,
learning (M L) approaches has tremendously benefitted the and training kinds. They also describe a nu mber of general
field of image interpretation. They grouped deep learning frameworks and present a number of theoretical studies. F.
model build ing blocks and state-of-the-art model families, Scarselli [12] proposed GNN model, for processing data it
and specified standard metrics for assessing models. They extends existing neural network approaches shown in this
also listed datasets suitable as benchmarks and pre-training graph domain. The mentioned GNN model can handle most
sources, discussed data preparation stages and techniques, and common graph topologies directly, including acyclic, cyclical,
organized deep learning model building blocks and state-of- guided, and unguided graphs. The paradigm revolves upon
the-art model families. Ishan Dave [6] proposes a temporal informat ion diffusion and relaxat ion processes. The generic
contrastive learning framework outperforms state of the art framework, prior generative method for organized data
outcomes in a variety of downstream video comprehension processing, and approaches based on vague walk mod are all
tasks, including action recognition, limited-label action incorporated into the approach. K. Simonyan [13] proposes
classification, and act ion classificat ion and nearest-neighbor the 2 stream ConvNet design that includes both spatial and
temporal networks. First they show that despite minimal categorization is to control structural in formation. As a result,
training samples, the results of a ConvNet taught on multi GNN appears to be highly tempting in this regard. The
frame intensive optical flo w might be great. Finally, they informat ion needed to lead the ZSL work may be found in
demonstrate how multitask learn ing may be utilised to knowledge graphs. The type of in formation that each
improve the quantity of data collected fro m t wo separate technique represents in the graph varies by knowledge.
action classification datasets. Graphs of such kind may well be built on co mmonalities
R. Girdhar [14] introduces a method for recognizing and between both the photos themselves or those of the objects
localizing living beings in video footage, use the Actions recovered using object recognition in the photos. Semantic
Converter model. They emp loy a transformer architecture for informat ion fro m embeddings of the image class labels may
collecting characteristics fro m the spatiotemporal also be included in the graphs. GNNs may then be used to
environment around the individual whose behaviors are being enhance the ZSL picture classification-recognition process by
classified. They demonstrated that even the Actions Converter applying them to this structured data. The process of creating
network can acquire spatiotemporal information fro m several a label for a video based on its frames is video classification,
other human Behaviour and items inside a film clip and use it strong video level classification not only delivers correct
to recognize and localize hu man activities. Siyuan Qi [15] frame labels, and also best represents the entire movie based
describes the challenge of identify ing and distinguishing on the characteristics and annotation of the indiv idual frames.
human-object interactions (HOI) in photos and videos in this The process of creating a label fo r a video based on its frames
work. A Graph Parsing Neural Net work (GPNN) is is video classification. A strong video level classification not
introduced, an end-to-end differentiable architecture that only delivers correct frame labels, and also best represents the
incorporates structural information. They test this model on entire movie based on the characteristics and annotation of the
various data sets such as V-COCO, HICO-DET and CAD- individual frames. Paper [1], p ropose a message passing
120, wh ich on photos and videos, are all HOI recognition graph GNN to spatio-temporal interactions and for object
standards. This technique outperforms current methods, representations it uses explicit object if monitoring is
indicating that GPNN was adaptable to large datasets. And available else implicit object shall be used. Their approach
can be used in both spatial and temporal scenarios. Boncelet broadens earlier structured models for video co mprehension,
[16] describes testing of the proposed method's performance, allo wing us to investigate how varied graph representation
where two picture understanding tasks were chosen: Emot ion and structure choices impact the model's performance. This
recognition at the group level and incident identification both shows how to apply a strategy to two separate tasks in videos
task is extremely mean ingful, and synthesizing mult iple cues that require related reasoning – on AVA and UCF101-24 it
necessitates the interplay of nu merous deep models. uses an action detection model of spatio temporal relation, on
Understanding an image includes not just recognising the the recently released Action Genome dataset it uses video
items in it, but also grasping their fundamental relationships scene graph categorizat ion on dataset. It also demonstrates
and interconnections. GNNs may take advantage of such that this strategy may more successfully model relationships
lin kages during the feature learn ing and forecasting phases by between significant things in the picture, both numerically
spreading nodal messages through the network and and qualitatively.
aggregating the outputs. Danfei Xu [17] suggests the use of
scene graphs, a graphical framework for a p icture that is
visually anchored, to formally model the objects and their
interactions. They also propose a revolutionary end -to-end
paradigm for creating structured scene representations fro m
an input image. They developed a novel end to end model that
solves the challenge of automat ically constructing a visibly
anchored virtual environ ment fro m an image by continuous
passing of messages between the primal and dual sub -graphs
along the topological structure of a scene graph. Yubo Zhang
[18] describes Action detection as an example of a d ifficult
problem: the models that must be trained are enormous, yet
labelled data is d ifficu lt to get by. To overcome this
constraint, they recommend incorporating domain knowledge
into the model's structure to make optimization easier. The
suggested methodology outperforms the by 5.5 percent mAP
in the I3D base and 4.8 percent mAP on AVA dataset.
III. GRAPH NEURAL NET WORK MODEL AND
ARCHIT ECT URE FOR IMAGE AND VIDEO Fig.1. Graph-structured data representation [19]
UNDRST ANDING Figure 1 shows the graph-structured data to understand the
Image categorizat ion, a classic computer vision problem, activity of video. The v isual ST graph with unique edge types,
where convolutional neural networks (CNN) being the most actor-to-actor temporal, object-to-actor spatial etc., and
prominent one. GNNs, which get their mot ivation fro m CNN, unique node types is a heterogeneous graph with varied
have been used in this arena as well. Main goal is to improve semantics and dimensions. First, it describes objects and
zero-shot and few-shot learning task models performance. actors visual spatio-temporal interactions. Second, co-
Zero shot learning (ZSL) is the process of train ing a model to occurrences, for example, are a co mmon connection between
recognize classes it has never seen before. ZSL image labels. These signals can be represented visually and
actors, objects, and their environment. This approach can [25] M. R. Nehashree et al.,“ Simulation and Performance Analysis of
implicitly or exp licit ly characterize objects, and it generalizes Feature Extraction and Matching Algorithms for Image Processing
Applications,” International Conference on Intelligent Sustainable
the existing structured models for video co mprehension. Systems (ICISS), 2019.
Usage of adaptive approach for better score, district task [26] R. K. Meghana et al.,“ Background-modelling techniques for
across distinct dataset proposed. On A VA, still lot of foreground detection and Tracking using Gaussian Mixture Model,” 3rd
additional work needs to be done to better utilize exp licit International Conference on Computing Methodologies and
Communication (ICCMC), 2019.
object representations.
[27] V. P. Korakoppa et al.,“ Implementation of highly efficient sorting
Further, Use of GNN in image and video understanding, algorithm for median filtering using FPGA Spartan 6,” International
architecture, applicat ions, resources, platforms, graph models Conference on Innovative Mechanisms for Industry Applications
and implementation challenges towards computer vision is (ICIMIA), 2017.
elaborated in detail. [28] D. Akash et al.,“Interfacing of flash memory and DDR3 RAM memory
with Kintex 7 FPGA board,” International Conference on Recent
REFERENCES Trends in Electronics, Information & Communication Technology
(RTEICT), 2017.
[1] Anurag Arnab et al., “Unified Graph Structured Models for Video
Understanding” CVPR, arXiv:2103.15662, 2021. [29] N. Jain et al.,“Performance Analysis of Object Detection and Tracking
Algorithms for Traffic Surveillance Applications using Neural
[2] Yubo Zhang et al., “A Structured Model For Action Detection” CVPR, Networks,” International conference on I-SMAC (IoT in Social, Mobile,
arXiv:1812.03544, 2019. Analytics and Cloud) (I-SMAC), 2019.
[3] Santiago Castro et al.,“Fill-in-the-blank as a Challenging Video [30] C. Kumar B et al.,“ YOLOv3 and YOLOv4: Multiple Object Detection
Understanding Evaluation Framework” CVPR, arXiv:2104.04182,2021. for Surveillance Applications,” International Conference on Smart
[4] Saurabh Sahu et al.,“ Enhancing Transformer for Video Understanding Systems and Inventive Technology (ICSSIT), 2020.
Using Gated Multi-Level Attention and Temporal Adversarial
[31] C. Kumar B et al.,“ Performance Analysis of Object Detection
T raining”CVPR, arXiv:2103.10043, 2021. Algorithm for Intelligent Traffic Surveillance System,” International
[5] Matthew Hutchinson et al.,“ Video Action Understanding: A T utorial” Conference on Inventive Research in Computing Applications
CVPR, arXiv:2010.06647, 2020. (ICIRCA), 2020.
[6] Ishan Dave et al.,“T CLR: T emporal Contrastive Learning for Video
Representation”CVPR, arXiv:2101.07974, 2021. [32] R. J. Franklin et al.,“Traffic Signal Violation Detection using Artificial
[7] Gunnar A. Sigurdsson et al.,“ Beyond the Camera: Neural Networks in Intelligence and Deep Learning,” International Conference on
World Coordinates”CVPR, arXiv:2003.05614, 2020. Communication and Electronics Systems (ICCES), 2020.
[8] Yubo Zhang et al.,“ A Structured Model For Action Detection” CVPR, [33] Mohana et al.,“ Object Detection and Tracking using Deep Learning
arXiv:1812.03544, 2018. and Artificial Intelligence for Video Surveillance Applications”
[9] Hao Huang et al.,“ Dynamic Graph Modules for Modeling Object- International Journal of Advanced Computer Science and
Object Interactions in Activity Recognition”CVPR, arXiv:1812.05637, Applications(IJACSA), 10(12), 2019.
2018. [34] Manoharan Samuel et al., “Improved Version of Graph-Cut Algorithm
[10] Yunpeng Chen et al.,“ Graph-Based Global Reasoning for CT Images of Lung Cancer with Clinical Property Condition”
Networks”,CVPR, arXiv:1811.12814, 2018. Journal of Artificial Intelligence and Capsule networks, 2(4), 201 –206.
[11] Jie Zhou et al.,“ Graph Neural Networks: A Review of Methods and
Applications”arXiv:1812.08434, 2021.
[12] F. Scarselli et al.,“The Graph Neural Network Model,” IEEE
Transactions on Neural Networks,vol. 20, no. 1, pp. 61-80, 2009.
[13] Karen Simonyan et al.,“T wo-Stream Convolutional Networks for
Action Recognition in Videos” Visual Geometry Group(VGG),
University of Oxford.
[14] Rohit Girdhar et al.,“Video Action Transformer Network” CVPR,
arXiv:1812.02707, 2018.
[15] Siyuan Qi et al.,“ Learning Human-Object Interactions by Graph
Parsing Neural Networks” CVPR, arXiv:1808.07962,2018.
[16] Xin Guo et al.,“ Graph Neural Networks for Image Understanding
Based on Multiple Cues: Group Emotion Recognition and Event
Recognition as Use Cases” CVPR, arXiv:1909.12911, 2020.
[17] Danfei Xu et al.,“ Scene Graph Generation by Iterative Message
Passing” CVPR, arXiv:1701.02426,2017.
[18] Yubo Zhang et al.,“ A Structured Model For Action Detection” CVPR,
arXiv:1812.03544, 2019.
[19] E. Mavroudi et al.,“Representation Learning on Visual-Symbolic
Graphs for Video Understanding” arXiv:1905.07385, 2020.
[20] Biswas A et al.,“ “Survey on Edge Computing–Key Technology in
Retail Industry” Lecture Notes on Data Engineering and
Communications Technologies, 2021, vol 58. Springer, Singapore.
[21] R. J. Franklin et al.,“Anomaly Detection in Videos for Video
Surveillance Applications using Neural Networks,” Fourth
International Conference on Inventive Systems and Control (ICISC),
2020.
[22] Mohana et al.,“Performance Evaluation of Background Modeling
Methods for Object Detection and Tracking,” 4 th International
Conference on Inventive Systems and Control (ICISC), 2020.
[23] A. Biswas et al.,“Classification of Objects in Video Records using
Neural Network Framework,” International Conference on Smart
Systems and Inventive Technology (ICSSIT), 2018.
[24] H. Jain et al.,“ Weapon Detection using Artificial Intelligence and Deep
Learning for Security Applications,” International Conference on
Electronics and Sustainable Communication Systems (ICESC), 2020.