0% found this document useful (0 votes)
16 views6 pages

Integrating Spatial and Temporal Dependencies

This study presents a hybrid deep learning architecture that integrates Convolutional Neural Networks (CNNs) with Squeeze-and-Excitation blocks for spatial feature extraction and Graph Attention Networks (GATs) for temporal modeling in video analysis. The proposed framework enhances video classification tasks such as action recognition and anomaly detection by effectively capturing both spatial and temporal dependencies through attention mechanisms. Experimental results demonstrate improved performance and robustness, highlighting the potential of combining these advanced techniques for more effective video analysis pipelines.

Uploaded by

Parinay Seth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views6 pages

Integrating Spatial and Temporal Dependencies

This study presents a hybrid deep learning architecture that integrates Convolutional Neural Networks (CNNs) with Squeeze-and-Excitation blocks for spatial feature extraction and Graph Attention Networks (GATs) for temporal modeling in video analysis. The proposed framework enhances video classification tasks such as action recognition and anomaly detection by effectively capturing both spatial and temporal dependencies through attention mechanisms. Experimental results demonstrate improved performance and robustness, highlighting the potential of combining these advanced techniques for more effective video analysis pipelines.

Uploaded by

Parinay Seth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Integrating Spatial and Temporal Dependencies: A

Hybrid Approach for Video Analysis Using GAT-


CNN
Akanksha Raj Siddharth Singh Chouhan
School of Computing Science and Engineering and Artificial School of Computing Science and Engineering and Artificial
Intelligence Intelligence
VIT Bhopal University VIT Bhopal University
Madhya Pradesh, India Madhya Pradesh, India
[email protected]

Abstract—The increasing demand for effective video video classification [3]. This temporal aspect is essential for
analysis across fields such as surveillance, healthcare, and smart identifying motion, tracking the flow of events, and
environments calls for models that can efficiently capture both interpreting context across frames. While Recurrent Neural
spatial structures and temporal dynamics. This study Networks (RNNs), including Long Short-Term Memory
investigates a hybrid deep learning architecture that integrates (LSTM) networks and Gated Recurrent Units (GRUs), have
Convolutional Neural Networks (CNNs) enhanced with been employed for this purpose, they often struggle to capture
Squeeze-and-Excitation (SE) blocks for spatial feature long-term dependencies [4]. These networks are susceptible to
extraction, and Graph Attention Networks (GATs) for temporal
the vanishing and exploding gradient problems, which hinder
modeling. SE blocks are employed to adaptively recalibrate
their ability to model relationships across extended time
channel-wise features, improving the quality of frame-level
representations. These enriched spatial features are sequentially
intervals [5].
aggregated and fed into a graph-based attention mechanism, As a result, the video classification community has turned
where video frames are treated as nodes and their temporal its attention to more advanced and robust architectures that
relationships are learned through dynamic attention weighting. can bridge the gap between spatial and temporal information.
Through experiments on benchmark video datasets, the A groundbreaking solution has emerged in the form of
proposed framework is evaluated across key tasks such as action Transformer networks, which leverage the power of self-
recognition, video summarization, and anomaly detection.
attention mechanisms to effectively model long-range
Results indicate improved performance in identifying salient
dependencies [6]. Transformers allow the network to consider
actions and detecting temporal patterns, with attention
mechanisms enhancing interpretability by focusing on relevant every frame in relation to every other frame in the sequence,
frames and transitions. Ablation studies confirm the regardless of their position enabling a more holistic
contribution of SE-enhanced CNNs in improving feature quality understanding of temporal dynamics [7]. By addressing the
and overall model robustness. This research highlights the limitations of both convolutional and recurrent architectures,
benefits of combining channel-attentive spatial extraction with Transformer-based models offer a promising approach to
attention-driven temporal reasoning, offering insights into the overcoming the challenges posed by the complex interplay of
development of more effective video analysis pipelines. spatial and temporal information in video data.

Keywords—Graph Attention Network (GAT), Convolutional With their ability to capture intricate temporal
Neural Network (CNN), Attention Mechanism, Computer Vision relationships and spatial patterns simultaneously, Transformer
(CV), Graph Neural Networks (GNN), Squeeze-and-Excitation models are rapidly gaining traction in video classification
(SE) Block, Spatiotemporal Modeling, Anomaly Detection, tasks [8]. Their superior capacity to model long-term
Attention Mechanism dependencies and their scalability across diverse video lengths
and contexts make them a promising avenue for pushing the
I. INTRODUCTION boundaries of what is possible in video analysis [9]. As
The domain of video classification has undergone research in this area continues to evolve, Transformer-based
remarkable advancements in recent years, primarily fuelled by architectures are expected to play a pivotal role in
the transformative capabilities of deep learning models. At the revolutionizing how we understand and classify video content
forefront of these innovations, Convolutional Neural in a wide range of applications, from autonomous systems to
Networks (CNNs) have emerged as powerful tools for entertainment and beyond.
extracting intricate spatial features from individual video II. LITERATURE REVIEW
frames [1]. Their unparalleled capacity to learn hierarchical
representations allows them to capture complex visual Video classification is a significant area in computer
patterns, textures, and spatial relationships, positioning CNNs vision, aiming to understand and categorize video content.
as a cornerstone of modern video analysis systems [2]. These Researchers have explored various approaches to tackle the
networks excel in identifying the rich visual details present in challenges inherent in this task, such as capturing both the
each frame, enabling the detection of objects, actions, and spatial appearance within individual frames and the temporal
other significant elements in the video content. dynamics across these frames.
However, despite their extraordinary ability to process A. Spatial Feature Extraction
spatial information, CNNs face significant challenges when it Early approaches often relied on handcrafted features.
comes to modelling the temporal dynamics inherent in video However, the advent of deep learning, particularly
sequences. Unlike static images, videos consist of a series of Convolutional Neural Networks (CNNs), has revolutionized
frames that evolve over time, and understanding the spatial feature extraction from video frames [10]. Deep CNNs
relationships between these frames is crucial for accurate

Integrating Spatial and Temporal Dependencies ©2024 IEEE


have demonstrated remarkable capabilities in learning followed by a Feed-Forward Network (FFN) for further
hierarchical representations from visual data. Architectures processing [14]. The proposed architecture employs a
like ResNet have shown significant success in image Temporal Transformer to process sequences of extracted
recognition tasks. spatial features, indicating an aim to leverage the strengths of
self-attention in capturing temporal dependencies. The use of
To enhance the representational power and efficiency of Multi-Head Self-Attention and Feed-Forward Network layers
CNNs, techniques like depthwise separable convolutions have in the Temporal Transformer aligns with the standard
been introduced. These convolutions decompose standard Transformer architecture.
convolution operations into a depthwise convolution and a
pointwise convolution, reducing the number of parameters C. Graph Neural Networks for Video Analysis
and computational cost while maintaining competitive Graph Neural Networks (GNNs) have also gained traction
performance. The proposed Spatial CNN utilizes depthwise in video analysis by representing videos as graphs where
convolutional layers, suggesting an aim for efficient feature
nodes can represent objects, frames, or regions, and edges
extraction [11].
represent their relationships [15]. GNNs can learn node
Furthermore, to emphasize the importance of different representations by aggregating information from their
feature channels, channel attention mechanisms, such as the neighbors in the graph, allowing them to model complex
Squeeze-and-Excitation (SE) block, have been proposed. The spatial and temporal relationships. Different types of GNNs,
core idea of an SE block is to model the interdependencies such as Graph Convolutional Networks (GCNs) and Graph
between feature channels, allowing the network to adaptively Attention Networks (GATs), have been applied to tasks like
recalibrate channel-wise feature responses by explicitly video classification, video anomaly detection, video
learning to weight each channel based on its importance. By summarization, and Video Question Answering (VideoQA).
doing so, SE blocks can enhance the sensitivity of the network For instance, in VideoQA, graph-based methods can model
to more informative features [11, 12]. The Spatial CNN the relationships between visual objects and the question to
incorporates Squeeze-and-Excitation blocks, indicating an
perform relational reasoning [15, 16]. Some works explore
intention to leverage channel-wise attention for improved
feature quality. multi-view graph embedding to capture different types of
relationships within a network. However, some graph-based
To obtain a fixed-size feature vector for each frame, CNNs methods might introduce irrelevant noise when capturing
often employ a Global Average Pooling (GAP) layer at the high-order relations by multiple convolutions between
end of the spatial feature extraction stage. This layer reduces pairwise relations.
the spatial dimensions of the feature maps by averaging the
values across each channel, resulting in a compact D. Hybrid Architectures
representation suitable for subsequent temporal modeling. Combining the strengths of CNNs for spatial feature
Finally, dropout is a commonly used regularization technique extraction and sequence models like RNNs or Transformers
in deep learning to prevent overfitting by randomly setting a for temporal modeling has become a common paradigm in
fraction of neurons to zero during training [12]. The Spatial video analysis. These two-stage architectures aim to leverage
CNN utilizes dropout for regularization, which is a standard the ability of CNNs to learn robust visual features and the
practice to improve the generalization ability of the model. capacity of sequence models to capture temporal dynamics.
B. Temporal Dependency Modeling The proposed two-stage architecture, consisting of a Spatial
After extracting spatial features from individual frames, CNN and a Temporal Transformer, falls under this category,
capturing the temporal relationships between these features is seeking to effectively integrate spatial and temporal
crucial for video classification. Various methods have been information for video classification [16].
explored for this purpose. Recurrent Neural Networks E. Attention Mechanisms in Video Analysis
(RNNs), particularly Long Short-Term Memory (LSTM)
networks, have been widely used to model sequential data due Attention mechanisms have become integral to many videos
to their ability to maintain a hidden state that captures analysis tasks. Inspired by human perception, attention
information from previous time steps. These models can learn mechanisms allow models to focus on the most relevant parts
long-range temporal dependencies by processing the sequence of the input data. In video analysis, attention can be applied
of frame-level features. Three-dimensional Convolutional spatially to highlight important regions within a frame,
Neural Networks (3D CNNs) offer another approach by temporally to weigh crucial time segments, or channel-wise
directly processing video data as a spatiotemporal volume. as seen in SE blocks. The self-attention mechanism in
These networks apply 3D convolutional filters that can Transformers is a powerful form of attention that allows the
simultaneously learn spatial and temporal features [13]. More model to learn relationships between different parts of the
recently, Transformer networks, initially proposed for natural input sequence. Question-guided attention has also been used
language processing, have shown remarkable success in in tasks like VideoQA to focus on video features relevant to
capturing long-range dependencies in various domains, a specific question [17].
including video analysis. The core of the Transformer The techniques employed in video classification are
architecture is the Multi-Head Self-Attention mechanism, often relevant to other video analysis tasks. For example,
which allows the model to weigh the importance of different video anomaly detection aims to identify unusual events in
time steps (in this case, different frame features in the
videos. Both spatial and temporal features are crucial for this
sequence) when computing the representation for a specific
task, and methods involving CNNs and sequence models or
time step. This mechanism enables the model to capture
complex temporal interactions effectively. Transformer GNNs have been explored. Video summarization aims to
networks are typically composed of multiple encoder layers, generate concise summaries of longer videos. Attention
each containing a multi-head self-attention mechanism mechanisms and temporal modeling are important for
selecting keyframes or key segments. Video Question Training uses the Adam optimizer (learning rate:
Answering (VideoQA) requires understanding both visual 0.001) and sparse categorical cross-entropy loss.
and textual information in videos to answer questions. This Validation accuracy is monitored with callbacks:
task often involves sophisticated methods for cross-modal • EarlyStopping (patience: 10 epochs on validation
interaction and relational reasoning, sometimes utilizing loss)
graph-based approaches [17]. Human action recognition,
which involves identifying and classifying actions performed • ReduceLROnPlateau (factor: 0.2, patience: 4 epochs,
in videos, also relies heavily on capturing spatiotemporal min_lr: 1e-6)
features. Pose estimation, the task of predicting the location • ModelCheckpoint saves best weights based on
of key body joints in images or videos, can provide valuable validation accuracy.
spatial information for action recognition and other video
understanding tasks. By drawing upon the advancements in C. Feature Sequence Construction
spatial feature extraction using CNNs, the ability of Once trained, the CNN up to the 64-unit dense layer serves
Transformer networks to model long-range temporal as a fixed spatial feature extractor. Frame-wise features are
dependencies through self-attention, and the broader extracted for all training, validation, and test sets. These
applications of attention mechanisms in video analysis, the features are grouped into sequences:
proposed two-stage architecture presents a novel approach to • Sequence length: 15 frames
video classification.
• Overlap: 5 frames (training), no overlap
III. METHODOLOGY (validation/test)
The methodology employed in this research involves a
• Sequence label: most frequent label in the sequence
two-pronged approach for video classification: a Spatial
Convolutional Neural Network (CNN) for feature extraction D. Temporal Modeling using Transformer
from individual frames, followed by a Temporal Transformer The Temporal Transformer model employed in this
model for sequence modeling. Depthwise convolutional research is designed to capture sequential dependencies across
layers followed by Squeeze-and-Excitation blocks are used to frame-wise spatial features. It begins with an optional input
extract features from the training and testing images. projection layer to align input dimensions with the embedding
A. Data Preprocessing and Augmentation dimension, followed by the addition of positional encoding to
retain sequence order information. The core of the model
A dataset of cropped video frames, organized into training
consists of two Transformer encoder blocks. Each block
and testing sets with a predefined class structure, undergoes
features a Multi-Head Self-Attention (MHSA) mechanism
preprocessing. All frames are resized to 128×128 pixels and
with four heads, where the key dimension is defined as the
their pixel values are normalized to the range [0, 1]. Online
embedding dimension divided by the number of heads.
data augmentation is applied exclusively to the training set to
improve robustness and generalization. This includes random This is followed by a dropout layer with a rate of 0.1 and
horizontal flipping, rotations up to 0.1 radians, zooms up to layer normalization. Subsequently, a Feed-Forward Network
10%, contrast adjustments up to 10%, and translations up to (FFN) with 128 ReLU units projects the features to an
10% in height and width. A 20% validation split is held out intermediate representation before linearly mapping them
from the training data. back to the embedding dimension (128), along with another
dropout layer (0.1) and layer normalization for added
B. Spatial Feature Extraction using CNN and SE Block regularization. The output of the final encoder block is
The spatial module is a custom-designed CNN processed through a Global Average Pooling 1D layer to
comprising three sequential convolutional blocks: produce a fixed-length sequence representation. This is
• Each block contains a 3×3 Depthwise Convolution followed by a dropout layer (0.3), a dense layer with 64 ReLU
followed by a 1×1 Convolution. units and L2 regularization (λ = 0.01), another dropout layer
(0.3), and a softmax output layer that produces class
• Squeeze-and-Excitation (SE) blocks with a reduction probabilities. The model is trained using the Adam optimizer
ratio of 8 are integrated post-convolution to recalibrate with a learning rate of 0.0005 and sparse categorical cross-
feature responses. entropy loss, with training monitored based on validation loss.
• Dropout layers (rates: 0.2, 0.3, 0.5) are inserted To prevent overfitting and ensure optimal performance, the
progressively. training process includes an EarlyStopping callback with a
patience of 15 epochs and best weight restoration, along with
• L2 regularization (λ = 0.001) is applied to all a ReduceLROnPlateau callback that reduces the learning rate
convolutional and dense layers. by a factor of 0.5 if no improvement is observed over 5
After the final convolutional block: consecutive epochs, down to a minimum learning rate of 1e-
7.
• A Global Average Pooling (GAP) layer produces a
128-dimensional feature vector. E. Evaluation Metrics
The performance of the trained temporal model is
• This is passed through a Dense layer with 64 ReLU thoroughly evaluated on the test set using several key metrics.
units and a 0.5 dropout. These include overall accuracy and loss, which provide a
• The final layer is a Softmax classifier matching the measure of the model's general performance. Additionally, a
number of classes. detailed classification report is generated, which includes
precision, recall, and F1-score for each class, offering
insights into the model's ability to correctly identify each rather than their refinement. Since SE mechanisms rely on
class while balancing false positives and false negatives. To meaningful feature maps to enhance discriminative channels,
further assess the model's performance across different their failure to improve performance suggests that the base
classes, a confusion matrix is used to analyze the true and convolutional layers may not be capturing sufficiently
predicted class distributions, providing a more granular view discriminative patterns.
of its classification behavior. All experiments are conducted
B. Feature Embedding Analysis
with a fixed random seed (123) to ensure the reproducibility
of results, enabling consistent and comparable outcomes To further investigate the spatial feature extractor’s
across different runs. performance, we visualized the learned 64-dimensional
embeddings using UMAP and t-SNE, two widely used
IV. RESULTS dimensionality reduction techniques.
A. Quantitative Performance Evaluation The UMAP projection revealed. Significant overlap
To evaluate the effectiveness of our two-stage model between different classes, with no clear separation. Poor
architecture, we conducted a comprehensive assessment using clustering, indicating that the model fails to generate distinct
classification accuracy and loss on a held-out test set. Despite feature representations for different action classes (See Fig. 3).
employing advanced architectural components—including
Squeeze-and-Excitation (SE) blocks—and rigorous
regularization techniques, the model exhibited limited
performance with test accuracy of ~46% (Fig. 1) and test loss
of ~2.6 (Fig. 2) showing no changes in test score and test loss
without SE blocks.

Fig. 3. UMAP on Train Features

Consistent with UMAP, the t-SNE projection exhibited.


Scattered embeddings without discernible class boundaries.
No meaningful clustering, reinforcing the hypothesis that the
feature space lacks discriminative power (See Fig. 4).
Fig. 1. Temporal Model Accuracy

The inclusion of SE blocks did not yield measurable


improvements in accuracy or loss, suggesting that the model’s
limitations extend beyond channel-wise feature recalibration.
The high-test loss indicates poor generalization, likely due to
either insufficient feature extraction or inherent dataset
challenges.

Fig. 4. t-SNE on Train Features

The lack of separation in both UMAP and t-SNE suggests


that The spatial feature extractor is unable to encode class-
specific patterns effectively. The model may be overfitting to
noise or irrelevant background features rather than learning
meaningful action representations.
Fig. 2. Temporal Model Test

The ineffectiveness of SE blocks implies that the model’s


bottleneck lies in the quality of extracted spatial features
C. Model Interpretability via Grad-CAM The SE mechanism was introduced to adaptively
To understand where the model focuses its attention, we recalibrate channel-wise feature importance, yet it failed to
employed Gradient-weighted Class Activation Mapping improve performance. If the initial convolutional layers do not
(Grad-CAM) on sample frames. Fig. shows the original abuse extract meaningful representations, SE cannot enhance them.
and Fig. 6 shows the Grad-CAM. The dataset may lack sufficient variation for SE to exploit.
Also, the transformer may compensate for weak spatial
features, diminishing the impact of SE. No change in Grad-
CAM attention with/without SE blocks and Identical
clustering patterns in UMAP/t-SNE, indicating no
improvement in feature discriminability.
Several factors likely contribute to the model’s suboptimal
performance:
• Dataset-Related Issues: Uneven sample distribution
may bias the model toward majority classes and
Overlapping or noisy annotations could hinder
learning.
• Architectural Limitations: May fail to capture long-
term temporal dependencies yet While efficient, they
may sacrifice spatial richness, especially in low-
resolution settings.
Fig. 5. Original Abuse
• Training Dynamics: Grad-CAM suggests the model
relies on non-discriminative regions and the spatial
backbone may struggle to encode dynamic action
cues.
V. CONCLUSION
Our analysis reveals that the current model struggles
due to weak spatial feature extraction and ineffective
temporal integration. While SE blocks did not help,
alternative approaches—such as stronger backbones, richer
temporal modeling, and multi-modal inputs—could unlock
significant improvements. Future work should focus on
enhancing feature discriminability while ensuring robustness
to dataset noise and variability.
VI. ACKNOWLEDGMENT
Fig. 6. Grad - CAM
The preferred spelling of the word “acknowledgment” in
America is without an “e” after the “g”. Avoid the stilted
The heatmap highlights background regions rather than expression “one of us (R. B. G.) thanks ...”. Instead, try “R.
the central action. The model appears to ignore motion cues or B. G. thanks...”. Put sponsor acknowledgments in the
key spatial details that define the class (See Fig. 7). This unnumbered footnote on the first page.
misalignment between model attention and semantically
relevant regions suggests weak localization capability, VII. REFERENCES
possibly due to inadequate receptive field size or insufficient [1] [Agarwal, A., Lalit, M., Bansal, A., & Seeja, K. (2023). iSGAN: An
spatial resolution. Over-reliance on spurious correlations (e.g., Improved SGAN for Crowd Trajectory Prediction from Surveillance
background textures) rather than genuine action features. Videos. Procedia Computer Science, 218, 2319–2327.
https://doi.org/10.1016/j.procs.2023.01.207
[2] Han, X., Xu, G., Zhang, M., Yang, Z., Yu, Z., Huang, W., & Meng,
C. (2024). DE-GNN: Dual embedding with graph neural network for
fine-grained encrypted traffic classification. Computer Networks, 245,
110372. https://doi.org/10.1016/j.comnet.2024.110372
[3] El-Nagar, G., El-Sawy, A., & Rashad, M. (2024). A deep audio-visual
model for efficient dynamic video summarization. Journal of Visual
Communication and Image Representation, 100, 104130.
https://doi.org/10.1016/j.jvcir.2024.104130
[4] Zhang, B., Yuan, C., Wang, T., & Liu, H. (2021). STENet: A hybrid
spatio-temporal embedding network for human trajectory forecasting.
Engineering Applications of Artificial Intelligence, 106, 104487.
https://doi.org/10.1016/j.engappai.2021.104487
[5] Liu, J., Zheng, S., & Wang, C. (2023). Causal Graph Attention
Network with Disentangled Representations for Complex Systems
Fault Detection. Reliability Engineering & System Safety, 235,
109232. https://doi.org/10.1016/j.ress.2023.109232
Fig. 7. Confusion Matrix [6] He, T., Li, H., Qian, Z., Niu, C., & Huang, R. (2024). Research on
weakly supervised pavement crack segmentation based on defect
location by generative adversarial network and target re‐optimization. [12] Ding, C., Sun, S., & Zhao, J. (2023). MST-GAT: A multimodal
Construction and https://doi.org/10.1016/j.conbuildmat.2023.134668 spatial–temporal graph attention network for time series anomaly
Building Materials, 411, 134668. detection. Information Fusion, 89, 527–536.
[7] Qin, C., Zhang, Y., Liu, Y., Coleman, S., Du, H., & Kerr, D. (2021). https://doi.org/10.1016/j.inffus.2022.08.011
A visual place recognition approach using learnable feature map [13] Rani, C. J., & Devarakonda, N. (2022). An effectual classical dance
filtering and graph attention networks. Neurocomputing, 457, 277– pose estimation and classification system employing Convolution
292. https://doi.org/10.1016/j.neucom.2021.06.038 Neural Network –Long ShortTerm Memory (CNN-LSTM) network
[8] Wang, Z., Wu, B., Ota, K., Dong, M., Beijing Key Laboratory of for video sequences. Microprocessors and Microsystems, 95, 104651.
Intelligent Telecommunications Software and Multimedia, & https://doi.org/10.1016/j.micpro.2022.104651
Muroran Institute of Technology. (2023). A multi-scale self- [14] Zhang, H., Li, Y., Zhuang, Z., Xie, L., The University of Texas at San
supervised hypergraph contrastive learning framework for video Antonio, Nanjing Normal University, Tsinghua University, & Johns
question answering. In Neural Networks Hopkins University. (2021). 3D-GAT: 3D-Guided adversarial
https://doi.org/10.1016/j.neunet.2023.08.057 (Vol. 168, pp. 272–286). transform network for person re-identification in unseen domains. In
[9] Wu, G., Song, S., Zhang, J., & School of Cyberspace Security, Gansu Pattern Recognition (Vol. 112, p. 107799).
University of Political Science and Law. (2024). Global–local spatio- https://doi.org/10.1016/j.patcog.2020.107799
temporal graph convolutional networks for video summarization. In [15] Yao, Y., Jiang, X., Fujita, H., & Fang, Z. (2022). A sparse graph
Computers and Electrical Engineering (Vol. 118, p. 109445). wavelet convolution neural network for video-based person re-
https://doi.org/10.1016/j.compeleceng.2024.109445 identification. Pattern Recognition, 129, 108708.
[10] Wang, Z., Li, F., Ota, K., Dong, M., Beijing Key Laboratory of https://doi.org/10.1016/j.patcog.2022.108708
Intelligent Telecommunications Software and Multimedia, Beijing [16] Petar Velickovi ˇ c, Guillem Cucurull, Arantxa Casanova, Adriana
University of Posts and Telecommunications, Muroran Institute of Romero, Pietro Lio`, Yoshua Bengio. (2018). GRAPH ATTENTION
Technology, & Wu, B. (2023). ReGR: Relation-aware graph NETWORKS. Graph attention networks
reasoning framework for video question answering. In Information [17] Haoyang Chen, Xue Mei, Zhiyuan Ma, Xinhong Wu, Yachuan Wei,
Processing and Management (Vol. 60, p. 103375). Spatial–temporal graph attention network for video anomaly
https://doi.org/10.1016/j.ipm.2023.103375 detection, Image and Vision Computing, Volume 131, 2023,
[11] Jia, A., Zhang, Y., Uprety, S., & Song, D. (2024). Learning https://doi.org/10.1016/j.imavis.2023.104629.
interactions across sentiment and emotion with graph attention
network and position encodings. Pattern Recognition Letters, 180,
33–40. https://doi.org/10.1016/j.patrec.2024.02.013

You might also like