Integrating Spatial and Temporal Dependencies

This study presents a hybrid deep learning architecture that integrates Convolutional Neural Networks (CNNs) with Squeeze-and-Excitation blocks for spatial feature extraction and Graph Attention Networks (GATs) for temporal modeling in video analysis. The proposed framework enhances video classification tasks such as action recognition and anomaly detection by effectively capturing both spatial and temporal dependencies through attention mechanisms. Experimental results demonstrate improved performance and robustness, highlighting the potential of combining these advanced techniques for more effective video analysis pipelines.

Uploaded by

Parinay Seth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views6 pages

Integrating Spatial and Temporal Dependencies

Uploaded by

Parinay Seth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Integrating Spatial and Temporal Dependencies: A

Hybrid Approach for Video Analysis Using GAT-

CNN
Akanksha Raj Siddharth Singh Chouhan
School of Computing Science and Engineering and Artificial School of Computing Science and Engineering and Artificial
Intelligence Intelligence
VIT Bhopal University VIT Bhopal University
Madhya Pradesh, India Madhya Pradesh, India
[email protected]

Abstract—The increasing demand for effective video video classification [3]. This temporal aspect is essential for
analysis across fields such as surveillance, healthcare, and smart identifying motion, tracking the flow of events, and
environments calls for models that can efficiently capture both interpreting context across frames. While Recurrent Neural
spatial structures and temporal dynamics. This study Networks (RNNs), including Long Short-Term Memory
investigates a hybrid deep learning architecture that integrates (LSTM) networks and Gated Recurrent Units (GRUs), have
Convolutional Neural Networks (CNNs) enhanced with been employed for this purpose, they often struggle to capture
Squeeze-and-Excitation (SE) blocks for spatial feature long-term dependencies [4]. These networks are susceptible to
extraction, and Graph Attention Networks (GATs) for temporal
the vanishing and exploding gradient problems, which hinder
modeling. SE blocks are employed to adaptively recalibrate
their ability to model relationships across extended time
channel-wise features, improving the quality of frame-level
representations. These enriched spatial features are sequentially
intervals [5].
aggregated and fed into a graph-based attention mechanism, As a result, the video classification community has turned
where video frames are treated as nodes and their temporal its attention to more advanced and robust architectures that
relationships are learned through dynamic attention weighting. can bridge the gap between spatial and temporal information.
Through experiments on benchmark video datasets, the A groundbreaking solution has emerged in the form of
proposed framework is evaluated across key tasks such as action Transformer networks, which leverage the power of self-
recognition, video summarization, and anomaly detection.
attention mechanisms to effectively model long-range
Results indicate improved performance in identifying salient
dependencies [6]. Transformers allow the network to consider
actions and detecting temporal patterns, with attention
mechanisms enhancing interpretability by focusing on relevant every frame in relation to every other frame in the sequence,
frames and transitions. Ablation studies confirm the regardless of their position enabling a more holistic
contribution of SE-enhanced CNNs in improving feature quality understanding of temporal dynamics [7]. By addressing the
and overall model robustness. This research highlights the limitations of both convolutional and recurrent architectures,
benefits of combining channel-attentive spatial extraction with Transformer-based models offer a promising approach to
attention-driven temporal reasoning, offering insights into the overcoming the challenges posed by the complex interplay of
development of more effective video analysis pipelines. spatial and temporal information in video data.

Keywords—Graph Attention Network (GAT), Convolutional With their ability to capture intricate temporal
Neural Network (CNN), Attention Mechanism, Computer Vision relationships and spatial patterns simultaneously, Transformer
(CV), Graph Neural Networks (GNN), Squeeze-and-Excitation models are rapidly gaining traction in video classification
(SE) Block, Spatiotemporal Modeling, Anomaly Detection, tasks [8]. Their superior capacity to model long-term
Attention Mechanism dependencies and their scalability across diverse video lengths
and contexts make them a promising avenue for pushing the
I. INTRODUCTION boundaries of what is possible in video analysis [9]. As
The domain of video classification has undergone research in this area continues to evolve, Transformer-based
remarkable advancements in recent years, primarily fuelled by architectures are expected to play a pivotal role in
the transformative capabilities of deep learning models. At the revolutionizing how we understand and classify video content
forefront of these innovations, Convolutional Neural in a wide range of applications, from autonomous systems to
Networks (CNNs) have emerged as powerful tools for entertainment and beyond.
extracting intricate spatial features from individual video II. LITERATURE REVIEW
frames [1]. Their unparalleled capacity to learn hierarchical
representations allows them to capture complex visual Video classification is a significant area in computer
patterns, textures, and spatial relationships, positioning CNNs vision, aiming to understand and categorize video content.
as a cornerstone of modern video analysis systems [2]. These Researchers have explored various approaches to tackle the
networks excel in identifying the rich visual details present in challenges inherent in this task, such as capturing both the
each frame, enabling the detection of objects, actions, and spatial appearance within individual frames and the temporal
other significant elements in the video content. dynamics across these frames.
However, despite their extraordinary ability to process A. Spatial Feature Extraction
spatial information, CNNs face significant challenges when it Early approaches often relied on handcrafted features.
comes to modelling the temporal dynamics inherent in video However, the advent of deep learning, particularly
sequences. Unlike static images, videos consist of a series of Convolutional Neural Networks (CNNs), has revolutionized
frames that evolve over time, and understanding the spatial feature extraction from video frames [10]. Deep CNNs
relationships between these frames is crucial for accurate

Integrating Spatial and Temporal Dependencies ©2024 IEEE

have demonstrated remarkable capabilities in learning followed by a Feed-Forward Network (FFN) for further
hierarchical representations from visual data. Architectures processing [14]. The proposed architecture employs a
like ResNet have shown significant success in image Temporal Transformer to process sequences of extracted
recognition tasks. spatial features, indicating an aim to leverage the strengths of
self-attention in capturing temporal dependencies. The use of
To enhance the representational power and efficiency of Multi-Head Self-Attention and Feed-Forward Network layers
CNNs, techniques like depthwise separable convolutions have in the Temporal Transformer aligns with the standard
been introduced. These convolutions decompose standard Transformer architecture.
convolution operations into a depthwise convolution and a
pointwise convolution, reducing the number of parameters C. Graph Neural Networks for Video Analysis
and computational cost while maintaining competitive Graph Neural Networks (GNNs) have also gained traction
performance. The proposed Spatial CNN utilizes depthwise in video analysis by representing videos as graphs where
convolutional layers, suggesting an aim for efficient feature
nodes can represent objects, frames, or regions, and edges
extraction [11].
represent their relationships [15]. GNNs can learn node
Furthermore, to emphasize the importance of different representations by aggregating information from their
feature channels, channel attention mechanisms, such as the neighbors in the graph, allowing them to model complex
Squeeze-and-Excitation (SE) block, have been proposed. The spatial and temporal relationships. Different types of GNNs,
core idea of an SE block is to model the interdependencies such as Graph Convolutional Networks (GCNs) and Graph
between feature channels, allowing the network to adaptively Attention Networks (GATs), have been applied to tasks like
recalibrate channel-wise feature responses by explicitly video classification, video anomaly detection, video
learning to weight each channel based on its importance. By summarization, and Video Question Answering (VideoQA).
doing so, SE blocks can enhance the sensitivity of the network For instance, in VideoQA, graph-based methods can model
to more informative features [11, 12]. The Spatial CNN the relationships between visual objects and the question to
incorporates Squeeze-and-Excitation blocks, indicating an
perform relational reasoning [15, 16]. Some works explore
intention to leverage channel-wise attention for improved
feature quality. multi-view graph embedding to capture different types of
relationships within a network. However, some graph-based
To obtain a fixed-size feature vector for each frame, CNNs methods might introduce irrelevant noise when capturing
often employ a Global Average Pooling (GAP) layer at the high-order relations by multiple convolutions between
end of the spatial feature extraction stage. This layer reduces pairwise relations.
the spatial dimensions of the feature maps by averaging the
values across each channel, resulting in a compact D. Hybrid Architectures
representation suitable for subsequent temporal modeling. Combining the strengths of CNNs for spatial feature
Finally, dropout is a commonly used regularization technique extraction and sequence models like RNNs or Transformers
in deep learning to prevent overfitting by randomly setting a for temporal modeling has become a common paradigm in
fraction of neurons to zero during training [12]. The Spatial video analysis. These two-stage architectures aim to leverage
CNN utilizes dropout for regularization, which is a standard the ability of CNNs to learn robust visual features and the
practice to improve the generalization ability of the model. capacity of sequence models to capture temporal dynamics.
B. Temporal Dependency Modeling The proposed two-stage architecture, consisting of a Spatial
After extracting spatial features from individual frames, CNN and a Temporal Transformer, falls under this category,
capturing the temporal relationships between these features is seeking to effectively integrate spatial and temporal
crucial for video classification. Various methods have been information for video classification [16].
explored for this purpose. Recurrent Neural Networks E. Attention Mechanisms in Video Analysis
(RNNs), particularly Long Short-Term Memory (LSTM)
networks, have been widely used to model sequential data due Attention mechanisms have become integral to many videos
to their ability to maintain a hidden state that captures analysis tasks. Inspired by human perception, attention
information from previous time steps. These models can learn mechanisms allow models to focus on the most relevant parts
long-range temporal dependencies by processing the sequence of the input data. In video analysis, attention can be applied
of frame-level features. Three-dimensional Convolutional spatially to highlight important regions within a frame,
Neural Networks (3D CNNs) offer another approach by temporally to weigh crucial time segments, or channel-wise
directly processing video data as a spatiotemporal volume. as seen in SE blocks. The self-attention mechanism in
These networks apply 3D convolutional filters that can Transformers is a powerful form of attention that allows the
simultaneously learn spatial and temporal features [13]. More model to learn relationships between different parts of the
recently, Transformer networks, initially proposed for natural input sequence. Question-guided attention has also been used
language processing, have shown remarkable success in in tasks like VideoQA to focus on video features relevant to
capturing long-range dependencies in various domains, a specific question [17].
including video analysis. The core of the Transformer The techniques employed in video classification are
architecture is the Multi-Head Self-Attention mechanism, often relevant to other video analysis tasks. For example,
which allows the model to weigh the importance of different video anomaly detection aims to identify unusual events in
time steps (in this case, different frame features in the
videos. Both spatial and temporal features are crucial for this
sequence) when computing the representation for a specific
task, and methods involving CNNs and sequence models or
time step. This mechanism enables the model to capture
complex temporal interactions effectively. Transformer GNNs have been explored. Video summarization aims to
networks are typically composed of multiple encoder layers, generate concise summaries of longer videos. Attention
each containing a multi-head self-attention mechanism mechanisms and temporal modeling are important for
selecting keyframes or key segments. Video Question Training uses the Adam optimizer (learning rate:
Answering (VideoQA) requires understanding both visual 0.001) and sparse categorical cross-entropy loss.
and textual information in videos to answer questions. This Validation accuracy is monitored with callbacks:
task often involves sophisticated methods for cross-modal • EarlyStopping (patience: 10 epochs on validation
interaction and relational reasoning, sometimes utilizing loss)
graph-based approaches [17]. Human action recognition,
which involves identifying and classifying actions performed • ReduceLROnPlateau (factor: 0.2, patience: 4 epochs,
in videos, also relies heavily on capturing spatiotemporal min_lr: 1e-6)
features. Pose estimation, the task of predicting the location • ModelCheckpoint saves best weights based on
of key body joints in images or videos, can provide valuable validation accuracy.
spatial information for action recognition and other video
understanding tasks. By drawing upon the advancements in C. Feature Sequence Construction
spatial feature extraction using CNNs, the ability of Once trained, the CNN up to the 64-unit dense layer serves
Transformer networks to model long-range temporal as a fixed spatial feature extractor. Frame-wise features are
dependencies through self-attention, and the broader extracted for all training, validation, and test sets. These
applications of attention mechanisms in video analysis, the features are grouped into sequences:
proposed two-stage architecture presents a novel approach to • Sequence length: 15 frames
video classification.
• Overlap: 5 frames (training), no overlap
III. METHODOLOGY (validation/test)
The methodology employed in this research involves a
• Sequence label: most frequent label in the sequence
two-pronged approach for video classification: a Spatial
Convolutional Neural Network (CNN) for feature extraction D. Temporal Modeling using Transformer
from individual frames, followed by a Temporal Transformer The Temporal Transformer model employed in this
model for sequence modeling. Depthwise convolutional research is designed to capture sequential dependencies across
layers followed by Squeeze-and-Excitation blocks are used to frame-wise spatial features. It begins with an optional input
extract features from the training and testing images. projection layer to align input dimensions with the embedding
A. Data Preprocessing and Augmentation dimension, followed by the addition of positional encoding to
retain sequence order information. The core of the model
A dataset of cropped video frames, organized into training
consists of two Transformer encoder blocks. Each block
and testing sets with a predefined class structure, undergoes
features a Multi-Head Self-Attention (MHSA) mechanism
preprocessing. All frames are resized to 128×128 pixels and
with four heads, where the key dimension is defined as the
their pixel values are normalized to the range [0, 1]. Online
embedding dimension divided by the number of heads.
data augmentation is applied exclusively to the training set to
improve robustness and generalization. This includes random This is followed by a dropout layer with a rate of 0.1 and
horizontal flipping, rotations up to 0.1 radians, zooms up to layer normalization. Subsequently, a Feed-Forward Network
10%, contrast adjustments up to 10%, and translations up to (FFN) with 128 ReLU units projects the features to an
10% in height and width. A 20% validation split is held out intermediate representation before linearly mapping them
from the training data. back to the embedding dimension (128), along with another
dropout layer (0.1) and layer normalization for added
B. Spatial Feature Extraction using CNN and SE Block regularization. The output of the final encoder block is
The spatial module is a custom-designed CNN processed through a Global Average Pooling 1D layer to
comprising three sequential convolutional blocks: produce a fixed-length sequence representation. This is
• Each block contains a 3×3 Depthwise Convolution followed by a dropout layer (0.3), a dense layer with 64 ReLU
followed by a 1×1 Convolution. units and L2 regularization (λ = 0.01), another dropout layer
(0.3), and a softmax output layer that produces class
• Squeeze-and-Excitation (SE) blocks with a reduction probabilities. The model is trained using the Adam optimizer
ratio of 8 are integrated post-convolution to recalibrate with a learning rate of 0.0005 and sparse categorical cross-
feature responses. entropy loss, with training monitored based on validation loss.
• Dropout layers (rates: 0.2, 0.3, 0.5) are inserted To prevent overfitting and ensure optimal performance, the
progressively. training process includes an EarlyStopping callback with a
patience of 15 epochs and best weight restoration, along with
• L2 regularization (λ = 0.001) is applied to all a ReduceLROnPlateau callback that reduces the learning rate
convolutional and dense layers. by a factor of 0.5 if no improvement is observed over 5
After the final convolutional block: consecutive epochs, down to a minimum learning rate of 1e-
7.
• A Global Average Pooling (GAP) layer produces a
128-dimensional feature vector. E. Evaluation Metrics
The performance of the trained temporal model is
• This is passed through a Dense layer with 64 ReLU thoroughly evaluated on the test set using several key metrics.
units and a 0.5 dropout. These include overall accuracy and loss, which provide a
• The final layer is a Softmax classifier matching the measure of the model's general performance. Additionally, a
number of classes. detailed classification report is generated, which includes
precision, recall, and F1-score for each class, offering
insights into the model's ability to correctly identify each rather than their refinement. Since SE mechanisms rely on
class while balancing false positives and false negatives. To meaningful feature maps to enhance discriminative channels,
further assess the model's performance across different their failure to improve performance suggests that the base
classes, a confusion matrix is used to analyze the true and convolutional layers may not be capturing sufficiently
predicted class distributions, providing a more granular view discriminative patterns.
of its classification behavior. All experiments are conducted
B. Feature Embedding Analysis
with a fixed random seed (123) to ensure the reproducibility
of results, enabling consistent and comparable outcomes To further investigate the spatial feature extractor’s
across different runs. performance, we visualized the learned 64-dimensional
embeddings using UMAP and t-SNE, two widely used
IV. RESULTS dimensionality reduction techniques.
A. Quantitative Performance Evaluation The UMAP projection revealed. Significant overlap
To evaluate the effectiveness of our two-stage model between different classes, with no clear separation. Poor
architecture, we conducted a comprehensive assessment using clustering, indicating that the model fails to generate distinct
classification accuracy and loss on a held-out test set. Despite feature representations for different action classes (See Fig. 3).
employing advanced architectural components—including
Squeeze-and-Excitation (SE) blocks—and rigorous
regularization techniques, the model exhibited limited
performance with test accuracy of ~46% (Fig. 1) and test loss
of ~2.6 (Fig. 2) showing no changes in test score and test loss
without SE blocks.

Fig. 3. UMAP on Train Features

Consistent with UMAP, the t-SNE projection exhibited.

Scattered embeddings without discernible class boundaries.
No meaningful clustering, reinforcing the hypothesis that the
feature space lacks discriminative power (See Fig. 4).
Fig. 1. Temporal Model Accuracy

The inclusion of SE blocks did not yield measurable

improvements in accuracy or loss, suggesting that the model’s
limitations extend beyond channel-wise feature recalibration.
The high-test loss indicates poor generalization, likely due to
either insufficient feature extraction or inherent dataset
challenges.

Fig. 4. t-SNE on Train Features

The lack of separation in both UMAP and t-SNE suggests

that The spatial feature extractor is unable to encode class-
specific patterns effectively. The model may be overfitting to
noise or irrelevant background features rather than learning
meaningful action representations.
Fig. 2. Temporal Model Test

The ineffectiveness of SE blocks implies that the model’s

bottleneck lies in the quality of extracted spatial features
C. Model Interpretability via Grad-CAM The SE mechanism was introduced to adaptively
To understand where the model focuses its attention, we recalibrate channel-wise feature importance, yet it failed to
employed Gradient-weighted Class Activation Mapping improve performance. If the initial convolutional layers do not
(Grad-CAM) on sample frames. Fig. shows the original abuse extract meaningful representations, SE cannot enhance them.
and Fig. 6 shows the Grad-CAM. The dataset may lack sufficient variation for SE to exploit.
Also, the transformer may compensate for weak spatial
features, diminishing the impact of SE. No change in Grad-
CAM attention with/without SE blocks and Identical
clustering patterns in UMAP/t-SNE, indicating no
improvement in feature discriminability.
Several factors likely contribute to the model’s suboptimal
performance:
• Dataset-Related Issues: Uneven sample distribution
may bias the model toward majority classes and
Overlapping or noisy annotations could hinder
learning.
• Architectural Limitations: May fail to capture long-
term temporal dependencies yet While efficient, they
may sacrifice spatial richness, especially in low-
resolution settings.
Fig. 5. Original Abuse
• Training Dynamics: Grad-CAM suggests the model
relies on non-discriminative regions and the spatial
backbone may struggle to encode dynamic action
cues.
V. CONCLUSION
Our analysis reveals that the current model struggles
due to weak spatial feature extraction and ineffective
temporal integration. While SE blocks did not help,
alternative approaches—such as stronger backbones, richer
temporal modeling, and multi-modal inputs—could unlock
significant improvements. Future work should focus on
enhancing feature discriminability while ensuring robustness
to dataset noise and variability.
VI. ACKNOWLEDGMENT
Fig. 6. Grad - CAM
The preferred spelling of the word “acknowledgment” in
America is without an “e” after the “g”. Avoid the stilted
The heatmap highlights background regions rather than expression “one of us (R. B. G.) thanks ...”. Instead, try “R.
the central action. The model appears to ignore motion cues or B. G. thanks...”. Put sponsor acknowledgments in the
key spatial details that define the class (See Fig. 7). This unnumbered footnote on the first page.
misalignment between model attention and semantically
relevant regions suggests weak localization capability, VII. REFERENCES
possibly due to inadequate receptive field size or insufficient [1] [Agarwal, A., Lalit, M., Bansal, A., & Seeja, K. (2023). iSGAN: An
spatial resolution. Over-reliance on spurious correlations (e.g., Improved SGAN for Crowd Trajectory Prediction from Surveillance
background textures) rather than genuine action features. Videos. Procedia Computer Science, 218, 2319–2327.
https://doi.org/10.1016/j.procs.2023.01.207
[2] Han, X., Xu, G., Zhang, M., Yang, Z., Yu, Z., Huang, W., & Meng,
C. (2024). DE-GNN: Dual embedding with graph neural network for
fine-grained encrypted traffic classification. Computer Networks, 245,
110372. https://doi.org/10.1016/j.comnet.2024.110372
[3] El-Nagar, G., El-Sawy, A., & Rashad, M. (2024). A deep audio-visual
model for efficient dynamic video summarization. Journal of Visual
Communication and Image Representation, 100, 104130.
https://doi.org/10.1016/j.jvcir.2024.104130
[4] Zhang, B., Yuan, C., Wang, T., & Liu, H. (2021). STENet: A hybrid
spatio-temporal embedding network for human trajectory forecasting.
Engineering Applications of Artificial Intelligence, 106, 104487.
https://doi.org/10.1016/j.engappai.2021.104487
[5] Liu, J., Zheng, S., & Wang, C. (2023). Causal Graph Attention
Network with Disentangled Representations for Complex Systems
Fault Detection. Reliability Engineering & System Safety, 235,
109232. https://doi.org/10.1016/j.ress.2023.109232
Fig. 7. Confusion Matrix [6] He, T., Li, H., Qian, Z., Niu, C., & Huang, R. (2024). Research on
weakly supervised pavement crack segmentation based on defect
location by generative adversarial network and target re‐optimization. [12] Ding, C., Sun, S., & Zhao, J. (2023). MST-GAT: A multimodal
Construction and https://doi.org/10.1016/j.conbuildmat.2023.134668 spatial–temporal graph attention network for time series anomaly
Building Materials, 411, 134668. detection. Information Fusion, 89, 527–536.
[7] Qin, C., Zhang, Y., Liu, Y., Coleman, S., Du, H., & Kerr, D. (2021). https://doi.org/10.1016/j.inffus.2022.08.011
A visual place recognition approach using learnable feature map [13] Rani, C. J., & Devarakonda, N. (2022). An effectual classical dance
filtering and graph attention networks. Neurocomputing, 457, 277– pose estimation and classification system employing Convolution
292. https://doi.org/10.1016/j.neucom.2021.06.038 Neural Network –Long ShortTerm Memory (CNN-LSTM) network
[8] Wang, Z., Wu, B., Ota, K., Dong, M., Beijing Key Laboratory of for video sequences. Microprocessors and Microsystems, 95, 104651.
Intelligent Telecommunications Software and Multimedia, & https://doi.org/10.1016/j.micpro.2022.104651
Muroran Institute of Technology. (2023). A multi-scale self- [14] Zhang, H., Li, Y., Zhuang, Z., Xie, L., The University of Texas at San
supervised hypergraph contrastive learning framework for video Antonio, Nanjing Normal University, Tsinghua University, & Johns
question answering. In Neural Networks Hopkins University. (2021). 3D-GAT: 3D-Guided adversarial
https://doi.org/10.1016/j.neunet.2023.08.057 (Vol. 168, pp. 272–286). transform network for person re-identification in unseen domains. In
[9] Wu, G., Song, S., Zhang, J., & School of Cyberspace Security, Gansu Pattern Recognition (Vol. 112, p. 107799).
University of Political Science and Law. (2024). Global–local spatio- https://doi.org/10.1016/j.patcog.2020.107799
temporal graph convolutional networks for video summarization. In [15] Yao, Y., Jiang, X., Fujita, H., & Fang, Z. (2022). A sparse graph
Computers and Electrical Engineering (Vol. 118, p. 109445). wavelet convolution neural network for video-based person re-
https://doi.org/10.1016/j.compeleceng.2024.109445 identification. Pattern Recognition, 129, 108708.
[10] Wang, Z., Li, F., Ota, K., Dong, M., Beijing Key Laboratory of https://doi.org/10.1016/j.patcog.2022.108708
Intelligent Telecommunications Software and Multimedia, Beijing [16] Petar Velickovi ˇ c, Guillem Cucurull, Arantxa Casanova, Adriana
University of Posts and Telecommunications, Muroran Institute of Romero, Pietro Lio`, Yoshua Bengio. (2018). GRAPH ATTENTION
Technology, & Wu, B. (2023). ReGR: Relation-aware graph NETWORKS. Graph attention networks
reasoning framework for video question answering. In Information [17] Haoyang Chen, Xue Mei, Zhiyuan Ma, Xinhong Wu, Yachuan Wei,
Processing and Management (Vol. 60, p. 103375). Spatial–temporal graph attention network for video anomaly
https://doi.org/10.1016/j.ipm.2023.103375 detection, Image and Vision Computing, Volume 131, 2023,
[11] Jia, A., Zhang, Y., Uprety, S., & Song, D. (2024). Learning https://doi.org/10.1016/j.imavis.2023.104629.
interactions across sentiment and emotion with graph attention
network and position encodings. Pattern Recognition Letters, 180,
33–40. https://doi.org/10.1016/j.patrec.2024.02.013

Immanuel Kant Correspondence 1
100% (1)
Immanuel Kant Correspondence 1
659 pages
Assessment and Identification of Needs PDF
100% (1)
Assessment and Identification of Needs PDF
637 pages
Cambridge International As Amp A Level Further Mathematics Further Pure Mathematics 1 9781510422018 1510422013
100% (1)
Cambridge International As Amp A Level Further Mathematics Further Pure Mathematics 1 9781510422018 1510422013
211 pages
EY University of The Future 2030
No ratings yet
EY University of The Future 2030
36 pages
AI Agents Free Guide Maryam Miradi
No ratings yet
AI Agents Free Guide Maryam Miradi
42 pages
RWKV Architecture and Applications: The Complete Guide for Developers and Engineers
From Everand
RWKV Architecture and Applications: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
DLP - Arithmetic Sequence
No ratings yet
DLP - Arithmetic Sequence
10 pages
Python Programming Important Notes
No ratings yet
Python Programming Important Notes
46 pages
ROI Basics
No ratings yet
ROI Basics
221 pages
Ayam Petelur
No ratings yet
Ayam Petelur
42 pages
Managment Assignment 1
No ratings yet
Managment Assignment 1
5 pages
Fractions Multiplying Pictures
0% (1)
Fractions Multiplying Pictures
2 pages
Database Handling in Prolog: Type1: Created at Each Execution. It Grows, Shrinks and
No ratings yet
Database Handling in Prolog: Type1: Created at Each Execution. It Grows, Shrinks and
5 pages
A Survey of The Vision Transformers and Its CNN-Transformer Based Variants - Khan Et Al
No ratings yet
A Survey of The Vision Transformers and Its CNN-Transformer Based Variants - Khan Et Al
82 pages
Jahangirnagar University: Admit Card Faculty of Law (Unit-F)
No ratings yet
Jahangirnagar University: Admit Card Faculty of Law (Unit-F)
2 pages
JSREP Volume 38 Issue 183ج1 Pages 223-310
No ratings yet
JSREP Volume 38 Issue 183ج1 Pages 223-310
88 pages
Ohms Law VD CD KVL and KCL
No ratings yet
Ohms Law VD CD KVL and KCL
50 pages
EEE1001 - ELECTRIC-CIRCUITS-AND-SYSTEMS - LTP - 1.1 - 1 - EEE1001 - Electric Circuits and Systems
No ratings yet
EEE1001 - ELECTRIC-CIRCUITS-AND-SYSTEMS - LTP - 1.1 - 1 - EEE1001 - Electric Circuits and Systems
2 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Yellow September: You Are Not Alone!
No ratings yet
Yellow September: You Are Not Alone!
9 pages
PM Toolkit For Youth
100% (1)
PM Toolkit For Youth
18 pages
Course Code CSE3001 CT C LTP 4 Prerequisite: Objectives
No ratings yet
Course Code CSE3001 CT C LTP 4 Prerequisite: Objectives
7 pages
Exploring The Synergies of Hybrid Cnns and Vits Architectures For Computer Vision: A Survey
No ratings yet
Exploring The Synergies of Hybrid Cnns and Vits Architectures For Computer Vision: A Survey
27 pages
Lesson Plan
No ratings yet
Lesson Plan
16 pages
Yun Thesesgg
No ratings yet
Yun Thesesgg
127 pages
Experience of Literature Syllabus
No ratings yet
Experience of Literature Syllabus
8 pages
5.5.2 Video To Text With LSTM Models
No ratings yet
5.5.2 Video To Text With LSTM Models
10 pages
Imp 2
No ratings yet
Imp 2
27 pages
Taylor Eccv 10
No ratings yet
Taylor Eccv 10
14 pages
A Brief History of Memory Research: Background: Associationism
No ratings yet
A Brief History of Memory Research: Background: Associationism
74 pages
The Relationship of Motivation and Flow Experience To Academic Procrastination in University Students
No ratings yet
The Relationship of Motivation and Flow Experience To Academic Procrastination in University Students
12 pages
Chapters
No ratings yet
Chapters
27 pages
CS231N Section: Video Understanding
No ratings yet
CS231N Section: Video Understanding
52 pages
Cse3004 Design-Analysis-Of-Algorithm LT 1.0 1 Cse3004
No ratings yet
Cse3004 Design-Analysis-Of-Algorithm LT 1.0 1 Cse3004
2 pages
Research Paper (2) Done
No ratings yet
Research Paper (2) Done
17 pages
Mobile Neural Network Framework in Practice: The Complete Guide for Developers and Engineers
From Everand
Mobile Neural Network Framework in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
SIDDHARTH SHARMA - FinalProjectReport - Siddharth
No ratings yet
SIDDHARTH SHARMA - FinalProjectReport - Siddharth
27 pages
5 6280382869936280464
No ratings yet
5 6280382869936280464
14 pages
Video Understanding With Large Language Models - A Survey
No ratings yet
Video Understanding With Large Language Models - A Survey
24 pages
Detecting Violence in Video Based On Dee
No ratings yet
Detecting Violence in Video Based On Dee
15 pages
JOURNAL PAPER DEEPFAKE DETECTION Upp
No ratings yet
JOURNAL PAPER DEEPFAKE DETECTION Upp
10 pages
Exploring Global Diverse Attention Via Pairwise
No ratings yet
Exploring Global Diverse Attention Via Pairwise
12 pages
Ai Lakshmana Sai Vision Transformer
No ratings yet
Ai Lakshmana Sai Vision Transformer
19 pages
Koch A Transformer-Based Late-Fusion Mechanism For Fine-Grained Object Recognition in Videos WACVW 2023 Paper
No ratings yet
Koch A Transformer-Based Late-Fusion Mechanism For Fine-Grained Object Recognition in Videos WACVW 2023 Paper
10 pages
TGAN sHIT
No ratings yet
TGAN sHIT
10 pages
Gaurav Vision Transformer
No ratings yet
Gaurav Vision Transformer
10 pages
What Is CL-NLP-1
No ratings yet
What Is CL-NLP-1
12 pages
A Transformer That Tends To Mine Metaphorical-Level Information
No ratings yet
A Transformer That Tends To Mine Metaphorical-Level Information
16 pages
Container Infrastructure and Operations: Definitive Reference for Developers and Engineers
From Everand
Container Infrastructure and Operations: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Graph Neural Network GNN in Image and Video Understanding Using Deep Learning For Computer Vision Applications
No ratings yet
Graph Neural Network GNN in Image and Video Understanding Using Deep Learning For Computer Vision Applications
7 pages
Tokenlearner: What Can 8 Learned Tokens Do For Images and Videos?
No ratings yet
Tokenlearner: What Can 8 Learned Tokens Do For Images and Videos?
21 pages
Symmetry: SIAT: A Distributed Video Analytics Framework For Intelligent Video Surveillance
No ratings yet
Symmetry: SIAT: A Distributed Video Analytics Framework For Intelligent Video Surveillance
20 pages
Memo 1 and Round 2 Plan
No ratings yet
Memo 1 and Round 2 Plan
9 pages
Liu Video Swin Transformer CVPR 2022 Paper
No ratings yet
Liu Video Swin Transformer CVPR 2022 Paper
10 pages
Journsl To Publish Research Paper
No ratings yet
Journsl To Publish Research Paper
15 pages
1 - 4. An Approach For Video Summarization Based On Unsupervised Learning Using Deep Semantic Features and Keyframe Extraction
No ratings yet
1 - 4. An Approach For Video Summarization Based On Unsupervised Learning Using Deep Semantic Features and Keyframe Extraction
8 pages
Action Recognition 2
No ratings yet
Action Recognition 2
6 pages
03 - ViViT - A Video Vision Transformer
No ratings yet
03 - ViViT - A Video Vision Transformer
13 pages
A Hybrid Deep Learning Approach For Video Object Detection
No ratings yet
A Hybrid Deep Learning Approach For Video Object Detection
9 pages
Zhao Streaming Video Model CVPR 2023 Paper
No ratings yet
Zhao Streaming Video Model CVPR 2023 Paper
11 pages
Gao SimVP Simpler Yet Better Video Prediction CVPR 2022 Paper
No ratings yet
Gao SimVP Simpler Yet Better Video Prediction CVPR 2022 Paper
11 pages
2023 - A Novel Deep Convolutionalencoder Decoder Network Application To Moving Object Detection in Videos
No ratings yet
2023 - A Novel Deep Convolutionalencoder Decoder Network Application To Moving Object Detection in Videos
15 pages
Video Swin Transformer
No ratings yet
Video Swin Transformer
12 pages
ICT Infrastructure Roadmap: Alain Del Bustamante Pascua Undersecretary For Administration Department of Education
No ratings yet
ICT Infrastructure Roadmap: Alain Del Bustamante Pascua Undersecretary For Administration Department of Education
4 pages
A Light-Weight Model With Granularity Feature Representation For Fine-Grained Visual Classification
No ratings yet
A Light-Weight Model With Granularity Feature Representation For Fine-Grained Visual Classification
12 pages
Learning Spatiotemporal Features With 3D Convolutional Networks
No ratings yet
Learning Spatiotemporal Features With 3D Convolutional Networks
16 pages
ViViT: A Video Vision Transformer
No ratings yet
ViViT: A Video Vision Transformer
14 pages
Lin TSM Temporal Shift Module For Efficient Video Understanding ICCV 2019 Paper
No ratings yet
Lin TSM Temporal Shift Module For Efficient Video Understanding ICCV 2019 Paper
11 pages
UT Dallas Syllabus For Entp6375.501.11s Taught by Rajiv Shah (rxs079000)
No ratings yet
UT Dallas Syllabus For Entp6375.501.11s Taught by Rajiv Shah (rxs079000)
4 pages
Yan Multiview Transformers For Video Recognition CVPR 2022 Paper
No ratings yet
Yan Multiview Transformers For Video Recognition CVPR 2022 Paper
11 pages
Video Prediction PDF
No ratings yet
Video Prediction PDF
21 pages
Agent Architecture
No ratings yet
Agent Architecture
18 pages
Course Outline (EEE315Lab)
No ratings yet
Course Outline (EEE315Lab)
4 pages
Video Transformer Network
No ratings yet
Video Transformer Network
11 pages
Deep Learning Approach For Suspicious Activity Detection From Surveillance Video
No ratings yet
Deep Learning Approach For Suspicious Activity Detection From Surveillance Video
6 pages
3D Convolutional Neural Networks For Human Action Recognition
No ratings yet
3D Convolutional Neural Networks For Human Action Recognition
11 pages
Nosework
No ratings yet
Nosework
4 pages
1 s2.0 S266682702400001X Main
No ratings yet
1 s2.0 S266682702400001X Main
8 pages
Enhancing Video Anomaly Detection For Human Suspicious Behavior Through Deep Hybrid Temporal Spatial Network
No ratings yet
Enhancing Video Anomaly Detection For Human Suspicious Behavior Through Deep Hybrid Temporal Spatial Network
8 pages
1 PB
No ratings yet
1 PB
7 pages
Video Quality Assessment (VQA) Using Vision Transformers
No ratings yet
Video Quality Assessment (VQA) Using Vision Transformers
5 pages
Kungfupunctuation
No ratings yet
Kungfupunctuation
11 pages
Wu AdaFrame Adaptive Frame Selection For Fast Video Recognition CVPR 2019 Paper
No ratings yet
Wu AdaFrame Adaptive Frame Selection For Fast Video Recognition CVPR 2019 Paper
10 pages
Cghs Empanneled Hospital
No ratings yet
Cghs Empanneled Hospital
4 pages
CNN Unconstrained Video Classification
No ratings yet
CNN Unconstrained Video Classification
9 pages
Non Text Magic Studio Magic Design For Presentations L&P
No ratings yet
Non Text Magic Studio Magic Design For Presentations L&P
6 pages
Human Activity Detection Using Deep - 2-1
No ratings yet
Human Activity Detection Using Deep - 2-1
8 pages
IJCRT24A5009
No ratings yet
IJCRT24A5009
5 pages
3D Convolutional Neural Networks For Human Action Recognition
No ratings yet
3D Convolutional Neural Networks For Human Action Recognition
11 pages
Ufc Sports Data
No ratings yet
Ufc Sports Data
10 pages
Is Assignment
No ratings yet
Is Assignment
5 pages
Two-Stream Convolutional Networks For Action Recognition in Videos
No ratings yet
Two-Stream Convolutional Networks For Action Recognition in Videos
9 pages
Liu Robust Video Super-Resolution ICCV 2017 Paper
No ratings yet
Liu Robust Video Super-Resolution ICCV 2017 Paper
9 pages
Logic and Lang Structure Syllabus
No ratings yet
Logic and Lang Structure Syllabus
2 pages
Combining Multiple Sources of Knowledge in Deep Cnns For Action Recognition
No ratings yet
Combining Multiple Sources of Knowledge in Deep Cnns For Action Recognition
8 pages
Raushan Pandey Review Paper of Deep Learning
No ratings yet
Raushan Pandey Review Paper of Deep Learning
3 pages
Business Seminar Flyer 9-30-14 Latest PDF
No ratings yet
Business Seminar Flyer 9-30-14 Latest PDF
1 page
Sns College of Engineering: Aruguments in Multi-Agent Systems
No ratings yet
Sns College of Engineering: Aruguments in Multi-Agent Systems
2 pages
05 Acknowledgement
No ratings yet
05 Acknowledgement
2 pages