PAT: Position-Aware Transformer for Dense Multi-Label Action Detection

Sardari, Faegheh; Mustafa, Armin; Jackson, Philip J. B.; Hilton, Adrian

Computer Science > Computer Vision and Pattern Recognition

arXiv:2308.05051 (cs)

[Submitted on 9 Aug 2023]

Title:PAT: Position-Aware Transformer for Dense Multi-Label Action Detection

Authors:Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson, Adrian Hilton

View PDF

Abstract:We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in a video by exploiting multi-scale temporal features. In existing methods, the self-attention mechanism in transformers loses the temporal positional information, which is essential for robust action detection. To address this issue, we (i) embed relative positional encoding in the self-attention mechanism and (ii) exploit multi-scale temporal relationships by designing a novel non hierarchical network, in contrast to the recent transformer-based approaches that use a hierarchical structure. We argue that joining the self-attention mechanism with multiple sub-sampling processes in the hierarchical approaches results in increased loss of positional information. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets, and show that PAT improves the current state-of-the-art result by 1.1% and 0.6% mAP on the Charades and MultiTHUMOS datasets, respectively, thereby achieving the new state-of-the-art mAP at 26.5% and 44.6%, respectively. We also perform extensive ablation studies to examine the impact of the different components of our proposed network.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2308.05051 [cs.CV]
	(or arXiv:2308.05051v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2308.05051

Submission history

From: Faegheh Sardari [view email]
[v1] Wed, 9 Aug 2023 16:29:31 UTC (20,404 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:PAT: Position-Aware Transformer for Dense Multi-Label Action Detection

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:PAT: Position-Aware Transformer for Dense Multi-Label Action Detection

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators