DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning

Wu, Wenhao; Zhao, Yuxiang; Xu, Yanwu; Tan, Xiao; He, Dongliang; Zou, Zhikang; Ye, Jin; Li, Yingying; Yao, Mingde; Dong, Zichao; Shi, Yifeng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2105.12085v2 (cs)

[Submitted on 25 May 2021 (v1), revised 7 Jul 2021 (this version, v2), latest version 17 Aug 2021 (v3)]

Title:DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning

Authors:Wenhao Wu, Yuxiang Zhao, Yanwu Xu, Xiao Tan, Dongliang He, Zhikang Zou, Jin Ye, Yingying Li, Mingde Yao, Zichao Dong, Yifeng Shi

View PDF

Abstract:Long-range and short-range temporal modeling are two complementary and crucial aspects of video recognition. Most of the state-of-the-arts focus on short-range spatio-temporal modeling and then average multiple snippet-level predictions to yield the final video-level prediction. Thus, their video-level prediction does not consider spatio-temporal features of how video evolves along the temporal dimension. In this paper, we introduce a novel Dynamic Segment Aggregation (DSA) module to capture relationship among snippets. To be more specific, we attempt to generate a dynamic kernel for a convolutional operation to aggregate long-range temporal information among adjacent snippets adaptively. The DSA module is an efficient plug-and-play module and can be combined with the off-the-shelf clip-based models (i.e., TSM, I3D) to perform powerful long-range modeling with minimal overhead. The final video architecture, coined as DSANet. We conduct extensive experiments on several video recognition benchmarks (i.e., Mini-Kinetics-200, Kinetics-400, Something-Something V1 and ActivityNet) to show its superiority. Our proposed DSA module is shown to benefit various video recognition models significantly. For example, equipped with DSA modules, the top-1 accuracy of I3D ResNet-50 is improved from 74.9% to 78.2% on Kinetics-400. Codes are available at this https URL.

Comments:	Accepted by ACMMM2021
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2105.12085 [cs.CV]
	(or arXiv:2105.12085v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2105.12085

Submission history

From: Wenhao Wu [view email]
[v1] Tue, 25 May 2021 17:09:57 UTC (4,181 KB)
[v2] Wed, 7 Jul 2021 13:43:37 UTC (4,181 KB)
[v3] Tue, 17 Aug 2021 07:11:18 UTC (4,189 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators