Video captioning with stacked attention and semantic hard pull

Rahman, Md. Mushfiqur; Abedin, Thasin; Prottoy, Khondokar S. S.; Moshruba, Ayana; Siddiqui, Fazlul Hasan

doi:10.7717/peerj-cs.664

Computer Science > Computer Vision and Pattern Recognition

arXiv:2009.07335 (cs)

[Submitted on 15 Sep 2020 (v1), last revised 16 Jul 2021 (this version, v3)]

Title:Video captioning with stacked attention and semantic hard pull

Authors:Md. Mushfiqur Rahman, Thasin Abedin, Khondokar S. S. Prottoy, Ayana Moshruba, Fazlul Hasan Siddiqui

View PDF

Abstract:Video captioning, i.e. the task of generating captions from video sequences creates a bridge between the Natural Language Processing and Computer Vision domains of computer science. The task of generating a semantically accurate description of a video is quite complex. Considering the complexity, of the problem, the results obtained in recent research works are praiseworthy. However, there is plenty of scope for further investigation. This paper addresses this scope and proposes a novel solution. Most video captioning models comprise two sequential/recurrent layers - one as a video-to-context encoder and the other as a context-to-caption decoder. This paper proposes a novel architecture, namely Semantically Sensible Video Captioning (SSVC) which modifies the context generation mechanism by using two novel approaches - "stacked attention" and "spatial hard pull". As there are no exclusive metrics for evaluating video captioning models, we emphasize both quantitative and qualitative analysis of our model. Hence, we have used the BLEU scoring metric for quantitative analysis and have proposed a human evaluation metric for qualitative analysis, namely the Semantic Sensibility (SS) scoring metric. SS Score overcomes the shortcomings of common automated scoring metrics. This paper reports that the use of the aforementioned novelties improves the performance of state-of-the-art architectures.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2009.07335 [cs.CV]
	(or arXiv:2009.07335v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2009.07335
Related DOI:	https://doi.org/10.7717/peerj-cs.664

Submission history

From: Md. Mushfiqur Rahman [view email]
[v1] Tue, 15 Sep 2020 19:34:37 UTC (1,730 KB)
[v2] Fri, 30 Oct 2020 15:10:44 UTC (1,836 KB)
[v3] Fri, 16 Jul 2021 18:06:58 UTC (2,199 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Video captioning with stacked attention and semantic hard pull

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Video captioning with stacked attention and semantic hard pull

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators