Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering

Huang, Hantao; Han, Tao; Han, Wei; Yap, Deep; Chiang, Cheng-Ming

Computer Science > Computer Vision and Pattern Recognition

arXiv:2010.08708 (cs)

[Submitted on 17 Oct 2020]

Title:Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering

Authors:Hantao Huang, Tao Han, Wei Han, Deep Yap, Cheng-Ming Chiang

View PDF

Abstract:Visual Question Answering (VQA) is challenging due to the complex cross-modal relations. It has received extensive attention from the research community. From the human perspective, to answer a visual question, one needs to read the question and then refer to the image to generate an answer. This answer will then be checked against the question and image again for the final confirmation. In this paper, we mimic this process and propose a fully attention based VQA architecture. Moreover, an answer-checking module is proposed to perform a unified attention on the jointly answer, question and image representation to update the answer. This mimics the human answer checking process to consider the answer in the context. With answer-checking modules and transferred BERT layers, our model achieves the state-of-the-art accuracy 71.57\% using fewer parameters on VQA-v2.0 test-standard split.

Comments:	Accepted in ICPR2020
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2010.08708 [cs.CV]
	(or arXiv:2010.08708v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2010.08708

Submission history

From: Wei Han [view email]
[v1] Sat, 17 Oct 2020 03:37:16 UTC (10,868 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2020-10

Change to browse by:

cs
cs.CL

References & Citations

DBLP - CS Bibliography

listing | bibtex

Tao Han
Wei Han

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators