Jointly learning attentions with semantic cross-modal correlation for visual question answering
L Cao, L Gao, J Song, X Xu, HT Shen - Databases Theory and …, 2017 - Springer
L Cao, L Gao, J Song, X Xu, HT Shen
Databases Theory and Applications: 28th Australasian Database Conference, ADC …, 2017•SpringerAbstract Visual Question Answering (VQA) has emerged as a prominent multi-discipline
research problem in artificial intelligence. A number of recent studies are focusing on
proposing attention mechanisms such as visual attention (“where to look”) or question
attention (“what words to listen to”), and they have been proved to be efficient for VQA.
However, they focus on modeling the prediction error, but ignore the semantic correlation
between image attention and question attention. As a result, it will inevitably result in …
research problem in artificial intelligence. A number of recent studies are focusing on
proposing attention mechanisms such as visual attention (“where to look”) or question
attention (“what words to listen to”), and they have been proved to be efficient for VQA.
However, they focus on modeling the prediction error, but ignore the semantic correlation
between image attention and question attention. As a result, it will inevitably result in …
Abstract
Visual Question Answering (VQA) has emerged as a prominent multi-discipline research problem in artificial intelligence. A number of recent studies are focusing on proposing attention mechanisms such as visual attention (“where to look”) or question attention (“what words to listen to”), and they have been proved to be efficient for VQA. However, they focus on modeling the prediction error, but ignore the semantic correlation between image attention and question attention. As a result, it will inevitably result in suboptimal attentions. In this paper, we argue that in addition to modeling visual and question attentions, it is equally important to model their semantic correlation to learn them jointly as well as to facilitate their joint representation learning for VQA. In this paper, we propose a novel end-to-end model to jointly learn attentions with semantic cross-modal correlation for efficiently solving the VQA problem. Specifically, we propose a multi-modal embedding to map the visual and question attentions into a joint space to guarantee their semantic consistency. Experimental results on the benchmark datasets demonstrate that our model outperforms several state-of-the-art techniques for VQA.
Springer
Showing the best result for this search. See all results