MAT: A Multimodal Attentive Translator for Image Captioning

Liu, Chang; Sun, Fuchun; Wang, Changhu; Wang, Feng; Yuille, Alan

Computer Science > Computer Vision and Pattern Recognition

arXiv:1702.05658 (cs)

[Submitted on 18 Feb 2017 (v1), last revised 10 Aug 2017 (this version, v3)]

Title:MAT: A Multimodal Attentive Translator for Image Captioning

Authors:Chang Liu, Fuchun Sun, Changhu Wang, Feng Wang, Alan Yuille

View PDF

Abstract:In this work we formulate the problem of image captioning as a multimodal translation task. Analogous to machine translation, we present a sequence-to-sequence recurrent neural networks (RNN) model for image caption generation. Different from most existing work where the whole image is represented by convolutional neural network (CNN) feature, we propose to represent the input image as a sequence of detected objects which feeds as the source sequence of the RNN model. In this way, the sequential representation of an image can be naturally translated to a sequence of words, as the target sequence of the RNN model. To represent the image in a sequential way, we extract the objects features in the image and arrange them in a order using convolutional neural networks. To further leverage the visual information from the encoded objects, a sequential attention layer is introduced to selectively attend to the objects that are related to generate corresponding words in the sentences. Extensive experiments are conducted to validate the proposed approach on popular benchmark dataset, i.e., MS COCO, and the proposed model surpasses the state-of-the-art methods in all metrics following the dataset splits of previous work. The proposed approach is also evaluated by the evaluation server of MS COCO captioning challenge, and achieves very competitive results, e.g., a CIDEr of 1.029 (c5) and 1.064 (c40).

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:1702.05658 [cs.CV]
	(or arXiv:1702.05658v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1702.05658

Submission history

From: Chang Liu [view email]
[v1] Sat, 18 Feb 2017 21:35:06 UTC (740 KB)
[v2] Wed, 5 Jul 2017 18:39:02 UTC (760 KB)
[v3] Thu, 10 Aug 2017 14:29:19 UTC (760 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MAT: A Multimodal Attentive Translator for Image Captioning

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MAT: A Multimodal Attentive Translator for Image Captioning

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators