Image Captioners Sometimes Tell More Than Images They See
Image Captioners Sometimes Tell More Than Images They See
crisis-image-datasets-asonam20
Input
Table 2. Accuracies (%) of image-based clas-
sifiers: MobileNetV2 (MobNetV2), Efficient-
Net (EffNet) and text-based classifiers: BERT
combined with InceptionV3+RNN (IV3+RNN),
BLIP (BLIP), CLIP-I (CLIP Interrogator).
Figure 4. Score-level fusion results using Mo- Figure 5. Score-level fusion results using Ef-
bileNetV2 as the image-based classifier: Hor- ficientNet B0 (rather than MobileNetV2 in Fig-
izontal axis stands for the fusion weight w for ure 4) as the image-based classifier.
the text-based classifier. w = 0 and w = 1
correspond to image-based and text-based
single-modal systems, respectively.
dard image-based classifiers using CNNs. Further, we have
confirmed that synergistic effects can be obtained by fusion
with those image-based classifiers. It can be said that im-
When we fused the text-based classifier with another age captioners based on large-scale foundation models are
image-based classifier, EfficientNet (B0), we could still see effective feature extractors for image classification. Image
a synergistic effect, as indicated in Figure 5. We should captioning is a rapidly evolving area, and we need to keep
note that, as the classification accuracy of EfficientNet is up with new technologies that are being released one after
higher than that of the text-based classifier using BLIP (un- another [7] 8 .
like MobileNetV2), the synergistic effect is not so large as
In this study, we experimented with a simple system con-
in the case of MobileNetV2.
figuration, in which very basic parts had been combined. In
future work, we intend to introduce more advanced meth-
4 Summary ods, such as feature-level fusion and knowledge distilla-
tion [17], in order to search for better answers to the orig-
Starting from the question of what kind of information inal question. It should be further noted that the behavior
image captioners can extract from images, we performed of image captioners would seem to depend heavily on the
image classification using only linguistic information con- images to be captioned. While scenes of disasters are rela-
tained in captions, and have shown that it is possible to
achieve classification accuracy that surpasses that of stan- 8 https://docs.midjourney.com/docs/describe
tively easy to describe, the characteristics of a human face [11] E. Mansimov, E. Parisotto, J. Ba, and R. Salakhutdinov.
would not be so easy. Studies on other datasets are to be Generating images from captions with attention. In Inter-
included in our future work. national Conference on Learning Representations (ICLR),
2016.
[12] C. Narisetty, E. Tsunoo, X. Chang, Y. Kashiwagi,
Acknowledgment M. Hentschel, and S. Watanabe. Joint speech recognition
and audio captioning. In IEEE International Conference on
This work was partially supported by JSPS KAKEN Acoustics, Speech and Signal Processing (ICASSP), 2022.
[13] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh,
Grant Number 21K11967.
G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, G. Krueger, and I. Sutskever. Learning trans-
References ferable visual models from natural language supervision.
arXiv:2103.00020, 2021.
[14] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen.
[1] F. Alam, F. Ofli, M. Imran, T. Alam, and U. Qazi. Crisis-
Hierarchical text-conditional image generation with clip la-
MMD: Multimodal twitter datasets from natural disasters.
tents. arXiv:2204.06125, 2022.
In 12th International AAAI Conference on Web and Social
[15] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton,
Media (ICWSM), 2018.
S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G.
[2] F. Alam, F. Ofli, M. Imran, T. Alam, and U. Qazi. Deep
Lopes, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi.
learning benchmarks and datasets for social media image
Photorealistic text-to-image diffusion models with deep lan-
classification for disaster response. In IEEE/ACM Interna-
guage understanding. arXiv:2205.11487, 2022.
tional Conference on Advances in Social Networks Analysis
[16] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C.
and Mining (ASONAM), 2020.
Chen. Mobilenetv2: Inverted residuals and linear bottle-
[3] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr,
necks. In IEEE Conference on Computer Vision and Pattern
Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds,
Recognition (CVPR), 2018.
R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Saman- [17] S. Srinivasan, Z. Huang, and K. Kirchhoff. Representation
gooei, M. Monteiro, J. L. Menick, S. Borgeaud, A. Brock, learning through cross-modal conditional teacher-student
A. Nematzadeh, S. Sharifzadeh, M. Bińkowski, R. Barreira, training for speech emotion recognition. In IEEE Interna-
O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a tional Conference on Acoustics, Speech and Signal Process-
visual language model for few-shot learning. In 36th An- ing (ICASSP), 2022.
nual Conference on Neural Information Processing Systems [18] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence-to-
(NeurIPS), 2022. sequence learning with neural networks. In 28th An-
[4] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. nual Conference on Neural Information Processing Systems
Zitnick, and D. Parikh. VQA: Visual question answering. In (NIPS), 2014.
IEEE Int’l Conf. on Computer Vision (ICCV), 2015. [19] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wo-
[5] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, jna. Rethinking the inception architecture for computer vi-
F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase sion. In IEEE Conference on Computer Vision and Pattern
representations using RNN encoder–decoder for statistical Recognition (CVPR), 2016.
machine translation. In Conference on Empirical Methods [20] M. Tan and Q. V. Le. EfficientNet: Rethinking model scaling
in Natural Language Processing (EMNLP), 2014. for convolutional neural networks. In International Confer-
[6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: ence on Machine Learning (ICML), 2019.
Pre-training of deep bidirectional transformers for language [21] Y. Ushiku, Y. Mukuta, M. Yamaguchi, and T. Harada. Com-
understanding. In Conference of the North American Chap- mon subspace for model and similarity: Phrase learning for
ter of the Association for Computational Linguistics: Hu- sentence generation from images. In IEEE International
man Language Technologies (NAACL-HLT), 2019. Conference on Computer Vision (ICCV), 2015.
[7] J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2: Bootstrapping [22] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and
language-image pre-training with frozen image encoders tell: A neural image caption generator. In IEEE Conference
and large language models. arXiv:2301.12597, 2023. on Computer Vision and Pattern Recognition (CVPR), 2015.
[8] J. Li, D. Li, C. Xiong, and S. Hoi. BLIP: Bootstrapping [23] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudi-
language-image pre-training for unified vision-language un- nov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural
derstanding and generation. arXiv:2201.12086, 2022. image caption generation with visual attention. In Interna-
[9] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, tional Conference on Machine Learning (ICML), 2015.
J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár. [24] Y. C. Yoon, S. Y. Park, S. M. Park, and H. Lim. Image
Microsoft COCO: Common objects in context. In European classification and captioning model considering a cam-based
Conference on Computer Vision (ECCV), 2014. disagreement loss. ETRI Journal, 42(1), 2020.
[10] M. R. Makiuchi, K. Uto, and K. Shinoda. Multimodal emo-
tion recognition with high-level speech and text features.
In IEEE Automatic Speech Recognition and Understanding
Workshop (ASRU), 2021.