0% found this document useful (0 votes)
13 views6 pages

Image Captioners Sometimes Tell More Than Images They See

Uploaded by

Thanhbich Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views6 pages

Image Captioners Sometimes Tell More Than Images They See

Uploaded by

Thanhbich Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Image Captioners Sometimes Tell More Than Images They See

Honori Udo∗ Takafumi Koshinaka


The School of Data Science, Yokohama City University
22-2 Seto, Kanazawa-ku, Yokohama 236-0027 Japan
arXiv:2305.02932v2 [cs.CV] 11 May 2023

* Now with NTT Comware Corporation

Abstract manner [22]. It has a two-sided relationship with image


generation (text-to-image), which has attracted public atten-
Image captioning, a.k.a. “image-to-text,” which gener- tion with the advent of Dall-E 2 and subsequent approaches
ates descriptive text from given images, has been rapidly to generation [14, 15]. Further development is expected
developing throughout the era of deep learning. To what ex- with the support of such foundation models for Vision and
tent is the information in the original image preserved in the Language as CLIP [13] and BLIP [8], which are trained
descriptive text generated by an image captioner? To an- with a large number of images and large amount of text
swer that question, we have performed experiments involv- found on the Internet.
ing the classification of images from descriptive text alone, Image captioning has something in common with au-
without referring to the images at all, and compared results tomatic speech recognition (ASR a.k.a. speech-to-text),
with those from standard image-based classifiers. We have which has long received much research attention. While
evaluate several image captioning models with respect to a ASR converts acoustic signals caused by air vibration into
disaster image classification task, CrisisNLP, and show that text, image captioning converts light signals into text. ASR
descriptive text classifiers can sometimes achieve higher ac- extracts only linguistic information from input speech and
curacy than standard image-based classifiers. Further, we discards such non-linguistic information as tone, emotion,
show that fusing an image-based classifier with a descrip- and the gender and age of the speaker. In this sense, ASR
tive text classifier can provide improvement in accuracy. can be viewed as a kind of feature extraction. Image cap-
tioning is similar in that it extracts certain information from
an input image and discards other information. We aim to
1. Introduction clarify what that “certain information” actually is.
In speech emotion recognition, Srinivasan et al. [17] em-
Advances in neural network-based representation learn- ploy linguistic information (text) obtained from ASR as fea-
ing have enabled “embedding,” which maps any type of data tures and show that the emotion-recognition accuracy is
into a latent space with a certain number of dimensions. superior to that with conventional methods that use only
Numerous methodologies have been proposed for handling acoustic features. This may also apply in image caption-
media data with different modalities [10, 12]. In particular, ing. Could an image captioner also serve as a feature ex-
significant research results have recently been reported in tractor that complements image recognition? In this pa-
the field dealing with images and texts referred to as “Vi- per, we examine how accurately image classification can
sion and Language” (V&L) [3]. Typical tasks here would be performed using descriptive text obtained by an image
include visual question answering (VQA) for answering captioner as a feature. We also try to improve classification
questions with images [4], image captioning for assigning accuracy by combining it with standard image classifiers
descriptions to images [21], and image generation for gen- based on convolutional neural networks (CNNs). Our ex-
erating images from descriptions [11]. perimental setup is designed for use with CrisisNLP [2, 1],
Image captioning (image-to-text), which is addressed in a benchmark dataset for disaster image classification. There
this paper, was inspired by sequence-to-sequence learning have been earlier studies that attempted multi-task learn-
in neural machine translation [18] and has shown remark- ing with image classification and captioning as two parallel
able progress employing a similar approach, one in which tasks [24], but, to the best of our knowledge, this is the first
a context vector obtained by encoding an input image is attempt to use an image captioner as a feature extractor for
decoded to generate descriptive text in an auto-regressive image classification.
of epochs, data augmentation, dropout, etc., we follow pub-
licly available tutorial codes 1 2 3 .

2.2 Image Captioners

We focused on three image captioning models. None of


them were fine-tuned using the data for the target task (be-
cause no image description text was available for the target
task), and the original models were used as it is.

InceptionV3+RNN: This is a basic model of small scale


that encodes an input image into a vector by using Incep-
tionV3 [19] and then decodes it to generate a caption using a
recurrent neural network (Gated Recurrent Unit; GRU [5]).
An attention mechanism [23] is equipped between the en-
coder and decoder, and features of each part of the image
are selectively sent to the decoder. The entire system was
trained with the MS-COCO dataset [9]. We followed Ten-
sorFlow’s tutorial implementation 4 .
BLIP: This is a foundation model that has learned a large
Figure 1. System configuration: a standard number of images and amount of text and is applicable to a
image-based classifier (left half) and a text- wide range of Vision and Language tasks [8]. When used as
based classifier combined with an image cap- an image captioner, it takes the form of an encoder-decoder
tioner (right half). configuration based on the Transformer model. Users can
easily run the sample code (demo.ipynb) on GitHub 5 to
obtain captions for their own images. BLIP is a relatively
advanced, large-scale model that is capable of producing
The remainder of this paper is organized as follows. Sec- quite accurate captions.
tion 2 describes the configuration of our image classifi- CLIP Interrogator: Given an image, this model infers
cation system using image/text-based classifiers combined prompts for such AI image generators as Stable Diffusion
with an image captioner. The experimental setup and results and Midjourney so as to generate similar images. Since the
for image- and text-based single-modal systems as well as text generated by CLIP Interrogator is not meant to be read
fused multi-modal systems are presented in Section 3. Sec- by humans, it may not be considered an image captioner in
tion 4 summarizes our work. the strict sense, but we tested it as a model that can generate
richer text than can BLIP.
2 System Configuration Although the technical specifications of CLIP Interroga-
tor have not been published as a paper and there is no liter-
ature that can be referred to, it may be presumed from the
Figure 1 shows the configuration of the final form of the code 6 and its operation that it first generates a base caption
image classification system considered in this paper. The using BLIP and then selects and adds phrases that match the
left half shows a standard image classifier for use with a target image from a predefined set of phrases called Flavors.
neural network such as a CNN. The right half shows the CLIP image/text encoders [13] are used to measure the de-
connection of an image captioner with a text classifier, in gree of matching between a target image and the phrases
tandem, for the classification of images on the basis of lin-
1 Transfer learning and fine-tuning https://www.tensorflow.
guistic information extracted from images.
org/tutorials/images/transfer_learning
2 Image classification via fine-tuning with EfficientNet https:

2.1 Image and Text Classifiers //keras.io/examples/vision/image_classification_


efficientnet_fine_tuning/
3 Classify text with BERT https://www.tensorflow.org/
We used pre-trained models throughout. Mo- text/tutorials/classify_text_with_bert
bileNetV2 [16] and EfficientNet [20] were used for the 4 Image captioning with visual attention https://www.
image-based classifier, and BERTBASE [6] was used for the tensorflow.org/tutorials/text/image_captioning
The architecture of the decoder is Embed(256)–GRU(512)–FC(512)–FC.
text-based classifier after fine-tuning with the training data 5 https://github.com/salesforce/BLIP
for the target task. Regarding hyper-parameter settings in 6 https://github.com/pharmapsychotic/

fine-tuning, such as learning rate, mini-batch size, number clip-interrogator


in Flavors. Flavors contains approximately 100,000 words
and phrases, including those referring to objects and en-
tities (e.g., motorcycle, building, young woman), image
styles (e.g., photo-realistic), and artist names (e.g., greg
rutkowski). We used the code released by the developer
(clip interrogator.ipynb, version 2.2).

It should be noted that the default settings are differ-


ent for the standalone BLIP and the BLIP running within
the CLIP Interrogator, and that the latter emphasizes the
quality of captions by changing search parameters (e.g.,
num beams). In our experiment, in order to match their
operations, the parameters of the standalone BLIP were
matched with those of the CLIP Interrogator.

2.3 System Fusion

Figure 2. Example images for different dis-


As previously indicated in Figure 1, we fuse the out- aster types included in the CrisisNLP dataset
puts of an image-based classifier with those of a text-based (cited from [2]). There is another type referred
classifier to improve classification accuracy. Fusion meth- to as "not disaster," which is not shown here.
ods include feature-level fusion (early fusion), which in-
puts the hidden layer states of each classifier into another
neural-network classifier, and score-level fusion (late fu-
sion), which averages the classification results of each clas- 3 Experiments
sifier. Here we choose the latter for simplicity. That is, if
the number of classes is C and the output of the image/text- We used the CrisisNLP dataset [2], which is a collec-
based classifiers
 normalized
 by the softmax
 function  are tion of natural disaster images shared on such social media
(I) (I) (T ) (T )
(I)
y = y1 , · · · , yC and y (T )
= y1 , · · · , yC , re- as Twitter 7 . The dataset provides four image classifica-
spectively, then the classification result obtained with score- tion tasks, for each of which training (Train), development
level fusion may be calculated as y = (1 − w) y (I) + (Dev), and test (Test) data partitions are defined. Among
wy (T ) , where 0 ≤ w ≤ 1 is the weight coefficient for the them, we focus on two tasks: 1) “Disaster types,” for pre-
text-based classifier. dicting types of disasters, such as earthquake, flood, etc.;
2) “Damage severity,” for predicting the degree of damage
caused by a disaster in terms of three stages. (See Table 1
and Figure 2)
We first show the classification accuracies of single-
Table 1. Number of images and classes con- modal systems using only an image-based classifier or a
tained in two tasks defined in the CrisisNLP text-based classifier (Table 2). To reduce the randomness of
dataset. model parameter initialization in fine-tuning, each of those
accuracies is averaged over five trials.
Task Disaster types Damage severity Regarding the image-based classifiers, EfficientNet
Train 12,724 26,898 models of three sizes (B0, B1, B2) outperformed Mo-
Dev 1,574 2,898 bileNetV2 while no significant difference was observed
Test 3,213 5,100 among the three. Comparing the text-based classifier with
Num 7 (earthquake, fire, 3 (severe damage, three different image captioners, we can first see that the
classes flood, hurricane, mild damage, little most basic image captioner, InceptionV3+RNN, falls far
landslide, other or none) short of the classification accuracy of standard image-based
disaster, not disaster) 7 https://crisisnlp.qcri.org/

crisis-image-datasets-asonam20
Input
Table 2. Accuracies (%) of image-based clas-
sifiers: MobileNetV2 (MobNetV2), Efficient-
Net (EffNet) and text-based classifiers: BERT
combined with InceptionV3+RNN (IV3+RNN),
BLIP (BLIP), CLIP-I (CLIP Interrogator).

System Disaster types Damage severity


Image-based
MobNetV2 69.79 73.37
EffNet (B0) 76.48 76.84
EffNet (B1) 75.05 76.71
EffNet (B2) 76.33 76.31
Text-based MobNetV2 → not disaster
IV3+RNN 42.40 52.36 EffNet B0 → earthquake
BLIP 71.14 72.17 IV3+RNN a restaurant that are shining that is lined
CLIP-I 85.28 78.67 up with lots of people.
→ not disaster
BLIP a group of people standing on top of a
building
classifiers. A look at the captions generated by Incep- → not disaster
tionV3+RNN reveals that most of them are seemingly ir- CLIP-I a group of people standing on top
relevant with respect to the images, and it seems difficult to of a building, collapsed building,
predict either the type of disaster or the degree of damage buildings collapsed, collapsed build-
from these captions (as indicated later in Figure 3). By way ings, videogame still, burning building,
of contrast, BLIP did generate good captions for many im- building destroyed, background of res-
ages. The caption in previously shown in Figure 1 is one ident evil game, 19xx :2 akira movie
actually generated by BLIP, and it notes such important ele- style : 8, photo”, big impact hit on
ments in the image as ”people” and ”rubble.” The text-based the building, damaged buildings, earth-
classifier with BLIP consequently achieved much better ac- quake, unreal engine. film still, test,
curacy, roughly comparable to that of standard image-based building on fire
classifiers. The accuracy of the text-based classifier using → earthquake
CLIP Interrogator (CLIP-I) is even higher, and results sig-
nificantly exceed those of standard image-based classifiers.
These results suggest that foundation models of Vision and Figure 3. Example results with an image to be
Language trained on a large amount of image/text data can classified as “earthquake.”
be an effective feature extractor for image classification.
Figure 3 shows an example of image classification re-
sults for an image that should be classified as “earthquake.”
As previously noted, InceptionV3+RNN (IV3+RNN), Figure 4 shows the results of score-level fusion that av-
which was the most basic image captioner, produced a de- eraged the output of the image/text-based classifiers with
scription that was far from the actual content of the image. weight w, where MobileNetV2 was used for the image-
BLIP’s descriptions were generally much more accurate. based classifier. Both of the two tasks (Disaster types, Dam-
The CLIP Interrogator (CLIP-I) behaved quite differently age severity) show similar trends, i.e., when using suffi-
from these two. After beginning with a common sentence ciently good models, such as BLIP and CLIP Interrogator
from BLIP, it continued the explanation with a large number (CLIP-I), for image captioning, the classification accuracy
of phrases selected by CLIP. Although those phrases make can be improved by appropriately choosing the weight w. It
no sense as sentences, we can observe some that reflect the seems that image captioning models extract features differ-
true class, such as “earthquake” and “collapsed building.” ent from CNN models; in other words, they look at images
On the other hand, there are also such completely irrelevant from a different perspective than do CNNs. Unfortunately,
phrases as “akira movie style” and ”videogame still;” they when using InceptionV3+RNN for image captioning, the
would, however, be good clues for Stable Diffusion to use effect of fusion was nealy unseen because the difference in
to generate images. performance between the two modalities was too large.
(a) Disaster types (a) Disaster types

(b) Damage severity (b) Damage severity

Figure 4. Score-level fusion results using Mo- Figure 5. Score-level fusion results using Ef-
bileNetV2 as the image-based classifier: Hor- ficientNet B0 (rather than MobileNetV2 in Fig-
izontal axis stands for the fusion weight w for ure 4) as the image-based classifier.
the text-based classifier. w = 0 and w = 1
correspond to image-based and text-based
single-modal systems, respectively.
dard image-based classifiers using CNNs. Further, we have
confirmed that synergistic effects can be obtained by fusion
with those image-based classifiers. It can be said that im-
When we fused the text-based classifier with another age captioners based on large-scale foundation models are
image-based classifier, EfficientNet (B0), we could still see effective feature extractors for image classification. Image
a synergistic effect, as indicated in Figure 5. We should captioning is a rapidly evolving area, and we need to keep
note that, as the classification accuracy of EfficientNet is up with new technologies that are being released one after
higher than that of the text-based classifier using BLIP (un- another [7] 8 .
like MobileNetV2), the synergistic effect is not so large as
In this study, we experimented with a simple system con-
in the case of MobileNetV2.
figuration, in which very basic parts had been combined. In
future work, we intend to introduce more advanced meth-
4 Summary ods, such as feature-level fusion and knowledge distilla-
tion [17], in order to search for better answers to the orig-
Starting from the question of what kind of information inal question. It should be further noted that the behavior
image captioners can extract from images, we performed of image captioners would seem to depend heavily on the
image classification using only linguistic information con- images to be captioned. While scenes of disasters are rela-
tained in captions, and have shown that it is possible to
achieve classification accuracy that surpasses that of stan- 8 https://docs.midjourney.com/docs/describe
tively easy to describe, the characteristics of a human face [11] E. Mansimov, E. Parisotto, J. Ba, and R. Salakhutdinov.
would not be so easy. Studies on other datasets are to be Generating images from captions with attention. In Inter-
included in our future work. national Conference on Learning Representations (ICLR),
2016.
[12] C. Narisetty, E. Tsunoo, X. Chang, Y. Kashiwagi,
Acknowledgment M. Hentschel, and S. Watanabe. Joint speech recognition
and audio captioning. In IEEE International Conference on
This work was partially supported by JSPS KAKEN Acoustics, Speech and Signal Processing (ICASSP), 2022.
[13] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh,
Grant Number 21K11967.
G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, G. Krueger, and I. Sutskever. Learning trans-
References ferable visual models from natural language supervision.
arXiv:2103.00020, 2021.
[14] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen.
[1] F. Alam, F. Ofli, M. Imran, T. Alam, and U. Qazi. Crisis-
Hierarchical text-conditional image generation with clip la-
MMD: Multimodal twitter datasets from natural disasters.
tents. arXiv:2204.06125, 2022.
In 12th International AAAI Conference on Web and Social
[15] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton,
Media (ICWSM), 2018.
S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G.
[2] F. Alam, F. Ofli, M. Imran, T. Alam, and U. Qazi. Deep
Lopes, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi.
learning benchmarks and datasets for social media image
Photorealistic text-to-image diffusion models with deep lan-
classification for disaster response. In IEEE/ACM Interna-
guage understanding. arXiv:2205.11487, 2022.
tional Conference on Advances in Social Networks Analysis
[16] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C.
and Mining (ASONAM), 2020.
Chen. Mobilenetv2: Inverted residuals and linear bottle-
[3] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr,
necks. In IEEE Conference on Computer Vision and Pattern
Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds,
Recognition (CVPR), 2018.
R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Saman- [17] S. Srinivasan, Z. Huang, and K. Kirchhoff. Representation
gooei, M. Monteiro, J. L. Menick, S. Borgeaud, A. Brock, learning through cross-modal conditional teacher-student
A. Nematzadeh, S. Sharifzadeh, M. Bińkowski, R. Barreira, training for speech emotion recognition. In IEEE Interna-
O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a tional Conference on Acoustics, Speech and Signal Process-
visual language model for few-shot learning. In 36th An- ing (ICASSP), 2022.
nual Conference on Neural Information Processing Systems [18] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence-to-
(NeurIPS), 2022. sequence learning with neural networks. In 28th An-
[4] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. nual Conference on Neural Information Processing Systems
Zitnick, and D. Parikh. VQA: Visual question answering. In (NIPS), 2014.
IEEE Int’l Conf. on Computer Vision (ICCV), 2015. [19] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wo-
[5] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, jna. Rethinking the inception architecture for computer vi-
F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase sion. In IEEE Conference on Computer Vision and Pattern
representations using RNN encoder–decoder for statistical Recognition (CVPR), 2016.
machine translation. In Conference on Empirical Methods [20] M. Tan and Q. V. Le. EfficientNet: Rethinking model scaling
in Natural Language Processing (EMNLP), 2014. for convolutional neural networks. In International Confer-
[6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: ence on Machine Learning (ICML), 2019.
Pre-training of deep bidirectional transformers for language [21] Y. Ushiku, Y. Mukuta, M. Yamaguchi, and T. Harada. Com-
understanding. In Conference of the North American Chap- mon subspace for model and similarity: Phrase learning for
ter of the Association for Computational Linguistics: Hu- sentence generation from images. In IEEE International
man Language Technologies (NAACL-HLT), 2019. Conference on Computer Vision (ICCV), 2015.
[7] J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2: Bootstrapping [22] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and
language-image pre-training with frozen image encoders tell: A neural image caption generator. In IEEE Conference
and large language models. arXiv:2301.12597, 2023. on Computer Vision and Pattern Recognition (CVPR), 2015.
[8] J. Li, D. Li, C. Xiong, and S. Hoi. BLIP: Bootstrapping [23] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudi-
language-image pre-training for unified vision-language un- nov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural
derstanding and generation. arXiv:2201.12086, 2022. image caption generation with visual attention. In Interna-
[9] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, tional Conference on Machine Learning (ICML), 2015.
J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár. [24] Y. C. Yoon, S. Y. Park, S. M. Park, and H. Lim. Image
Microsoft COCO: Common objects in context. In European classification and captioning model considering a cam-based
Conference on Computer Vision (ECCV), 2014. disagreement loss. ETRI Journal, 42(1), 2020.
[10] M. R. Makiuchi, K. Uto, and K. Shinoda. Multimodal emo-
tion recognition with high-level speech and text features.
In IEEE Automatic Speech Recognition and Understanding
Workshop (ASRU), 2021.

You might also like