0% found this document useful (0 votes)

15 views6 pages

Image Captioners Sometimes Tell More Than Images They See

Uploaded by

Thanhbich Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views6 pages

Image Captioners Sometimes Tell More Than Images They See

Uploaded by

Thanhbich Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Image Captioners Sometimes Tell More Than Images They See

Honori Udo∗ Takafumi Koshinaka

The School of Data Science, Yokohama City University
22-2 Seto, Kanazawa-ku, Yokohama 236-0027 Japan
arXiv:2305.02932v2 [cs.CV] 11 May 2023

* Now with NTT Comware Corporation

Abstract manner [22]. It has a two-sided relationship with image

generation (text-to-image), which has attracted public atten-
Image captioning, a.k.a. “image-to-text,” which gener- tion with the advent of Dall-E 2 and subsequent approaches
ates descriptive text from given images, has been rapidly to generation [14, 15]. Further development is expected
developing throughout the era of deep learning. To what ex- with the support of such foundation models for Vision and
tent is the information in the original image preserved in the Language as CLIP [13] and BLIP [8], which are trained
descriptive text generated by an image captioner? To an- with a large number of images and large amount of text
swer that question, we have performed experiments involv- found on the Internet.
ing the classification of images from descriptive text alone, Image captioning has something in common with au-
without referring to the images at all, and compared results tomatic speech recognition (ASR a.k.a. speech-to-text),
with those from standard image-based classifiers. We have which has long received much research attention. While
evaluate several image captioning models with respect to a ASR converts acoustic signals caused by air vibration into
disaster image classification task, CrisisNLP, and show that text, image captioning converts light signals into text. ASR
descriptive text classifiers can sometimes achieve higher ac- extracts only linguistic information from input speech and
curacy than standard image-based classifiers. Further, we discards such non-linguistic information as tone, emotion,
show that fusing an image-based classifier with a descrip- and the gender and age of the speaker. In this sense, ASR
tive text classifier can provide improvement in accuracy. can be viewed as a kind of feature extraction. Image cap-
tioning is similar in that it extracts certain information from
an input image and discards other information. We aim to
1. Introduction clarify what that “certain information” actually is.
In speech emotion recognition, Srinivasan et al. [17] em-
Advances in neural network-based representation learn- ploy linguistic information (text) obtained from ASR as fea-
ing have enabled “embedding,” which maps any type of data tures and show that the emotion-recognition accuracy is
into a latent space with a certain number of dimensions. superior to that with conventional methods that use only
Numerous methodologies have been proposed for handling acoustic features. This may also apply in image caption-
media data with different modalities [10, 12]. In particular, ing. Could an image captioner also serve as a feature ex-
significant research results have recently been reported in tractor that complements image recognition? In this pa-
the field dealing with images and texts referred to as “Vi- per, we examine how accurately image classification can
sion and Language” (V&L) [3]. Typical tasks here would be performed using descriptive text obtained by an image
include visual question answering (VQA) for answering captioner as a feature. We also try to improve classification
questions with images [4], image captioning for assigning accuracy by combining it with standard image classifiers
descriptions to images [21], and image generation for gen- based on convolutional neural networks (CNNs). Our ex-
erating images from descriptions [11]. perimental setup is designed for use with CrisisNLP [2, 1],
Image captioning (image-to-text), which is addressed in a benchmark dataset for disaster image classification. There
this paper, was inspired by sequence-to-sequence learning have been earlier studies that attempted multi-task learn-
in neural machine translation [18] and has shown remark- ing with image classification and captioning as two parallel
able progress employing a similar approach, one in which tasks [24], but, to the best of our knowledge, this is the first
a context vector obtained by encoding an input image is attempt to use an image captioner as a feature extractor for
decoded to generate descriptive text in an auto-regressive image classification.
of epochs, data augmentation, dropout, etc., we follow pub-
licly available tutorial codes 1 2 3 .

2.2 Image Captioners

We focused on three image captioning models. None of

them were fine-tuned using the data for the target task (be-
cause no image description text was available for the target
task), and the original models were used as it is.

InceptionV3+RNN: This is a basic model of small scale

that encodes an input image into a vector by using Incep-
tionV3 [19] and then decodes it to generate a caption using a
recurrent neural network (Gated Recurrent Unit; GRU [5]).
An attention mechanism [23] is equipped between the en-
coder and decoder, and features of each part of the image
are selectively sent to the decoder. The entire system was
trained with the MS-COCO dataset [9]. We followed Ten-
sorFlow’s tutorial implementation 4 .
BLIP: This is a foundation model that has learned a large
Figure 1. System configuration: a standard number of images and amount of text and is applicable to a
image-based classifier (left half) and a text- wide range of Vision and Language tasks [8]. When used as
based classifier combined with an image cap- an image captioner, it takes the form of an encoder-decoder
tioner (right half). configuration based on the Transformer model. Users can
easily run the sample code (demo.ipynb) on GitHub 5 to
obtain captions for their own images. BLIP is a relatively
advanced, large-scale model that is capable of producing
The remainder of this paper is organized as follows. Sec- quite accurate captions.
tion 2 describes the configuration of our image classifi- CLIP Interrogator: Given an image, this model infers
cation system using image/text-based classifiers combined prompts for such AI image generators as Stable Diffusion
with an image captioner. The experimental setup and results and Midjourney so as to generate similar images. Since the
for image- and text-based single-modal systems as well as text generated by CLIP Interrogator is not meant to be read
fused multi-modal systems are presented in Section 3. Sec- by humans, it may not be considered an image captioner in
tion 4 summarizes our work. the strict sense, but we tested it as a model that can generate
richer text than can BLIP.
2 System Configuration Although the technical specifications of CLIP Interroga-
tor have not been published as a paper and there is no liter-
ature that can be referred to, it may be presumed from the
Figure 1 shows the configuration of the final form of the code 6 and its operation that it first generates a base caption
image classification system considered in this paper. The using BLIP and then selects and adds phrases that match the
left half shows a standard image classifier for use with a target image from a predefined set of phrases called Flavors.
neural network such as a CNN. The right half shows the CLIP image/text encoders [13] are used to measure the de-
connection of an image captioner with a text classifier, in gree of matching between a target image and the phrases
tandem, for the classification of images on the basis of lin-
1 Transfer learning and fine-tuning https://www.tensorflow.
guistic information extracted from images.
org/tutorials/images/transfer_learning
2 Image classification via fine-tuning with EfficientNet https:

2.1 Image and Text Classifiers //keras.io/examples/vision/image_classification_

efficientnet_fine_tuning/
3 Classify text with BERT https://www.tensorflow.org/
We used pre-trained models throughout. Mo- text/tutorials/classify_text_with_bert
bileNetV2 [16] and EfficientNet [20] were used for the 4 Image captioning with visual attention https://www.
image-based classifier, and BERTBASE [6] was used for the tensorflow.org/tutorials/text/image_captioning
The architecture of the decoder is Embed(256)–GRU(512)–FC(512)–FC.
text-based classifier after fine-tuning with the training data 5 https://github.com/salesforce/BLIP
for the target task. Regarding hyper-parameter settings in 6 https://github.com/pharmapsychotic/

fine-tuning, such as learning rate, mini-batch size, number clip-interrogator

in Flavors. Flavors contains approximately 100,000 words
and phrases, including those referring to objects and en-
tities (e.g., motorcycle, building, young woman), image
styles (e.g., photo-realistic), and artist names (e.g., greg
rutkowski). We used the code released by the developer
(clip interrogator.ipynb, version 2.2).

It should be noted that the default settings are differ-

ent for the standalone BLIP and the BLIP running within
the CLIP Interrogator, and that the latter emphasizes the
quality of captions by changing search parameters (e.g.,
num beams). In our experiment, in order to match their
operations, the parameters of the standalone BLIP were
matched with those of the CLIP Interrogator.

2.3 System Fusion

Figure 2. Example images for different dis-

As previously indicated in Figure 1, we fuse the out- aster types included in the CrisisNLP dataset
puts of an image-based classifier with those of a text-based (cited from [2]). There is another type referred
classifier to improve classification accuracy. Fusion meth- to as "not disaster," which is not shown here.
ods include feature-level fusion (early fusion), which in-
puts the hidden layer states of each classifier into another
neural-network classifier, and score-level fusion (late fu-
sion), which averages the classification results of each clas- 3 Experiments
sifier. Here we choose the latter for simplicity. That is, if
the number of classes is C and the output of the image/text- We used the CrisisNLP dataset [2], which is a collec-
based classifiers
normalized
by the softmax
function are tion of natural disaster images shared on such social media
(I) (I) (T ) (T )
(I)
y = y1 , · · · , yC and y (T )
= y1 , · · · , yC , re- as Twitter 7 . The dataset provides four image classifica-
spectively, then the classification result obtained with score- tion tasks, for each of which training (Train), development
level fusion may be calculated as y = (1 − w) y (I) + (Dev), and test (Test) data partitions are defined. Among
wy (T ) , where 0 ≤ w ≤ 1 is the weight coefficient for the them, we focus on two tasks: 1) “Disaster types,” for pre-
text-based classifier. dicting types of disasters, such as earthquake, flood, etc.;
2) “Damage severity,” for predicting the degree of damage
caused by a disaster in terms of three stages. (See Table 1
and Figure 2)
We first show the classification accuracies of single-
Table 1. Number of images and classes con- modal systems using only an image-based classifier or a
tained in two tasks defined in the CrisisNLP text-based classifier (Table 2). To reduce the randomness of
dataset. model parameter initialization in fine-tuning, each of those
accuracies is averaged over five trials.
Task Disaster types Damage severity Regarding the image-based classifiers, EfficientNet
Train 12,724 26,898 models of three sizes (B0, B1, B2) outperformed Mo-
Dev 1,574 2,898 bileNetV2 while no significant difference was observed
Test 3,213 5,100 among the three. Comparing the text-based classifier with
Num 7 (earthquake, fire, 3 (severe damage, three different image captioners, we can first see that the
classes flood, hurricane, mild damage, little most basic image captioner, InceptionV3+RNN, falls far
landslide, other or none) short of the classification accuracy of standard image-based
disaster, not disaster) 7 https://crisisnlp.qcri.org/

crisis-image-datasets-asonam20
Input
Table 2. Accuracies (%) of image-based clas-
sifiers: MobileNetV2 (MobNetV2), Efficient-
Net (EffNet) and text-based classifiers: BERT
combined with InceptionV3+RNN (IV3+RNN),
BLIP (BLIP), CLIP-I (CLIP Interrogator).

System Disaster types Damage severity

Image-based
MobNetV2 69.79 73.37
EffNet (B0) 76.48 76.84
EffNet (B1) 75.05 76.71
EffNet (B2) 76.33 76.31
Text-based MobNetV2 → not disaster
IV3+RNN 42.40 52.36 EffNet B0 → earthquake
BLIP 71.14 72.17 IV3+RNN a restaurant that are shining that is lined
CLIP-I 85.28 78.67 up with lots of people.
→ not disaster
BLIP a group of people standing on top of a
building
classifiers. A look at the captions generated by Incep- → not disaster
tionV3+RNN reveals that most of them are seemingly ir- CLIP-I a group of people standing on top
relevant with respect to the images, and it seems difficult to of a building, collapsed building,
predict either the type of disaster or the degree of damage buildings collapsed, collapsed build-
from these captions (as indicated later in Figure 3). By way ings, videogame still, burning building,
of contrast, BLIP did generate good captions for many im- building destroyed, background of res-
ages. The caption in previously shown in Figure 1 is one ident evil game, 19xx :2 akira movie
actually generated by BLIP, and it notes such important ele- style : 8, photo”, big impact hit on
ments in the image as ”people” and ”rubble.” The text-based the building, damaged buildings, earth-
classifier with BLIP consequently achieved much better ac- quake, unreal engine. film still, test,
curacy, roughly comparable to that of standard image-based building on fire
classifiers. The accuracy of the text-based classifier using → earthquake
CLIP Interrogator (CLIP-I) is even higher, and results sig-
nificantly exceed those of standard image-based classifiers.
These results suggest that foundation models of Vision and Figure 3. Example results with an image to be
Language trained on a large amount of image/text data can classified as “earthquake.”
be an effective feature extractor for image classification.
Figure 3 shows an example of image classification re-
sults for an image that should be classified as “earthquake.”
As previously noted, InceptionV3+RNN (IV3+RNN), Figure 4 shows the results of score-level fusion that av-
which was the most basic image captioner, produced a de- eraged the output of the image/text-based classifiers with
scription that was far from the actual content of the image. weight w, where MobileNetV2 was used for the image-
BLIP’s descriptions were generally much more accurate. based classifier. Both of the two tasks (Disaster types, Dam-
The CLIP Interrogator (CLIP-I) behaved quite differently age severity) show similar trends, i.e., when using suffi-
from these two. After beginning with a common sentence ciently good models, such as BLIP and CLIP Interrogator
from BLIP, it continued the explanation with a large number (CLIP-I), for image captioning, the classification accuracy
of phrases selected by CLIP. Although those phrases make can be improved by appropriately choosing the weight w. It
no sense as sentences, we can observe some that reflect the seems that image captioning models extract features differ-
true class, such as “earthquake” and “collapsed building.” ent from CNN models; in other words, they look at images
On the other hand, there are also such completely irrelevant from a different perspective than do CNNs. Unfortunately,
phrases as “akira movie style” and ”videogame still;” they when using InceptionV3+RNN for image captioning, the
would, however, be good clues for Stable Diffusion to use effect of fusion was nealy unseen because the difference in
to generate images. performance between the two modalities was too large.
(a) Disaster types (a) Disaster types

(b) Damage severity (b) Damage severity

Figure 4. Score-level fusion results using Mo- Figure 5. Score-level fusion results using Ef-
bileNetV2 as the image-based classifier: Hor- ficientNet B0 (rather than MobileNetV2 in Fig-
izontal axis stands for the fusion weight w for ure 4) as the image-based classifier.
the text-based classifier. w = 0 and w = 1
correspond to image-based and text-based
single-modal systems, respectively.
dard image-based classifiers using CNNs. Further, we have
confirmed that synergistic effects can be obtained by fusion
with those image-based classifiers. It can be said that im-
When we fused the text-based classifier with another age captioners based on large-scale foundation models are
image-based classifier, EfficientNet (B0), we could still see effective feature extractors for image classification. Image
a synergistic effect, as indicated in Figure 5. We should captioning is a rapidly evolving area, and we need to keep
note that, as the classification accuracy of EfficientNet is up with new technologies that are being released one after
higher than that of the text-based classifier using BLIP (un- another [7] 8 .
like MobileNetV2), the synergistic effect is not so large as
In this study, we experimented with a simple system con-
in the case of MobileNetV2.
figuration, in which very basic parts had been combined. In
future work, we intend to introduce more advanced meth-
4 Summary ods, such as feature-level fusion and knowledge distilla-
tion [17], in order to search for better answers to the orig-
Starting from the question of what kind of information inal question. It should be further noted that the behavior
image captioners can extract from images, we performed of image captioners would seem to depend heavily on the
image classification using only linguistic information con- images to be captioned. While scenes of disasters are rela-
tained in captions, and have shown that it is possible to
achieve classification accuracy that surpasses that of stan- 8 https://docs.midjourney.com/docs/describe
tively easy to describe, the characteristics of a human face [11] E. Mansimov, E. Parisotto, J. Ba, and R. Salakhutdinov.
would not be so easy. Studies on other datasets are to be Generating images from captions with attention. In Inter-
included in our future work. national Conference on Learning Representations (ICLR),
2016.
[12] C. Narisetty, E. Tsunoo, X. Chang, Y. Kashiwagi,
Acknowledgment M. Hentschel, and S. Watanabe. Joint speech recognition
and audio captioning. In IEEE International Conference on
This work was partially supported by JSPS KAKEN Acoustics, Speech and Signal Processing (ICASSP), 2022.
[13] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh,
Grant Number 21K11967.
G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, G. Krueger, and I. Sutskever. Learning trans-
References ferable visual models from natural language supervision.
arXiv:2103.00020, 2021.
[14] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen.
[1] F. Alam, F. Ofli, M. Imran, T. Alam, and U. Qazi. Crisis-
Hierarchical text-conditional image generation with clip la-
MMD: Multimodal twitter datasets from natural disasters.
tents. arXiv:2204.06125, 2022.
In 12th International AAAI Conference on Web and Social
[15] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton,
Media (ICWSM), 2018.
S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G.
[2] F. Alam, F. Ofli, M. Imran, T. Alam, and U. Qazi. Deep
Lopes, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi.
learning benchmarks and datasets for social media image
Photorealistic text-to-image diffusion models with deep lan-
classification for disaster response. In IEEE/ACM Interna-
guage understanding. arXiv:2205.11487, 2022.
tional Conference on Advances in Social Networks Analysis
[16] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C.
and Mining (ASONAM), 2020.
Chen. Mobilenetv2: Inverted residuals and linear bottle-
[3] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr,
necks. In IEEE Conference on Computer Vision and Pattern
Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds,
Recognition (CVPR), 2018.
R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Saman- [17] S. Srinivasan, Z. Huang, and K. Kirchhoff. Representation
gooei, M. Monteiro, J. L. Menick, S. Borgeaud, A. Brock, learning through cross-modal conditional teacher-student
A. Nematzadeh, S. Sharifzadeh, M. Bińkowski, R. Barreira, training for speech emotion recognition. In IEEE Interna-
O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a tional Conference on Acoustics, Speech and Signal Process-
visual language model for few-shot learning. In 36th An- ing (ICASSP), 2022.
nual Conference on Neural Information Processing Systems [18] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence-to-
(NeurIPS), 2022. sequence learning with neural networks. In 28th An-
[4] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. nual Conference on Neural Information Processing Systems
Zitnick, and D. Parikh. VQA: Visual question answering. In (NIPS), 2014.
IEEE Int’l Conf. on Computer Vision (ICCV), 2015. [19] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wo-
[5] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, jna. Rethinking the inception architecture for computer vi-
F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase sion. In IEEE Conference on Computer Vision and Pattern
representations using RNN encoder–decoder for statistical Recognition (CVPR), 2016.
machine translation. In Conference on Empirical Methods [20] M. Tan and Q. V. Le. EfficientNet: Rethinking model scaling
in Natural Language Processing (EMNLP), 2014. for convolutional neural networks. In International Confer-
[6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: ence on Machine Learning (ICML), 2019.
Pre-training of deep bidirectional transformers for language [21] Y. Ushiku, Y. Mukuta, M. Yamaguchi, and T. Harada. Com-
understanding. In Conference of the North American Chap- mon subspace for model and similarity: Phrase learning for
ter of the Association for Computational Linguistics: Hu- sentence generation from images. In IEEE International
man Language Technologies (NAACL-HLT), 2019. Conference on Computer Vision (ICCV), 2015.
[7] J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2: Bootstrapping [22] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and
language-image pre-training with frozen image encoders tell: A neural image caption generator. In IEEE Conference
and large language models. arXiv:2301.12597, 2023. on Computer Vision and Pattern Recognition (CVPR), 2015.
[8] J. Li, D. Li, C. Xiong, and S. Hoi. BLIP: Bootstrapping [23] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudi-
language-image pre-training for unified vision-language un- nov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural
derstanding and generation. arXiv:2201.12086, 2022. image caption generation with visual attention. In Interna-
[9] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, tional Conference on Machine Learning (ICML), 2015.
J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár. [24] Y. C. Yoon, S. Y. Park, S. M. Park, and H. Lim. Image
Microsoft COCO: Common objects in context. In European classification and captioning model considering a cam-based
Conference on Computer Vision (ECCV), 2014. disagreement loss. ETRI Journal, 42(1), 2020.
[10] M. R. Makiuchi, K. Uto, and K. Shinoda. Multimodal emo-
tion recognition with high-level speech and text features.
In IEEE Automatic Speech Recognition and Understanding
Workshop (ASRU), 2021.

Automatic Image Captioning Combining Natural Language Processing and
No ratings yet
Automatic Image Captioning Combining Natural Language Processing and
14 pages
Ref 12
No ratings yet
Ref 12
7 pages
He 2017
No ratings yet
He 2017
8 pages
Image Captioning Using R-CNN & LSTM Deep Learning Model
No ratings yet
Image Captioning Using R-CNN & LSTM Deep Learning Model
4 pages
Design of Machine Learning Algorithms For Object Captioning
No ratings yet
Design of Machine Learning Algorithms For Object Captioning
45 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
Switching Text-Based Image Encoders for Captioning Images With Text
No ratings yet
Switching Text-Based Image Encoders for Captioning Images With Text
10 pages
TSP CMC 53245
No ratings yet
TSP CMC 53245
18 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
Image Captioning Generator Using Deep Machine Learning
No ratings yet
Image Captioning Generator Using Deep Machine Learning
3 pages
2023 Cross-Domain Image Captioning With Discriminative Finetuning
No ratings yet
2023 Cross-Domain Image Captioning With Discriminative Finetuning
10 pages
Image Caption Generator Research Paper
No ratings yet
Image Caption Generator Research Paper
4 pages
Image Caption Generator
No ratings yet
Image Caption Generator
2 pages
Camera 2 Caption
No ratings yet
Camera 2 Caption
6 pages
ROHAN PRASAD FinalProjectReport - Rohan Gamer
No ratings yet
ROHAN PRASAD FinalProjectReport - Rohan Gamer
39 pages
Image Captioning
No ratings yet
Image Captioning
8 pages
Hybrid Image Captioning Model
No ratings yet
Hybrid Image Captioning Model
6 pages
IJNRD2309143
No ratings yet
IJNRD2309143
11 pages
Deep Learning-Based Image Captioning For Visually
No ratings yet
Deep Learning-Based Image Captioning For Visually
7 pages
Review 3
No ratings yet
Review 3
18 pages
Enhancing Multimodal Understanding With CLIP-Based
No ratings yet
Enhancing Multimodal Understanding With CLIP-Based
7 pages
Two Tier LSTM Model
No ratings yet
Two Tier LSTM Model
13 pages
DW & Caption Generator - Paper 1
No ratings yet
DW & Caption Generator - Paper 1
6 pages
RP Springer
No ratings yet
RP Springer
10 pages
Review 3
No ratings yet
Review 3
18 pages
Image Caption Bot With Keras and Speech Generation For
No ratings yet
Image Caption Bot With Keras and Speech Generation For
7 pages
Conference Paper A5
No ratings yet
Conference Paper A5
9 pages
Mini Project Fln..
No ratings yet
Mini Project Fln..
51 pages
Project Report Image Captioning Models Prakhar Dhyani
No ratings yet
Project Report Image Captioning Models Prakhar Dhyani
8 pages
Applsci 13 11103 v2
No ratings yet
Applsci 13 11103 v2
38 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
48 pages
ICCCI 2020 Paper 58 PDF
No ratings yet
ICCCI 2020 Paper 58 PDF
12 pages
Building A Voice Based Image Caption Generator With Deep Learning
No ratings yet
Building A Voice Based Image Caption Generator With Deep Learning
6 pages
Rich Image Captioning in The Wild
No ratings yet
Rich Image Captioning in The Wild
8 pages
Paper 17881
No ratings yet
Paper 17881
6 pages
Image Caption Generator PCL
No ratings yet
Image Caption Generator PCL
19 pages
Fang 2015
No ratings yet
Fang 2015
10 pages
PGCON Paper Final
No ratings yet
PGCON Paper Final
4 pages
Aneja Convolutional Image Captioning CVPR 2018 Paper
No ratings yet
Aneja Convolutional Image Captioning CVPR 2018 Paper
10 pages
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
No ratings yet
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
8 pages
(Ankitveer)
No ratings yet
(Ankitveer)
18 pages
Major Report Final
No ratings yet
Major Report Final
40 pages
Image Captioning Model Using Attention and Object
No ratings yet
Image Captioning Model Using Attention and Object
17 pages
ALGORITHM Saikareddy Img Cap-1742112866980
No ratings yet
ALGORITHM Saikareddy Img Cap-1742112866980
6 pages
Image Captioning - A Deep Learning Approach Using CNN and LSTM Network
No ratings yet
Image Captioning - A Deep Learning Approach Using CNN and LSTM Network
6 pages
He Image Captioning Through Image Transformer ACCV 2020 Paper
No ratings yet
He Image Captioning Through Image Transformer ACCV 2020 Paper
17 pages
From Show To Tell: A Survey On Image Captioning
No ratings yet
From Show To Tell: A Survey On Image Captioning
22 pages
Image Caption Generator Using AI: Review - 1
No ratings yet
Image Caption Generator Using AI: Review - 1
9 pages
Meshed-Memory Transformer For Image Captioning
No ratings yet
Meshed-Memory Transformer For Image Captioning
15 pages
Image Caption Generation Using Deep Neural Networks
No ratings yet
Image Caption Generation Using Deep Neural Networks
3 pages
Ref 11
No ratings yet
Ref 11
6 pages
An Empirical Study of Language CNN For Image Captioning
No ratings yet
An Empirical Study of Language CNN For Image Captioning
10 pages
Gu An Empirical Study ICCV 2017 Paper PDF
No ratings yet
Gu An Empirical Study ICCV 2017 Paper PDF
10 pages
DL Group 6 Rep
No ratings yet
DL Group 6 Rep
11 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
47 pages
A Comparative Study of Machine Learning Based Image Captioning Models
No ratings yet
A Comparative Study of Machine Learning Based Image Captioning Models
6 pages
Image Captioning Synopsis
No ratings yet
Image Captioning Synopsis
17 pages
Seminar Report Final
No ratings yet
Seminar Report Final
20 pages
Image Caption Generation Research Paper
No ratings yet
Image Caption Generation Research Paper
8 pages
Bag of Words Model: Unlocking Visual Intelligence with Bag of Words
From Everand
Bag of Words Model: Unlocking Visual Intelligence with Bag of Words
Fouad Sabry
No ratings yet
French Pronunciation
No ratings yet
French Pronunciation
13 pages
Academic Telegram9
50% (2)
Academic Telegram9
148 pages
Narrative Essay That End With All That Glitters Are Not Gold
No ratings yet
Narrative Essay That End With All That Glitters Are Not Gold
7 pages
Gis Manual
No ratings yet
Gis Manual
22 pages
Freddy Operating System Thesis 20132014 - 11
No ratings yet
Freddy Operating System Thesis 20132014 - 11
50 pages
ISO20022 (CGI) Pain.002.001.03
No ratings yet
ISO20022 (CGI) Pain.002.001.03
12 pages
Oswald Spengler and Faustian Culture' - Faustian Europe
No ratings yet
Oswald Spengler and Faustian Culture' - Faustian Europe
3 pages
Assigment Individual Kce 130
No ratings yet
Assigment Individual Kce 130
13 pages
Discoursive Todo Final
No ratings yet
Discoursive Todo Final
41 pages
Rarita-Schwinger Particles in Homogeneous Magnetic Fields, and Inconsistencies of Spin 3/2 Theories
No ratings yet
Rarita-Schwinger Particles in Homogeneous Magnetic Fields, and Inconsistencies of Spin 3/2 Theories
9 pages
Cleaning Day - Past Simple - Unit 3 Lesson 4
No ratings yet
Cleaning Day - Past Simple - Unit 3 Lesson 4
11 pages
Sylvester Positive Definite
No ratings yet
Sylvester Positive Definite
2 pages
CrudExpressj Mysql
100% (1)
CrudExpressj Mysql
8 pages
Sa 14
No ratings yet
Sa 14
4 pages
Programming Homework P1
No ratings yet
Programming Homework P1
1 page
Examen Final Final
No ratings yet
Examen Final Final
3 pages
0e11067001 1602725970 Tgpwin21adultsddgu30s1esv
No ratings yet
0e11067001 1602725970 Tgpwin21adultsddgu30s1esv
11 pages
Ngo Report
No ratings yet
Ngo Report
13 pages
Presentation 4
No ratings yet
Presentation 4
20 pages
Y6 Assessment1 Ict
No ratings yet
Y6 Assessment1 Ict
5 pages
S3503 PDF
No ratings yet
S3503 PDF
10 pages
Figurative Language Powerpoint
No ratings yet
Figurative Language Powerpoint
20 pages
Estrategiaspoke
No ratings yet
Estrategiaspoke
903 pages
Genres of Viewing
100% (1)
Genres of Viewing
6 pages
Resume of Kanimozhi Promantus
No ratings yet
Resume of Kanimozhi Promantus
4 pages
ASIC Design Project Tutorial: Tianning Gao
No ratings yet
ASIC Design Project Tutorial: Tianning Gao
16 pages
Term 2 Basic 6 TSOL
No ratings yet
Term 2 Basic 6 TSOL
19 pages
B2 Av3 Final Completion
No ratings yet
B2 Av3 Final Completion
4 pages
Requirements Analysis and Specification PDF
No ratings yet
Requirements Analysis and Specification PDF
77 pages
Computing Cheat Sheets
No ratings yet
Computing Cheat Sheets
18 pages

Image Captioners Sometimes Tell More Than Images They See

Uploaded by

Image Captioners Sometimes Tell More Than Images They See

Uploaded by

Image Captioners Sometimes Tell More Than Images They See

Honori Udo∗ Takafumi Koshinaka

* Now with NTT Comware Corporation

Abstract manner [22]. It has a two-sided relationship with image

2.2 Image Captioners

We focused on three image captioning models. None of

InceptionV3+RNN: This is a basic model of small scale

2.1 Image and Text Classifiers //keras.io/examples/vision/image_classification_

fine-tuning, such as learning rate, mini-batch size, number clip-interrogator

It should be noted that the default settings are differ-

2.3 System Fusion

Figure 2. Example images for different dis-

System Disaster types Damage severity

(b) Damage severity (b) Damage severity

You might also like