Image2tweet: Datasets in Hindi and English For Generating Tweets From Images
Image2tweet: Datasets in Hindi and English For Generating Tweets From Images
Image2tweet: Datasets in Hindi and English For Generating Tweets From Images
net/publication/359497106
CITATIONS READS
0 180
7 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Parth Patwa on 26 March 2022.
4 Dataset
The data consists of image-tweet pairs. We
provide 2 dataset - English and Hindi. The
Hindi data is collected by crawling tweets from
two well known Hindi Newspapers - Dainik
Bhaskar and Dainik Jagran. The English data
is crawled from the twitter handle of Times
of India. We use Twitter API1 to crawl the
tweets. We collect total 70k Tweets for English
Image2Tweet and 51K for Hindi Image2Tweet.
Table 1 gives the data statistics.
5 Baseline
We develop our baseline using BERT (Devlin
et al., 2018) and VGG-19(Simonyan and Zisser-
man, 2014). The BERT model is pre-trained
on whole English Wikipedia and Brown corpus
for next sentence prediction objective.
We design is a two branch model (refer fig-
ure 8). While training, the image embedding
Figure 8: Architecture diagram of baseline model.
obtained from VGG-19 is passed to one branch
and the text is passed to the other branch. In
the first branch the image embedding is passed the baseline. The results were found based on
to dense layer. In the second branch the text is the discussed approach. The results are poor
sent to BERT tokenizer and its output passed since we use a relatively simple approach to
to the pre-trained BERT. Then the output establish the baseline. There is a huge scope
from the last layer of BERT is passed to an of improvement in the results, for which We
LSTM layer which is given as an input to max encourage more innovative approaches.
pooling and to and average polling. The out-
put vectors of max pooling and average pooling 7 Conclusion
are concatenated. After this, we concatenate In this paper we define the task image2tweet
the outputs from both the branches, and give and release datasets in Hindi and English for
the concatenated vector as input to an LSTM the task. The English and Hindi datasets con-
followed by a dense layer. The output vector of sists of 70k and 51k image-tweet pairs respec-
this dense layer is used to generate the words tively. We cluster the similar tweets in our
in Tweet. dataset for better evaluation of the system gen-
For training we use Adam optimiser, and erated tweets. Generated tweets are evaluated
train it with a mini-batch size of 32. The using Cider. Further, we provide VGG19 +
learning rate is set to 1e -5. The max caption BERT based baseline systems for our data.
length is set to 34. While testing, we pass Image2tweet is more difficult than traditional
the image vector and the sequence of words image captioning and we believe it needs fur-
generated so far and will predict the next word. ther research attention. Future work includes
Likewise we go on until the end token appears. collecting data for more languages, building
We use greedy search method to generate the more complex systems for the task etc.
whole tweet.
The baseline code is available at https://
github.com/git-rishabh-jha/Image2Tweet.
6 Results
Table 2 shows the results of the baseline sys-
tem. The results are poor since we use a rela-
tively simple approach to establish the baseline.
There is a huge scope of improvement in the
results, for which we encourage more innova-
tive approaches. Table2 shows the results of
System CIDEr BLEU-4 METEOR ROUGE
Baseline-English 0.0003 0.02 0.00013 0.00013
Baseline-Hindi 0.0004 0.03 0.00023 0.00023
David G Lowe. 2004. Distinctive image features Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong
from scale-invariant keypoints. International Hu, Jason Corso, and Jianfeng Gao. 2020. Uni-
journal of computer vision, 60(2):91–110. fied vision-language pre-training for image cap-
tioning and vqa. In Proceedings of the AAAI
Timo Ojala, Matti Pietikäinen, and Topi Mäenpää. Conference on Artificial Intelligence, volume 34,
2000. Gray scale and rotation invariant texture pages 13041–13049.
classification with local binary patterns. In Eu-
ropean Conference on Computer Vision, pages
404–420. Springer.
Alec Radford, Jeffrey Wu, Rewon Child, David
Luan, Dario Amodei, Ilya Sutskever, et al. 2019.
Language models are unsupervised multitask
learners. OpenAI blog, 1(8):9.
Piyush Sharma, Nan Ding, Sebastian Goodman,
and Radu Soricut. 2018. Conceptual captions:
A cleaned, hypernymed, image alt-text dataset
for automatic image captioning. In Proceedings