Image2tweet: Datasets in Hindi and English For Generating Tweets From Images

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/359497106

Image2tweet: Datasets in Hindi and English for Generating Tweets from


Images

Conference Paper · December 2022

CITATIONS READS

0 180

7 authors, including:

Rishabh Jha Vishnu Sai Varshith Kaki


Indian Institute of Information Technology Sricity Indian Institute of Information Technology Sri City
2 PUBLICATIONS   0 CITATIONS    2 PUBLICATIONS   0 CITATIONS   

SEE PROFILE SEE PROFILE

Parth Patwa Amitava Das


Indian Institute of Information Technology Sri City University of North Texas
31 PUBLICATIONS   188 CITATIONS    91 PUBLICATIONS   1,230 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Social Network Communities View project

MM-PostEdit View project

All content following this page was uploaded by Parth Patwa on 26 March 2022.

The user has requested enhancement of the downloaded file.


Image2tweet: Datasets in Hindi and English for Generating
Tweets from Images
Rishabh Jha1 Varshith Kaki 1 Varuna Krishna Kolla1 Shubham Bhagat1
Parth Patwa2 Amitava Das3,4 Santanu Pal3
1
Indian Institute of Information Technology Sri City, India
2
University of California Los Angeles, USA
3
Wipro AI Labs, India 4 AI Institute, University of South Carolina, USA
1
{rishabh.j19, vishnusaivarshith.k19, varunakrishna.k19, shubham.b18}@iiits.in
2
[email protected]
3
{amitava.das2, santanu.pal2 }@wipro.com

Abstract needs to generate syntactically and semanti-


cally correct sentences. This task involves the
Image Captioning as a task that has knowledge of both computer vision and natural
seen major updates over time. In recent language processing.
methods, visual-linguistic grounding of the Image Captioning has been a very popular
image-text pair is leveraged. This includes
research area since the last decade. Even be-
either generating the textual description
of the objects and entities present within fore the boom of neural network based tech-
the image in constrained manner, or gen- niques people tried various hand crafted fea-
erating detailed description of these enti- tures such as Local Binary Patterns (LBP)
ties as a paragraph. But there is still a (Ojala et al., 2000), Scale-Invariant Feature
long way to go towards being able to gen- Transform (SIFT) (Lowe, 2004), the Histogram
erate text that is not only semantically of Oriented Gradients (HOG) (De Marneffe
richer, but also contains real world knowl-
et al., 2006) along with classical ML methods
edge in it. This is the motivation behind
exploring image2tweet generation through
like SVM for Image Captioning. On the other
the lens of existing image-captioning ap- hand, while using neural network based tech-
proaches. At the same time, there is lit- niques, features are learned automatically from
tle research in image captioning in Indian training data and they can handle a large and
languages like Hindi. In this paper, we re- diverse set of images (Karpathy and Fei-Fei,
lease Hindi and English datasets for the 2015; Vinyals et al., 2015; Xu et al., 2015).
task of tweet generation given an image. Moreover, the availability of large and new
The aim is to generate a specialized text
datasets has made the learning-based image
like a tweet, that is not a direct result of
visual-linguistic grounding that is usually captioning an interesting research area. The
leveraged in similar tasks, but conveys a popular datasets for English Image Captioning
message that factors-in not only the visual are - Flickr30K Dataset (Young et al., 2014),
content of the image, but also additional MS COCO (Lin et al., 2014), and Google Con-
real world contextual information associ- ceptual Caption dataset (Sharma et al., 2018).
ated with the event described within the However, there is almost no research of image
image as closely as possible. Further, We
captioning in Hindi and/or Indian languages.
provide baseline DL models on our data
and invite researchers to build more sophis- Image captioning is important for many rea-
ticated systems for the problem. sons. For example, they can be used for au-
tomatic image indexing. Image indexing is
1 Introduction important for Content-Based Image Retrieval
(CBIR) and therefore, it can be applied to
Generating a textual description of an image many areas, including biomedicine, commerce,
is called image captioning. It can be an easy the military, education, digital libraries, and
process for most adults, but for a machine web searching.
to generate a rich and vivid description is a Image2Tweet takes one step ahead of regular
difficult task. Image captioning requires to rec- image captioning task. It involves generating
ognize the important objects, their attributes captions that are not only semantically rich
and their relationships in an image. It also but also contain some real world knowledge
(Sharma, 2020). The task is that given an im- Deep learning methods are the most popular
age, the machine has to generate a tweet from to solve the image captioning task. Jiang et al.
it. An example is provided in figure 1. Gen- (2018) proposed novel Recurrent Fusion Net-
erating this level of detailed tweets requires work (RFNet), which exploits complementary
person identification (Sachin Tendulkar), Ob- information from multiple encoders to tackle
ject detection (BJP logo) etc. image captioning. Xu et al. (2015) propose
In this paper, we describe the image2tweet an encoder-decoder method which incorporate
task and release a new dataset for the task spatial attention mechanism to help the model
and also release a novel hindi dataset to ig- to determine which regions to focus in an image.
nite the Image Captioning research for Indian Yang et al. (2016) propose a framework called
languages. ReviewNet. Zhou et al. (2020) proposed a Uni-
fied Vision-Language Pre-Training for Image
Captioning which can be easily fine tuned.
Similar to caption generator meme gener-
ation has also been a eye-catching task for
researchers. the task is to generate memes
based on the image. unlike captioning, here
in meme generation it has to generate text
for multiple persons, if multiple persons are
involved in meme image. Kurochkin (2020)
released a dataset consisting of 650K meme
instances. They applied GPT-2(Radford et al.,
Figure 1: 2019) model for meme generation and observed
COCO Style: A man in front of a crowd. that machine generated meme text’s are not
Conceptual Caption: A closeup of a mid-aged that engaging as human generated.
man, and a parade.
Expected Image2Tweet: Sachin Tendulkar 3 Task Description
and BJP parade.
Image Captioning for English is well stud-
ied paradigm and researchers have tried vari-
2 Related Work ous methods like hand crafted features (Ojala
There are quite a few popular image caption- et al., 2000; Lowe, 2004; De Marneffe et al.,
ing datasets. Flickr30k (Young et al., 2014) 2006) along with classical ML methods like
consists of 30K images and each image has SVM. During the last decade numerous of Big
5 captions. COCO (Lin et al., 2014) dataset datasets have been released and quite a few ef-
consists of 330K images and each image has 5 forts can be noticed but there is a still shortage
captions. Google Conceptual Caption (Sharma of works in Indic Languages.
et al., 2018) has approximately 3.3 million im- Image2Tweet is a shared task where we move
ages and each image has only one caption. How- a step forward from image captioning. The
ever, such datasets use commonly found images task is to generate a tweet like a human/news
over the web and couple the images with alt- reporter given an image. We release datasets
text descriptions. Most of the descriptions use for two languages - English and Hindi. Figures
proper nouns (such as characters, places, loca- 2 and 3 show an instance from the English and
tions, organizations, etc.). Such proper nouns Hindi data respectively.
pose some problems because a image caption-
3.1 Evaluation Metric
ing model is difficult to learn such fine-grained
proper noun inference from the input image For Image Captioning, most used metrics are
pixels. At the same time, there is very little re- n-gram based matching metrics such as BLEU,
search done on Hindi image captioning. To the ROUGE, METEOR, and CIDEr.
best of our knowledge, ours is the first dataset Popular Image Captioning datasets like
to generate tweet from images and to release a Flickr30k (Young et al., 2014), COCO (Lin
Hindi dataset. et al., 2014), and Google Conceptual Caption
(Sharma et al., 2018) provide multiple captions
per image, as the same image can be described
in many different ways. So, in these datasets,
while evaluating they calculate the score be-
tween the system generated caption and all
the reference captions in the gold data. Now,
in our task having multiple tweets for a given
image is difficult to collect, and having only
one reference tweet will affect the evaluation
score.
Since having multiple tweets for an image
would be difficult, we assume that similar im-
ages may have similar tweets. With this in
Figure 2: Tweet: Finance Minister Nirmala mind we apply content based similarity match
Sitharaman presents the full Budget of the second on the collected data and keep all the similar
term of the Narendra Modi government images in one cluster. The released data is pre-
#BudgetSession2020 #BudgetWithTimes processed accordingly, and all the clusters are
#UnionBudget2020 marked along with image ids. For evaluation,
we use CIDEr, where the score will be calcu-
lated between system generated tweet vs. all
the tweets belong to the similar image cluster
provided in the dataset.

4 Dataset
The data consists of image-tweet pairs. We
provide 2 dataset - English and Hindi. The
Hindi data is collected by crawling tweets from
two well known Hindi Newspapers - Dainik
Bhaskar and Dainik Jagran. The English data
is crawled from the twitter handle of Times
of India. We use Twitter API1 to crawl the
tweets. We collect total 70k Tweets for English
Image2Tweet and 51K for Hindi Image2Tweet.
Table 1 gives the data statistics.

Dataset English Hindi


Training 48792 35701
Validation 10209 7652
Test 10411 7652
Total 69412 51005

Table 1: Train, Validation and Test data split for


the English and Hindi datasets.
Figure 3: Tweet: िपंकिसटी में सुबह से हो रही झमाझम
बािरश िकसी के िलए राहत तो कहीं आफत #jaipur
#Monsoon2017. Figures 4 and 5 show the word clouds of
Hindi and English tweets respectively. We ob-
serve that most of the words are related to
politics and Covid-19.
Clustering is the necessary part of making
1
https://developer.twitter.com/en/docs/
twitter-api
Figure 7: An example of a cluster from English
data. All the tweet objects are related to cricket.

of the dataset, as mentioned in the previous


section. For clustering we first remove unnec-
Figure 4: Word cloud of English dataset. Most of essary links, symbols and numbers. However
the words are related to Politics and Covid-19. keeping the hashtags and mentions can help in
the process of clustering similar tweets (with-
out symbols ’@’ and hashtags). In the next
step we remove the words which doesn’t add
meaning to sentence, stopwords.
Figure 6 and 7 show and example of En-
glish and Hindi cluster respectively. We can
see that the tweet objects within a cluster are
related/similar to each other. Our aim is to
do multimodal clustering, Image+Text, hence
we implement an algorithm in which similarity
score between tweet object are calculated with
every other tweet object and stored in the form
of 2D dictionary, each row sorted in reverse or-
Figure 5: Word cloud of Hindi dataset. Most of der. ith row contains the similarity score of ith
the words are related to Politics and Covid-19. tweet object with every other object in reverse
order. As for every pair of tweet data there are
two entries (dict[i][j] & dict[j][i]), we eliminate
that entry which is in lower relative position
among those two rows. After that for every
row we consider at most 5 element with high-
est similarity score and combine them to make
a cluster group. Hence. each cluster has at
most 6 tweet objects (1 tweet object and its 5
neighbors). The formula to calculate similarity
between 2 tweet objects is :
Sim(i, j) = W 1 ∗ textSim(i, j) + W 2 ∗
imgSim(i, j)
textSim(i,j) function calculates the similarity
Figure 6: An example of a cluster from English
between the textual part of the tweet object
data. All the tweet objects are related to political using the weighted average of overlap of uni-
elections in India. grams, bigrams and trigrams. imgSim(i,j) func-
tion calculates the cosine similarity between
the feature vectors of the images extracted us-
ing DenseNet (Huang et al., 2017). Overall
similarity is just the weighted average of both
the similarity, where w1 + w2 = 1 and (w1,
w2)  [0,1].
The datasets are available at https://
competitions.codalab.org/competitions/35702.

5 Baseline
We develop our baseline using BERT (Devlin
et al., 2018) and VGG-19(Simonyan and Zisser-
man, 2014). The BERT model is pre-trained
on whole English Wikipedia and Brown corpus
for next sentence prediction objective.
We design is a two branch model (refer fig-
ure 8). While training, the image embedding
Figure 8: Architecture diagram of baseline model.
obtained from VGG-19 is passed to one branch
and the text is passed to the other branch. In
the first branch the image embedding is passed the baseline. The results were found based on
to dense layer. In the second branch the text is the discussed approach. The results are poor
sent to BERT tokenizer and its output passed since we use a relatively simple approach to
to the pre-trained BERT. Then the output establish the baseline. There is a huge scope
from the last layer of BERT is passed to an of improvement in the results, for which We
LSTM layer which is given as an input to max encourage more innovative approaches.
pooling and to and average polling. The out-
put vectors of max pooling and average pooling 7 Conclusion
are concatenated. After this, we concatenate In this paper we define the task image2tweet
the outputs from both the branches, and give and release datasets in Hindi and English for
the concatenated vector as input to an LSTM the task. The English and Hindi datasets con-
followed by a dense layer. The output vector of sists of 70k and 51k image-tweet pairs respec-
this dense layer is used to generate the words tively. We cluster the similar tweets in our
in Tweet. dataset for better evaluation of the system gen-
For training we use Adam optimiser, and erated tweets. Generated tweets are evaluated
train it with a mini-batch size of 32. The using Cider. Further, we provide VGG19 +
learning rate is set to 1e -5. The max caption BERT based baseline systems for our data.
length is set to 34. While testing, we pass Image2tweet is more difficult than traditional
the image vector and the sequence of words image captioning and we believe it needs fur-
generated so far and will predict the next word. ther research attention. Future work includes
Likewise we go on until the end token appears. collecting data for more languages, building
We use greedy search method to generate the more complex systems for the task etc.
whole tweet.
The baseline code is available at https://
github.com/git-rishabh-jha/Image2Tweet.

6 Results
Table 2 shows the results of the baseline sys-
tem. The results are poor since we use a rela-
tively simple approach to establish the baseline.
There is a huge scope of improvement in the
results, for which we encourage more innova-
tive approaches. Table2 shows the results of
System CIDEr BLEU-4 METEOR ROUGE
Baseline-English 0.0003 0.02 0.00013 0.00013
Baseline-Hindi 0.0004 0.03 0.00023 0.00023

Table 2: Results of baseline systems on Hindi and English datasets

References of the 56th Annual Meeting of the Association


for Computational Linguistics (Volume 1: Long
Marie-Catherine De Marneffe, Bill MacCartney, Papers), pages 2556–2565.
Christopher D Manning, et al. 2006. Generating
typed dependency parses from phrase structure Shivam Sharma. 2020. Generating tweet-like text
parses. In Lrec, volume 6, pages 449–454. from images. where we are…and where we need
to be.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of Karen Simonyan and Andrew Zisserman. 2014.
deep bidirectional transformers for language un- Very deep convolutional networks for large-scale
derstanding. arXiv preprint arXiv:1810.04805. image recognition.
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, Oriol Vinyals, Alexander Toshev, Samy Bengio,
and Kilian Q Weinberger. 2017. Densely con- and Dumitru Erhan. 2015. Show and tell: A
nected convolutional networks. In Proceedings neural image caption generator. In Proceedings
of the IEEE conference on computer vision and of the IEEE conference on computer vision and
pattern recognition, pages 4700–4708. pattern recognition, pages 3156–3164.
Wenhao Jiang, Lin Ma, Yu-Gang Jiang, Wei Liu, Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun
and Tong Zhang. 2018. Recurrent fusion net- Cho, Aaron Courville, Ruslan Salakhudinov,
work for image captioning. In Proceedings of Rich Zemel, and Yoshua Bengio. 2015. Show,
the European Conference on Computer Vision attend and tell: Neural image caption genera-
(ECCV), pages 499–515. tion with visual attention. In International con-
ference on machine learning, pages 2048–2057.
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-
PMLR.
semantic alignments for generating image de-
scriptions. In Proceedings of the IEEE confer- Zhilin Yang, Ye Yuan, Yuexin Wu, William W
ence on computer vision and pattern recognition, Cohen, and Russ R Salakhutdinov. 2016. Re-
pages 3128–3137. view networks for caption generation. Ad-
vances in neural information processing systems,
Andrew Kurochkin. 2020. Meme generation for so-
29:2361–2369.
cial media audience engagement.
Tsung-Yi Lin, Michael Maire, Serge Belongie, Peter Young, Alice Lai, Micah Hodosh, and Julia
James Hays, Pietro Perona, Deva Ramanan, Pi- Hockenmaier. 2014. From image descriptions to
otr Dollár, and C Lawrence Zitnick. 2014. Mi- visual denotations: New similarity metrics for se-
crosoft coco: Common objects in context. In mantic inference over event descriptions. Trans-
European conference on computer vision, pages actions of the Association for Computational
740–755. Springer. Linguistics, 2:67–78.

David G Lowe. 2004. Distinctive image features Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong
from scale-invariant keypoints. International Hu, Jason Corso, and Jianfeng Gao. 2020. Uni-
journal of computer vision, 60(2):91–110. fied vision-language pre-training for image cap-
tioning and vqa. In Proceedings of the AAAI
Timo Ojala, Matti Pietikäinen, and Topi Mäenpää. Conference on Artificial Intelligence, volume 34,
2000. Gray scale and rotation invariant texture pages 13041–13049.
classification with local binary patterns. In Eu-
ropean Conference on Computer Vision, pages
404–420. Springer.
Alec Radford, Jeffrey Wu, Rewon Child, David
Luan, Dario Amodei, Ilya Sutskever, et al. 2019.
Language models are unsupervised multitask
learners. OpenAI blog, 1(8):9.
Piyush Sharma, Nan Ding, Sebastian Goodman,
and Radu Soricut. 2018. Conceptual captions:
A cleaned, hypernymed, image alt-text dataset
for automatic image captioning. In Proceedings

View publication stats

You might also like