0% found this document useful (0 votes)
35 views8 pages

2022 Aacl-Srw 5

This paper introduces a new multimodal dataset called MUTE for detecting hateful memes in Bengali. MUTE contains over 4,000 memes with annotations of hateful or not hateful. The paper aims to facilitate research on detecting hate speech in low-resource languages like Bengali. It presents experiments using visual, textual, and multimodal models, finding that combining visual and textual features improves hateful meme classification performance. This dataset addresses the lack of resources for analyzing code-mixed captions and hate speech in Bengali, which has seen increasing use of memes but remains understudied.

Uploaded by

Dipta Biswas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views8 pages

2022 Aacl-Srw 5

This paper introduces a new multimodal dataset called MUTE for detecting hateful memes in Bengali. MUTE contains over 4,000 memes with annotations of hateful or not hateful. The paper aims to facilitate research on detecting hate speech in low-resource languages like Bengali. It presents experiments using visual, textual, and multimodal models, finding that combining visual and textual features improves hateful meme classification performance. This dataset addresses the lack of resources for analyzing code-mixed captions and hate speech in Bengali, which has seen increasing use of memes but remains understudied.

Uploaded by

Dipta Biswas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

MUTE: A Multimodal Dataset for Detecting Hateful Memes

Eftekhar Hossainf , Omar SharifΨ and Mohammed Moshiul HoqueΨ


f
Department of Electronics and Telecommunication Engineering
Ψ
Department of Computer Science and Engineering

Chittagong University of Engineering & Technology, Chattogram-4349, Bangladesh
{eftekhar.hossain,omar.sharif, moshiul_240}@cuet.ac.bd

Abstract
The exponential surge of social media has en-
abled information propagation at an unprece-
dented rate. However, it also led to the gen-
eration of a vast amount of malign content,
such as hateful memes. To eradicate the detri-
mental impact of this content, over the last
few years hateful memes detection problem
has grabbed the attention of researchers. How- (a) Attack religious beliefs (b) Insult a person
ever, most past studies were conducted pri-
marily for English memes, while memes on
Figure 1: Examples of hateful memes having (a) only
resource-constraint languages (i.e., Bengali)
Bengali caption (b) Code-mixed (Bengali + English)
remain under-studied. Moreover, current re-
caption.
search considers memes with a caption writ-
ten in monolingual (either English or Bengali)
form. However, memes might have code-mixed
also trigger human emotions that can be consid-
captions (English+Bangla), and the existing
models can not provide accurate inference in ered harmful. Among them, the propagation of
such cases. Therefore, to facilitate research in hateful content can directly or indirectly attack so-
this arena, this paper introduces a multimodal cial harmony based on race, gender, religion, na-
hate speech dataset (named MUTE) consisting tionality, political support, immigration status, and
of 4158 memes having Bengali and code-mixed personal beliefs. In recent years, memes have be-
captions. A detailed annotation guideline is come a popular form of circulating hate speech
provided to aid the dataset creation in other
(Kiela et al., 2020). These memes on social media
resource-constraint languages. Additionally,
extensive experiments have been carried out
have a pernicious impact on societal polarization
on MUTE, considering the only visual, only as they can instigate hateful crimes. Therefore, to
textual, and both modalities. The result demon- restrain the interaction through hateful memes, an
strates that joint evaluation of visual and tex- automated system is required to quickly flag this
tual features significantly improves (≈ 3%) the content and lessen the inflicted harm to the readers.
hateful memes classification compared to the Several works (Davidson et al., 2017; Waseem and
unimodal evaluation. Hovy, 2016) have accomplished hateful memes
detection, most of which were for the English lan-
1 Introduction
guage. Unfortunately, no significant studies have
With the advent of the Internet, social media plat- been conducted on memes regarding low-resource
forms (i.e., Facebook, Twitter, Instagram) signif- languages, especially Bengali. In recent years an in-
icantly impact people’s day-to-day life. As a re- creasing trend has been observed among the people
sult, many users communicate by posting various to use Bengali memes. As a result, it becomes mon-
content in these mediums. This content includes umental to identify the Bengali hateful memes to
promulgating hate speech, misinformation, aggres- mitigate the spread of negativity. However, memes
sive and offensive views. While some contents analysis is complicated as it requires a holistic un-
are beneficial and enrich our knowledge, they can derstanding of visual and textual content to infer
WARNING: This paper contains meme examples and (Zhou et al., 2021). The visual content of the meme
words that are offensive in nature. alone may not be harmful (Figure 1 (a)). However,
32
Proceedings of the AACL-IJCNLP 2022 Student Research Workshop, pages 32–39
November 20, 2022. ©2022 Association for Computational Linguistics
it becomes hateful with the incorporation of textual Long Short Term Memory (LSTM) Network (Bad-
content as it directly attacks religious beliefs. A jatiya et al., 2017), and the combination of RNN
meme’s caption can be written in a mixed language and convolutional neural network (CNN) (Zhang
(written in both English and Bengali as in Figure et al., 2018b) based methods. Recently, Bidirec-
1 (b)), which can evade the surveillance engine in tional Encoder Representations for Transformers or
those cases. Developing a hateful meme detection BERT-based models (Pamungkas and Patti, 2019;
system for such a scenario is complicated as no Fortuna et al., 2021) are applied and achieved su-
standard dataset is available. Moreover, develop- perior performance compared to the deep learning-
ing an intelligent multimodal memes analysis sys- based methods.
tem for Bengali is challenging due to the unavail- Multimodal hate speech detection: In contrast to
ability of benchmark corpus, lack of reliable NLP the text-based analysis, in recent years, few pieces
tools (such as OCR), and the complex morpholog- of work considered multimodal information (i.e.,
ical structure of the Bengali language. Therefore, image + text) for hate speech detection. For exam-
this work aims to develop a multimodal dataset for ple, Kiela et al. (2020) introduced a multimodal
Bangla hate speech detection and investigate vari- memes dataset for detecting hate speech. Gomez
ous models for the task. The critical contributions et al. (2020) developed a large scale multimodal
of the work are summarized as follows: dataset (MMHS150k) for detecting hateful memes.
In another work, Rana and Jha (2022) introduced
• Created a multimodal hate speech dataset a multimodal hate speech dataset concerning three
(MUTE) in Bengali consisting of 4158 memes modalities (i.e., image, text, and audio). However,
annotated with Hate and Not-Hate labels. few works have been accomplished on multimodal
hate speech detection for resource constraint lan-
• Performed extensive experiments with state-
guages. Perifanos and Goutsos (2021) introduced
of-the-art visual and textual models and then
a multimodal dataset for detecting hate speech in
integrate the features of both modalities using
Greek social media. Likewise, Karim et al. (2022)
the early fusion approach.
developed a dataset for multimodal hate speech de-
2 Related Work tection from Bengali memes. Several approaches
were employed for detecting hate speech using mul-
This section discusses the past studies on hate timodal learning. Some researchers exploited the
speech detection based on unimodal (i.e., image or different fusion (Sai et al., 2022; Perifanos and
text) and multimodal data. Goutsos, 2021) techniques (i.e., early and late fu-
Unimodal based hate speech detection: Hate sion) to evaluate the image and textual features
speech detection is a prominent research issue jointly. Others have employed bi-linear pooling
among the researchers of different languages (Ross (Chandra et al., 2021; Choi and Lee, 2019) and
et al., 2016; Lekea and Karampelas, 2018). Most transformer-based methods (Kiela et al., 2020) such
hate speech detection works were accomplished as MMBT, ViLBERT, and Visual-BERT. Despite
based on the text data. For example, both Davidson having the state of the art multimodal transformer
et al. (2017) and Waseem and Hovy (2016) devel- architectures, these models have only applied for
oped hate speech datasets considering the Twitter high resource language (i.e., English).
posts. Similarly, De Gibert et al. (2018) constructs Differences with existing researches: Though
a dataset that considers the hate speech posted a considerable amount of work has been accom-
in a white supremacy forum. Some works were plished on multimodal hate speech detection, only
also accomplished concerning the low resource a few works studied low-resource languages (i.e.,
languages. For instance, Fortuna et al. (2019); Bengali). In our exploration, we found a work
Ousidhoum et al. (2019) introduced hate speech (Karim et al., 2022) that detects hate speech from
datasets for Portuguese and Arabic. A few works multimodal memes for the Bengali language. How-
have also been done on Bengali hate speech de- ever, they did not curate the social media memes
tection (Romim et al., 2021; Mathew et al., 2021; for analysis; instead artificially created a memes
Ishmam and Sharmin, 2019). Several architectures dataset for Bengali by conjoining the hateful texts
have been employed over the last few years to clas- into various images. Moreover, the current works
sify hateful texts. Earlier researchers widely used overlooked the memes containing captions written
Recurrent Neural Network (Gröndahl et al., 2018),
33
cross-lingually. Considering these drawbacks, the 2020; Gomez et al., 2020; Perifanos and Goutsos,
proposed research differs from the existing stud- 2021), we define the classes:
ies in three ways: (i) develops a multimodal hate Hate: A meme is considered as Hateful if it intends
speech dataset (i.e., MUTE) for Bengali consider- to vilify, denigrate, bullying, insult, and mocking
ing the Internet memes, (ii) provides a detailed an- an entity based on the characteristics including gen-
notation guideline that can be followed for resource der, race, religion, caste, and organizational status
creation in other low resource languages, and (iii) etc.
consider the memes that contain code-mixed (En- Not-Hate: A meme is reckoned as not-Hateful if it
glish + Bangla) and code-switched (written Bengali does not express any inappropriate cogitation and
dialects in English alphabets) caption. conveys positive emotions (i.e., affection, gratitude,
support, and motivation) explicitly or implicitly.
3 MUTE: A New Benchmark Dataset
3.2.1 Process of Annotation
This work developed MUTE: a novel multimodal We instructed the annotators to follow the class def-
dataset for Bengali Hateful memes detection. The initions for performing the annotation. It also asked
MUTE considered the memes with code-mixed and them to mention the reasons for assigning a meme
cod-switched captions. For developing the dataset, to a particular class. This explanation will aid the
we follow the guidelines provided by Kiela et al. expert in selecting the correct label during contra-
(2020). This section briefly describes the dataset diction. Initially, we trained the annotators with
development process with detailed statistics. some sample memes. Four annotators (computer
science graduate students) performed the manual
3.1 Data Accumulation
annotation process, and an expert (a Professor con-
For dataset construction, we have manually col- ducting NLP research for more than 20 years) ver-
lected memes from various social media platforms ified the labels. Annotators were equally divided
such as Facebook, Twitter, and Instagram. We into two groups where each annotated a subset of
search the memes using a set of keywords such memes. In case of disagreement, the expert de-
as Bengali Memes, Bangla Troll Memes, Bangla cided on the final label. The expert ruled a total
Celebrity Troll Memes, Bangla Funny Memes etc. of 113 non-hateful and 217 hateful memes as hos-
Besides, some popular public memes pages are tile and non-hateful. An inter-annotator agreement
also considered for the data collection, such as Keu was measured using Cohen (Cohen, 1960) Kappa
Amare Mairala, Ovodro Memes etc. We accumu- Coefficient to ensure the data annotation quality.
lated 4210 memes from January 10, 2022, to April We achieved a mean Kappa score of 0.714, which
15, 2022. During the data collection, some inappro- indicates a moderate agreement between the an-
priate memes are discarded by following the guide- notators. Earlier, it is mentioned that this work is
lines provided by Pramanick et al. (2021). The the very first attempt at multimodal hate speech
criteria for discarding data are: (i) memes contain detection that considers the social media memes of
only unimodal data, (ii) memes whose textual or the Bengali language. Therefore, it requires more
visual information is unclear and (iii) memes con- extensive scrutiny with more diverse data and a
tain cartoons. In this filtering process, 52 memes high level of annotator agreement to deploy the
were removed and ended up with a dataset of 4158 model trained on this dataset. The agreement score
memes. Afterwards, the caption of the memes illustrates the difficulty in identifying the potential
is manually extracted as Bengali has no standard hateful memes by humans and brings a question
OCR. Finally, the memes and their corresponding of biases, thus limiting the broader impact of this
captions are given to the annotators for annotation. work.

3.2 Dataset Annotation 3.3 Dataset Statistics


The collected memes are manually labelled into For training and evaluation, the MUTE is split into
two distinct categories: Hate and not-Hate. How- the train (80%), test (10%), and validation (10%)
ever, to ensure the dataset’s quality, it is essential to set. Table 1 presents the class-wise distribution of
follow a standard definition for segregating the two the dataset. It is observed that the dataset is slightly
categories. After exploring some existing works imbalanced as the ‘Not-Hate’ class contains ≈60%
on multimodal hate speech detection (Kiela et al., data. Table 2 shows the statistics of the training
34
Class Train Test Valid Total 4.2 Baselines for Textual Modality
Hate 1275 159 152 1586
For text based hateful memes analysis, various deep
Not-Hate 2092 257 223 2572
learning models are employed including BiLSTM
Table 1: Number of instances in train, test and validation + CNN (Sharif et al., 2020), BiLSTM + Attention
set for each class. (Zhang et al., 2018a), and Transformers (Vaswani
et al., 2017).

Hate Not-Hate BiLSTM + CNN: At first, the word embedding


#Code-mixed texts 345 138 (Mikolov et al., 2013) vectors are fed to a BiLSTM
#Words 12854 22885 layer consisting of 64 hidden units. Following this,
#Unique words 5781 8627 a convolution layer with 32 filters with kernel size
Max. caption length 51 87 two is added, followed by a max-pooling layer to
Avg. #words/caption 10.08 10.94 extract the significant contextual features. Finally,
a sigmoid layer is used for the classification. The
Table 2: Training set statistics of the captions of the final time steps output of the BiLSTM network
memes
provides the contextual information of the overall
text.

set, which contains a total of 483 memes with code- BiLSTM + Attention: We applied the additive
mixed captions. Moreover, it is also illustrate that attention (Bahdanau et al., 2015) mechanism to
the ‘Not-Hate’ class has a higher number of words the individual word representations of the BiLSTM
and unique words than the ‘Hate’ class. However, cell. The CNN is replaced with an attention layer.
the average caption length is almost identical in The attention layer tries to give higher weight to the
both classes. Apart from this, we carried out a significant words for inferring a particular class.
quantitative analysis using the Jaccard similarity in-
Transformers: Pretrained transformer models
dex to figure out the fraction of overlapping words
have recently obtained remarkable performance in
among the classes. We obtained a score of 0.391,
almost every NLP task (Naseem et al., 2020; Yang
indicating that some common words exist between
et al., 2020; Cao et al., 2020). As the MUTE con-
the classes.
tains cross-lingual text, this work employed three
transformer models, namely Multilingual Bidirec-
4 Methodology tional Encoder Representations for Transformer
(M-BERT (Devlin et al., 2019)), Bangla-BERT
Several computational models have been explored (Sarker, 2020), and Cross-Lingual Representation
to identify hateful memes by considering the single Learner (XLM-R (Conneau et al., 2020)). All the
modality (i.e., image, text) and the combination models are downloaded from HuggingFace1 trans-
of both modalities (image and text). This section former library. We follow their preprocessing 2 and
briefly discusses the methods and parameters uti- encoding technique for preparing the texts. The
lized to construct the models. transformer models provide a sentence represen-
tation vector of size 768. This vector is passed to
4.1 Baselines for Visual Modality a dense layer of 32 neurons, and then using the
pre-trained weights, models are retrained on the
This work employed convolutional neural networks
developed dataset with a sigmoid layer.
(CNN) to classify hateful memes based on visual
information. Initially, the images are resized into 4.3 Baselines for Multimodal Data
150 × 150 × 3 and then driven into the pre-trained
CNN models. Specifically, we curated the VGG19, In recent years, joint evaluation of visual and tex-
VGG16 (Simonyan and Zisserman, 2015), and tual data has proven superior in solving many com-
ResNet50 (He et al., 2016) architectures that fine- plex NLP problems (Hori et al., 2017; Yang et al.,
tuned on MUTE dataset by using the transfer learn- 2019; Alam et al., 2021). This work investigates the
ing (Tan et al., 2018) approach. Before that, the joint learning of multimodal data for hateful memes
top two layers of the models are replaced with a 1
https://huggingface.co/
2
sigmoid layer for classification. https://huggingface.co/docs/tokenizers/index
35
classification. For multimodal feature representa- On the other hand, with the multimodal informa-
tion, we employed the feature fusion (Nojavanas- tion, the outcomes of the models are not improved.
ghari et al., 2016) approach. In earlier experiments, Almost all the models’ WF lies around 0.60 except
all the visual and two textual (i.e., Bangla-BERT the VGG19 + B-BERT model (0.641). However,
and XLM-R) models are used to construct the mul- the VGG16 + B-BERT model outperformed all the
timodal models. For the model construction, we models by achieving the highest weighted WF of
added a dense layer of 100 neurons at both modality 0.672, which is approximately 2% higher than the
sides and then concatenated their outputs to make best unimodal model of B-BERT (0.649).
combined visual and textual data representations.
Finally, this combined feature is passed to a dense
5.2 Error Analysis
layer of 32 neurons, followed by a sigmoid layer
for the classification task. We conducted a quantitative error analysis to inves-
tigate the model’s mistakes across the two classes.
5 MUTE: Benchmark Evaluation To illustrate the errors, the number of misclassified
The training set is used to train the models, whereas instances is reported in Figure 2 for the best uni-
the validation set is for tweaking the hyperparame- modal (ResNet50 and B-BERT) and multimodal
ters. We have empirically tried several hyperparam- (VGG19 + B-BERT) models. It is observed that
eters to obtain a better model’s performance and the misclassification rate (MR) is increased ≈10%
reported the best one. The final evaluation of the and decreased ≈9% from visual to textual model,
models is done on the test set. This work selects respectively, for the ‘Hate’ and ‘Not-Hate’ classes.
the weighted f1-score (WF) as the primary metric However, the joint evaluation of multimodal fea-
for the evaluation due to the class imbalance na- tures significantly reduced the MR to 38% (from
ture of the dataset. Apart from this, we used the 44% and 54%) in the Hate class and thus improved
class weighting technique (Sun et al., 2009) to give the model’s overall performance. Though the mul-
equal priority to the minority class (hate) during timodal model showed superior performance com-
the model training. pared to the unimodal models, there is still room
for improvement. We point out several reasons
5.1 Results behind the model’s mistakes. Among them, identi-
Table 3 illustrates the outcome of the visual, textual, cal words in different written formats (code-mixed,
and multimodal models for hateful memes classifi- code-switched) made it difficult for the model to
cation. In the case of the visual model, ResNet50 identify accurate labels. Moreover, the discrep-
obtained the maximum WF of 0.641. For the text ancy between some memes’ visual and textual in-
modality, the B-BERT model obtained the high- formation creates confusion for the multimodal
est WF (0.649). The outcomes of the other tex- model. Indeed, these are some significant factors
tual models (i.e., BiLSTM + Attention, BiLSTM that should be tackled to develop a more sophisti-
+ CNN, and XLM-R) are not exhibited significant cated model for Bengali hateful memes classifica-
differences compared to the best model (B-BERT). tion.

Approach Models P R WF
VGG19 0.594 0.579 0.584
Visual VGG16 0.636 0.644 0.638
ResNet50 0.643 0.639 0.641

BiLSTM + CNN 0.617 0.663 0.608


BiLSTM + Attention 0.647 0.653 0.642
Textual M-BERT 0.627 0.644 0.620
B-BERT 0.645 0.658 0.649
XLM-R 0.646 0.656 0.648

VGG19 + B-BERT 0.639 0.649 0.641


VGG16 + B-BERT 0.676 0.670 0.672
ResNet50 + B-BERT 0.606 0.620 0.609
Multimodal
VGG16 + XLM-R 0.594 0.581 0.586
VGG19 + XLM-R 0.515 0.605 0.489
ResNet50 + XLM-R 0.651 0.600 0.604

Table 3: Performance comparison of the visual, textual,


and multimodal models on the test set. Where P, R,
WF denotes precision, recall and weighted f1 -score, Figure 2: Miss-classification rate across two classes by
respectively. different models.
36
6 Conclusion Mohit Chandra, Dheeraj Pailla, Himanshu Bhatia,
Aadilmehdi Sanchawala, Manish Gupta, Manish Shri-
This paper presented a multimodal framework for vastava, and Ponnurangam Kumaraguru. 2021. “sub-
hateful memes classification and investigated its verting the jewtocracy”: Online antisemitism detec-
tion using multimodal deep learning. In 13th ACM
performance on a newly developed multimodal Web Science Conference 2021, pages 148–157.
dataset (MUTE) having Bengali and code-mixed
(Bangla + English) captions. For benchmarking the Jun-Ho Choi and Jong-Seok Lee. 2019. Embracenet:
A robust deep learning architecture for multimodal
framework, this work exploited several computa-
classification. Information Fusion, 51:259–270.
tional models for detecting hateful content. The key
finding of the experiment is that the joint evaluation Jacob Cohen. 1960. A coefficient of agreement for
of multimodal features is more effective than the nominal scales. Educational and Psychological Mea-
surement, 20(1):37–46.
memes’ only visual or textual information. More-
over, the cross-lingual embeddings (XLM-R) did Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
not provide the expected performance compared to Vishrav Chaudhary, Guillaume Wenzek, Francisco
Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
the monolingual embeddings (Bangla-BERT) when moyer, and Veselin Stoyanov. 2020. Unsupervised
jointly evaluated with the visual features. The er- cross-lingual representation learning at scale. In
ror analysis reveals that the model’s performance ACL.
gets biased to a particular class due to the class
Thomas Davidson, Dana Warmsley, Michael Macy, and
imbalance. In future, we aim to alleviate this prob- Ingmar Weber. 2017. Automated hate speech de-
lem by extending the dataset to a large scale and tection and the problem of offensive language. In
framing it as a multi-class classification problem. Proceedings of the international AAAI conference on
Secondly, for robust inference, advanced fusion web and social media, volume 11, pages 512–515.
techniques (i.e., co-attention) and multitask learn- Ona De Gibert, Naiara Perez, Aitor García-Pablos,
ing approaches will be explored. Finally, future and Montse Cuadros. 2018. Hate speech dataset
research will explore the impact of dataset sam- from a white supremacy forum. arXiv preprint
arXiv:1809.04444.
pling and do some ablation study (i.e., experiment-
ing with only English, only Bangla, code-mixed, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
and code-switched text) to convey valuable insights Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
about the models’ performance.
standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
References nologies, Volume 1 (Long and Short Papers), pages
4171–4186, Minneapolis, Minnesota. Association for
Firoj Alam, Stefano Cresci, Tanmoy Chakraborty, Fab- Computational Linguistics.
rizio Silvestri, Dimiter Dimitrov, Giovanni Da San
Martino, Shaden Shaar, Hamed Firooz, and Preslav Paula Fortuna, Joao Rocha da Silva, Leo Wanner, Sér-
Nakov. 2021. A survey on multimodal disinforma- gio Nunes, et al. 2019. A hierarchically-labeled por-
tion detection. arXiv preprint arXiv:2103.12541. tuguese hate speech dataset. In Proceedings of the
third workshop on abusive language online, pages
Pinkesh Badjatiya, Shashank Gupta, Manish Gupta, 94–104.
and Vasudeva Varma. 2017. Deep learning for hate
speech detection in tweets. In Proceedings of the Paula Fortuna, Juan Soler-Company, and Leo Wanner.
26th international conference on World Wide Web 2021. How well do hate speech, toxicity, abusive
companion, pages 759–760. and offensive language classification models gener-
alize across datasets? Information Processing &
Management, 58(3):102524.
Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Ben-
gio. 2015. Neural machine translation by jointly Raul Gomez, Jaume Gibert, Lluis Gomez, and Dimos-
learning to align and translate. In 3rd International thenis Karatzas. 2020. Exploring hate speech detec-
Conference on Learning Representations, ICLR tion in multimodal publications. In Proceedings of
2015. the IEEE/CVF winter conference on applications of
computer vision, pages 1470–1478.
Qingqing Cao, Harsh Trivedi, Aruna Balasubramanian,
and Niranjan Balasubramanian. 2020. Deformer: De- Tommi Gröndahl, Luca Pajola, Mika Juuti, Mauro Conti,
composing pre-trained transformers for faster ques- and N Asokan. 2018. All you need is" love" evading
tion answering. In Proceedings of the 58th Annual hate speech detection. In Proceedings of the 11th
Meeting of the Association for Computational Lin- ACM workshop on artificial intelligence and security,
guistics, pages 4487–4497. pages 2–12.
37
Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Endang Wahyu Pamungkas and Viviana Patti. 2019.
2016. Deep residual learning for image recognition. Cross-domain and cross-lingual abusive language de-
2016 IEEE Conference on Computer Vision and Pat- tection: A hybrid approach with deep learning and
tern Recognition (CVPR), pages 770–778. a multilingual lexicon. In Proceedings of the 57th
Annual Meeting of the Association for Computational
Chiori Hori, Takaaki Hori, Teng-Yok Lee, Ziming Linguistics: Student Research Workshop, pages 363–
Zhang, Bret Harsham, John R Hershey, Tim K Marks, 370, Florence, Italy. Association for Computational
and Kazuhiko Sumi. 2017. Attention-based multi- Linguistics.
modal fusion for video description. In Proceedings
of the IEEE international conference on computer Konstantinos Perifanos and Dionysis Goutsos. 2021.
vision, pages 4193–4202. Multimodal hate speech detection in greek social
media. Multimodal Technologies and Interaction,
Alvi Md Ishmam and Sadia Sharmin. 2019. Hateful
5(7):34.
speech detection in public facebook pages for the
bengali language. In 2019 18th IEEE international Shraman Pramanick, Dimitar Dimitrov, Rituparna
conference on machine learning and applications Mukherjee, Shivam Sharma, Md Shad Akhtar,
(ICMLA), pages 555–560. IEEE. Preslav Nakov, and Tanmoy Chakraborty. 2021. De-
Md Karim, Sumon Kanti Dey, Tanhim Islam, tecting harmful memes and their targets. In Find-
Bharathi Raja Chakravarthi, et al. 2022. Multimodal ings of the Association for Computational Linguis-
hate speech detection from bengali memes and texts. tics: ACL-IJCNLP 2021, pages 2783–2796.
arXiv preprint arXiv:2204.10196.
Aneri Rana and Sonali Jha. 2022. Emotion based hate
Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj speech detection using multimodal learning. arXiv
Goswami, Amanpreet Singh, Pratik Ringshia, and preprint arXiv:2202.06218.
Davide Testuggine. 2020. The hateful memes chal-
lenge: Detecting hate speech in multimodal memes. Nauros Romim, Mosahed Ahmed, Hriteshwar Talukder,
Advances in Neural Information Processing Systems, Saiful Islam, et al. 2021. Hate speech detection in
33:2611–2624. the bengali language: A dataset and its baseline eval-
uation. In Proceedings of International Joint Con-
Ioanna K Lekea and Panagiotis Karampelas. 2018. De- ference on Advances in Computational Intelligence,
tecting hate speech within the terrorist argument: a pages 457–468. Springer.
greek case. In 2018 IEEE/ACM International Confer-
ence on Advances in Social Networks Analysis and Bjorn Ross, Michael Rist, Guillermo Carbonell, Ben-
Mining (ASONAM), pages 1084–1091. IEEE. jamin Cabrera, Nils Kurowsky, and Michael Wojatzki.
2016. Measuring the reliability of hate speech anno-
Binny Mathew, Punyajoy Saha, Seid Muhie Yimam, tations: The case of the european refugee crisis. In
Chris Biemann, Pawan Goyal, and Animesh Mukher- 3rd Workshop on Natural Language Processing for
jee. 2021. Hatexplain: A benchmark dataset for ex- Computer-Mediated Communication/Social Media,
plainable hate speech detection. In Proceedings of pages 6–9. Ruhr-Universitat Bochum.
the AAAI Conference on Artificial Intelligence, vol-
ume 35, pages 14867–14875. Siva Sai, Naman Deep Srivastava, and Yashvardhan
Sharma. 2022. Explorative application of fusion
Tomas Mikolov, Kai Chen, G. Corrado, and J. Dean.
techniques for multimodal hate speech detection. SN
2013. Efficient estimation of word representations in
Computer Science, 3(2):1–13.
vector space. In ICLR.
Usman Naseem, Imran Razzak, Katarzyna Musial, and Sagor Sarker. 2020. Banglabert: Bengali mask language
Muhammad Imran. 2020. Transformer based deep model for bengali language understading.
intelligent contextual embedding for twitter senti-
ment analysis. Future Generation Computer Systems, Omar Sharif, Eftekhar Hossain, and Mo-
113:58–69. hammed Moshiul Hoque. 2020. TechTexC:
Classification of technical texts using convolution
Behnaz Nojavanasghari, Deepak Gopinath, Jayanth and bidirectional long short term memory network.
Koushik, Tadas Baltrušaitis, and Louis-Philippe In Proceedings of the 17th International Conference
Morency. 2016. Deep multimodal fusion for persua- on Natural Language Processing (ICON): TechDOfi-
siveness prediction. In Proceedings of the 18th ACM cation 2020 Shared Task, pages 35–39, Patna, India.
International Conference on Multimodal Interaction, NLP Association of India (NLPAI).
pages 284–288.
K. Simonyan and Andrew Zisserman. 2015. Very deep
Nedjma Ousidhoum, Zizheng Lin, Hongming Zhang, convolutional networks for large-scale image recog-
Yangqiu Song, and Dit-Yan Yeung. 2019. Multi- nition. CoRR, abs/1409.1556.
lingual and multi-aspect hate speech analysis. In
Proceedings of the 2019 Conference on Empirical Yanmin Sun, Andrew KC Wong, and Mohamed S
Methods in Natural Language Processing and the 9th Kamel. 2009. Classification of imbalanced data: A
International Joint Conference on Natural Language review. International journal of pattern recognition
Processing (EMNLP-IJCNLP), pages 4675–4684. and artificial intelligence, 23(04):687–719.
38
Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang,
Chao Yang, and Chunfang Liu. 2018. A survey
on deep transfer learning. In International confer-
ence on artificial neural networks, pages 270–279.
Springer.
Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. ArXiv, abs/1706.03762.
Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols
or hateful people? predictive features for hate speech
detection on twitter. In Proceedings of the NAACL
student research workshop, pages 88–93.
Fan Yang, Xiaochang Peng, Gargi Ghosh, Reshef
Shilon, Hao Ma, Eider Moore, and Goran Predovic.
2019. Exploring deep multimodal fusion of text and
photo for hate speech classification. In Proceedings
of the third workshop on abusive language online,
pages 11–18.
Jiacheng Yang, Mingxuan Wang, Hao Zhou, Chengqi
Zhao, Weinan Zhang, Yong Yu, and Lei Li. 2020.
Towards making the most of bert in neural machine
translation. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 34, pages 9378–
9385.

You Zhang, Jin Wang, and Xuejie Zhang. 2018a. Ynu-


hpcc at semeval-2018 task 1: Bilstm with attention
based sentiment analysis for affect in tweets. In Pro-
ceedings of The 12th International Workshop on Se-
mantic Evaluation, pages 273–278.

Ziqi Zhang, David Robinson, and Jonathan Tepper.


2018b. Detecting hate speech on twitter using a
convolution-gru based deep neural network. In Eu-
ropean semantic web conference, pages 745–760.
Springer.

Yi Zhou, Zhenhao Chen, and Huiyuan Yang. 2021. Mul-


timodal learning for hateful memes detection. In
2021 IEEE International Conference on Multimedia
& Expo Workshops (ICMEW), pages 1–6. IEEE.

39

You might also like