Hate Speech Detection in The Bengali Language: A Dataset and Its Baseline Evaluation
Hate Speech Detection in The Bengali Language: A Dataset and Its Baseline Evaluation
Abstract. Social media sites such as YouTube and Facebook have be-
come an integral part of everyone’s life and in the last few years, hate
speech in the social media comment section has increased rapidly. Detec-
tion of hate speech on social media websites faces a variety of challenges
including small imbalanced data sets, the finding of an appropriate model
and also the choice of feature analysis method. Furthermore, this problem
is more severe for the Bengali speaking community due to the lack of gold
standard labelled datasets. This paper presents a new dataset of 30,000
user comments tagged by crowdsourcing and verified by expert. All the
user comments collected from YouTube and Facebook comment section
and to classified into seven categories: sports, entertainment, religion,
politics, crime, celebrity, and TikTok & meme. A total of 50 annota-
tors annotated each comment three times, and the majority vote was
taken as the final annotation. Nevertheless, we have conducted base-
line experiments and several deep learning models along with extensive
pretrained Bengali word embedding such as Word2Vec, FastTest, and
BengFastText on this dataset to facilitate future research opportunities.
The experiment illustrated that although all the deep learning model
performed well, SVM achieved the best result with 87.5% accuracy. Our
core contribution is to make this benchmark dataset available and ac-
cessible to facilitate further research in the field of Bengali hate speech
detection.
1 Introduction
Social media has become an essential part of every person’s day to day life.
It enables fast communication, easy access to sharing, and receiving ideas and
views from worldwide. However, at the same time, this expression of freedom has
lead to the continuous rise of hate speech and offensive language in social media.
2 N. Romim et al.
Part of this problem has been created by the corporate social media model and
the gap between documented community policy and the real-life implication of
hate speech [4]. Not only that, but hate speech-language is also very diverse [17].
The language used in social media is often very different from traditional print
media. It has various linguistic features. Thus the challenge in automatically
detect hate speech is very hard [20].
Even though much work has been done in hate speech prevention in the En-
glish language, there is a significant lacking of resources regarding hate speech de-
tection in Bengali social media. Nevertheless, problems like online abuse and es-
pecially online abuse towards females, are continuously on the rise in Bangladesh
[19]. However, for language like Bengali, a low resource language, developing and
deploying machine learning models to tackle real-life problems is very difficult,
since there is a shortage of dataset and other tools for Bengali text classifica-
tion [6]. So, the need for research on the nature and prevention of social media
hate speech has never been higher.
This paper illustrates our attempt to improve this problem. Our dataset com-
prises 30,000 Bengali comments from YouTube and Facebook comment sections
and has a 10,000 hate speech. We select comments from 7 different categories:
Sports, Entertainment, Crime, Religion, Politics, Celebrity, TikTok & meme,
making it diverse. We ran several deep learning models along with word em-
bedding models such as Word2Vec, FastText, and pretrained BengFastText on
our dataset to obtain benchmark results. Lastly, we analyzed our findings and
explained challenges to detect hate speech.
2 Literature review
Much work has been done on detecting hate speech using deep learning models
[7], [10]. There have also been efforts to increase the accuracy of predicting
hate speech specifically by extracting unique semantic features of hate speech
comments [22]. Researchers have also utilized fastText to build models that can
be trained on billions of words in less than ten minutes and classify millions of
sentences among hundreds of classes [14]. There have also been researches that
indicate how the annotator’s bias and worldview affect the performance of a
dataset [21]. Right now, the state of the art research for hate speech detection
reached the point where researches utilize the power of advanced architectures
such as transfer learning. For example, in [9], researchers compared deep learning,
transfer learning, and multitask learning architectures on the Arabic hate speech
dataset. There is also research in identifying hate speech from a multilingual
dataset using the pretrained state of the art models such as mBERT and xlm-
RoBERTa [3].
Unfortunately, very few works have been done on hate speech detection in
Bengali social media. The main challenge is the lack of sufficient data. To the best
of our knowledge, many of the datasets were around 5000 corpora [5], [8] and [12].
There was a publicly available corpus containing around 10000 corpora, which
were annotated into five different classes [2]. Nevertheless, one limitation the
Hate Speech detection in the Bengali language 3
authors faced that they could only use sentences labeled as toxicity in their ex-
periments since other labels were low in number. There was also another dataset
of 2665 corpus that was translated from an English hate speech dataset [1].
Another research used a rule-based stemmer to obtain linguistic features [8]. Re-
searches are coming out that uses deep learning models such as CNN, LSTM to
obtain better results [2], [5], and [8]. One of the biggest challenges is that Bengali
is a low resource language. Research has been done to create word embedding
specifically for a low-resource language like Bengali called BengFastText [15].
This embedding was trained on 250 million Bengali word corpus. However, one
thing is clear that there is a lack of dataset that is both large and diverse. This
paper tries to tackle the problem by presenting a large dataset and having com-
ments from seven different categories, making it diverse. To our knowledge, this
is the first dataset in Bengali social media hate speech with such a large and
diverse dataset.
3 Dataset
Our primary goal while creating the dataset was to create a dataset with dif-
ferent varieties of data. For this reason, comments on seven different categories:
sports, entertainment, crime, religion, politics, celebrity and meme, TikTok &
miscellaneous from YouTube and Facebook were extracted.
We extracted comments from the Facebook public page, Dr. Muhammad
Zafar Iqbal. He is a prominent science fiction author, physicist, academic, and
activist of Bangladesh. These comments belonged to the celebrity category. Nev-
ertheless, due to Facebook’s restriction on its graph API, we had to focus on
YouTube as the primary source of data.
From YouTube, we looked for the most scandalous and controversial topics
of Bangladesh between the year 2017-20. We reasoned that since these were con-
troversial, videos were made more frequently, and people participated more in
the comment section, and the comments might contain more hate speech. We
searched YouTube for videos on keywords related to these controversial events.
For example, we searched for renowned singer and actor couple Mithila-Tahsan
divorce, i.e., the Mithila controversy of 2020. Then we selected videos with at
least 700k views and extracted comments from them. In this way, we extracted
comments videos on controversial topics covering five categories: sports, enter-
tainment, crime, religion, and politics. Finally, we searched for videos that are
memes, TikTok, and other keywords that might contain hate speech in their
comment section. This is our seventh category.
We extracted all the comments using open-source software called FacePager3 .
After extracting, we labeled each comment with a number defining which cat-
egory it belonged. We also labeled the comments with the additional number
to define its keyword. For this paper, the keyword means the controversial event
3
https://github.com/strohne/Facepager
4 N. Romim et al.
that falls under a category. For example, mithila is a keyword that belongs to
the entertainment category. We labeled every comment with its corresponding
category and keyword for future research.
After extracting the comments, we manually checked the whole dataset and
removed all the comments made entirely of English or other non-Bengali lan-
guages, emoji, punctuation, and numerical values. However, we kept a comment
if it primarily consists of Bengali words with some emoji, number, and non-
Bengali words mixed within it. We deleted non-Bengali comments because our
research focuses on Bengali hate speech comments, so non-Bengali comments are
out of our research focus. However, we kept impure Bengali comments because
people do not make pure Bengali comments on social media. So emoji, number,
punctuation, and English words are a feature of social media comment. Thus
we believe our dataset can prove to be very rich for future research purposes. In
the end, we collected a total of 30,000 comments. In a nutshell, our dataset has
30k comments that are mostly Bengali sentences with some emoji, punctuation,
number, and English alphabet mixed in it.
িক বােলর মুিভ
Here the slang word is not used to dehumanized any person. So it is not hate
speech.
– If a comment does not dehumanize a person rather directly supports another
idea that clearly dehumanizes a person or a community, This is considered
hate speech. For example:
গতর্ েযাদ্ধা
Our dataset has a total of 30,000 comments, where there are a total of 10,000 hate
speech comments, as evident from table-2. So it is clear that our dataset is heavily
biased towards not hate speech. If we look closely at each category in figure-1,
it becomes even more apparent. In all categories, not hate speech comments
dominates. But particularly in celebrity and politics, the number of hate speech
is very low, even compared to hate speech in other categories. During data
collection, we have observed that there were many hate speech in the celebrity
section i.e. in Dr. Muhammad Zafar Iqbal’s Facebook page, but they were outside
context. As we have discussed before in section 3.2, we have only considered texts
without context while labeling it as hate speech. So many comments were labeled
as not hate speech. For the category politics, we have observed that people tend
to not attack any person or group directly. Rather they tend to add their own
take on the current political environment. So the number of direct attacks is less
in the politics category.
6 N. Romim et al.
When we look at the mean text length in table-3, we can find a couple of
interesting observations. First, we can see that meme comments are very short
in length. This makes sense as when a person is posting a comment in a meme
video, and he is likely to express his state of mind, requiring a shorter amount
sentences. But the opposite is true for the celebrity category. This has the longest
average text length. This is large because when people comment on Dr. Zafar
Iqbal’s Facebook page, they add a lot of their own opinion and analysis, no
matter the comment is hate speech or not. This shows how unique the comment
section of an individual celebrity page can be. Lastly, we see that average hate
speech tends to be shorter than not hate speech.
In the table, 4 we have compared all state of the art datasets. The table
we have included the total number of the dataset and the number of classes
the datasets were annotated. As you can see, there are some datasets that have
multiple classes. In this paper, we focused on the total number of the dataset
and extracted comments from different categories so that we can ensure linguistic
variation.
4 Experiment
4.1 Preprocess
Our 30k dataset had raw Bengali comments with emoji, punctuation, and En-
glish alphabet mixed in it. We removed all emoji, punctuation, numerical values,
non-Bengali alphabet, and symbols from all comments for baseline evaluation.
After that, we created a train and test set of 24,000 and 6,000 comments, re-
spectively. Now the dataset is ready for evaluation.
Hate Speech detection in the Bengali language 7
Table 4. A comparison of all state of the art datasets on Bengali hate speech
4.3 Models
Support Vector Machine (SVM) We used the Support Vector Machine [11]
to determine baseline evaluation. We used the linear kernel and kept all other
parameters to its default value.
Long Short Term Memory (LSTM) For our experiment, we used 100 LSTM
layers, set both dropout and recurrent dropout rate to 0.2, and used ’adam’ as
the optimizer.
case, we kept all parameters standard. Epoch and batch size was set to 5 and
64, respectively. Then we tested all the trained models on the test dataset and
measured the accuracy and F-1 score. Bellow are all types of models we tested
on our dataset.
– Baseline evaluation: Support Vector Machine (SVM)
– FastText Embedding with LSTM
– FastText Embedding with Bi-LSTM
– BengFastText Embedding with LSTM
– BengFastText Embedding with Bi-LSTM
– Word2Vec Embedding with LSTM
– Word2Vec Embedding with Bi-LSTM
4.5 Result
We can observe from the table 5 that all the models achieved good accuracy.
SVM achieved the overall best result with accuracy and an F-1 score of 87.5% and
0.911, respectively. But BengFastText with LSTM and Bi-LSTM had relatively
the worst accuracy and F-1 score. Their low F-1 score indicates that deep learning
models with BengFastText embedding were overfitted the most. BengFastText
is not trained on any YouTube data [18]. But our dataset has a huge amount of
YouTube comment. This might be a reason for its drop on performance.
Then we looked at the performance of Word2Vec and FastText embedding.
We can see that FastText performed better in terms of accuracy and had a
lower F-1 score than Word2Vec. Word2Vec was more overfitted than FastText.
FastText has one distinct advantage over Word2Vec: it learns from the words
of a corpus and it’s substrings. Thus FastText can tell ’love’ and ’beloved’ are
similar words [13]. This might be a reason as to why FastText outperformed
Word2Vec.
We manually cross-checked all labels of the test set with the prediction of the
SVM model. We looked at the false negative and false positive cases and wanted
to find which types of sentences the model failed to predict accurately. We found
that some of the labels were actually wrong, and the model actually predicted
accurately. Nevertheless, there were some unusual cases. For example:
ওরা চাইেবা েকন েতার কােছ তুই িবিসিব েপৰ্িসেডন্ট হেয় িক বাল ফালাস েতার েতা েখয়াল রাখা
দরকার িছল িকৰ্েকটারেদর িফিসিিলস িনেয়
This is not a hate speech, but the model predicted it to be hate speech. There
are several other similar examples. The reason here is that this sentence contains
aggressive words that are normally used in hate speech, but in this case, it was
not used to dehumanize another person. However, the model failed to understand
that. This type of mistake was common in false-negative cases. It demonstrates
Hate Speech detection in the Bengali language 11
that words in the Bengali language can be used in a complicated context, and
it is a tremendous challenge for machine learning models to actually understand
the proper context of a word used in a sentence.
5 Conclusion
Here, they all are the same word. A human brain can understand that these
words are the same, but they are different from a deep learning model. Another
difference is the use of emoji to convey meaning. Often people will express a
specific emotion with the only emoji. Emoji is a recent phenomenon that is absent
in blog posts or newspaper articles, or books. Currently, there is no dataset or
pretrained model that classifies the sentiment of emoji used in social media.
One of the critical challenges to accurate hate speech detection is to create
models that can extract necessary information from an unbalanced dataset to
predict the minority class with reasonable accuracy. Our experiment demon-
strated that standard deep learning models are not sufficient for this task. Ad-
vanced models like mBERT and XLM-RoBERTa can be of great use in this
regard as they are trained on a large multilingual dataset and use attention
mechanisms. Embedding models based on extensive and diverse Social Media
comment datasets can also be of great help.
12 N. Romim et al.
6 Acknowledgement
This work would not have been possible without the kind support from SUST
NLP Research Group and SUST Research Center. We would also like to ex-
press our hearfealt gratitude to all the annotators and volunteers who made the
journey possible.
References
1. Awal, M.A., Rahman, M.S., Rabbi, J.: Detecting Abusive Comments in Discus-
sion Threads Using Naïve Bayes. 2018 International Conference on Innovations
in Science, Engineering and Technology, ICISET 2018 (October), 163–167 (2018).
https://doi.org/10.1109/ICISET.2018.8745565
2. Banik, N.: Toxicity Detection on Bengali Social Media Comments using Supervised
Models (November) (2019). https://doi.org/10.13140/RG.2.2.22214.01608
3. Baruah, A., Das, K., Barbhuiya, F., Dey, K.: Aggression identification in english,
hindi and bangla text using bert, roberta and svm. In: Proceedings of the Second
Workshop on Trolling, Aggression and Cyberbullying. pp. 76–82 (2020)
4. Ben-David, A., Fernández, A.M.: Hate speech and covert discrimination on social
media: Monitoring the facebook pages of extreme-right political parties in spain.
International Journal of Communication 10, 27 (2016)
5. Chakraborty, P., Seddiqui, M.H.: Threat and Abusive Language Detection on So-
cial Media in Bengali Language. 1st International Conference on Advances in Sci-
ence, Engineering and Robotics Technology 2019, ICASERT 2019 2019(Icasert),
1–6 (2019). https://doi.org/10.1109/ICASERT.2019.8934609
6. Chakravarthi, B.R., Arcan, M., McCrae, J.P.: Improving wordnets for under-
resourced languages using machine translation. In: Proceedings of the 9th Global
WordNet Conference (GWC 2018). p. 78 (2018)
7. Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detec-
tion and the problem of offensive language. Proceedings of the 11th International
Conference on Web and Social Media, ICWSM 2017 (Icwsm), 512–515 (2017)
8. Emon, E.A., Rahman, S., Banarjee, J., Das, A.K., Mittra, T.: A Deep Learn-
ing Approach to Detect Abusive Bengali Text. 2019 7th International Confer-
ence on Smart Computing and Communications, ICSCC 2019 pp. 1–5 (2019).
https://doi.org/10.1109/ICSCC.2019.8843606
9. Farha, I.A., Magdy, W.: Multitask learning for arabic offensive language and hate-
speech detection. In: Proceedings of the 4th Workshop on Open-Source Arabic
Corpora and Processing Tools, with a Shared Task on Offensive Language Detec-
tion. pp. 86–90 (2020)
10. Gambäck, B., Sikdar, U.K.: Using Convolutional Neural Networks to Classify Hate-
Speech (7491), 85–90 (2017). https://doi.org/10.18653/v1/w17-3013
11. Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B.: Support vector
machines. IEEE Intelligent Systems and their applications 13(4), 18–28 (1998)
12. Ishmam, A.M., Sharmin, S.: Hateful speech detection in public facebook pages
for the bengali language. Proceedings - 18th IEEE International Conference
on Machine Learning and Applications, ICMLA 2019 pp. 555–560 (2019).
https://doi.org/10.1109/ICMLA.2019.00104
13. Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: Fast-
text.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651
(2016)
Hate Speech detection in the Bengali language 13
14. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text
classification. 15th Conference of the European Chapter of the Association for
Computational Linguistics, EACL 2017 - Proceedings of Conference 2(2), 427–431
(2017). https://doi.org/10.18653/v1/e17-2068
15. Karim, M.R., Chakravarthi, B.R., McCrae, J.P., Cochez, M.: Classification
Benchmarks for Under-resourced Bengali Language based on Multichannel
Convolutional-LSTM Network (2020), http://arxiv.org/abs/2004.07807
16. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-
sentations of words and phrases and their compositionality. In: Advances in neural
information processing systems. pp. 3111–3119 (2013)
17. Mondal, M., Silva, L.A., Benevenuto, F.: A measurement study of hate speech
in social media. In: Proceedings of the 28th ACM Conference on Hypertext and
Social Media. pp. 85–94 (2017)
18. Rezaul Karim, M., Raja Chakravarthi, B., Arcan, M., McCrae, J.P., Cochez, M.:
Classification benchmarks for under-resourced bengali language based on multi-
channel convolutional-lstm network. arXiv pp. arXiv–2004 (2020)
19. Sambasivan, N., Batool, A., Ahmed, N., Matthews, T., Thomas, K., Gaytán-Lugo,
L.S., Nemer, D., Bursztein, E., Churchill, E., Consolvo, S.: ” they don’t leave us
alone anywhere we go” gender and digital abuse in south asia. In: proceedings
of the 2019 CHI Conference on Human Factors in Computing Systems. pp. 1–14
(2019)
20. Schmidt, A., Wiegand, M.: A survey on hate speech detection using natural lan-
guage processing. In: Proceedings of the Fifth International Workshop on Natural
Language Processing for Social Media. pp. 1–10 (2017)
21. Waseem, Z.: Are You a Racist or Am I Seeing Things? Annota-
tor Influence on Hate Speech Detection on Twitter pp. 138–142 (2016).
https://doi.org/10.18653/v1/w16-5618
22. Zhang, Z., Luo, L.: Hate speech detection: A solved problem? The chal-
lenging case of long tail on Twitter. Semantic Web 10(5), 925–945 (2019).
https://doi.org/10.3233/SW-180338