0% found this document useful (0 votes)

6 views13 pages

Hate Speech Detection in The Bengali Language: A Dataset and Its Baseline Evaluation

This paper presents a new dataset of 30,000 Bengali comments for hate speech detection, addressing the lack of resources in this area. The dataset includes comments from various categories and was annotated by 50 individuals, achieving a 91.05% accuracy in labeling hate speech. Baseline evaluations showed that while deep learning models performed well, an SVM model achieved the highest accuracy of 87.5%.

Uploaded by

Sadman Shifat Protic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views13 pages

Hate Speech Detection in The Bengali Language: A Dataset and Its Baseline Evaluation

Uploaded by

Sadman Shifat Protic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Hate Speech detection in the Bengali language:

A dataset and its baseline evaluation

Nauros Romim1[0000−0002−2000−0898] , Mosahed Ahmed1[0000−0002−0010−397X] ,

Hriteshwar Talukder1,2[0000−0002−6391−6177] , and Md Saiful
Islam1,2[0000−0001−9236−380X]
1
Shahjalal University of Science and Technology, Kumargaon, Sylhet 3114,
Bangladesh
{naurosromim,tainahmed96} @gmail.com
2
{hriteshwar-eee,saiful-cse} @sust.edu

Abstract. Social media sites such as YouTube and Facebook have be-
come an integral part of everyone’s life and in the last few years, hate
speech in the social media comment section has increased rapidly. Detec-
tion of hate speech on social media websites faces a variety of challenges
including small imbalanced data sets, the finding of an appropriate model
and also the choice of feature analysis method. Furthermore, this problem
is more severe for the Bengali speaking community due to the lack of gold
standard labelled datasets. This paper presents a new dataset of 30,000
user comments tagged by crowdsourcing and verified by expert. All the
user comments collected from YouTube and Facebook comment section
and to classified into seven categories: sports, entertainment, religion,
politics, crime, celebrity, and TikTok & meme. A total of 50 annota-
tors annotated each comment three times, and the majority vote was
taken as the final annotation. Nevertheless, we have conducted base-
line experiments and several deep learning models along with extensive
pretrained Bengali word embedding such as Word2Vec, FastTest, and
BengFastText on this dataset to facilitate future research opportunities.
The experiment illustrated that although all the deep learning model
performed well, SVM achieved the best result with 87.5% accuracy. Our
core contribution is to make this benchmark dataset available and ac-
cessible to facilitate further research in the field of Bengali hate speech
detection.

Keywords: Natural Language Processing (NLP) · Bengali Text Clas-

sification · Bengali sentiment analysis · Bengali Social Media Hate
speech detection.

1 Introduction
Social media has become an essential part of every person’s day to day life.
It enables fast communication, easy access to sharing, and receiving ideas and
views from worldwide. However, at the same time, this expression of freedom has
lead to the continuous rise of hate speech and offensive language in social media.
2 N. Romim et al.

Part of this problem has been created by the corporate social media model and
the gap between documented community policy and the real-life implication of
hate speech [4]. Not only that, but hate speech-language is also very diverse [17].
The language used in social media is often very different from traditional print
media. It has various linguistic features. Thus the challenge in automatically
detect hate speech is very hard [20].
Even though much work has been done in hate speech prevention in the En-
glish language, there is a significant lacking of resources regarding hate speech de-
tection in Bengali social media. Nevertheless, problems like online abuse and es-
pecially online abuse towards females, are continuously on the rise in Bangladesh
[19]. However, for language like Bengali, a low resource language, developing and
deploying machine learning models to tackle real-life problems is very difficult,
since there is a shortage of dataset and other tools for Bengali text classifica-
tion [6]. So, the need for research on the nature and prevention of social media
hate speech has never been higher.
This paper illustrates our attempt to improve this problem. Our dataset com-
prises 30,000 Bengali comments from YouTube and Facebook comment sections
and has a 10,000 hate speech. We select comments from 7 different categories:
Sports, Entertainment, Crime, Religion, Politics, Celebrity, TikTok & meme,
making it diverse. We ran several deep learning models along with word em-
bedding models such as Word2Vec, FastText, and pretrained BengFastText on
our dataset to obtain benchmark results. Lastly, we analyzed our findings and
explained challenges to detect hate speech.

2 Literature review

Much work has been done on detecting hate speech using deep learning models
[7], [10]. There have also been efforts to increase the accuracy of predicting
hate speech specifically by extracting unique semantic features of hate speech
comments [22]. Researchers have also utilized fastText to build models that can
be trained on billions of words in less than ten minutes and classify millions of
sentences among hundreds of classes [14]. There have also been researches that
indicate how the annotator’s bias and worldview affect the performance of a
dataset [21]. Right now, the state of the art research for hate speech detection
reached the point where researches utilize the power of advanced architectures
such as transfer learning. For example, in [9], researchers compared deep learning,
transfer learning, and multitask learning architectures on the Arabic hate speech
dataset. There is also research in identifying hate speech from a multilingual
dataset using the pretrained state of the art models such as mBERT and xlm-
RoBERTa [3].
Unfortunately, very few works have been done on hate speech detection in
Bengali social media. The main challenge is the lack of sufficient data. To the best
of our knowledge, many of the datasets were around 5000 corpora [5], [8] and [12].
There was a publicly available corpus containing around 10000 corpora, which
were annotated into five different classes [2]. Nevertheless, one limitation the
Hate Speech detection in the Bengali language 3

authors faced that they could only use sentences labeled as toxicity in their ex-
periments since other labels were low in number. There was also another dataset
of 2665 corpus that was translated from an English hate speech dataset [1].
Another research used a rule-based stemmer to obtain linguistic features [8]. Re-
searches are coming out that uses deep learning models such as CNN, LSTM to
obtain better results [2], [5], and [8]. One of the biggest challenges is that Bengali
is a low resource language. Research has been done to create word embedding
specifically for a low-resource language like Bengali called BengFastText [15].
This embedding was trained on 250 million Bengali word corpus. However, one
thing is clear that there is a lack of dataset that is both large and diverse. This
paper tries to tackle the problem by presenting a large dataset and having com-
ments from seven different categories, making it diverse. To our knowledge, this
is the first dataset in Bengali social media hate speech with such a large and
diverse dataset.

3 Dataset

3.1 Data extraction

Our primary goal while creating the dataset was to create a dataset with dif-
ferent varieties of data. For this reason, comments on seven different categories:
sports, entertainment, crime, religion, politics, celebrity and meme, TikTok &
miscellaneous from YouTube and Facebook were extracted.
We extracted comments from the Facebook public page, Dr. Muhammad
Zafar Iqbal. He is a prominent science fiction author, physicist, academic, and
activist of Bangladesh. These comments belonged to the celebrity category. Nev-
ertheless, due to Facebook’s restriction on its graph API, we had to focus on
YouTube as the primary source of data.
From YouTube, we looked for the most scandalous and controversial topics
of Bangladesh between the year 2017-20. We reasoned that since these were con-
troversial, videos were made more frequently, and people participated more in
the comment section, and the comments might contain more hate speech. We
searched YouTube for videos on keywords related to these controversial events.
For example, we searched for renowned singer and actor couple Mithila-Tahsan
divorce, i.e., the Mithila controversy of 2020. Then we selected videos with at
least 700k views and extracted comments from them. In this way, we extracted
comments videos on controversial topics covering five categories: sports, enter-
tainment, crime, religion, and politics. Finally, we searched for videos that are
memes, TikTok, and other keywords that might contain hate speech in their
comment section. This is our seventh category.
We extracted all the comments using open-source software called FacePager3 .
After extracting, we labeled each comment with a number defining which cat-
egory it belonged. We also labeled the comments with the additional number
to define its keyword. For this paper, the keyword means the controversial event
3
https://github.com/strohne/Facepager
4 N. Romim et al.

that falls under a category. For example, mithila is a keyword that belongs to
the entertainment category. We labeled every comment with its corresponding
category and keyword for future research.
After extracting the comments, we manually checked the whole dataset and
removed all the comments made entirely of English or other non-Bengali lan-
guages, emoji, punctuation, and numerical values. However, we kept a comment
if it primarily consists of Bengali words with some emoji, number, and non-
Bengali words mixed within it. We deleted non-Bengali comments because our
research focuses on Bengali hate speech comments, so non-Bengali comments are
out of our research focus. However, we kept impure Bengali comments because
people do not make pure Bengali comments on social media. So emoji, number,
punctuation, and English words are a feature of social media comment. Thus
we believe our dataset can prove to be very rich for future research purposes. In
the end, we collected a total of 30,000 comments. In a nutshell, our dataset has
30k comments that are mostly Bengali sentences with some emoji, punctuation,
number, and English alphabet mixed in it.

3.2 Dataset annotation

Hate speech is a subjective matter. So it is quite difficult to define what makes

a comment hate speech. Thus we have come up with some rigid rules. We have
based these rules on the community standard of Facebook4 and YouTube5 . Bel-
low, we have listed some criteria with necessary examples:

– Hate speech is a sentence that dehumanizes one or multiple persons or a

community. Dehumanizing can be done by comparing the person or commu-
nity to an insect, object, or a criminal. It can also be done by targeting a
person based on their race, gender, physical and mental disability.
– A sentence might contain slang or inappropriate language. But unless that
slang dehumanizes a person or community, we did not consider it to be hate
speech. For example:

িক বােলর মুিভ

Here the slang word is not used to dehumanized any person. So it is not hate
speech.
– If a comment does not dehumanize a person rather directly supports another
idea that clearly dehumanizes a person or a community, This is considered
hate speech. For example:

েবারকা পের না, ধিষর্ত েতা হেবই।

this sentence supports the dehumanizing act of women, so we labeled it as

hate speech.
4
https://web.facebook.com/communitystandards/
5
https://www.youtube.com/howyoutubeworks/policies/community-guidelines/
Hate Speech detection in the Bengali language 5

– If additional context is needed to understand that a comment is a hate

speech, we did no consider it to be one. For example, consider this sentence:

গতর্ েযাদ্ধা

It is a comment taken from Dr. Muhammad Zafar Iqbal’s Facebook page.

This comment refers to a particular jab the haters of Dr. Zafar Iqbal con-
stantly use to attack him in social media. But unless no one mentions an
annotator that this comment belongs to this particular Facebook page, that
person will have no way of knowing this is actually a hate speech. Thus these
types of comments will be labeled as not hate speech.
– It does not matter if the stand that a hate speech comment takes is right or
wrong. Because what is right or wrong is subjective. So if a sentence, without
any outside context, dehumanizes a person or community, we considered that
to be hate speech.

We worked with 50 annotators to annotate the entire dataset. We instructed

all annotators to follow our guidelines mentioned above. The annotators are
all undergraduate students of Shahjalal University of Science and Technology.
Thus the annotators have an excellent understanding of popular social media
trends and seen how hate speech propagates in social media. All comments were
annotated three times, and we took the majority decision as the final annotation.
After annotation, we wanted to check the validity of the annotator’s annota-
tion. For this reason, we randomly sampled 300 comments from every category.
Then we manually checked each comment’s majority decision. Since we, as the
authors of this paper, were the ones that set the guideline for defining hate
speech and did not participate in the annotation procedure, our checking was
a neutral evaluation of the annotator’s performance. After our evaluation, we
found that our dataset annotation is 91.05% correct.

3.3 Dataset statistics

Our dataset has a total of 30,000 comments, where there are a total of 10,000 hate
speech comments, as evident from table-2. So it is clear that our dataset is heavily
biased towards not hate speech. If we look closely at each category in figure-1,
it becomes even more apparent. In all categories, not hate speech comments
dominates. But particularly in celebrity and politics, the number of hate speech
is very low, even compared to hate speech in other categories. During data
collection, we have observed that there were many hate speech in the celebrity
section i.e. in Dr. Muhammad Zafar Iqbal’s Facebook page, but they were outside
context. As we have discussed before in section 3.2, we have only considered texts
without context while labeling it as hate speech. So many comments were labeled
as not hate speech. For the category politics, we have observed that people tend
to not attack any person or group directly. Rather they tend to add their own
take on the current political environment. So the number of direct attacks is less
in the politics category.
6 N. Romim et al.

When we look at the mean text length in table-3, we can find a couple of
interesting observations. First, we can see that meme comments are very short
in length. This makes sense as when a person is posting a comment in a meme
video, and he is likely to express his state of mind, requiring a shorter amount
sentences. But the opposite is true for the celebrity category. This has the longest
average text length. This is large because when people comment on Dr. Zafar
Iqbal’s Facebook page, they add a lot of their own opinion and analysis, no
matter the comment is hate speech or not. This shows how unique the comment
section of an individual celebrity page can be. Lastly, we see that average hate
speech tends to be shorter than not hate speech.
In the table, 4 we have compared all state of the art datasets. The table
we have included the total number of the dataset and the number of classes
the datasets were annotated. As you can see, there are some datasets that have
multiple classes. In this paper, we focused on the total number of the dataset
and extracted comments from different categories so that we can ensure linguistic
variation.

Table 1. Sample dataset

Sentence Hate speech Category Keyword

মিহলার িবচার চাই 0 3 1
হ্লারেপা েতার কত বড় সাহস? 1 1 6
আমরা পােশ আিছ বৰ্াদার। 0 2 12
এই গাধার বাচ্চা, িক বিলস তুই 1 3 3
েহ আল্লাহ, তার সু স্থতা দা কর 0 6 1

Table 2. Hate speech comments per category

Hate speech Not hate speech Total

10,000 20,000 30,000

4 Experiment

4.1 Preprocess

Our 30k dataset had raw Bengali comments with emoji, punctuation, and En-
glish alphabet mixed in it. We removed all emoji, punctuation, numerical values,
non-Bengali alphabet, and symbols from all comments for baseline evaluation.
After that, we created a train and test set of 24,000 and 6,000 comments, re-
spectively. Now the dataset is ready for evaluation.
Hate Speech detection in the Bengali language 7

Table 3. Mean text length of the dataset

Fig. 1. Distribution diagram of data in each category

8 N. Romim et al.

Table 4. A comparison of all state of the art datasets on Bengali hate speech

Paper Total data Number of classes

Hateful speech detection in public facebook
5,126 06
pages for the bengali language [12]
Toxicity Detection on Bengali Social Media
10,219 05
Comments using Supervised Models [2]
A Deep Learning Approach
4,700 07
to Detect Abusive Bengali Text [8]
Threat and Abusive Language Detection
5,644 07
on Social Media in Bengali Language [5]
Detecting Abusive Comments in
2,665 07
Discussion Threads Using Naïve Bayes [1]
Hate Speech detection in the Bengali language:
30,000 02
A dataset and its baseline evaluation

4.2 Word embedding

We have used three word embedding models. They are: Word2Vec [16], FastText
[13] and BengFast [18]. To create a Word2Vec model, we used gensim6 module to
train on the 30k dataset. We have used CBoW method for building the Word2Vec
model. For FastText, we also used the 30k dataset to create the embedding and
used the skip-gram method. The embedding dimension for both models was set
to 300. Lastly, BengFastText is the largest pretrained Bengali word embedding
based on FastText. But BengFastText was not trained on any YouTube data.
So we wanted to see how it performs on YouTube comments.

4.3 Models
Support Vector Machine (SVM) We used the Support Vector Machine [11]
to determine baseline evaluation. We used the linear kernel and kept all other
parameters to its default value.

Long Short Term Memory (LSTM) For our experiment, we used 100 LSTM
layers, set both dropout and recurrent dropout rate to 0.2, and used ’adam’ as
the optimizer.

Bi-directional Long Short Term Memory (Bi-LSTM) In this case, we

used 64 Bi-LSTM layers, with a dropout rate set to 0.2 and optimizer as ’adam.’

4.4 Experimental setting

We kept 80% of the dataset as a train set and 20% as a test and trained every
word embedding with every deep learning algorithm on the train set. For every
6
https : //radimrehurek.com/gensim/models/word2vec.html
Hate Speech detection in the Bengali language 9

case, we kept all parameters standard. Epoch and batch size was set to 5 and
64, respectively. Then we tested all the trained models on the test dataset and
measured the accuracy and F-1 score. Bellow are all types of models we tested
on our dataset.
– Baseline evaluation: Support Vector Machine (SVM)
– FastText Embedding with LSTM
– FastText Embedding with Bi-LSTM
– BengFastText Embedding with LSTM
– BengFastText Embedding with Bi-LSTM
– Word2Vec Embedding with LSTM
– Word2Vec Embedding with Bi-LSTM

Fig. 2. Deep learning architecture with Word2Vec and Bi-LSTM

10 N. Romim et al.

4.5 Result

We can observe from the table 5 that all the models achieved good accuracy.
SVM achieved the overall best result with accuracy and an F-1 score of 87.5% and
0.911, respectively. But BengFastText with LSTM and Bi-LSTM had relatively
the worst accuracy and F-1 score. Their low F-1 score indicates that deep learning
models with BengFastText embedding were overfitted the most. BengFastText
is not trained on any YouTube data [18]. But our dataset has a huge amount of
YouTube comment. This might be a reason for its drop on performance.
Then we looked at the performance of Word2Vec and FastText embedding.
We can see that FastText performed better in terms of accuracy and had a
lower F-1 score than Word2Vec. Word2Vec was more overfitted than FastText.
FastText has one distinct advantage over Word2Vec: it learns from the words
of a corpus and it’s substrings. Thus FastText can tell ’love’ and ’beloved’ are
similar words [13]. This might be a reason as to why FastText outperformed
Word2Vec.

Table 5. Result of all models

Model name Accuracy F-1 Score

SVM 87.5 0.911
Word2Vec + LSTM 83.85 0.89
Word2Vec + Bi-LSTM 81.52 0.86
FastText + LSTM 84.3 0.89
FastText + Bi-LSTM 86.55 0.901
BengFastText + LSTM 81 0.859
BengFastText + Bi-LSTM 80.44 0.857

4.6 Error analysis

We manually cross-checked all labels of the test set with the prediction of the
SVM model. We looked at the false negative and false positive cases and wanted
to find which types of sentences the model failed to predict accurately. We found
that some of the labels were actually wrong, and the model actually predicted
accurately. Nevertheless, there were some unusual cases. For example:

ওরা চাইেবা েকন েতার কােছ তুই িবিসিব েপৰ্িসেডন্ট হেয় িক বাল ফালাস েতার েতা েখয়াল রাখা
দরকার িছল িকৰ্েকটারেদর িফিসিিলস িনেয়

This is not a hate speech, but the model predicted it to be hate speech. There
are several other similar examples. The reason here is that this sentence contains
aggressive words that are normally used in hate speech, but in this case, it was
not used to dehumanize another person. However, the model failed to understand
that. This type of mistake was common in false-negative cases. It demonstrates
Hate Speech detection in the Bengali language 11

that words in the Bengali language can be used in a complicated context, and
it is a tremendous challenge for machine learning models to actually understand
the proper context of a word used in a sentence.

5 Conclusion

Hate speech comment in social media is a pervasive problem. So a lot more

research is urgent to combat this issue and ensure a better online environment.
One of the biggest obstacles towards detecting hate speech through state-of-
the-art deep learning models is the lack of a large and diverse dataset. In this
paper, we created a large dataset of 30,000 comments, a 10,000 hate speech. Our
dataset has comments from 7 different categories making the dataset diverse and
abundant. We showed that hate speech comments tend to be shorter in length
and word count than not hate speech comments. Finally, we ran several deep
learning models with word embedding on our dataset. It showed that when the
training dataset is highly imbalanced, the models become overfitted and biased
towards not hate speech. Thus even though overall accuracy is very high, the
models can not predict hate speech well.
However, we believe this is scratching just the surface of solving this widespread
problem. One of the biggest obstacles is that there is no proper word embedding
for the Bengali language used in social media. There is some word embedding
created from newspaper and blog article corpus. Nevertheless, the language used
is vastly different from social media language. The main reason is that unlike
traditional print media, there is no one to check for grammatical and spelling
mistakes. Thus there are lots of misspelling, grammatical errors, cryptic meme
language, emoji, etc. In fact, in our dataset, we found the same word having
multiple types of spelling. For example:

জাব, যাব, যােবা, জাবও

Here, they all are the same word. A human brain can understand that these
words are the same, but they are different from a deep learning model. Another
difference is the use of emoji to convey meaning. Often people will express a
specific emotion with the only emoji. Emoji is a recent phenomenon that is absent
in blog posts or newspaper articles, or books. Currently, there is no dataset or
pretrained model that classifies the sentiment of emoji used in social media.
One of the critical challenges to accurate hate speech detection is to create
models that can extract necessary information from an unbalanced dataset to
predict the minority class with reasonable accuracy. Our experiment demon-
strated that standard deep learning models are not sufficient for this task. Ad-
vanced models like mBERT and XLM-RoBERTa can be of great use in this
regard as they are trained on a large multilingual dataset and use attention
mechanisms. Embedding models based on extensive and diverse Social Media
comment datasets can also be of great help.
12 N. Romim et al.

6 Acknowledgement
This work would not have been possible without the kind support from SUST
NLP Research Group and SUST Research Center. We would also like to ex-
press our hearfealt gratitude to all the annotators and volunteers who made the
journey possible.

References
1. Awal, M.A., Rahman, M.S., Rabbi, J.: Detecting Abusive Comments in Discus-
sion Threads Using Naïve Bayes. 2018 International Conference on Innovations
in Science, Engineering and Technology, ICISET 2018 (October), 163–167 (2018).
https://doi.org/10.1109/ICISET.2018.8745565
2. Banik, N.: Toxicity Detection on Bengali Social Media Comments using Supervised
Models (November) (2019). https://doi.org/10.13140/RG.2.2.22214.01608
3. Baruah, A., Das, K., Barbhuiya, F., Dey, K.: Aggression identification in english,
hindi and bangla text using bert, roberta and svm. In: Proceedings of the Second
Workshop on Trolling, Aggression and Cyberbullying. pp. 76–82 (2020)
4. Ben-David, A., Fernández, A.M.: Hate speech and covert discrimination on social
media: Monitoring the facebook pages of extreme-right political parties in spain.
International Journal of Communication 10, 27 (2016)
5. Chakraborty, P., Seddiqui, M.H.: Threat and Abusive Language Detection on So-
cial Media in Bengali Language. 1st International Conference on Advances in Sci-
ence, Engineering and Robotics Technology 2019, ICASERT 2019 2019(Icasert),
1–6 (2019). https://doi.org/10.1109/ICASERT.2019.8934609
6. Chakravarthi, B.R., Arcan, M., McCrae, J.P.: Improving wordnets for under-
resourced languages using machine translation. In: Proceedings of the 9th Global
WordNet Conference (GWC 2018). p. 78 (2018)
7. Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detec-
tion and the problem of offensive language. Proceedings of the 11th International
Conference on Web and Social Media, ICWSM 2017 (Icwsm), 512–515 (2017)
8. Emon, E.A., Rahman, S., Banarjee, J., Das, A.K., Mittra, T.: A Deep Learn-
ing Approach to Detect Abusive Bengali Text. 2019 7th International Confer-
ence on Smart Computing and Communications, ICSCC 2019 pp. 1–5 (2019).
https://doi.org/10.1109/ICSCC.2019.8843606
9. Farha, I.A., Magdy, W.: Multitask learning for arabic offensive language and hate-
speech detection. In: Proceedings of the 4th Workshop on Open-Source Arabic
Corpora and Processing Tools, with a Shared Task on Offensive Language Detec-
tion. pp. 86–90 (2020)
10. Gambäck, B., Sikdar, U.K.: Using Convolutional Neural Networks to Classify Hate-
Speech (7491), 85–90 (2017). https://doi.org/10.18653/v1/w17-3013
11. Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B.: Support vector
machines. IEEE Intelligent Systems and their applications 13(4), 18–28 (1998)
12. Ishmam, A.M., Sharmin, S.: Hateful speech detection in public facebook pages
for the bengali language. Proceedings - 18th IEEE International Conference
on Machine Learning and Applications, ICMLA 2019 pp. 555–560 (2019).
https://doi.org/10.1109/ICMLA.2019.00104
13. Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: Fast-
text.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651
(2016)
Hate Speech detection in the Bengali language 13

14. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text
classification. 15th Conference of the European Chapter of the Association for
Computational Linguistics, EACL 2017 - Proceedings of Conference 2(2), 427–431
(2017). https://doi.org/10.18653/v1/e17-2068
15. Karim, M.R., Chakravarthi, B.R., McCrae, J.P., Cochez, M.: Classification
Benchmarks for Under-resourced Bengali Language based on Multichannel
Convolutional-LSTM Network (2020), http://arxiv.org/abs/2004.07807
16. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-
sentations of words and phrases and their compositionality. In: Advances in neural
information processing systems. pp. 3111–3119 (2013)
17. Mondal, M., Silva, L.A., Benevenuto, F.: A measurement study of hate speech
in social media. In: Proceedings of the 28th ACM Conference on Hypertext and
Social Media. pp. 85–94 (2017)
18. Rezaul Karim, M., Raja Chakravarthi, B., Arcan, M., McCrae, J.P., Cochez, M.:
Classification benchmarks for under-resourced bengali language based on multi-
channel convolutional-lstm network. arXiv pp. arXiv–2004 (2020)
19. Sambasivan, N., Batool, A., Ahmed, N., Matthews, T., Thomas, K., Gaytán-Lugo,
L.S., Nemer, D., Bursztein, E., Churchill, E., Consolvo, S.: ” they don’t leave us
alone anywhere we go” gender and digital abuse in south asia. In: proceedings
of the 2019 CHI Conference on Human Factors in Computing Systems. pp. 1–14
(2019)
20. Schmidt, A., Wiegand, M.: A survey on hate speech detection using natural lan-
guage processing. In: Proceedings of the Fifth International Workshop on Natural
Language Processing for Social Media. pp. 1–10 (2017)
21. Waseem, Z.: Are You a Racist or Am I Seeing Things? Annota-
tor Influence on Hate Speech Detection on Twitter pp. 138–142 (2016).
https://doi.org/10.18653/v1/w16-5618
22. Zhang, Z., Luo, L.: Hate speech detection: A solved problem? The chal-
lenging case of long tail on Twitter. Semantic Web 10(5), 925–945 (2019).
https://doi.org/10.3233/SW-180338

Marathi Hate Speech Detection IEEE Paper
No ratings yet
Marathi Hate Speech Detection IEEE Paper
5 pages
SSRN Id4389914
No ratings yet
SSRN Id4389914
12 pages
Nepali Dataset 1
No ratings yet
Nepali Dataset 1
8 pages
Final Review PPT 19bit0029
No ratings yet
Final Review PPT 19bit0029
65 pages
18CSE006 Thesis Report
No ratings yet
18CSE006 Thesis Report
23 pages
Capstone Review 02
No ratings yet
Capstone Review 02
54 pages
Contextual-Aware and Expert Data Resources For Bra
No ratings yet
Contextual-Aware and Expert Data Resources For Bra
22 pages
Gitari - A Lexicon-Based Approach For Hate Speech Detection
0% (1)
Gitari - A Lexicon-Based Approach For Hate Speech Detection
16 pages
Bangla Hate Speech Detection On Social
No ratings yet
Bangla Hate Speech Detection On Social
14 pages
A Multilingual Evaluation For Online Hate Speech Detection
No ratings yet
A Multilingual Evaluation For Online Hate Speech Detection
22 pages
5 Hate - Speech - Detection - in - Low-Resourced - Indian - Lang
No ratings yet
5 Hate - Speech - Detection - in - Low-Resourced - Indian - Lang
22 pages
Deep Learning Models For Multilingual Hate Speech Detection
No ratings yet
Deep Learning Models For Multilingual Hate Speech Detection
16 pages
2022 Aacl-Srw 5
No ratings yet
2022 Aacl-Srw 5
8 pages
Ijst 2021 1019
No ratings yet
Ijst 2021 1019
12 pages
Deep Learning For Hate Speech Detection: Compararive Study
No ratings yet
Deep Learning For Hate Speech Detection: Compararive Study
18 pages
Avoid Note
No ratings yet
Avoid Note
8 pages
Offensive Comments in The Brazilian Web: A Dataset and Baseline Results
No ratings yet
Offensive Comments in The Brazilian Web: A Dataset and Baseline Results
10 pages
Hate Speech Detection in Online Social Media
No ratings yet
Hate Speech Detection in Online Social Media
12 pages
ArHateDetector Detection of Hate Speech From Standard and Dialectal Arabic Tweets
No ratings yet
ArHateDetector Detection of Hate Speech From Standard and Dialectal Arabic Tweets
13 pages
Integrating Handcrafted Features With Machine Lear
No ratings yet
Integrating Handcrafted Features With Machine Lear
13 pages
Applsci 11 08575
No ratings yet
Applsci 11 08575
21 pages
Hate Speech Detection - Challenges and Solutions - PLOS ONE
No ratings yet
Hate Speech Detection - Challenges and Solutions - PLOS ONE
9 pages
A Lexicon-Based Approach For Hate Speech Detection
No ratings yet
A Lexicon-Based Approach For Hate Speech Detection
17 pages
Investigating Deep Learning Approaches For Hate
No ratings yet
Investigating Deep Learning Approaches For Hate
12 pages
Comparative Analysis of Deep Learning Based Afaan Oromo Hate Speech Detection
No ratings yet
Comparative Analysis of Deep Learning Based Afaan Oromo Hate Speech Detection
13 pages
Journal Pone 0305657
No ratings yet
Journal Pone 0305657
24 pages
Hatemonitors: Language Agnostic Abuse Detection in Social Media
No ratings yet
Hatemonitors: Language Agnostic Abuse Detection in Social Media
8 pages
A Survey On Automatic Online Hate Speech Detection in Low-Resource Languages
No ratings yet
A Survey On Automatic Online Hate Speech Detection in Low-Resource Languages
34 pages
A Large Expert Annotated Corpus of Brazilian Instagram Comments For Offensive Language and Hate Speech Detection
No ratings yet
A Large Expert Annotated Corpus of Brazilian Instagram Comments For Offensive Language and Hate Speech Detection
10 pages
12 V May 2024
No ratings yet
12 V May 2024
8 pages
Hate Speech Chapter Final Preprint
No ratings yet
Hate Speech Chapter Final Preprint
27 pages
Detection of Hate Based Political Speech
No ratings yet
Detection of Hate Based Political Speech
5 pages
NLP Case Studynaman
No ratings yet
NLP Case Studynaman
23 pages
Hate Speech Detection and Racial Bias Mitigation in Social Media Based On BERT Model
No ratings yet
Hate Speech Detection and Racial Bias Mitigation in Social Media Based On BERT Model
26 pages
Hate Speech Detection in Twitter Using Natural Language Processing
No ratings yet
Hate Speech Detection in Twitter Using Natural Language Processing
7 pages
2024 Ltedi-1 20
No ratings yet
2024 Ltedi-1 20
6 pages
2022 Restup-1 2
No ratings yet
2022 Restup-1 2
8 pages
Hate Speech Detection: Challenges and Solutions: A1111111111 A1111111111 A1111111111 A1111111111 A1111111111
No ratings yet
Hate Speech Detection: Challenges and Solutions: A1111111111 A1111111111 A1111111111 A1111111111 A1111111111
16 pages
Hate Speech Detection Is Not As Easy As You May Think
No ratings yet
Hate Speech Detection Is Not As Easy As You May Think
17 pages
1 Identification of Hate Speech in Social Media
No ratings yet
1 Identification of Hate Speech in Social Media
6 pages
A296 D Stamped
No ratings yet
A296 D Stamped
4 pages
Defence University College of Engineering: M-Tech Thesis Progress Report
No ratings yet
Defence University College of Engineering: M-Tech Thesis Progress Report
15 pages
Paper by Raghad and Hend For Hate Speech Detection in Saudi Twitter Sphere A Deep Learning Approach
No ratings yet
Paper by Raghad and Hend For Hate Speech Detection in Saudi Twitter Sphere A Deep Learning Approach
12 pages
RP 3
No ratings yet
RP 3
4 pages
Research PAPER ML ABHIMANYU
No ratings yet
Research PAPER ML ABHIMANYU
2 pages
Multilingual and Multi-Aspect Hate Speech Analysis
No ratings yet
Multilingual and Multi-Aspect Hate Speech Analysis
10 pages
Online Social Networks and Media: Safa Alsafari, Samira Sadaoui, Malek Mouhoub
No ratings yet
Online Social Networks and Media: Safa Alsafari, Samira Sadaoui, Malek Mouhoub
15 pages
Ousidhoun Multilingual and Multi Aspect HAte Speech Analysis
No ratings yet
Ousidhoun Multilingual and Multi Aspect HAte Speech Analysis
10 pages
Countering Hate Speech On Social Media
No ratings yet
Countering Hate Speech On Social Media
2 pages
Multilingual Hate Speech Detection A Semi-Supervised Generative Adversarial Approach
No ratings yet
Multilingual Hate Speech Detection A Semi-Supervised Generative Adversarial Approach
19 pages
Identification of HATE Speech Tweets in Pashto Language Using Machine Learning Techniques
No ratings yet
Identification of HATE Speech Tweets in Pashto Language Using Machine Learning Techniques
8 pages
3 Detection of Hate Speech in Social Networks A Surv
No ratings yet
3 Detection of Hate Speech in Social Networks A Surv
19 pages
An Approach To Detect Abusive Bangla Text
No ratings yet
An Approach To Detect Abusive Bangla Text
5 pages
BDA Minor Specialization Literature Review IEEE Format
No ratings yet
BDA Minor Specialization Literature Review IEEE Format
5 pages
Final Year
No ratings yet
Final Year
25 pages
2022 Paclic-1 94
No ratings yet
2022 Paclic-1 94
13 pages
Google Certleader Professional-Machine-Learning-Engineer Vce Download 2024-Jul-26 by Spencer 78q Vce
No ratings yet
Google Certleader Professional-Machine-Learning-Engineer Vce Download 2024-Jul-26 by Spencer 78q Vce
8 pages
(8a) Davidson Et Al. - 2017
No ratings yet
(8a) Davidson Et Al. - 2017
4 pages
Seyfe Yihenew
No ratings yet
Seyfe Yihenew
141 pages
Literature Review Table
No ratings yet
Literature Review Table
5 pages
ML-UNIT - I - Part A
No ratings yet
ML-UNIT - I - Part A
88 pages
Forest Fire Prediction Using Machine Learning
No ratings yet
Forest Fire Prediction Using Machine Learning
28 pages
Predicting Personal Loan Approval Using Machine Learning Handbook
No ratings yet
Predicting Personal Loan Approval Using Machine Learning Handbook
31 pages
1 Introduction To Automatic Process Control and Robotics in Food Industry 05-Jan-2024 23.01.2025
No ratings yet
1 Introduction To Automatic Process Control and Robotics in Food Industry 05-Jan-2024 23.01.2025
101 pages
Enhancing Sepsis Detection Using Feed-Forward Neural Networks With Hyperparameter Tuning Techniques
No ratings yet
Enhancing Sepsis Detection Using Feed-Forward Neural Networks With Hyperparameter Tuning Techniques
8 pages
ESE Lab File
No ratings yet
ESE Lab File
105 pages
Age Face
No ratings yet
Age Face
41 pages
Machine Learning in Drug Recommenndation
No ratings yet
Machine Learning in Drug Recommenndation
72 pages
DSCTP 2022 1 ML Slides
No ratings yet
DSCTP 2022 1 ML Slides
351 pages
Backorder Prediction in The Supply Chain Using Machine Learning
No ratings yet
Backorder Prediction in The Supply Chain Using Machine Learning
6 pages
Batch 23
No ratings yet
Batch 23
6 pages
Anomaly Detection Using Bi-Directional Long Short-Term Memory Networks For Cyber-Physical Electric Vehicle Charging Stations
No ratings yet
Anomaly Detection Using Bi-Directional Long Short-Term Memory Networks For Cyber-Physical Electric Vehicle Charging Stations
11 pages
CSL7620 A2
No ratings yet
CSL7620 A2
2 pages
CCCS CIC AndMal 2020
No ratings yet
CCCS CIC AndMal 2020
6 pages
Machine-Learning Set 7
No ratings yet
Machine-Learning Set 7
22 pages
Decision Tree Questions
No ratings yet
Decision Tree Questions
8 pages
Brain Tumor Detection and Classification Using Intelligence Techniques An Overview
No ratings yet
Brain Tumor Detection and Classification Using Intelligence Techniques An Overview
17 pages
NeurIPS 2023 LLM Pruner On The Structural Pruning of Large Language Models Paper Conference
No ratings yet
NeurIPS 2023 LLM Pruner On The Structural Pruning of Large Language Models Paper Conference
19 pages
Towards Reliable Medical Question Answering: Techniques and Challenges in Mitigating Hallucinations in Language Models
No ratings yet
Towards Reliable Medical Question Answering: Techniques and Challenges in Mitigating Hallucinations in Language Models
9 pages
01 s+Aditi+Apurva Bioscan
No ratings yet
01 s+Aditi+Apurva Bioscan
8 pages
Computer Science Review: Praphula Kumar Jain, Rajendra Pamula, Gautam Srivastava
No ratings yet
Computer Science Review: Praphula Kumar Jain, Rajendra Pamula, Gautam Srivastava
17 pages
8626-Article Text-32183-1-10-20230510
No ratings yet
8626-Article Text-32183-1-10-20230510
11 pages
Leveraging Future Relationship Reasoning For Vehicle Trajectory Prediction
No ratings yet
Leveraging Future Relationship Reasoning For Vehicle Trajectory Prediction
13 pages
ML Speech Recognition Lab Manual - Public
No ratings yet
ML Speech Recognition Lab Manual - Public
24 pages
2 PB
No ratings yet
2 PB
9 pages
Đề Tài BTL MT2013 - 2023
No ratings yet
Đề Tài BTL MT2013 - 2023
4 pages
Attention Based Graph Classification
No ratings yet
Attention Based Graph Classification
5 pages
Resume Rajdeep Saha
No ratings yet
Resume Rajdeep Saha
1 page
F-Growth. Gamification, virality and monetization
From Everand
F-Growth. Gamification, virality and monetization
Ilya Osipov
No ratings yet
Statistical Semantics: Fundamentals and Applications
From Everand
Statistical Semantics: Fundamentals and Applications
Fouad Sabry
No ratings yet

Hate Speech Detection in The Bengali Language: A Dataset and Its Baseline Evaluation

Uploaded by

Hate Speech Detection in The Bengali Language: A Dataset and Its Baseline Evaluation

Uploaded by

Hate Speech detection in the Bengali language:

A dataset and its baseline evaluation

Nauros Romim1[0000−0002−2000−0898] , Mosahed Ahmed1[0000−0002−0010−397X] ,

Keywords: Natural Language Processing (NLP) · Bengali Text Clas-

3.1 Data extraction

3.2 Dataset annotation

Hate speech is a subjective matter. So it is quite difficult to define what makes

– Hate speech is a sentence that dehumanizes one or multiple persons or a

েবারকা পের না, ধিষর্ত েতা হেবই।

this sentence supports the dehumanizing act of women, so we labeled it as

– If additional context is needed to understand that a comment is a hate

It is a comment taken from Dr. Muhammad Zafar Iqbal’s Facebook page.

We worked with 50 annotators to annotate the entire dataset. We instructed

3.3 Dataset statistics

Table 1. Sample dataset

Sentence Hate speech Category Keyword

Table 2. Hate speech comments per category

Hate speech Not hate speech Total

Table 3. Mean text length of the dataset

Category Mean text length

Fig. 1. Distribution diagram of data in each category

Paper Total data Number of classes

4.2 Word embedding

Bi-directional Long Short Term Memory (Bi-LSTM) In this case, we

4.4 Experimental setting

Fig. 2. Deep learning architecture with Word2Vec and Bi-LSTM

Table 5. Result of all models

Model name Accuracy F-1 Score

4.6 Error analysis

Hate speech comment in social media is a pervasive problem. So a lot more

জাব, যাব, যােবা, জাবও

You might also like