Experiments On Indobert Implementation For Detecting Multi-Label Hate Speech With Data Resampling Through Synonym Replacement Method
Experiments On Indobert Implementation For Detecting Multi-Label Hate Speech With Data Resampling Through Synonym Replacement Method
net/publication/379132696
CITATIONS READS
0 20
4 authors:
All content following this page was uploaded by Rhio Sutoyo on 24 March 2024.
s
Computer Science Department
School of Computer Science
es
Bina Nusantara University
Jakarta, Indonesia 11480
[email protected]
work also tried to solve the imbalance dataset problem in the While the [2] research used ML approaches in their work,
multi-label dataset through the synonym replacement method. other researcher such as [3] and [4] show better evaluation
cl
As a result, the IndoBERT model of this research achieved an results for the Deep Learning (DL) models compared to that
improved accuracy of 88.23%. It was also implemented into a of the ML models. The capability of DL to learn more
web-based application to show the potential usage. complex patterns and capture underlying meaning in the
Keywords—Indonesian hate speech detection, deep learning,
rti
IndoBERT, synonym replacement data makes DL a more promising method for detecting hate
speeches compared to the previous way of using conventional
ML.
I. I NTRODUCTION
Therefore, this paper will develop a DL multi-label hate
A
The internet has been an essential part of daily lives speech detection model, trained using the same dataset from
where online interactions are inevitable as it is a part of the the [2] research, using a BERT-based model. BERT is a pre-
globalization era [1]. This also affects the increasing number trained model for language understanding created by a team of
of hate speech incidents. In research by Ibrohim and Budi [2], Google engineers in 2018 [5]. It was created using transformer
it was agreed through a Focus Group Discussion (FGD) with architecture from [6]. This research will utilize the Indonesian
the Criminal Investigation Agency of the Indonesian National version of the BERT, namely IndoBERT. IndoBERT addresses
Police (Badan Reserse Kriminal) that the characteristics of a the lack of available NLP tools for the Indonesian language.
hate speech is that it has a specific target, a category of hate It was created by the IndoNLU team and was trained using
speech, and a level indicating the hate speech in terms of a sizeable Indonesian dataset called Indo4B [7].
severity. By leveraging the more advanced capability of the DL
In their research, Ibrohim and Budi [2] created a multi- method in learning more complex patterns, this paper aims
label hate speech dataset using Twitter API and gathered to produce a superior multi-label hate speech detection com-
13.169 tweets. The research uses Machine Learning (ML) ap- pared to the conventional ML methods. In maximizing the
proaches such as approaches such as Support Vector Machine potential of the IndoBERT model, this research experimented
(SVM), Naive Bayes (NB), and Random Forest Decision Tree on several word processing methods, such as text normal-
ization, case folding, and word stemming. This research also several feature extraction methods such as word n-gram,
conducted the experiments in 3 different dataset splits, each character n-gram, orthography, and lexicon. The first scenario
with a different distribution of training, validation, and testing produced the best result when the combination of RFDT,
data. In the end, to address the imbalance dataset problem LP, and word unigram was used, producing an accuracy of
within the dataset, this research performed data resampling 66.12%, whereas the second scenario produced the best result
by using the synonym replacement method. of 77.36% accuracy when the combination of word unigram,
character quadrams, positive sentiment, and abusive lexicon
II. R ELATED W ORKS features was used along with RFDT and LP.
In creating an IndoBERT hate speech model, it is essential In summary, the research by [2] achieved 66.12% accuracy
first to understand what can be identified as hate speech. for predicting the Multi-Labeled Dataset (MLD). Moreover,
According to Gelber [8], hate speech can be described as a the authors propose several ideas to improve the results. The
form of speech that causes harm by emphasizing a speaker’s first method is to collect more data. However, the drawback
capability to do harm, a listener’s vulnerability as a member to this method is that it could be a very lengthy and ex-
of the target group, or both. The target group in this context is pensive process. Other feasible methods for this problem are
usually based on characteristics such as religion, gender, race, data resampling and data augmentation. However, another
ethnicity, and sexual orientation. Another research conducted important thing is that data balancing for MLD could be
on Indonesian social media [9] said that hate speech aims to complex due to the nature of MLD itself, where data could
s
incite, order, forbid, criticize, and warn. contain both the majority and minority labels simultaneously.
The last method mentioned by Ibrohim and Budi [2] uses
es
A. Machine Learning Approaches to Detect Indonesian Hate
a conventional machine learning approach by implementing
Speech
hierarchical MLC.
The [2] research focused on solving the hate speech de- Another research by Aulia and Budi [10], were able to
Pr
tection problem using ML approaches such as SVM, NB, produce a better result for detecting hate speech in Indonesian
and RFDT. Ibrohim and Budi [2] first created the dataset language by calculating the F1-score value by using a dataset
that is made up of 13,169 Twitter data. The process of gathered from Facebook. The [10] research yielded a best
collecting the Twitter data was done by using Twitter Search model by combining SVM and TF-IDF classifiers, with an
API and was collected in seven months. Also, the dataset F1-score of 85%. However, different from the dataset in [2]
was combined with related research on hate speech detection
in
which is a MLD, this dataset contains only a single label,
in the Indonesian language. After the data gathering, the where the value of a data can be either a hate speech or not
process is followed by the data annotation process. The a hate speech. Another research [11] on Indonesian language
data annotation process aims to classify each data into 12 hate speech detection was also able to produce a best model
labels. The data annotation process was separated into two with 98.13% accuracy by using K-Nearest Neighbor (KNN)
e
parts. The first part was to annotate whether the data is classifier. The research by [11], just like the research by [10],
classified as hate speech, abusive language, or both. The also only used a single label dataset, instead of the MLD that
cl
second part of the annotation process was to annotate the is used in [2].
target of the hate speech, categories of the hate speech, and
level of the hate speech. Before feeding the data to the training
B. Deep Learning Approach to Detect Multilingual Hate
rti
accuracy for the evaluation. The accuracy is calculated by in multiple languages. The paper analyzed nine languages, in-
summing the number of correct predictions and dividing it by cluding Arabic, English, German, Indonesian, Italian, Polish,
the total number of all predictions. Equation (1) were utilized Portuguese, Spanish, and French, with 159,753 data collected
to calculate accuracy [2]. from 16 datasets. The experiment in this paper was conducted
in two scenarios. The first scenario used two techniques,
TP + TN LASER embedding with Logistic Regression (LR) and BERT,
Accuracy = (1)
TP + TN + FP + FN to create a model using the same language for the training
The experiment was conducted in two scenarios. The and testing data. The second scenario involved training the
hate speech target, categories, and level are all taken into model using multiple languages and testing the result on
consideration in the first scenario to identify hate speech and one language. All the datasets used in this research are first
abusive language. The second scenario used MLC to find transformed into only two labels: hate speech and normal
abusive language and hate speech in data without identifying speech. The research found that LASER embedding with LR
the target, categories, and level of hate speech. Both scenarios performed best in a low resource condition [12]. However,
were carried out to determine the most suitable classifier, when the training data for the target language in the testing
transformation method, and features. Both scenarios used was used, the BERT-based model performed better.
C. BERT-Based Model Performance on an Unpreprocessed manually annotated into 12 different hate speech related la-
Dataset bels: HS; Abusive; HS Individual; HS Group; HS Religion;
Alzahrani and Jololian [13] analyze the text processing HS Race; HS Physical; HS Gender; HS Other; HS Weak;
effects using the BERT model. The research utilized a Twitter HS Moderate; and HS Strong.
dataset that consists of 363,031 distinct tweets. The authors
aimed to use the BERT model to distinguish the author of the
tweets’ gender using this dataset. The training process of the
BERT model was implemented in five experiments.
From the five experiments by Alzahrani and Jololian [13],
the best result was achieved in the fifth experiment with
86.67% accuracy, where no text preprocessing methods were
applied. On the contrary, the fourth experiment, where all
the text preprocessing methods were applied, resulted in the Fig. 2: Distribution of Hate Speech Types Occurrences
worst model with 78.86% accuracy. The authors suspected
that applying text preprocessing such as stop words removal
The dataset contains 5,561 HS data, consisting of 2,266
and punctuation marks removal negatively affected the overall
data for HS label only and 3,295 for both HS and Abusive
s
accuracy of the model because pre-trained models such as
[2]. The Twitter dataset by Ibrohim and Budi [2] also contains
BERT need to use more extensive texts to perform better.
5,043 abusive data, with 1,748 not associated with HS. As
es
In the end, [13] concluded that BERT performs better when
shoz.l”’wn in Fig. 2, several labels such as HS Religion,
no text preprocessing techniques are applied because BERT
HS Race, HS Physical, HS Gender, HS Strong have signif-
could take advantage of longer texts.
icantly lower occurrences compared to the other labels. This
Pr
III. M ETHODOLOGY suggests that the dataset is imbalanced. The data imbalance
problem in MLD was also addressed and explained in [14]
The methodology of this work is illustrated in Fig. 1. paper. Both [2] and [14] suggested several methods to improve
The green boxes represent the processes, whereas the yellow the model performance and accuracy in imbalanced data.
diamond and the red parallelogram represent the decision and Ibrohim and Budi [2] mentioned that balancing the dataset
the output, respectively. by adding more new data could be expensive and time-
in
consuming, so the feasible methods proposed by the [2] paper
are to use an ML approach by implementing hierarchical
Dataset Dataset MLC and by doing data resampling or data augmentation.
Gathering Preparation However, for the scope of this research, no hierarchical MLC
e
Train Model IndoBERT and accuracy of the previously used ML approach through
Model Design IndoBERT. In addition, this paper also tries to balance the
Yes
data using a heuristic resampling method mentioned in [14]
rti
s
preparation was explained in Table I.
augmentation. Similarly, synthetic data can be generated by
es
TABLE I: Experiment Setups upsampling the data through text-based synonym replacement,
as seen in Table III.
Experiment Preprocessing Dataset Distribution Stratify
1 No 80/10/10 No TABLE III: Before and After Synonym Replacement and
Pr
2 Yes 80/10/10 No Resampling
3 No 80/10/10 Yes
4 Yes 80/10/10 Yes Before After
5 No 70/15/15 No berat wakil setya novanto dpr beban agen setya novanto dpr ri
6 Yes 70/15/15 No ri pecat setya novanto dpr ri melengserkan setya novanto dpr ri
7 No 70/15/15 Yes berat wakil setya novanto dpr bahara distributor setya novanto
8 Yes 70/15/15 Yes ri pecat setya novanto dpr ri dpr ri memberhentikan setya no-
in
9 No 60/20/20 No vanto dpr ri
10 Yes 60/20/20 No berat wakil setya novanto dpr bahara leveransir setya novanto
11 No 60/20/20 Yes ri pecat setya novanto dpr ri dpr ri memberhentikan setya no-
12 Yes 60/20/20 Yes vanto dpr ri
berat wakil setya novanto dpr bahara pemasok setya novanto dpr
ri pecat setya novanto dpr ri ri memberhentikan setya novanto
e
process of replacing informal words with formal words. The process works by replacing a single or multiple word
The following process is case folding. Case folding works in a sentence with the corresponding synonym. The authors of
by converting all the letters in a sentence to either lowercase this research conducted two different variations: the first is by
replacing a random word from a sentence with its synonym,
rti
s
techniques. Then, the model will detect the occurrences of doBERT model doBERT model
hate speech within each sentence.
es
Fig. 3: Loss and Accuracy Graph for Training and Validation
IV. R ESULT AND D ISCUSSION Set
A. Experiment Results
For the validation loss, it can be seen that it was not
Pr
In total, there are 12 experiments conducted based on the
combination of dataset distribution, dataset preprocessing, until around epoch 13 and epoch 14 that it started to rise,
and dataset stratification. In addition, this paper also tried whereas the training loss showed a stable trend of decreasing
two additional variations on data upsampling by using the throughout 15 epochs. This shows that after epoch 13, the
synonym replacement method to augment the dataset, which model has started to slightly overfit the training data, which
is further supported by the model’s training accuracy and
in
includes the preprocessed and un-preprocessed versions of
the dataset, using the best-produced IndoBERT hate speech validation accuracy data.
model. These experiments use the following values as the The training accuracy achieved in epoch 15 is 91.30%,
hyperparameter: whereas the validation accuracy is not far from the testing
• Batch size = 32
accuracy, with the value of 88.11%. Therefore, it can be
e
• Epoch = 15
derived from the graph that the best model obtained from
• Learning Rate = (2 x 10
−5
) this dataset and the combination is in epoch 13, with both
cl
s
HS Strong 96.50%
Overall, the experiments demonstrate that upsampling data
es
using synonym replacement for data augmentation worsens
However, this high accuracy result could be attributed to the
the IndoBERT model’s performance, indicated by the de-
high number of negative data existing in the dataset, especially
crease in macro accuracy. This result proved the argument
for the labels of hate speech category, such as HS Religion,
Pr
in [2], where the authors mentioned the problem in balanc-
HS Race, HS Physical, and HS Gender. This created many
ing an MLD. In the [2] MLD dataset, a single data entry
True Negative (TN) data in the prediction, resulting in a high
could contain both the minority and majority labels. Later,
accuracy. The existence of a large amount of TP data could
during the synonym replacement and upsampling process,
impact the IndoBERT model’s capability in True Positive
the minority label data will be upsampled. However, since
(TP) data. This is the effect of a data imbalance in the
it is an MLD, where the data could contain multiple labels,
in
dataset, where most instances belong to the negative class,
the majority of labels will also get upsampled. This leads to
contributing to higher overall accuracy in the performance
another dataset imbalance problem. Also, in general, based
evaluation.
on the two variations of experiments conducted in Table VI
B. Experiment Results for Synonym Replacement and Table VII, the process of synonym replacement causes
e
The first experiment results for synonym replacement are the sentence to lose its original meaning by introducing noise
listed in Table VI. The experiment replaces a synonym for a data.
cl
same pattern also occurs in the preprocessed dataset, where a web-based application platform. As shown in Fig. 4, the
the accuracy dropped from 84.13% (see Table IV) to 78.54%. application first extracted sentences from a given URL using
This shows that there is a pretty significant decrease in the web scraping techniques, and then the IndoBERT model
IndoBERT model’s performance by applying the upsampling detected and labeled each sentence.
A
Dataset
Exper- Pre Stra- Training Macro
Distribu-
iment processing tify Time Accuracy
tion
1 No 80/10/10 Yes 48m 52s 86.03%
2 Yes 80/10/10 Yes 59m 39s 78.54%
In the second variation of the experiment, the result of Fig. 4: IndoBERT Detection Result for Hate Speech using
upsampling through replacing all words in a sentence can be Web-Based Application
In Figure 4, using a YouTube URL, the IndoBERT model HS Gender, and HS Strong does not improve the accuracy
detected that 35.5% of the sentences extracted from the URL since a data can contain both the majority and the minority
are categorized as hate speech sentences. The IndoBERT label at the same time interconnecting the labels to each
model also detected that 25.5% of the sentences contain other. Data resampling on minority labels only affected the
abusive language. data frequency from other labels. In addition, the synonym
replacement method produced noisy and biased data that
D. Evaluation further worsened the detection capability of the IndoBERT
As shown in Fig. 5, The generated IndoBERT hate speech model. Therefore, by adding more new minority data, the
model in this research reached an optimized result with goal is to make. Therefore improving the model’s accuracy
88,23% of accuracy, outperforming the conventional ML and reliability when predicting multi-label Indonesian hate
techniques used in [2], that is RDFT classifier combined speech.
with LP data transformation and word unigram feature, for
detecting multi-label hate speech data with only 66,12% of R EFERENCES
accuracy in the same MLD. [1] A. Yohanna, “The influence of social media on social interactions
among students,” Indonesian Journal of Social Sciences, vol. 12, no. 2,
p. 34, 2020.
[2] M. O. Ibrohim and I. Budi, “Multi-label hate speech and abusive
language detection in indonesian twitter,” in Proceedings of the third
s
workshop on abusive language online, 2019, pp. 46–57.
[3] R. Brehar, D.-A. Mitrea, F. Vancea, T. Marita, S. Nedevschi, M. Lupsor-
es
Platon, M. Rotaru, and R. I. Badea, “Comparison of deep-learning and
conventional machine-learning methods for the automatic recognition
of the hepatocellular carcinoma areas from ultrasound images,” Sensors,
vol. 20, no. 11, p. 3085, 2020.
[4] M. Nabipour, P. Nayyeri, H. Jabani, S. Shahab, and A. Mosavi, “Pre-
Pr
dicting stock market trends using machine learning and deep learning
algorithms via continuous and binary data; a comparative analysis,”
IEEE Access, vol. 8, pp. 150 199–150 212, 2020.
[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
of deep bidirectional transformers for language understanding,” arXiv
preprint arXiv:1810.04805, 2018.
Fig. 5: Accuracy Comparison for Detecting Multi-Label In- [6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
in
donesian Hate Speech Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023.
[7] B. Wilie, K. Vincentio, G. I. Winata, S. Cahyawijaya, X. Li, Z. Y.
Lim, S. Soleman, R. Mahendra, P. Fung, S. Bahar et al., “Indonlu:
Benchmark and resources for evaluating indonesian natural language
V. C ONCLUSION AND S UGGESTION understanding,” arXiv preprint arXiv:2009.05387, 2020.
The authors concluded the research findings into three [8] K. Gelber, “Differentiating hate speech: a systemic discrimination
e
approach,” Critical Review of International Social and Political Phi-
main conclusions. First, in balancing the multi-label hate losophy, 2019.
speech dataset, the data resampling method through synonym [9] U. Ubaidillah and I. D. P. Wijana, “A directive speech act of hate speech
cl
replacement to augment the minority data does not resolve the on indonesian social media,” LiNGUA: Jurnal Ilmu Bahasa dan Sastra,
vol. 16, no. 1, pp. 125–138, 2021.
imbalance dataset problem because the ratio of the data be- [10] N. Aulia and I. Budi, “Hate speech detection on indonesian long
tween the minority labels and majority labels is not changed. text documents using machine learning approach,” in Proceedings of
the 2019 5th international conference on computing and artificial
rti