0% found this document useful (0 votes)
30 views8 pages

Experiments On Indobert Implementation For Detecting Multi-Label Hate Speech With Data Resampling Through Synonym Replacement Method

Uploaded by

tripratiwi4586
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views8 pages

Experiments On Indobert Implementation For Detecting Multi-Label Hate Speech With Data Resampling Through Synonym Replacement Method

Uploaded by

tripratiwi4586
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/379132696

Experiments on IndoBERT Implementation for Detecting Multi-Label Hate


Speech with Data Resampling through Synonym Replacement Method

Conference Paper · December 2023


DOI: 10.1109/ICRAIE59459.2023.10468099

CITATIONS READS

0 20

4 authors:

Michael Adriel Darmawan William Boentoro


Binus University Binus University
2 PUBLICATIONS 0 CITATIONS 2 PUBLICATIONS 0 CITATIONS

SEE PROFILE SEE PROFILE

Kevin Christian Surya Rhio Sutoyo


Binus University Binus University
2 PUBLICATIONS 0 CITATIONS 52 PUBLICATIONS 255 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Rhio Sutoyo on 24 March 2024.

The user has requested enhancement of the downloaded file.


Experiments on IndoBERT Implementation for
Detecting Multi-Label Hate Speech with Data
Resampling through Synonym Replacement Method
1st Michael Adriel Darmawan 2nd Nathanael William Boentoro 3rd Kevin Christian Surya
Computer Science Department Computer Science Department Computer Science Department
School of Computer Science School of Computer Science School of Computer Science
Bina Nusantara University Bina Nusantara University Bina Nusantara University
Jakarta, Indonesia 11480 Jakarta, Indonesia 11480 Jakarta, Indonesia 11480
[email protected] [email protected] [email protected]

4th Rhio Sutoyo

s
Computer Science Department
School of Computer Science

es
Bina Nusantara University
Jakarta, Indonesia 11480
[email protected]

Abstract—The freedom to share information online leads to


increased online hate speech. Addressing this issue is crucial to
creating better online communities. A study in 2019 used ma-
chine learning approaches to detect multi-label hate speech in the Pr
(RFDT). Moreover, the research then produced the best result
in predicting multi-label hate speech occurrences with an
accuracy of 66.12% by using RFDT. Despite the result, the
dataset created and used in [2] is imbalanced, meaning that
in
Indonesian language, achieving an accuracy of 66.12%. Recently,
several researchers have shown improved performance when several data in specific labels or classes are underrepresented.
using the deep learning method. Looking at the opportunity, In their research, Ibrohim and Budi [2] did not implement any
this work is developing an Indonesian hate speech model using methods to solve this issue. They, however, mentioned several
IndoBERT. Various word preprocessing techniques and dataset methods, such as collecting new data and data resampling.
stratification were explored for optimal results. In addition, this
e

work also tried to solve the imbalance dataset problem in the While the [2] research used ML approaches in their work,
multi-label dataset through the synonym replacement method. other researcher such as [3] and [4] show better evaluation
cl

As a result, the IndoBERT model of this research achieved an results for the Deep Learning (DL) models compared to that
improved accuracy of 88.23%. It was also implemented into a of the ML models. The capability of DL to learn more
web-based application to show the potential usage. complex patterns and capture underlying meaning in the
Keywords—Indonesian hate speech detection, deep learning,
rti

IndoBERT, synonym replacement data makes DL a more promising method for detecting hate
speeches compared to the previous way of using conventional
ML.
I. I NTRODUCTION
Therefore, this paper will develop a DL multi-label hate
A

The internet has been an essential part of daily lives speech detection model, trained using the same dataset from
where online interactions are inevitable as it is a part of the the [2] research, using a BERT-based model. BERT is a pre-
globalization era [1]. This also affects the increasing number trained model for language understanding created by a team of
of hate speech incidents. In research by Ibrohim and Budi [2], Google engineers in 2018 [5]. It was created using transformer
it was agreed through a Focus Group Discussion (FGD) with architecture from [6]. This research will utilize the Indonesian
the Criminal Investigation Agency of the Indonesian National version of the BERT, namely IndoBERT. IndoBERT addresses
Police (Badan Reserse Kriminal) that the characteristics of a the lack of available NLP tools for the Indonesian language.
hate speech is that it has a specific target, a category of hate It was created by the IndoNLU team and was trained using
speech, and a level indicating the hate speech in terms of a sizeable Indonesian dataset called Indo4B [7].
severity. By leveraging the more advanced capability of the DL
In their research, Ibrohim and Budi [2] created a multi- method in learning more complex patterns, this paper aims
label hate speech dataset using Twitter API and gathered to produce a superior multi-label hate speech detection com-
13.169 tweets. The research uses Machine Learning (ML) ap- pared to the conventional ML methods. In maximizing the
proaches such as approaches such as Support Vector Machine potential of the IndoBERT model, this research experimented
(SVM), Naive Bayes (NB), and Random Forest Decision Tree on several word processing methods, such as text normal-
ization, case folding, and word stemming. This research also several feature extraction methods such as word n-gram,
conducted the experiments in 3 different dataset splits, each character n-gram, orthography, and lexicon. The first scenario
with a different distribution of training, validation, and testing produced the best result when the combination of RFDT,
data. In the end, to address the imbalance dataset problem LP, and word unigram was used, producing an accuracy of
within the dataset, this research performed data resampling 66.12%, whereas the second scenario produced the best result
by using the synonym replacement method. of 77.36% accuracy when the combination of word unigram,
character quadrams, positive sentiment, and abusive lexicon
II. R ELATED W ORKS features was used along with RFDT and LP.
In creating an IndoBERT hate speech model, it is essential In summary, the research by [2] achieved 66.12% accuracy
first to understand what can be identified as hate speech. for predicting the Multi-Labeled Dataset (MLD). Moreover,
According to Gelber [8], hate speech can be described as a the authors propose several ideas to improve the results. The
form of speech that causes harm by emphasizing a speaker’s first method is to collect more data. However, the drawback
capability to do harm, a listener’s vulnerability as a member to this method is that it could be a very lengthy and ex-
of the target group, or both. The target group in this context is pensive process. Other feasible methods for this problem are
usually based on characteristics such as religion, gender, race, data resampling and data augmentation. However, another
ethnicity, and sexual orientation. Another research conducted important thing is that data balancing for MLD could be
on Indonesian social media [9] said that hate speech aims to complex due to the nature of MLD itself, where data could

s
incite, order, forbid, criticize, and warn. contain both the majority and minority labels simultaneously.
The last method mentioned by Ibrohim and Budi [2] uses

es
A. Machine Learning Approaches to Detect Indonesian Hate
a conventional machine learning approach by implementing
Speech
hierarchical MLC.
The [2] research focused on solving the hate speech de- Another research by Aulia and Budi [10], were able to

Pr
tection problem using ML approaches such as SVM, NB, produce a better result for detecting hate speech in Indonesian
and RFDT. Ibrohim and Budi [2] first created the dataset language by calculating the F1-score value by using a dataset
that is made up of 13,169 Twitter data. The process of gathered from Facebook. The [10] research yielded a best
collecting the Twitter data was done by using Twitter Search model by combining SVM and TF-IDF classifiers, with an
API and was collected in seven months. Also, the dataset F1-score of 85%. However, different from the dataset in [2]
was combined with related research on hate speech detection
in
which is a MLD, this dataset contains only a single label,
in the Indonesian language. After the data gathering, the where the value of a data can be either a hate speech or not
process is followed by the data annotation process. The a hate speech. Another research [11] on Indonesian language
data annotation process aims to classify each data into 12 hate speech detection was also able to produce a best model
labels. The data annotation process was separated into two with 98.13% accuracy by using K-Nearest Neighbor (KNN)
e

parts. The first part was to annotate whether the data is classifier. The research by [11], just like the research by [10],
classified as hate speech, abusive language, or both. The also only used a single label dataset, instead of the MLD that
cl

second part of the annotation process was to annotate the is used in [2].
target of the hate speech, categories of the hate speech, and
level of the hate speech. Before feeding the data to the training
B. Deep Learning Approach to Detect Multilingual Hate
rti

model, Ibrohim and Budi [2] applied the Binary Relevance


Speech
(BR), Label Power-set (LP), and Classifier Chains (CC) data
transformation methods. To evaluate the model created, the Research by Aluru, Mathew, Saha, and Mukherjee [12]
[2] research uses the 10-fold cross-validation technique and focused on developing DL models for detecting hate speech
A

accuracy for the evaluation. The accuracy is calculated by in multiple languages. The paper analyzed nine languages, in-
summing the number of correct predictions and dividing it by cluding Arabic, English, German, Indonesian, Italian, Polish,
the total number of all predictions. Equation (1) were utilized Portuguese, Spanish, and French, with 159,753 data collected
to calculate accuracy [2]. from 16 datasets. The experiment in this paper was conducted
in two scenarios. The first scenario used two techniques,
TP + TN LASER embedding with Logistic Regression (LR) and BERT,
Accuracy = (1)
TP + TN + FP + FN to create a model using the same language for the training
The experiment was conducted in two scenarios. The and testing data. The second scenario involved training the
hate speech target, categories, and level are all taken into model using multiple languages and testing the result on
consideration in the first scenario to identify hate speech and one language. All the datasets used in this research are first
abusive language. The second scenario used MLC to find transformed into only two labels: hate speech and normal
abusive language and hate speech in data without identifying speech. The research found that LASER embedding with LR
the target, categories, and level of hate speech. Both scenarios performed best in a low resource condition [12]. However,
were carried out to determine the most suitable classifier, when the training data for the target language in the testing
transformation method, and features. Both scenarios used was used, the BERT-based model performed better.
C. BERT-Based Model Performance on an Unpreprocessed manually annotated into 12 different hate speech related la-
Dataset bels: HS; Abusive; HS Individual; HS Group; HS Religion;
Alzahrani and Jololian [13] analyze the text processing HS Race; HS Physical; HS Gender; HS Other; HS Weak;
effects using the BERT model. The research utilized a Twitter HS Moderate; and HS Strong.
dataset that consists of 363,031 distinct tweets. The authors
aimed to use the BERT model to distinguish the author of the
tweets’ gender using this dataset. The training process of the
BERT model was implemented in five experiments.
From the five experiments by Alzahrani and Jololian [13],
the best result was achieved in the fifth experiment with
86.67% accuracy, where no text preprocessing methods were
applied. On the contrary, the fourth experiment, where all
the text preprocessing methods were applied, resulted in the Fig. 2: Distribution of Hate Speech Types Occurrences
worst model with 78.86% accuracy. The authors suspected
that applying text preprocessing such as stop words removal
The dataset contains 5,561 HS data, consisting of 2,266
and punctuation marks removal negatively affected the overall
data for HS label only and 3,295 for both HS and Abusive

s
accuracy of the model because pre-trained models such as
[2]. The Twitter dataset by Ibrohim and Budi [2] also contains
BERT need to use more extensive texts to perform better.
5,043 abusive data, with 1,748 not associated with HS. As

es
In the end, [13] concluded that BERT performs better when
shoz.l”’wn in Fig. 2, several labels such as HS Religion,
no text preprocessing techniques are applied because BERT
HS Race, HS Physical, HS Gender, HS Strong have signif-
could take advantage of longer texts.
icantly lower occurrences compared to the other labels. This

Pr
III. M ETHODOLOGY suggests that the dataset is imbalanced. The data imbalance
problem in MLD was also addressed and explained in [14]
The methodology of this work is illustrated in Fig. 1. paper. Both [2] and [14] suggested several methods to improve
The green boxes represent the processes, whereas the yellow the model performance and accuracy in imbalanced data.
diamond and the red parallelogram represent the decision and Ibrohim and Budi [2] mentioned that balancing the dataset
the output, respectively. by adding more new data could be expensive and time-
in
consuming, so the feasible methods proposed by the [2] paper
are to use an ML approach by implementing hierarchical
Dataset Dataset MLC and by doing data resampling or data augmentation.
Gathering Preparation However, for the scope of this research, no hierarchical MLC
e

will be done since this paper uses a more sophisticated


DL approach that could improve the overall performance
Adjust
cl

Train Model IndoBERT and accuracy of the previously used ML approach through
Model Design IndoBERT. In addition, this paper also tries to balance the
Yes
data using a heuristic resampling method mentioned in [14]
rti

to perform data augmentation. The details of this method will


More be explained further in the following subchapter.
Evaluate
Experiments to be According to previous literature [2], adding new data and
Performance
Conducted ?
annotating those new data manually could help improve the
A

model’s performance in the case of an imbalanced dataset.


No
Thus, this research attempted to scavenge hate speech data1
in YouTube and TikTok comment sections. However, the
Select Model Optimized process was lengthy and did not result as expected. After
with the Best IndoBERT
Performance Model three weeks of the new data gathering process in YouTube
and TikTok, there is very little usable data. Most of the
hate speech comments found are made up of less than three
words. This is not ideal for the dataset as it does not align
Fig. 1: Research Methodology with the basic structure of a sentence that should have at
least three elements: the subject, verb, and object. In the end,
since only 90 instances of data were able to be found, this
A. Dataset Gathering
research excluded those 90 data for the following process due
This paper used an Indonesian multi-label hate speech to insufficient amount of data.
dataset by Ibrohim and Budi [2]. Ibrohim and Budi [2]
collected 13,169 Indonesian tweets for the dataset that are 1 https://github.com/asianjack19/Additional-Hate-Speech-Data
B. Dataset Preparation TABLE II: Before and After Word Preprocessing
In this phase, three variants of dataset distributions from Before After
- disaat semua cowok berusaha mela- cowok usaha lacak perhati
[2] will be further processed with word preprocessing, data cak perhatian gue. loe lantas remehkan gue lantas remeh perhati
distribution, and data stratification. perhatian yg gue kasih khusus ke elo. gue kasih khusus elo basic
basic elo cowok bego ! ! ! cowok bego
1) The first dataset duplicate consists of 80% training data, RT USER: USER siapa yang telat telat tau elu edan sarap gue
10% validation data, and 10% testing data. ngasih tau elu?edan sarap gue bergaul gaul cigax jifla cal licew
2) The second dataset duplicate will use 70% training data, dengan cigax jifla calis sama siapa noh
licew juga
15% validation data, and 15% testing data. USER USER Kaum cebong kapir udah kaum cebong kafir dongok
3) Lastly, the third dataset duplicate will be using 60% keliatan dongoknya dari awal tambah dungu hahahah
training data, 20% testing data, and 20% validation data. dongok lagi hahahah
Then, there are twelve experiments to be conducted in three
dataset duplicates. The dataset splitting includes the random
an imbalanced dataset. In stratifying the dataset, a total of 13
state option to ensure that each experiment is conducted fairly.
data with a unique label sequence are removed to maintain
The dataset will consistently split into the same training,
consistency of each class label existence in each dataset split.
validation, and testing sets across different runs by setting
Addressing the imbalance data problem mentioned before,
the same random state for each experiment. The dataset
this research experimented with heuristic resampling for data

s
preparation was explained in Table I.
augmentation. Similarly, synthetic data can be generated by

es
TABLE I: Experiment Setups upsampling the data through text-based synonym replacement,
as seen in Table III.
Experiment Preprocessing Dataset Distribution Stratify
1 No 80/10/10 No TABLE III: Before and After Synonym Replacement and

Pr
2 Yes 80/10/10 No Resampling
3 No 80/10/10 Yes
4 Yes 80/10/10 Yes Before After
5 No 70/15/15 No berat wakil setya novanto dpr beban agen setya novanto dpr ri
6 Yes 70/15/15 No ri pecat setya novanto dpr ri melengserkan setya novanto dpr ri
7 No 70/15/15 Yes berat wakil setya novanto dpr bahara distributor setya novanto
8 Yes 70/15/15 Yes ri pecat setya novanto dpr ri dpr ri memberhentikan setya no-
in
9 No 60/20/20 No vanto dpr ri
10 Yes 60/20/20 No berat wakil setya novanto dpr bahara leveransir setya novanto
11 No 60/20/20 Yes ri pecat setya novanto dpr ri dpr ri memberhentikan setya no-
12 Yes 60/20/20 Yes vanto dpr ri
berat wakil setya novanto dpr bahara pemasok setya novanto dpr
ri pecat setya novanto dpr ri ri memberhentikan setya novanto
e

The first step in word preprocessing is text normalization dpr ri


using typo and slang word dictionaries [2]. It involves the
cl

process of replacing informal words with formal words. The process works by replacing a single or multiple word
The following process is case folding. Case folding works in a sentence with the corresponding synonym. The authors of
by converting all the letters in a sentence to either lowercase this research conducted two different variations: the first is by
replacing a random word from a sentence with its synonym,
rti

or uppercase. This research uses case folding to change all


the letters to lowercase. By standardizing each sentence to and the second is by replacing all the words in a sentence with
lowercase, the dimensionality of the data is reduced and the synonyms. To do the synonym replacement, the first step
ensures consistency for the training process. Unique tags, is to get a dictionary containing Indonesian words and their
synonyms. The dictionary is based on the 2008 Indonesian
A

”user” and ”URL,” are also removed from the sentence to


eliminate specific uninformative tokens for the hate speech thesaurus [15]. The number of data that will be resampled is
model. To further clean the dataset from noisy elements in dictated by the ratio of the minority labels.
a sentence, characters such as punctuation marks, newline
characters, and alphanumeric patterns are eliminated. C. Adjust IndoBERT Model Design and Train Model
Then, each sentence is checked for the existence of stop- The model is created using the Indonesian version of BERT
words, which are common words like dan, adalah, oleh, and from IndoNLU [7], IndoBERT. The version used in this
pada. These words will then be removed since they do not research is the base model of IndoBERT. since the larger
have any significant meaning. The last process involves per- model of IndoBERT requires more computational resources
forming a stemming process using PySastrawi. This stemming and time, with minimal improvements on the accuracy while
process took approximately 25 minutes, as shown in Table II. tested on the [2] dataset. The use of IndoBERT large could
Furthermore, the next step in preparing the dataset is also lead to overfitting due to the model having too many
stratification, ensuring each set contains an equal class label parameters related to the number of data.
distribution. It is essential to maintain a proportional repre- In determining the hyperparameter for the model, exper-
sentation of each class label in each set that has the case of iments were conducted by trying different hyperparameter
combinations to get an optimal result. In the end, the hy- This research also monitored the model performance during
perparameter of epoch = 15, learning rate = 2e-05, and batch the training process of IndoBERT by measuring the average
size of 32 results in slightly better accuracy compared to other of training loss and validation loss that is updated in each step
values for this particular model and dataset. In addition to the of the epoch. The loss values in each epoch are plotted in Fig.
IndoBERT base model, this research added two extra linear 3a and Fig. 3b. Both graphs are taken from the best-produced
layers with 1,024 nodes and 72 nodes, followed by a ReLU IndoBERT model mentioned previously.
activation function in between so that the model can detect
multi-label hate speech downstream.
D. Evaluate Performance and Select Model with the Best
Performance
The results are evaluated using accuracy metrics obtained
from (1). This research paper uses accuracy as the evaluation
metric to directly compare the model’s performance to the
previous work in [2]. This paper also implemented the best-
produced IndoBERT model into a web-based application to
(a) Training Loss and Valida- (b) Training and Validation Ac-
extract sentences from online web pages using web scraping tion Loss for best produced In- curacy for best produced In-

s
techniques. Then, the model will detect the occurrences of doBERT model doBERT model
hate speech within each sentence.

es
Fig. 3: Loss and Accuracy Graph for Training and Validation
IV. R ESULT AND D ISCUSSION Set
A. Experiment Results
For the validation loss, it can be seen that it was not

Pr
In total, there are 12 experiments conducted based on the
combination of dataset distribution, dataset preprocessing, until around epoch 13 and epoch 14 that it started to rise,
and dataset stratification. In addition, this paper also tried whereas the training loss showed a stable trend of decreasing
two additional variations on data upsampling by using the throughout 15 epochs. This shows that after epoch 13, the
synonym replacement method to augment the dataset, which model has started to slightly overfit the training data, which
is further supported by the model’s training accuracy and
in
includes the preprocessed and un-preprocessed versions of
the dataset, using the best-produced IndoBERT hate speech validation accuracy data.
model. These experiments use the following values as the The training accuracy achieved in epoch 15 is 91.30%,
hyperparameter: whereas the validation accuracy is not far from the testing
• Batch size = 32
accuracy, with the value of 88.11%. Therefore, it can be
e

• Epoch = 15
derived from the graph that the best model obtained from
• Learning Rate = (2 x 10
−5
) this dataset and the combination is in epoch 13, with both
cl

the training loss and validation loss at 0.297. However, since


From Table IV, based on the macro accuracy on the testing all the training, validation, and testing accuracy only have a
set, the best produced IndoBERT model is achieved with the slight difference between them, it is accepted that the model
has achieved a consistent level of performance robustness and
rti

combination of un-preprocessed dataset and stratified dataset


with the data distribution of 80% training set, 10% validation reliability.
set, and 10% testing set with 88.23%. When comparing experiments with the same data distribu-
tion and stratification option, it is shown that the model trained
TABLE IV: Accuracy Comparison of Different Experiments
A

on un-preprocessed datasets results in slower training time


Combinations Table and better accuracy for about 1.85% on average compared
to the preprocessed ones. One of the possible reasons is
Dataset
Exper- Pre
Distribu-
Stra- Training Macro that IndoBERT was pre-trained using multiple preprocessed
iment processing tify Time Accuracy Indonesian datasets but still maintaining the original form of
tion
1 No 80/10/10 No 42m 34s 86.56% the words, whereas in this research’s preprocessing step, each
2 Yes 80/10/10 No 42m 10s 82.96% word is checked for stemming, stop word removal, and text
3 No 80/10/10 Yes 42m 8s 88.23%
4 Yes 80/10/10 Yes 42m 10s 84.13% normalization using typo and slang words dictionaries. These
5 No 70/15/15 No 41m 4s 86.13% additional steps may have led to differences in the representa-
6 Yes 70/15/15 No 38m 17s 86.54% tion of each sentence, especially during the stemming process,
7 No 70/15/15 Yes 38m 22s 87.52%
8 Yes 70/15/15 Yes 37m 50s 85.89% which reduces a word to its base form. Thus, when it is trained
9 No 60/20/20 No 34m 7s 86.72% on preprocessed datasets where stopwords are removed, and
10 Yes 60/20/20 No 33m 19s 85.26% words are reduced to their base form, the IndoBERT model
11 No 60/20/20 Yes 35m 21s 86.19%
12 Yes 60/20/20 Yes 33m 16s 85.73% could be impacted by the testing data due to differences in
the representation of the meaning of a particular sentence.
The accuracy per label of the best model, which is the seen in Table VII. In this variation, both the un-preprocessed
one with the combination of un-preprocessed dataset and and preprocessed datasets show a more significant decrease in
stratified dataset with the data distribution of 80% training macro accuracy than the first variation. The un-preprocessed
set, 10% validation set, and 10% testing set, is also shown to dataset experiences a drop from 88.23% to 78.49%, while
be consistent and relatively high across all labels in Table V. the preprocessed dataset drops from 84.13% to 68.86%.
This significant decrease in accuracy suggests that replacing
TABLE V: Accuracy of Each Label of the Best Model all words with synonyms negatively impacts the model’s
performance even more than the first variation.
Labels Accuracy
HS 88.67%
Abusive 73.02% TABLE VII: Synonym Replacement for All Words
HS Individual 78.03%
HS Group 88.82% Dataset
Exper- Pre Stra- Training Macro
HS Religion 93.76% Distribu-
iment processing tify Time Accuracy
HS Race 95.82% tion
HS Physical 97.49% 1h 20m
3 No 80/10/10 Yes 78.49%
HS Gender 97.94% 21s
HS Other 80.85% 1h 17m
4 Yes 80/10/10 Yes 68.86%
HS Weak 79.78% 7s
HS Moderate 88.06%

s
HS Strong 96.50%
Overall, the experiments demonstrate that upsampling data

es
using synonym replacement for data augmentation worsens
However, this high accuracy result could be attributed to the
the IndoBERT model’s performance, indicated by the de-
high number of negative data existing in the dataset, especially
crease in macro accuracy. This result proved the argument
for the labels of hate speech category, such as HS Religion,

Pr
in [2], where the authors mentioned the problem in balanc-
HS Race, HS Physical, and HS Gender. This created many
ing an MLD. In the [2] MLD dataset, a single data entry
True Negative (TN) data in the prediction, resulting in a high
could contain both the minority and majority labels. Later,
accuracy. The existence of a large amount of TP data could
during the synonym replacement and upsampling process,
impact the IndoBERT model’s capability in True Positive
the minority label data will be upsampled. However, since
(TP) data. This is the effect of a data imbalance in the
it is an MLD, where the data could contain multiple labels,
in
dataset, where most instances belong to the negative class,
the majority of labels will also get upsampled. This leads to
contributing to higher overall accuracy in the performance
another dataset imbalance problem. Also, in general, based
evaluation.
on the two variations of experiments conducted in Table VI
B. Experiment Results for Synonym Replacement and Table VII, the process of synonym replacement causes
e

The first experiment results for synonym replacement are the sentence to lose its original meaning by introducing noise
listed in Table VI. The experiment replaces a synonym for a data.
cl

random word. Based on the experiment, the macro accuracy


shows a decrease in macro accuracy for the un-preprocessed C. Implementation of the IndoBERT Model
dataset. It drops from 88.23% (see Table IV) to 86.03%. The The best-produced IndoBERT model was implemented into
rti

same pattern also occurs in the preprocessed dataset, where a web-based application platform. As shown in Fig. 4, the
the accuracy dropped from 84.13% (see Table IV) to 78.54%. application first extracted sentences from a given URL using
This shows that there is a pretty significant decrease in the web scraping techniques, and then the IndoBERT model
IndoBERT model’s performance by applying the upsampling detected and labeled each sentence.
A

method in both un-preprocessed dataset and the preprocessed


dataset. The decrease in macro accuracy suggests that re-
placing a random word in a sentence to do upsampling may
introduce a noise through unfamiliar synonyms, altering the
original meaning of that sentence.

TABLE VI: Synonym Replacement for a Random Word

Dataset
Exper- Pre Stra- Training Macro
Distribu-
iment processing tify Time Accuracy
tion
1 No 80/10/10 Yes 48m 52s 86.03%
2 Yes 80/10/10 Yes 59m 39s 78.54%

In the second variation of the experiment, the result of Fig. 4: IndoBERT Detection Result for Hate Speech using
upsampling through replacing all words in a sentence can be Web-Based Application
In Figure 4, using a YouTube URL, the IndoBERT model HS Gender, and HS Strong does not improve the accuracy
detected that 35.5% of the sentences extracted from the URL since a data can contain both the majority and the minority
are categorized as hate speech sentences. The IndoBERT label at the same time interconnecting the labels to each
model also detected that 25.5% of the sentences contain other. Data resampling on minority labels only affected the
abusive language. data frequency from other labels. In addition, the synonym
replacement method produced noisy and biased data that
D. Evaluation further worsened the detection capability of the IndoBERT
As shown in Fig. 5, The generated IndoBERT hate speech model. Therefore, by adding more new minority data, the
model in this research reached an optimized result with goal is to make. Therefore improving the model’s accuracy
88,23% of accuracy, outperforming the conventional ML and reliability when predicting multi-label Indonesian hate
techniques used in [2], that is RDFT classifier combined speech.
with LP data transformation and word unigram feature, for
detecting multi-label hate speech data with only 66,12% of R EFERENCES
accuracy in the same MLD. [1] A. Yohanna, “The influence of social media on social interactions
among students,” Indonesian Journal of Social Sciences, vol. 12, no. 2,
p. 34, 2020.
[2] M. O. Ibrohim and I. Budi, “Multi-label hate speech and abusive
language detection in indonesian twitter,” in Proceedings of the third

s
workshop on abusive language online, 2019, pp. 46–57.
[3] R. Brehar, D.-A. Mitrea, F. Vancea, T. Marita, S. Nedevschi, M. Lupsor-

es
Platon, M. Rotaru, and R. I. Badea, “Comparison of deep-learning and
conventional machine-learning methods for the automatic recognition
of the hepatocellular carcinoma areas from ultrasound images,” Sensors,
vol. 20, no. 11, p. 3085, 2020.
[4] M. Nabipour, P. Nayyeri, H. Jabani, S. Shahab, and A. Mosavi, “Pre-

Pr
dicting stock market trends using machine learning and deep learning
algorithms via continuous and binary data; a comparative analysis,”
IEEE Access, vol. 8, pp. 150 199–150 212, 2020.
[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
of deep bidirectional transformers for language understanding,” arXiv
preprint arXiv:1810.04805, 2018.
Fig. 5: Accuracy Comparison for Detecting Multi-Label In- [6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
in
donesian Hate Speech Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023.
[7] B. Wilie, K. Vincentio, G. I. Winata, S. Cahyawijaya, X. Li, Z. Y.
Lim, S. Soleman, R. Mahendra, P. Fung, S. Bahar et al., “Indonlu:
Benchmark and resources for evaluating indonesian natural language
V. C ONCLUSION AND S UGGESTION understanding,” arXiv preprint arXiv:2009.05387, 2020.
The authors concluded the research findings into three [8] K. Gelber, “Differentiating hate speech: a systemic discrimination
e
approach,” Critical Review of International Social and Political Phi-
main conclusions. First, in balancing the multi-label hate losophy, 2019.
speech dataset, the data resampling method through synonym [9] U. Ubaidillah and I. D. P. Wijana, “A directive speech act of hate speech
cl

replacement to augment the minority data does not resolve the on indonesian social media,” LiNGUA: Jurnal Ilmu Bahasa dan Sastra,
vol. 16, no. 1, pp. 125–138, 2021.
imbalance dataset problem because the ratio of the data be- [10] N. Aulia and I. Budi, “Hate speech detection on indonesian long
tween the minority labels and majority labels is not changed. text documents using machine learning approach,” in Proceedings of
the 2019 5th international conference on computing and artificial
rti

Second, preprocessing of the tweet data in the multi-label


intelligence, 2019, pp. 164–169.
hate speech dataset resulted in a decrease of macro accuracy [11] A. Briliani, B. Irawan, and C. Setianingsih, “Hate speech detection
of around 2%. The decrease in accuracy could result from in indonesian language on instagram comment section using k-nearest
the IndoBERT model trained on more extended texts, and neighbor classification method,” in 2019 IEEE international conference
on internet of things and intelligence system (IoTaIS). IEEE, 2019,
A

preprocessing techniques such as punctuation and stop word pp. 98–104.


removal reduce the meaning of a given sentence. Lastly, the [12] S. S. Aluru, B. Mathew, P. Saha, and A. Mukherjee, “Deep learn-
IndoBERT hate speech model has been proven to outperform ing models for multilingual hate speech detection,” arXiv preprint
arXiv:2004.06465, 2020.
the previous ML methods (RFDT + word unigram) with a [13] E. Alzahrani and L. Jololian, “How different text-preprocessing tech-
22.11% difference in macro accuracy, with the IndoBERT niques using the bert model affect the gender profiling of authors,”
achieving 88.23% accuracy. arXiv preprint arXiv:2109.13890, 2021.
[14] A. N. Tarekegn, M. Giacobini, and K. Michalak, “A review of methods
For future research, the authors suggest an alternative for imbalanced multi-label classification,” Pattern Recognition, vol. 118,
approach to balancing the multi-label dataset other than data p. 107965, 2021.
resampling. For instance, augmenting the dataset by adding [15] T. Redaksi, “Tesaurus bahasa indonesia pusat bahasa,” Pus. Bahasa,
Dep. Pendidik. Nas, 2008.
more new positive data for the minority labels from the
current dataset organically. Through several experiments using
synonym replacement to balance the dataset, this research
has shown that the result was not an improved perfor-
mance. Performing either upsampling or downsampling on
minority labels such as HS Religion, HS Race, HS Physical,

View publication stats

You might also like