Fine-Tuning Transformer Models Using Transfer Learning For Multilingual Threatening Text Identification
Fine-Tuning Transformer Models Using Transfer Learning For Multilingual Threatening Text Identification
ABSTRACT Threatening content detection on social media has recently gained attention. There is very
limited work regarding threatening content detection in low-resource languages, especially in Urdu. Fur-
thermore, previous work explored only mono-lingual approaches, and multi-lingual threatening content
detection was not studied. This research addressed the task of Multi-lingual Threatening Content Detection
(MTCD) in Urdu and English languages by exploiting transfer learning methodology with fine-tuning
techniques. To address the multi-lingual task, we investigated two methodologies: 1) Joint multi-lingual,
and 2) Joint-translated method. The former approach employs the concept of building a universal classifier
for different languages whereas the latter approach applies the translation process to transform the text
into one language and then perform classification. We explore the Multilingual Representations for Indian
Languages (MuRIL) and Robustly Optimized BERT Pre-Training Approach (RoBERTa) with fine-tuning
that already demonstrated state-of-the-art in capturing the contextual and semantic characteristics within the
text. For hyper-parameters, manual search and grid search strategies are utilized to find the optimum values.
Various experiments are performed on bi-lingual English and Urdu datasets and findings revealed that the
proposed methodology outperformed the baselines and showed benchmark performance. The RoBERTa
model achieved the highest performance by demonstrating 92% accuracy and 90% macro F1-score with the
joint multi-lingual approach.
Various languages exist worldwide and are being spoken are employed for threatening content identification in
which leads to diversity in several multi-lingual tasks related English and Urdu languages.
to NLP, such as threatening text, and hate speech detection in 4. The experiments showed the effectiveness of the proposed
social media etc. The task of ‘‘categorization of a set of posts framework by achieving benchmark performance and out-
written in different languages (Urdu, English, Roman-Urdu, performed the baselines.
etc.) into predefined groups across languages’’ is referred 5. The proposed framework demonstrated the highest perfor-
to as Multi-lingual Text Classification (MTC). In contrast, mance with the joint multi-lingual approach by obtaining
‘‘content written in one language but categorized by a clas- 92% accuracy, and 90% macro F1-score.
sification model learned in another language’’ is defined 6. The joint multi-lingual approach improved performance
as cross-language content classification. Threatening content by 3.3% in threatening, 10.31% in not-threatening, 6.81%
identification was not handled earlier as a cross-language in macro, and 5.3% in weighted F1-score.
or MTC methodology, prior approaches addressed it as the The remainder of the paper is organized as follows: Related
mono-lingual paradigm. We are interested to handle it as an work is described in section II containing the summary of
MTC task, so MTCD is a binary classification framework, threatening content identification works and multi-lingual
that needs to be trained and tested on the multi-lingual threat- approaches. Section III presents the proposed methodology
ening content dataset in English and Urdu languages. in detail, followed by section IV, which describes the detail
In the literature, three types of techniques were explored of the experimental setup. Section V presents the results and
to handle the task of MTC. The first approach enforces their analysis. Discussion and limitations are presented in
the design of a separate classification framework for each section VI. At last, the conclusion is presented in section VII.
language [7], [8]. The second approach employs the step
of translation for all languages into one universal language II. RELATED WORK
before applying a single classification system [9], [10]. The This section briefly reviews the works related to threat-
third approach emphasizes the development of one classifi- ening text detection from social media as a mono-lingual
cation framework for different languages. In literature, the approach. In addition, some state-of-the-art multi-lingual
MTC was explored in a limited context despite its impor- approaches addressing the low and high-resources languages
tance. In this study, we dig into the second and third types are reviewed.
of approaches to address the design of MTCD in Urdu
and English languages. For this purpose, joint-translation A. THREATENING TEXT DETECTION
and joint multi-lingual methodologies are investigated [11] Due to the increasing volume of abusive content on social
and RoBERTa and MuRIL transformer models are consid- media, it is difficult to discriminate threatening material
ered to employ the MTC paradigm. The RoBERTa model from abusive content manually. The first detection model
has exhibited state-of-the-art performance for cross-lingual for threatening content was proposed by [14] to detect
and multi-lingual NLP tasks [12]. Likewise, the MuRIL threats in the Dutch language using n-gram representation.
model already demonstrated benchmark performance in Then, the authors proposed another threatening text detection
mono-lingual and multi-lingual classification tasks for Indian model [15] in the Dutch language using a shallow parsing
languages [13]. mechanism. Later in 2016, a study [16] proposed a threaten-
In this study, a robust solution for multi-lingual threat- ing content detection method for YouTube comments. They
ening content identification is proposed by exploiting the compared lexical, syntactic, and semantic features and con-
MTC paradigm with the strength of the transfer learning cluded that lexical features outperformed them. Then, another
technique. The RoBERTa and MuRIL transformer models are study [17] proposed a detection model for threats on Twitter.
used with fine-tuning hyper-parameters. Joint multi-lingual Authors used Glove embeddings with Convolutional Neural
(XLM-RoBERTa, MuRIL) and joint translation-based MTC Network (CNN) and their model demonstrated effective per-
approaches are explored. Furthermore, the joint-translated formance. Later, [18] explored bag-of-word features with the
approach is further divided into two parts; 1) Urdu-RoBERTa, logistic regression model for threatening text identification
and Urdu-MuRIL, and 2) English-RoBERTa, and English- from tweets. Their model achieved 98% F1-score.
MuRIL. The main contributions of this study addressing the Likewise, an annotated big corpus containing violent threat
MTCD are presented below: material is released by [19]. The dataset contains 10,000
1. It is the first attempt to propose a multi-lingual threatening YouTube comments. A novel approach was proposed [3]
content identification framework for English and Urdu by addressing threat detection and then target identification
languages. on Twitter for the English language. They explored bag-of-
2. We explored joint multi-lingual and joint-translated words and FastText features. Moreover, Glove embeddings,
methodologies for MTC and an algorithm is also designed CNN, and Long-Short Term Memory (LSTM) models are
to understand the methodology and re-produce the results used. Their framework achieved an 85% F1-score. The first
easily. identification model for threatening language and its target in
3. Contextual embeddings offered by state-of-the-art Urdu was proposed by [6]. Authors explored FastText, char,
RoBERTa and MuRIL transformer models with fine-tuning and word n-gram models with Machine Learning (ML) and
Deep Learning (DL) models. The experiments revealed that joint-translation approaches and achieved the best perfor-
word n-gram and FastText are the best feature models. Then, mance with the translation-based method (Arabic-BERT).
another study [20] addressed the task of threatening content Later, [32] addressed the offensive language detection prob-
detection in the Urdu language by utilizing the transformer lem using the MTC framework for English and Bengali
model. However, their dataset is not balanced. languages. The authors proposed a Deep-BERT model and
Likewise, a framework for abusive and threatening context obtained effective performance.
detection in Urdu is proposed by [21]. The authors explored Summarizing the literature, to the best of our knowledge,
several benchmark features like Bidirectional Encoder Rep- we did not find any work on multi-lingual threatening content
resentation Transformer (BERT), mBERT with XGboost, identification. Furthermore, threatening language detection
and other ML models. Their framework achieved an 88% was only addressed in Dutch, English, and Urdu languages.
F1-score. Recently, a study [5] developed a detection model The following limitations related to threatening content
for Urdu-threatening language by exploring n-grams, TF- detection are identified in the literature:
IDF, and BOW features with the stacking ensemble model. ■ Mono-lingual Methodology: To the best of our knowl-
Their model obtained 73.99% F1-score. Then, another study edge, prior works used only mono-lingual techniques for
handled the task of abusive and threatening content identifica- designing identification (threatening text) systems for
tion [22]. They explored word2vec and TF-IDF features with each language.
several ML models, but their model did not achieve signifi- ■ Feature Engineering Methods: The prior works
cant performance. The semantic network-based pipeline [23] mainly explored lexical, syntactic, and semantic features
is designed for threatening text and target identification in for threatening content identification.
tweets. Their proposed model outperformed the ML models ■ MTC Methodology: To the best of our knowledge,
and obtained 76% accuracy. More recently, a robust approach we did not find any work on multi-lingual threatening
is introduced to identify threatening text and target identifi- content identification.
cation [4] from Urdu tweets. The BERT model is fine-tuned
using important hyperparameters and it outperformed the III. PROPOSED METHODOLOGY
benchmarks. The summary of all the works for threatening This section describes the proposed methodology in detail.
content detection is presented in Table 1. We address multi-lingual threatening text detection as a
binary classification task. The steps employed in the proposed
pipeline are presented in Fig. 1. Each part of the pipeline is
B. MTC APPROACHES
described step-by-step in this section.
In the last decade, multi-lingual content classification got
attention from the research community but the research in this
A. TRANSLATION PHASE
area is restricted. Not so many but some works proposed MTC
solutions to develop classification systems for low-resource For the joint-translation approach, we need to translate our
and high-resource languages. The first work was presented multi-lingual Twitter dataset into a single language. This pro-
in 2006 [24], in which English and Chinese content was cess includes two steps: first we transform the multi-lingual
classified using the latent semantic indexing approach but dataset into English; second, we transform it into Urdu lan-
they addressed it as mono-lingual. Later, a study [9] explored guage. The detail of the steps is provided below:
two WordNet methods to handle the MTC. One method did ❖ Universal Urdu Corpus: We already have Urdu cor-
not consider translation and used directly WordNet related to pora [4] in the multi-lingual dataset. The other corpora
each language, whereas another approach adopted a transla- (English) [3] is translated into Urdu using the Google
tion phase to access WordNet. Then, [25] developed an MTC translator API. The translated data is edited manually to
system for Spanish, Italian, and English languages by explor- resolve the issues and inconsistencies. After that, both
ing n-gram features. They used the Naïve Bayes algorithm for corpora are combined to get a single Urdu corpus.
classification. Later, the MTC system for English and Hindi ❖ Universal English Corpus: The Urdu corpus is trans-
languages is developed [26] by exploring several ML mod- lated into English using the services of Google trans-
els including genetic algorithm and self-organizing map and lator. After the manual editing of translated data,
they achieved benchmark accuracy by using feature selection we combined it with the English part of the multi-lingual
algorithms. dataset to finalize a single English corpus.
Recently, contextual embeddings based on transformer
models and deep neural networks are introduced for B. PRE-PROCESSING PHASE
English [27] and Arabic languages [28]. Furthermore, Pre-processing steps are an important phase for automated
some multi-lingual transformer models (pre-trained in many text classification tasks. After pre-processing, it is convenient
languages) like mBERT [29], XLM-RoBERTa [30], and to extract precise information from the dataset. We employed
MuRIL [31] are developed. The study [11] proposed an the following steps to pre-process the multi-lingual dataset.
MTC pipeline for offensive text identification in English ■ Punctuations, mentions, hashtags, numbers, HTML
and Arabic languages. They explored joint multi-lingual and tags, and URLs are removed.
■ Emoji/Emoticons are replaced with relevant text, since TABLE 2. Search space of hyper-parameters for RoBERTa and MuRIL
transformers.
these hold important information.
■ English text to lowercase conversion.
■ Address the issue of miss-spelled words (English text).
■ Decode the English abbreviations (pls, thx, etc)
After the phase of pre-processing, we employ two
approaches of MTC; a joint-translated approach and a
joint multi-lingual approach. In joint multi-lingual approach,
we combine the tweets of English and Urdu languages
whereas in joint-translated, we use two corpora one by
one (Universal Urdu and Universal English) described in
section III-A. The algorithmic pseudo-code of our pipeline
is described in Fig. 2.
Regarding the RoBERTa transformer, we use RoBERTa-
Tokenizer [33] for English text, RoBERTa-UrduTokenizer
C. TRANSFORMER MODELS
(https://huggingface.co/urduhack/roberta-urdu-small) for Ur-
In this study, we explore two transformer models with du language and XLMRoBERTaTokenizer [35] for multi-
fine-tuning to design an effective multi-lingual threatening lingual text. For the MuRIL transformer model, we again
content detection in English and Urdu languages. The detail use three tokenizers [31] for Urdu, English, and multi-lingual
of two state-of-the-art transformer models is presented below: text. After that, each token (tweet) is mapped to an index cor-
responding to the transformer model vocabulary (RoBERTa
1) RoBERTa or MuRIL).
The RoBERTa transformer model was introduced by [33]
in 2019. The difference between BERT and RoBERTa is E. FINE-TUNNING
that the RoBERTa eliminates the objective of next sentence The next task is the fine-tuning of both transformer models
pre-training and re-visits the significant hyperparameters and (RoBERTa and MuRIL). We applied two methods to explore
applies them with larger ranges, but it is mainly based suitable values of hyper-parameters for fine-tuning, i.e. man-
on the architecture of the BERT model. XLM-RoBERTa ual search and grid search. The number of hyper-parameters
is trained on 100 languages. Here, we are interested to is eight, the search space for the eight parameters is presented
explore the RoBERTa with fine-tuning of significant hyper- in Table 2. The maximum number of characters supported by
parameters to design a robust multi-lingual threatening text a tweet is 280, therefore 128 could be the maximum value of
detection model. For this purpose, we investigate XLM- sequence length. Thus two sequence lengths are investigated
RoBERTa for the joint multi-lingual task and RoBERTa and to analyze their impact on binary classification, i.e. 64, and
Urdu-RoBERTa for the joint-translation approach. 128. Three batch sizes (16, 32, 64) are evaluated one by one
to investigate the impact of each batch size. Likewise, three
2) MuRIL learning rates are explored to see their impact on the training,
The MuRIL transformer is the most recent multi-lingual validation, and test part of the multi-lingual dataset. The other
model on Google. It is pre-trained on 17 Indian languages parameters and their corresponding ranges are presented in
to promote some downstream NLP operations like spelling Table 2. These parameters are used to fine-tune RoBERTa
differences and transliteration to enhance linguistic interoper- and MuRIL transformer models.
ability. According to the cross-lingual XTREME test, MuRIL For RoBERTa, we are using its un-cased pre-trained base
presented better than the BERT model [34]. In this study, model for Urdu, English, and multi-lingual dataset. The hid-
we are using three models of MuRIL: 1) For multi-lingual den size is 768, attention_heads are 12 and hidden_layers
text, 2) For English text, and 3) For Urdu text. are also 12. For the MuRIL transformer, we are using its
cased pre-trained base model, trained on 17 Indian languages
D. TOKENIZATION & REPRESENTATION with the MLM layer intact. We are interested to explore the
To make the input data compatible with the two transformer strengths of RoBERTa and MuRIL models by fine-tuning
models (RoBERTa and MuRIL), a few steps are needed. eight hyper-parameters.
First of all, we have to perform tokenization of English and For dataset splitting, we employed stratified data splitting
Urdu posts to transform them into a unified format. For this method and split our datasets into 80-20, in which 20%
purpose, the [CLS] token is added at the start of every post and is used for testing the validated model and the remaining
the [SEP] token is added at the end of every post. This process 80% is further split into 90-10, where 90% is actually used
reveals that from where each sentence starts and ends and the for training and 10% is used for validation. The optimizer
resultant entity is a single vector for the whole post. It results function is utilized for updating the parameters for each epoch
in a universal vector to input for the MuRIL and RoBERTa and the output of each training and validating cycle is mea-
classifiers. sured using training loss, validation loss, accuracy, and macro
F1-score. Google Colab and High-Performance Computing forgetting in transfer learning’’ [4]. We exhaustively tried
(HPC) local cloud is used to conduct the experiments of several learning rates to analyze the risk of catastrophic for-
fine-tuning two transformers. The transformer models are getting while fine-tuning the MuRIL and RoBERTa models.
initialized in their pre-trained settings and then annotated After several trials, we found that higher learning rates make
datasets are used to fine-tune important parameters. poor convergence and result in failure most of the time. Thus,
we obtained the best results with learning rates ≤ 1e-5.
F. CLASSIFICATION
For the classification task, we appended a single layer con-
H. OVERFITTING
taining the Softmax function on the top of the RoBERTa
For deep learning models, it is a common issue to choose
and MuRIL transformer models. This layer is used to
the appropriate number of epochs because choosing very few
classify the tweets as threatening or non-threatening. The
epochs results in under-fitting, and choosing too many results
fine-tuned tweet is forwarded to the softmax function and
in over-fitting. We investigated the impact of the number
the transformer model is trained to optimize the cross-entropy
of epochs (5 or 10) on the validation and test parts of the
loss.
multi-lingual threatening content dataset using the loss func-
G. CATASTROPHIC FORGETTING tion. The performance of RoBERTa and MuRIL transformer
The literature reveals that the transformer models encounter models are monitored and we reached on a conclusion that
the problem of forgetting already learned knowledge while 10 epochs are suitable for getting the desired results by
attempting to learn new knowledge by fine-tuning the evaluating the trained model on the validation and test parts
hyperparameters. This concept is termed as ‘‘Catastrophic of the dataset.
TABLE 4. Fine-tunning results of XLM-Roberta and MURIL on the test part of multi-lingual dataset.
hyper-parameters and for the F1-score metric, class-wise per- 5.21%, and recall by 2.185 in comparison with baselines.
formance is presented (threatening, not-threatening, macro, Furthermore, it improved 3.3% in threatening, 10.31% in
and weighted average). The reported results contain learning not-threatening, 6.81% in macro, and 5.3% in weighted
rate, hidden-dropout, and weight-decay parameters, and the F1-score compared to baselines as shown in Fig 3. The high-
sequence length is 128. It is evident that the best perfor- est improvement is observed for detecting not-threatening
mance by the MuRIL model is obtained on 1e-5 learning class instances in the F1-score. The word2vec embeddings
rate, 0.1 hidden-dropout, and 0.1 weight-decay resulting combined with ML and DL models did not perform well.
in 90.35% accuracy, and 90.17% weighted F1-score. Like- Thus, the joint multi-lingual methodology proved itself
wise, the best performance with XLM-RoBERTa is 91.89% by demonstrating benchmark performance and also outper-
accuracy and 91.80% weighted F1-score on 1e-5 learning formed the baselines in class-wise, macro, and weighted
rate, 0.05 hidden-dropout, and 0.01 weight-decay. Thus, this evaluation metrics.
experiment concluded that the best performance is obtained
by fine-tuned XLM-RoBERTa model as compared to fine- B. JOINT-TRANSLATED RESULTS
tuned MuRIL. In this section, we performed two sets of experiments; one
The next experiment compared the performances of the for joint-translated English and the other for joint-translated
fine-tuned MuRIL and XLM-RoBERTa models with ten Urdu model. The objective here is to evaluate the strength
baselines and the results are shown in Table 5. The results of the joint-translated approach for multi-lingual threatening
are reported in accuracy, precision, recall, and class-wise content identification. We already devised two corpora for
performances in F1-score. In baselines, word2vec with the RF the joint-translated approach; 1) English, and 2) Urdu. First,
model and XLM-RoBERTa with the CNN model presented we evaluate the effectiveness of the joint-translated English
better performance in comparison to other combinations. approach and then the joint-translated Urdu approach.
In the proposed models, fine-tuned MuRIL presented better The fine-tunning of English-RoBERTa and MuRIL is per-
than all baselines, and fine-tuned XLM-RoBERTa presented formed using the hyper-parameters enlisted in Table 2 and
better than all baselines including fine-tuned MuRIL. The best results are reported in the Table 6. The joint-translated
XLM-RoBERTa improved accuracy by 5.02%, precision by English corpora is used for experimental setup and
performance of classifiers are reported using four metrics. comparison with baseline. Furthermore, the proposed frame-
Furthermore, the performance of ten baselines are also added work improved the threatening class by 4.43%, not-
to compare them with proposed joint-translated english threatening by 7.82%, macro F1-score by 6.13%, and
models. weighted F1-score by 5.41%. We noticed substantial
The word2vec and XLM-RoBERTa are combined with improvement achieved by the proposed joint-translated
three ML and two DL models to create comparable models. English model in comparison with baselines, indicating
Considering baselines, the superior performance is demon- the effectiveness of the proposed methodology. The largest
strated by RoBERTa+CNN by achieving 84.33% weighted improvement is observed in identifying not-threatening class
F1-score. In contrast, the MuRIL fine-tuned model obtained instances. This proves the strength of the proposed method-
87.20% weighted F1-score and fine-tuned RoBERTa model ology for the joint-translated English approach to address the
demonstrated 89.74% weighted F1-score. Thus, the pro- problem of multi-lingual threatening content identification.
posed joint-translated English methodology outperformed The last set of experiments is performed to investigate
the baselines and demonstrated benchmark performance. The the effectiveness of proposed joint-translated Urdu models
performances of baselines and proposed frameworks per for multi-lingual threatening content identification task. The
class-wise and in macro and weighted F1-score are presented proposed models are Urdu-RoBERTa and MuRIL. We per-
in Fig. 4. It is observable that English-RoBERTa improved formed fine-tuning of Urdu-RoBERTa and MuRIL using
accuracy by 5.76%, precision by 4.98%, recall by 5.76% in the same parameters mentioned earlier. After fine-tuning,
the best results are reported. In addition, for state-of-the- i.e. 3.4% in accuracy, 0.77% in precision, and 9.32% in
art comparison, we compared the fine-tuned models with recall.
ten baselines, and the results are presented in Table 7. Furthermore, we observed 1.63% improvement in threat-
Among baselines, the best performance is demonstrated ening class, 11.69% improvement in not-threatening, 6.67%
by the Urdu-RoBERTa+CNN model by achieving 83.77% improvement in macro F1-score, and 4.56% improvement in
accuracy and 77.84% macro F1-score. On the other end, weighted F1-score.
fine-tuned Urdu-RoBERTa transformer presented 86.37% After an extensive set of various experiments, we con-
accuracy and 82.23% macro F1-score. Furthermore, fine- clude that the proposed methodology is very helpful in
tuned MuRIL model demonstrated the best performance by identifying multi-lingual threatening content in English and
obtaining 87.17% accuracy and 84.51% macro F1-score. Urdu. It outperformed the ten baseline models while test-
Thus proposed fine-tuned MuRIL model outperformed the ing both types (joint multi-lingual and joint-translated)
baselines including fine-tuned Urdu-RoBERTa model and of approaches. A substantial improvement is observed
showed benchmark performance. It is important to note that while evaluating the classifiers in macro and weighted
fine-tuned MuRIL model beat the fine-tuned Urdu-RoBERTa F1-scores.
for the joint-translated Urdu approach. The class-wise perfor-
mance of all classification models in the F1-score is presented VI. DISCUSSION AND LIMITATIONS
in Fig. 5. It is visible that the proposed joint-translated Urdu To extract new insights and knowledge from the plethora
approach obtained substantial improvement in performance, of data and to automate the business process, content
classification is a significant task. Furthermore, the require- findings suggest that our approach can be applied to similar
ment for an accurate and efficient multi-lingual NLP frame- MTC in NLP tasks.
work is in high demand due to the large multilingualism The current study has some limitations; First of all, only
on social media. To handle part of these issues, this study two languages (English and Urdu) are considered to test the
conducted a comparative analysis and evaluation of a pro- performance of the proposed framework, more languages can
posed multi-lingual threatening content detection framework be incorporated to deal with the task of MTC, especially
on a bi-lingual dataset. The task is of business interest and low-resource languages like Russian, Chinese, roman Urdu
high research and the developed framework is evaluated in and Hindi, etc. Second, the proposed framework can be tested
terms of accuracy, precision, recall, and macro & weighted on a larger corpus to make the framework more generalizable.
F1-score. Moreover, the findings of this research help to Third, threatening content identification is addressed here as
unveil important characteristics that are useful in the identifi- a binary classification task. It will be more appropriate if
cation of threatening expressions from Twitter and YouTube we address the task of ‘‘who is being threatened, individual
comments. The current study has brought out worthful or community’’. It will be helpful to locate the targeted
insights for social media users and our society to save community.
them and promote peace and harmony. The research on
threatening content detection in low-resource languages
is very restricted and the main emphasis was on mono- VII. CONCLUSION
lingual techniques. According to our knowledge, work on In this paper, we developed a multi-lingual text classification
multi-lingual threatening content detection is missing in the framework to handle the task of multi-lingual threatening
literature. We attempted to fill out this gap by designing content detection in English and Urdu languages. We took
a multi-lingual pipeline using two techniques (joint multi- advantage of the transfer learning approach to deal with
lingual and joint-translated approach) to identify threatening the complexity and overhead of designing a separate clas-
content in English and Urdu languages. sification system for each language. Joint multi-lingual and
The proposed pipeline is evaluated on a bi-lingual semi- joint-translated techniques are explored to design the robust
supervised setup containing English and Urdu corpora. The MTCD system. The proposed MTCD system is based on
transfer learning methodology in the form of fine-tuning of the RoBERTa and MuRIL transformer models, which were
RoBERTa and MuRIL transformers is utilized. The experi- fine-tuned on bilingual semi-supervised threatening content
ments help us to discover insights for further research and aid detection corpus. The proposed methodology is also trans-
in making a practical approach. The findings disclosed that formed into an algorithm for readers and researchers to
the proposed pipeline obtained state-of-the-art performance reproduce the results and understand the methodology easily.
by getting 92% accuracy and 90% macro F1-score with the Two benchmark transformer models (RoBERTa and MuRIL)
joint multi-lingual approach. Furthermore, it outperformed are chosen and their effectiveness for the MTCD task is
the baselines and proved itself a benchmark approach for explored extensively by fine-tuning. The proposed pipeline
multi-lingual threatening content detection. Therefore, these is comprised of four modules (pre-processing, tokenization,
fine-tuning, and classification). The experiments on bilingual [9] M. A. Bentaallah and M. Malki, ‘‘The use of WordNets for multilin-
semi-supervised corpus revealed that the proposed method- gual text categorization: A comparative study,’’ in Proc. ICWIT, 2012,
pp. 121–128.
ology demonstrated superior performance than ten baselines [10] B. P. Prajapati, S. Garg, and M. H. Panchal, ‘‘Automated text categorization
for joint multi-lingual and joint-translated approaches. The with machine learning and its application in multilingual text categoriza-
best performance is observed against the joint multi-lingual tion,’’ in Proc. Nat. Conf. Advance Comput. (NCAC), 2009, pp. 204–209.
[11] F.-Z. El-Alami, S. Ouatik El Alaoui, and N. En Nahnahi, ‘‘A multilingual
approach, that is 92% accuracy and 90% macro F1-score. offensive language detection method based on transfer learning from trans-
In future work, the proposed pipeline can be extended former fine-tuning model,’’ J. King Saud Univ.-Comput. Inf. Sci., vol. 34,
for other low-resource languages such as Russian, Chinese, no. 8, pp. 6048–6056, Sep. 2022.
[12] A. Höfer and M. Mottahedin, ‘‘Minanto at SemEval-2023 task 2: Fine-
Roman Urdu, etc. In addition, the proposed framework can tuning XLM-RoBERTa for named entity recognition on English data,’’
be easily applicable to other binary and multi-class classi- in Proc. The 17th Int. Workshop Semantic Eval. (SemEval), 2023,
fication tasks in the NLP field. Another possible direction pp. 1127–1130.
would be to re-visit the methodology by hybridizing the trans- [13] R. Rajalakshmi, S. Selvaraj, F. R. Mattins, P. Vasudevan, and A. M. Kumar,
‘‘HOTTEST: Hate and offensive content identification in Tamil using
former architecture with some robust algorithms to improve transformers and enhanced stemming,’’ Comput. Speech Lang., vol. 78,
performance. Mar. 2023, Art. no. 101464.
[14] N. Oostdijk and H. van Halteren, ‘‘N-gram-based recognition of threaten-
ing tweets,’’ in Computational Linguistics and Intelligent Text Processing.
ACKNOWLEDGMENT Cham, Switzerland: Springer, 2013, pp. 183–196.
This work was supported in part by Princess Nourah [15] N. Oostdijk and H. van Halteren, ‘‘Shallow parsing for recognizing threats
in Dutch tweets,’’ in Proc. IEEE/ACM Int. Conf. Adv. Social Netw. Anal.
bint Abdulrahman University, Riyadh, Saudi Arabia, under Mining (ASONAM), Aug. 2013, pp. 1034–1041.
project PNURSP2023R104. Moreover, this research was sup- [16] A. Wester, L. Øvrelid, E. Velldal, and H. L. Hammer, ‘‘Threat detection in
ported in part by the Computational Resources of High Per- online discussions,’’ in Proc. 7th Workshop Comput. Approaches Subjec-
tivity, Sentiment Social Media Anal., 2016, pp. 66–71.
formance Computing (HPC) facilities at National Research
[17] A. Wester, L. Øvrelid, E. Velldal, and H. L. Hammer, ‘‘Threat detection in
University Higher School of Economics. This article is an online discussions,’’ in Proc. 7th Workshop Comput. Approaches Subjec-
output of a research project implemented as part of the tivity, Sentiment Social Media Anal., 2016, pp. 66–71.
Basic Research Program at the National Research University [18] C. D. Lim, ‘‘Detecting legally actionable threats on Twitter using natural
language processing and machine learning,’’ M.S. thesis, Dept. Cogn. Sci.
Higher School of Economics. Artif. Intell., Tilburg Univ., Tilburg, The Netherlands, 2018.
[19] H. L. Hammer, M. A. Riegler, L. Øvrelid, and E. Velldal,
‘‘THREAT: A large annotated corpus for detection of violent threats,’’ in
FUNDING Proc. Int. Conf. Content-Based Multimedia Indexing (CBMI), Sep. 2019,
This research did not receive any specific grant from funding pp. 1–5.
agencies in the public, commercial, or not-for-profit sectors. [20] S. Kalraa, M. Agrawala, and Y. Sharmaa, ‘‘Detection of threat records
by analyzing the tweets in Urdu language exploring deep learning
transformer-based models,’’ in Proc. CEUR Workshop, 2021, pp. 1–7.
REFERENCES [21] M. Das, S. Banerjee, and P. Saha, ‘‘Abusive and threatening language
detection in Urdu using boosting based and BERT based models: A com-
[1] S. Malliga, K. Shanmugavadivel, R. Chinnasamy, N. Subbarayan,
parative approach,’’ 2021, arXiv:2111.14830.
A. Ganesan, D. Ravi, V. Palanikumar, and B. R. Chakravarthi, ‘‘On fine-
[22] M. Humayoun, ‘‘Abusive and threatening language detection in Urdu
tuning adapter-based transformer models for classifying abusive social
using supervised machine learning and feature combinations,’’ 2022,
media Tamil comments,’’ Kongu Eng. College, Perundurai, India, Tech.
arXiv:2204.03062.
Rep., 2023.
[23] F. Fkih and G. Al-Turaif, ‘‘Threat modelling and detection using semantic
[2] M. Anand, K. B. Sahay, M. A. Ahmed, D. Sultan, R. R. Chandan, and network for improving social media safety,’’ Int. J. Comput. Netw. Inf.
B. Singh, ‘‘Deep learning and natural language processing in computation Secur., vol. 15, no. 1, pp. 39–53, Feb. 2023.
for offensive language detection in online social networks by feature
[24] C.-H. Lee, H.-C. Yang, and S.-M. Ma, ‘‘A novel multilingual text cate-
selection and ensemble classification techniques,’’ Theor. Comput. Sci.,
gorization system using latent semantic indexing,’’ in Proc. 1st Int. Conf.
vol. 943, pp. 203–218, Jan. 2023.
Innov. Comput., Inf. Control (ICICIC), vol. 2, Sep. 2006, pp. 503–506.
[3] N. Ashraf, R. Mustafa, G. Sidorov, and A. Gelbukh, ‘‘Individual vs. group [25] P. P. Dhyani and S. Mittal, ‘‘Multilingual text classification,’’ Int. J. Eng.
violent threats classification in online discussions,’’ in Proc. Companion Res., vol. 4, no. 3, pp. 99–101, Mar. 2015.
Web Conf., Apr. 2020, pp. 629–633.
[26] K. Rani, ‘‘Satvika: Text categorization on multiple languages based on
[4] M. S. I. Malik, U. Cheema, and D. I. Ignatov, ‘‘Contextual embeddings classification technique,’’ Int. J. Comput. Sci. Inf. Technol., vol. 7, no. 3,
based on fine-tuned Urdu-BERT for Urdu threatening content and tar- pp. 1578–1581, 2016.
get identification,’’ J. King Saud Univ.-Comput. Inf. Sci., vol. 35, no. 7, [27] G. Liu and J. Guo, ‘‘Bidirectional LSTM with attention mechanism and
Jul. 2023, Art. no. 101606. convolutional layer for text classification,’’ Neurocomputing, vol. 337,
[5] A. Mehmood, M. S. Farooq, A. Naseem, F. Rustam, M. G. Villar, pp. 325–338, Apr. 2019.
C. L. Rodríguez, and I. Ashraf, ‘‘Threatening Urdu language detection [28] W. Antoun, F. Baly, and H. Hajj, ‘‘AraBERT: Transformer-based model for
from tweets using machine learning,’’ Appl. Sci., vol. 12, no. 20, p. 10342, Arabic language understanding,’’ 2020, arXiv:2003.00104.
Oct. 2022. [29] J. D. M.-W. C. Kenton and L. K. Toutanova, ‘‘Bert: Pre-training of deep
[6] M. Amjad, N. Ashraf, A. Zhila, G. Sidorov, A. Zubiaga, and A. Gelbukh, bidirectional transformers for language understanding,’’ in Proc. NAACL-
‘‘Threatening language detection and target identification in Urdu tweets,’’ HLT, 2019, p. 2.
IEEE Access, vol. 9, pp. 128302–128313, 2021. [30] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek,
[7] T. Gonalves and P. Quaresma, ‘‘Multilingual text classification through F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov,
combination of monolingual classifiers,’’ in Proc. 4th Workshop Legal ‘‘Unsupervised cross-lingual representation learning at scale,’’ 2019,
Ontologies Artif. Intell. Techn., vol. 605, 2010, pp. 29–38. arXiv:1911.02116.
[8] M. R. Amini, C. Goutte, and N. Usunier, ‘‘Combining coregularization [31] S. Khanuja, D. Bansal, S. Mehtani, S. Khosla, A. Dey, B. Gopalan,
and consensus-based self-training for multilingual text categorization,’’ D. K. Margam, P. Aggarwal, R. T. Nagipogu, S. Dave, S. Gupta,
in Proc. 33rd Int. ACM SIGIR Conf. Res. Develop. Inf. Retr., Jul. 2010, S. C. B. Gali, V. Subramanian, and P. Talukdar, ‘‘MuRIL: Multilingual
pp. 475–482. representations for Indian languages,’’ 2021, arXiv:2103.10730.
[32] M. Anwar Hussen Wadud, M. F. Mridha, J. Shin, K. Nur, and A. Kumar MUHAMMAD SHAHID IQBAL MALIK received
Saha, ‘‘Deep-BERT: Transfer learning for classifying multilingual offen- the master’s degree in computer engineering
sive texts on social media,’’ Comput. Syst. Sci. Eng., vol. 44, no. 2, in 2011, and the Ph.D. degree in data mining
pp. 1775–1791, 2023. from International Islamic University, Islamabad,
[33] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, Pakistan, in 2018. He is currently a Postdoc Fellow
L. Zettlemoyer, and V. Stoyanov, ‘‘RoBERTa: A robustly optimized BERT with the Lab for Models and Methods of Compu-
pretraining approach,’’ 2019, arXiv:1907.11692. tational Pragmatics, National Research University
[34] J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat, and M. Johnson, Higher School of Economics, Moscow, Russia.
‘‘XTREME: A massively multilingual multi-task benchmark for evaluat-
Previously, he served more than three years as an
ing cross-lingual generalisation,’’ in Proc. Int. Conf. Mach. Learn., 2020,
Assistant Professor at the Capital University of
pp. 4411–4421.
[35] R. Mehta and V. Varma, ‘‘LLM-RM at SemEval-2023 task 2: Multilingual Science and Technology, and four years as a Lecturer at Comsats University
complex NER using XLM-RoBERTa,’’ 2023, arXiv:2305.03300. Islamabad, Pakistan. In addition, he served 12 years in HVAC industry,
[36] M. S. I. Malik, T. Imran, and J. Mona Mamdouh, ‘‘How to detect pro- Islamabad and developed several embedded systems solutions for Air-
paganda from social media? Exploitation of semantic and fine-tuned conditioning systems. He authored more than 23 research papers published
language models,’’ PeerJ Comput. Sci., vol. 9, p. e1248, Feb. 2023. in leading International Journals and Conferences. His research interests
[37] S. Hussain, M. S. I. Malik, and N. Masood, ‘‘Identification of offensive include social media mining, natural language processing, predictive ana-
language in Urdu using semantic and embedding models,’’ PeerJ Comput. lytics and social computing. He is the reviewers of famous International
Sci., vol. 8, p. e1169, Dec. 2022. Journals.
[38] M. Z. Younas, M. S. I. Malik, and D. I. Ignatov, ‘‘Automated defect
identification for cell phones using language context, linguistic and smoke-
word models,’’ Expert Syst. Appl., vol. 227, Oct. 2023, Art. no. 120236.