0% found this document useful (0 votes)
47 views27 pages

Detection of Hate Speech and Offensive Language CodeMix Text in Dravidian Languages Using Cost-Sensitive Learning Approach

This paper explores the detection of hate speech and offensive language in CodeMix Dravidian languages using various multilingual transformer-based embedding models and machine learning classifiers. The study identifies MuRIL as the most effective pre-trained model, achieving high accuracy across multiple datasets, while also addressing class imbalance through a cost-sensitive learning approach. Additionally, a novel annotated test set for Malayalam-English CodeMix is introduced to enhance existing datasets for future research.

Uploaded by

jayaseelan.work
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views27 pages

Detection of Hate Speech and Offensive Language CodeMix Text in Dravidian Languages Using Cost-Sensitive Learning Approach

This paper explores the detection of hate speech and offensive language in CodeMix Dravidian languages using various multilingual transformer-based embedding models and machine learning classifiers. The study identifies MuRIL as the most effective pre-trained model, achieving high accuracy across multiple datasets, while also addressing class imbalance through a cost-sensitive learning approach. Additionally, a novel annotated test set for Malayalam-English CodeMix is introduced to enhance existing datasets for future research.

Uploaded by

jayaseelan.work
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Received 29 December 2023, accepted 14 January 2024, date of publication 5 February 2024, date of current version 9 February 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3358811

Detection of Hate Speech and Offensive Language


CodeMix Text in Dravidian Languages Using
Cost-Sensitive Learning Approach
K. SREELAKSHMI 1, B. PREMJITH 1, BHARATHI RAJA CHAKRAVARTHI 2, AND K. P. SOMAN1
1 Amrita School of Artificial Intelligence, Amrita Vishwa Vidyapeetham, Coimbatore 641112, India
2 School of Computer Science, University of Galway, Galway, H91 TK33 Ireland

Corresponding author: B. Premjith ([email protected])

ABSTRACT Recently, the emergence of social media has opened the way for online harassment in the
form of hate speech and offensive language. An automated approach is needed to detect hate and offensive
content from social media, which is indispensable. This task is challenging in the case of social media
posts or comments in low-resourced CodeMix languages. This paper investigates the efficacy of various
multilingual transformer-based embedding models with machine learning classifiers for detecting hate
speech and offensive language (HOS) content in social media posts in CodeMix Dravidian languages that
belong to the low-resource language group. Experiments were conducted on six sets of openly available
datasets in Kannada-English, Malayalam-English and Tamil-English languages. The objective is to identify
a single pre-trained embedding model that commonly works well for HOS tasks in the above mentioned
languages. For this, a comprehensive study of various multilingual transformer embedding models, such as
BERT, DistilBERT, LaBSE, MuRIL, XLM, IndicBERT, and FNET for HOS detection was conducted. Our
experiments revealed that MuRIL pre-trained embedding performed consistently well for all six datasets
using Support Vector Machine (SVM) with Radial Basis Function (RBF) kernel. In a set of experiments
conducted on six datasets, the highest accuracy results for each dataset are as follows: DravidianLangTech
2021 achieved 96% accuracy for Malayalam, 72% accuracy for Tamil, and 66% accuracy for Kannada.
For HASOC 2021 Tamil, the accuracy reached 76%, and for HASOC 2021 Malayalam, it reached 68%.
Additionally, HASOC 2020 demonstrated an accuracy of 92% for Malayalam. Moreover, we performed an
in-depth error analysis and a comparative study, presenting a tabulated summary of our work compared to
other top-performing studies. In addition, we employed a cost-sensitive learning approach to address the
class imbalance problem in the dataset, in which minority classes get higher classification weights than
the majority classes. The weights were initialized and fine-tuned to obtain the best balance between all the
classes. The results showed that incorporating the cost-sensitive learning strategy avoided class bias in the
trained model. In addition to the aforementioned points, a significant contribution of our research presented
in this paper is introducing a novel annotated test set for Malayalam-English CodeMix. This new dataset
serves as an extension to our existing data, known as the Hate Speech and Offensive Content Identification
in English and Indo-Aryan Languages (HASOC) 2021 Malayalam-English dataset.

INDEX TERMS Natural language processing, CodeMix, hate speech, offensive language, bidirectional
encoder representations from transformers, language-agnostic BERT sentence embedding, multilingual
representations for Indian languages, machine learning, IndicBERT.

I. INTRODUCTION
The associate editor coordinating the review of this manuscript and The emergence of social media platforms has helped people
approving it for publication was Maria Chiara Caschera . communicate across borders and opine easily [29], [36].
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
20064 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 12, 2024
K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

It has not only paved the way for networking and information media text in Kannada-English, Malayalam-English and
exchange but also resulted in the proliferation of hate and Tamil-English languages. We conducted experiments using
offensive content. Under International Human Rights Law, six different corpora collected from various shared tasks,
there is no universal definition of hate and offensive speech containing the Dravidian script and native language words
(HOS). However, in [10] and [42], it is stated that HOS is written using the Roman script (CodeMix). Unlike various
the advocacy or incitement in any form, defamation, hatred, neural network embedding models trained on a single
or vilification of a person or group, along with insulting or language, multilingual models have the advantage of a
abusing anyone on the grounds of their colour, race, religion, single model that can handle the characteristics of multiple
caste, gender, sex, or financial status, which is pervasive on languages together, which motivates us to focus on generating
social media platforms. In most cases, hate speech contains sentence embedding using multilingual models rather than
much offensive content, which soberly damages our society an embedding model trained exclusively for one language.
by leading to civil war, undermining vulnerable groups, and Transformer-based embedding models are trained on both
even spoiling new research inventions like chatbots, which subword-level and word-level information, which also helps
learn from the inputs it experience [46]. the models learn representations for Out-of-Vocabulary
Automatic detection of hate and offensive content from (OOV) words. Furthermore, it is observed that the class
social media posts/comments is very relevant in recent times imbalance problem in the data has not yet been addressed
due to the adverse effect of the increasing amount of HOS in the literature, which could affect the performance of
content on social media. Automatic HOS detection is a the ML/DL models. Therefore, a cost-sensitive learning
relevant area of work mainly because the spread of HOS in approach was incorporated in the classifiers, making them not
social media can lead to the usage of offensive words by biased towards any particular class. In addition, an extension
children, can result in religious or community conflicts and of the Hate Speech and Offensive Content Identification
even lead to the failure of conversational chatbots due to a in English and Indo-Aryan Languages (HASOC) 2021
lack of knowledge for the bots to identify offensive words. Malayalam-English dataset is proposed in this paper, which
Various research has been conducted to detect HOS using is an annotated CodeMix Malayalam-English HOS corpus.
Natural Language Processing (NLP) approaches. However, In summary, the main contributions of this paper are as
the complexity of the language and the nature of the social follows:
media comments and posts caused the ML/DL models • A comprehensive study of various multilingual trans-
to detect the HOS effectively. The challenge increases former embedding models for detecting HOS from
in the case of CodeMix Dravidian languages due to the CodeMix social media texts in Dravidian languages.
usage of vocabulary from multiple languages, the usage • A single multilingual embedding model that performs
of different language scripts, and non-standard grammar, well on all CodeMix datasets of the three languages was
spelling and abbreviation variation. These factors affected the identified.
development of a gold-standard annotated corpus, hence the • Cost-sensitive learning approach was used to deal with
research in detecting HOS contents [13]. However, shared the class imbalance problem in the dataset to avoid the
tasks for offensive language identification from Kannada, class-bias.
Malayalam and Tamil by Chakravarthi et al. [13], [14] • An annotated HOS detection CodeMix Malayalam-
paved the way for more research on CodeMix Dravidian English dataset was developed
languages. HOS detection began by using manually extracted • Error analysis and performance comparison of our work
features namely punctuation count, emoticon count, negation with the state-of-the-art approach.
words, lexicon words [11], [33], [78], followed by machine
learning classifiers. Further research used character N-grams, The rest of the paper is organised as follows: Section II
word N-grams, term frequency-inverse document frequency highlights the previous work done in this area; Section III
(TF-IDF), Bag of Words vectors (BOW) features [45], [60]. gives the details of the dataset used and the experiments and
The advent of neural network-based embedding algorithms the results are provided in Section IV. Section V gives a
motivated the researchers to use pre-trained domain-specific thorough discussion of the error analysis conducted and the
embedding such as Word2vec [55], fastText [56], GloVe [62] paper is concluded in section VI.
for generating word vectors followed by Machine learn-
ing/Deep learning classifiers [30], [72], [73] for detecting II. RELATED WORK
HOS contents. The availability of multilingual pre-trained Research in HOS detection and classification has advanced
models and their efficient performance on various NLP in the past decade. Attentiveness to this field has increased as
tasks increased the use of transformer models for HOS the influence and user adoption of social media and social
detection [27], [38], [49], [50], [63]. platforms have expanded. Research in HOS identification
This paper investigates the performance of various mainly focuses on two approaches: those based on hand-
multilingual transformer embedding models with Machine made features such as punctuation count, emoticon count,
Learning classifiers for detecting HOS contents from social negation words, or lexicon features, and those on neural

VOLUME 12, 2024 20065


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

network-based pre-trained embeddings such as fastText, data by researchers [15], [24], [40] opened the way for more
GloVe and transformers [4]. research on Dravidian languages.
Most of the HOS detection works on social media Most of the papers found in the literature regarding the
comments and posts are highly challenging due to their non- HOS detection in Dravidian languages were related to the
standard formats in spelling and grammar [28]. Generally, teams participating in shared tasks on offensive language
the comments and posts in Dravidian languages appear in identification in Kannada, Malayalam and Tamil Dravidi-
CodeMix form [8]. Several works have recently been reported anLangTech 2021 and HASOC-Dravidian-CodeMix shared
for detecting HOS from social media texts in the CodeMix tasks [12], [48]. Most of these works used transformer models
Dravidian language. This section reviews the research on because of the availability of multilingual pre-trained models,
HOS detection in CodeMix social media texts. their capability to capture context information, and the ease
Apart from English, there were also significant research of fine-tuning. The top-performing teams used various deep
contributions in HOS detection in European languages using learning and transformer methodologies. Saha et al. [67] used
machine learning [18], [19], [23], [39], [78]. Initial imple- various models, namely the XLM-RoBERTa-large model,
mentations used TF-IDF scores [57], BOW Vectors [44], a fusion model with a Bidirectional Encoder Representations
[51], [52], [69], N-grams [26], [58], [78], meta-information from Transformers- Convolutional Neural Network (BERT-
such as user account information and network structure CNN) where a single classification head was trained on
information [19], [59], [70] for representing the text. These the concatenated embedding from different BERT and CNN
features are fed to various machine-learning classifiers models. The BERT models were initialized with fine-tuned
to detect HOS contents. The popularity of deep learning weights. The CNN models were trained on skip-gram word
algorithms attracted interest in HOS detection due to their vectors using fastText. They also experimented using MuRIL
ability to automatically learn input representations which can and IndicBERT pre-trained specifically on low resource
be used for detecting HOS [1], [6], [32], [34], [61], [81]. The languages. Balouchzahi et al. [7] used a COOLI Ensemble
negative impact of the posted contents, their possible severe model, which takes the CountVectors of words and character
consequences and the lack of annotated data paved the way sequence as features and classifies them using a voting clas-
for many academic events and shared tasks on HOS detection. sifier with three estimators Multi-Layer Perceptron, extreme
Some of the tasks are: Gradient Boosting, Logistic Regression. Tula et al. [75]
• The shared task on aggression identification included used an ensemble model of DistilmBERT and ULMFiT
in the First Workshop on Trolling, Aggression and and inverse weighting and focal loss strategies to solve the
Cyberbullying [47]. class imbalance issue. Vasantharajan and Thayasivam [76]
• The first, second and third editions of the Workshop on used pre-trained multilingual transformer models and used
Abusive Language [65]. Negative Log Likelihood (NLL) Loss with class weights
• The first and second edition of GermEval Shared Task and self-adjusting dice loss to resolve the class imbalance
on the Identification of Offensive Language [74]. issue. A few researchers have worked on the comparison
• The MEX-A3T track at IberLEF 2019 on Authorship of pre-trained Embedding to identify Hate Speech [5]. The
and Aggressiveness Analysis [3]. paper compares BERT [25], XLNet [79], DistilBERT [68],
• The PolEval 2019 shared Task 6 on Automatic Cyber- RoBERTA [21] and Ensemble model for classification.
bullying Detection in Polish Twitter [64]. Hande et al. trained multi-task learning models for Sentiment
• The first edition of the HASOC track at FIRE 2019 on analysis and offensive language [35]. The other works in
HOS and Offensive Content Identification in Indo- this area are on Offensive Language Identification using
European Languages [54]. Pseudo-labeling [37] and an approach that uses selective
• The SemEval shared subtask 5 on the detection of translation and transliteration techniques to reap better results
HOS against immigrants and women (HatEval) [9] and by fine-tuning and ensembling multilingual transformer [76].
subtask 6 on Identifying and categorizing offensive Sivalingam and Thavareesan [71] used Support Vector
language in social media (OffensEval) [80]. Machine, random forest, k- Nearest Neighbour and Naive
Bayes classifiers with chi-square, BOW, TF-IDF feature
representation techniques. Apart from these, there are works
A. DRAVIDIAN CodeMix on pre-trained models specific to Indian Languages. Raj
Recently, there has been an increase in research focus on Dabre et al. and team developed a multilingual, sequence-
HOS detection in CodeMix Dravidian languages, particularly to-sequence pre-trained model IndicBART on 11 Indian
Kannada-English, Malayalam-English and Tamil-English. languages and English. The model is smaller in size than
However, the diverse nature of the grammar, polysemous mBERT but gives comparative results for Neural Machine
words and unavailability of the tools and annotated data Translation and summarization [22]. From the literature
limited the research in Dravidian languages [2], [16]. Shared review, we observed that various pre-trained models are
tasks on offensive language identification in Kannada, explored on the different datasets, including class imbalanced
Malayalam Tamil [13], [14] and contribution of annotated data.

20066 VOLUME 12, 2024


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

TABLE 1. Detailed dataset description. The dataset statistics including the train-test count, the details about the dataset source, the scripts involved and
the labels.

Our literature review revealed that no recognized pre- model. We removed all the comments from the collected
trained model is available for detecting HOS in CodeMix comments that were not Malayalam-English CodeMix. These
Dravidian languages. It inspired us to investigate different comments were then used to create a dataset for the
multilingual transformer embeddings for HOS detection on offensive language classification task. This test set is an
six datasets from three language pairs: Kannada-English, extension of the HASOC 2021 Malayalam dataset [17].
Malayalam-English, and Tamil-English. Moreover, we aimed In this work, we annotated the data into two classes - hate
to find a single multilingual embedding that performs effec- and Non-hate, with each class having 670 and 330 test
tively on all CodeMix datasets across the three languages. sentences, respectively. One example from each class is given
Furthermore, it was noted that most of the existing data below.
exhibits an imbalance. It motivated us to utilize a cost- • Non-hate: A comment is annotated as Non-hate if
sensitive learning strategy to address the issue. In addition, it doesn’t contain any offensive words or is not
we noticed a scarcity of annotated corpus, which inspired us sarcastically insulting a person or a group.
to create a novel annotated Malayalam-English HOS corpora. Text: Cbz nalla quality ulla bike aanu..Silencer
ilaki pokunnathu oru safety mechanism aanu..Impact
III. DATASET DESCRIPTION kurakkan..
We conducted our experiments on six datasets belonging Translation: Cbz is a good quality bike. Silencer getting
to three Dravidian CodeMix languages, Kannada-English, removed is a part of safety mechanism to reduce the
Malayalam-English and Tamil-English, collected from the impact.
shared task on Offensive Language Identification in Kannada, • Hate: Comments with offensive or abusive language
Malayalam and Tamil, HASOC track at FIRE 2020: which is used to insult people or group are labelled as
Hate Speech and Offensive Content Identification in Indo- Hate.
European Languages and HASOC track at FIRE 2021:Offen- Text: Yep. . . ennittum telegram inne free aayi kaa-
sive Language Identification for Dravidian Languages in nunna oolakal
CodeMix Text. Detailed dataset description is given in Translation: Yep.. Still seeing telegram free fools.
Table 1. The classwise dissemination of the datasets is
specified in Table 2. 1) ANNOTATION
The dataset annotation was done by three annotators who
A. IN-HOUSE TESTSET were proficient in Malayalam and English. The annotation
The lack of annotated data is still a hindrance to research was based on whole sentence meaning, usage of words,
in the area of HOS. So, we have contributed Malayalam- sarcasm in the speech and emoticons. All three annotators
English annotated HOS detection data for shared tasks separately annotated the whole test set. The annotated
HASOC 2020 Malayalam and HASOC 2021 Malayalam. dataset is in the form of an excel sheet, with the first
As a part of this work, we collected 1000 Malayalam-English column having the texts and the second column having the
CodeMix YouTube comments to validate our top-performing tags.

VOLUME 12, 2024 20067


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

TABLE 2. Class-wise distribution of the datasets.

2) INTER ANNOTATOR AGREEMENT fall into two main categories, offensive and not-offensive,
Inter Annotator Agreement measures the accord between the or into multiple classes offensive-targeted-insult-individual,
annotators during data annotation. All three annotators follow offensive-targeted-insult-group, offensive-targeted-insult-
the same set of guidelines for annotation. After the annotation other, offensive untargeted and not-in-intended-language.
by the three annotators, the tag for each comment is chosen by The work investigates various multilingual pre-trained
the majority voting scheme. That is, for a comment, the label transformer-based models to find the best fit for Dravidian
that the majority of the annotators choose is assigned to it. CodeMix datasets of Kannada, Malayalam and Tamil.
The annotation is validated using Cohen’s kappa score, [20], In this section, we talk about several multilingual pre-trained
a statistical measure for assessing the reliability of agreement transformer-based models. We obtained the transformer
between a fixed number of raters over 2. Cohen’s Kappa embedding using https://www.sbert.net/. The transformer
score for the proposed corpus is 0.8967. Equation 1 shows models, the experiments conducted, and the results are
the formula used to calculate Cohen’s Kappa score, where P̄ explained in detail in the algorithm 1 and workflow
is the observed agreement, and P̄e is the expected agreement. diagram 1.
As described in the dataset description section, our
P̄ − P̄e dataset consists of CodeMix and Indic script sentences with
k= (1)
1 − P̄e emoticons and punctuation. These emoticons and punctuation
Since, the number of annotators is three the expected were not removed, and the dataset was not subjected to any
agreement P̄e = 1/3. other preprocessing as they can contribute to the emotion the
sentence carries.
IV. EXPERIMENTS AND RESULTS The first step in the workflow is to extract features.
This section explores the capabilities and the limitations of Since the data is diverse and multilingual, multilingual
the different transformer embedding and machine learning transformer models trained on a large Wikipedia dump
approaches for hate speech detection using the workflow with a sizeable vocabulary were used. We used the python
given in Figure 2. We considered six sets of Indian CodeMix framework SentenceTransformers to convert these sentences
datasets which consist of a single sentence that either to numbers. Transformer models such as BERT, DistilBERT,

20068 VOLUME 12, 2024


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

Algorithm 1 An Algorithm Explaining the Workflow Used


for Building a HOS Detection Classifier. Features Are
Extracted Using Multilingual Models and Machine Learning
Classifier Is Built Using Training Set
[1]
Input: X = CodeMix social media text in the languages
Kannada-English, Malayalam-English and Tamil-English
Texts with labels.
Output: Trained machine learning models
Multilingual Transformer models M = {BERT, Distil-
BERT, XLM-RoBERTa, LaBSE, MuRIL, FNet, IndicBERT
} FIGURE 2. An illustration of the transformer architecture. It has 12 blocks
Machine Learning Classifiers C={Random Forest,Linear each consisting of Multi head Attention, Add and Norm, point wise FFN
and Add and Norm. The embedding of the input words w1, w2 are
Regression, Naive Bayes, K-Nearest neighbour, Decision combined with the positional encoding and fed to the block.
Tree, Adaboost }

for x in X do
Read each text require capable models connecting the previous information
for m in M do to the present. For example, predicting the next word from
Generate embedding the previous words in a sentence or present video frames
Assign class weights using the below given formula from the previous frame, etc. Recurrent neural network
(RNN) and Long short-term memory (LSTM) are such neural
N networks capable of remembering previous information.
(2) However, these models face the vanishing gradient problem
n×b
whenever such long-term dependencies are involved. Since
where, N = Total Number of text in the dataset,
these are seq2seq models, they handle the sentences word by
n = Number of classes and b = Total number of
word or character by character forming a hindrance towards
occurrences of each class in the dataset
parallelization. Especially in the case of long sentences, these
for c in C do
Train the machine learning classifier models tend to forget the information of the initial positions.
Save the model These flaws of RNN and LSTM led to the introduction of the
end Transformer model [77]. Figure 2 gives the illustration of the
end transformer architecture.
end The transformer is the first-ever model that entirely
depends on self-attention to obtain the input representation.
As illustrated in Figure 2. It consists of encoder-decoder
stacks. The encoder consists of a multi-head self-attention
layer and a feed-forward layer. In addition to those layers,
the decoder consists of masked multi-head attention. Around
each of these sub-layers of both encoder and decoder, a resid-
ual connection is employed, followed by normalization.
Apart from this, the multi-head attention in the decoder takes
the encoder output as well [77].
Attention mainly involves six steps.
1) Calculation of the vectors Query, Key, and Value.
These vectors are calculated by multiplying the word
FIGURE 1. An illustration of the workflow. The dataset is divided into
embedding by three matrices we trained during the
train and test sets and the embeddings of the data are obtained. machine training process. The architecture is designed in such a
learning classifier is trained and used to predict the labels. way that the embedding vector is of length 512, and the
Query, Key and Value vectors have a smaller dimension
of 64.
XLM-RoBERTa, LaBSE, MuRIL, FNet and IndicBERT are 2) Finding the self-attention score. This score is calculated
used to extract embedding from the sentences. by taking the dot product of the query vector of one
word to the key vector of every other word. For a
A. TRANSFORMER MODELS particular word in a sentence, this score gives how
Most NLP tasks, for instance, Machine Translation (MT), much focus has to be given to other words of the input
Topic Classification, Named Entity Recognition (NER), etc., sentence.

VOLUME 12, 2024 20069


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

3) Divide each score with 8, which is the square root of decisions. We used the ‘‘bert-base-multilingual-cased’’ from
the dimension of the key vector. hugging face for our experiments.
4) These values are passed through a softmax layer, which
converts them to a positive value. 1) DistilBERT
5) Multiply the softmax score with each value vector, DistilBERT is a smaller language representation model,
which forms the weighted value vector. a distilled version of the BERT base multilingual model.
6) Sum the weighted value vectors. It can perform all the tasks a BERT model can do but at
twice the speed of mBERT-base. It is a case-sensitive model
1) MULTI-HEAD ATTENTION with six layers, 768 dimensions and 12 attention heads which
The model uses several attention layers parallelly, forming makes a total of 134 million parameters [68].
the multi-head attention. It is mainly used to attend all the
positions in the previous layer. So when attention is calculated 2) XLM-RoBERTa
multiple times, we get multiple z matrices for a single XLM-R is a transformer-based masked multilingual language
sentence. These matrices are concatenated and then fed to the model trained on 100 languages. It was trained using more
feed-forward network. than two terabytes of filtered CommonCrawl data. It has
significantly improved performance on various cross-lingual
2) POSITIONAL ENCODINGS transfer tasks and outperformed the mBERT model. XLM-
The absence of recurrence or convolutional layers is over- R has 24 layers and 16 attention heads, making a total of
come by the positional encodings. It feeds the information 550 million parameters [21].
about the tokens’ relative or absolute position, which takes
care of the succession of sequence order in the model. 3) LaBSE
LaBSE was proposed [31] to produce a language-agnostic
B. BERT sentence embedding by following the multilingual BERT
BERT is a case-sensitive transformer model pre-trained on model. It is trained on 109 languages.
a large unlabelled Wikipedia corpus of 104 languages using
deep bidirectional representations [25]. BERT-base multilin- 4) MuRIL
gual (mBERT-base) architecture has 12 layers, 768 hidden Multilingual Representations for Indian Languages (MuRIL)
sizes and 12 attention heads, making a total of 177 million is a transformer-based multilingual language model built
parameters. It has been trained on 11 NLP tasks with two for Indian languages. The model is trained using both
main objectives: translated and transliterated document pairs of 16 Indian
1) Masked language modeling (MLM) BERT architec- languages and English. It uses BERT architecture which
ture enables the model to learn the bidirectional is pre-trained from scratch using datasets collected from
representation of the sentence, in contrast to other Wikipedia, CommonCrawl, PMINDIA, and Dakshina. It has
Language models, which are either trained from outperformed mBERT on a lot of downstream tasks [43].
forward or backward, the sequential deep learning
models, or autoregressive text generation models such 5) IndicBERT
as GPT. Even so, the word might unintentionally spot IndicBERT is a multilingual model trained based on the
itself due to its bidirectional nature. To prevent this the ALBERT model. It is trained on IndicCorp of 12 Indian
model masks 15% of the input words and the objective languages, including Assamese, Bengali, English, Gujarati,
is to predict the masked words. Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil,
2) Next-Sentence Prediction (NSP) Language models’ Telugu and evaluated on IndicGLUE. Compared to other
failure to accurately represent the relationship between multilingual models, namely IndicBERT mBERT, XLM-R
two sentences is one of their fundamental weaknesses. IndicBERT has fewer parameters [41].
However, this is crucial for NLP tasks like question
answering and interference with natural language. 6) FNet
The task of Next Sentence Prediction is carried out FNet is a transformer model pre-trained on an English dataset
to get around this. In this, two masked sentences for masked language modeling and next sentence prediction.
are combined as pretraining inputs by the model. It is very similar to other transformer models except for
Sometimes they line up with sentences that are adjacent using Fourier transforms instead of attention. It is trained
to one another in the original text, and other times they on a massive corpus of English raw texts without labels.
don’t. The next step is for the model to determine if the The model can be used for many downstream tasks such as
two sentences are in order. classification.
BERT can be fine-tuned on tasks such as sequence The models mentioned above were chosen because they
classification, token classification or question answering were trained on multilingual data. Among these models,
that use the whole sentence (potentially masked) to make LaBSE, MuRIL and IndicBERT are explicitly trained in

20070 VOLUME 12, 2024


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

Indian languages but not on CodeMix text. The majority This way of training the LaBSE model on Indian scripts
of our data is in Roman script. Hence we chose FNet, helped the model grab the meaning of CodeMix texts.
which is an English pre-trained model. In addition, FNet Compared to all the other models, MuRIL gave a consistent
is smaller in size and hence computationally efficient. Our performance on all the datasets irrespective of language or
dataset being CodeMix with Roman and Indic script, we used script. We also obtained comparable results to the state-of-
them to get the embedding using the transformer models the-art models. This is due to the MuRIL model’s pre-training
BERT, DistilBERT, XLM-RoBERTa, LaBSE, MuRIL and on parallel and monolingual segments. The model was
IndicBERT. At the same time, we transliterated the data to trained using monolingual data collected from Wikipedia
English before using FNet to extract the embedding. and Common Crawl corpora for 17 Indian languages and
parallel corpora obtained by translation and transliteration of
C. DISCUSSION OF RESULTS the monolingual corpora mentioned above. MuRIL exploits
As shown in the workflow algorithm 1, the third step the characteristics of the transformer model to generate the
is to classify. For classification, we chose the following embeddings of words in Indian languages.
set of Machine learning classifiers. Random Forest, Linear
Regression, Naive Bayes, K-Nearest neighbor, Decision Tree, D. COMPARATIVE STUDY WITH THE STATE-OF-THE-ART
Adaboost, SVM Rbf, SVM Linear, SVM Ploy. RESULTS
In total, we conducted 63 experiments on six different In this subsection, we compare the results obtained
datasets. The results of the experiments conducted are given by our approach with the state-of-the-art models. The
in Tables 3, 4, 5, 6, 7, 8. We used the metrics Accuracy, comparison details are given in Table 9. The state-
Precision (macro), Recall (macro), F1-score (macro) and F1- of-the-art results are taken from the overview papers
score (weighted) to evaluate our models. One of the major of each shared task. Out of the four datasets, our
issues with available data is the class imbalance. We resolved approach showed comparable results with the state-of-the-
the dataset imbalance issue during the experiment using a art models in three datasets without any data translation.
cost-sensitive learning approach in which the class weights In DravidianLangTech 2021 Malayalam data, our MuRIL
are computed using Equation 3. embedding+Machine Learning classifier got the same F1-
|X | score compared to the top-performing ULMFiT model of
wci = (3) the state-of-the-art model. However, our other embedding
n |ci |
BERT+Machine Leaning, DistilBERT+Machine Leaning,
where wci is the class weight for the class ci , |X | is the and XLM-R+Machine Leaning had better performance than
total number of data points in the corpus, n is the total the BERT, DistilBERT, XLM-R based classifiers of the
number of classes, and |ci | is the total number of data points state-of-the-art work. For DravidianLangTech 2021 Tamil
in the class ci . This would assign a large weight to the data, the MuRIL embedding+Machine Learning classifier
minority classes and small weights to the majority classes, and XLM-R embedding+Machine Learning classifier per-
subsequently restricting the classifiers to bias towards the formed better than their MuRIL and XLM-R classifiers.
majority classes. For DravidianLangTech 2021 Kannada data, our MuRIL
The code for the experiments is given in the GitHub embedding+Machine Learning classifier crossed the results
repository. obtained by the MuRIL classifier of the state-of-the-art work.
On observing the results, we noted that for Malayalam In addition, for HASOC 2020 Malayalam data, our approach
DravidianLangTech data DistilBERT, MuRIL, LaBSE gave got extremely high performance compared to the state-of-the-
comparatively high accuracy and F1-score; for Tamil Dravid- art model. This indicates that BERT-based embedding with
ianLangTech data DistilBERT, MuRIL gave comparatively Machine Learning classifiers has the upper hand compared
high accuracy and F1-score and for Kannada Dravidian- to BERT-based classifiers in HOS from Dravidian language
LangTech data MuRIL gave high accuracy and F1-score. tasks.
These three datasets had sentences in Roman as well as
Indic scripts, and they also had other non-Indian language
sentences. Out of these, for two datasets, DistilBERT was E. VALIDATION ON IN-HOUSE DATA
able to perform well because of its masking model, which The performance of the collected test set was validated using
grabbed the complete sentence meaning in addition to it being the MuRIL models trained on the three binary datasets. The
trained in more than 100 languages. For HASOC Malayalam details of this are given in Table 10.
2021 and HASOC Tamil 2021 data, which are in pure Roman
script and CodeMix, MuRIL gave the highest accuracy and V. ERROR ANALYSIS
F1-score. For HASOC Malayalam 2020, LaBSE and MuRIL This section provides a comprehensive error analysis of the
gave comparatively high accuracy and F1-score. This data, experimental outputs, exploring the misclassification errors
having both Roman and Devanagari script LaBSE gave encountered by the model during the classification task across
high performance due to its Dual encoder structure, which diverse datasets in different languages. The evaluation is
considers source text and targets text simultaneously as input. based on a weighted F1-score for performance comparison.

VOLUME 12, 2024 20071


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

TABLE 3. Results for Kannada DravidianLangTech dataset.

20072 VOLUME 12, 2024


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

TABLE 4. Results for Malayalam DravidianLangTech dataset.

VOLUME 12, 2024 20073


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

TABLE 5. Results for Malayalam HASOC 2021 dataset.

20074 VOLUME 12, 2024


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

TABLE 6. Results for Malayalam HASOC 2020 dataset.

VOLUME 12, 2024 20075


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

TABLE 7. Results for Tamil DravidianLangTech dataset.

20076 VOLUME 12, 2024


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

TABLE 8. Results for Tamil HASOC 2021 dataset.

VOLUME 12, 2024 20077


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

TABLE 9. Comparison of the weighted F1-scores of the proposed work and the state of the art work. In the table the BERT+Machine Learning models that
excel the BERT-based classifiers are highlighted.

TABLE 10. Performance of each embedding trained on three sets of binary data over the In-house test set.

A. DravidianLangTech DATA • The first level error analysis was done on the lengths
1) KANNADA of the sentences. We checked for the length of the
All the models applied to this CodeMix data achieved misclassified sentences and the rightly classified sen-
exceptional results. MuRIL embedding with SVM (RBF) tences. Figures 3 and 4 show the histogram plots of the
classifier obtained the highest results. The error analysis of sentence lengths for misclassified and rightly classified
this data is based on the MuRIL embedding results with sentences. It is evident from the two figures that the
the Random Forest classifier. Out of the 2001 test set, only sentence lengths have not affected the predictions as
89 were misclassified. we observe that most of the misclassified sentences and

20078 VOLUME 12, 2024


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

FIGURE 3. The histogram plot of the sentence lengths for the


misclassified sentences for the top performing model on
DravidianLangTech Kannada data.

FIGURE 5. 3D TSNE plot for Kannada DravidianLangTech dataset.The


figure has the scatter plot for 6 different classes.

The above sentence doesn’t contain any Kannada words


hence as per the definition the sentence is labelled as
non_Kannada. But it doesn’t have any offensive words
and it gives a positive sentiment which resulted in the
sentence getting misclassified as Not_offensive.
Table 12 gives the examples which has high chance of
falling into multiple classes due to the lack of offensive
FIGURE 4. The histogram plot of the sentence lengths for the correctly words.
classified sentences for the top performing model on DravidianLangTech
kannada data. • Text: Found 806 rashmika mangannas. . . .
Translation: Found 806 Rashmika monkeys
This is a very confusing sentence as the word manganna
sounds like ‘‘Mandanna’’. Though the word manganna
the rightly classified sentences have sentence lengths
means monkeys in Kannada the replaced character
between 25 and 50.
can be considered a spelling error. Hence the model
• The data is a mixture of multiple Indian and non-
misclassified it as ‘‘Not-offensive’’.
Indian languages written in Roman as well as Language
• Figures 5 and 6 show the dataset’s 3D TSNE and PCA
specific text. We manually analysed the effect of lan-
scatter plots. The 768-dimensional sentence embedding
guage on misclassification by comparing the languages
is mapped to 3D and plotted. It is clear from the Figures
in misclassified and correctly classified sentences. It was
that the dataset is clumsy, there are many overlapping
observed that the difference in language does not affect
data points, and there is no clear separation between the
the prediction.
dataset of each class. This has made the classification
• Among the misclassified sentences, we observed that a
strenuous.
few of the sentences were mislabelled. Table 11 shows
the test sentences that are mislabelled but predicted
rightly by the model. Most of the sentences are in pure 2) MALAYALAM
Roman script but are labeled as Not_offensive but have All the models applied to this CodeMix data achieved
some offensive content. This is one of the reasons for the exceptional results. DistilBERT embedding with Random
model’s low performance. Forest classifier and MuRIL obtained the highest results. The
• Certain comments do not have any offensive words but error analysis on this data is done on the results obtained for
are written in Latin script and do not contain any Kan- DistilBERT embedding with Random Forest classifier. Out
nada words, which have to fall into the non_Kannada of the 2001 test set, only 89 were misclassified. This section
class, but the model predicted it as Not_offensive. discusses the misclassification errors of the model for the
• Text: D boss fans inda full support iden #jaidboss classification task on various datasets of different languages.
Translation: D boss fan’s full support is there #jaidboss The performance is compared using a weighted F1-score.

VOLUME 12, 2024 20079


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

TABLE 11. These are test sentences from Kannada DravidianLangTech dataset that are mislabelled but predicted rightly by the model.

TABLE 12. These are test sentences from Kannada DravidianLangTech dataset that are misclassified as Not_offensive as they do not contain any
offensive words but are labelled as non_Kannada as they do not contain any Kannada words.

• The first level error analysis was done on the length of misclassified and correctly classified sentences. It was
sentences. We checked the length of the misclassified observed that the difference in language does not affect
sentences and the rightly classified sentences. Figures 7 the prediction.
and 8 show the histogram plots of the sentence lengths • Among the misclassified sentences, we observed that a
for misclassified and rightly classified sentences. It is few of the sentences were mislabelled. Table 13 shows
evident from the two figures that the sentences’ length the test sentences that are mislabelled but predicted
has not affected the predictions as we observe that most rightly by the model. Most of the sentences are in
of the misclassified sentences and the rightly classified pure Roman script but are labeled as non-Malayalam.
sentences have sentences length between 25 and 50. However, as per the definition, sentences that do not
• The data is a mixture of multiple Indian and non- have Malayalam words written in Malayalam script
Indian languages written in Roman and language- or Latin script are labeled as non-Malayalam [14].
specific text. We manually analsed the effect of language This is one of the reasons for the model’s low
on misclassification by comparing the languages in performance.

20080 VOLUME 12, 2024


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

TABLE 13. These are test sentences from Malayalam DravidianLangTech dataset that are mislabelled but predicted rightly by the model.

FIGURE 8. The histogram plot of the sentence lengths for the correctly
classified sentences for the top performing model.

• Text: OUR BROTHER IS COMING ’BIG BROTHER’


The above sentence does not contain any Malayalam
words. So as per the definition, the sentence is labeled
FIGURE 6. 3D PCA plot for Kannada DravidianLangTech dataset.The figure
has the scatter plot for 6 different classes. as non_Malayalam. But it does not have any offensive
words, and it gives a positive sentiment which resulted
in the sentence getting misclassified as Not_offensive
Table 14 gives the examples which has high chance of
falling into multiple classes due to the lack of offensive
words.
• Text: ithokke comedy pole und . . . aa bahubaliyude
manam kalayo ?
Translation: This seems like comedy. Will Bagubali’s
fame be ruined.
This comment is labelled as ‘‘Offensive_Targeted_
Insult_Group’’ as it is in a way trying to insult a
group. But as it doesn’t contain any offensive words the
comment got misclassified as Not_offensive.
• Text: Aye kuura trailer oola padam chali mohan-
FIGURE 7. The histogram plot of the sentence lengths for the
misclassified sentences for the top performing model on lal*************
DravidianLangTech Malayalam data. Translation: Yuk bad trailer worst movie dirty mohan-
lal*************
In this comment, the author is trying to insult a
• There are specific comments that do not have any group of people who worked behind the movie
offensive words but are written in Latin script and do by writing bad comments about the trailer, movie
not contain any Malayalam words, which have to fall and the actor Mohanlal, so the comment is Offen-
into the non_Malayalam class, but the model predicted sive_Targeted_Insult_Group, but due to the lack of any
it as Not_offensive. offensive words, it is misclassified as Not_offensive.

VOLUME 12, 2024 20081


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

TABLE 14. These are test sentences from Malayalam DravidianLangTech dataset that are misclassified as Not_offensive as they do not contain any
offensive words but are labelled as non_Malayalam as they do not contain any Malayalam words.

FIGURE 9. 3D TSNE plot for Malayalam DravidianLangTech dataset.The


figure has the scatter plot for 5 different classes. FIGURE 10. 3D PCA plot for Malayalam DravidianLangTech dataset.The
figure has the scatter plot for 5 different classes.

• The dataset has sentences that do not use any direct


offensive words but are sarcastic or insulting which overlapping data points, and there is no clear separation
belong to the offensive classes. between the dataset of each class. This has made the
Text: Ella oollapadathinteyum stiram cheruva. 8nilayil classification strenuous.
padam pottum
Translation: cliche ingredient of all flop movies. This 3) TAMIL
movie is going to be a failure The highest results on this data were obtained using
Though this comment does not contain any offensive DistilBERT embedding with SVM (RBF) classifier and
words, the sentence as a whole is meant to insult the MuRIL embedding. The error analysis of this data is done
director or any person behind that movie. The author on the results of this high-performing model.
gives negative comments about the movie. Hence the • The first level error analysis was done on the length of
sentence is actually ‘‘Offensive Targeted Individual’’ but the sentences. We checked for the length of the mis-
is misclassified as ‘‘Not Offensive.’’ classified sentences and the rightly classified sentences.
• It is also observed that the significant misclassification Figures 11 and 12 show the histogram plots of the
happens to the Not_offensive class as the data is highly sentence lengths for misclassified and rightly classified
imbalanced. The Not_offensive class has a total of sentences. It is evident from the two figures that the
17697 data points, on the other hand, the rest of the data sentence lengths have not affected the predictions as
points from other classes sum up to 2313. we observe that most of the misclassified sentences and
• Figures 9 and 10 show the 3D TSNE and PCA scatter the rightly classified sentences have sentence lengths
plots of the dataset. The 768-dimensional sentence between 10 and 100.
embedding is mapped to 3D and plotted. It is clear from • The data is a mixture of multiple Indian and non-
the Figures that the dataset is clumsy, there are many Indian languages written in Roman as well as Language

20082 VOLUME 12, 2024


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

FIGURE 11. The histogram plot of the sentence lengths for the
misclassified sentences for the top performing model on
DravidianLangTech Tamil data.

FIGURE 13. 3D TSNE plot for Tamil DravidianLangTech dataset.The figure


has the scatter plot for 6 different classes.

FIGURE 12. The histogram plot of the sentence lengths for the correctly
classified sentences for the top performing model on DravidianLangTech
Tamil data.

specific text. We manually analysed the effect of lan-


guage on misclassification by comparing the languages
in misclassified and correctly classified sentences. It was
observed that the difference in language does not affect
the prediction.
• Among the misclassified sentences, we observed that a
few of the sentences were mislabelled. Table 15 shows
a few test sentences that are mislabelled but predicted FIGURE 14. 3D PCA plot for Tamil DravidianLangTech dataset.The figure
has the scatter plot for 6 different classes.
rightly by the model. This is one of the reasons for the
model’s low performance.
• Some of the wrong predictions are that the non-Tamil
sentences do not have any offensive content. These was captured by the model and hence misclassified as
sentences are misclassified as not offensive. Table 16 Offensive_Targeted_Insult_Individual.
shows a few sentences that are not-Tamil but do not • Figures 13 and 14 show the 3D TSNE and PCA scatter
contain any offensive words and are hence misclassified plots of the dataset. The 768-dimensional sentence
as Not_offensive. embedding is mapped to 3D and plotted. It is clear from
• Text Rajin political entry dailog 1996/2016 10 years the Figures that the dataset is clumsy there are many
one dailog naaku baag nachindi ——— Offen- overlapping data points, and there is no clear separation
sive_Targeted_Insult_Individual between the dataset of each class. This has made the
Translation: Political entry dialog of Rajnikanth between classification strenuous.
1996-2016. I liked it.
The above sentence is written in Telugu, so it is B. HASOC DATA
labeled as ‘‘not-Tamil.’’ This sentence does not have any 1) MALAYALAM 2021
offensive words; instead, it has positive words such as ‘I • Initially, the sentences were analysed based on the
liked it,’ but the sentence has a sarcastic meaning which length of the sentences. On comparing the length of

VOLUME 12, 2024 20083


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

TABLE 15. These are test sentences from Tamil DravidianLangTech dataset that are mislabelled but predicted rightly by the model.

TABLE 16. These are test sentences from Tamil DravidianLangTech dataset that are misclassified as Not_offensive as they do not contain any offensive
words but are labelled as not-Tamil as they do not contain any Tamil words.

FIGURE 16. The histogram plot of the sentence lengths for the correctly
FIGURE 15. The histogram plot of the sentence lengths for the classified sentences for the top performing model of Malayalam HASOC
misclassified sentences for the top performing model Malayalam HASOC 2021 dataset.
2021 dataset.

• The next level of analysis was based on word frequency.


the misclassified sentences and the rightly classified Figure 19 shows the plot of the most frequent 100 words.
sentences, it was observed that most of the sentence The plot shows that the words are very close to the
lengths were between 40 and 150. The Figures 15 straight line. This shows that most of these 100 words
and 16 show the histogram plots of the sentence lengths fall into both classes, which makes it difficult for the
for misclassified sentences and the rightly classified model to classify. Further, on analysing, we observe
sentences. that in a total of 6156 words, 579 words fall into both
• The dataset did not have any mislabelling. As a next classes. A few of the overlapping words are ‘‘’oru’,
level, we tried to analyse the data behavior by plotting ‘aa’, ‘ee’, ‘ennu’, ‘user’, ‘nalla’, ‘aanu’, ‘sir’, ‘anu’,
TSNE and PCA plots. Figures 17 and 18 show the 3D ‘pole’, ‘athu’, ‘e’, ‘avan’, ‘kondu’, ‘ethu’, ‘kollanam’,
TSNE and PCA scatter plots of the dataset. From the ‘okke’, ‘poyi’, ‘ulla’, ‘thanne’, ‘amma’, ‘video’, ‘onnu’,
plots, we can observe that no clusters formed, and most ‘alle’, ‘bro’, ‘avane’, ‘alla’, ‘nee’, ‘ariyam’, ‘interview’,
of the data points are close or above each other, which ‘boby’, ‘vere’, ‘pinne’, ‘onnum’, ‘koodi’, ‘illa’, ‘enna’,
makes the classification difficult. ‘undu’, ‘ningal’, ’thalla’’’.

20084 VOLUME 12, 2024


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

FIGURE 19. Word frequency plot for HASOC Malayalam 2021 dataset.
FIGURE 17. 3D PCA plot for Malayalam 2021 HASOC dataset.The figure
has the scatter plot for 2 different classes.

FIGURE 20. The histogram plot of the sentence lengths for the
misclassified sentences for the top performing model of HASOC
Malayalam 2020 dataset.

FIGURE 18. 3D TSNE plot for Malayalam 2021 HASOC dataset.The figure
has the scatter plot for 2 different classes.

2) MALAYALAM 2020
• Initially, the sentences were analysed based on the
length of the sentences. Comparing the length of the
misclassified and the rightly classified sentences, it was
observed that most of the sentence lengths were between
40 and 150. The Figures 20 and 21 show the histogram
plots of the sentence lengths for misclassified sentences
and the rightly classified sentences.
• The dataset did not have any mislabelling. As a next
level, we tried to analyse the data behavior by plotting
FIGURE 21. The histogram plot of the sentence lengths for the correctly
TSNE and PCA plots. Figure 22 and Figure 23 shows classified sentences for the top performing model of HASOC Malayalam
the 3D TSNE and PCA scatter plots of the dataset. From 2020 dataset.
the plots, we can observe that no clusters were formed,
and most data points are close to or above each other,
making the classification difficult. The plot shows that the words are very close to the
• The next level of analysis was based on word frequency. straight line. This shows that most of these 100 words
Figure 24 shows the plot of the most frequent 100 words. fall into both classes, which makes it difficult for the

VOLUME 12, 2024 20085


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

FIGURE 24. Word frequency plot for HASOC Malayalam 2020 dataset.

FIGURE 22. 3D PCA plot for Malayalam 2020 HASOC dataset.The figure
has the scatter plot for 2 different classes.

FIGURE 25. The histogram plot of the sentence lengths for the
misclassified sentences for the top performing model of Tamil HASOC
data.

FIGURE 23. 3D TSNE plot for Malayalam 2020 HASOC dataset.The figure
has the scatter plot for 2 different classes.

model to classify. Further, on analysing, we observe


that in a total of 2012 words, 191 words fall into
both classes. A few of the overlapping words are
‘‘’ivide’, ‘pakshe’, ‘kanda’, ‘eee’, ‘release’, ‘trailor’,
‘kure’, ‘comment’, ‘illa’, ‘cinemaye’, ‘hit’, ‘onnum’,
‘china’, ‘collection’, ‘kaanan’, ‘ella’, ‘poyi’, ‘undo’,
‘cheyyu’, ‘kooduthal’, ‘undallo’, ‘ne’, ‘paranju’, ‘okke’,
FIGURE 26. The histogram plot of the sentence lengths for the correctly
‘poi’, ‘views’, ‘million’, ‘vere’, ‘polum’, ‘ningalude’, classified sentences for the top performing model of Tamil HASOC data.
‘look’, ‘ulla’, ‘unlike’, ‘akumo’, ‘ethu’, ‘ellam’,.

3) TAMIL of the misclassified sentences and the rightly classified


• The first level of error analysis was done based on the sentences showed that most of the sentence lengths were
length of the sentences. Comparing the sentence lengths between 60 and 110. The Figures 25 and 26 show the

20086 VOLUME 12, 2024


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

FIGURE 29. Word frequency plot for Tamil HASOC dataset.

FIGURE 27. 3D PCA plot for HASOC Tamil dataset. The figure has the The plot shows that the words are very close to the
scatter plot for 2 different classes. straight line. This shows that most of these 100 words
fall into both classes, making it difficult for the model to
classify. Further, on analysing, we observe that in a total
of 4827 words, 632 words fall into both classes. A few of
the overlapping words are ‘‘’eppadi’, ‘la’, ‘da’, ‘veetla’,
‘nu’, ‘loosu’, ‘paarunga’, ‘tag’, ‘ah’, ‘tag rt’, ‘oru’, ‘rt’,
‘ku’, ‘nee’, ‘nalla’, ‘illa’, ‘tag tag’, ‘panna’, ‘bro’, ‘ena’,
‘enna’, ‘unga’, ‘irukku’, ‘na’, ‘tweet’, ‘anna’, ‘avan’,
‘intha’, ‘poi’, ‘fan’, ‘thaan’, ‘dei’, ‘panni’, ‘tha’, ‘enga’,
‘pola’, ‘vera’, ’dhan’’’.
From all the above analysis on the three HASOC datasets,
we observe that word frequency, sentence length, or mis-
labelling are not the reasons for the misclassification. This
requires further analysis of the linguistics and embedding,
which will be done as future work.

VI. CONCLUSION
The increasing spread of abusive language on social media
platforms, lack of annotated CodeMix data, and relatively
fewer approaches that address CodeMixing in Dravidian
languages have impelled us to study various multilingual
FIGURE 28. 3D TSNE plot for HASOC Tamil dataset. The figure has the
transformer models and find a single model that works well
scatter plot for 2 different classes. for CodeMix data. Another major problem in this area is
the class imbalance issue. Through this paper, we devel-
oped machine learning classifiers for HOS detection using
histogram plots of the sentence length for misclassified various multilingual transformer-based embedding models
sentences and the rightly classified sentences. by employing a cost-sensitive learning approach to address
• The dataset did not have any mislabelling. As a next the class imbalance problem. We compared seven different
level, we tried to analyze the data behavior by plotting transformer embedding with Machine Learning classifiers
TSNE and PCA plots. Figures 27 and 28 show the 3D on six CodeMix datasets. Individually observing the perfor-
PCA and TSNE scatter plots of the dataset. From the mance of embedding on each dataset, DistilBERT was top
plots, we can observe that no clusters formed, and most performing for Tamil DravidianLangTech and Malayalam
of the data points are close or above each other, which DravidianLangTech data with F1-scores (weighted) 72 %
makes the classification difficult. and 96% respectively, LaBSE for Malayalam 2020 data
• The next level of analysis was based on word frequency. with a F1-score (weighted) of 92% and MuRIL for Kan-
Figure 29 shows the plot of the most frequent 100 words. nada DravidianLangTech, Tamil HASOC and Malayalam

VOLUME 12, 2024 20087


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

2021 datasets with F1-scores(weighted) 66%, 76% and [12] B. Raja Chakravarthi, D. Chinnappa, R. Priyadharshini, A. Kumar
68% respectively. Apart from the three datasets, Kannada Madasamy, S. Sivanesan, S. Chinnaudayar Navaneethakrishnan,
S. Thavareesan, D. Vadivel, R. Ponnusamy, and P. Kumar Kumaresan,
DravidianLangTech, Tamil HASOC and Malayalam 2021, ‘‘Developing successful shared tasks on offensive language identification
for which MuRIL gave top performance, the model also for Dravidian languages,’’ 2021, arXiv:2111.03375.
had a comparable result on the remaining three datasets. [13] B. R. Chakravarthi, A. Kumar, J. P. McCrae, B. Premjith, K. P. Soman,
and T. Mandl, ‘‘Overview of the track on hasoc-offensive language
Hence, we observed that MuRIL embedding worked well identification-dravidiancodemix,’’ in Proc. FIRE, 2020, pp. 112–120.
for all six datasets. We also compared our results with the [14] B. R. Chakravarthi, R. Priyadharshini, N. Jose, T. Mandl, P. K. Kumaresan,
state-of-the-art models. Out of the compared four datasets, R. Ponnusamy, R. L. Hariharan, J. P. McCrae, and E. Sherly, ‘‘Findings of
our approach exhibited comparable results with the state-of- the shared task on offensive language identification in Tamil, Malayalam,
and Kannada,’’ in Proc. 1st Workshop Speech Language Technol.
the-art work in three datasets without any data translation. Dravidian Lang., 2021, pp. 133–145.
We also noticed that MuRIL gave consistent results; hence, [15] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, N. Jose,
we elucidate that MuRIL embedding works well for the S. Suryawanshi, E. Sherly, and J. P. McCrae, ‘‘DravidianCodeMix:
Sentiment analysis and offensive language identification dataset for
CodeMix Dravidian text. Compared with the state-of-the- Dravidian languages in code-mixed text,’’ Lang. Resour. Eval., vol. 56,
art works, we obtained better results for two datasets and no. 3, pp. 765–806, Sep. 2022.
comparable results for the remaining two. In addition, for [16] B. Raja Chakravarthi, R. Priyadharshini, S. Thavareesan, D. Chinnappa,
all the data, BERT-based embedding with Machine Learning D. Thenmozhi, E. Sherly, J. P. McCrae, A. Hande, R. Ponnusamy,
S. Banerjee, and C. Vasantharajan, ‘‘Findings of the sentiment analysis of
classifiers performed better than BERT-based classifiers. Dravidian languages in code-mixed text,’’ 2021, arXiv:2111.09811.
Hence, BERT-based embedding with Machine Learning [17] B. R. Chakravarthi, P. K. Kumaresan, R. Sakuntharaj, A. K. Madasamy,
classifiers has the upper hand over BERT-based classifiers S. Thavareesan, S. C. Navaneethakrishnan, J. P. McCrae, and T. Mandl,
‘‘Overview of the HASOC-DravidianCodeMix shared task on offensive
in HOS from Dravidian language tasks. Through the paper,
language detection in Tamil and Malayalam,’’ in Proc. Work. Notes FIRE
we also introduce a new Malayalam-English CodeMix test Forum Inf. Retr. Eval., 2021, pp. 1–14.
set which is an extension of the Hate Speech and Offensive [18] D. Chatzakou, N. Kourtellis, J. Blackburn, E. De Cristofaro, G. Stringhini,
Content Identification in English and Indo-Aryan Languages and A. Vakali, ‘‘Mean birds: Detecting aggression and bullying on
Twitter,’’ in Proc. ACM Web Sci. Conf., New York, NY, USA, Jun. 2017,
(HASOC) 2021 Malayalam-English dataset. pp. 13–22.
[19] Y. Chen, Y. Zhou, S. Zhu, and H. Xu, ‘‘Detecting offensive language
REFERENCES in social media to protect adolescent online safety,’’ in Proc. Int. Conf.
Privacy, Secur., Risk Trust Int. Conf. Social Comput., Sep. 2012, pp. 71–80.
[1] S. Agrawal and A. Awekar, ‘‘Deep learning for detecting cyberbullying
across multiple social media platforms,’’ in Proc. Eur. Conf. Inf. Retr. [20] J. Cohen, ‘‘A coefficient of agreement for nominal scales,’’ Educ. Psychol.
Cham, Switzerland: Springer, 2018, pp. 141–153. Meas., vol. 20, no. 1, pp. 37–46, Apr. 1960.
[2] S. Anbukkarasi and S. Varadhaganapathy, ‘‘Deep learning-based hate [21] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek,
speech detection in code-mixed Tamil text,’’ IETE J. Res., pp. 1–6, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov,
Mar. 2022. ‘‘Unsupervised cross-lingual representation learning at scale,’’ 2019,
[3] M. E. Aragon, M. A. A. Carmona, M. M.-Y. Gomez, H. J. Escalante, arXiv:1911.02116.
L. V. Pineda, and D. Moctezuma, ‘‘Overview of MEX-A3T at IberLEF [22] R. Dabre, H. Shrotriya, A. Kunchukuttan, R. Puduppully, M. Khapra, and
2019: Authorship and aggressiveness analysis in Mexican Spanish tweets,’’ P. Kumar, ‘‘IndicBART: A pre-trained model for indic natural language
in Proc. IberLEF@ SEPLN, 2019, pp. 478–494. generation,’’ in Proc. Findings Assoc. Comput. Linguistics: ACL, 2022,
[4] A. Arango, J. Pérez, and B. Poblete, ‘‘Hate speech detection is not as easy pp. 1849–1863.
as you may think: A closer look at model validation (extended version),’’ [23] T. Davidson, D. Warmsley, M. Macy, and I. Weber, ‘‘Automated hate
Inf. Syst., vol. 105, Mar. 2022, Art. no. 101584. speech detection and the problem of offensive language,’’ in Proc. Int.
[5] G. Arora, ‘‘INLTK: Natural language toolkit for Indic languages,’’ 2020, AAAI Conf. Web Social Media, vol. 11, 2017, pp. 512–515.
arXiv:2009.12534. [24] S. L. Devi, ‘‘Anaphora resolution from social media text in Indian
[6] P. Badjatiya, S. Gupta, M. Gupta, and V. Varma, ‘‘Deep learning for hate languages (SocAnaRes-IL): 2nd edition-overview,’’ in Proc. Forum Inf.
speech detection in tweets,’’ in Proc. 26th Int. Conf. World Wide Web Retr. Eval., Dec. 2020, pp. 9–13.
Companion, 2017, pp. 759–760. [25] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘‘BERT: Pre-training
[7] F. Balouchzahi, B. K. Aparna, and H. L. Shashirekha, of deep bidirectional transformers for language understanding,’’ 2018,
‘‘MUCS@DravidianLangTech-EACL2021: COOLI-code-mixing arXiv:1810.04805.
offensive language identification,’’ in Proc. 1st Workshop Speech [26] K. Dinakar, R. Reichart, and H. Lieberman, ‘‘Modeling the detection of
Lang. Technol. Dravidian Lang., 2021, pp. 323–329. textual cyberbullying,’’ in Proc. 5th Int. AAAI Conf. Weblogs Social Media,
[8] S. Banerjee, B. Raja Chakravarthi, and J. P. McCrae, ‘‘Comparison of 2011, pp. 11–17.
pretrained embeddings to identify hate speech in Indian code-mixed text,’’
[27] S. Doddapaneni, G. Ramesh, M. M. Khapra, A. Kunchukuttan, and
in Proc. 2nd Int. Conf. Adv. Comput., Commun. Control Netw. (ICACCCN),
P. Kumar, ‘‘A primer on pretrained multilingual language models,’’ 2021,
Dec. 2020, pp. 21–25.
arXiv:2107.00676.
[9] V. Basile, C. Bosco, E. Fersini, D. Nozza, V. Patti, F. M. R. Pardo, P. Rosso,
and M. Sanguinetti, ‘‘SemEval-2019 task 5: Multilingual detection of [28] S. Dowlagar and R. Mamidi, ‘‘Hate speech detection on code-mixed
hate speech against immigrants and women in Twitter,’’ in Proc. 13th dataset using a fusion of custom and pre-trained models with profanity
Int. Workshop Semantic Eval. Minneapolis, MN, USA: Association for vector augmentation,’’ Social Netw. Comput. Sci., vol. 3, no. 4, pp. 1–17,
Computational Linguistics, 2019, pp. 54–63. Jul. 2022.
[10] D. Benikova, M. Wojatzki, and T. Zesch, ‘‘What does this imply? [29] S. Edosomwan, S. K. Prakasan, D. Kouame, J. Watson, and T. Seymour,
Examining the impact of implicitness on the perception of hate speech,’’ ‘‘The history of social media and its impact on business,’’ J. Appl. Manag.
in Proc. Int. Conf. German Soc. for Comput. Linguistics Lang. Technol. Entrepreneurship, vol. 1, no. 3, pp. 79–91, 2011.
Cham, Switzerland: Springer, 2017, pp. 171–179. [30] H. Faris, I. Aljarah, M. Habib, and P. Castillo, ‘‘Hate speech detection using
[11] A. Bohra, D. Vijay, V. Singh, S. S. Akhtar, and M. Shrivastava, ‘‘A dataset word embedding and deep learning in the Arabic language context,’’ in
of Hindi-English code-mixed social media text for hate speech detection,’’ Proc. 9th Int. Conf. Pattern Recognit. Appl. Methods, 2020, pp. 453–460.
in Proc. 2nd Workshop Comput. Modeling People’s Opinions, Personality, [31] F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and W. Wang, ‘‘Language-
Emotions Social Media, 2018, pp. 36–41. agnostic BERT sentence embedding,’’ 2020, arXiv:2007.01852.

20088 VOLUME 12, 2024


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

[32] B. Gambäck and U. K. Sikdar, ‘‘Using convolutional neural networks to [54] T. Mandl, S. Modha, P. Majumder, D. Patel, M. Dave, C. Mandlia, and
classify hate-speech,’’ in Proc. 1st Workshop Abusive Lang. Online, 2017, A. Patel, ‘‘Overview of the HASOC track at FIRE 2019: Hate speech and
pp. 85–90. offensive content identification in indo-European languages,’’ in Proc. 11th
[33] N. D. Gitari, Z. Zhang, H. Damien, and J. Long, ‘‘A lexicon-based approach Forum Inf. Retr. Eval., Dec. 2019, pp. 14–17.
for hate speech detection,’’ Int. J. Multimedia Ubiquitous Eng., vol. 10, [55] T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘‘Efficient estimation of
no. 4, pp. 215–230, Apr. 2015. word representations in vector space,’’ 2013, arXiv:1301.3781.
[34] T. Gr¨ndahl, L. Pajola, M. Juuti, M. Conti, and N. Asokan, ‘‘All you need is [56] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin, ‘‘Advances
‘love’ evading hate speech detection,’’ in Proc. 11th ACM Workshop Artif. in pre-training distributed word representations,’’ 2017, arXiv:1712.09405.
Intell. Secur., 2018, pp. 2–12. [57] V. Mujadia, P. Mishra, and D. M. Sharma, ‘‘IIIT-hyderabad at HASOC
[35] A. Hande, S. U. Hegde, R. Priyadharshini, R. Ponnusamy, P. K. Kumare- 2019: Hate speech detection,’’ in Proc. FIRE, 2019, pp. 271–278.
san, S. Thavareesan, and B. R. Chakravarthi, ‘‘Benchmarking multi-task
[58] C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, and Y. Chang, ‘‘Abusive
learning for sentiment analysis and offensive language identification in
language detection in online user content,’’ in Proc. 25th Int. Conf. World
under-resourced Dravidian languages,’’ 2021, arXiv:2108.03867.
Wide Web, 2016, pp. 145–153.
[36] A. Hande, K. Puranik, R. Priyadharshini, S. Thavareesan, and
[59] E. Papegnies, V. Labatut, R. Dufour, and G. Linares, ‘‘Graph-based
B. R. Chakravarthi, ‘‘Evaluating pretrained transformer-based models
features for automatic online abuse detection,’’ in Proc. Int. Conf. Stat.
for COVID-19 fake news detection,’’ in Proc. 5th Int. Conf. Comput.
Lang. Speech Process. Cham, Switzerland: Springer, 2017, pp. 70–81.
Methodol. Commun. (ICCMC), Apr. 2021, pp. 766–772.
[37] A. Hande, K. Puranik, K. Yasaswini, R. Priyadharshini, S. Thavareesan, [60] B. Pariyani, K. Shah, M. Shah, T. Vyas, and S. Degadwala, ‘‘Hate speech
A. Sampath, K. Shanmugavadivel, D. Thenmozhi, and B. Raja detection in Twitter using natural language processing,’’ in Proc. 3rd Int.
Chakravarthi, ‘‘Offensive language identification in low-resourced Conf. Intell. Commun. Technol. Virtual Mobile Netw. (ICICV), Feb. 2021,
code-mixed Dravidian languages using pseudo-labeling,’’ 2021, pp. 1146–1152.
arXiv:2108.12177. [61] J. Ho Park and P. Fung, ‘‘One-step and two-step classification for abusive
[38] M. A. Hedderich, D. Adelani, D. Zhu, J. Alabi, U. Markus, and D. Klakow, language detection on Twitter,’’ 2017, arXiv:1706.01206.
‘‘Transfer learning and distant supervision for multilingual transformer [62] J. Pennington, R. Socher, and C. Manning, ‘‘Glove: Global vectors for
models: A study on African languages,’’ 2020, arXiv:2010.03179. word representation,’’ in Proc. Conf. Empirical Methods Natural Lang.
[39] H. Hosseinmardi, S. Arredondo Mattson, R. Ibn Rafiq, R. Han, Q. Lv, and Process. (EMNLP), 2014, pp. 1532–1543.
S. Mishra, ‘‘Detection of cyberbullying incidents on the Instagram social [63] F. M. Plaza-del-Arco, M. D. Molina-González, L. A. Ureña-López,
network,’’ 2015, arXiv:1503.03909. and M. T. Martín-Valdivia, ‘‘Comparing pre-trained language models for
[40] N. Jose, B. R. Chakravarthi, S. Suryawanshi, E. Sherly, and J. P. McCrae, Spanish hate speech detection,’’ Exp. Syst. Appl., vol. 166, Mar. 2021,
‘‘A survey of current datasets for code-switching research,’’ in Proc. 6th Int. Art. no. 114120.
Conf. Adv. Comput. Commun. Syst. (ICACCS), Mar. 2020, pp. 136–141. [64] M. Ptaszynski, A. Pieciukiewicz, and P. Dybała, ‘‘Results of the PolEval
[41] D. Kakwani, A. Kunchukuttan, S. Golla, A. Bhattacharyya, M. M. Khapra, 2019 shared task 6: First dataset and open shared task for automatic
and P. Kumar, ‘‘IndicNLPSuite: Monolingual corpora, evaluation bench- cyberbullying detection in Polish Twitter,’’ Inst. Comput. Sci., Polish Acad.
marks and pre-trained multilingual language models for Indian languages,’’ Sci., Warsaw, Poland, Tech. Rep., 2019, pp. 89–110.
in Proc. Findings Assoc. Comput. Linguistics, 2020, pp. 4948–4961. [65] S. T. Roberts, J. Tetreault, V. Prabhakaran, and Z. Waseem, ‘‘Proceedings
[42] T. Keipi, M. Nasi, A. Oksanen, and P. Räsänen, Online Hate and Harmful of the third workshop on abusive language online,’’ in Proc. 3rd Workshop
Content: Cross-National Perspectives. London, U.K.: Taylor & Francis, Abusive Lang. Online, 2019, pp. 1–16.
2016. [66] J. P. McCrae, S. Banerjee, and B. R. Chakravarthi, ‘‘Comparison of
[43] S. Khanuja, D. Bansal, S. Mehtani, S. Khosla, A. Dey, B. Gopalan, pretrained embeddings to identify hate speech in Indian codemixed text,’’
D. K. Margam, P. Aggarwal, R. T. Nagipogu, S. Dave, S. Gupta, in Proc. 2nd Int. Conf. Adv. Comput., Commun. Control Netw. (ICACCCN),
S. C. B. Gali, V. Subramanian, and P. Talukdar, ‘‘MuRIL: Multilingual 2020, pp. 21–25.
representations for Indian languages,’’ 2021, arXiv:2103.10730. [67] D. Saha, N. Paharia, D. Chakraborty, P. Saha, and A. Mukherjee,
[44] S. Koffer, D. M. Riehle, S. Hohenberger, and J. Becker, ‘‘Discussing the ‘‘Hate-Alert@DravidianLangTech-EACL2021: Ensembling strategies
value of automatic hate speech detection in online debates,’’ in Proc. for transformer-based offensive language detection,’’ 2021,
Multikonferenz Wirtschaftsinformatik Data Driven X-Turning Data in arXiv:2102.10084.
Value, Leuphana, Germany, 2018, pp. 1–12.
[68] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, ‘‘DistilBERT, a
[45] G. Koushik, K. Rajeswari, and S. K. Muthusamy, ‘‘Automated hate speech distilled version of BERT: Smaller, faster, cheaper and lighter,’’ 2019,
detection on Twitter,’’ in Proc. 5th Int. Conf. Comput., Commun., Control arXiv:1910.01108.
Autom. (ICCUBEA), Sep. 2019, pp. 1–4.
[69] S. Sharma, S. Agrawal, and M. Shrivastava, ‘‘Degree based classification
[46] G. Kovács, P. Alonso, and R. Saini, ‘‘Challenges of hate speech detection
of harmful speech using Twitter data,’’ 2018, arXiv:1806.04197.
in social media,’’ Social Netw. Comput. Sci., vol. 2, no. 2, pp. 1–15,
Apr. 2021. [70] V. K. Singh, S. Ghosh, and C. Jose, ‘‘Toward multimodal cyberbullying
detection,’’ in Proc. CHI Conf. Extended Abstr. Hum. Factors Comput.
[47] R. Kumar, A. K. Ojha, M. Zampieri, and S. Malmasi, ‘‘Proceedings
Syst., May 2017, pp. 2090–2099.
of the first workshop on trolling, aggression and cyberbullying (TRAC-
2018),’’ in Proc. 1st Workshop Trolling, Aggression Cyberbullying, 2018, [71] D. Sivalingam and S. Thavareesan, ‘‘OffTamil@DravideanLangTech-
pp. 1–14. EASL2021: Offensive language identification in Tamil text,’’ in Proc. 1st
[48] P. K. Kumaresan, R. Sakuntharaj, S. Thavareesan, S. Navaneethakrishnan, Workshop Speech Lang. Technol. Dravidian Lang., 2021, pp. 346–351.
A. K. Madasamy, B. R. Chakravarthi, and J. P. McCrae, ‘‘Findings of [72] K. Sreelakshmi, B. Premjith, and K. P. Soman, ‘‘Amrita cen at HASOC
shared task on offensive language identification in Tamil and Malayalam,’’ 2019: Hate speech detection in Roman and devanagiri scripted text,’’ in
in Proc. Forum Inf. Retr. Eval., Dec. 2021, pp. 16–18. Proc. FIRE, 2019, pp. 366–369.
[49] A. Kunchukuttan, D. Kakwani, S. Golla, A. Bhattacharyya, M. M. Khapra, [73] K. Sreelakshmi, B. Premjith, and K. P. Soman, ‘‘Detection of hate speech
and P. Kumar, ‘‘AI4Bharat-IndicNLP corpus: Monolingual corpora and text in hindi-english code-mixed data,’’ Proc. Comput. Sci., vol. 171,
word embeddings for Indic languages,’’ 2020, arXiv:2005.00085. pp. 737–744, Jan. 2020.
[50] Y. Kuratov and M. Arkhipov, ‘‘Adaptation of deep bidirectional multilin- [74] J. M. Struß, M. Siegel, J. Ruppenhofer, M. Wiegand, and M. Klenner,
gual transformers for Russian language,’’ 2019, arXiv:1905.07213. ‘‘Overview of GermEval task 2, 2019 shared task on the identification
[51] I. Kwok and Y. Wang, ‘‘Locate the hate: Detecting tweets against blacks,’’ of offensive language,’’ in Proc. 15th Conf. Natural Lang. Process.
in Proc. 27th AAAI Conf. Artif. Intell., 2013, pp. 1621–1622. (KONVENS). Nuremberg, Germany: German Society for Computational
[52] S. Malmasi and M. Zampieri, ‘‘Detecting hate speech in social media,’’ Linguistics, 2019, pp. 354–365.
2017, arXiv:1712.06427. [75] D. Tula, P. Potluri, S. Ms, S. Doddapaneni, P. Sahu, R. Sukumaran,
[53] T. Mandl, S. Modha, A. Kumar, and B. R. Chakravarthi, ‘‘Overview of and P. Patwa, ‘‘Bitions@DravidianLangTech-EACL2021: Ensemble of
the HASOC track at FIRE 2020: Hate speech and offensive language multilingual language models with pseudo labeling for offence detection
identification in Tamil, Malayalam, Hindi, English and German,’’ in Proc. in Dravidian languages,’’ in Proc. 1st Workshop Speech Lang. Technol.
Forum Inf. Retr. Eval., Dec. 2020, pp. 29–32. Dravidian Lang., 2021, pp. 291–299.

VOLUME 12, 2024 20089


K. Sreelakshmi et al.: Detection of HOS CodeMix Text in Dravidian Languages

[76] C. Vasantharajan and U. Thayasivam, ‘‘Towards offensive language BHARATHI RAJA CHAKRAVARTHI is currently
identification for Tamil code-mixed Youtube comments and posts,’’ Social a permanent Lecturer above the bar with the
Netw. Comput. Sci., vol. 3, no. 1, pp. 1–13, Jan. 2022. School of Computer Science, University of Gal-
[77] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, way, Ireland. He involved on multimodal machine
Ł. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv. learning, abusive/offensive language detection,
Neural Inf. Process. Syst. (NIPS), 2017, pp. 5998–6008. bias in natural language processing tasks, inclusive
[78] Z. Waseem and D. Hovy, ‘‘Hateful symbols or hateful people? Predictive language detection, and multilingualism. He has
features for hate speech detection on Twitter,’’ in Proc. NAACL Student published papers in highly reputed journal articles
Res. Workshop, 2016, pp. 88–93.
(LRE, CSL, MTAP, SNAM, JDSA, JDIM, and
[79] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and
IJIM Data Insights) and multiple international
Q. V. Le, ‘‘XLNet: Generalized autoregressive pretraining for language
understanding,’’ in Proc. Adv. Neural Inf. Process. Syst., 32, 2019, conference papers (COLING, LREC, MTSUMMIT, DSAA, LDK, GWC,
pp. 1–11. AICS, and FIRE). He received the Best Application Paper Award at
[80] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, and R. Kumar, DSAA 2020 IEEE and ACM-funded conference. He is the Area Chair
‘‘SemEval-2019 task 6: Identifying and categorizing offensive language in of the 17th Conference of the European Chapter of the Association for
social media (OffensEval),’’ 2019, arXiv:1903.08983. Computational Linguistics 2023 and the General Chair of SPELLL 2022 and
[81] Z. Zhang, D. Robinson, and J. Tepper, ‘‘Detecting hate speech on Twitter 2023. He is an Associate Editor of Expert Systems with Applications
using a convolution-GRU based deep neural network,’’ in Proc. Eur. (Elsevier) and an Editorial Board Member of Computer Speech and
Semantic Web Conf. Cham, Switzerland: Springer, 2018, pp. 745–760. Language (Elsevier).

K. SREELAKSHMI is currently pursuing the


Ph.D. degree in offensive language identifica-
tion with Social Media Text. She is also an
Assistant Professor with the Centre for Com-
putational Engineering and Networking, Amrita K. P. SOMAN is currently the Head and a Profes-
Vishwa Vidyapeetham, Coimbatore, India. She has sor with the Center for Computational Engineering
authored several national and international pub- and Networking (CEN) and the Dean of the Amrita
lications. Her research interests include offensive School of Artificial Intelligence, Amrita Vishwa
language detection from code-mixed social media Vidyapeetham, Coimbatore, India. He has more
text, machine learning, deep learning, and artificial than 25 years of research and teaching experience
intelligence. in artificial intelligence and data science related
subjects with the Amrita School of Engineering,
Coimbatore. He has around 450 publications to
his credit in reputed journals, such as IEEE
B. PREMJITH is currently an Assistant Professor TRANSACTIONS, IEEE ACCESS, Applied Energy, and conference proceedings.
(Senior Grade) with the Amrita School of Artifi- He published four books, namely, Insight Into Wavelets: from Theory to
cial Intelligence, Amrita Vishwa Vidyapeetham, Practice, Insight Into Data Mining: Theory and Practice, Support Vector
Coimbatore, India. He has published papers in Machines and Other Kernel Methods, and Signal and Image Processing-The
reputed journals and conferences. His research Sparse Way. He is the most cited author in Amrita Vishwa Vidyapeetham
interests include natural language processing, in the areas of artificial intelligence and data science. He was listed among
computational linguistics, and computational the Top-10 Computer Science Faculty by DST, Government of India,
social science. He secured first place in interna- from 2009 to 2013, the Career 360 and MHRD, from 2017 to 2018, and
tional competitions on factuality analysis of the also in the list of the most prolific authors in the world, prepared by Springer
text in Iberian languages and abusive language Nature.
detection in Tamil.

20090 VOLUME 12, 2024

You might also like