Toxic_Comment_Classification_Using_S-BERT_Vectorization_and_Random_Forest_Algorithm
Toxic_Comment_Classification_Using_S-BERT_Vectorization_and_Random_Forest_Algorithm
S.Tresa Sangeetha
Department of Engineering
University of Technology and Applied
Sciences
AlMusannah, Sultanate of Oman
Abstract— The growing popularity of social media emotions or sentiments might be either favourable,
platforms and microblogging websites has led to an increase in unfavourable, or neutral. The author's feelings are the
the expression of views and opinions. However, conversations fundamental concept of sentiment. These could differ from
and debates on these platforms often lead to the use of toxic person to person. Husnain et al. [2] have employed machine
comments, which consists of insulting and hateful remarks. To learning algorithms to analyse the sentiment of the text and
address this issue, it is important for social media systems to be determine its polarity (positive, negative, or neutral polarity).
able to recognize harmful comments. With the rising incidence Such categorization may be achieved with a number of
of cyberbullying, it is crucial to study the classification of toxic techniques, such as information filtering, word embedding,
comments using various algorithms. This study compares the
and subject categorization. These days, sentiment analysis is
effectiveness of different word and sentence embedding
methods, including TF-IDF, InferSent, Bert, and T5 for toxic
highly valued across a variety of industries, from marketing
comments classification. A comparative study is also conducted to forensics.
on the impact of using SMOTE to balance the highly Vectorization of comments refers to the process of
imbalanced dataset. The results of these models are compared representing text data in a numerical form that can be used
and analysed. It is observed that T5 embedding with Random by machine learning models. The goal of vectorization is to
Forest Classifier works best at 0.91 F1-Score. transform text data into a format that can be easily
understood and processed by machines, without losing
Keywords—comments, toxic, TF-IDF, InferSent, S-Bert, T5,
Classification
important information. One common approach to
vectorization is to use a technique called "bag-of-words"
I. INTRODUCTION representation. Another approach is to use pre-trained
language models like BERT, RoBERTa, or GPT-3, to
Social media refers to online platforms where users can encode each comment into a dense vector representation.
share and exchange information, content, and opinions. It has
become a ubiquitous part of modern life, connecting people Word and sentence embeddings are two ways of
from all over the world. However, the increased use of social representing words and sentences respectively in a vector
media has also led to a rise in negative and harmful form. Word embeddings are mathematical representations of
activities, such as cyberbullying, hate speech, and spreading words that reflect their meanings and allow words with
misinformation. These actions can have a significant impact similar meanings to be close in the vector space. They are
on people's mental and emotional well-being, causing learned from text data and can be used for various NLP
anxiety, depression, and other mental health issues. As a tasks. Sentence embeddings, on the other hand, represent
result, it's important to monitor and address the negative entire sentences and capture their meanings. They are usually
impact of social media, and ensure that it remains a positive obtained by combining the word embeddings of the words in
and safe space for everyone. With the growing use of social a sentence. Sentence embeddings can be used for tasks such
media and the internet, it's crucial to monitor harmful as text classification and semantic search.
activities and take appropriate measures. The abusive
In this paper, the study is being to classify the toxic
comments, messages, and discussions on these platforms
comments based on their level of toxicity into different
have had a negative impact on people's mental health.
categories like toxic, insult, identity hate , obscene, threat,
Nair et al. [1] have suggested that sentiment analysis is severe toxic and high toxic. Before classification of these
nothing more than using text analysis and natural language comments, data is pre-processed and different vectorization
processing (NLP) technologies to find and extract a writer's methods like TF-IDF, S-Bert, InferSent and T5 Transformer
emotions from a piece of text, in this case, a tweet. These is being used. This paper aims to classify toxic comments
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on January 22,2025 at 08:41:11 UTC from IEEE Xplore. Restrictions apply.
from the Kaggle dataset of toxic comments on Wikipedia from SBERT. Additionally, BERT aids with deciphering
comments, prepared by Jigsaw, and determine the level of unclear terminology in a written document or a query.
toxicity. The comments are categorized into different levels
of toxicity, including severe toxic, toxic, threat, insult, Brundha & Meera [8] used different word embeddings
obscene, and identity hate. The novelty of this research is like Word2vec, Glove2vec and Sentence BERT to find
conducting a comparative analysis of four different relevant documents. After apply PCA and Factor Analysis
vectorization techniques, namely S-Bert, TF-IDF, InferSent, there is significant increase of 2-3% in Mean Average
and T5, in addition to evaluating the performance of six Precision.
classification algorithms on these vectorized data. Ajallouda et al. [9] discussed that a supervised phrase
embedding method called InferSent gives English phrases
II. RELATED WORK semantic representations. It received training using corpus
Kumar & Kanisha [3] conducted a study to show that data from Stanford Natural Language Inference (SNLI).
there two approaches to classify the comments based on
Ibrahim et al. [10] reported that study presented an
toxicity. In the first approach the dataset is classified into
ensemble approach using three models (CNN, LSTM, and
whether the comment is toxic or not whereas in the other
GRU) to find toxicity in user-generated content on online
approach, comments are classified into varying range like
platforms. The researchers used data from Wikipedia talk
toxic, severely toxic, threat, insult, hate, obscene. It has been
page modifications and addressed the data imbalance
observed here that logistic regression outperforms all other
problem by employing data augmentation techniques. The
classification algorithms like Naïve Bayes and decision tree.
approach involved two stages: determining input toxicity and
Husnain et al. [2] stated that, two strategies were identifying the categories of toxicity in the content.
proposed for detecting comment toxicity on social media According to the study's evaluation, the proposed ensemble
platforms. The first approach trains individual classifiers for approach achieved the maximum accuracy with an F1-score
each aspect of comment toxicity. The second approach is a for toxic as 0.80 and non-toxic classification and F1-Score
multi-label classification problem. Various ML techniques for predicting the type of toxicity is 0.872.
such as Naive Bayes, decision trees and logistic regression
Chaval et al. [11] stated that due to the rising use of brief
were used and the results were evaluated with 10-fold cross-
text messages for communication, the spam problem in text
validation on a dataset from Kaggle. The multi-label
messages has gotten worse. To identify undesired messages,
classification problem was converted into a multi-class
automatic spam filters either use content-based or heuristic-
problem using a unique pre-processing method, which
based methods. This paper focuses on imbalanced datasets,
improved accuracy for simple classification models. The
which reflect real-world spam detection settings, and
experimental results showed that logistic regression was
proposes a content-based ML approach with an importance
effective for both multi-class and binary classification.
on them. To discover the best classifier for unbalanced
Venugopalan & Gupta [4] have used SVM with Linear datasets, a number of machine learning techniques including
Kernel and Decision tree Classifier. It is reported that the SVM, Bagging, AdaBoost and J48 were investigated. The
model performs with high accuracy when compared with problem of imbalanced datasets was addressed using the
previous models. Garlapati et al. [5] reported that for binary SMOTE technique, and the SVM plus SMOTE combination
and multi label classification, the best models can be used are performed best, increasing the average class accuracy of 7
CNN and LSTM. The goal is to evaluate the type of points in the JSC dataset and 3 points in the UCI dataset [11].
statement and identify the various classes of toxic language,
Rahul et al. [12] stated that to achieve accurate analysis
including obscene, identity-hateful, poisonous, insulting, and
of the toxicity of content, the prevalence of online
severely toxic language. This system uses comments from
harassment will be rigorously evaluated. Six machine
websites like harmful or non-toxic as input. The goal of our
learning algorithms will be utilized to address the text
model is to identify the toxicity class. Phased analysis is the
classification problem and identify the best algorithm based
goal of this research. The goal of Phase I is to assess the
on the assessment metrics for categorizing harmful
toxicity.
comments. The aim is to evaluate toxicity with precision in
Rupapara et al. [6] reported that Regression Vector order to mitigate its adverse effects, which will motivate
Voting Classifier (RVVC) is a new methodology for organizations to take appropriate actions.
detecting harmful comments on social media platforms.
Murali et al. [13] discussed that the primary objective is
RVVC combines logistic regression and support vector
to improve the conversational abilities of a robot receptionist
classifier using soft voting rules. The method was evaluated
by enhancing three key modules: named entity identification,
on both balanced and imbalanced datasets using bag-of-
sentiment analysis, and hazardous comment classification.
words and TF-IDF features. The results showed that RVVC
The named entity recognition module helps the chatbot
outdos other machine learning classifiers in terms of
retrieve specific information about a company. The
accuracy, recall, precision and F1-score. When combined
sentiment analysis module aims to enhance the chatbot's
with the SMOTE applied dataset and TF-IDF features,
responsiveness to user feedback and conversation style. The
RVVC reached an accuracy of 0.97. RVVC is a promising
hazardous comment classifier ensures that the self-learning
method for detecting harmful comments on social media
chatbot generates only positive and uplifting comments. The
platforms and its effectiveness has been demonstrated
effectiveness of these modules is evaluated by comparing
through experiments.
them to previous methods, and the code required to
Nair et al. [7] discussed that Bert is PyTorch pretrained reproduce the results is included for future development.
model which will provide contextual embeddings. Semantic
Sumanth et al. [14] stated that the issue of harmful
textual resemblance and semantic search can both benefit
language on social media platforms is discussed, along with
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on January 22,2025 at 08:41:11 UTC from IEEE Xplore. Restrictions apply.
the reasons why it is a pressing concern. The primary
objective of the research is to apply machine learning to
identify and remove harmful speech from these platforms.
To accomplish binary and multi-class classification, various
machine learning methods are studied, including standard
machine learning and ensemble methods. The paper
examines two approaches to identify harmful speech: (a) a
technique that extracts word embeddings and creates a
model, and (b) a method that enhances pre-existing models
like Decision Tree, Logistic Regression, Random Forest,
Voting Classifier and K-Nearest Neighbors. The suggested
method can be used to analyse any social media comments.
Swetha et al. [15] reported that to investigate the
pervasiveness of online bullying and classify the level of
harm in the content. An optimal machine learning algorithm Fig. 1. Distribution of Comments Across Different Categories
is selected from a set of six based on evaluation metrics to
TABLE II. PERCENTAGE OF COMMENTS IN EACH CATEGORY
categorize harmful statements. The main focus is to conduct
a thorough assessment of the harmful nature of online Category Percentage
content to reduce its negative impact and encourage Toxic 9.6
organizations to take appropriate action. Severe toxic 1.0
Obscene 5.3
Ibrahim et al.[16] reported that the method suggested Threat 0.3
involves two prediction stages. In the initial stage, a classifier Insult 4.9
is used to identify whether a comment is normal or has any Hate identity 0.9
form of toxicity. If the comment is identified as toxic, a
second classifier is employed to detect the specific type of Table 2 shows the proportion of each type of comment
comment toxicity. To develop these classifiers, various deep compared to the non-toxic comments in the dataset. For
learning models, such as convolutional neural network instance, 9.6% of the comments are considered toxic in the
(CNN), bidirectional gated recurrent unit (GRU) and dataset. The number of non-toxic comments far outweighs
bidirectional long-short term memory (LSTM). each of the other categories. It is apparent that categories
such as identity hate are minimal. The majority of categories
have fewer comments than non-toxic comments by 90%.
III. PROPOSED METHODOLOGY Synthetic Minority Over-sampling Technique (SMOTE) is
deployed for data balance in the machine learning process.
A. Understanding the Data
The dataset consists of 6 attributes and 159571
observations. The list of attributes and the observations are
listed in Table 1.From Table 1 data presented it is inferred
that there are no missing values for any characteristics. The
dataset has 8 features labelled as Id, toxic, comment_text,
severe toxic, threat, obscene, insult, identity hate. While the
Id is an indexing element, the comment_text field contains
the text which need to be analysed and categorized. The
other fields are labels collected by manual labelling and are
used for deciding on the category.[17]
The bar graph in Fig. 1 displays the number of comments
that fall into each category. Based on this information, it
appears that the majority of comments are toxic which is
above 14000, while a smaller number of comments contain
threats which are around 50.
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on January 22,2025 at 08:41:11 UTC from IEEE Xplore. Restrictions apply.
B. Data Pre-processing implement, good choice for large datasets and can be trained
In the above section, it has been discussed that data is quickly. It also reduces overfitting. Multilayer Perceptron is
highly imbalanced and in this section the task of used because of its capacity to learn a wide range of
preprocessing that data is being done. Fig.4 explains the data representations, complex non-linear relationships,
pre-processing process and the process involved is detailed Robustness to noise and ability to handle large datasets. K-
below. NN is used because no training is needed, no assumptions
about data distribution and is a flexible algorithm directly
Fig. 3 depicts that the text undergoes several used for multiclass classification.
preprocessing steps: removing punctuation, special
characters, and numbers, tokenizing it into words, and SVMs maximize the margin between different classes,
removing stopwords. Stemming is then performed to reduce which helps to separate the data and reduce overfitting. It can
words to their base form. handle large datasets efficiently and good for high
dimensional data. Random forest is a versatile algorithm that
can handle multi-class classification problems. It does not
make any assumptions about the distribution of data. It can
be trained and predicted quickly. Gaussian Naive Bayes is a
simple and fast algorithm that requires less computational
resources and can be used for large datasets. It requires
relatively little training data to make accurate predictions.
After implementing the algorithms, the data is
categorized into 7 classes including toxic, severe toxic,
insult, threat, identity hate, obscene, and high toxic. In cases
where a comment falls under multiple categories from the
Fig. 3. Pre-processing of comments above classes, it will be classified as high toxic.
The resulting cleaned text is then transformed into D. Performance Evaluation and comparison Metrics
numerical vectors using various methods, such as TF-IDF, S- The metrics being employed for assessing performance
BERT, InferSent, and T5. The TF-IDF representation of a are Precision, Recall, F1-Score, and RoC Score.
document converts the raw text into a numerical vector,
where each dimension of the vector depicts a specific term in Precision: Precision is denoted as P as given in Eq. (1).
the corpus vocabulary. Infersent: Infersent is a sentence
embedding model developed by Facebook AI. It is trained on
the SNLI and MultiNLI datasets and generates sentence Eq. (1)
embeddings that capture semantic meaning. Infersent can be
used for different NLP tasks like semantic similarity, text Recall: Recall is denoted as R Eq. (2).
classification and question answering. Sentence-BERT (S-
BERT): S-BERT is a pre-trained model based on the
transformer architecture. S-BERT is designed to understand
the contextual relationship between words in a sentence. T5: F1- Score: F1-Score is denoted as F Eq. (3).
T5 known as Text-to-Text Transfer Transformer is a pre-
trained language model developed by Google Research. T5
is designed to be a generic text-to-text model that can be Eq. (3)
fine-tuned for different NLP tasks like machine translation,
text classification and question answering. Here the RoC Score: The model's effectiveness is indicated by the
comments are going through these vectorization methods so RoC Area Under the Curve (AUC) score. The model shows
that a comparative study can be done on the data. greater effectiveness in distinguishing between the categories
of positive and negative. Higher the AUC value better is the
C. Model Training and Testing performance of the model. If the classifier achieves an AUC
In this section, model building using various score of 1, it is capable of precisely distinguishing between
classification algorithms is being performed. Before that as all points belonging to the Positive class and the Negative
mentioned in the above section since the data is highly class.
imbalanced, SMOTE is being applied. SMOTE (Synthetic
Minority Over-sampling Technique) is an oversampling IV. RESULTS & DISCUSSIONS
method used in imbalanced class classification problems. In Fig.4 & Fig. 5, the F1-score is compared between a
The technique generates synthetic examples of the minority model trained without SMOTE and a model trained with
class, instead of simply duplicating the existing examples, to SMOTE.
balance the distribution of the target classes.
According to figures 4 and 5, the highest F1-score of 0.71
After applying SMOTE, various models are built using was obtained by Random Forest and SVM for the TF-IDF
various classification algorithms and a comparative study is feature representation without using SMOTE. On the other
being done on the models which are built after applying hand, MLP achieved the highest F1-score of 0.91 for the TF-
SMOTE and without SMOTE. The models are trained and IDF feature representation with the use of SMOTE. In the
tested using 6 classification algorithms that are Logistic case of S-BERT, the figure indicates that SVM outperformed
Regression, MLP, K-NN, SVM, Random Forest and all other models, with the highest F1-score both with and
Gaussian Naïve Bayes. Logistic regression is a popular without SMOTE. For Infersent, without SMOTE, SVM had
method for multiclass classification because it is easy to the highest F1-score of 0.57, while Random Forest had the
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on January 22,2025 at 08:41:11 UTC from IEEE Xplore. Restrictions apply.
highest F1-score of 0.91 with SMOTE. In the T5 model, precise and dependable among the classifiers that were
MLP had the highest F1-score of 0.76 without using evaluated.
SMOTE, but Random Forest achieved the highest F1-score
of 0.91 when SMOTE was used. In the T5 model without
using SMOTE, MLP model performs the best with a
maximum F1-score of 0.76. However, when SMOTE is
used, the Random Forest model achieves the highest F1-
score of 0.91.
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on January 22,2025 at 08:41:11 UTC from IEEE Xplore. Restrictions apply.
REFERENCES Learning and Applications (ICMLA), Orlando, FL, USA, 2018, pp.
875-878, doi: 10.1109/ICMLA.2018.00141.
[1] Anu J Nair , Aadithya Vinayak, Veena G “Comparative study of
Twitter Sentiment On COVID - 19 Tweets”,5th International [11] S. T. Chavali, C. Tej Kandavalli, S. T. M and D. Gupta, “A Study on
Conference on Computing Methodologies and Communication,2021 Named Entity Recognition with Different Word Embeddings on
GMB Dataset using Deep Learning Pipelines,” 2022 13th
[2] Muhammad Husnain, Adnan Khalid, Numan Shafi, “A Novel International Conference on Computing Communication and
Preprocessing Technique for Toxic Comment Classification”, 2021 Networking Technologies (ICCCNT), Kharagpur, India, 2022, pp. 1-
International Conference on Artificial Intelligence (ICAI) Islamabad, 5, doi: 10.1109/ICCCNT54827.2022.9984220.
Pakistan, April 05-07, 2021
[12] Rahul, H. Kajla, J. Hooda and G. Saini, “Classification of Online
[3] KGSSV Akhil Kumar, Dr. B. Kanisha, “Analysis of Multiple Toxic Comments Using Machine Learning Algorithms,” 2020 4th
Toxicities using ML Algorithms to detect toxic comments”, 2022 2nd International Conference on Intelligent Computing and Control
International Conference on Advance Computing and Innovative Systems (ICICCS), Madurai, India, 2020, pp. 1119-1123, doi:
Technologies in Engineering (ICACITE) 10.1109/ICICCS48265.2020.9120939.
[4] Manju Venugopalan, Deepa Gupta, “A reinforced active learning [13] S. R. Murali, S. Rangreji, S. Vinay and G. Srinivasa, “Automated
approach for optimal sampling in aspect term extraction for sentiment NER, Sentiment Analysis and Toxic Comment Classification for a
analysis”, Expert Systems with Applications,Volume 209, 2022 Goal-Oriented Chatbot,” 2020 Fourth International Conference On
[5] Anusha Garlapati, Neeraj Malisetty, Gayathri Narayanan Intelligent Computing in Data Sciences (ICDS), Fez, Morocco, 2020,
“Classification of Toxicity in Comments using NLP and LSTM”, 8th pp. 1-7, doi: 10.1109/ICDS50568.2020.9268680.
International Conference on Advanced Computing and [14] P. Sumanth, S. Samiuddin, K. Jamal, S. Domakonda and P. Shivani,
Communication Systems,2022 “Toxic Speech Classification using Machine Learning Algorithms,”
[6] Vaibhav Rupapara; Furqan Rustam; Hina Fatima Shahzad; Arif 2022 International Conference on Electronic Systems and Intelligent
Mehmood; Imran Ashraf; Gyu Sang Choi, “Impact of SMOTE on Computing (ICESIC), Chennai, India, 2022, pp. 257-263, doi:
Imbalanced Text Features for Toxic Comments Classification Using 10.1109/ICESIC53714.2022.9783475.
RVVC Model”, IEEE Access,2021 [15] V. Swetha, R. Anuhya, E. S. Sowmya and A. Geethanjali, “Building a
[7] Rohan Nair, Vadla Niranjan Vara Prasad, Sreenadh A, Jyothisha J. Toxic Comments Classification Model,” 2021 5th International
Nair, “Coreference Resolution for Ambiguous Pronoun with BERT Conference on Electronics, Communication and Aerospace
and MLP”, 10th International Conference on Advances in Computing Technology (ICECA), Coimbatore, India, 2021, pp. 1519-1523, doi:
and Communications,2021 10.1109/ICECA52323.2021.967
[8] Brundha J , K.N Meera, “Vector Model Based Information Retrieval [16] M. Ibrahim, M. Torki and N. El-Makky, "Imbalanced Toxic
System With Word Embedding Transformation”, 10th International Comments Classification Using Data Augmentation and Deep
Conference on Emerging Trends in Engineering and Technology, Learning," 2018 17th IEEE International Conference on Machine
2022 Learning and Applications (ICMLA), Orlando, FL, USA, 2018, pp.
[9] Lahbib Ajallouda , Kawtar Najmani , Ahmed Zellou , El habib 875-878, doi: 10.1109/ICMLA.2018.00141.
Benlahmar “Doc2Vec, SBERT, InferSent, and USE Which [17] Cjadams, Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark
embedding technique for noun phrases?”, 2nd International McDonald, nithum, Will Cukierski, Toxic Comment Classification
Conference on Innovative Research in Applied Science, Engineering Challenge, Kaggle, 2017, https://kaggle.com/competitions/jigsaw-
and Technology (IRASET),2022 toxic-comment-classification-challenge
[10] M. Ibrahim, M. Torki and N. El-Makky, “Imbalanced Toxic
Comments Classification Using Data Augmentation and Deep
Learning,” 2018 17th IEEE International Conference on Machine
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on January 22,2025 at 08:41:11 UTC from IEEE Xplore. Restrictions apply.