Toxic Comment Classification Using S-BERT Vectorization and Random Forest Algorithm

Uploaded by

Tapesh Chandra Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views6 pages

Toxic Comment Classification Using S-BERT Vectorization and Random Forest Algorithm

Uploaded by

Tapesh Chandra Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

2023 IEEE International Conference on Contemporary Computing and Communications (InC4)

Toxic Comment Classification Using S-BERT

Vectorization and Random Forest Algorithm
2023 IEEE International Conference on Contemporary Computing and Communications (InC4) | 979-8-3503-3577-4/23/$31.00 ©2023 IEEE | DOI: 10.1109/InC457730.2023.10263218

Aparna Ashok Kumar Peeta Basa Pati K. Deepa

Department of Computer Science & Department of Computer Science & Department of Electrical & Electronics
Engineering Engineering Engineering
Amrita School of Computing Amrita School of Computing Amrita School of Engineering
Amrita Vishwa Vidyapeetham Amrita Vishwa Vidyapeetham Amrita Vishwa Vidyapeetham
Karnataka, India Karnataka, India Karnataka, India
[email protected] ORCID:0000-0003-2376-4591

S.Tresa Sangeetha
Department of Engineering
University of Technology and Applied
Sciences
AlMusannah, Sultanate of Oman

Abstract— The growing popularity of social media emotions or sentiments might be either favourable,
platforms and microblogging websites has led to an increase in unfavourable, or neutral. The author's feelings are the
the expression of views and opinions. However, conversations fundamental concept of sentiment. These could differ from
and debates on these platforms often lead to the use of toxic person to person. Husnain et al. [2] have employed machine
comments, which consists of insulting and hateful remarks. To learning algorithms to analyse the sentiment of the text and
address this issue, it is important for social media systems to be determine its polarity (positive, negative, or neutral polarity).
able to recognize harmful comments. With the rising incidence Such categorization may be achieved with a number of
of cyberbullying, it is crucial to study the classification of toxic techniques, such as information filtering, word embedding,
comments using various algorithms. This study compares the
and subject categorization. These days, sentiment analysis is
effectiveness of different word and sentence embedding
methods, including TF-IDF, InferSent, Bert, and T5 for toxic
highly valued across a variety of industries, from marketing
comments classification. A comparative study is also conducted to forensics.
on the impact of using SMOTE to balance the highly Vectorization of comments refers to the process of
imbalanced dataset. The results of these models are compared representing text data in a numerical form that can be used
and analysed. It is observed that T5 embedding with Random by machine learning models. The goal of vectorization is to
Forest Classifier works best at 0.91 F1-Score. transform text data into a format that can be easily
understood and processed by machines, without losing
Keywords—comments, toxic, TF-IDF, InferSent, S-Bert, T5,
Classification
important information. One common approach to
vectorization is to use a technique called "bag-of-words"
I. INTRODUCTION representation. Another approach is to use pre-trained
language models like BERT, RoBERTa, or GPT-3, to
Social media refers to online platforms where users can encode each comment into a dense vector representation.
share and exchange information, content, and opinions. It has
become a ubiquitous part of modern life, connecting people Word and sentence embeddings are two ways of
from all over the world. However, the increased use of social representing words and sentences respectively in a vector
media has also led to a rise in negative and harmful form. Word embeddings are mathematical representations of
activities, such as cyberbullying, hate speech, and spreading words that reflect their meanings and allow words with
misinformation. These actions can have a significant impact similar meanings to be close in the vector space. They are
on people's mental and emotional well-being, causing learned from text data and can be used for various NLP
anxiety, depression, and other mental health issues. As a tasks. Sentence embeddings, on the other hand, represent
result, it's important to monitor and address the negative entire sentences and capture their meanings. They are usually
impact of social media, and ensure that it remains a positive obtained by combining the word embeddings of the words in
and safe space for everyone. With the growing use of social a sentence. Sentence embeddings can be used for tasks such
media and the internet, it's crucial to monitor harmful as text classification and semantic search.
activities and take appropriate measures. The abusive
In this paper, the study is being to classify the toxic
comments, messages, and discussions on these platforms
comments based on their level of toxicity into different
have had a negative impact on people's mental health.
categories like toxic, insult, identity hate , obscene, threat,
Nair et al. [1] have suggested that sentiment analysis is severe toxic and high toxic. Before classification of these
nothing more than using text analysis and natural language comments, data is pre-processed and different vectorization
processing (NLP) technologies to find and extract a writer's methods like TF-IDF, S-Bert, InferSent and T5 Transformer
emotions from a piece of text, in this case, a tweet. These is being used. This paper aims to classify toxic comments

979-8-3503-3577-4/23/$31.00 ©2023 IEEE

Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on January 22,2025 at 08:41:11 UTC from IEEE Xplore. Restrictions apply.
from the Kaggle dataset of toxic comments on Wikipedia from SBERT. Additionally, BERT aids with deciphering
comments, prepared by Jigsaw, and determine the level of unclear terminology in a written document or a query.
toxicity. The comments are categorized into different levels
of toxicity, including severe toxic, toxic, threat, insult, Brundha & Meera [8] used different word embeddings
obscene, and identity hate. The novelty of this research is like Word2vec, Glove2vec and Sentence BERT to find
conducting a comparative analysis of four different relevant documents. After apply PCA and Factor Analysis
vectorization techniques, namely S-Bert, TF-IDF, InferSent, there is significant increase of 2-3% in Mean Average
and T5, in addition to evaluating the performance of six Precision.
classification algorithms on these vectorized data. Ajallouda et al. [9] discussed that a supervised phrase
embedding method called InferSent gives English phrases
II. RELATED WORK semantic representations. It received training using corpus
Kumar & Kanisha [3] conducted a study to show that data from Stanford Natural Language Inference (SNLI).
there two approaches to classify the comments based on
Ibrahim et al. [10] reported that study presented an
toxicity. In the first approach the dataset is classified into
ensemble approach using three models (CNN, LSTM, and
whether the comment is toxic or not whereas in the other
GRU) to find toxicity in user-generated content on online
approach, comments are classified into varying range like
platforms. The researchers used data from Wikipedia talk
toxic, severely toxic, threat, insult, hate, obscene. It has been
page modifications and addressed the data imbalance
observed here that logistic regression outperforms all other
problem by employing data augmentation techniques. The
classification algorithms like Naïve Bayes and decision tree.
approach involved two stages: determining input toxicity and
Husnain et al. [2] stated that, two strategies were identifying the categories of toxicity in the content.
proposed for detecting comment toxicity on social media According to the study's evaluation, the proposed ensemble
platforms. The first approach trains individual classifiers for approach achieved the maximum accuracy with an F1-score
each aspect of comment toxicity. The second approach is a for toxic as 0.80 and non-toxic classification and F1-Score
multi-label classification problem. Various ML techniques for predicting the type of toxicity is 0.872.
such as Naive Bayes, decision trees and logistic regression
Chaval et al. [11] stated that due to the rising use of brief
were used and the results were evaluated with 10-fold cross-
text messages for communication, the spam problem in text
validation on a dataset from Kaggle. The multi-label
messages has gotten worse. To identify undesired messages,
classification problem was converted into a multi-class
automatic spam filters either use content-based or heuristic-
problem using a unique pre-processing method, which
based methods. This paper focuses on imbalanced datasets,
improved accuracy for simple classification models. The
which reflect real-world spam detection settings, and
experimental results showed that logistic regression was
proposes a content-based ML approach with an importance
effective for both multi-class and binary classification.
on them. To discover the best classifier for unbalanced
Venugopalan & Gupta [4] have used SVM with Linear datasets, a number of machine learning techniques including
Kernel and Decision tree Classifier. It is reported that the SVM, Bagging, AdaBoost and J48 were investigated. The
model performs with high accuracy when compared with problem of imbalanced datasets was addressed using the
previous models. Garlapati et al. [5] reported that for binary SMOTE technique, and the SVM plus SMOTE combination
and multi label classification, the best models can be used are performed best, increasing the average class accuracy of 7
CNN and LSTM. The goal is to evaluate the type of points in the JSC dataset and 3 points in the UCI dataset [11].
statement and identify the various classes of toxic language,
Rahul et al. [12] stated that to achieve accurate analysis
including obscene, identity-hateful, poisonous, insulting, and
of the toxicity of content, the prevalence of online
severely toxic language. This system uses comments from
harassment will be rigorously evaluated. Six machine
websites like harmful or non-toxic as input. The goal of our
learning algorithms will be utilized to address the text
model is to identify the toxicity class. Phased analysis is the
classification problem and identify the best algorithm based
goal of this research. The goal of Phase I is to assess the
on the assessment metrics for categorizing harmful
toxicity.
comments. The aim is to evaluate toxicity with precision in
Rupapara et al. [6] reported that Regression Vector order to mitigate its adverse effects, which will motivate
Voting Classifier (RVVC) is a new methodology for organizations to take appropriate actions.
detecting harmful comments on social media platforms.
Murali et al. [13] discussed that the primary objective is
RVVC combines logistic regression and support vector
to improve the conversational abilities of a robot receptionist
classifier using soft voting rules. The method was evaluated
by enhancing three key modules: named entity identification,
on both balanced and imbalanced datasets using bag-of-
sentiment analysis, and hazardous comment classification.
words and TF-IDF features. The results showed that RVVC
The named entity recognition module helps the chatbot
outdos other machine learning classifiers in terms of
retrieve specific information about a company. The
accuracy, recall, precision and F1-score. When combined
sentiment analysis module aims to enhance the chatbot's
with the SMOTE applied dataset and TF-IDF features,
responsiveness to user feedback and conversation style. The
RVVC reached an accuracy of 0.97. RVVC is a promising
hazardous comment classifier ensures that the self-learning
method for detecting harmful comments on social media
chatbot generates only positive and uplifting comments. The
platforms and its effectiveness has been demonstrated
effectiveness of these modules is evaluated by comparing
through experiments.
them to previous methods, and the code required to
Nair et al. [7] discussed that Bert is PyTorch pretrained reproduce the results is included for future development.
model which will provide contextual embeddings. Semantic
Sumanth et al. [14] stated that the issue of harmful
textual resemblance and semantic search can both benefit
language on social media platforms is discussed, along with

Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on January 22,2025 at 08:41:11 UTC from IEEE Xplore. Restrictions apply.
the reasons why it is a pressing concern. The primary
objective of the research is to apply machine learning to
identify and remove harmful speech from these platforms.
To accomplish binary and multi-class classification, various
machine learning methods are studied, including standard
machine learning and ensemble methods. The paper
examines two approaches to identify harmful speech: (a) a
technique that extracts word embeddings and creates a
model, and (b) a method that enhances pre-existing models
like Decision Tree, Logistic Regression, Random Forest,
Voting Classifier and K-Nearest Neighbors. The suggested
method can be used to analyse any social media comments.
Swetha et al. [15] reported that to investigate the
pervasiveness of online bullying and classify the level of
harm in the content. An optimal machine learning algorithm Fig. 1. Distribution of Comments Across Different Categories
is selected from a set of six based on evaluation metrics to
TABLE II. PERCENTAGE OF COMMENTS IN EACH CATEGORY
categorize harmful statements. The main focus is to conduct
a thorough assessment of the harmful nature of online Category Percentage
content to reduce its negative impact and encourage Toxic 9.6
organizations to take appropriate action. Severe toxic 1.0
Obscene 5.3
Ibrahim et al.[16] reported that the method suggested Threat 0.3
involves two prediction stages. In the initial stage, a classifier Insult 4.9
is used to identify whether a comment is normal or has any Hate identity 0.9
form of toxicity. If the comment is identified as toxic, a
second classifier is employed to detect the specific type of Table 2 shows the proportion of each type of comment
comment toxicity. To develop these classifiers, various deep compared to the non-toxic comments in the dataset. For
learning models, such as convolutional neural network instance, 9.6% of the comments are considered toxic in the
(CNN), bidirectional gated recurrent unit (GRU) and dataset. The number of non-toxic comments far outweighs
bidirectional long-short term memory (LSTM). each of the other categories. It is apparent that categories
such as identity hate are minimal. The majority of categories
have fewer comments than non-toxic comments by 90%.
III. PROPOSED METHODOLOGY Synthetic Minority Over-sampling Technique (SMOTE) is
deployed for data balance in the machine learning process.
A. Understanding the Data
The dataset consists of 6 attributes and 159571
observations. The list of attributes and the observations are
listed in Table 1.From Table 1 data presented it is inferred
that there are no missing values for any characteristics. The
dataset has 8 features labelled as Id, toxic, comment_text,
severe toxic, threat, obscene, insult, identity hate. While the
Id is an indexing element, the comment_text field contains
the text which need to be analysed and categorized. The
other fields are labels collected by manual labelling and are
used for deciding on the category.[17]
The bar graph in Fig. 1 displays the number of comments
that fall into each category. Based on this information, it
appears that the majority of comments are toxic which is
above 14000, while a smaller number of comments contain
threats which are around 50.

TABLE I. SUMMARY OF COMMENT ATTRIBUTES

Attributes No of observations Description Fig. 2. Correlation matrix of comments
Id 159571 Unique Identifier
comment_text 159571 Text
From Fig. 2, the correlation between the classes can be
Toxic 159571 Binary Label
severe_toxic 159571 Binary Label
analyzed. It can be inferred that if a comment is classified as
Obscene 159571 Binary Label toxic, there is a 68% likelihood that it is also classified as
Threat 159571 Binary Label obscene. There is a 31% chance that it is severely toxic, a
Insult 159571 Binary Label 16% chance it is a threat, a 65% chance it is an insult, and a
identity_hate 159571 Binary Label 27% chance it is considered identity hate. The same pattern
can be observed for all other categories as indicated in the
figure. When comments fall under multiple categories, they
will be deemed highly toxic.

Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on January 22,2025 at 08:41:11 UTC from IEEE Xplore. Restrictions apply.
B. Data Pre-processing implement, good choice for large datasets and can be trained
In the above section, it has been discussed that data is quickly. It also reduces overfitting. Multilayer Perceptron is
highly imbalanced and in this section the task of used because of its capacity to learn a wide range of
preprocessing that data is being done. Fig.4 explains the data representations, complex non-linear relationships,
pre-processing process and the process involved is detailed Robustness to noise and ability to handle large datasets. K-
below. NN is used because no training is needed, no assumptions
about data distribution and is a flexible algorithm directly
Fig. 3 depicts that the text undergoes several used for multiclass classification.
preprocessing steps: removing punctuation, special
characters, and numbers, tokenizing it into words, and SVMs maximize the margin between different classes,
removing stopwords. Stemming is then performed to reduce which helps to separate the data and reduce overfitting. It can
words to their base form. handle large datasets efficiently and good for high
dimensional data. Random forest is a versatile algorithm that
can handle multi-class classification problems. It does not
make any assumptions about the distribution of data. It can
be trained and predicted quickly. Gaussian Naive Bayes is a
simple and fast algorithm that requires less computational
resources and can be used for large datasets. It requires
relatively little training data to make accurate predictions.
After implementing the algorithms, the data is
categorized into 7 classes including toxic, severe toxic,
insult, threat, identity hate, obscene, and high toxic. In cases
where a comment falls under multiple categories from the
Fig. 3. Pre-processing of comments above classes, it will be classified as high toxic.

The resulting cleaned text is then transformed into D. Performance Evaluation and comparison Metrics
numerical vectors using various methods, such as TF-IDF, S- The metrics being employed for assessing performance
BERT, InferSent, and T5. The TF-IDF representation of a are Precision, Recall, F1-Score, and RoC Score.
document converts the raw text into a numerical vector,
where each dimension of the vector depicts a specific term in Precision: Precision is denoted as P as given in Eq. (1).
the corpus vocabulary. Infersent: Infersent is a sentence
embedding model developed by Facebook AI. It is trained on
the SNLI and MultiNLI datasets and generates sentence Eq. (1)
embeddings that capture semantic meaning. Infersent can be
used for different NLP tasks like semantic similarity, text Recall: Recall is denoted as R Eq. (2).
classification and question answering. Sentence-BERT (S-
BERT): S-BERT is a pre-trained model based on the
transformer architecture. S-BERT is designed to understand
the contextual relationship between words in a sentence. T5: F1- Score: F1-Score is denoted as F Eq. (3).
T5 known as Text-to-Text Transfer Transformer is a pre-
trained language model developed by Google Research. T5
is designed to be a generic text-to-text model that can be Eq. (3)
fine-tuned for different NLP tasks like machine translation,
text classification and question answering. Here the RoC Score: The model's effectiveness is indicated by the
comments are going through these vectorization methods so RoC Area Under the Curve (AUC) score. The model shows
that a comparative study can be done on the data. greater effectiveness in distinguishing between the categories
of positive and negative. Higher the AUC value better is the
C. Model Training and Testing performance of the model. If the classifier achieves an AUC
In this section, model building using various score of 1, it is capable of precisely distinguishing between
classification algorithms is being performed. Before that as all points belonging to the Positive class and the Negative
mentioned in the above section since the data is highly class.
imbalanced, SMOTE is being applied. SMOTE (Synthetic
Minority Over-sampling Technique) is an oversampling IV. RESULTS & DISCUSSIONS
method used in imbalanced class classification problems. In Fig.4 & Fig. 5, the F1-score is compared between a
The technique generates synthetic examples of the minority model trained without SMOTE and a model trained with
class, instead of simply duplicating the existing examples, to SMOTE.
balance the distribution of the target classes.
According to figures 4 and 5, the highest F1-score of 0.71
After applying SMOTE, various models are built using was obtained by Random Forest and SVM for the TF-IDF
various classification algorithms and a comparative study is feature representation without using SMOTE. On the other
being done on the models which are built after applying hand, MLP achieved the highest F1-score of 0.91 for the TF-
SMOTE and without SMOTE. The models are trained and IDF feature representation with the use of SMOTE. In the
tested using 6 classification algorithms that are Logistic case of S-BERT, the figure indicates that SVM outperformed
Regression, MLP, K-NN, SVM, Random Forest and all other models, with the highest F1-score both with and
Gaussian Naïve Bayes. Logistic regression is a popular without SMOTE. For Infersent, without SMOTE, SVM had
method for multiclass classification because it is easy to the highest F1-score of 0.57, while Random Forest had the

Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on January 22,2025 at 08:41:11 UTC from IEEE Xplore. Restrictions apply.
highest F1-score of 0.91 with SMOTE. In the T5 model, precise and dependable among the classifiers that were
MLP had the highest F1-score of 0.76 without using evaluated.
SMOTE, but Random Forest achieved the highest F1-score
of 0.91 when SMOTE was used. In the T5 model without
using SMOTE, MLP model performs the best with a
maximum F1-score of 0.76. However, when SMOTE is
used, the Random Forest model achieves the highest F1-
score of 0.91.

Fig. 4. Models Without SMOTE

Fig. 6. Confusion matrix for reference

TABLE V. COMPARISON OF RESULTS WITH PRIOR REPORTED WORKS.

Research Work F1-Score
Nair et al. [1] 88.0%
Ibrahim et al. [16] 82.8%
Proposed Work 91.0%

The highest F1-Score and ROC Score for the classifier

algorithms were obtained when using S-BERT and T5
Fig. 5. Models With SMOTE
vectorization methods. Figure 5 provides the confusion
matrix of TF-IDF without SMOTE as a reference for
TABLE III. AREA UNDER ROC (AUROC) WITHOUT SMOTE
analysis. Although the precision of each class differs, the
TF-IDF S-BERT InferSent T5 precision, F1-score, and recall are the same when computed
MLP 0.77 0.79 0.50 0.85 for all classes combined, due to an equal number of false
KNN 0.55 0.64 0.51 0.67 positives and false negatives.
Random Forest 0.71 0.71 0.49 0.74
Table 5 presents the F1-Score attained by previous
SVM 0.81 0.83 0.57 0.70 studies, showing the highest F1-Score obtained by the
Logistic Regression 0.78 0.80 0.45 0.86 models analyzed in each respective study. In Nair et al.'s
Gaussian Naïve
Bayes 0.57 0.79 0.47 0.83
research, the F1-Score of 88.0 was attained on the Twitter
dataset after using VADER. Husnian et al.'s study on the
TABLE IV. AREA UNDER ROC (AUROC) WITH SMOTE Kaggle dataset for multi-label classification found that
Logistic Regression with an F1-Score of 91.5 demonstrated
TF-IDF S-BERT InferSent T5 the best performance. In the proposed work, on the same
MLP 0.98 0.97 0.81 0.98
Kaggle dataset, Random Forest with T5 and SMOTE yielded
KNN 0.88 0.90 0.87 0.91
Random Forest 0.98 0.98 0.91 0.98
the best performance, with an F1-Score of 91.0.
SVM 0.87 0.99 0.86 0.98
Logistic Regression 0.97 0.94 0.81 0.96
V. CONCLUSIONS
Gaussian Naïve Bayes 0.83 0.89 0.76 0.91 The performance of various classifiers was evaluated for
four text embedding techniques, namely TF-IDF, S-Bert,
Tables 3 and 4 display the ROC scores for all classifiers InferSent, and T5, with and without applying SMOTE. After
with and without the use of SMOTE for a specific dataset. analyzing and evaluating the results from both F1-Score and
To determine the best model, both F1-Score and ROC Score AU-RoC metrics, it can be concluded that the Random
were taken into consideration and combined. By evaluating Forest Classifier model combined with SMOTE technique
the models based on both these metrics, a comprehensive using T5 approach outperformed other models. This model
comparison of their performance was possible. The model achieved the highest scores and can be considered the best
that achieved the highest scores for both F1 and ROC metrics performing model among the options considered. These
was selected, indicating its ability to accurately identify true techniques could be tried with deep learning models in future
negatives and true positives while minimizing false negatives research.
and false positives. This model was found to be the most

Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on January 22,2025 at 08:41:11 UTC from IEEE Xplore. Restrictions apply.
REFERENCES Learning and Applications (ICMLA), Orlando, FL, USA, 2018, pp.
875-878, doi: 10.1109/ICMLA.2018.00141.
[1] Anu J Nair , Aadithya Vinayak, Veena G “Comparative study of
Twitter Sentiment On COVID - 19 Tweets”,5th International [11] S. T. Chavali, C. Tej Kandavalli, S. T. M and D. Gupta, “A Study on
Conference on Computing Methodologies and Communication,2021 Named Entity Recognition with Different Word Embeddings on
GMB Dataset using Deep Learning Pipelines,” 2022 13th
[2] Muhammad Husnain, Adnan Khalid, Numan Shafi, “A Novel International Conference on Computing Communication and
Preprocessing Technique for Toxic Comment Classification”, 2021 Networking Technologies (ICCCNT), Kharagpur, India, 2022, pp. 1-
International Conference on Artificial Intelligence (ICAI) Islamabad, 5, doi: 10.1109/ICCCNT54827.2022.9984220.
Pakistan, April 05-07, 2021
[12] Rahul, H. Kajla, J. Hooda and G. Saini, “Classification of Online
[3] KGSSV Akhil Kumar, Dr. B. Kanisha, “Analysis of Multiple Toxic Comments Using Machine Learning Algorithms,” 2020 4th
Toxicities using ML Algorithms to detect toxic comments”, 2022 2nd International Conference on Intelligent Computing and Control
International Conference on Advance Computing and Innovative Systems (ICICCS), Madurai, India, 2020, pp. 1119-1123, doi:
Technologies in Engineering (ICACITE) 10.1109/ICICCS48265.2020.9120939.
[4] Manju Venugopalan, Deepa Gupta, “A reinforced active learning [13] S. R. Murali, S. Rangreji, S. Vinay and G. Srinivasa, “Automated
approach for optimal sampling in aspect term extraction for sentiment NER, Sentiment Analysis and Toxic Comment Classification for a
analysis”, Expert Systems with Applications,Volume 209, 2022 Goal-Oriented Chatbot,” 2020 Fourth International Conference On
[5] Anusha Garlapati, Neeraj Malisetty, Gayathri Narayanan Intelligent Computing in Data Sciences (ICDS), Fez, Morocco, 2020,
“Classification of Toxicity in Comments using NLP and LSTM”, 8th pp. 1-7, doi: 10.1109/ICDS50568.2020.9268680.
International Conference on Advanced Computing and [14] P. Sumanth, S. Samiuddin, K. Jamal, S. Domakonda and P. Shivani,
Communication Systems,2022 “Toxic Speech Classification using Machine Learning Algorithms,”
[6] Vaibhav Rupapara; Furqan Rustam; Hina Fatima Shahzad; Arif 2022 International Conference on Electronic Systems and Intelligent
Mehmood; Imran Ashraf; Gyu Sang Choi, “Impact of SMOTE on Computing (ICESIC), Chennai, India, 2022, pp. 257-263, doi:
Imbalanced Text Features for Toxic Comments Classification Using 10.1109/ICESIC53714.2022.9783475.
RVVC Model”, IEEE Access,2021 [15] V. Swetha, R. Anuhya, E. S. Sowmya and A. Geethanjali, “Building a
[7] Rohan Nair, Vadla Niranjan Vara Prasad, Sreenadh A, Jyothisha J. Toxic Comments Classification Model,” 2021 5th International
Nair, “Coreference Resolution for Ambiguous Pronoun with BERT Conference on Electronics, Communication and Aerospace
and MLP”, 10th International Conference on Advances in Computing Technology (ICECA), Coimbatore, India, 2021, pp. 1519-1523, doi:
and Communications,2021 10.1109/ICECA52323.2021.967
[8] Brundha J , K.N Meera, “Vector Model Based Information Retrieval [16] M. Ibrahim, M. Torki and N. El-Makky, "Imbalanced Toxic
System With Word Embedding Transformation”, 10th International Comments Classification Using Data Augmentation and Deep
Conference on Emerging Trends in Engineering and Technology, Learning," 2018 17th IEEE International Conference on Machine
2022 Learning and Applications (ICMLA), Orlando, FL, USA, 2018, pp.
[9] Lahbib Ajallouda , Kawtar Najmani , Ahmed Zellou , El habib 875-878, doi: 10.1109/ICMLA.2018.00141.
Benlahmar “Doc2Vec, SBERT, InferSent, and USE Which [17] Cjadams, Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark
embedding technique for noun phrases?”, 2nd International McDonald, nithum, Will Cukierski, Toxic Comment Classification
Conference on Innovative Research in Applied Science, Engineering Challenge, Kaggle, 2017, https://kaggle.com/competitions/jigsaw-
and Technology (IRASET),2022 toxic-comment-classification-challenge
[10] M. Ibrahim, M. Torki and N. El-Makky, “Imbalanced Toxic
Comments Classification Using Data Augmentation and Deep
Learning,” 2018 17th IEEE International Conference on Machine

Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on January 22,2025 at 08:41:11 UTC from IEEE Xplore. Restrictions apply.

Hate Speech Detection PPT FINAL
100% (1)
Hate Speech Detection PPT FINAL
29 pages
Xgboost PDF
100% (1)
Xgboost PDF
128 pages
Responsible Artificial Intelligence in Human Resources Technology An
No ratings yet
Responsible Artificial Intelligence in Human Resources Technology An
8 pages
Final Assessment MBAS901-Sabina.K
No ratings yet
Final Assessment MBAS901-Sabina.K
9 pages
Toxic Comment Detection Code Using LSTM: A Project On
No ratings yet
Toxic Comment Detection Code Using LSTM: A Project On
11 pages
Class Weights Random Forest Algorithm For Processing Class Imbalanced Medical Data
No ratings yet
Class Weights Random Forest Algorithm For Processing Class Imbalanced Medical Data
12 pages
Zero Shot Learning
No ratings yet
Zero Shot Learning
13 pages
Safetalk (Abstract)
No ratings yet
Safetalk (Abstract)
43 pages
Twitte Analysis
No ratings yet
Twitte Analysis
53 pages
Topic 7
No ratings yet
Topic 7
70 pages
SL No 11
No ratings yet
SL No 11
22 pages
Asian - Journal - of - Dietetics - 6 - 3-4 - 2024 - 127 - 136 4
No ratings yet
Asian - Journal - of - Dietetics - 6 - 3-4 - 2024 - 127 - 136 4
10 pages
Analysis of Multiple Toxicities Using ML Algorithms To Detect Toxic Comments
No ratings yet
Analysis of Multiple Toxicities Using ML Algorithms To Detect Toxic Comments
6 pages
DeTox A WebApp For Toxic Comment Detection and Moderation
No ratings yet
DeTox A WebApp For Toxic Comment Detection and Moderation
5 pages
Drone Detection Audio
No ratings yet
Drone Detection Audio
34 pages
Exploring The Efficacy of Deep Learning Models For Multiclass Toxic Comment Classification in Social Media Using Natural Language Processing
No ratings yet
Exploring The Efficacy of Deep Learning Models For Multiclass Toxic Comment Classification in Social Media Using Natural Language Processing
8 pages
Toxic Comment Classificationusing Bidirectional LSTMand Tensor Flow
No ratings yet
Toxic Comment Classificationusing Bidirectional LSTMand Tensor Flow
35 pages
Bhavyatha Technical Seminar Report
No ratings yet
Bhavyatha Technical Seminar Report
30 pages
REPORT
No ratings yet
REPORT
30 pages
Uno 3
No ratings yet
Uno 3
16 pages
BenchmarkingDefaultPredictionModels TR030124
No ratings yet
BenchmarkingDefaultPredictionModels TR030124
37 pages
Injury Risk Prediction in Soccer Using Machine Learning
No ratings yet
Injury Risk Prediction in Soccer Using Machine Learning
4 pages
Document Dsbda Codes For Mini Project
No ratings yet
Document Dsbda Codes For Mini Project
9 pages
Bms Syllabus Sem 3. Cvs Warriors
No ratings yet
Bms Syllabus Sem 3. Cvs Warriors
34 pages
Fundamentals of Bioinformatics Project Manual 2022
No ratings yet
Fundamentals of Bioinformatics Project Manual 2022
25 pages
Exploring Depression Through Social Media A Textual Analysis
No ratings yet
Exploring Depression Through Social Media A Textual Analysis
7 pages
(Heitjan Et Al) Predicting Hospital Readmission in Medicaid Patients With COPD
No ratings yet
(Heitjan Et Al) Predicting Hospital Readmission in Medicaid Patients With COPD
8 pages
Effective Sentiment Analysis of Twitter With Apache Spark
No ratings yet
Effective Sentiment Analysis of Twitter With Apache Spark
8 pages
Campos & Holden, 2019
No ratings yet
Campos & Holden, 2019
6 pages
ML Projrct Article 2
No ratings yet
ML Projrct Article 2
6 pages
Twitter and Emotions: Exploring Sentiment Detection
No ratings yet
Twitter and Emotions: Exploring Sentiment Detection
6 pages
Chap3 Part1 Classification
No ratings yet
Chap3 Part1 Classification
38 pages
Maslej-Krešňáková Et Al. - 2020 - Comparison of Deep Learning Models and Various Text Pre-Processing Techniques For The Toxic Comments C-Annotated
No ratings yet
Maslej-Krešňáková Et Al. - 2020 - Comparison of Deep Learning Models and Various Text Pre-Processing Techniques For The Toxic Comments C-Annotated
26 pages
P.S.V College of Engineering and Technology Summer Internship
No ratings yet
P.S.V College of Engineering and Technology Summer Internship
10 pages
Lectura 1
No ratings yet
Lectura 1
13 pages
Wang 2016
No ratings yet
Wang 2016
4 pages
Identifying High-Risk Areas For Dog-Mediated Rabies Using Bayesian Spatial Regression 2022
No ratings yet
Identifying High-Risk Areas For Dog-Mediated Rabies Using Bayesian Spatial Regression 2022
12 pages
Project Review
No ratings yet
Project Review
6 pages
Machine Learning Course Content For Classroomdocx - 240504 - 163403
No ratings yet
Machine Learning Course Content For Classroomdocx - 240504 - 163403
6 pages
Machine Learning Methods For Toxic Comment Classif
No ratings yet
Machine Learning Methods For Toxic Comment Classif
12 pages
7834-Article Text-8539-1-10-20230901
No ratings yet
7834-Article Text-8539-1-10-20230901
12 pages
Thesis - Aru Omarali
No ratings yet
Thesis - Aru Omarali
34 pages
Assignment 1
No ratings yet
Assignment 1
24 pages
Toxic Comment Classification System Using Deep Lea
No ratings yet
Toxic Comment Classification System Using Deep Lea
6 pages
IEEE Conference Template
No ratings yet
IEEE Conference Template
8 pages
Health and Quality of Life Outcomes
No ratings yet
Health and Quality of Life Outcomes
10 pages
Reddy - 2022 - Toxic Comments Classification (2) - Annotated
No ratings yet
Reddy - 2022 - Toxic Comments Classification (2) - Annotated
17 pages
A Comparative Study and Analysis On Toxic Comment Classification
No ratings yet
A Comparative Study and Analysis On Toxic Comment Classification
5 pages
2020 Trac-1 4
No ratings yet
2020 Trac-1 4
5 pages
Hate Speech Detection Using Machine Learning2
No ratings yet
Hate Speech Detection Using Machine Learning2
4 pages
Impact of SMOTE On Imbalanced Text Features For Toxic Comments Classification Using RVVC Model
No ratings yet
Impact of SMOTE On Imbalanced Text Features For Toxic Comments Classification Using RVVC Model
14 pages
ML Projrct Article 1
No ratings yet
ML Projrct Article 1
6 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
7 pages
Fin Irjmets1699759581
No ratings yet
Fin Irjmets1699759581
5 pages
Literature Survey Original
No ratings yet
Literature Survey Original
2 pages
Exploring Sentiment Analysis Through Deep Learning: A Comprehensive Review
No ratings yet
Exploring Sentiment Analysis Through Deep Learning: A Comprehensive Review
4 pages
Literature Surveyy
No ratings yet
Literature Surveyy
6 pages
Technical Seminar
No ratings yet
Technical Seminar
19 pages
Sentiment of Tweets
No ratings yet
Sentiment of Tweets
7 pages
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
No ratings yet
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
12 pages
PatternProject FinalReport
No ratings yet
PatternProject FinalReport
5 pages
Identification and Classification of Toxic Comment Using Machine Learning Methods
0% (1)
Identification and Classification of Toxic Comment Using Machine Learning Methods
5 pages
A Comparative Analysis of Deep Learning Models For Flower Recognition and Health Prediction Proposal
No ratings yet
A Comparative Analysis of Deep Learning Models For Flower Recognition and Health Prediction Proposal
7 pages
Toxic Comment Classification Using Natural Language Processing IRJET-V7I61123
No ratings yet
Toxic Comment Classification Using Natural Language Processing IRJET-V7I61123
4 pages
Diabetes Prediction and Validation From Lit-2
No ratings yet
Diabetes Prediction and Validation From Lit-2
5 pages
TSA Synopsis
No ratings yet
TSA Synopsis
18 pages
Machine Learning For Sentiment Analysis of Twitter Data
No ratings yet
Machine Learning For Sentiment Analysis of Twitter Data
9 pages
Prediction of HIV Status in Addis Ababa Using Data Mining Technology
No ratings yet
Prediction of HIV Status in Addis Ababa Using Data Mining Technology
7 pages
Entropy: Tweets Classification On The Base of Sentiments For US Airline Companies
No ratings yet
Entropy: Tweets Classification On The Base of Sentiments For US Airline Companies
22 pages
Sentiment Analysis Using Twitter Data
No ratings yet
Sentiment Analysis Using Twitter Data
7 pages
Bankruptcy Usecase
No ratings yet
Bankruptcy Usecase
16 pages
IJCRT2207068
No ratings yet
IJCRT2207068
5 pages
Sentiment Analysis of Comment Texts Based On BiLSTM
No ratings yet
Sentiment Analysis of Comment Texts Based On BiLSTM
11 pages
Martin, Adrián Rodríguez, Barcelona - 2018 - Toxic Comment Classification Using Convolutional and Recurrent Neural Networks-Annotated
No ratings yet
Martin, Adrián Rodríguez, Barcelona - 2018 - Toxic Comment Classification Using Convolutional and Recurrent Neural Networks-Annotated
4 pages
Sentiment Analysis of User Comment Text Based On L
No ratings yet
Sentiment Analysis of User Comment Text Based On L
13 pages
Rushikesh Kamble (Roll No. 20) Siddhesh Ghosalkar (Roll No. 14) Pradnya Dhondge (Roll No. 07)
No ratings yet
Rushikesh Kamble (Roll No. 20) Siddhesh Ghosalkar (Roll No. 14) Pradnya Dhondge (Roll No. 07)
13 pages
Toxic Comments Classification
No ratings yet
Toxic Comments Classification
10 pages
Project Report Toxic Comment Classifier
No ratings yet
Project Report Toxic Comment Classifier
25 pages
Wne WP361
No ratings yet
Wne WP361
36 pages
Sentiment Analysis of Twitter Data by Making Use of SVM Random Forest and Decision Tree Algorithm
No ratings yet
Sentiment Analysis of Twitter Data by Making Use of SVM Random Forest and Decision Tree Algorithm
6 pages
Jurnal Autorefractometer
No ratings yet
Jurnal Autorefractometer
11 pages
Machine Learning To Develop Credit Card Customer Churn Prediction
No ratings yet
Machine Learning To Develop Credit Card Customer Churn Prediction
14 pages
Sentiment Analysis On Twitter in R
No ratings yet
Sentiment Analysis On Twitter in R
3 pages
Once Upon A Crime: Towards Crime Prediction From Demographics and Mobile Data
No ratings yet
Once Upon A Crime: Towards Crime Prediction From Demographics and Mobile Data
8 pages
Machine Learning Based Sentiment Analysis For Text Messages
No ratings yet
Machine Learning Based Sentiment Analysis For Text Messages
7 pages
Toxic Comment Classification
No ratings yet
Toxic Comment Classification
4 pages
ML Glossary
No ratings yet
ML Glossary
44 pages
Text Mining Project Report
No ratings yet
Text Mining Project Report
27 pages
Deep Learning Journal
No ratings yet
Deep Learning Journal
6 pages
Electrical Engineering
From Everand
Electrical Engineering
Mohammadreza Taherpour
No ratings yet

Toxic Comment Classification Using S-BERT Vectorization and Random Forest Algorithm

Uploaded by

Toxic Comment Classification Using S-BERT Vectorization and Random Forest Algorithm

Uploaded by

2023 IEEE International Conference on Contemporary Computing and Communications (InC4)

Toxic Comment Classification Using S-BERT

Aparna Ashok Kumar Peeta Basa Pati K. Deepa

979-8-3503-3577-4/23/$31.00 ©2023 IEEE

TABLE I. SUMMARY OF COMMENT ATTRIBUTES

Fig. 4. Models Without SMOTE

Fig. 6. Confusion matrix for reference

TABLE V. COMPARISON OF RESULTS WITH PRIOR REPORTED WORKS.

The highest F1-Score and ROC Score for the classifier

You might also like