Survey Paper
Survey Paper
1. Introduction
2. Literature Review
[1] Tofayet Sultan, Nusrat Jahan, Ritu Basak, Mohammed Shaheen Alam Jony, and Rashidul Hasan Nabil proposed a model
for cyberbullying detection from social-media images or screenshots using Optical Character Recognition (OCR) and Natural
Language Processing (NLP) techniques.EasyOCR did the text extraction while Bag of Words and TF-IDF models worked for
NLP. It is trained with a Kaggle dataset on online bullying and toxicity, getting 96% accuracy.Several machine learning
algorithms such as Logistic Regression, Decision Tree, Random Forest, SGD, Linear SVC, and AdaBoost were applied in the
classification procedure making it robust yet improvable approach.The system faced challenges such as line breaks in text
making them classified as a different word and having trouble handling unnecessary phrases and social networks instructions.
[2] Andrea Perera and Pumudu Fernando have come up with a model for detecting and preventing cyberbullying on social
media using natural language processing, supervised machine learning with Support Vector Machines and feature extraction
methods like TF-IDF, sentiment analysis, profanity detection, and pronoun analysis. The model was trained on a Twitter
dataset retrieved from the Internet Archive and achieved 74.50% accuracy, with 74% precision, recall, and F1-score using TF-
IDF features.The challenge in this case is handling the sarcastic content, which does not get detected as cyberbullying by the
system, and adapting to changing language patterns over time, which definitely limits the robustness in a dynamic online
environment.
[3] Aigerim Altayeva, Rustam Abdrakhmanov, Aigerim Toktarova, and Abdimukhan Tolep proposed a model for
cyberbullying detection on social networks using a hybrid deep learning architecture. The approach combines Convolutional
Neural Networks (CNNs) for feature extraction and Long Short-Term Memory (LSTM) networks for sequence processing to
enhance detection accuracy. The dataset, sourced from Kaggle, includes structured textual content from various social media
platforms, annotated with categories for different types of cyberbullying. The model achieved a training accuracy of 95.4%
and a validation accuracy of 87.5%, thereby showing strong performance though some challenge on generalization.Some
challenges posed were related to imbalanced datasets that weren't able to detect the rather less common types of
cyberbullying, and also privacy issues as this limits detail and diversity in dataset access for analysis.
[4]Bandeh Ali Talpur and Declan O'Sullivan conducted research on cyberbullying severity detection, The research is feature
extraction done through Pointwise Mutual Information (PMI) and is further divided among the several machine learning
classifiers such as Naïve Bayes, K-Nearest Neighbors (KNN), Decision Tree, Random Forest, and Support Vector
Machine(SVM). The dataset for this study is basically 24189 annotated tweets based on a publicly available harassment
dataset on Github and all these tweets are posted between December 2016 and January 2017. It is a multi-class classification
study, and the True Positive Rates achieved are as follows: Random Forest: 91.2%, SVM: 90.3%, KNN: 87.9%, Naive Bayes:
67.2%, and Decision Tree: 88.4%.Major issues were the scantiness of user data as this dataset lacked most of the central
social and user-related features. Much concentration was, therefore, put on the text content to detect cyber-bullying.
[5] Vedadri Yoganand Bharadwaj, Vasamsetti Likhitha, Vootnoori Vardhini, Adari Uma Sree Asritha, Saurabh Dhyani, and
M. Lakshmi Kanth proposed a machine learning framework for detecting cyberbullying in online platforms. The system
employs algorithms such as Multinomial Naïve Bayes, Linear SVC, Logistic Regression, and K-Nearest Neighbors to classify
abusive text messages. The dataset used, "Suspicious Communications on Social Platforms," consists of 20,000 entries tagged
for sentiment. Key steps include preprocessing the data through tokenization and extracting relevant features for training. The
study showcases the system's effectiveness with high accuracy rates and emphasizes its scalability for real-time application. A
user-friendly interface further enhances the tool’s accessibility and practicality in mitigating harmful online interactions.
[6] Aljwharah Alabdulwahab, Mohd Anul Haq, and Mohammed Alshehri explored innovative techniques for cyberbullying
detection, integrating both traditional machine learning and deep learning models. By combining Natural Language
Processing (NLP) for feature extraction with algorithms like KNN, SVM, and Naïve Bayes, as well as advanced neural
networks (CNN and LSTM), the study provided a comprehensive approach to text classification. Using a dataset of nearly
48,000 tweets, the researchers achieved accuracy levels of 90% with KNN, 92% with SVM, and 96% with their deep learning
model. They highlighted the model's adaptability for multilingual datasets and its potential to detect evolving abusive
language. While challenges such as dataset size and overlapping data remain, the authors suggested expanding the dataset and
incorporating additional channels for further improvement
[7] Khalid M. O. Nahar, Mohammad Alauthman, Saud Yonbawi, and Ammar Almomani cyberbullying detection using the
machine learning techniques. The models, which can be put to use, include Logistic Regression, SVM, Naïve Bayes, Random
Forest, and LDA. For all of the above models, a Twitter dataset comprising as many as 32000 tweets, categorized as hatred or
non-hatred, had been used for training. Of all these models, Logistic Regression achieved maximum accuracy of 94.97%,
whereas for SVM and Random Forest the accuracies were further below these figures: 94.66% and 93.1%, respectively. It is
understood by these authors that some of the biggest challenges are dynamic streaming data, complex raw text, as well as
inadequate feature extractions, all in real-time applications, thus pointing out the requirement for better optimization
algorithms.
[8].Peiling Yi and Arkaitz Zubiaga, in the study of 2023, Session-based Cyberbullying Detection on Social Media. The
research combined deep learning, hybrid machine learning with fuzzy logic, and multimodal strategies to detect cyberbullying
on Instagram and Vine datasets, ranging from 970 to 534,000 samples. XBully achieved the best Instagram performance
(0.878), while MMCD excelled on Vine (0.841). Key challenges included dataset imbalance, limited session-based
cyberbullying detection datasets, and varying dataset sizes. The study highlighted the need for balanced, representative
datasets and models capable of generalizing across diverse social media platforms.
3. Methodology:
1.DATA COLLECTION:
Twitter Datasets:
[2] A Twitter dataset was utilized from the Internet Archive, which is an online repository of open access
to myriad sites, software, and multimedia. This dataset, in particular, focused on tweets regarding
cyberbullying, wherein only relevant entries are stored in a SQL database, such as tweet text, tweet ID,
in-reply-to status ID, and retweeted status ID.
[6]Cyberbullying detection was implemented using a tweets dataset. This dataset carries over 47,692
tweets grouped into two classes, namely cyberbullying and non-cyberbullying, and was sourced for data
analysis techniques employed in the study.
TABLE 1
[4]They chose the quality annotated corpus for harassment research. Dataset was already categorized
into different topics of harassment content: i) sexual, ii) racial, iii) appearance-related, iv) intelligence,
and v) political
TABLE 2
[4]In order to perform experiment on severity assessment on the harassment data set, categorizing the
annotated cyberbullied tweets into 4 levels; low, medium, high and non-cyberbullying.Then
categorized: sexual and appearance related tweets as high-level cyberbullying
severity; political and racial tweets as medium-level; intelligence tweets as low-level cyberbullying
severity, and all the tweets that were labelled as ‘non-cyberbullying’ in each category were consolidated
into one category as non-cyberbullying tweets. This resulted in a dataset with characteristics shown in
Table2.
Text Extraction from Visual Content
[1] FIG:OCR
Different techniques such as traditional machine learning techniques and deep learning methods have
been used in cyberbullying detection for classifying textual data. These models classify the offensive
content all over different media such as Twitter, Facebook, or YouTube through the linguistic pattern
and contextual meaning of text.-
2. Linear SVC: A binary classification algorithm which separates classes through the hyperplane; works
well on class data that is linearly separable.
3. Multinomial Naive Bayes: A probabilistic model used for estimating the probabilities of any word
belonging to one of the cyberbullying categories.
4. Logistic Regression: It uses logistic function to classify any text into either offensive or not-offensive.
5. Nearest Neighbours (KNN): Classifies texts depending on majority class of K nearest samples. Highly
sensitive to noise/local patterns.
6. Decision Tree: A tree-like structure used for making decisions based on feature values. It is easy to
interpret but prone to overfitting on complex datasets.
7. Gradient Boosting Classifier: This ensemble method combines the predictions of multiple weak
learners to create a strong classifier. It is effective for improving the accuracy of classification.
8. Random Forest Classifier: This ensemble method builds multiple decision trees and merges their
results to improve accuracy and reduce overfitting.
9. Bagging Classifier: Utilizes the technique of Bootstrap Aggregating (bagging) to reduce variance and
prevent overfitting by training multiple models on different subsets of the data.
10. SGD Classifier: Stochastic Gradient Descent (SGD) is used for large-scale linear classification tasks
and is suitable for datasets with many features.
11. AdaBoost Classifier: An ensemble learning method that combines weak classifiers to create a strong
classifier, focusing on misclassified instances to improve accuracy.
These models work well for simpler datasets but struggle with the complexity of social media data.
1. LSTM Layer: Gives importance to the sequence and long-term dependencies that are critical for
identification of minute patterns in the cyberbullying sample.
2. CNN Layer: Tries combining all the local patterns in the LSTM output that strengthens
identification for very subtle indicators of bullying.
Such type of an approach proves to be very efficacious when it comes to the most successful detection of
cyberbullying in ever-changing and multi-mode statements.
4.Evaluation Strategy:
[1]FIG:Precision
[1]FIG:Recall
1.Accuracy:The accuracy of Logistic Regression is the highest at 96% for both Bag of Words and TF-
IDF models that are far beyond all the classifiers for Decision Tree, Gradient Boosting, Random Forest,
and AdaBoost which had shown the worst accuracies.
2.F1-score:Linear SVC topped the F1 score by as much as 93.13 derived for the TF-IDF model while
Logistic Regression led for the Bag of Words model. Other classifiers such as Bagging and SGD also
showed strong F1 scores. Conversely, the performance of Decision Tree, Gradient Boosting, and
Random Forest stayed weak across all the metrics much higher when compared to all of the other
aforementioned classifiers.
3.Precision:In terms of precision, Logistic Regression again topped the charts at 96% for both models,
while Linear SVC equaled this for the TF-IDF model. Other classifiers like Decision Tree, Gradient
Boosting, and Random Forest were found to exhibit the lowest percentages of precision.
4.Recall:The highest recall, i.e., 96%, is achieved by both models using Logistic Regression, while
Linear SVC is performing well for TF-IDF model and SGD Classifier for Bag of Words model.
Likewise, Decision Tree, Gradient Boosting, and Random Forest have been lacking in recall.
4. Result
5. Logistic Regression performed the best and reached a maximum accuracy of 96% for the both
Bag of Words and TF-IDF models. Precision-wise, the winners were Logistic Regression and
Linear SVC that were on top of each other at 96%, while other classifiers like Decision Tree
and Random Forest had the precision value far below the rest. Linear SVC scored the highest
with F1-score at 93.13% with TF-IDF model. When evaluating recall, Logistic Regression
and Linear SVC for TF-IDF showed significant performance at 96%. However, Decision
Tree, Gradient Boosting, and Random Forest were suboptimal on every one of the
evaluation metrics.
6.
7. Conclusion
8. References
[1] Tofayet Sultan, Nusrat Jahan, Ritu Basak, Mohammed Shaheen Alam Jony, Rashidul Hasan Nabil: Machine Learning in
Cyberbullying Detection from Social-Media Image or Screenshot with Optical Character Recognition,2023
https://www.mecs-press.org/ijisa/ijisa-v15-n2/IJISA-V15-N2-1.pdf
[2] Andrea Perera, Pumudu Fernando: Accurate Cyberbullying Detection and Prevention on Social Media,2021
https://www.researchgate.net/publication/349531243_Accurate_Cyberbullying_Detection_and_Prevention_on_Social_Media
[3] Aigerim Altayeva, Rustam Abdrakhmanov, Aigerim Toktarova, and Abdimukhan Tolep: Cyberbullying Detection on
Social Networks Using a Hybrid Deep Learning Architecture Based on Convolutional and Recurrent Models, 2024
https://thesai.org/Publications/ViewPaper?Volume=15&Issue=10&Code=IJACSA&SerialNo=18
[4] Bandeh Ali Talpur and Declan O'Sullivan: Cyberbullying Severity Detection,2020
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0240924
[5] Vedadri Yoganand Bharadwaj, Vasamsetti Likhitha, Vootnoori Vardhini, Adari Uma Sree Asritha, Saurabh Dhyani, M.
Lakshmi Kanth: Automated Cyberbullying Activity Detection using Machine Learning Algorithm, 2023.
https://www.researchgate.net/publication/374505635_Automated_Cyberbullying_Activity_Detection_using_Machine_Learni
ng_Algorithm
[6] Aljwharah Alabdulwahab, Mohd Anul Haq, Mohammed Alshehri: Cyberbullying Detection using Machine Learning and
Deep Learning, International Journal of Advanced Computer Science and Applications, Vol. 14, No. 10, 2023.
https://www.researchgate.net/publication/
375120231_Cyberbullying_Detection_using_Machine_Learning_and_Deep_Learning
[7] Cyberbullying Detection and Recognition with Type Determination Based on Machine Learning. 2022
https://acrobat.adobe.com/id/urn:aaid:sc:AP:fa992068-14bc-4eb5-8054-d81296f10bed