Cyber Bullying Detection
Cyber Bullying Detection
1051/itmconf/20214003038
ICACC-2021
1,2,3
Student, Department of Computer Engineering, Ramrao Adik Institute of Technology, Nerul
4
Assistant Professor, Department of Computer Engineering, Ramrao Adik Institute of Technology, Nerul
Abstract. Usage of internet and social media backgrounds tends in the use of sending, receiving
and posting of negative, harmful, false or mean content about another individual which thus
means Cyberbullying. Bullying over social media also works the same as threatening, calumny,
and chastising the individual. Cyberbullying has led to a severe increase in mental health
problems, especially among the young generation. It has resulted in lower self-esteem,
increased suicidal ideation. Unless some measure against cyberbullying is taken, self-esteem
and mental health issues will affect an entire generation of young adults. Many of the traditional
machine learning models have been implemented in the past for the automatic detection of
cyberbullying on social media. But these models have not considered all the necessary features
that can be used to identify or classify a statement or post as bullying. In this paper, we proposed
a model based on various features that should be considered while detecting cyberbullying and
implement a few features with the help of a bidirectional deep learning model called BERT.
*
Corresponding author: [email protected]
© The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0
(http://creativecommons.org/licenses/by/4.0/).
ITM Web of Conferences 40, 03038 (2021) https://doi.org/10.1051/itmconf/20214003038
ICACC-2021
results. The author then tested the youtube dataset with 3 paper, various practical steps required for the
different Machine Learning Models: a Naive Bayes development of an effective and efficient application for
Classifier, Decision Tree Classifier(C4.5), and a Support cyberbullying detection are described thoroughly. The
Vector Machine(SVM) with a Linear Kernel. It was trends involved in the categorization and labeling of data
observed that clustering results for the hate posts turned platforms, machine learning models and feature types,
out to have a lower precision in the youtube dataset and case studies that made use of such tools are
when compared to the FormSpring tests, as textual explained. This paper will serve as an initial step for the
analysis and syntactical features perform differently on project in Cyberbullying Detection using Machine
both sides. When this hybrid approach was applied to the learning.
Twitter dataset, it resulted in a weak recall and F1 Score. G. A. León-Paredes et al.[6] have explained the
The model proposed by the authors can be improved and development of a cyberbullying detection model using
used in building constructive applications to mitigate Natural Language Processing (NLP) and Machine
cyberbullying issues. Learning (ML). A Spanish cyberbullying Prevention
J. Yadav, et al.[2] proposes a new approach to System (SPC) was developed by applying machine
cyberbullying detection in social media platforms by learning techniques Naïve Bayes, Support Vector
using the BERT model with a single linear neural Machine, and Logistic Regression. The dataset used for
network layer on top as a classifier. The model is trained this research was extracted from Twitter. The maximum
and evaluated on the Formspring forum and Wikipedia accuracy of 93% was achieved with the help of three
dataset. The proposed model gave a performance techniques used. The cases of cyberbullying detected
accuracy of 98% for the Form spring dataset and of 96% with the help of this system presented an accuracy of
for the Wikipedia dataset which is relatively high from 80% to 91% on average. Stemming and lemmatization
the previously used models. The proposed model gave techniques in NLP can be implemented to further
better results for the Wikipedia dataset due to its large increase the accuracy of the system. Such a model can
size g without the need for oversampling whereas the also be implemented for detection in English and local
Form spring dataset needed oversampling. languages if possible.
R. R. Dalvi, et al.[3] suggests a method to P. K. Roy, et al. [7] detail about creating a
detect and prevent Internet exploitation on Twitter using request for the discovery of hate speech on Twitter with
Supervised classification Machine Learning algorithms. the help of a deep convolutional neural network.
In this research, the live Twitter API is used to collect Machine learning algorithms such as Logistic Regression
tweets and form datasets. The proposed model tests both (LR), Random Forest (RF), Naïve Bayes (NB), Support
Support Vector Machine and Naive Bayes on the Vector Machine (SVM), Decision Tree (DT), Gradient
collected datasets. To extract the feature, they have used Boosting (GB), and K-nearby Neighbors (KNN) has
the TFIDF vectorizer. The results show that the accuracy been used to identify tweets related to hate speech on
of the cyberbullying model based on the Support Vector Twitter and features have been removed using the tf-idf
Machine is almost 71.25% that is better than the Naive process. The best ML model was SVM but it managed to
Bayes which was almost 52.75%. predict 53% hate speech tweets in a 3: 1 dataset to test
Trana R.E., et al. [4] goal was to design a the train. The reason behind the low prediction scale was
machine learning model to minimize special events the unequal data. The model is based on the prediction of
involving text extracted from image memes. The author hate speech tweets. Advanced forms of learning based
has compiled a database containing approximately on the Convolutional Neural Network (CNN),
19,000 text views published on YouTube. This study Long-Term Memory (LSTM), and their Contextual
discusses the effectiveness of the three machine learning LSTM (CLSTM) combinations have the same effects as
machines, the Uninformed Bayes, the Support Vector a separate distributed database. 10-fold cross-validation
Machine, and the convolutional neural network used on was used along with the proposed DCNN model and
the YouTube database, and compares the results with the obtained a very good recall rate. It was 0.88 for hate
existing Form databases. The authors further speech and 0.99 for non-hate speech. Test results
investigated algorithms for Internet cyberbullying in confirmed that the k-fold cross-validation process is a
sub-categories within the YouTube database. Naive better decision with unequal data. In the future, the
Bayes surpassed SVM and CNN in the following four current database can be expanded to achieve better
categories: race, ethnicity, politics, and generalism. SVM accuracy.
has passed well with the inexperienced Naïve Bayes and S. M. Kargutkar, et al. [8] had proposed a
CNN in the same gender group, and all three algorithms system to give a double characterization for
have shown equal performance with central body group cyberbullying. The system uses Convolutional Neural
accuracy. The results of this study provided data that can Network (CNN) and Keras for content examination as
be used to distinguish between incidents of abuse and the relevant strategies at that time provided a guideless
non-violence. Future work could focus on the creation of view with less precision. This research involved data
a two-part segregation scheme used to test the text from Twitter and YouTube. CNN accuracy was 87%.
extracted from images to see if the YouTube database In-depth learning-based models have found their way to
provides a better context for aggression-related clusters. identify digital harassment episodes, they can overcome
N. Tsapatsoulis, et al. [5] a detailed review of the imprisonment of traditional models, and improve
cyberbullying on Twitter is presented. The importance of adoption.
identifying different abusers on Twitter is given. In the
2
ITM Web of Conferences 40, 03038 (2021) https://doi.org/10.1051/itmconf/20214003038
ICACC-2021
Jamil H. et al. [9] have described the crucial stage for the effectiveness of the algorithms in
implementation of a new social network model and its pattern recognition and classification problems.
query language called GreenShip. They showed that with In sentimental features, we try to evaluate the
the support devices, GreenShip users can be more sentiment( positive, negative ) of a given text document.
effective in the fight against online bullying, and loss of The research shows that human analysts tend to agree
traditional, online, and social networks. The reputation around 80-85% of the time and that is the baseline we
of a management model that has been introduced to have tried to consider while training our sentiment
restrict access to harmful information, which focuses on scoring system.
the denial of the criminal code, the means for the In sarcastic features, we try to consider the
dissemination of information to the users associated with context incongruity. Incongruity occurs when a
the target. GreenShip has a reputation as a model that nonverbal behavior contradicts a person's word. A text
provides safe, “green” friends, due to the recognition of may contain half of the objects in a congruent context
the different types of friendships on Facebook. The which can be considered as expected context, whereas
damage is as a result of bad friends and that was very for the other half, objects were embedded in incongruent
limited, and the more complex, that there are many contexts. This can be a major factor in cyberbullying
forms of friendship, and the communication lines are put detection because the hidden nature of the sarcastic
aside for the sake of the benefits of privacy and control. comment won’t be detected in sentiment analysis
because of the context incongruity. We also consider
Rasel, Risul Islam, et al. [10] focuses on the removal of pragmatic features like emojis, mentions, etc. while
the comments made on social networks, and the analysis detecting the sarcastic nature of the source material.
of the question as to whether these observations provide While considering the syntactic features we
an offensive meaning. The reactions can be divided into have identified in the lists of insults, we also monitor and
three categories: offensive, hate speech, and neither of take into consideration the number of such bad words or
the two. The proposed model classifies the notes on the insults present in a single sentence and accordingly map
species), with an accuracy of more than 93%. Latent a density to it. We have also validated the badness of the
Semantic Analysis (LSA) has been used as a feature entire sentence based on certain parameters like density
selection method to reduce the size of the input data. In range. The emphasis of uppercase characters while
addition to standard feature extraction methods such as making hate statements is also taken into consideration
tokenization, N-gram, TF-IDF was applied to detect the while generating syntactic features because it can be
important notes. We made three different machine referred to as an act of shouting or attacking over social
learning models, Random Forest, Logistic Regression, media platforms. Similarly, the use of special characters
and Support Vector Machines (SVMs) to perform the or patterns formed by them is also brought into
calculation, analysis, forecasting, and a teasing consideration while deriving syntactic features.
comment. Semantic Features can be used to determine the
lexical relation which exists between two words in a
language. The meaning of the word can be represented
by Semantic features. Here we have tried to identify the
trigram and the bigrams that occur while referencing
3 Proposed Methodology something in the text format. Here usually the negation
In this paper, a method to detect cyberbullying on social of the sentence is considered along with the mapping of
media is proposed that is not just based on the different pronouns that can be implicitly or explicitly
sentimental analysis but also considers the syntactic, used to refer to another individual while harassing
semantic, and sarcastic nature of the sentence before someone over social media.
classifying it as hate speech. To achieve our goal we start Social features refer to the social behavior of
with the traditional sentiment analysis where we perform the victim or the bully itself. The post itself won't be
contextual mining of text to identify and extract the sufficient to detect the nature of the text. We have
subjective information in the source material to considered patterns in the behaviors of the bullies and
understand the opinion, emotion, or attitude towards the identified a few features. We have considered the direct
topic. Later we introduce a group of “social” features tagging of the victim while using hate speech. We also
that can highly affect and guide the cyberbullying try to gain information regarding the context of the post
detection process. We have divided all the features we based on the previous interactions between the bully and
have extracted into five categories: the victim. Profiling of the author can be done to
● Sentimental Features discover its past interactions and involvement in similar
● Sarcastic Features malicious activities over social media platforms.
● Syntactic Features We proposed a cyberbullying detection model
● Semantic Features based on transformers. Similar to RNN, transformers can
● Social Features also be used to solve a wide variety of NLP(Natural
All these features have been categorized based Language Processing) problems like translation and text
on the literature survey of the existing systems and each summarization as they can take sequential data as input.
feature uniquely identifies the text. Choosing A recent improvement on the natural language task
informative, descriptive, and independent features is a introduced the BERT. The BERT is a recent paper
published by researchers at Google AI Language. BERT
3
ITM Web of Conferences 40, 03038 (2021) https://doi.org/10.1051/itmconf/20214003038
ICACC-2021
4
ITM Web of Conferences 40, 03038 (2021) https://doi.org/10.1051/itmconf/20214003038
ICACC-2021
References
1. M. Di Capua, E. Di Nardo and A. Petrosino,
Unsupervised cyberbullying detection in social
networks, ICPR, pp. 432-437, doi:
10.1109/ICPR.2016.7899672. (2016)
2. J. Yadav, D. Kumar and D. Chauhan,
Cyberbullying Detection using Pre-Trained
BERT Model, ICESC, pp. 1096-1100, doi:
10.1109/ICESC48915.2020.9155700. (2020)
3. R. R. Dalvi, S. Baliram Chavan and A. Halbe,
Detecting A Twitter Cyberbullying Using
Machine Learning, ICICCS, pp. 297-301, doi:
10.1109/ICICCS48265.2020.9120893. (2020)
4. Trana R.E., Gomez C.E., Adler R.F. (2021)
Fighting Cyberbullying: An Analysis of
Algorithms Used to Detect Harassing Text