0% found this document useful (0 votes)
13 views

Cyber Bullying Detection Using Machine Learning

Cyber bullying detection using machine learning
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Cyber Bullying Detection Using Machine Learning

Cyber bullying detection using machine learning
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2022 2nd Asian Conference on Innovation in Technology (ASIANCON)

Pune, India. Aug 26-28, 2022

Cyber Bullying Detection Using Machine Learning


K.Siddhartha K.Raj Kumar K.Jayanth Varma
ECE,GRIET ECE,GRIET ECE,GRIET
Hyderabad,India Hyderabad,India Hyderabad,India
2022 2nd Asian Conference on Innovation in Technology (ASIANCON) | 978-1-6654-6851-0/22/$31.00 ©2022 IEEE | DOI: 10.1109/ASIANCON55314.2022.9909201

[email protected] [email protected] [email protected]

M.Amogh Mamatha Samson


ECE,GRIET ECE,GRIET
Hyderabad,India Hyderabad,India
[email protected] [email protected]

Abstract—Cyber bullying has evolved as a severe problem bullying or harassment that takes place through the internet.
hurting children, teenagers, and young adults as a result of the Onlinebullying includes cyberbullying and cyber-harassment.
increasing use of social media. Automatic detection of bullying Cyberbullying has become more widespread as the digital
communications in social media is now possible, thanks to
machine learning techniques, which could aid in the creation world has evolved and technology has advanced, particularly
of a healthy and safe social media environment. One major among teens.Direct messaging platforms are the most common
issue in this important research area is robust and discriminative platforms where bullying occur.Cyber bullying or harassment
numerical representation learning of text messages. To address refers to troubling the users of internet through electronic
this challenge, we offer a new representation learning method media.This bullying can have adverse effects on mental health
in this study. The Semantic-Enhanced Marginalized Denoising
Auto-Encoder (SMSDA) is a semantic enhancement of the pop- of the victims. In this regard, we propose a machine-learning-
ular deep learning model stacked denoising Auto-Encoder. The based cyberbullying detection model that can determine if a
semantic extension is made up of semantic dropout noise and communication is related to cyberbullying or not. In the pro-
sparsity constraints, with the semantic dropout noise being the posed cyberbullying detection model, we tested many machine
most important. learning methods, including Naive Bayes, VectorMachines for
Index Terms—Cyberbullying. Machine learning . Supervised
learning . Learning algorithms . Naive Bayes classifier Support, Decision Tree, and Random Forest.
II. LITERATURE SURVEY
I. I NTRODUCTION
Despoina C. et al. in [3], the study has identified the
Social media is a platform that allows users to share challenges in detection of cyberbullying such as heterogeneity
anything they want, such as images, videos, and documents, of users, transient nature of the problem, anonymity capability
and engage with others [1]. Social networking platforms are offered in social media, and multiple bullying forms beyond
excellent resources for meeting new people. However, as social abusive language. The authors have considered the user, tex-
networking has grown in popularity, some have begun to tual and network features to detect cyberbullying. Supervised
exploit it in unlawful and immoral ways.Lots of teenagers machine learning algorithms have been used to classify the
these days are sharing information online.Discussions with text as bully or non-bully. In [4], text-based Convolution
criminal thinkers through net based networking media shed Neural Network (CNN) using fastText word embedding was
light about encounters ,emotions and concerns about the built to identify toxic and abusive comments on social media
learning process. In a Symantec research, over a quarter of platform and classifying them based on their toxicity level. It
parents said their child had been involved in a cyberbullying is concluded that fastText has provided more accurate results
episode that they were aware of[2]. Nowadays, social media is when dealing with slang, jargons, typing mistakes and short
the most common platform for delivering hate speeches. As a forms used in the posts. The model had outperformed when the
result, in recent years, cyber-hate crime has skyrocketed.Even datasize was large enough to split as training and testing data.
though there are lots of upsides of social media,the downsides John M. et al. in [5] has used a supervised machine learning
cannot be undermined. One of the most recent prominent algorithms to detect and prevent cyberbullying. Several classi-
social media challenges has been cyberbullying. Although fiers were used to train and identify the bullying actions. When
social media offers many advantages, it also has significant the proposed approach on cyberbullying dataset was evaluated,
disadvantages. Malevolent individuals utilise this medium to it is showed that Neural Network performs better and achieved
carry out unethical and deceptive acts in order to hurt others’ accuracy of 92.8 and Support Vector Machine achieved 90.3
feelings and harm their reputation. One of the most recent Also, Neural Network outperformed other classifies of similar
prominent social media challenges has been cyberbullying. work on the same dataset. Lu Cheng et al. in [6] proposed
Cyberbullying, also known as cyber-harassment, is a type of XBully, a multimodal cyberbullying detection system. It was

978-1-6654-6851-0/22/$31.00 ©2022 IEEE 1


Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on August 11,2024 at 06:28:30 UTC from IEEE Xplore. Restrictions apply.
based on network representation learning. This system has categories.Decision tree is a graphical representation of all
taken into account the various feature types and those were the possible solutions to a decision.In this project of detecting
handles by identifying representative mode hotspot. It was then abusive messages,we create a decision tree for it starting with
mapped in a heterogeneous network. During the social session the root node.A decision tree is tree shaped diagram used to
interaction different roles such as victims or bullies could be determine the course of action.Some of the importatant terms
found out so that cyberbullying classification can be improved. used in decision tree are follows
Cynthia et al. in [7] has built a classifier to detect indications of • Entropy:It is the measure of randomness or predictability
cyberbullying on social media platform which identifies differ- in the dataset.
ent social roles involved in a cyberbullying interaction. Roles • Information gain:It is the measure of decrease in entropy
were discriminated in the annotation scheme which includes after the dataset is split.
victim, bully, bystanders-defendant and bystanders-assistant. • Leaf Node:leaf node carries the classification or the
Linear support vector machine was used as a classifier. They decision.
demonstrated the method which can be used for languages • Decision Node:Decision node has two or more branches.
easily. The experiments were performed in English and Dutch • Root Node:The top most decision node is known as the
datasets. decision node.
III. METHODOLOGY .
In this section,we discuss the bullying detection frame- B. Random Forest
work which consists of couple of main parts namely Natu-
This algorithm builds multiple decision trees and merges
ral language processing and Machine learning.In the initial
them together to get a more accurate and stable prediction.The
step,datasets containing abusive words or texts are being
decision of majority of the trees is chosen by the random forest
collected and they are prepared for machine learning algo-
for the final decision.For instance we have two main classes
rithms through Natural Language Processing.We split up the
X and Y,and the most of the decision tree will predict the
dataset into two categories,training and testing.For training
class label Y of any instance,then random forest will decide
the model,80 percent of the dataset is used and rest 20
the label Y as follows
percent is utilized for checking the accuracy of training
models.The dataset which is being collected is labelled or f (x) = M ajorityvoteof alltreeasY (1)
classified as two categories,bully or not bully. Natural Lan- (2)
guage Processing:Natural language processing is the field
of AI,which concerns with the processing and understand- C. Naive Bayes
ing of the human language.NLP plays a vital role in text Naive bayes classifier works in the principles of conditional
classification,Spam filters,Machine translation and so on.The probability as given by bayes theorem[14].Bayes theorem
texts or sentences on social media contains unnecessary bul- gives the conditional probability of an event A given another
lying words.Before applying them to the different Machine event say B has occured.It is given by
Learning techniques,we need to prepare them for the de-
tection phase.This includes the removal of various uneces- P (A|B) = P (B|A)) ∗ P (A)/P (B) (3)
sary characters such as stop words.Some of the open source P (A|B) = Conditionalprobabilityof AgivenB (4)
NLP libraries include Apache open nlp,Natural Language
P (A) = P robabilityof eventA (5)
toolkit(NLTK),Stanford NLP,Mallet.
P (B) = P robabilityof eventB (6)
IV. MACHINE LEARNING ALGORITHMS P (B|A) = Conditionalprobabilityof BgivenA (7)
There are several different ways to perform a same task
inorder to predict whether the message is abusive or not.The D. Support vector machine
machine had to be trained first and there are multipe ways to Support vector Machine is the supervised learning algo-
train the machine and we can choose any one of them based on rithm that looks at data and sorts it into one of the two
accuracy.We find dozens of applications of Machine learning categories.This algorithm can be applied to both regression
which we are using and interacting in our daily lives.We and classifiction like a decision tree.
discuss decision tree,random forest,naive bayes and support To implement ,we need to follow following steps
vector machine algorithms and propose the best algorithm. i)Loading the data
ii)Exploring the data
A. Decision tree iii)Splitting the data
:Decision tree is a type of classification algorithm which iv)Generating the model
come under the supervised learning technique.Classification v)Model Evaluation
is the process of dividing the datasets into different categories The following measures are used to evaluate the performances.
by adding label or in other word we can say that it is a) Precision,Accuracy and recall:
a technique of categorizing the observation into different Suppose,say we have data set of texts where 6 are bullying

2
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on August 11,2024 at 06:28:30 UTC from IEEE Xplore. Restrictions apply.
texts and 4 are not bullying texts.There is a binary classi- TABLE I
fication here i.e.,bullying vs not bullying text.Whichever the T HE CONFUSION MATRIX
predictions are correct are called true positives and viceversa Condition Positive Condition Negative
are known as false positives.Lets assume true positives often Predicted Condition Positive True Positive False Negative
referred as Tpn as 4 and false positives(Fp) as 3. True negatives Predicted Condition Negative False Positive True Negative
as 1 and False negative as 2(say) The negative here refers to
the class which is not bullying text but classified as bullying TABLE II
S AMPLE DATA
text.Suppose if we got 5 predictions correct out of 10,accuracy
is given as 5/10 which is equal to 0.5 S.No Tweet Label
Precision(Pr) is given as follows 1 How are you doing ? 0
2 Holy Shit ! 1
Pr = T P/(T P + F P ) (8) 3 Iam Fine 0
4 Bull Shit ! 1
Recall is the proportion of correctly identified class out of all 5 Free gift for you 0
the truth predictions made//
Recall = T P/(T P + F N ) (9) • Non Bullying text: These are the messages or texts
which are not offensive.For instance “Had your dinner?”
Let the total number of samples be 6 and the the number is decent and can be classified as non bullying message
of true postives be 4,then upon calculation we get recall as and considered as positive comment
0.67.For precision we take predictions as our base and for • Bullying text: The anonymous electronic publishing of
recall,think about truth as your base. deflamatory messages about a person.For example “Bull
b) F-Measure:The balanced f-measure(also called as f1 mea- shit “ can be classified as a bullying text and it is
sure) is defined as the harmonic mean of precison and recall. considered as negative comment.
F = (2 ∗ P r ∗ P c)/(P r + P c) (10) Python machine learning packages are used to implement the
bullying detection methods. The following measures are used
In the above example precision and recall was found out to to evaluate the performances.
be 0.57 and 0.67 respectively.Hence substituting the values of The classification results are shown in Table as a confusion
precision and recall in the above equation,we get F measure matrix. The number of people indicated as true positive in the
as 0.615. upper left corner is the number of people who were actually
true. The False-positive lower right cell represents the number
V. DATASET
of samples that were labelled as erroneous negative despite
Data set is being collected from kaggle.com[8] and they being false. False-negative indicates the number of people who
are labeled as 0 and 1 which means abusive and not abusive were counted as true despite the fact that they were false.False-
respectively. positive refers to the amount of people who were listed as true
when they weren’t,as in TABLE 1.
A. DATA PREPROCESSING
Since the collected data is not the best fit for the model
because there might be unwanted characters.So the data should
be clean and processed. Unwanted characters are detected P recision = ΣT ruepositive/ΣP redictedConditionP ositive
(11)
and removed using the Python script below.Stop-words, IP Recall = ΣT ruepositive/ΣConditionP ositive (12)
addresses, usernames, new line characters, punctuation, and
quotation marks are just a few examples. The real positive rate for several potential diagnostic test cut-
The following techniques are used for data preprocessing points is plotted against the false-positive rate in the Receiver
Operating Characteristic Curve (or ROC Curve). ROC reveals
the trade-off between sensitivity and specificity (any increase
•Punctuation : All the punctuation marks are removed in sensitivity leads to a decrease in specificity). The more
•Stopwords: Stopwords are commonly used words that closely the curve follows the left border and the more closely
repeat themselves, such as articles, prepositions, and so the curve follows the top border of the ROC space, the more
on. As a result, stopwords for each are deleted. precise the test.
• Lemmatization :A lemma is an inflected version of a word
that can take on numerous verb forms, singular/plural VI. PROPOSED METHODOLOGY
forms, and so on. The inflected forms or lemma of We Chose Naive bayes algorithm to perform this project
the word gone, for example, are go and gone. The due to below mentioned reasons:
technique of grouping these lemma together is known i)Non sensitive to irrelevant features.
as lemmatization. As a result, lemmatization is applied ii)Very easy and simple to implement.
to every remark. iii)Needs less training data.
The types of texts are - iv)Highly scalable with number of predictors and datapoints.

3
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on August 11,2024 at 06:28:30 UTC from IEEE Xplore. Restrictions apply.
TABLE III
OUTCOMES

SVM RF NB DT
Precision 0.91 0.92 0.85 0.93
Recall 0.82 0.76 0.79 0.76
Fscore 0.86 0.83 0.82 0.78
Accuracy 86.5 82.3 83.2 82.06

VII. R ESULT
We performed and analysed various Machine learning al-
gorithms such as Space Vector Machine(SVM),Random for-
est(RF),Decision tree(DT) and naı̈ve bayes(NB) and Naı̈ve
bayes found out to be the best algorithm because of the
features of this algorithm mentioned above.The above table
depicts the Precision,Recall ,F score and accuracy for all the
4 algorithms.
VIII. CONCLUSION
We employed a language-based technique to detect cyber-
bullying in this paper. We were able to correctly identify 83.2
percent of the messages that contained cyberbullying in a
sample of Twitter labelled data by recording the percentage of
swear and insultwords inside a post. Our findings show that
our features are capable of detecting cyberbullying in Twitter
posts, but there is still potential for improvement in this crucial
and vital application of machine learning to web data.
R EFERENCES
[1] C. Fuchs, Social media: A critical introduction. Sage, 2017.
[2] D. Poeter. (2011) Study: A Quarter of Parents Say Their
Child Involved in Cyberbullying. pcmag.com. [Online]. Available:
http://www.pcmag.com/article2/0,2817,2388540,00.asp
[3] Despoina Chatzakou, Ilias Leontiadia, Jeremy Blackburn, Emiliano
De Cristofaro, Gianluca Stringhini, Athena Vakali, Nicolas Kourtellis,
“Detecting Cyberbullying and Cyberaggression in Social Media”, ACM
Transactions on the Web, Vol. 13, No. 3, Article 17, October 2019.
[4] Suresh Mestry, Hargun Singh, Roshan Chauhan, Vishal Bisht, Kaushik
Tiwari, “Automation in Social Networking Comments With the Help of
Robust fast Text and CNN”, 1st International Conference on Innovations
in Information and Communication Technology (ICIIACT), June 2019.
[5] John Hani Mounir, Mohamed Nashaat, Mostafaa Ahmed, Eslam A.
Amer, “Social Media Cyberbullying Detection using Machine Learn-
ing”, International Journal of Advanced Computer Science and Appli-
cations, Vol. 10, No. 5, January 2019.
[6] Lu Cheng, Jundong Li, Yasin N. Silva, Deborah L. Hall, Huan Liu,
“XBully: Cyberbullying Detection within a Multi-Modal Context”,
Proceedings of the 12th ACM International Conference on Web Search
and Data Mining, WSADM, January 2019, pp. 339-347.

[7] Cynthia Van HeeID, Gilles Jacobs, Chris Emmery, Bart Desmet,
Els Lefever, Ben Verhoeven, Guy De Pauw, Walter Daelemans, Ve
´roniqueHoste, “Automatic detection of cyberbullying in social media
text”, PLOS ONE, October 2018.

[8] Kaggle link for the classification data set


https://www.kaggle.com/datasets/dataturks/dataset-fordetection-of-
cybertrolls

4
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on August 11,2024 at 06:28:30 UTC from IEEE Xplore. Restrictions apply.

You might also like