Cyber Bullying Detection Using Machine Learning
Cyber Bullying Detection Using Machine Learning
Abstract—Cyber bullying has evolved as a severe problem bullying or harassment that takes place through the internet.
hurting children, teenagers, and young adults as a result of the Onlinebullying includes cyberbullying and cyber-harassment.
increasing use of social media. Automatic detection of bullying Cyberbullying has become more widespread as the digital
communications in social media is now possible, thanks to
machine learning techniques, which could aid in the creation world has evolved and technology has advanced, particularly
of a healthy and safe social media environment. One major among teens.Direct messaging platforms are the most common
issue in this important research area is robust and discriminative platforms where bullying occur.Cyber bullying or harassment
numerical representation learning of text messages. To address refers to troubling the users of internet through electronic
this challenge, we offer a new representation learning method media.This bullying can have adverse effects on mental health
in this study. The Semantic-Enhanced Marginalized Denoising
Auto-Encoder (SMSDA) is a semantic enhancement of the pop- of the victims. In this regard, we propose a machine-learning-
ular deep learning model stacked denoising Auto-Encoder. The based cyberbullying detection model that can determine if a
semantic extension is made up of semantic dropout noise and communication is related to cyberbullying or not. In the pro-
sparsity constraints, with the semantic dropout noise being the posed cyberbullying detection model, we tested many machine
most important. learning methods, including Naive Bayes, VectorMachines for
Index Terms—Cyberbullying. Machine learning . Supervised
learning . Learning algorithms . Naive Bayes classifier Support, Decision Tree, and Random Forest.
II. LITERATURE SURVEY
I. I NTRODUCTION
Despoina C. et al. in [3], the study has identified the
Social media is a platform that allows users to share challenges in detection of cyberbullying such as heterogeneity
anything they want, such as images, videos, and documents, of users, transient nature of the problem, anonymity capability
and engage with others [1]. Social networking platforms are offered in social media, and multiple bullying forms beyond
excellent resources for meeting new people. However, as social abusive language. The authors have considered the user, tex-
networking has grown in popularity, some have begun to tual and network features to detect cyberbullying. Supervised
exploit it in unlawful and immoral ways.Lots of teenagers machine learning algorithms have been used to classify the
these days are sharing information online.Discussions with text as bully or non-bully. In [4], text-based Convolution
criminal thinkers through net based networking media shed Neural Network (CNN) using fastText word embedding was
light about encounters ,emotions and concerns about the built to identify toxic and abusive comments on social media
learning process. In a Symantec research, over a quarter of platform and classifying them based on their toxicity level. It
parents said their child had been involved in a cyberbullying is concluded that fastText has provided more accurate results
episode that they were aware of[2]. Nowadays, social media is when dealing with slang, jargons, typing mistakes and short
the most common platform for delivering hate speeches. As a forms used in the posts. The model had outperformed when the
result, in recent years, cyber-hate crime has skyrocketed.Even datasize was large enough to split as training and testing data.
though there are lots of upsides of social media,the downsides John M. et al. in [5] has used a supervised machine learning
cannot be undermined. One of the most recent prominent algorithms to detect and prevent cyberbullying. Several classi-
social media challenges has been cyberbullying. Although fiers were used to train and identify the bullying actions. When
social media offers many advantages, it also has significant the proposed approach on cyberbullying dataset was evaluated,
disadvantages. Malevolent individuals utilise this medium to it is showed that Neural Network performs better and achieved
carry out unethical and deceptive acts in order to hurt others’ accuracy of 92.8 and Support Vector Machine achieved 90.3
feelings and harm their reputation. One of the most recent Also, Neural Network outperformed other classifies of similar
prominent social media challenges has been cyberbullying. work on the same dataset. Lu Cheng et al. in [6] proposed
Cyberbullying, also known as cyber-harassment, is a type of XBully, a multimodal cyberbullying detection system. It was
2
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on August 11,2024 at 06:28:30 UTC from IEEE Xplore. Restrictions apply.
texts and 4 are not bullying texts.There is a binary classi- TABLE I
fication here i.e.,bullying vs not bullying text.Whichever the T HE CONFUSION MATRIX
predictions are correct are called true positives and viceversa Condition Positive Condition Negative
are known as false positives.Lets assume true positives often Predicted Condition Positive True Positive False Negative
referred as Tpn as 4 and false positives(Fp) as 3. True negatives Predicted Condition Negative False Positive True Negative
as 1 and False negative as 2(say) The negative here refers to
the class which is not bullying text but classified as bullying TABLE II
S AMPLE DATA
text.Suppose if we got 5 predictions correct out of 10,accuracy
is given as 5/10 which is equal to 0.5 S.No Tweet Label
Precision(Pr) is given as follows 1 How are you doing ? 0
2 Holy Shit ! 1
Pr = T P/(T P + F P ) (8) 3 Iam Fine 0
4 Bull Shit ! 1
Recall is the proportion of correctly identified class out of all 5 Free gift for you 0
the truth predictions made//
Recall = T P/(T P + F N ) (9) • Non Bullying text: These are the messages or texts
which are not offensive.For instance “Had your dinner?”
Let the total number of samples be 6 and the the number is decent and can be classified as non bullying message
of true postives be 4,then upon calculation we get recall as and considered as positive comment
0.67.For precision we take predictions as our base and for • Bullying text: The anonymous electronic publishing of
recall,think about truth as your base. deflamatory messages about a person.For example “Bull
b) F-Measure:The balanced f-measure(also called as f1 mea- shit “ can be classified as a bullying text and it is
sure) is defined as the harmonic mean of precison and recall. considered as negative comment.
F = (2 ∗ P r ∗ P c)/(P r + P c) (10) Python machine learning packages are used to implement the
bullying detection methods. The following measures are used
In the above example precision and recall was found out to to evaluate the performances.
be 0.57 and 0.67 respectively.Hence substituting the values of The classification results are shown in Table as a confusion
precision and recall in the above equation,we get F measure matrix. The number of people indicated as true positive in the
as 0.615. upper left corner is the number of people who were actually
true. The False-positive lower right cell represents the number
V. DATASET
of samples that were labelled as erroneous negative despite
Data set is being collected from kaggle.com[8] and they being false. False-negative indicates the number of people who
are labeled as 0 and 1 which means abusive and not abusive were counted as true despite the fact that they were false.False-
respectively. positive refers to the amount of people who were listed as true
when they weren’t,as in TABLE 1.
A. DATA PREPROCESSING
Since the collected data is not the best fit for the model
because there might be unwanted characters.So the data should
be clean and processed. Unwanted characters are detected P recision = ΣT ruepositive/ΣP redictedConditionP ositive
(11)
and removed using the Python script below.Stop-words, IP Recall = ΣT ruepositive/ΣConditionP ositive (12)
addresses, usernames, new line characters, punctuation, and
quotation marks are just a few examples. The real positive rate for several potential diagnostic test cut-
The following techniques are used for data preprocessing points is plotted against the false-positive rate in the Receiver
Operating Characteristic Curve (or ROC Curve). ROC reveals
the trade-off between sensitivity and specificity (any increase
•Punctuation : All the punctuation marks are removed in sensitivity leads to a decrease in specificity). The more
•Stopwords: Stopwords are commonly used words that closely the curve follows the left border and the more closely
repeat themselves, such as articles, prepositions, and so the curve follows the top border of the ROC space, the more
on. As a result, stopwords for each are deleted. precise the test.
• Lemmatization :A lemma is an inflected version of a word
that can take on numerous verb forms, singular/plural VI. PROPOSED METHODOLOGY
forms, and so on. The inflected forms or lemma of We Chose Naive bayes algorithm to perform this project
the word gone, for example, are go and gone. The due to below mentioned reasons:
technique of grouping these lemma together is known i)Non sensitive to irrelevant features.
as lemmatization. As a result, lemmatization is applied ii)Very easy and simple to implement.
to every remark. iii)Needs less training data.
The types of texts are - iv)Highly scalable with number of predictors and datapoints.
3
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on August 11,2024 at 06:28:30 UTC from IEEE Xplore. Restrictions apply.
TABLE III
OUTCOMES
SVM RF NB DT
Precision 0.91 0.92 0.85 0.93
Recall 0.82 0.76 0.79 0.76
Fscore 0.86 0.83 0.82 0.78
Accuracy 86.5 82.3 83.2 82.06
VII. R ESULT
We performed and analysed various Machine learning al-
gorithms such as Space Vector Machine(SVM),Random for-
est(RF),Decision tree(DT) and naı̈ve bayes(NB) and Naı̈ve
bayes found out to be the best algorithm because of the
features of this algorithm mentioned above.The above table
depicts the Precision,Recall ,F score and accuracy for all the
4 algorithms.
VIII. CONCLUSION
We employed a language-based technique to detect cyber-
bullying in this paper. We were able to correctly identify 83.2
percent of the messages that contained cyberbullying in a
sample of Twitter labelled data by recording the percentage of
swear and insultwords inside a post. Our findings show that
our features are capable of detecting cyberbullying in Twitter
posts, but there is still potential for improvement in this crucial
and vital application of machine learning to web data.
R EFERENCES
[1] C. Fuchs, Social media: A critical introduction. Sage, 2017.
[2] D. Poeter. (2011) Study: A Quarter of Parents Say Their
Child Involved in Cyberbullying. pcmag.com. [Online]. Available:
http://www.pcmag.com/article2/0,2817,2388540,00.asp
[3] Despoina Chatzakou, Ilias Leontiadia, Jeremy Blackburn, Emiliano
De Cristofaro, Gianluca Stringhini, Athena Vakali, Nicolas Kourtellis,
“Detecting Cyberbullying and Cyberaggression in Social Media”, ACM
Transactions on the Web, Vol. 13, No. 3, Article 17, October 2019.
[4] Suresh Mestry, Hargun Singh, Roshan Chauhan, Vishal Bisht, Kaushik
Tiwari, “Automation in Social Networking Comments With the Help of
Robust fast Text and CNN”, 1st International Conference on Innovations
in Information and Communication Technology (ICIIACT), June 2019.
[5] John Hani Mounir, Mohamed Nashaat, Mostafaa Ahmed, Eslam A.
Amer, “Social Media Cyberbullying Detection using Machine Learn-
ing”, International Journal of Advanced Computer Science and Appli-
cations, Vol. 10, No. 5, January 2019.
[6] Lu Cheng, Jundong Li, Yasin N. Silva, Deborah L. Hall, Huan Liu,
“XBully: Cyberbullying Detection within a Multi-Modal Context”,
Proceedings of the 12th ACM International Conference on Web Search
and Data Mining, WSADM, January 2019, pp. 339-347.
[7] Cynthia Van HeeID, Gilles Jacobs, Chris Emmery, Bart Desmet,
Els Lefever, Ben Verhoeven, Guy De Pauw, Walter Daelemans, Ve
´roniqueHoste, “Automatic detection of cyberbullying in social media
text”, PLOS ONE, October 2018.
4
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on August 11,2024 at 06:28:30 UTC from IEEE Xplore. Restrictions apply.