0% found this document useful (0 votes)
4 views6 pages

Paper 2

The document presents a machine learning architecture for detecting cyberbullying on Twitter using a two-step multiclass classification method. It addresses the issue of class imbalance in cyberbullying detection and compares various traditional machine learning algorithms and text embedding methods. The study highlights the effectiveness of traditional ML models over deep learning approaches and emphasizes the need for active detection systems due to the high rate of unreported cyberbullying incidents.

Uploaded by

Vikas Jayakumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views6 pages

Paper 2

The document presents a machine learning architecture for detecting cyberbullying on Twitter using a two-step multiclass classification method. It addresses the issue of class imbalance in cyberbullying detection and compares various traditional machine learning algorithms and text embedding methods. The study highlights the effectiveness of traditional ML models over deep learning approaches and emphasizes the need for active detection systems due to the high rate of unreported cyberbullying incidents.

Uploaded by

Vikas Jayakumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Cyberbullying Detection in Twitter Using Machine

Learning
Amirmohammad Shahbandegan Lakshmi Preethi Kamak Mohammad Ghadiri
Depatrment of Computer Science Depatrment of Computer Science Depatrment of Computer Science
Lakehead University Lakehead University Lakehead University
Thunderbay, Canada Thunderbay, Canada Thunderbay, Canada
1172613 1160111 1170979

Abstract—Cyberbullying can have serious legal consequences II. L ITERATURE REVIEW


in Canada including jail time and fines. Social media companies
such as Twitter, Facebook etc., have resources and guides on
cyberbullying and are relying on passive reporting mechanisms. A. SOSNet: A Graph Convolutional Network Approach to
However, 90% of cyberbullying activities go unreported making Fine-Grained Cyberbullying Detection
the presence of an active cyberbullying detection system crucial.
Our proposed architecture detects cyberbullying using a two-
step multiclass classification method using traditional machine Wang et al. [6] developed a Graph Convolutional Neu-
learning algorithms in a balanced dataset distributed into six ral Network (GCN) based model with eight different tweet
cyberbullying classes. Our model tackles both balanced classes embedding methods and six different classification models
and imbalanced classes in the dataset and outperforms the as a baseline to compare the performance of cyberbullying
current ML and DNN baselines. This work experiments with
multiple text embedding methods to compare and find the most detection. They employed a GCN model by generating a Tweet
suitable strategy in detecting cyberbullying. Our results provide graph using the cosine similarity of the Tweets. They leverage
significant insights into the effectiveness of constructing archi- and present a case for the use of Dynamic Query Expansion
tectures using traditional ML models rather than implementing (DQE) data mining technique in a novel way to combat severe
deep learning methods to overcome the cyberbullying issue. We class imbalance in their dataset curation process; the class
have released our models and code.
distribution was 0.995% for age, 1.64% for ethnicity, 39.1%
Index Terms—Cyberbullying Detection, Social Media, Machine
learning for gender, 11.7% for religion, and 46.6% for other. This
imbalance significantly affects the training process because
of the resulting bias towards the Gender and Other classes,
I. I NTRODUCTION discounting Age and Ethnicity.
They gathered more data via semi-supervised learning by
With the growth of social media platforms, cyberbullying augmenting current datasets to solve class imbalance and
is a significant concern among the younger population. Cy- integrated GetOldTweets3 for real-time updates and new data.
berbullying can have adverse effects on vulnerable people and By using a combination of real-time queries and executing
is considered a severe threat. Cyberbullying can be defined separate processes for the fine-grained classes, they built a
as “the use of digital technology to inflict harm repeatedly or labeled dataset in a semi-supervised fashion. The baseline
to bully” [1]. Statistics show that about 36% of people felt evaluation metrics were test accuracy and F1-Score. Wang et
cyberbullied in their lifetime, 60% of teenagers experienced al. [6] achieved decent accuracy and F1-Score with simple
some cyberbullying, and 87% of young people have observed word embedding methods like Bag of Words and TF-IDF
cyberbullying [2]. In Canada, cyberbullying can have serious with traditional classifiers, such as decision trees.
legal consequences. Cyberbullies can face jail time, have their
devices taken away, and may even have to pay their victims
[3]. Many Social media companies like Facebook, Twitter,
Instagram, etc., have resources and guides on cyberbullying.
Although social media companies rely on passive reporting
mechanisms, 90% of cyberbullying activities go unreported
[4]. Therefore, the presence of an active cyberbullying de-
tection system is crucial. In this project we experiment with
various classifiers using a proposed two step classification ar-
chitecture to get achieve a higher accuracy in the cyberbullying
Fig. 1. Classification architecture of Wang et al. [6]
dataset , while using traditional ML models rather than Deep
learning models.
B. Am I Being Bullied on Social Media? An Ensemble Ap- 1) Similarity:
proach to Categorize Cyberbullying same dataset • Implementation of the baseline models and text embed-
Ahmed et al. [7] proposed a neural ensemble method of ding techniques is based on Wang et al. [6]
transformer-based architectures with an attention mechanism. • We have compared our results with Ahmed et al. [7] who
Four transformer-based models are combined to achieve higher has done both five-class and six-class classification
accuracy. Max voting and probability averaging are the two en- 2) Difference:
semble methods used in their work. They evaluate their model • Our work is experimenting with both five-class and six-
for the FGCD dataset, five classes (40,000 tweets), to find the class classification
types of cyberbullying along with the ‘not cyberbullying (Not • Traditional ML methods with a two-stage pipeline vs
CB)’ class. Deep learning implementation by fine-tuning language
In comparison, Ahmed et al. [7] claim that their ensemble models ( 400M parameters)
model of probability averaging outperforms Wang et al.’s [6]
best model by 1.22% in test accuracy in 1.15% in F1-Score. III. DATASET
Their proposed architecture can learn abstract features with A. Data source
the attention mechanism and performs better on these datasets
Wang et al. [6] developed a dataset - FGCD by combining
than the feature-based approaches Wang et al. [6].
six different datasets and classifying the tweets by labelling
and grouping the same classes. Fine-grained balanced cyber-
bullying dataset released in 2020 [5]. FGCD is a balanced
dataset of about 48,000 Twitter comments distributed into six
cyberbullying classes, ‘Age’, ‘Ethnicity’, ‘Gender’, ‘Religion’,
‘Other’, and ‘NotCb (not cyberbullying)’. To remove the class
imbalance, they used Dynamic Query Expansion (DQE) and
increased the number of samples of each class in a semi-
supervised manner. They randomly sampled 8000 tweets of
each class from the different datasets and formed a balanced
dataset of size 48000.
B. Data description
The data has a simple structure with only two columns:
tweet and label. It is a fully balanced dataset with almost
8k tweets in each class, with an average of 24 words per
tweet. The textual data needs to be converted to a feature
vector to be used with a learning algorithm. There are no
missing values in the FGCD dataset. Currently, the dataset has
six balanced classes. But there is an imbalance between the
Cb and Not Cbtweets. We propose to perform oversampling
Fig. 2. Classification architecture of Ahmed et al. [7]
techniques on NotCb class by generating around NotCb 39,750
tweets before the binary classification step. We might also
consider undersampling by removing the cyberbullying classes
C. Comparison with Previous Works to reduce the majority and fix the imbalance.
The proposed project is similar to the previously discussed IV. M ETHODOLOGY
studies [6], [7] in a few ways. We follow Wang et al.’s
[6] baseline models and text embedding techniques for our A. Project Architecture
implementation. Ahmed et al. [7] have also done a five-class Our proposed architecture detects cyberbullying using
classification; hence, we will compare our results with them a two-step multiclass classification method using machine
for the benchmark. learning classifiers like XG-Boost, SVM and other traditional
Our work will differ from Wang et al. [6] in the approach models. The Fine-Grained Cyberbullying Dataset (FGCD),
to handling class imbalance. In the first step of the proposed developed by Wang et al., is a balanced dataset of about
algorithm, there is a class imbalance between identifying 48,000 Twitter comments distributed into six cyberbullying
cyberbullying with classes Cb and NotCb. We will be per- classes, ‘Age’, ‘Ethnicity’, ‘Gender’, ‘Religion’, ‘Other’, and
forming undersampling in the dataset for the first step using ‘NotCb (not cyberbullying)’. The first step will be a binary
techniques like random undersampling. In contrast to the classification model that can identify cyberbullying in a
previous literature that has implemented deep learning and pre- tweet with classes cyberbullying (Cb) and Not cyberbullying
trained language models, we use traditional ML methods with (NotCb). The second step will be a fine-grained multiclass
the state of the art embedding techniques. classification that determines the characteristics of the target
from the remaining five cyberbullying classes. • TF-IDF - ”TF-IDF, short for Term Frequency-Inverse
Document Frequency, is a numerical statistic that reflects
a word’s importance in a document in a collection or
corpus” [8].
• word2vec - The word2vec algorithm uses a neural net-
work model to learn word associations from a large
corpus of text [9]. This study will use the google news
pre-trained version of this model.
• GloVe - GloVe obtains word representations in an unsu-
pervised manner by performing aggregated global word-
word co-occurrence statistics from a corpus [10].
• fastText - fastText is a word embedding method devel-
oped by Facebook research. It works similar to Word2vec
but generalizes better to unseen words.
Fig. 3. Our proposed two-step classification architecture • Sentence BERT (SBERT) - ”SBERT is a modification
of the pre-trained BERT network that uses siamese and
triplet network structures to derive semantically mean-
B. Environment ingful sentence embeddings that can be compared using
• Python 3 cosine-similarity” [11].
• Text processing: gensim, emoji , nltk, spacy, contractions,
sentence transformers E. Binary Classification
• scikit-learn The data is loaded from the embeddings package. Given
• imblearn that the data is balanced among 6 classes, the binary model
• Google colab environment - for model development pur- suffers from class imbalanc. The Binary labels are generated
poses for the cyberbullying and not cyberbullying classes and are
• Compute Canada - for high performance, parallel pro- marked True & False respectively. To overcome this problem
cessing purposes approaches like Near Miss algorithm and Random under
sampling were applied.Multiple classifiers were used in the
C. Text pre-processing experiment to find a classifier best suited for this problem.
Text preprocessing is fundamental for natural language The classifiers with default hyperparameters were fitted with
processing (NLP) tasks. It is a method to clean and standardize the data. 5-fold Cross-Validation to train and test the models
the text data and make it readable by the model. Text data was applied. All the experiments are executed in parallel using
contains noise in various forms like emojis, punctuation, text the multiprocessing module.
in various cases. The undersampling algorithm in the experiments are dis-
The following preprocessing on the tweet text was imple- cussed below:
mented. • Near Miss - This technique removes the datapoint from
• Stripped links, mentions, retweet flag, stop words, and the majority class when two points in the distribution
punctuation belonging to different classes are relatively close to each
• Hashtags were not removed other, attempting to balance the distribution.
• Removed extra whitespaces, special characters, and num- • Random Undersampling - In the random under-sampling,
bers. the majority of class instances are discarded at random
• Emojis are replaced with a corresponding word. until a more balanced distribution is reached.
• Expanded contractions and lowercased all text.
• Performed lemmatization only for the bag of words and
F. Fine Grained Classification
TF-IDF embeddings. The data is loaded from the embeddings package. Not-
cyberbullying samples were removed from the dataset. Mul-
D. Text embedding tiple classifiers were used to find the best-suited classifier
Text embedding is used for analysis in the form of a vector for this problem. 5-fold Cross-Validation to train and test the
that encodes the meaning of words that are closer in the vector models was applied. The accuracy and F1 score was measured
space are expected to be similar in meaning [12]. as the average of 5-fold and the results was saved.
After the text preprocessing, each tweet is converted to a The 5 classifiers in the experiments are discussed below:
feature vector keeping the semantics of the tweet. The follow- • Logistic Regression (LR) - LR is a statistical method
ing text embedding methods were used in our experimentation. that uses a logistic function to model a binary dependant
• Bag of Words (BOW) - BOW represents the text as a variable.
set of its words in which the frequency of occurrence of • K-Nearest Neighbors (KNN) - KNN is a non-parametric
each word is used as a feature. supervised learning algorithm where the function is only
approximated locally and all computation is deferred until TABLE I
function evaluation. F1 SCORES FOR 5- CLASS MODELS .
• Support Vector Machine (SVM) - SVMs being one the
knn lr lsvm mlp svm xgb
most robust prediction methods, are supervised learning bow 73.06 94.51 93.88 92.58 94.3 94.75
models based on statistical learning frameworks. SVM ft 78.64 88.17 89.58 92.26 92.44 91.59
maps training examples to points in space so as to max- glove 80.62 88.89 89.45 91.59 92.68 91.71
sbert 89.85 91.94 92.50 92.31 93.33 91.43
imize the width of the gap between the two categories. tfidf 36.93 94.17 94.33 92.46 94.30 94.64
• eXtreme Gradient Boosting(XGBoost) - ”XGBoost is w2v NaN 89.31 89.47 92.29 93.08 91.79
an implementation of gradient boosted decision trees
designed towards speed and performance enhancement”.
• Multi-layer Perceptron (MLP) - MLP is a class of neural TABLE II
ACCURACY SCORES FOR 5- CLASS MODELS .
networks with at least three layers: an input layer, a
hidden layer, and an output layer. MLP uses non-linear knn lr lsvm mlp svm xgb
activation functions and a supervised learning technique bow 73.06 94.51 93.88 92.58 94.30 94.75
ft 78.64 88.17 89.58 92.26 92.44 91.59
called backpropagation for training. glove 80.62 88.89 89.45 91.59 92.68 91.71
sbert 89.85 91.94 92.50 92.31 93.33 91.43
V. E XPERIMENTAL R ESULTS tfidf 36.93 94.17 94.33 92.46 94.30 94.64
The results are categorized into two main sections for w2v NaN 89.31 89.47 92.29 93.08 91.79
individual classifiers and the final pipeline model. In the first
stage, the best model for the binary and multi-class models are TABLE III
established using exhaustive search. The results from the first F1 SCORES FOR BINARY MODELS .
set of experiments are then used to build the final pipeline to
knn lr lsvm mlp svm xgb
undertake the 6-class classification problem. bow NaN 92.03 90.67 90.49 92.82 92.93
ft NaN 91.74 91.86 91.67 92.16 90.62
A. Individual Classifiers glove NaN 92.10 92.11 90.55 92.64 90.85
Multiple embeddings and classifiers are studied and exper- sbert 92.59 92.43 92.61 90.94 93.16 91.02
tfidf NaN 92.63 91.75 90.10 92.27 92.81
imented with to find the most suitable model for the binary w2v NaN 91.88 91.93 90.81 92.40 90.58
and multiclass problems. Every possible permutation of the
following configurations are tested seperately using 5-fold
cross validation resulting in 450 different iterations running TABLE IV
ACCURACY SCORES FOR BINARY MODELS .
concurrently.
• 3 models (5-class, binary without undersampling and knn lr lsvm mlp svm xgb
binary with undersampling) bow NaN 86.37 84.32 84.08 87.35 87.59
ft NaN 85.32 85.53 85.72 86.05 83.92
• 6 text embedding methods glove NaN 86.15 86.06 84.08 87.04 84.32
• 5 classifiers sbert 87.25 86.92 87.24 84.75 88.09 84.59
tfidf NaN 87.09 85.90 83.41 86.47 87.40
Tables I and II show the accuracy and F1 score for the 5
w2v NaN 85.71 85.72 84.50 86.63 83.86
class classifiers with different embeddings. It can be seen that
the combination of the Bag of Words embedding and XGBoost
classifier achieves the best results among all of the candidate TABLE V
methods. F1 SCORES FOR BINARY WITH UNDER SAMPLING MODELS .
The results for the binary model without under sampling are knn lr lsvm mlp svm xgb
shown in Tables III and IV and the results for the same model bow 67.36 86.25 84.83 84.10 85.86 86.17
with random under sampling can been seen in Tables V and VI. ft 90.12 84.76 84.73 87.20 86.54 85.73
glove 90.26 85.20 85.18 86.77 86.46 85.61
Comparing the results from these two binary models reveals sbert 87.91 86.73 86.70 86.44 85.98 85.44
that the best results are achieved without under sampling and tfidf 29.27 86.11 85.83 83.02 85.36 85.77
the best combination for this model is the Sentence Bert w2v NaN 84.81 84.65 87.17 86.49 85.71
embdding and the SVM classifier.
B. Pipeline Model TABLE VI
ACCURACY SCORES FOR BINARY WITH UNDER SAMPLING MODELS .
Following the results from the previous experiments, the
final model incorporates a two step pipeline where the tweets knn lr lsvm mlp svm xgb
bow 58.24 79.39 77.23 76.12 79.10 79.56
are first sent to a binary model built with sentence bert and ft 83.30 77.01 76.98 80.38 79.67 78.48
SVM. If the binary model flags the tweet as a cyberbullying glove 83.74 77.56 77.55 79.73 79.61 78.31
class, the tweet is then sent to the second model in the sbert 81.29 79.87 79.85 79.40 79.11 78.07
pipeline which is the multi-class model built with bag of words tfidf 30.02 79.20 78.65 74.46 78.31 79.00
w2v NaN 77.13 76.92 80.31 79.69 78.48
and XGBoost to detect the type of cyberbullying. The final
architecture of the pipeline model is depicted on Fig. 4. Table 3) Long Execution time of the near miss algorithm: The
VII shows the class-wise F1 scores of our proposed model and authors initial plan was to experiment with near miss as
compares it with previous work [7]. a potential under sampling method to achieve better results
in the binary model. However, due to the large number of
TABLE VII samples and the high-dimensional embeddings used, the near
C LASS - WISE F1 SCORES OF THE PIPELINE MODEL COMPERD WITH miss algorithm was not able to find the candidate points in a
PREVIOUS WORK
timely manner. Therefore this method was not used and the
Class Ahmed et al. [7] Our pipeline model
experimental results for this method are not available.
4) High memory consumption of the KNN classifier: The
Age 0.93 0.98
KNN classifier is a relatively memory-extensive algorithm
Ethnicity 0.95 0.98
compared to the other methods used in this work. Given
Gender 0.86 0.87 that these experiments were running concurrently on a single
NotCB 0.56 0.55 machine with hundreds of CPU cores and a limited amount of
Others 0.61 0.67 main memory, the results for this classifier are not complete
Religion 0.93 0.95 and some of the experiments stopped when they ran out of
Average 0.80 0.83
memory.

B. Future Scope
The authors are planning to improve this work in the future
in two ways. First, the effect of the hyper-parameters on the
final pipeline model are not studied in this work. It is possible
to run more experiments and find the optimal hyper-parameters
for the binary and multi-class models and therefore increase
the performance of the model.
Second, as mentioned before, the under sampling technique
used in the binary classifier resulted in poor performance.
One way to improve the quality of the binary classifier is
by employing over sampling methods instead. There are two
general strategies in generating synthetic text. One way is to
use the general methods such as SMOTE and ADASYN on
the embedded text. The other can be achieved by generating
new synthetic text using methods such as back-translation and
word-replacement and then generating new embeddings for
these synthetic text data. Both of the mentioned methods could
be a successful way to improve the performance of the binary
model and are great directions to work in the future.

ACKNOWLEDGEMENTS
Fig. 4. Our final best classification architecture
R EFERENCES

VI. D ISCUSSION AND C ONCLUSION [1] E. Englander, E. Donnerstein, R. Kowalski, C. A. Lin, and K. Parti,
“Defining cyberbullying,” Pediatrics, vol. 140, no. Supplement 2, pp.
S148–S151, 2017.
A. Challenges Faced [2] All the latest cyber bullying statistics and what they mean in
1) Long Execution time: Running all of the 450 exper- 2022. BroadbandSearch.net. (n.d.). Retrieved April 7, 2022, from
https://www.broadbandsearch.net/blog/cyber-bullying-statistics
iments in the first set was challenging since it required a [3] Canada, P. S. (2021, February 5). Government of Canada.
huge amount of time to complete. To overcome the slow Cyberbullying can be against the law - Canada.ca. Retrieved
execution time of the models, the experiments were conducted April 7, 2022, from https://www.canada.ca/en/public-safety-
canada/campaigns/cyberbullying/cyberbullying-against-law.html
in parallel on high-preformance compute nodes in Compute [4] Hatfield, H. (n.d.). Stop school bullying and cyber-
Canada clusters bullying. WebMD. Retrieved April 7, 2022, from
2) Random Under Sampling: The use of random under https://www.webmd.com/parenting/features/prevent-cyberbullying-
and-school-bullying
sampling had a negative effect on model’s performance. This [5] J. Wang, K. Fu and C.-T. Lu, ”Fine-grained balanced cyberbullying
can be due to the huge loss of information that happens when dataset”, 2020.
using this under sampling method. It is possible to experiment [6] J. Wang, K. Fu and C. -T. Lu, ”SOSNet: A Graph Convolutional
Network Approach to Fine-Grained Cyberbullying Detection,” 2020
with over sampling techniques in the future to overcome this IEEE International Conference on Big Data (Big Data), 2020, pp. 1699-
issue. 1708, doi: 10.1109/BigData50022.2020.9378065.
[7] T. Ahmed, M. Kabir, S. Ivan, H. Mahmud and K. Hasan, ”Am I
Being Bullied on Social Media? An Ensemble Approach to Categorize
Cyberbullying,” 2021 IEEE International Conference on Big Data (Big
Data), 2021, pp. 2442-2453, doi: 10.1109/BigData52589.2021.9671594.
[8] Rajaraman, Anand, and Jeffrey David Ullman. Mining of massive
datasets. Cambridge University Press, 2011. Mikolov, Tomas, et al.
”Efficient estimation of word representations in vector space.” arXiv
preprint arXiv:1301.3781 (2013).
[9] Pennington, Jeffrey, Richard Socher, and Christopher D. Manning.
”Glove: Global vectors for word representation.” Proceedings of the
2014 conference on empirical methods in natural language processing
(EMNLP). 2014.
[10] Pennington, Jeffrey, Richard Socher, and Christopher D. Manning.
”Glove: Global vectors for word representation.” Proceedings of the
2014 conference on empirical methods in natural language processing
(EMNLP). 2014.
[11] Jurafsky, Daniel; H. James, Martin (2000). Speech and language pro-
cessing : an introduction to natural language processing, computational
linguistics, and speech recognition. Upper Saddle River, N.J.: Prentice
Hall. ISBN 978-0-13-095069-7.

You might also like