0% found this document useful (0 votes)
37 views19 pages

CBDA Research Paper

Uploaded by

gicisa2637
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views19 pages

CBDA Research Paper

Uploaded by

gicisa2637
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Leveraging Tweet Data for Automated Cyberbullying

Detection: A Machine Learning Approach


PROJECT GUIDE-MAYANK GUPTA, Assistant Professor, Department of Computer Science and Engineering,
SRMIST NCR CAMPUS, [email protected]

1
KARAN RANA, 2 SUHAIL SAIFI, 3MOHD. ZUFAR HASAN ALVI, 4SOHAIL
1
[email protected], STUDENT CSE, SRMIST NCR CAMPUS
2
[email protected]. , STUDENT CSE, SRMIST NCR CAMPUS
3
[email protected], , STUDENT CSE, SRMIST NCR CAMPUS
4
[email protected], STUDENT CSE, SRMIST NCR CAMPUS

ABSTRACT

The proliferation of social media platforms has led to a significant rise in cyberbullying incidents,
which poses serious challenges for online safety and mental well-being. This paper presents a
comprehensive study on leveraging tweet data for automated cyberbullying detection through
advanced machine learning techniques. We propose a novel framework that employs natural
language processing (NLP) and machine learning algorithms to identify and classify
cyberbullying content within Twitter data. The framework integrates various feature extraction
methods and classification models to enhance detection accuracy. Our experimental results
demonstrate the effectiveness of the proposed approach, achieving high precision and recall
rates in distinguishing between abusive and non-abusive tweets. This research contributes to
the development of automated tools for monitoring and mitigating cyberbullying on social media
platforms, offering insights into the potential for improved online safety through technological
interventions.

I. INTRODUCTION The anonymous and often unregulated


nature of online interactions exacerbates
The pervasive influence of social media the difficulties in identifying and mitigating
platforms has revolutionized such harmful behaviors.
communication, yet it has also given rise to
significant challenges, among which In the digital age, social media platforms
cyberbullying is a prominent concern. have become integral to personal and
Cyberbullying, characterized using digital professional communication, providing
platforms to inflict psychological harm on unprecedented opportunities for individuals
individuals, represents a growing threat that to connect, share, and express their
impacts users' mental health and well-being. thoughts. Twitter, a widely used
microblogging service, exemplifies this shift which enables computers to understand and
with its real-time, concise, and dynamic interpret human language, further enhances
nature. While Twitter facilitates positive these algorithms by providing the ability to
interactions and community building, it also analyze the context, sentiment, and intent
serves as a venue for detrimental behind textual content.
behaviors, including cyberbullying—a The integration of ML and NLP techniques
phenomenon with significant implications for into the automated detection of
mental health and social harmony. cyberbullying involves several critical
Cyberbullying, defined as the use of components. Data preprocessing is a
electronic communication to bully a person foundational step, involving the cleaning
by sending threatening, intimidating, or and normalization of tweet data to prepare it
malicious messages, represents a growing for analysis. Feature extraction, which
concern in the digital era. Unlike traditional includes techniques such as word
bullying, cyberbullying operates in a virtual embeddings and sentiment analysis, plays a
environment where the perpetrators can crucial role in transforming raw text into
remain anonymous, and the victims can meaningful inputs for machine learning
experience harassment at any time and models. The choice of algorithms, such as
from any location. The psychological impact support vector machines, random forests, or
of cyberbullying can be profound, leading to deep learning models, impacts the accuracy
issues such as anxiety, depression, and and efficiency of detection systems.
social withdrawal. The anonymity and scale Evaluating these models requires robust
of social media platforms like Twitter metrics and validation methods to ensure
exacerbate these issues, making it that they can generalize well to new and
challenging to identify and address diverse datasets.
cyberbullying effectively. This research paper aims to explore the
The sheer volume of content generated on potential of leveraging tweet data for
Twitter—with over 500 million tweets posted automated cyberbullying detection through
daily—presents a significant challenge for a comprehensive machine learning
manual detection of cyberbullying. approach. The study will begin with an
Traditional methods of identifying harmful overview of the theoretical framework
content, such as human moderation and underpinning ML and NLP techniques,
manual reporting, are insufficient for followed by a detailed examination of the
managing the vast amount of data and the methods employed in preprocessing and
speed at which content is produced. This feature extraction. The core of the research
limitation highlights the need for automated will involve developing and evaluating
solutions that can efficiently process and various machine learning models to assess
analyse large datasets to detect instances their effectiveness in detecting
of cyberbullying in real-time. cyberbullying. The evaluation will consider
Recent advancements in machine learning factors such as precision, recall, and
(ML) and natural language processing F1-score, as well as the ability of the
(NLP) offer promising approaches for models to handle the inherent variability and
addressing this challenge. Machine complexity of natural language.
learning, a subset of artificial intelligence, In addition to the technical aspects, the
involves the development of algorithms that paper will address the challenges
can learn from and make predictions or associated with automated detection
decisions based on data. When applied to systems. These include dealing with the
textual data, ML algorithms can identify evolving nature of language, the risk of false
patterns and anomalies that may indicate positives and negatives, and the ethical
cyberbullying. Natural language processing, considerations surrounding privacy and data
security. The research will also explore LITERATURE REVIEW
potential strategies for improving detection Cynthia Van Hee , Gilles Jacobs [4],
accuracy and system robustness, such as proposed a model for automatic
incorporating contextual information and cyberbullying detection in social media text
leveraging ensemble methods. by modeling posts written by bullies, victims
By providing a detailed analysis of these by standards of online bullying. two corpora
methodologies and challenges, this were constructed by collecting data from
research aims to contribute to the social networking sites like Ask. fm.
development of effective and scalable developed a model using tokenization,
solutions for combating cyberbullying on PoS-tagging, and lemmatization for
social media platforms. The goal is to pre-processing. Models were developed for
enhance the ability of automated systems to English and Dutch to test for language
identify and mitigate harmful behavior, conversion and subsequent accuracy. ML
thereby fostering safer and more supportive algorithm SVM gave the accuracy for the
online environments. Through this study, we English language - 64% and for the Dutch
seek to advance the field of cyberbullying language - 61%.
detection and offer practical insights for
improving the well-being of social media
users. Mohammed Ali Al-Garadi, et al. [5],
implemented a model to reduce textual
As social media continues to evolve, so too cyberbullying because it has become the
must the strategies and technologies dominant aggressive behavior in social
designed to address its associated risks. media sites.They extracted data from
This research underscores the importance Wikipedia, you- tube Twitter, Instagram
of advancing automated systems for and developed a model using
cyberbullying detection, emphasizing the tokenization lemmatization and N-gram
need for ongoing innovation and refinement was used up to 5 levels to calculate TF
in machine learning and natural language IDF and count vector for pre-processing.
processing techniques. The findings of this They gave a comparative analysis of ML
study are anticipated to provide valuable algorithms using SVM , K clustering,
Random forest, Decision Trees and
insights not only into the effectiveness of
concluded that SVM worked best
current methodologies but also into potential
amongst the four machine learning
areas for future research and development.
models. Kshitiz Sahay,et al. [6], Their
By enhancing the capabilities of automated
focus was to identify and classify bullying
detection systems, we aim to contribute to a
in the text by analyzing and studying the
safer digital environment where individuals
properties of bullies and aggressors and
can engage in online interactions free from
what features distinguish them from
the fear of harassment and abuse.
regular users. The dataset they used was
Ultimately, the research aspires to support
obtained from Wikipedia, YouTube,
broader efforts to foster a more respectful
Twitter. In preprocessing removed URL
and empathetic online community, and tags from dataset and performed
benefiting both individuals and society at Count Vectors and TF- IDF vectors. For
large. classification they used Logistic
Regression, SVM, Random Forest And
Gradient Boosting.
Michele Di Capua [7], implemented a cleaning they removed the words
model inspired by Growing Hierarchical like ‘haha’, ‘hehe’ , ‘umm’ etc. For
SOMs, which are able to efficiently cluster balancing dataset they formed
documents containing bully traces, built classification: 2
upon semantic and syntactic features of classes(cyberbullying and non
textual sentences. They followed an
-cyberbullying), 4 classes
Unsupervised approach with Syntactic,
(non-cyberbullying, cyberbullying
Semantic, Sentiment analysis. In
with low,middle and high severity
Pre-processing stop word removal,
level), 11 classes (non-
punctuation removal was done to
cyberbullying, cyberbullying with
generate word clusters. Social features
1-10 severity level). In
were extracted. Convolutional Neural
preprocessing they used
networks were applied
tokenizations, Transfercase, stop
using Kohonen map (or word removal, filter token,
GHSOM).Homa Hosseinmardi, et stemming, and generating n-gram.
al. [8], They proposed a model to For classification they used
automatically detect cyberbullying text in NaiveBayes and SVM with
linear,poly and sigmoid kernels. The
Instagram by modeling posts written by
SVM kernel with poly kernel gave
bullies. developed a system for deciding most average accuracy 97.11%.
posts based on shortlisting words of
caption. The paper suggested using image
processing on Instagram posts for deciding H. Watanabe, M. Bouazizi and T.
emotional response or test response in case Ohtsuki [10] their aim was to
of text pictures. Vijay Banerjee Jui Telavane detect hate speech on Twitter. Their
et al. [9] developed the cyberbullying Technique is based on unigram and
patterns that are automatically
detection model using Convolution Neural
collected from the dataset. Their
Network and compared the accuracy with aim was to classify tweets as clean,
previous models.They used the twitter offensive and hateful. They used 3
dataset which consists of 69874 tweets types of datasets the first dataset
which converted to vectors. The accuracy of was from crowdflower contains
this model was 93.97% which was greater 14000 tweets are classified into
than other models. clean,offensive and hateful; second
was also from crowdflower tweets
classified into offensive,hateful and
Novianto, S. M. Isa and L. Ashianti neither; third dataset was from
[3] created a classification model for github in which tweets were
cyberbullying using Naive Bayes classified into sexism, racism and
Method and Support Vector neither. They combined 3 datasets
Machine (SVM).The dataset they to make a bigger dataset. In
used was collected from Kaggle preprocessing removed URL and
which provides 1600 conversations tags from tweets also they did
in Formspring.me in which question tokenization, Part of Speech
and answer are used as labels. Tagging, and lemmatization.They
This consists of 12729 data of used binary classification and
which 11661 data is labeled ternary classification to identify
non-cyberbullying and 1068 is sentiment- based features,semantic
labeled cyberbullying.In data features, Unigram features and
pattern feature.Their proposed offensive tweets using MachineLearning
model gave accuracy of 87.4% for and Deep Learning. A data set created
binary classification to classify by the University of Maryland and Cornell
tweets into offensive and University of about 35000 and 24000
non-offensive and 78.4% for ternary tweets respectively was used with tweets
classification to classify tweets into labeled as Hate and Non hate. Tweets
hateful, offensive and clean. were converted into lowercase, numbers,
URLs and user mentions, punctuation’s,
special characters and stop words were
J. Yadav, D. Kumar and D. removed and contradictions were
Chauhan [11] developed a model to replaced. Data set was then balanced.
classify cyberbullying using a Logistic Regression, Linear SVC,
pre-trainedBERT model. BERT Multinomial and Bernoulli classifiers were
model is a recently developed applied in unigrams, bigrams and
learning. trigrams. Word2Vec technique was used
to improve accuracy. Accuracy of 95%
and 96% was achieved for the datasets.
S.E.Vishwapriya, Ajay Gour et al. [12],
implemented a model for detecting hate
speech and offensive language on twitter Sindhu Abro, Sarang Shaikh et al. [14],
using machine learning. Datasets were implemented a model to detect
taken from crowd flower and GitHub. cyberbullying via text using
Crowd flower dataset had tweets with MachineLearning. CrowdFlower dataset
labels Hateful, offensive and clean was used. Tweets were converted into
whereas GitHub dataset had columns lowercase and URLs, usernames,
tweet id and class such as sexism,racism whitespaces, hashtags, punctuations and
and neither. Tweets were fetched by the stop-words were removed. Tokenization
tweet id using twitter API. These datasets and lemmatization was applied. Naives
were then combined.Tweets were Bayes, Support Vector Machine, K
converted to lowercase and Space Nearest Neighbour,Random forest and
Pattern, URLs, Twitter Mentions, Retweet Logistic Regression were applied.
Symbols and Stop words were removed. N-gram with TFIDF, Word2vec and
To reduce inflectional forms of words, Doc2vec feature techniques were
stemming was applied. The dataset was applied. SVM with a combination of
then split into 70%training and 30% test bigram andTFIDF technique
samples. N-gram features from the
tweets were extracted and were weighed
according totheir TF IDF values.
Unigram, Bigram and Trigram features
along with L1 and L2 normalization of TF
IDF were considered. Logistic
Regression, Na¨ıve Bayes and Support
vector machine algorithms were
compared. 95% accuracy was obtained
using Logistic Regression with L2
Normalization and n=3.

Lida Ketsbaia, Biju Issac et al. [13],


proposed a model To detect hateful and
both cyberbullying and
non-cyberbullying content.
RESEARCH METHODOLOGY Data Cleaning
Following extraction, the data
undergoes a cleaning process to
eliminate noise and irrelevant
information. The cleaning process
involves:
· Removing special characters,
numerical values, and URLs.
· Converting all text to
lowercase to standardize the
format.
· Filtering out non-textual
elements such as emojis and
symbols, unless they hold
specific relevance for the
analysis.
This cleaning process ensures that the
data is uniform and ready for further
Our working is divided into 2 phases mainly processing.
(I) NLP Preprocessing Techniques
(II) Machine learning 1. Tokenization
Tokenization refers to the process of
[1] Phase 1: Natural Language segmenting the text into individual units
Processing (NLP) or tokens, such as words or phrases.
The initial phase of this project focuses For instance, tokenizing the sentence
on Natural Language Processing (NLP), "Stop bullying others!" results in the
which is crucial for preparing raw tweet tokens: ["Stop", "bullying", "others"]. This
data for subsequent machine learning step is essential as machine learning
algorithms. This phase encompasses models analyze patterns based on these
several key sub-steps that convert individual word units.
unstructured text into a structured 2.Lemmatization - Lemmatization
format suitable for analysis. involves reducing words to their base or
Data Extraction dictionary forms. For example, "running"
The first step in this phase involves data
and "ran" are normalized to the lemma
extraction, where raw text data,
"run." This process ensures that
specifically tweets, is collected from the
Twitter platform. The Twitter API, among different forms of a word are treated
other tools, is utilized to retrieve tweet consistently, thereby simplifying the data
data along with metadata such as and enhancing the performance of
timestamps, usernames, and hashtags. machine learning algorithms.
The objective is to assemble a
comprehensive dataset that includes
3. Vectorization - Following tokenization
and lemmatization, the text data is
converted into numerical
representations that can be interpreted
by machine learning models. This
conversion is achieved through
vectorization methods such as:
4.TF-IDF (Term Frequency-Inverse
Document Frequency): This method
assesses the significance of words in
the context of the dataset by evaluating
their frequency in individual documents
relative to their frequency across the
entire dataset.
5.Word Embeddings (Word2Vec,
GloVe): These techniques capture
semantic relationships between words Fig. 2. Functional Block Diagram

by representing them as dense vectors


in a high-dimensional space.
6.Through these preprocessing
[2] Machine Learning Algorithms
techniques, the raw tweet data is
A. In the subsequent phase of the
transformed into a structured format
project, machine learning
suitable for machine learning, thereby
algorithms are applied to classify
facilitating the subsequent phase of the
processed tweet data into
project.
cyberbullying or non-cyberbullying
categories. This phase utilizes
various well-established
algorithms, each with distinct
methodologies and strengths. The
following subsections provide a
detailed examination of the
Support Vector Machine (SVM),
K-Nearest Neighbors (KNN),
Logistic Regression, and
Stochastic Gradient Descent
(SGD) Classifier.
Support Vector Machine (SVM)
B. Support Vector Machine (SVM) is
a powerful supervised learning
algorithm employed for
classification and regression tasks. neighbors and assigns a class
SVM operates by constructing a label based on a majority vote.
hyperplane in a high-dimensional E. The performance of KNN is highly
space that separates different dependent on the choice of the
classes with the maximum margin. parameter k, which determines the
The primary goal of SVM in number of nearest neighbors
cyberbullying detection is to considered. Additionally, the
identify an optimal boundary that distance metric used to measure
differentiates cyberbullying tweets similarity—such as Euclidean
from non-cyberbullying ones. distance, Manhattan distance, or
C. SVM’s effectiveness stems from Minkowski distance—can influence
its ability to handle both linear and the results. KNN’s simplicity and
non-linear classifications through interpretability make it a valuable
the use of kernel functions. A tool, though its performance may
kernel function transforms the data degrade with high-dimensional
into a higher-dimensional space data and large datasets.
where a linear separation is Logistic Regression
possible. Commonly used kernels F. Logistic Regression is a statistical
include the linear, polynomial, and method designed for binary
radial basis function (RBF) kernels. classification problems. It models
The choice of kernel significantly the probability of a binary outcome
impacts the performance of the by employing a logistic function to
SVM model. For this project, estimate the probability that a
hyperparameter tuning, including given input belongs to one of the
the selection of the kernel and two classes. In the domain of
regularization parameters, is cyberbullying detection, logistic
essential to achieving optimal regression estimates the likelihood
classification accuracy. that a tweet falls into either the
K-Nearest Neighbors (KNN) cyberbullying or non-cyberbullying
D. K-Nearest Neighbors (KNN) is a category.
straightforward, instance-based G. The logistic function, or sigmoid
learning algorithm used for both function, transforms the linear
classification and regression. The combination of the input features
fundamental principle of KNN is to into a probability value between 0
classify a data point based on the and 1. The model parameters are
majority class of its k-nearest estimated using maximum
neighbors in the feature space. In likelihood estimation. Logistic
the context of cyberbullying Regression is advantageous for its
detection, KNN evaluates the simplicity, interpretability, and
similarity of a tweet to its nearest efficiency, especially when the
relationship between the predictors
and the outcome is approximately selection of an appropriate
linear. Regularization techniques algorithm, coupled with rigorous
such as L1 (Lasso) and L2 (Ridge) parameter tuning and evaluation, is
can be employed to prevent vital for enhancing the accuracy
overfitting and enhance model and robustness of the classification
generalization. model.
Stochastic Gradient Descent (SGD)
Classifier
H. The Stochastic Gradient Descent
(SGD) Classifier is an optimization [3] Evaluation Phase
method used to train various types The evaluation phase is crucial in
of models, including linear assessing the performance of the
classifiers. SGD operates by machine learning models used for
iteratively updating model cyberbullying detection. This phase
involves comparing the predicted
parameters based on small,
classifications with the true labels to
random subsets of the training determine the effectiveness of the
data, known as mini-batches. This models. Key evaluation metrics used in
approach enables the classifier to this phase include precision, recall,
handle large-scale datasets and F-measure, and accuracy. These
high-dimensional feature spaces metrics provide a comprehensive
efficiently. understanding of how well the models
I. In the context of cyberbullying
detection, the SGDClassifier
approximates the solution to the
classification problem by
minimizing the loss function
through stochastic gradient perform in identifying cyberbullying
descent. The choice of loss content.
[4] Precision
function (e.g., hinge loss for linear
Precision measures the accuracy of the
SVM, log loss for logistic
positive predictions made by the model. It is
regression) and the learning rate the proportion of true positive predictions
are crucial for the convergence and out of all the instances that were predicted
performance of the model. SGD’s as positive. .
ability to process large datasets
where:
and adapt quickly to new data
· TP(True Positives): The number
makes it particularly suitable for of correctly predicted positive
applications involving extensive instances.
tweet data. · FP (False Positives): The number
J. Each of these algorithms brings a of instances incorrectly predicted
unique set of advantages to the as positive.
cyberbullying detection task. The [5] Recall
Recall (also known as Sensitivity or True predicted cyberbullying
Positive Rate) measures the model's ability tweets.
to identify all relevant positive instances. It
· TN (True Negatives) refers to
is the proportion of true positive predictions
out of all the actual positive instances. the number of correctly
. predicted non-cyberbullying
tweets.
FP(False Positives) refers to the
number of tweets incorrectly
classified as cyberbullying.
Where: · FN (False Negatives) refers to
TP (True Positives): The number of the number of cyberbullying
correctly predicted positive instances. tweets that were missed by
FN (False Negatives): The number of the model.
actual positive instances that were missed Evaluation Process
by the model. To evaluate the performance of each
3. F-Measure (F1 Score) machine learning model, the following
The F-Measure, or F1 Score, is the steps are typically undertaken:
harmonic mean of precision and recall, 1. Confusion Matrix Calculation:
providing a single metric that balances A confusion matrix is
both precision and recall. It is
generated to summarize the
particularly useful when dealing with
results of the classification
imbalanced datasets where one class is
more frequent than the other. The F1 model. It includes counts of
Score is given by the formula: true positives, true negatives,
false positives, and false
negatives.
2. Metric Computation:
Precision, recall, F1 Score,
4. Accuracy
Accuracy measures the proportion of and accuracy are calculated
correctly classified instances (both true based on the confusion matrix
positives and true negatives) among all values.
instances in the dataset. It provides an 3. Model Comparison: The
overall assessment of the model's calculated metrics are used to
performance. Accuracy is given by the compare the performance of
formula: different models and select
the best-performing one for
the task of cyberbullying
detection.
4. Cross-Validation: To ensure
where: the robustness of the model
· TP (True Positives) refers to performance, cross-validation
the number of correctly techniques such as k-fold
cross-validation are used to experience. The interface leverages Tkinter
evaluate the models on for desktop application development, NLTK
for natural language processing tasks, and
different subsets of the data.
Streamlit for interactive web-based
By thoroughly evaluating the models visualizations. The design aims to provide
using these metrics, one can assess an intuitive interaction model and effective
their effectiveness in accurately data presentation.
identifying cyberbullying tweets and 1. Overview
ensure that the chosen model performs The UI design focuses on simplifying the
well across various dimensions of user experience by enabling users to input
classification quality. text, view analysis results, and understand
data visualizations seamlessly. The
application is designed to be straightforward
and accessible, allowing users to perform
cyberbullying detection efficiently.
2. Layout and Components
a. Main Interface (Tkinter)
Input Area: A text entry field where users
can type or paste tweets for analysis. This
input field is central to the application,
allowing users to enter data easily.
Submit Button: A button that users click to
initiate the analysis. Upon clicking, the text
Comparison of Algorithms with Count Vectorizer is processed, and results are generated.
Results Display: A section that shows the
classification results, indicating whether the
tweet is identified as cyberbullying or not.
This area may also include additional details
such as confidence scores or a brief
explanation of the result.
b. Visualization and Analysis (Streamlit)
Graphs and Charts: Streamlit is used to
create interactive visualizations, such as bar
charts or pie charts, displaying metrics like
the distribution of cyberbullying and
non-cyberbullying content. These
visualizations help users interpret the
analysis results more effectively.
Comparison of Algorithms with Term Frequency-Inverse Summary Statistics: This section presents
Document Frequency key performance metrics of the model,
including precision, recall, F1 score, and
OUTPUT accuracy. Streamlit enables dynamic
updates of these metrics based on the latest
The user interface (UI) design of the analysis.
cyberbullying detection project is essential c. Navigation and Accessibility
for delivering a user-friendly and efficient
Navigation Menu: Tkinter’s menu system Streamlit: For creating interactive
or Streamlit’s sidebar can be used to web-based visualizations and displaying
navigate between different functionalities of performance metrics and charts.
the application, such as input analysis, 5. User Experience Considerations
historical data, and settings. This provides a Usability: The design emphasizes ease of
streamlined way to access various features. use, ensuring that users can interact with
Responsive Design: While Tkinter is the application intuitively without needing
primarily used for desktop applications, extensive guidance.
careful design ensures that the UI remains Accessibility: The UI is designed to be
responsive and functional across different accessible, with features like adjustable font
screen sizes and resolutions. sizes and clear navigation paths, to cater to
3. Visual Design users with different needs.
a. Colour Scheme and Typography By integrating Tkinter, NLTK, and Streamlit,
Colour Scheme: The application uses a the cyberbullying detection project achieves
coherent colour scheme to enhance a well-rounded user interface that supports
readability and visual appeal. The colours efficient interaction, data processing, and
are chosen to provide clear contrast and visualization.
highlight important elements such as results
and charts.
Typography: Clear and legible fonts are User Authentication and Login Window
selected to ensure that text is easily The cyberbullying detection platform
readable. Consistent use of font sizes and includes a user authentication mechanism
styles contributes to a professional and to ensure secure and personalized access
cohesive look. to the application. This is facilitated through
b. Interaction Design a login window, which serves as the entry
User Feedback: Tkinter provides visual point for users to access the platform's
feedback for interactive elements, such as features. The design and implementation of
buttons and input fields, through color this login window are crucial for managing
changes or messages to indicate actions. user sessions and safeguarding sensitive
Streamlit enhances interaction with data.
real-time updates and feedback. User Authentication and Login Window
Error Handling: Informative error The cyberbullying detection platform
messages and prompts guide users in case includes a user authentication mechanism
of invalid inputs or processing issues. This to ensure secure and personalized access
helps users correct errors and proceed with to the application. This is facilitated through
their analysis. a login window, which serves as the entry
4. Implementation point for users to access the platform's
The UI is implemented using: features. The design and implementation of
Tkinter: For the desktop application this login window are crucial for managing
interface, including input fields, buttons, and user sessions and safeguarding sensitive
results display. data.
NLTK: For processing the input text and
performing natural language analysis, such
as tokenization and lemmatization.
(tweets) for analysis. This input field is
designed to be easily accessible,
allowing users to quickly enter text.
Send Button: A button that users click
to submit their input for processing. The
send button triggers the analysis of the
entered text and updates the chat
window with the results.
Message Display Area: A section that
shows the conversation history,
including user inputs and system
responses. This area displays the results
of the cyberbullying detection, such as
whether the tweet is classified as
cyberbullying or not, along with any
additional comments or analysis.
b. Interaction Flow
User Input: Users enter their messages
into the text input field and click the
send button. The system processes the
input using the natural language
processing (NLP) and machine learning
algorithms implemented in the platform.
System Response: After processing the
input, the system generates a response
that is displayed in the message area.
This response includes the results of the
Fig. 2. Functional Block Diagram
analysis and any relevant information,
Chat Window such as confidence scores or feedback.
The chat window is a central component Conversation History: The chat window
of the cyberbullying detection platform, maintains a history of interactions,
designed to facilitate real-time allowing users to review previous inputs
interaction between users and the and responses. This feature helps users
system. This feature allows users to track the results of their analyses and
input text, view analysis results, and provides context for ongoing
engage with the platform in a interactions.
conversational manner. The chat window
enhances user experience by providing a
dynamic and intuitive interface for
cyberbullying detection and analysis.

a. Chat Interface
Text Input Field: A text entry field where
users can type or paste their messages
Action Buttons: Each flagged message
includes action buttons that allow
administrators to:
Review: View the full content of the
flagged message and any associated
analysis or comments.
Disable: Mark the content as
inappropriate and disable it from being
visible to users. This action helps in
managing and moderating content
effectively.
b. Review and Management
Detailed View: Administrators can click
on individual flagged messages to
access a detailed view, including the full
Fig. 2. Chatbox output
content, detection results, and any
notes or contextual information provided
Admin Page for Cyberbullying Detection by the system.
Platform Content Disabling: Administrators have
The admin page is a critical component the option to disable inappropriate
of the cyberbullying detection platform, content, which removes the flagged
designed to provide administrators with messages from user interactions and
tools to manage and monitor the prevents them from being displayed on
platform's activities. This page offers
the platform.
functionalities to review detected
c. User Management
instances of cyberbullying, disable
inappropriate content, and oversee user User Profiles: The admin page provides
interactions. The admin page enhances access to user profiles, allowing
the system's control and moderation administrators to review user activity
capabilities, ensuring a safer and more and manage user permissions. This
manageable environment. feature helps in identifying users who
may be repeatedly involved in
a. Admin Interface cyberbullying.
Dashboard Overview: The admin page Account Actions: Administrators can
includes an overview dashboard that perform actions such as suspending or
summarizes key metrics, such as the deactivating user accounts based on
number of flagged instances, active their behaviour or involvement in flagged
users, and recent activities. This content.
overview helps administrators quickly 3. Implementation
assess the platform's status. The admin page is implemented using a
Flagged Content List: A list or table combination of Tkinter for
displaying all chats or messages flagged desktop-based management and
by the system as potential instances of Streamlit for web-based visualization
cyberbullying. This list includes details and interaction:
such as the message content, user Tkinter: Provides the graphical interface
information, and the reason for flagging. for the admin page, including the list of
flagged content, action buttons, and user across Indian social media platforms, it
management tools. Tkinter’s widgets are presents unique challenges for text
used to create a functional and processing, requiring specialized
organized layout for administrators. handling during feature extraction and
Streamlit: Enhances the admin page classification.
To address this, we implement multiple
with interactive elements and real-time
machine learning algorithms to improve
updates. Streamlit’s capabilities are used the prediction accuracy of cyberbullying
to display metrics, update flagged detection. Multinomial Naive Bayes (NB),
content status, and visualize data related Decision Tree Classifier, AdaBoost
to user interactions and flagged Classifier, and Bagging Classifier are
messages. employed as core classification models.
4. Security and Access Control Each of these algorithms brings distinct
Authentication: Access to the admin advantages in terms of handling textual
page is restricted to authorized data, classification accuracy, and model
personnel only. Administrators must log robustness.
Accuracy is a critical metric in evaluating
in with special credentials to access the
the performance of machine learning
management tools and features. models, particularly in the context of
Data Security: Sensitive information classifying cyberbullying messages in
displayed on the admin page, such as Hinglish. It represents the proportion of
user data and flagged content, is correctly classified instances out of the
protected through encryption and secure total instances examined. In this study,
data handling practices. we rigorously assess the accuracy of the
various algorithms employed:
Multinomial Naive Bayes (NB), Decision
Tree Classifier, AdaBoost Classifier,
and Bagging Classifier.

Fig. 2. Server-side working

Comparative Study of Functionality


The project involves classifying and
Fig. 2. Table depicting accuracy for various algorithm
predicting cyberbullying messages
across various social media platforms,
such as comments, chats, and tweets,
with a specific focus on content written
in Hinglish—a hybrid of Hindi and
English. As Hinglish is widely used
very good accuracy compare to
word2vec and bag of words. So for
selecting best feature extraction between
count Vectorizer and TF-IDF, we have
done comparative analysis between this
two feature extraction models and
observed that count Vectorizer slightly
provides better accuracy then TF-IDF.
We identified various algorithms and try
to apply some of them in our project like
Support Vector Machine, Logistic
Regression, K Nearest Neighbor and
Random Forest, Bagging
Classifier,Decisison Tree
Classifier,SGDC classifier, Multinomial
Classifier, and AdaBoost Classifier. We
then trained our models and obtained
good accuracy as well as speed while
applying these algorithms with count
In this research, the dataset utilized vectorizer as feature selection model.
After training we summarized all the
for the classification of cyberbullying
Algorithms in one plot with Accuracy and
messages comprises a total of 1,000
F1 score. Afterobserving the results we
instances, of which 360 instances
noted that Linear SVC and SGD
(36%) are classified as bullying data,
(stochastic gradient classifier) is able to
while the remaining 640 instances
give a comparatively better results in
(64%) represent non-bullying data.
classifying and predicting bullying
This distribution is crucial for
messages in Hinglish languages and
understanding the dynamics of the
takes less time to train and predict then
dataset and its implications for
other algorithms.
model training and evaluation.

Conclusion
Thus we have successfully been able to
extract the data
, clean it, and visualize it using various
python libraries. We also implemented
various natural language
processingtechniques like tokenization,
lemmatization and vectorization Fig. 6. Comparision between CV and
TF-IDF
i.e. feature extraction. After reading
various research papers published in this
field we analyzed that in feature Future Scope
extraction, count vectorizer and TF-IDF
are the two methods which are giving
While the current models Furthermore, the creation of
demonstrate promising results in annotated corpora that label
classifying cyberbullying messages, instances of cyberbullying will be
there remains considerable potential instrumental in training and
for improving accuracy through the validating the machine learning
application of deep learning models effectively.
techniques. Deep learning models, While the current model focuses on
such as Convolutional Neural classifying cyberbullying solely
Networks (CNNs) and Recurrent through textual analysis, the future
Neural Networks (RNNs), especially iterations of this research can
Long Short-Term Memory (LSTM) explore multimodal classification
networks, have shown exceptional techniques. Cyberbullying often
capabilities in natural language manifests not only in text but also in
processing tasks. These models can images and videos. Developing a
capture complex patterns and model that can classify cyberbullying
relationships in data, making them content across these different media
particularly effective for types would provide a more holistic
understanding the nuanced approach to detection. For instance,
language used in cyberbullying. By integrating image processing
employing these advanced techniques, such as CNNs for visual
architectures, we aim to achieve data, alongside text analysis could
higher classification accuracy, enhance the system's ability to
reduce misclassifications, and detect cyberbullying in scenarios
enhance the overall robustness of where images or videos convey
the system. harmful or abusive messages.

To facilitate the detection of


cyberbullying in additional
languages, we will need to gather
comprehensive and diverse datasets
that represent different linguistic
features and cultural contexts. This
process may involve collaboration
with linguistic experts and native
speakers to ensure the datasets are
rich and representative.
REFERENCES

[1] B. Dean, “How many people use social media in 2021? (65+ statistics),” Sep. 2021.
[Online]. Available: https://backlinko.com/social-media-users

[2] J. W. Patchin, “Summary of our cyberbullying research (2004-2016),” Jul. 2019.


[Online]. Available:https://cyberbullying.org/summary-of-our-cyberbullying-research

[3] S. M. Novianto, I. Isa, and L. Ashianti, “Cyberbullying classification using text mining,”
in Proc. 1st Int. Conf. on Informatics and Computational Sciences (ICICoS), 2017, pp.
241–246.

[4] C. Van Hee et al., “Automatic detection of cyberbullying in social media text,” PLoS
One, vol. 13, no. 10, p. e0203794, 2018.

[5] M. A. Al-Garadi et al., “Predicting cyberbullying on social media in the big data era
using machine learning algorithms: Review of literature and open challenges,” IEEE
Access, vol. 7, pp. 70 701–70 718, 2019.

[6] K. Sahay, H. S. Khaira, P. Kukreja, and N. Shukla, “Detecting cyberbullying and


aggression in social commentary using NLP and machine learning,” Int. J. Engineering
Technology Science and Research, vol. 5, no. 1, 2018.

[7] M. Di Capua, E. Di Nardo, and A. Petrosino, “Unsupervised cyberbullying detection


in social networks,” in Proc. 23rd Int. Conf. on Pattern Recognition (ICPR), 2016, pp.
432–437.

[8] H. Hosseinmardi et al., “Detection of cyberbullying incidents on the Instagram social


network,” arXiv preprint arXiv:1503.03909, 2015.

[9] V. Banerjee, J. Telavane, P. Gaikwad, and P. Vartak, “Detection of cyberbullying using


deep neural network,” in Proc. 5th Int. Conf. on Advanced Computing & Communication
Systems (ICACCS), 2019, pp. 604–607.

[10] H. Watanabe, M. Bouazizi, and T. Ohtsuki, “Hate speech on Twitter: A pragmatic


approach to collect hateful and offensive expressions and perform hate speech
detection,” IEEE Access, vol. 6, pp. 13 825–13 835, 2018.
[11] J. Yadav, D. Kumar, and D. Chauhan, “Cyberbullying detection using pre-trained
BERT model,” in Proc. Int. Conf. on Electronics and Sustainable Communication
Systems (ICESC), 2020, pp. 1096–1100.

[12] A. Gaydhani, V. Doma, S. Kendre, and L. Bhagwat, “Detecting hate speech and
offensive language on Twitter using machine learning: An n-gram and TF-IDF based
approach,” arXiv preprint arXiv:1809.08651, 2018.

[13] L. Ketsbaia, B. Issac, and X. Chen, “Detection of hate tweets using machine
learning and deep learning,” in Proc. 19th Int. Conf. on Trust, Security and Privacy in
Computing and Communications (TrustCom), 2020, pp. 751–758.

You might also like