0% found this document useful (0 votes)

20 views57 pages

Human Value Ethics

Whansjsns hsjsnsnnsns hehesbsbshshsbbs hehehhebee bsbebsbns hsbsbs

Uploaded by

sriram.p

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views57 pages

Human Value Ethics

Whansjsns hsjsnsnnsns hehesbsbshshsbbs hehehhebee bsbebsbns hsbsbs

Uploaded by

sriram.p

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

TOXIC COMMAND DETECTION SYSTEM

INTERNSHIP REPORT
Submitted By

P.SRIRAM - (512221104057)
In partial fulfilment for the award of the degree

of
BACHELOR OF ENGINEERING
(COMPUTER SCIENCE AND ENGINEERING)

S.K.P. ENGINEERING COLLEGE

THIRUVANNAMALAI – 606601

ANNA UNIVERSITY, CHENNAI 600 025

SEPTEMBER 2024

i
STRYDO TECHNOLOGIES PVT. LTD.
2nd floor, sapthagiri complex,

257, Katpadi Main Rd,

Katpadi, Vellore, Tamil Nadu 632007

ONE - MONTH INTERNSHIP ON

FULL STACK DEVELOPMENT
TOXIC COMMAND DETECTION SYSTEM

Abstract:

Online media is a phase where various young people are getting tortured. As individual to
individual correspondence objections are growing, toxic command is extending bit by bit. To
perceive word resemblances in the tweets made by hazards, use NLP, and can develop a ML model
normally perceive online media torturing exercises. Regardless, various online media torturing
recognizable proof techniques have been completed, but various of them were printed based. The
target of this paper is to show the execution of programming that will recognize hate conversation
tweets, posts, etc. An ML model is proposed to recognize and prevent torturing on Twitter. RF is
used for getting ready and testing the online media torturing substance. Both Support Vector
Machine (SVM) and Random Forest (RF) had the alternative to perceive the certifiable up-sides
with better accuracy independently.
TABLE OF CONTENTS

PAGE
TITLE NO.
CHAPTER
NO.

CHAPTER 1 : INTRODUCTION
1. 1.1 GENERAL
1.1.1 THE MACHINE LEARNING SYSTEM
1.1.2 FUNDAMENTAL

1.2 JUPYTER
1.3 MACHINE LEARNING
1.4 CLASSIFICATION TECHNIQUES
1.4.1 NEURAL NETWORK AND DEEP LEARNING
1.4.2 METHODOLOGIES - GIVEN INPUT AND EXPECTED
OUTPUT
1.5 OBJECTIVE AND SCOPE OF THE PROJECT
1.6 EXISTING SYSTEM
1.6.1 DISADVANTAGES OF EXISTING SYSTEM
1.6.2 LITERATURE SURVEY
1.7 PROPOSED SYSTEM
1.7.1 PROPOSED SYSTEM ADVANTAGES

CHAPTER 2 :PROJECT DESCRIPTION

2.
2.1 INTRODUCTION
2.2 DETAILED DIAGRAM
2.2.1 FRONT END DESIGN
2.2.2 BACK END FLOW
2.3 SOFTWARE SPECIFICATION
2.3.1 HARDWARE SPECIFICATION
2.3.2 SOFTWARE SPECIFICATION
2.4 MODULE DESCRIPTION
2.4.1 DATA COLLECTION
2.4.2 DATA AUGUMENTATION
2.4.3 DATA SPLITTING
2.4.4 CLASSIFICATION
2.4.5 PERFORMANCES MATRICES
2.4.6 CONFUSION MATRIX
2.5 MODULE DIAGRAM
2.5.1 SYSTEM ARCHITECTURE
2.5.2 USECASE DIAGRAM
2.5.3 CLASS DIAGRAM
2.5.4 ACTIVITY DIAGRAM
2.5.5 SEQUENCE DIAGRAM
2.5.6 STATE FLOW DIAGRAM
2.5.7 FLOW DIAGRAM

CHAPTER 3 : SOFTWARE SPECIFICATION

3.
3.1 GENERAL
3.2 ANACONDA
3.3 PYTHON
3.2.1 SCIENTIFIC AND NUMERIC COMPUTING
3.2.2 CREATING SOFTWARE PROTOTYPES
3.2.3 GOOD LANGUAGE TO TEACH PROGRAMMING
CHAPTER 4 : IMPLEMENTATION
4.1 GENERAL
4. 4.2 IMPLEMENTATION CODING
4.3 SNAPSHOTS
CHAPTER 5 : CONCLUSION & REFERENCES
5. 5.1 CONCLUSION
5.2 REFERENCES
CHAPTER I

INTRODUCTION

1.1 GENERAL
Glossary and Key Terms

This section provides a quick reference for several algorithms that are not explicity mentioned in
this chapter, but may be of interest to the reader. This should provide the reader with some
keywords or useful points of reference for other similar libraries to those discussed in this chapter.

BIDMach GPU accelerated machine learning library for algorithms that are not necessarily
neural network based.

Caret provides a standardised API for many of the most useful machine learning packages for
R. For readers who are more comfortable with R, Caret provides a good substitute for Python’s
SciKit-Learn.

Mathematica is a commercial symbolic mathematical computation system, developed since

1988 by Wolfram, Inc. It provides powerful machine learning techniques “out of the box” such
as image classification [4].

MATLAB is short for MATrix LABoratory, which is a commercial numerical computing

environment, and is a proprietary programming language by MathWorks. It is very popular at
universities where it is often licensed. It was originally built on the idea that most computing
applications in some wayrely on storage and manipulations of one fundamental object—the
matrix, and this is still a popular approach.

-R is used extensively by the statistics community. The software package Caret provides a
standardised API for many of R’s machine learning libraries.

WEKA is short for the Waikato Environment for Knowledge Analysis [6] and has been a very
popular open source tool since its inception in 1993. In 2005 Weka received the SIGKDD Data
Mining and Knowledge Discovery Service
Award: it is easy to learn and simple to use, and provides a GUI to many machine learning
algorithms.
Vowpal Wabbit Microsoft’s machine learning library. Mature and actively developed, with an
emphasis on performance.

Requirements and Installation

The most convenient way of installing the Python requirements for this tutorial is by using the
Anaconda scientific Python distribution. Anaconda is a collection of the most commonly used
Python packages preconfigured and ready to use.
Approximately 150 scientific packages are included in the Anaconda installation.
Install the version of Anaconda for your operating system.
All Python software described here is available for Windows, Linux, and Macintosh. All code
samples presented in this tutorial were tested under Ubuntu Linux 14.04 using Python 2.7. Some
code examples may not work on Windows without slight modification (e.g. file paths in Windows
use \ and not / as in
UNIX type systems).
The main software used in a typical Python machine learning pipeline can consist of almost
any combination of the following tools:
1. NumPy, for matrix and vector manipulation
2. Pandas for time series and R-like DataFrame data structures
3. The 2D plotting library matplotlib
4. SciKit-Learn as a source for many machine learning algorithms and utilities
5. Keras for neural networks and deep learning

Managing Packages

Anaconda comes with its own built in package manager, known as Conda. Using the conda
command from the terminal, you can download, update, and delete Python packages. Conda takes
care of all dependencies and ensures that packages are preconfigured to work with all other
packages you may have installed.
Keeping your Python distribution up to date and well maintained is essential in this fast moving
field. However, Anaconda makes it particularly easy to manage and keep your scientific stack up
to date. Once Anaconda is installed you can manage your Python distribution, and all the scientific
packages installed by Anaconda using the conda application from the command line. To list all
packages currently installed, use conda list. This will output all packages and their version
numbers. Updating all Anaconda packages in your system is performed using the conda update -
all command. Conda itself can be updated using the conda update conda command, while Python
can be updated using the conda update python command. To search for packages, use the search
parameter, e.g. conda search stats where stats is the name or partial name of the package you are
searching for.

OBJECTIVE AND SCOPE OF THE PROJECT

 The objective of this project is to show how sentimental analysis can help improve the user
experience over a social network or system interface.
 The learning algorithm will learn what our emotions are from statistical data then perform
sentiment analysis.
 Our main objective is also maintain accuracy in the final result.
 The main goal of such a sentiment analysis is to discover how the audience perceives the
television show. The Twitter data that is collected will be classified into two categories; positive
or negative. An analysis will then be performed on the classified data to investigate what
percentage of the audience sample falls into each category.
 Particular emphasis is placed on evaluating different machine learning algorithms for the
task of twitter sentiment analysis.
Jupyter
Jupyter, previously known as IPython Notebook, is a web-based, interactive development
environment. Originally developed for Python, it has since expanded to support over 40 other
programming languages including Julia and R.
Jupyter allows for notebooks to be written that contain text, live code, images, and equations. These
notebooks can be shared, and can even be hosted on GitHub for free.
For each section of this tutorial, you can download a Juypter notebook that allows you to edit and
experiment with the code and examples for each topic. Jupyter is part of the Anaconda distribution;
it can be started from the command line using the jupyter command:

Machine Learning
We will now move on to the task of machine learning itself. In the following sections we will
describe how to use some basic algorithms, and perform regression, classification, and clustering
on some freely available medical datasets concerning breast cancer and diabetes, and we will also
take a look at a DNA microarray dataset.
SciKit-Learn
SciKit-Learn provides a standardised interface to many of the most commonly used machine
learning algorithms, and is the most popular and frequently used library for machine learning for
Python. As well as providing many learning algorithms, SciKit-Learn has a large number of
convenience functions for common preprocessing tasks (for example, normalisation or k-fold cross
validation).
SciKit-Learn is a very large software library.
Clustering
Clustering algorithms focus on ordering data together into groups. In general clustering algorithms
are unsupervised—they require no y response variable as input. That is to say, they attempt to find
groups or clusters within data where you do not know the label for each sample. SciKit-Learn have
many clustering algorithms, but in this section we will demonstrate hierarchical clustering on a
DNA expression microarray dataset using an algorithm from the SciPy library.
We will plot a visualisation of the clustering using what is known as a dendrogram, also using the
SciPy library.
The goal is to cluster the data properly in logical groups, in this case into the cancer types
represented by each sample’s expression data. We do this using agglomerative hierarchical
clustering, using Ward’s linkage method:

Classification
we analysed data that was unlabelled—we did not know to what class a sample belonged (known
as unsupervised learning). In contrast to this, a supervised problem deals with labelled data where
are aware of the discrete classes to which each sample belongs. When we wish to predict which
class a sample belongs to, we call this a classification problem. SciKit-Learn has a number of
algorithms for classification, in this section we will look at the Support Vector Machine.
We will work on the Wisconsin breast cancer dataset, split it into a training set and a test set, train
a Support Vector Machine with a linear kernel, and test the trained model on an unseen dataset.
The Support Vector Machine model should be able to predict if a new sample is malignant or
benign based on the features of a new, unseen sample:
You will notice that the SVM model performed very well at predicting the malignancy of new,
unseen samples from the test set—this can be quantified nicely by printing a number of metrics
using the classification report function. Here, the precision, recall, and F1 score (F1 = 2 ·
precision·recall/precision+recall) for each class is shown. The support column is a count of the
number of samples for each class.
Support Vector Machines are a very powerful tool for classification. They work well in high
dimensional spaces, even when the number of features is higher than the number of samples.
However, their running time is quadratic to the number of samples so large datasets can become
difficult to train. Quadratic means that if you increase a dataset in size by 10 times, it will take 100
times longer to train.
Last, you will notice that the breast cancer dataset consisted of 30 features. This makes it difficult
to visualize or plot the data. To aid in visualization of highly dimensional data, we can apply a
technique called dimensionality reduction.

Dimensionality Reduction
Another important method in machine learning, and data science in general, is dimensionality
reduction. For this example, we will look at the Wisconsin breast cancer dataset once again. The
dataset consists of over 500 samples, where each sample has 30 features. The features relate to
images of a fine needle aspirate of breast tissue, and the features describe the characteristics of the
cells present in the images. All features are real values. The target variable is a discrete value
(either malignant or benign) and is therefore a classification dataset.
You will recall from the Iris example in Sect. 7.3 that we plotted a scatter matrix of the data, where
each feature was plotted against every other feature in the dataset to look for potential correlations
(Fig. 3). By examining this plot you could probably find features which would separate the dataset
into groups. Because the dataset only had 4 features we were able to plot each feature against each
other relatively easily. However, as the numbers of features grow, this becomes less and less
feasible, especially if you consider the gene expression example in Sect. 9.4 which had over 6000
features.
One method that is used to handle data that is highly dimensional is Principle Component Analysis,
or PCA. PCA is an unsupervised algorithm for reducing the number of dimensions of a dataset.
For example, for plotting purposes you might want to reduce your data down to 2 or 3 dimensions,
and PCA allows
you to do this by generating components, which are combinations of the original features, that you
can then use to plot your data.
PCA is an unsupervised algorithm. You supply it with your data, X, and you specify the number
of components you wish to reduce its dimensionality to. This is known as transforming the data:
Again, you would not use this model for new data—in a real world scenario, you would, for
example, perform a 10-fold cross validation on the dataset, choosing the model parameters that
perform best on the cross validation. This model would be much more likely to perform well on
new data. At the very least, you would randomly select a subset, say 30% of the data, as a test set
and train the model on the remaining 70% of the dataset. You would evaluate the model based on
the score on the test set and not on the training set

NEURAL NETWORKS AND DEEP LEARNING

While a proper description of neural networks and deep learning is far beyond the scope of this
chapter, we will however discuss an example use case of one of the most popular frameworks for
deep learning: Keras4.
In this section we will use Keras to build a simple neural network to classify theWisconsin breast
cancer dataset that was described earlier. Often, deep learning algorithms and neural networks are
used to classify images—convolutional neural networks are especially used for image related
classification. However,
they can of course be used for text or tabular-based data as well. In this we will build a standard
feed-forward, densely connected neural network and classify a text-based cancer dataset in order
to demonstrate the framework’susage.
In this example we are once again using the Wisconsin breast cancer dataset, which consists of 30
features and 569 individual samples. To make it more challenging for the neural network, we will
use a training set consisting of only 50% of the entire dataset, and test our neural network on the
remaining 50% of the data.
Note, Keras is not installed as part of the Anaconda distribution, to install it use pip:

Keras additionally requires either Theano or TensorFlow to be installed. In the examples in this
chapter we are using Theano as a backend, however the code will work identically for either
backend. You can install Theano using pip, but it has a number of dependencies that must be
installed first. Refer to the Theano and TensorFlow documentation for more information [12].
Keras is a modular API. It allows you to create neural networks by building a stack of modules,
from the input of the neural network, to the output of the neural network, piece by piece until you
have a complete network. Also, Keras can be configured to use your Graphics Processing Unit, or
GPU. This makes training neural networks far faster than if we were to use a CPU. We begin by
importing Keras:

We may want to view the network’s accuracy on the test (or its loss on the training set) over time
(measured at each epoch), to get a better idea how well it is learning. An epoch is one complete
cycle through the training data.
Fortunately, this is quite easy to plot as Keras’ fit function returns a history object which we can
use to do exactly this:

This will result in a plot similar to that shown. Often you will also want to plot the loss on the test
set and training set, and the accuracy on the test set and training set.
Plotting the loss and accuracy can be used to see if you are over fitting (you experience tiny loss
on the training set, but large loss on the test set) and to see when your training has plateaued.
OBJECTIVE:

The main aim of the detecting the hatespeech model will help to improve manual monitoring for
unwanted chats on social networks. In this project we fetch the tweets from twitter accounts and
preprocess the twits and images and applying generated model will detect the hatespeech or not. The
objectives of the systems development and event management are: Collect the dataset of hate words
and preprocess it and apply natural language processing and then machine learning algorithms
Generate different machine learning algorithm model. Fetch the tweets from twitter account and
preprocess it. Apply generated model on the fetched tweets and get final output hate words or not.

SCOPE OF THE PROJECT:

Toxic command is the use of electronic communication to bully a person by sending hatefull
messages using social media, instant messaging or through digital messages. Toxic command
can be very damaging to adolescents and teens. It can lead to fight, killing and even bullying.
Also, once things are circulated on the Internet, they may never disappear, resurfacing at later
times to renew the pain of Hate speech. Toxic command can be very damaging to adolescents
and teens. It can lead to anxiety, depression, and even suicide. Also, once things are circulated
on the Internet, they may never disappear, resurfacing at later times to renew the pain of Hate
speech. So overcome these issues detecting the Toxic command is very important in now a days
which will help to stop Toxic command conversation on social media networks.

PROBLEM STATEMENT:

The social media network gives us to great communication platform opportunities they also
increase the vulnerability of young people to threatening situations online. Toxic command on
an social media network is a globle phenomenon because of its huge volumes of active users.
The trend shows that the Toxic command on social network is growing rapidly every day.
Recent studies report that Toxic command constitutes a growing problem among youngsters.
Successful prevention depends on the adequate detection of potentially harmful messages and
the information overload on the Web requires intelligent systems to identify potential risks
automatically. So, In this project we focus on to make a model on automatic Toxic command
detection in social media text by modelling posts written by bullies on social network.

EXISTING SYSTEM
The Naïve Bayes model involves a simple conditional independence assumption, i.e. given a
class which may be positive or negative; the words are conditionally independent of each other.
This assumption doesn‘t much affect the accuracy of text classification but makes really fast
classification applicable for the problem.
Tokenization: In this part we take the text as sentences or whole paragraphs and then output the
entered text as separated words in a list. –
Lowering text: This takes the list of words that got out of the tokenization and then lower all
the letters Like: ’THIS IS AWESOME’ is going to be ’this is awesome’. –
Stop words and encoding cleaning: This is an essential part of the preprocessing where we clean
the text from those stop words and encoding characters like \n or \t which do not provide a
meaningful information to the classifiers. –
Word Correction: In this part we used Microsoft Bing word correction API [24] that takes a
word and then return a JSON object with the most similar words and the distance between these
words and the original word.

1.3.1 DISADVANTAGES OF EXISTING SYSTEM

 Accuracy is low.
 These segmentation have shortcomings
 Feature extraction is not accurate
 Accuracy will be low Computation load very high.

LITERATURE SURVEY:
1. TITLE: Detection of Toxic command using Text Mining and Natural Language Processing
YEAR: 2020

AUTHOR: G. Priyadharshini

DESCRIPTION:

In today’s modern world, technology connected with humanity is doing wonderful things. On the
other hand, people inclined to social networks where they have anonymity are bringing out the
very nastiest of people in the form of hate speech. Social media toxic command is a serious societal
problem which can contribute to magnify the violence ranging from lynching to ethical cleansing.
One of the critical tasks of automatic detection of toxic command is differentiating it from the
other context of offensive languages. The existing works to distinguish the two categories using
the lexical methods showed very low performance metrics values which led to major
misclassification. The works with supervised machine learning approaches indeed gave significant
results in distinguishing hate and offensive but the presence or absence of certain words of both
the classes can serve as both merit and demerit to achieve accurate classification. In this paper, a
ternary classification of tweets into hate speech, offensive and neither is performed using multi
class classifiers. Among the four classifiers: Logistic Regression, Random forests, Support Vector
Machines (SVM) and Naïve Bayes. It can be seen that Random Forest classifier performs
significantly well with almost all feature combinations giving maximum accuracy of 0.90 for
TFIDF feature technique

2. TITLE: Toxic Speech Classification via Deep Learning using Combined Features from BERT &
FastText Embedding
YEAR: 2021

AUTHOR: Asmi P

DESCRIPTION:

With the growing internet usage rate, people are more likely to express their opinion or ideas openly on
social media. A lot of discussion platforms are available nowadays. But some are misused the freedom of
speech by spreading online toxic speech. The toxic command that is intended not just to insult or mock,
but to harass and cause lasting pain by attacking something uniquely dear to the target. Thus, the
necessary of automatically detecting and removing toxic speech in social media is very important. We
proposed a feature-based method that combining the features of TF-IDF, FastText Embedding and BERT
Embedding and by using a DNN classifier. We compare the individual features of these three methods
with the combined features as a performance analysis.

3. TITLE: Detecting the Presence of Cyberbullying using Machine Learning

YEAR: 2021
AUTHOR: Arathi Unni

DESCRIPTION:

The increasing use of online social media and their demand has turned up the rise of cyberbullying
among people. Nowadays cyberbullying has become very frequent. The majority of the people are using
social media to troll and smear others, and the others are being defamed and agitated by unknown
users or friends. So it is necessary to detect these type of comments and prevent it. Our work proposes
an ensemble learning approach to detect cyberbullying comments. Different supervised ensemble
learning techniques are used to classify comments. Here voting classifier trains on an ensemble of
Support Vector Machine, Logistic Regression, and Perceptron models and predicts the output based on
the highest majority of the vote. This model detects cyberbullying comments with 94% accuracy.

4. TITLE: Twitter Cyberbullying on social networking sites: A literature review and

future research directions
YEAR: 2021

AUTHOR: K.H.ChanaChristy

DESCRIPTION:

Cyber bullying on social networking sites is an emerging societal issue that has drawn
significant scholarly attention. The purpose of this study is to consolidate the existing
knowledge through a literature review and analysis. We first discuss the nature, research
patterns, and theoretical foundations. We then develop an integrative framework based on
social cognitive theory to synthesize what is known and identify what remains to be
learned, with a focus on the triadic reciprocal relationships between perpetrators, victims,
and bystanders. We discuss the key findings and highlight opportunities for future research.
We conclude this paper by noting research contributions and limitations.

5. TITLE: Identification of Hate Content on Social Media

YEAR: 2021
AUTHOR: Dandala Bharath Reddy, Doma Akhil Sai

As online substance keeps on developing, so does the spread of disdain discourse. We

distinguish and look at difficulties looked by online programmed approaches for disdain
discourse recognition in text. Among these hardships are nuances in language, varying
definitions on what establishes disdain discourse, and impediments of information
accessibility for preparing and testing of these frameworks. Besides, numerous new
methodologies experience the ill effects of an interpretability issue—that is, it tends to be
hard to comprehend the reason why the frameworks settle on the choices that they do. We
propose a multi-view SVM approach that accomplishes close to cutting edge execution,
while being more straightforward and delivering more effectively interpretable choices
than neural strategies. We additionally talk about both specialized and down to earth
difficulties that stay for this errand.
1.1 PROPOSED SYSTEM

In proposed work, Toxic command is a huge problem on social media websites like Facebook and
Twitter. A number of life-threatening cyberbullying experiences among young people have been reported
internationally thus drawing attention to its negative impact. In the USA, the problem of cyberbullying has
become increasingly evident and has officially been identified as a social threat. The challenges in fighting
cyberbullying include: detecting online bullying when it occurs; reporting it to law enforcement agencies;
and identifying predators and their victims. No present online community or social media websites (for
example, Facebook and Twitter; where cyberbullying are most common), incorporates a system to
automatically and intelligently identify aggression and instances of online harassment on its platform.
Despite the seriousness of the problem, there are very few successful efforts to detect abusive behavior,
both from the research community and social media itself, due to several inherent obstacles like grammar,
syntactic flaws,and fairly limited context. Aggression and bullying against an individual can be performed
in several ways beyond just obviously abusive language – for example, via constant sarcasm, trolling, etc.

1. Extracting, collecting, and labeling the dataset.

2. Preprocessing, cleaning, and experiment with various features to improve accuracy.

3. Classification of text, comment, or posts into one of the many classes.

4. Evaluation and analysis of thebest model. The motivation for the research work is to
learn the application and implementation of Natural Language Processing and Machine
Learning in a real-world problem, i.e., cyberbullying and online harassment.

ARCHITECTURE FOR PROPOSED SYSTEM:

Preprocessing and Augmentation.

1. Data Cleaning: Data is cleansed through processes such filling in missing values, smoothing
the noisy data, or resolving the inconsistencies in the data.

2. Data Integration: Data with different representations are put together and conflicts within the
data are resolved.

3. Data Transformation: Data is normalized, aggregated and generalized.

4. Data Reduction: The step aims to present a reduced representation of the data.
The raw data is first loaded into the memory where it is cleansed of escape sequences like \n, \t
and Unicode characters such as \xc2 with a white space. Colloquial words and phrases used
mostly in text messages are replaced with its corresponding English word.

For example, “u” is replaced with “you”; “em” is replaced with “them”; “da” is replaced with
“the” and so on. Contractions such as “won’t” and “can’t” are replaced with “will not” and
“cannot” respectively along with others. The data is further converted to lowercase format.
Advanced natural language processing techniques are used to further preprocess the data to
ensure a better quality and consistency of data format while building the ground truth and
training a classifier. To build a vocabulary of abusive words and internet slangs, a dictionary of
bad words available at (http://urbanoalvarez.es/blog/2008/04/04/bad-words-list/) is used.

The dictionary contains a list of bad words in a number of variations used on the internet and
its corresponding English dictionary word. Through data preprocessing, different variations of
internet slangs are replaced with its dictionary counterpart for which the bad words file is used.
The data is further stemmed down to its root form and occurrences of special characters is
removed using regular expressions.

1.1.2 PROPOSED SYSTEM ADVANTAGES

 Speed and very low complexity, which makes it very well suited to operate on real
scenarios.
 Computation load needed for image processing purpose is much reduced, combined with
very simple classifiers..
 Ability to learn and extract complex image features.
 With its simplicity and fast processing time, the proposed algorithm is suitable to be
implemented in embedded system or mobile application that has limited processing resources
CHAPTER 2

PROJECT DESCRIPTION

2.1 INTRODUCTION
Cyberbullying and cyber aggression are serious and widespread issues affecting
increasingly more Internet users. It is defined as an aggressive, intentional act carried out by
an individual or group, that takes place in cyberspace.In today’s hyper-connected society,
bullying, which was once limited to particular places or times of the day (e.g., school hours),
can instead occur anytime, anywhere, with just a few clicks on mouse and taps on a keyboard.
Cyberbullying and cyber aggression can take many forms and definitions, however, the
former typically denotes repeated and hostile behavior performed by a group or an individual
and the latter intentional harm delivered via electronic means to a person or a group of people
who perceive such acts as offensive, derogatory, harmful, or unwanted. The abundance of
public discussion spaces on the Internet has in many ways changed how we communicate
with others. These discussions can often be productive, but the anonymity that comes with
hiding behind a username has allowed users to post insulting or inappropriate comments.
These posts can often create a hostile or uncomfortable environment for other users, one that
may even discourage them from visiting the site. In 2017, about 50% of young social media
users reported being bullied online in various forms. Popular social media platforms like
Twitter and Facebook are not immune, as racist and sexist attacks may even have caused
potential buyers of Twitter to balk. The ambition of this research work is to explore the
possibilities of classifying hate speech, insults and harassment which are one of the various
forms of cyberbullying in social media. The research work extends current research on
cyberbullying and online harassment detection.

According with these interpretations, cyberbullying is a multidimensional construct

encompassing a broad set of online behaviors [12]. Several taxonomies describe flaming,
harassment, denigration, impersonation, outing/trickery, exclusion, cyberstalking, revenge
porn as forms of cyberbullying [13, 14, 15, 16]. In such a broad categorization, Menesini and
colleagues have traced a line between two main forms of cyberbullying: i. written
cyberbullying in the form of verbal offenses; ii. visual cyberbullying, perceived by
adolescents as more harmful than written cyberbullying, consisting in non-consensually
sharing denigrating videos or pictures [17]. Research also remarked that cyberbullying is a
threat to adolescents’ psychosocial wellbeing, characterized by symptoms of stress,
depression, and anxiety, which are more severe than those observed in traditional bullying
[18,19]. Furthermore, several empirical studies have been taking into consideration the link
between the types of adolescents' online activities and cyberbullying [20, 21]. In particular,
the use of Social Network sites and gaming platforms are the activities associated with a
higher risk for cyberbullying [22,23,24]. For these reasons, research is currently stressing the
importance of identifying the protective factors against cyberbullying to structure tailored
primary, secondary, tertiary prevention programs [25]. Technology could significantly
contribute to the development of these preventative strategies by the use of Machine Learning
algorithms (ML).
Back End Module Diagrams:
Front End Module Diagrams:

2.3 SYSTEM SPECIFICATION:

2.3.1 HARDWARE REQUIREMENTS:

The hardware requirements may serve as the basis for a contract for the implementation of the
system and should therefore be a complete and consistent specification of the whole system.
They are used by software engineers as the starting point for the system design. It shows what
the system does and not how it should be implemented

PROCESSOR : Intel I5
RAM : 4GB
HARD DISK : 40 GB

2.3.2 SOFTWARE REQUIREMENTS:

The software requirements document is the specification of the system. It should include
both a definition and a specification of requirements. It is a set of what the system should
do rather than how it should do it. The software requirements provide a basis for creating
the software requirements specification. It is useful in estimating cost, planning team
activities, performing tasks and tracking the team’s and tracking the team’s progress
throughout the development activity.

PYTHON IDE : Anaconda Jupyter Notebook

PROGRAMMING LANGUAGE : Python

MODULE EXPLANATIONS (METHODOLOGY)

1. Dataset preparation and preprocessing

2. Featuraization
3. Data splitting
4. Modeling Evaluation
5. Hyper parameter Tuning
6. Model Testing

Dataset preparation and preprocessing:

Data is the foundation for any machine learning project. The second stage of project
implementation is complex and involves data collection, selection, preprocessing, and
transformation. Each of these phases can be split into several steps.
Data collection:-
It’s time for a data analyst to pick up the baton and lead the way to machine learning
implementation. The job of a data analyst is to find ways and sources of collecting relevant
and comprehensive data, interpreting it, and analyzing results with the help of statistical
techniques.

The type of data depends on what you want to predict.

There is no exact answer to the question “How much data is needed?” because each
machine learning problem is unique. In turn, the number of attributes data scientists will
use when building a predictive model depends on the attributes’ predictive value.

‘The more, the better’ approach is reasonable for this phase. Some data scientists suggest
considering that less than one-third of collected data may be useful. It’s difficult to estimate
which part of the data will provide the most accurate results until the model training begins.
That’s why it’s important to collect and store all data — internal and open, structured and
unstructured.

The tools for collecting internal data depend on the industry and business infrastructure.
For example, those who run an online-only business and want to launch a personalization
campaign can try out such web analytic tools as Mixpanel, Hotjar, CrazyEgg, well-known
Google analytics, etc. A web log file, in addition, can be a good source of internal data. It
stores data about users and their online behavior: time and length of visit, viewed pages or
objects, and location.
Companies can also complement their own data with publicly available datasets. For
instance, Kaggle, Github contributors, AWS provide free datasets for analysis.

Data preprocessing:-
The purpose of preprocessing is to convert raw data into a form that fits machine learning.
Structured and clean data allows a data scientist to get more precise results from an applied
machine learning model. The technique includes data formatting, cleaning, and sampling.

Data formatting: - The importance of data formatting grows when data is acquired from
various sources by different people. The first task for a data scientist is to standardize record
formats. A specialist checks whether variables representing each attribute are recorded in
the same way. Titles of products and services, prices, date formats, and addresses are
examples of variables. The principle of data consistency also applies to attributes
represented by numeric ranges.

Data cleaning: - This set of procedures allows for removing noise and fixing
inconsistencies in data. A data scientist can fill in missing data using imputation techniques,
e.g. substituting missing values with mean attributes. A specialist also detects outliers —
observations that deviate significantly from the rest of distribution. If an outlier indicates
erroneous data, a data scientist deletes or corrects them if possible. This stage also includes
removing incomplete and useless data objects.

Data anonymization: - Sometimes a data scientist must anonymize or exclude attributes

representing sensitive information (i.e. when working with healthcare and banking data).

Data sampling: - Big datasets require more time and computational power for analysis. If
a dataset is too large, applying data sampling is the way to go. A data scientist uses this
technique to select a smaller but representative data sample to build and run models much
faster, and at the same time to produce accurate outcomes.
Featuraization:-
Featuraization is a way to change some form of data (text data, graph data, time-series
data…) into a numerical vector.
Featuraization is different from feature engineering. Feature engineering is just
transforming the numerical features somehow so that the machine learning models work
well. In feature engineering, features are already in the numerical form. Whereas in
Featuraization data not need to be in the form of numerical vector.
The machine learning model cannot work with row text data directly. In the end, machine
learning models work with numerical (categorical, real…) features. So it is import to
change some type of data into numerical vector so that we can leverage the whole power
of linear algebra (making the decision boundary between data points) and statistics tools
with other types of data also.

Data splitting:-
A dataset used for machine learning should be partitioned into three subsets — training,
test, and validation sets.

Training set: -A data scientist uses a training set to train a model and define its optimal
parameters — parameters it has to learn from data.

Test set: - A test set is needed for an evaluation of the trained model and its capability for
generalization. The latter means a model’s ability to identify patterns in new unseen data
after having been trained over a training data. It’s crucial to use different subsets for
training and testing to avoid model over fitting, which is the incapacity for generalization
we mentioned above.

Modeling:-
During this stage, a data scientist trains numerous models to define which one of them
provides the most accurate predictions.

Model training:-
After a data scientist has preprocessed the collected data and split it into three subsets, he
or she can proceed with a model training. This process entails “feeding” the algorithm with
training data. An algorithm will process data and output a model that is able to find a target
value (attribute) in new data — an answer you want to get with predictive analysis. The
purpose of model training is to develop a model.
Two model training styles are most common — supervised and unsupervised learning. The
choice of each style depends on whether you must forecast specific attributes or group data
objects by similarities.
Supervised learning: - Supervised learning allows for processing data with target
attributes or labeled data. These attributes are mapped in historical data before the training
begins. With supervised learning, a data scientist can solve classification and regression
problems.
Unsupervised learning: - During this training style, an algorithm analyzes unlabeled data.
The goal of model training is to find hidden interconnections between data objects and
structure objects by similarities or differences. Unsupervised learning aims at solving such
problems as clustering, association rule learning, and dimensionality reduction. For
instance, it can be applied at the data preprocessing stage to reduce data complexity.
Textual Based:
We group features such as cyberbullying keywords, pro - fanity, pronouns, n-grams, Bags-of-
words (BoW), Term Frequency Inverse Document Frequency (TFIDF), document length, and
spelling content-based features. Content-based features are overwhelmingly used across our
sample, with as many as 41 papers utilising con-tent-based features. As cyberbullying messages
are often abusive and insulting in nature, it is not surprising that profanity was

Model Testing:-
QXThe goal of this step is to develop the simplest model able to formulate a target value
fast and well enough. A data scientist can achieve this goal through model tuning. That’s
the optimization of model parameters to achieve an algorithm’s best performance.

One of the more efficient methods for model evaluation and tuning is cross-validation

Cross-validation: - Cross-validation is the most commonly used tuning method. It entails

splitting a training dataset into ten equal parts (folds). A given model is trained on only
nine folds and then tested on the tenth one (the one previously left out). Training continues
until every fold is left aside and used for testing. As a result of model performance measure,
a specialist calculates a cross-validated score for each set of hyper parameters. A data
scientist trains models with different sets of hyper parameters to define which model has
the highest prediction accuracy. The cross-validated score indicates average model
performance across ten hold-out folds. Then a data science specialist tests models with a
set of hyper parameter values that received the best cross-validated score. There are various
error metrics for machine learning tasks.

SUPPORT VECTOR MACHINE : SVM Model: SVM (Support Vector machine) is a

supervised learning algorithm, and is one of the most efficient and universal classification
algorithms. Its goal is to fmd the optimal separating hyperplane which maximizes the margin of
training data. Initially the classifier is trained with labelled data before being used to classify the
data to test accuracy. Before the data can be used to train our classifier, it is imperative to process
it. This consists of the following steps:

• Labelling of data

• Generation of vocabulary

• Creation of document-term matrix

Once the labelled data is converted into a data matrix based on the values in the vocabulary,
the values are then plotted and optimal hyperplane is chosen based on the convex hull. The
optimal hyperplane is chosen in such a way that it maximizes the margin of the training data.
Once the classifier is trained the input data is passed to this classifier to segregate it into
positive and negative instances of bullying. This input data for testing purposes is also
converted into data matrix and this data matrix is passed to the classifier. SVMs use
sophisticated statistical learning theory to overcome the curse of dimensionality.

RANDOM FOREST:
It is type of ensemble learning method and also used for classification and regression tasks. The
accuracy it gives is grater then compared to other models. This method can easily handle large
datasets. It is popular ensemble Learning Method. Random Forest Improve Performance of
Decision Tree by reducing variance. It operates by constructing a multitude of decision trees at
training time and outputs the class that is the mode of the classes or classification or mean
prediction (regression) of the individual trees.

Algorithm-

 The first step is to select the “R” features from the total features “m” where R<<M
 Among the “R” features, the node using the best split point.
 Split the node into sub nodes using the best split.
 Repeat a to c steps until ”l” number of nodes has been reached.
 Built forest by repeating steps a to d for “a” number of times to create “n” number of trees.

PERFORMANCE MATRICES:

Data was divided into two portions, training data and testing data, both these portions consisting
70% and 30% data respectively. All these two algorithms were applied on same dataset using
Enthought Canaopy and results were obtained.
Predicting accuracy is the main evaluation parameter that we used in this work. Accuracy can be
defied using equation. Accuracy is the overall success rate of the algorithm.

CONFUSION MATRIX:

It is the most commonly used evaluation metrics in predictive analysis mainly because it is very
easy to understand and it can be used to compute other essential metrics such as accuracy, recall,
precision, etc. It is an NxN matrix that describes the overall performance of a model when used on
some dataset, where N is the number of class labels in the classification problem.

All predicted true positive and true negative divided by all positive and negative. True Positive
(TP), True Negative (TN), False Negative (FN) and False Positive (FP) predicted by all algorithms
are presented in table.
True positive (TP) indicates that the positive class is predicted as a positive class, and the number
of sample positive classes was actually predicted by the model.
False negative indicates (FN) that the positive class is predicted as a negative class, and the number
of negative classes in the sample was actually predicted by the model.
False positive (FP) indicates that the negative class is predicted as a positive class, and the number
of positive classes of samples was actually predicted by the model.
True negative (TN) indicates that the negative class is predicted as a negative class, and the number
of sample negative classes was actually predicted by the model.
2.5 SYSTEM DESIGN:
Designing of system is the process in which it is used to define the interface, modules and data for a
system to specified the demand to satisfy. System design is seen as the application of the system theory.
The main thing of the design a system is to develop the system architecture by giving the data and
information that is necessary for the implementation of a system.

2.5.1 ARCHITECTURE DIAGRAM:

DATA FLOW DIAGRAM:

Data flow diagrams are used to graphically represent the flow of data in a business
information system. DFD describes the processes that are involved in a system to transfer data
from the input to the file storage and reports generation. Data flow diagrams can be divided into
logical and physical. The logical data flow diagram describes flow of data through a system to
perform certain functionality of a business. The physical data flow diagram describes the
implementation of the logical data flow.

2.5.3 USECASE DIAGRAM:

Use case diagrams identify the functionalities provides by the use cases, the actors who interact with the
system and the association between the actors and the functionalities.

ER DIAGRAM:

2.5.4 CLASS DIAGRAM:

The class diagram is a static diagram. It represents the static view of an application. Class diagram is not
only used for visualizing, describing and documenting different aspects of a system but also for
constructing executable code of the software application

2.5.5 SEQUENCE DIAGRAM:

The sequence diagram of a system shows the entity interplay are ordered in the time order level. So, that
it drafts the classes and object that are imply in the that plot and also the series of message exchange take
place betwixt the body that need to be carried out by the purpose of that scenario.

2.5.6 ACTIVITY DIAGRAM:

The Activity Diagram forms effective while modeling the functionality of the system. Hence this diagram
reflects the activities, the types of flows between these activities and finally the response of objects to
these activities.

2.5.7 STATE FLOW DIAGRAM:

The below state chart diagram describes the flow of control from one state to another state (event)
in the flow of the events from the creation of an object to its termination.
COLLOBORATIVE DIAGRAM:

CHAPTER 3
SOFTWARE SPECIFICATION

3.1 GENERAL

ANACONDA
It is a free and open-source distribution of the Python and R programming languages for scientific
computing (data science, machine learning applications, large-scale data processing, predictive
analytics, etc.), that aims to simplify package management and deployment.

Anaconda distribution comes with more than 1,500 packages as well as the Conda package
and virtual environment manager. It also includes a GUI, Anaconda Navigator, as a graphical
alternative to the Command Line Interface (CLI).

The big difference between Conda and the pip package manager is in how package dependencies
are managed, which is a significant challenge for Python data science and the reason Conda exists.
Pip installs all Python package dependencies required, whether or not those conflict with other
packages you installed previously.

So your working installation of, for example, Google Tensorflow, can suddenly stop working when
you pip install a different package that needs a different version of the Numpy library. More
insidiously, everything might still appear to work but now you get different results from your data
science, or you are unable to reproduce the same results elsewhere because you didn't pip install
in the same order.

Conda analyzes your current environment, everything you have installed, any version limitations
you specify (e.g. you only want tensorflow >= 2.0) and figures out how to install compatible
dependencies. Or it will tell you that what you want can't be done. Pip, by contrast, will just install
the thing you wanted and any dependencies, even if that breaks other things.Open source packages
can be individually installed from the Anaconda repository, Anaconda Cloud (anaconda.org), or
your own private repository or mirror, using the conda install command. Anaconda Inc compiles
and builds all the packages in the Anaconda repository itself, and provides binaries for Windows
32/64 bit, Linux 64 bit and MacOS 64-bit. You can also install anything on PyPI into a Conda
environment using pip, and Conda knows what it has installed and what pip has installed. Custom
packages can be made using the conda build command, and can be shared with others by uploading
them to Anaconda Cloud, PyPI or other repositories.The default installation of Anaconda2
includes Python 2.7 and Anaconda3 includes Python 3.7. However, you can create new
environments that include any version of Python packaged with conda.

Anaconda Navigator is a desktop Graphical User Interface (GUI) included in Anaconda

distribution that allows users to launch applications and manage conda packages, environments
and channels without using command-line commands. Navigator can search for packages on
Anaconda Cloud or in a local Anaconda Repository, install them in an environment, run the
packages and update them. It is available for Windows, macOS and Linux.

The following applications are available by default in Navigator :

 JupyterLab
 Jupyter Notebook
 QtConsole
 Spyder
 Glueviz
 Orange
 Rstudio
 Visual Studio Code
Microsoft .NET is a set of Microsoft software technologies for rapidly building and integrating
XML Web services, Microsoft Windows-based applications, and Web solutions. The .NET
Framework is a language-neutral platform for writing programs that can easily and securely
interoperate. There’s no language barrier with .NET: there are numerous languages available to
the developer including Managed C++, C#, Visual Basic and Java Script. The .NET framework
provides the foundation for components to interact seamlessly, whether locally or remotely on
different platforms. It standardizes common data types and communications protocols so that
components created in different languages can easily interoperate.

“.NET” is also the collective name given to various software components built upon the .NET
platform. These will be both products (Visual Studio.NET and Windows.NET Server, for instance)
and services (like Passport, .NET My Services, and so on).

Microsoft VISUAL STUDIO is an Integrated Development Environment (IDE) from

Microsoft. It is used to develop computer programs, as well as websites, web apps, web services
and mobile apps.
Python is a powerful multi-purpose programming language created by Guido van Rossum. It has
simple easy-to-use syntax, making it the perfect language for someone trying to learn computer
programming for the first time. Python features are:

 Easy to code
 Free and Open Source
 Object-Oriented Language
 GUI Programming Support
 High-Level Language
 Extensible feature
 Python is Portable language
 Python is Integrated language
 Interpreted
 Large Standard Library
 Dynamically Typed Language

PYTHON:

 Python is a powerful multi-purpose programming language created by Guido

van Rossum.
 It has simple easy-to-use syntax, making it the perfect language for someone
trying to learn computer programming for the first time.
Features Of Python :

1.Easy to code:
Python is high level programming language. Python is very easy to learn language as compared
to other language like c, c#, java script, java etc. It is very easy to code in python language and
anybody can learn python basic in few hours or days. It is also developer-friendly language.

2. Free and Open Source:

Python language is freely available at official website and you can download it from the given
download link below click on the Download Python keyword.
Since, it is open-source, this means that source code is also available to the public. So you can
download it as, use it as well as share it.

3.Object-Oriented Language:
One of the key features of python is Object-Oriented programming. Python supports object
oriented language and concepts of classes, objects encapsulation etc.

4. GUI Programming Support:

Graphical Users interfaces can be made using a module such as PyQt5, PyQt4, wxPython or Tk
in python.
PyQt5 is the most popular option for creating graphical apps with Python.

5. High-Level Language:
Python is a high-level language. When we write programs in python, we do not need to
remember the system architecture, nor do we need to manage the memory.

6.Extensible feature:
Python is a Extensible language. we can write our some python code into c or c++ language and
also we can compile that code in c/c++ language.

7. Python is Portable language:

Python language is also a portable language. for example, if we have python code for windows
and if we want to run this code on other platform such as Linux, Unix and Mac then we do not
need to change it, we can run this code on any platform.

8. Python is Integrated language:

Python is also an Integrated language because we can easily integrated python with other
language like c, c++ etc.

9. Interpreted Language:
Python is an Interpreted Language. because python code is executed line by line at a time. like
other language c, c++, java etc there is no need to compile python code this makes it easier to
debug our code. The source code of python is converted into an immediate form called bytecode.

10. Large Standard Library

Python has a large standard library which provides rich set of module and functions so you do
not have to write your own code for every single thing.There are many libraries present in
python for such as regular expressions, unit-testing, web browsers etc.

11. Dynamically Typed Language:

Python is dynamically-typed language. That means the type (for example- int, double, long etc)
for a variable is decided at run time not in advance.because of this feature we don’t need to
specify the type of variable.
APPLICATIONS OF PYTHON :

WEB APPLICATIONS

 You can create scalable Web Apps using frameworks and CMS (Content Management
System) that are built on Python. Some of the popular platforms for creating Web Apps
are: Django, Flask, Pyramid, Plone, Django CMS.
 Sites like Mozilla, Reddit, Instagram and PBS are written in Python.

SCIENTIFIC AND NUMERIC COMPUTING

 There are numerous libraries available in Python for scientific and numeric computing.
There are libraries like: SciPy and NumPy that are used in general purpose computing.
And, there are specific libraries like: EarthPy for earth science, AstroPy for Astronomy
and so on.
 Also, the language is heavily used in machine learning, data mining and deep learning.

CREATING SOFTWARE PROTOTYPES

 Python is slow compared to compiled languages like C++ and Java. It might not be a
good choice if resources are limited and efficiency is a must.
 However, Python is a great language for creating prototypes. For example: You can use
Pygame (library for creating games) to create your game's prototype first. If you like the
prototype, you can use language like C++ to create the actual game.

GOOD LANGUAGE TO TEACH PROGRAMMING

 Python is used by many companies to teach programming to kids

 It is a good language with a lot of features and capabilities. Yet, it's one of the easiest
language to learn because of its simple easy-to-use sy
CHAPTER 4

IMPLEMENTATION

4.1 GENERAL

Python is a program that was originally designed to simplify the implementation of

numerical linear algebra routines. It has since grown into something much bigger, and it is used to
implement numerical algorithms for a wide range of applications. The basic language used is very
similar to standard linear algebra notation, but there are a few extensions that will likely cause you
some problems at first.

4.2 CODE IMPLEMENTATION

In this we implement the coding part using anaconda. Below are the coding’s that
are used to generate the domain module for Deep learning. Here the proposed techniques
are used in the coding.

import pandas as pd
import numpy as np
import os
import re
import nltk
import pickle
#from imblearn.over_sampling import SMOTE
from nltk.corpus import stopwords
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords, words
from sklearn.model_selection import train_test_split
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.metrics import confusion_matrix,accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_extraction.text import CountVectorizer,
TfidfTransformer,TfidfVectorizer

data=pd.read_csv('formspring_data.csv',error_bad_lines=False,sep='\t')
data.head()
df1=df[(df.ans1==df.ans2)& (df.ans2 == df.ans3) & (df.ans1 ==
df.ans3)].reset_index(drop=True)
print(df1.shape)
df1.head()
df1['ques']=df['ques'].apply(lambda x:re.sub('[^a-zA-Z\s0-9]','',x) if isinstance(x,str) else '
')
df1.head()
df1['ans']=df['ans'].apply(lambda x:re.sub('[^a-zA-Z\s0-9]','',x) if isinstance(x,str) else ' ')
df1.head()
stemmmer=SnowballStemmer('english')
df1['ques']=df1['ques'].apply(lambda x:' '.join([stemmmer.stem(i) for i in x.split(' ')]) if
isinstance(x,str) else x)
df1['ans']=df1['ans'].apply(lambda x:' '.join([stemmmer.stem(i) for i in x.split(' ')]) if
isinstance(x,str) else x)
#stop_words = stopwords.words("english")
lemmatizer = WordNetLemmatizer()
df1['ques']=df1['ques'].apply(lambda x:' '.join([lemmatizer.lemmatize(i) for i in x.split(' ')
]) if isinstance(x,str) else x)
df1['ans']=df1['ans'].apply(lambda x:' '.join([lemmatizer.lemmatize(i) for i in x.split(' ') ])
if isinstance(x,str) else x) #stop_words = stopwords.words("english")
lemmatizer = WordNetLemmatizer()
df1['ques']=df1['ques'].apply(lambda x:' '.join([lemmatizer.lemmatize(i) for i in x.split(' ')
]) if isinstance(x,str) else x)
df1['ans']=df1['ans'].apply(lambda x:' '.join([lemmatizer.lemmatize(i) for i in x.split(' ') ])
if isinstance(x,str) else x)
ans2=df1[df1.ans.notna()]
Ans=ans2[ans2.ans1=='Yes']
Ans['ans'].values
ans2=df1[df1.ans.notna()]
Ans=ans2[ans2.ans1=='Yes']
Ans['ans'].values
import matplotlib.pyplot as plt
import random
# Prepare Data
df = data.groupby('label').size().reset_index(name='counts')
n = df['label'].unique().__len__()+1
all_colors = list(plt.cm.colors.cnames.keys())
random.seed(100)
c = random.choices(all_colors, k=n)

# Plot Bars
plt.figure(figsize=(6,6), dpi= 80)
plt.bar(df['label'], df['counts'], color=c, width=.5)
for i, val in enumerate(df['counts'].values):
plt.text(i, val, float(val), horizontalalignment='center', verticalalignment='bottom',
fontdict={'fontweight':500, 'size':12})
# Decoration
plt.gca().set_xticklabels(df['label'], rotation=60, horizontalalignment= 'right')

plt.ylim(0, 6000)
plt.show()
import matplotlib.pyplot as plt
import random
# Prepare Data
df = data.groupby('label').size().reset_index(name='counts')
n = df['label'].unique().__len__()+1
all_colors = list(plt.cm.colors.cnames.keys())
random.seed(100)
c = random.choices(all_colors, k=n)

# Decoration
plt.gca().set_xticklabels(df['label'], rotation=60, horizontalalignment= 'right')
plt.ylim(0, 6000)
plt.show()
X
<10000x6295 sparse matrix of type '<class 'numpy.float64'>'
with 60929 stored elements in Compressed Sparse Row format>
SCREENSHOTS
CHAPTER 5

CONCLUSION AND REFERENCES

5.1 CONCLUSION:

The goal of this project is to the automatic detection of hate speech-related posts on
social media. Given the information overload on the web, manual monitoring for
cyberbullying has become unfeasible. Automatic detection of signals of cyberbullying
would enhance moderation and allow to respond quickly when necessary. However, these
posts could just as well indicate that cyberbullying is going on. The main aim of this project
is that it presents a system to automatically detect signals of cyberbullying on social media,
including different types of hate speech, covering posts from bullies, victims and
bystanders.

In this project, we develop achievability protocols and outer bounds for the secure
network coding setting, where the edges are subject to packet erasures, and public feedback
of the channel state is available to both Eve and the legitimate network nodes. Secure
network coding assumes that the underlying network channels are error-free; thus, if our
channels introduce errors, we need to first apply a channel code to correct them, and then
build security on top of the resulting error-free network. We show that by leveraging
erasures and feedback, we can achieve secrecy rates that are in some cases multiple times
higher than the alternative of separate channel-error-correction followed by secure network
coding;
5.2 APPLICATION

DETECTION OF HATESPEECH

The dataset based on toxic command been gathered and preprocessing been done.
After preprocessing data training been done in machine and various algorithm been applied
and trained the data and finally get the prediction with accuracy

5.3 FUTURE ENHANCEMENTS:

Future Modification The validity and accuracy of the predictive models to detect
toxic command on twitter in this case primarily based on the correct psychometric
categorization of the text. In future it is intended to improve the system developed by use
more accurate dataset and to detect the toxic command or not. We also apply other machine
learning algorithm and check the accuracy of models. Higher accuracy model will help to
detect more accurate bullying. Another interesting direction for future work would be the
detection of fine-grained toxic command categories such as threats, curses and expressions
of racism and hate. When applied in a cascaded model, the system could find severe cases
of toxic command with high precision. This would be particularly interesting for
monitoring purposes. Additionally, our dataset allows for detection of participant roles
typically involved in hate speech.
REFERENCES:

[1] D. Poeter. (2011) Study: A Quarter of Parents Say Their Child Involved in Cyberbullying.
pcmag.com. [Online].Available: http://www.pcmag.com/article2/0,2817,2388540,00.asp

[2] J. W. Patchin and S. Hinduja, “Bullies move Beyond the Schoolyard; a Preliminary Look at
Cyberbullying,” Youth Violence and Juvenile Justice, vol. 4, no. 2, pp. 148–169,2006

[3] Anti Defamation League. (2011) Glossary of Cyberbullying

Terms.adl.org.[Online].Available:http://www.adl.org/educati on/curriculum
connections/cyberbullying /glossary.pdf

[4] N. E. Willard, Cyberbullying and Cyberthreats: Responding to the Challenge of Online Social
Aggression, Threats, and Distress. Research Press, 2007.

[5] D. Maher, “Cyberbullying: an Ethnographic Case Study of one Australian Upper Primary
School Class,” Youth Studies Australia, vol. 27, no. 4, pp. 50–57, 2008.

[6] D. Yin, Z. Xue, L. Hong, B. D. Davison, A. Kontostathis, and L. Edwards, “Detection of

Harassment on Web 2.0,” in Proc. Content Analysis of Web 2.0 Workshop (CAW 2.0), Madrid,
Spain, 2009.

[7] K. Dinakar, R. Reichart, and H. Lieberman, “Modeling the Detection of Textual

Cyberbullying,” in Proc. IEEE International Fifth International AAAI Conference on Weblogs and
Social Media (SWM’11), Barcelona, Spain, 2011.

[8] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques,
Second Edition. San Francisco, CA: Morgan Kauffman, 2005.

[9] R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kauffman, 1993.

[10] W. W. Cohen, “Fast Effective Rule Induction,” in Proc. Twelfth International Conference on
Machine Learning (ICML’95), Tahoe City, CA, 1995, pp. 115–123.

[11] D. W. Aha and D. Kibler, “Instance-based Learning Algorithms,” Machine Learning, vol. 6,
pp. 37–66, 1991.
[12] J. C. Platt, “Fast Training of Support Vector Machines using Sequential Minimal
Optimization,” Advances in Kernel Methods, pp. 185–208, 1999. [Online]. Available:
http://portal.acm.org/citation.cfm?id=299094.299105

[13]https://www.sciencedirect.com/topics/computerscience/deep-neural-network [14]An
Effective Approach for Cyberbullying Detection and avoidance ieee paper [15]Approaches to
Automated Detection of Cyberbullying: A Survey ieee paper.

SIC - C - P - Chapter 1. Programing Basic Concept and Starting Python - v1
100% (1)
SIC - C - P - Chapter 1. Programing Basic Concept and Starting Python - v1
545 pages
OCS351 - AI ML Fundamentals Syllabus
No ratings yet
OCS351 - AI ML Fundamentals Syllabus
2 pages
Student Performance Analysis Using Machine Learning
No ratings yet
Student Performance Analysis Using Machine Learning
40 pages
Ai Federated Learning Fundamentals Challenges
No ratings yet
Ai Federated Learning Fundamentals Challenges
309 pages
Cyberbullying A17 Major Project
No ratings yet
Cyberbullying A17 Major Project
98 pages
Internship Report On Machine Learning With Python
100% (1)
Internship Report On Machine Learning With Python
50 pages
Semester Vi - Compiler Design (Cs8602) - Compressed
No ratings yet
Semester Vi - Compiler Design (Cs8602) - Compressed
509 pages
Examples of NFA
100% (1)
Examples of NFA
6 pages
Transmutation Table
100% (1)
Transmutation Table
10 pages
Tong Hop - Share Final
No ratings yet
Tong Hop - Share Final
160 pages
Co Notes Module 1
No ratings yet
Co Notes Module 1
42 pages
1NH17CS407
No ratings yet
1NH17CS407
110 pages
Apmp Study Guide Quick Quiz Answers
No ratings yet
Apmp Study Guide Quick Quiz Answers
10 pages
Learning PyTorch 2.0, Second Edition: Utilize PyTorch 2.3 and CUDA 12 to experiment neural networks and deep learning models
From Everand
Learning PyTorch 2.0, Second Edition: Utilize PyTorch 2.3 and CUDA 12 to experiment neural networks and deep learning models
Matthew Rosch
No ratings yet
Text Analysis with Python: A Research-Oriented Guide
From Everand
Text Analysis with Python: A Research-Oriented Guide
Mamta Mittal
No ratings yet
Spam 123
No ratings yet
Spam 123
59 pages
Final Doc of Mini Project Comprised
No ratings yet
Final Doc of Mini Project Comprised
63 pages
Project Document
No ratings yet
Project Document
71 pages
MCQs - Visual-Programming
No ratings yet
MCQs - Visual-Programming
23 pages
Majp Doc M
No ratings yet
Majp Doc M
70 pages
30 Most Asked Coding Questions
No ratings yet
30 Most Asked Coding Questions
19 pages
Sravs Mini
No ratings yet
Sravs Mini
65 pages
Major Project B
No ratings yet
Major Project B
72 pages
PDF&Rendition 1
No ratings yet
PDF&Rendition 1
56 pages
Data Science and Machine Learning
No ratings yet
Data Science and Machine Learning
30 pages
AIot Lab Syllabus
No ratings yet
AIot Lab Syllabus
4 pages
Tutorial 05: Mixing C and C++ Code: Part 1: Issues and Resolutions
No ratings yet
Tutorial 05: Mixing C and C++ Code: Part 1: Issues and Resolutions
28 pages
Combinepdf
No ratings yet
Combinepdf
64 pages
Seminar 6 - Dynamic Programming
No ratings yet
Seminar 6 - Dynamic Programming
64 pages
Easy Cls Algorithms
No ratings yet
Easy Cls Algorithms
3 pages
# Computer Science Core Concepts
No ratings yet
# Computer Science Core Concepts
5 pages
Producer Consumer Problem
No ratings yet
Producer Consumer Problem
4 pages
Final Main Report 1
No ratings yet
Final Main Report 1
68 pages
Theolaaaa4273 Merged
No ratings yet
Theolaaaa4273 Merged
76 pages
Report
No ratings yet
Report
112 pages
AIML Curriculum Powered by IBM - Pregrad-Merged
No ratings yet
AIML Curriculum Powered by IBM - Pregrad-Merged
66 pages
1822 B.E Cse Batchno 109
No ratings yet
1822 B.E Cse Batchno 109
55 pages
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
1822 B.E Cse Batchno 169
No ratings yet
1822 B.E Cse Batchno 169
56 pages
Learning PyTorch 2.0, Second Edition
From Everand
Learning PyTorch 2.0, Second Edition
Matthew Rosch
No ratings yet
A5 Last
No ratings yet
A5 Last
54 pages
Shine Mini Project Final
No ratings yet
Shine Mini Project Final
64 pages
Example of Trip Distribution
No ratings yet
Example of Trip Distribution
9 pages
Revision SQL
No ratings yet
Revision SQL
5 pages
MACHINE LEARNING LAB Manual
No ratings yet
MACHINE LEARNING LAB Manual
48 pages
Machine Learning Lab Manual (BCSL606)
No ratings yet
Machine Learning Lab Manual (BCSL606)
19 pages
Complete Invent Your Own Computer Games With Python 3rd Edition Al Sweigart PDF For All Chapters
100% (5)
Complete Invent Your Own Computer Games With Python 3rd Edition Al Sweigart PDF For All Chapters
55 pages
23CS401 Aiml Lab Manual PDF
No ratings yet
23CS401 Aiml Lab Manual PDF
55 pages
J1 (SkillDzire)
No ratings yet
J1 (SkillDzire)
49 pages
Final Modified Document PG
No ratings yet
Final Modified Document PG
58 pages
VEERENDRA Internship Report 1
No ratings yet
VEERENDRA Internship Report 1
42 pages
17BIT051
No ratings yet
17BIT051
26 pages
Stratus AXI v2.0 en
No ratings yet
Stratus AXI v2.0 en
38 pages
05 - synchCritSec Args HW
No ratings yet
05 - synchCritSec Args HW
47 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
PDS Labmanualword
No ratings yet
PDS Labmanualword
32 pages
Connect Four
100% (1)
Connect Four
3 pages
AIML-Curriculum by Pregrad
No ratings yet
AIML-Curriculum by Pregrad
33 pages
Malicious Twitter Bots Detection Using Machine Learning: A Mini Project Report
No ratings yet
Malicious Twitter Bots Detection Using Machine Learning: A Mini Project Report
54 pages
PDS - October 26, 1 - 32 AM
No ratings yet
PDS - October 26, 1 - 32 AM
29 pages
AI - ML Curriculum Powered by IBM - Pregrad
No ratings yet
AI - ML Curriculum Powered by IBM - Pregrad
31 pages
Trends and Possibilities of AI
No ratings yet
Trends and Possibilities of AI
29 pages
Time Response Lecture Note
No ratings yet
Time Response Lecture Note
40 pages
Practical 1to10
No ratings yet
Practical 1to10
32 pages
Intro
No ratings yet
Intro
22 pages
Rohini 96687902490
No ratings yet
Rohini 96687902490
6 pages
Detectron2 in Practice: Definitive Reference for Developers and Engineers
From Everand
Detectron2 in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Aparna Tweet Report
No ratings yet
Aparna Tweet Report
15 pages
Visual Basic - Net L2
No ratings yet
Visual Basic - Net L2
5 pages
SystemTap Essentials: Definitive Reference for Developers and Engineers
From Everand
SystemTap Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Syllabus CSE All Year
No ratings yet
Syllabus CSE All Year
9 pages
Assignment 2
No ratings yet
Assignment 2
4 pages
Syllabus
No ratings yet
Syllabus
7 pages
COMP484 Machine Learning Syllabus Undergraduate Fall 2018
No ratings yet
COMP484 Machine Learning Syllabus Undergraduate Fall 2018
6 pages
Syllabus - ML Lab
No ratings yet
Syllabus - ML Lab
3 pages
‘
No ratings yet
‘
14 pages
New ITRAdd On
No ratings yet
New ITRAdd On
6 pages
2) What Are The Differences Between C++ and Java?
No ratings yet
2) What Are The Differences Between C++ and Java?
7 pages
Sms Spam Detection Using Machine Learning and Deep Learning Techniques
No ratings yet
Sms Spam Detection Using Machine Learning and Deep Learning Techniques
11 pages
Discovering Drugs To Treat Various Diseases For Human Beings
No ratings yet
Discovering Drugs To Treat Various Diseases For Human Beings
11 pages
Syllabus
No ratings yet
Syllabus
11 pages
Auto ML Tool For Supervised Machine Learning Data
No ratings yet
Auto ML Tool For Supervised Machine Learning Data
11 pages
PCAC2009
No ratings yet
PCAC2009
3 pages
Brochure - UoA - Curriculum
No ratings yet
Brochure - UoA - Curriculum
13 pages
Counting Techniques and Permutation
No ratings yet
Counting Techniques and Permutation
7 pages
Extending The CSV Importer
No ratings yet
Extending The CSV Importer
9 pages
Advance Machine Learning
No ratings yet
Advance Machine Learning
4 pages
INDEX MPT
No ratings yet
INDEX MPT
2 pages
Sentiment Analysis PDF
No ratings yet
Sentiment Analysis PDF
4 pages
Newzen - Python List - 2021
No ratings yet
Newzen - Python List - 2021
3 pages

Human Value Ethics

Uploaded by

Human Value Ethics

Uploaded by

TOXIC COMMAND DETECTION SYSTEM

S.K.P. ENGINEERING COLLEGE

ANNA UNIVERSITY, CHENNAI 600 025

257, Katpadi Main Rd,

Katpadi, Vellore, Tamil Nadu 632007

ONE - MONTH INTERNSHIP ON

CHAPTER 2 :PROJECT DESCRIPTION

CHAPTER 3 : SOFTWARE SPECIFICATION

Mathematica is a commercial symbolic mathematical computation system, developed since

MATLAB is short for MATrix LABoratory, which is a commercial numerical computing

Requirements and Installation

OBJECTIVE AND SCOPE OF THE PROJECT

NEURAL NETWORKS AND DEEP LEARNING

SCOPE OF THE PROJECT:

1.3.1 DISADVANTAGES OF EXISTING SYSTEM

3. TITLE: Detecting the Presence of Cyberbullying using Machine Learning

4. TITLE: Twitter Cyberbullying on social networking sites: A literature review and

5. TITLE: Identification of Hate Content on Social Media

As online substance keeps on developing, so does the spread of disdain discourse. We

1. Extracting, collecting, and labeling the dataset.

3. Classification of text, comment, or posts into one of the many classes.

ARCHITECTURE FOR PROPOSED SYSTEM:

Preprocessing and Augmentation.

3. Data Transformation: Data is normalized, aggregated and generalized.

1.1.2 PROPOSED SYSTEM ADVANTAGES

According with these interpretations, cyberbullying is a multidimensional construct

2.3 SYSTEM SPECIFICATION:

2.3.1 HARDWARE REQUIREMENTS:

2.3.2 SOFTWARE REQUIREMENTS:

PYTHON IDE : Anaconda Jupyter Notebook

PROGRAMMING LANGUAGE : Python

MODULE EXPLANATIONS (METHODOLOGY)

1. Dataset preparation and preprocessing

Dataset preparation and preprocessing:

The type of data depends on what you want to predict.

Data anonymization: - Sometimes a data scientist must anonymize or exclude attributes

Cross-validation: - Cross-validation is the most commonly used tuning method. It entails

SUPPORT VECTOR MACHINE : SVM Model: SVM (Support Vector machine) is a

• Creation of document-term matrix

2.5.1 ARCHITECTURE DIAGRAM:

DATA FLOW DIAGRAM:

2.5.3 USECASE DIAGRAM:

2.5.4 CLASS DIAGRAM:

2.5.5 SEQUENCE DIAGRAM:

2.5.6 ACTIVITY DIAGRAM:

2.5.7 STATE FLOW DIAGRAM:

Anaconda Navigator is a desktop Graphical User Interface (GUI) included in Anaconda

The following applications are available by default in Navigator :

Microsoft VISUAL STUDIO is an Integrated Development Environment (IDE) from

 Python is a powerful multi-purpose programming language created by Guido

2. Free and Open Source:

4. GUI Programming Support:

7. Python is Portable language:

8. Python is Integrated language:

10. Large Standard Library

11. Dynamically Typed Language:

SCIENTIFIC AND NUMERIC COMPUTING

CREATING SOFTWARE PROTOTYPES

 Python is used by many companies to teach programming to kids

Python is a program that was originally designed to simplify the implementation of

4.2 CODE IMPLEMENTATION

CONCLUSION AND REFERENCES

5.3 FUTURE ENHANCEMENTS:

[3] Anti Defamation League. (2011) Glossary of Cyberbullying

[6] D. Yin, Z. Xue, L. Hong, B. D. Davison, A. Kontostathis, and L. Edwards, “Detection of

[7] K. Dinakar, R. Reichart, and H. Lieberman, “Modeling the Detection of Textual

You might also like