Human Value Ethics
Human Value Ethics
INTERNSHIP REPORT
Submitted By
P.SRIRAM - (512221104057)
In partial fulfilment for the award of the degree
of
BACHELOR OF ENGINEERING
(COMPUTER SCIENCE AND ENGINEERING)
THIRUVANNAMALAI – 606601
SEPTEMBER 2024
i
STRYDO TECHNOLOGIES PVT. LTD.
2nd floor, sapthagiri complex,
Abstract:
Online media is a phase where various young people are getting tortured. As individual to
individual correspondence objections are growing, toxic command is extending bit by bit. To
perceive word resemblances in the tweets made by hazards, use NLP, and can develop a ML model
normally perceive online media torturing exercises. Regardless, various online media torturing
recognizable proof techniques have been completed, but various of them were printed based. The
target of this paper is to show the execution of programming that will recognize hate conversation
tweets, posts, etc. An ML model is proposed to recognize and prevent torturing on Twitter. RF is
used for getting ready and testing the online media torturing substance. Both Support Vector
Machine (SVM) and Random Forest (RF) had the alternative to perceive the certifiable up-sides
with better accuracy independently.
TABLE OF CONTENTS
PAGE
TITLE NO.
CHAPTER
NO.
CHAPTER 1 : INTRODUCTION
1. 1.1 GENERAL
1.1.1 THE MACHINE LEARNING SYSTEM
1.1.2 FUNDAMENTAL
1.2 JUPYTER
1.3 MACHINE LEARNING
1.4 CLASSIFICATION TECHNIQUES
1.4.1 NEURAL NETWORK AND DEEP LEARNING
1.4.2 METHODOLOGIES - GIVEN INPUT AND EXPECTED
OUTPUT
1.5 OBJECTIVE AND SCOPE OF THE PROJECT
1.6 EXISTING SYSTEM
1.6.1 DISADVANTAGES OF EXISTING SYSTEM
1.6.2 LITERATURE SURVEY
1.7 PROPOSED SYSTEM
1.7.1 PROPOSED SYSTEM ADVANTAGES
INTRODUCTION
1.1 GENERAL
Glossary and Key Terms
This section provides a quick reference for several algorithms that are not explicity mentioned in
this chapter, but may be of interest to the reader. This should provide the reader with some
keywords or useful points of reference for other similar libraries to those discussed in this chapter.
BIDMach GPU accelerated machine learning library for algorithms that are not necessarily
neural network based.
Caret provides a standardised API for many of the most useful machine learning packages for
R. For readers who are more comfortable with R, Caret provides a good substitute for Python’s
SciKit-Learn.
-R is used extensively by the statistics community. The software package Caret provides a
standardised API for many of R’s machine learning libraries.
WEKA is short for the Waikato Environment for Knowledge Analysis [6] and has been a very
popular open source tool since its inception in 1993. In 2005 Weka received the SIGKDD Data
Mining and Knowledge Discovery Service
Award: it is easy to learn and simple to use, and provides a GUI to many machine learning
algorithms.
Vowpal Wabbit Microsoft’s machine learning library. Mature and actively developed, with an
emphasis on performance.
Managing Packages
Anaconda comes with its own built in package manager, known as Conda. Using the conda
command from the terminal, you can download, update, and delete Python packages. Conda takes
care of all dependencies and ensures that packages are preconfigured to work with all other
packages you may have installed.
Keeping your Python distribution up to date and well maintained is essential in this fast moving
field. However, Anaconda makes it particularly easy to manage and keep your scientific stack up
to date. Once Anaconda is installed you can manage your Python distribution, and all the scientific
packages installed by Anaconda using the conda application from the command line. To list all
packages currently installed, use conda list. This will output all packages and their version
numbers. Updating all Anaconda packages in your system is performed using the conda update -
all command. Conda itself can be updated using the conda update conda command, while Python
can be updated using the conda update python command. To search for packages, use the search
parameter, e.g. conda search stats where stats is the name or partial name of the package you are
searching for.
Machine Learning
We will now move on to the task of machine learning itself. In the following sections we will
describe how to use some basic algorithms, and perform regression, classification, and clustering
on some freely available medical datasets concerning breast cancer and diabetes, and we will also
take a look at a DNA microarray dataset.
SciKit-Learn
SciKit-Learn provides a standardised interface to many of the most commonly used machine
learning algorithms, and is the most popular and frequently used library for machine learning for
Python. As well as providing many learning algorithms, SciKit-Learn has a large number of
convenience functions for common preprocessing tasks (for example, normalisation or k-fold cross
validation).
SciKit-Learn is a very large software library.
Clustering
Clustering algorithms focus on ordering data together into groups. In general clustering algorithms
are unsupervised—they require no y response variable as input. That is to say, they attempt to find
groups or clusters within data where you do not know the label for each sample. SciKit-Learn have
many clustering algorithms, but in this section we will demonstrate hierarchical clustering on a
DNA expression microarray dataset using an algorithm from the SciPy library.
We will plot a visualisation of the clustering using what is known as a dendrogram, also using the
SciPy library.
The goal is to cluster the data properly in logical groups, in this case into the cancer types
represented by each sample’s expression data. We do this using agglomerative hierarchical
clustering, using Ward’s linkage method:
Classification
we analysed data that was unlabelled—we did not know to what class a sample belonged (known
as unsupervised learning). In contrast to this, a supervised problem deals with labelled data where
are aware of the discrete classes to which each sample belongs. When we wish to predict which
class a sample belongs to, we call this a classification problem. SciKit-Learn has a number of
algorithms for classification, in this section we will look at the Support Vector Machine.
We will work on the Wisconsin breast cancer dataset, split it into a training set and a test set, train
a Support Vector Machine with a linear kernel, and test the trained model on an unseen dataset.
The Support Vector Machine model should be able to predict if a new sample is malignant or
benign based on the features of a new, unseen sample:
You will notice that the SVM model performed very well at predicting the malignancy of new,
unseen samples from the test set—this can be quantified nicely by printing a number of metrics
using the classification report function. Here, the precision, recall, and F1 score (F1 = 2 ·
precision·recall/precision+recall) for each class is shown. The support column is a count of the
number of samples for each class.
Support Vector Machines are a very powerful tool for classification. They work well in high
dimensional spaces, even when the number of features is higher than the number of samples.
However, their running time is quadratic to the number of samples so large datasets can become
difficult to train. Quadratic means that if you increase a dataset in size by 10 times, it will take 100
times longer to train.
Last, you will notice that the breast cancer dataset consisted of 30 features. This makes it difficult
to visualize or plot the data. To aid in visualization of highly dimensional data, we can apply a
technique called dimensionality reduction.
Dimensionality Reduction
Another important method in machine learning, and data science in general, is dimensionality
reduction. For this example, we will look at the Wisconsin breast cancer dataset once again. The
dataset consists of over 500 samples, where each sample has 30 features. The features relate to
images of a fine needle aspirate of breast tissue, and the features describe the characteristics of the
cells present in the images. All features are real values. The target variable is a discrete value
(either malignant or benign) and is therefore a classification dataset.
You will recall from the Iris example in Sect. 7.3 that we plotted a scatter matrix of the data, where
each feature was plotted against every other feature in the dataset to look for potential correlations
(Fig. 3). By examining this plot you could probably find features which would separate the dataset
into groups. Because the dataset only had 4 features we were able to plot each feature against each
other relatively easily. However, as the numbers of features grow, this becomes less and less
feasible, especially if you consider the gene expression example in Sect. 9.4 which had over 6000
features.
One method that is used to handle data that is highly dimensional is Principle Component Analysis,
or PCA. PCA is an unsupervised algorithm for reducing the number of dimensions of a dataset.
For example, for plotting purposes you might want to reduce your data down to 2 or 3 dimensions,
and PCA allows
you to do this by generating components, which are combinations of the original features, that you
can then use to plot your data.
PCA is an unsupervised algorithm. You supply it with your data, X, and you specify the number
of components you wish to reduce its dimensionality to. This is known as transforming the data:
Again, you would not use this model for new data—in a real world scenario, you would, for
example, perform a 10-fold cross validation on the dataset, choosing the model parameters that
perform best on the cross validation. This model would be much more likely to perform well on
new data. At the very least, you would randomly select a subset, say 30% of the data, as a test set
and train the model on the remaining 70% of the dataset. You would evaluate the model based on
the score on the test set and not on the training set
Keras additionally requires either Theano or TensorFlow to be installed. In the examples in this
chapter we are using Theano as a backend, however the code will work identically for either
backend. You can install Theano using pip, but it has a number of dependencies that must be
installed first. Refer to the Theano and TensorFlow documentation for more information [12].
Keras is a modular API. It allows you to create neural networks by building a stack of modules,
from the input of the neural network, to the output of the neural network, piece by piece until you
have a complete network. Also, Keras can be configured to use your Graphics Processing Unit, or
GPU. This makes training neural networks far faster than if we were to use a CPU. We begin by
importing Keras:
We may want to view the network’s accuracy on the test (or its loss on the training set) over time
(measured at each epoch), to get a better idea how well it is learning. An epoch is one complete
cycle through the training data.
Fortunately, this is quite easy to plot as Keras’ fit function returns a history object which we can
use to do exactly this:
This will result in a plot similar to that shown. Often you will also want to plot the loss on the test
set and training set, and the accuracy on the test set and training set.
Plotting the loss and accuracy can be used to see if you are over fitting (you experience tiny loss
on the training set, but large loss on the test set) and to see when your training has plateaued.
OBJECTIVE:
The main aim of the detecting the hatespeech model will help to improve manual monitoring for
unwanted chats on social networks. In this project we fetch the tweets from twitter accounts and
preprocess the twits and images and applying generated model will detect the hatespeech or not. The
objectives of the systems development and event management are: Collect the dataset of hate words
and preprocess it and apply natural language processing and then machine learning algorithms
Generate different machine learning algorithm model. Fetch the tweets from twitter account and
preprocess it. Apply generated model on the fetched tweets and get final output hate words or not.
Toxic command is the use of electronic communication to bully a person by sending hatefull
messages using social media, instant messaging or through digital messages. Toxic command
can be very damaging to adolescents and teens. It can lead to fight, killing and even bullying.
Also, once things are circulated on the Internet, they may never disappear, resurfacing at later
times to renew the pain of Hate speech. Toxic command can be very damaging to adolescents
and teens. It can lead to anxiety, depression, and even suicide. Also, once things are circulated
on the Internet, they may never disappear, resurfacing at later times to renew the pain of Hate
speech. So overcome these issues detecting the Toxic command is very important in now a days
which will help to stop Toxic command conversation on social media networks.
PROBLEM STATEMENT:
The social media network gives us to great communication platform opportunities they also
increase the vulnerability of young people to threatening situations online. Toxic command on
an social media network is a globle phenomenon because of its huge volumes of active users.
The trend shows that the Toxic command on social network is growing rapidly every day.
Recent studies report that Toxic command constitutes a growing problem among youngsters.
Successful prevention depends on the adequate detection of potentially harmful messages and
the information overload on the Web requires intelligent systems to identify potential risks
automatically. So, In this project we focus on to make a model on automatic Toxic command
detection in social media text by modelling posts written by bullies on social network.
EXISTING SYSTEM
The Naïve Bayes model involves a simple conditional independence assumption, i.e. given a
class which may be positive or negative; the words are conditionally independent of each other.
This assumption doesn‘t much affect the accuracy of text classification but makes really fast
classification applicable for the problem.
Tokenization: In this part we take the text as sentences or whole paragraphs and then output the
entered text as separated words in a list. –
Lowering text: This takes the list of words that got out of the tokenization and then lower all
the letters Like: ’THIS IS AWESOME’ is going to be ’this is awesome’. –
Stop words and encoding cleaning: This is an essential part of the preprocessing where we clean
the text from those stop words and encoding characters like \n or \t which do not provide a
meaningful information to the classifiers. –
Word Correction: In this part we used Microsoft Bing word correction API [24] that takes a
word and then return a JSON object with the most similar words and the distance between these
words and the original word.
LITERATURE SURVEY:
1. TITLE: Detection of Toxic command using Text Mining and Natural Language Processing
YEAR: 2020
AUTHOR: G. Priyadharshini
DESCRIPTION:
In today’s modern world, technology connected with humanity is doing wonderful things. On the
other hand, people inclined to social networks where they have anonymity are bringing out the
very nastiest of people in the form of hate speech. Social media toxic command is a serious societal
problem which can contribute to magnify the violence ranging from lynching to ethical cleansing.
One of the critical tasks of automatic detection of toxic command is differentiating it from the
other context of offensive languages. The existing works to distinguish the two categories using
the lexical methods showed very low performance metrics values which led to major
misclassification. The works with supervised machine learning approaches indeed gave significant
results in distinguishing hate and offensive but the presence or absence of certain words of both
the classes can serve as both merit and demerit to achieve accurate classification. In this paper, a
ternary classification of tweets into hate speech, offensive and neither is performed using multi
class classifiers. Among the four classifiers: Logistic Regression, Random forests, Support Vector
Machines (SVM) and Naïve Bayes. It can be seen that Random Forest classifier performs
significantly well with almost all feature combinations giving maximum accuracy of 0.90 for
TFIDF feature technique
2. TITLE: Toxic Speech Classification via Deep Learning using Combined Features from BERT &
FastText Embedding
YEAR: 2021
AUTHOR: Asmi P
DESCRIPTION:
With the growing internet usage rate, people are more likely to express their opinion or ideas openly on
social media. A lot of discussion platforms are available nowadays. But some are misused the freedom of
speech by spreading online toxic speech. The toxic command that is intended not just to insult or mock,
but to harass and cause lasting pain by attacking something uniquely dear to the target. Thus, the
necessary of automatically detecting and removing toxic speech in social media is very important. We
proposed a feature-based method that combining the features of TF-IDF, FastText Embedding and BERT
Embedding and by using a DNN classifier. We compare the individual features of these three methods
with the combined features as a performance analysis.
DESCRIPTION:
The increasing use of online social media and their demand has turned up the rise of cyberbullying
among people. Nowadays cyberbullying has become very frequent. The majority of the people are using
social media to troll and smear others, and the others are being defamed and agitated by unknown
users or friends. So it is necessary to detect these type of comments and prevent it. Our work proposes
an ensemble learning approach to detect cyberbullying comments. Different supervised ensemble
learning techniques are used to classify comments. Here voting classifier trains on an ensemble of
Support Vector Machine, Logistic Regression, and Perceptron models and predicts the output based on
the highest majority of the vote. This model detects cyberbullying comments with 94% accuracy.
AUTHOR: K.H.ChanaChristy
DESCRIPTION:
Cyber bullying on social networking sites is an emerging societal issue that has drawn
significant scholarly attention. The purpose of this study is to consolidate the existing
knowledge through a literature review and analysis. We first discuss the nature, research
patterns, and theoretical foundations. We then develop an integrative framework based on
social cognitive theory to synthesize what is known and identify what remains to be
learned, with a focus on the triadic reciprocal relationships between perpetrators, victims,
and bystanders. We discuss the key findings and highlight opportunities for future research.
We conclude this paper by noting research contributions and limitations.
In proposed work, Toxic command is a huge problem on social media websites like Facebook and
Twitter. A number of life-threatening cyberbullying experiences among young people have been reported
internationally thus drawing attention to its negative impact. In the USA, the problem of cyberbullying has
become increasingly evident and has officially been identified as a social threat. The challenges in fighting
cyberbullying include: detecting online bullying when it occurs; reporting it to law enforcement agencies;
and identifying predators and their victims. No present online community or social media websites (for
example, Facebook and Twitter; where cyberbullying are most common), incorporates a system to
automatically and intelligently identify aggression and instances of online harassment on its platform.
Despite the seriousness of the problem, there are very few successful efforts to detect abusive behavior,
both from the research community and social media itself, due to several inherent obstacles like grammar,
syntactic flaws,and fairly limited context. Aggression and bullying against an individual can be performed
in several ways beyond just obviously abusive language – for example, via constant sarcasm, trolling, etc.
4. Evaluation and analysis of thebest model. The motivation for the research work is to
learn the application and implementation of Natural Language Processing and Machine
Learning in a real-world problem, i.e., cyberbullying and online harassment.
1. Data Cleaning: Data is cleansed through processes such filling in missing values, smoothing
the noisy data, or resolving the inconsistencies in the data.
2. Data Integration: Data with different representations are put together and conflicts within the
data are resolved.
4. Data Reduction: The step aims to present a reduced representation of the data.
The raw data is first loaded into the memory where it is cleansed of escape sequences like \n, \t
and Unicode characters such as \xc2 with a white space. Colloquial words and phrases used
mostly in text messages are replaced with its corresponding English word.
For example, “u” is replaced with “you”; “em” is replaced with “them”; “da” is replaced with
“the” and so on. Contractions such as “won’t” and “can’t” are replaced with “will not” and
“cannot” respectively along with others. The data is further converted to lowercase format.
Advanced natural language processing techniques are used to further preprocess the data to
ensure a better quality and consistency of data format while building the ground truth and
training a classifier. To build a vocabulary of abusive words and internet slangs, a dictionary of
bad words available at (http://urbanoalvarez.es/blog/2008/04/04/bad-words-list/) is used.
The dictionary contains a list of bad words in a number of variations used on the internet and
its corresponding English dictionary word. Through data preprocessing, different variations of
internet slangs are replaced with its dictionary counterpart for which the bad words file is used.
The data is further stemmed down to its root form and occurrences of special characters is
removed using regular expressions.
PROJECT DESCRIPTION
2.1 INTRODUCTION
Cyberbullying and cyber aggression are serious and widespread issues affecting
increasingly more Internet users. It is defined as an aggressive, intentional act carried out by
an individual or group, that takes place in cyberspace.In today’s hyper-connected society,
bullying, which was once limited to particular places or times of the day (e.g., school hours),
can instead occur anytime, anywhere, with just a few clicks on mouse and taps on a keyboard.
Cyberbullying and cyber aggression can take many forms and definitions, however, the
former typically denotes repeated and hostile behavior performed by a group or an individual
and the latter intentional harm delivered via electronic means to a person or a group of people
who perceive such acts as offensive, derogatory, harmful, or unwanted. The abundance of
public discussion spaces on the Internet has in many ways changed how we communicate
with others. These discussions can often be productive, but the anonymity that comes with
hiding behind a username has allowed users to post insulting or inappropriate comments.
These posts can often create a hostile or uncomfortable environment for other users, one that
may even discourage them from visiting the site. In 2017, about 50% of young social media
users reported being bullied online in various forms. Popular social media platforms like
Twitter and Facebook are not immune, as racist and sexist attacks may even have caused
potential buyers of Twitter to balk. The ambition of this research work is to explore the
possibilities of classifying hate speech, insults and harassment which are one of the various
forms of cyberbullying in social media. The research work extends current research on
cyberbullying and online harassment detection.
The hardware requirements may serve as the basis for a contract for the implementation of the
system and should therefore be a complete and consistent specification of the whole system.
They are used by software engineers as the starting point for the system design. It shows what
the system does and not how it should be implemented
PROCESSOR : Intel I5
RAM : 4GB
HARD DISK : 40 GB
There is no exact answer to the question “How much data is needed?” because each
machine learning problem is unique. In turn, the number of attributes data scientists will
use when building a predictive model depends on the attributes’ predictive value.
‘The more, the better’ approach is reasonable for this phase. Some data scientists suggest
considering that less than one-third of collected data may be useful. It’s difficult to estimate
which part of the data will provide the most accurate results until the model training begins.
That’s why it’s important to collect and store all data — internal and open, structured and
unstructured.
The tools for collecting internal data depend on the industry and business infrastructure.
For example, those who run an online-only business and want to launch a personalization
campaign can try out such web analytic tools as Mixpanel, Hotjar, CrazyEgg, well-known
Google analytics, etc. A web log file, in addition, can be a good source of internal data. It
stores data about users and their online behavior: time and length of visit, viewed pages or
objects, and location.
Companies can also complement their own data with publicly available datasets. For
instance, Kaggle, Github contributors, AWS provide free datasets for analysis.
Data preprocessing:-
The purpose of preprocessing is to convert raw data into a form that fits machine learning.
Structured and clean data allows a data scientist to get more precise results from an applied
machine learning model. The technique includes data formatting, cleaning, and sampling.
Data formatting: - The importance of data formatting grows when data is acquired from
various sources by different people. The first task for a data scientist is to standardize record
formats. A specialist checks whether variables representing each attribute are recorded in
the same way. Titles of products and services, prices, date formats, and addresses are
examples of variables. The principle of data consistency also applies to attributes
represented by numeric ranges.
Data cleaning: - This set of procedures allows for removing noise and fixing
inconsistencies in data. A data scientist can fill in missing data using imputation techniques,
e.g. substituting missing values with mean attributes. A specialist also detects outliers —
observations that deviate significantly from the rest of distribution. If an outlier indicates
erroneous data, a data scientist deletes or corrects them if possible. This stage also includes
removing incomplete and useless data objects.
Data sampling: - Big datasets require more time and computational power for analysis. If
a dataset is too large, applying data sampling is the way to go. A data scientist uses this
technique to select a smaller but representative data sample to build and run models much
faster, and at the same time to produce accurate outcomes.
Featuraization:-
Featuraization is a way to change some form of data (text data, graph data, time-series
data…) into a numerical vector.
Featuraization is different from feature engineering. Feature engineering is just
transforming the numerical features somehow so that the machine learning models work
well. In feature engineering, features are already in the numerical form. Whereas in
Featuraization data not need to be in the form of numerical vector.
The machine learning model cannot work with row text data directly. In the end, machine
learning models work with numerical (categorical, real…) features. So it is import to
change some type of data into numerical vector so that we can leverage the whole power
of linear algebra (making the decision boundary between data points) and statistics tools
with other types of data also.
Data splitting:-
A dataset used for machine learning should be partitioned into three subsets — training,
test, and validation sets.
Training set: -A data scientist uses a training set to train a model and define its optimal
parameters — parameters it has to learn from data.
Test set: - A test set is needed for an evaluation of the trained model and its capability for
generalization. The latter means a model’s ability to identify patterns in new unseen data
after having been trained over a training data. It’s crucial to use different subsets for
training and testing to avoid model over fitting, which is the incapacity for generalization
we mentioned above.
Modeling:-
During this stage, a data scientist trains numerous models to define which one of them
provides the most accurate predictions.
Model training:-
After a data scientist has preprocessed the collected data and split it into three subsets, he
or she can proceed with a model training. This process entails “feeding” the algorithm with
training data. An algorithm will process data and output a model that is able to find a target
value (attribute) in new data — an answer you want to get with predictive analysis. The
purpose of model training is to develop a model.
Two model training styles are most common — supervised and unsupervised learning. The
choice of each style depends on whether you must forecast specific attributes or group data
objects by similarities.
Supervised learning: - Supervised learning allows for processing data with target
attributes or labeled data. These attributes are mapped in historical data before the training
begins. With supervised learning, a data scientist can solve classification and regression
problems.
Unsupervised learning: - During this training style, an algorithm analyzes unlabeled data.
The goal of model training is to find hidden interconnections between data objects and
structure objects by similarities or differences. Unsupervised learning aims at solving such
problems as clustering, association rule learning, and dimensionality reduction. For
instance, it can be applied at the data preprocessing stage to reduce data complexity.
Textual Based:
We group features such as cyberbullying keywords, pro - fanity, pronouns, n-grams, Bags-of-
words (BoW), Term Frequency Inverse Document Frequency (TFIDF), document length, and
spelling content-based features. Content-based features are overwhelmingly used across our
sample, with as many as 41 papers utilising con-tent-based features. As cyberbullying messages
are often abusive and insulting in nature, it is not surprising that profanity was
Model Testing:-
QXThe goal of this step is to develop the simplest model able to formulate a target value
fast and well enough. A data scientist can achieve this goal through model tuning. That’s
the optimization of model parameters to achieve an algorithm’s best performance.
One of the more efficient methods for model evaluation and tuning is cross-validation
• Labelling of data
• Generation of vocabulary
RANDOM FOREST:
It is type of ensemble learning method and also used for classification and regression tasks. The
accuracy it gives is grater then compared to other models. This method can easily handle large
datasets. It is popular ensemble Learning Method. Random Forest Improve Performance of
Decision Tree by reducing variance. It operates by constructing a multitude of decision trees at
training time and outputs the class that is the mode of the classes or classification or mean
prediction (regression) of the individual trees.
Algorithm-
The first step is to select the “R” features from the total features “m” where R<<M
Among the “R” features, the node using the best split point.
Split the node into sub nodes using the best split.
Repeat a to c steps until ”l” number of nodes has been reached.
Built forest by repeating steps a to d for “a” number of times to create “n” number of trees.
PERFORMANCE MATRICES:
Data was divided into two portions, training data and testing data, both these portions consisting
70% and 30% data respectively. All these two algorithms were applied on same dataset using
Enthought Canaopy and results were obtained.
Predicting accuracy is the main evaluation parameter that we used in this work. Accuracy can be
defied using equation. Accuracy is the overall success rate of the algorithm.
CONFUSION MATRIX:
It is the most commonly used evaluation metrics in predictive analysis mainly because it is very
easy to understand and it can be used to compute other essential metrics such as accuracy, recall,
precision, etc. It is an NxN matrix that describes the overall performance of a model when used on
some dataset, where N is the number of class labels in the classification problem.
All predicted true positive and true negative divided by all positive and negative. True Positive
(TP), True Negative (TN), False Negative (FN) and False Positive (FP) predicted by all algorithms
are presented in table.
True positive (TP) indicates that the positive class is predicted as a positive class, and the number
of sample positive classes was actually predicted by the model.
False negative indicates (FN) that the positive class is predicted as a negative class, and the number
of negative classes in the sample was actually predicted by the model.
False positive (FP) indicates that the negative class is predicted as a positive class, and the number
of positive classes of samples was actually predicted by the model.
True negative (TN) indicates that the negative class is predicted as a negative class, and the number
of sample negative classes was actually predicted by the model.
2.5 SYSTEM DESIGN:
Designing of system is the process in which it is used to define the interface, modules and data for a
system to specified the demand to satisfy. System design is seen as the application of the system theory.
The main thing of the design a system is to develop the system architecture by giving the data and
information that is necessary for the implementation of a system.
Data flow diagrams are used to graphically represent the flow of data in a business
information system. DFD describes the processes that are involved in a system to transfer data
from the input to the file storage and reports generation. Data flow diagrams can be divided into
logical and physical. The logical data flow diagram describes flow of data through a system to
perform certain functionality of a business. The physical data flow diagram describes the
implementation of the logical data flow.
ER DIAGRAM:
The sequence diagram of a system shows the entity interplay are ordered in the time order level. So, that
it drafts the classes and object that are imply in the that plot and also the series of message exchange take
place betwixt the body that need to be carried out by the purpose of that scenario.
The below state chart diagram describes the flow of control from one state to another state (event)
in the flow of the events from the creation of an object to its termination.
COLLOBORATIVE DIAGRAM:
CHAPTER 3
SOFTWARE SPECIFICATION
3.1 GENERAL
ANACONDA
It is a free and open-source distribution of the Python and R programming languages for scientific
computing (data science, machine learning applications, large-scale data processing, predictive
analytics, etc.), that aims to simplify package management and deployment.
Anaconda distribution comes with more than 1,500 packages as well as the Conda package
and virtual environment manager. It also includes a GUI, Anaconda Navigator, as a graphical
alternative to the Command Line Interface (CLI).
The big difference between Conda and the pip package manager is in how package dependencies
are managed, which is a significant challenge for Python data science and the reason Conda exists.
Pip installs all Python package dependencies required, whether or not those conflict with other
packages you installed previously.
So your working installation of, for example, Google Tensorflow, can suddenly stop working when
you pip install a different package that needs a different version of the Numpy library. More
insidiously, everything might still appear to work but now you get different results from your data
science, or you are unable to reproduce the same results elsewhere because you didn't pip install
in the same order.
Conda analyzes your current environment, everything you have installed, any version limitations
you specify (e.g. you only want tensorflow >= 2.0) and figures out how to install compatible
dependencies. Or it will tell you that what you want can't be done. Pip, by contrast, will just install
the thing you wanted and any dependencies, even if that breaks other things.Open source packages
can be individually installed from the Anaconda repository, Anaconda Cloud (anaconda.org), or
your own private repository or mirror, using the conda install command. Anaconda Inc compiles
and builds all the packages in the Anaconda repository itself, and provides binaries for Windows
32/64 bit, Linux 64 bit and MacOS 64-bit. You can also install anything on PyPI into a Conda
environment using pip, and Conda knows what it has installed and what pip has installed. Custom
packages can be made using the conda build command, and can be shared with others by uploading
them to Anaconda Cloud, PyPI or other repositories.The default installation of Anaconda2
includes Python 2.7 and Anaconda3 includes Python 3.7. However, you can create new
environments that include any version of Python packaged with conda.
JupyterLab
Jupyter Notebook
QtConsole
Spyder
Glueviz
Orange
Rstudio
Visual Studio Code
Microsoft .NET is a set of Microsoft software technologies for rapidly building and integrating
XML Web services, Microsoft Windows-based applications, and Web solutions. The .NET
Framework is a language-neutral platform for writing programs that can easily and securely
interoperate. There’s no language barrier with .NET: there are numerous languages available to
the developer including Managed C++, C#, Visual Basic and Java Script. The .NET framework
provides the foundation for components to interact seamlessly, whether locally or remotely on
different platforms. It standardizes common data types and communications protocols so that
components created in different languages can easily interoperate.
“.NET” is also the collective name given to various software components built upon the .NET
platform. These will be both products (Visual Studio.NET and Windows.NET Server, for instance)
and services (like Passport, .NET My Services, and so on).
Easy to code
Free and Open Source
Object-Oriented Language
GUI Programming Support
High-Level Language
Extensible feature
Python is Portable language
Python is Integrated language
Interpreted
Large Standard Library
Dynamically Typed Language
PYTHON:
1.Easy to code:
Python is high level programming language. Python is very easy to learn language as compared
to other language like c, c#, java script, java etc. It is very easy to code in python language and
anybody can learn python basic in few hours or days. It is also developer-friendly language.
3.Object-Oriented Language:
One of the key features of python is Object-Oriented programming. Python supports object
oriented language and concepts of classes, objects encapsulation etc.
5. High-Level Language:
Python is a high-level language. When we write programs in python, we do not need to
remember the system architecture, nor do we need to manage the memory.
6.Extensible feature:
Python is a Extensible language. we can write our some python code into c or c++ language and
also we can compile that code in c/c++ language.
9. Interpreted Language:
Python is an Interpreted Language. because python code is executed line by line at a time. like
other language c, c++, java etc there is no need to compile python code this makes it easier to
debug our code. The source code of python is converted into an immediate form called bytecode.
WEB APPLICATIONS
You can create scalable Web Apps using frameworks and CMS (Content Management
System) that are built on Python. Some of the popular platforms for creating Web Apps
are: Django, Flask, Pyramid, Plone, Django CMS.
Sites like Mozilla, Reddit, Instagram and PBS are written in Python.
There are numerous libraries available in Python for scientific and numeric computing.
There are libraries like: SciPy and NumPy that are used in general purpose computing.
And, there are specific libraries like: EarthPy for earth science, AstroPy for Astronomy
and so on.
Also, the language is heavily used in machine learning, data mining and deep learning.
Python is slow compared to compiled languages like C++ and Java. It might not be a
good choice if resources are limited and efficiency is a must.
However, Python is a great language for creating prototypes. For example: You can use
Pygame (library for creating games) to create your game's prototype first. If you like the
prototype, you can use language like C++ to create the actual game.
GOOD LANGUAGE TO TEACH PROGRAMMING
IMPLEMENTATION
4.1 GENERAL
In this we implement the coding part using anaconda. Below are the coding’s that
are used to generate the domain module for Deep learning. Here the proposed techniques
are used in the coding.
import pandas as pd
import numpy as np
import os
import re
import nltk
import pickle
#from imblearn.over_sampling import SMOTE
from nltk.corpus import stopwords
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords, words
from sklearn.model_selection import train_test_split
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.metrics import confusion_matrix,accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_extraction.text import CountVectorizer,
TfidfTransformer,TfidfVectorizer
data=pd.read_csv('formspring_data.csv',error_bad_lines=False,sep='\t')
data.head()
df1=df[(df.ans1==df.ans2)& (df.ans2 == df.ans3) & (df.ans1 ==
df.ans3)].reset_index(drop=True)
print(df1.shape)
df1.head()
df1['ques']=df['ques'].apply(lambda x:re.sub('[^a-zA-Z\s0-9]','',x) if isinstance(x,str) else '
')
df1.head()
df1['ans']=df['ans'].apply(lambda x:re.sub('[^a-zA-Z\s0-9]','',x) if isinstance(x,str) else ' ')
df1.head()
stemmmer=SnowballStemmer('english')
df1['ques']=df1['ques'].apply(lambda x:' '.join([stemmmer.stem(i) for i in x.split(' ')]) if
isinstance(x,str) else x)
df1['ans']=df1['ans'].apply(lambda x:' '.join([stemmmer.stem(i) for i in x.split(' ')]) if
isinstance(x,str) else x)
#stop_words = stopwords.words("english")
lemmatizer = WordNetLemmatizer()
df1['ques']=df1['ques'].apply(lambda x:' '.join([lemmatizer.lemmatize(i) for i in x.split(' ')
]) if isinstance(x,str) else x)
df1['ans']=df1['ans'].apply(lambda x:' '.join([lemmatizer.lemmatize(i) for i in x.split(' ') ])
if isinstance(x,str) else x) #stop_words = stopwords.words("english")
lemmatizer = WordNetLemmatizer()
df1['ques']=df1['ques'].apply(lambda x:' '.join([lemmatizer.lemmatize(i) for i in x.split(' ')
]) if isinstance(x,str) else x)
df1['ans']=df1['ans'].apply(lambda x:' '.join([lemmatizer.lemmatize(i) for i in x.split(' ') ])
if isinstance(x,str) else x)
ans2=df1[df1.ans.notna()]
Ans=ans2[ans2.ans1=='Yes']
Ans['ans'].values
ans2=df1[df1.ans.notna()]
Ans=ans2[ans2.ans1=='Yes']
Ans['ans'].values
import matplotlib.pyplot as plt
import random
# Prepare Data
df = data.groupby('label').size().reset_index(name='counts')
n = df['label'].unique().__len__()+1
all_colors = list(plt.cm.colors.cnames.keys())
random.seed(100)
c = random.choices(all_colors, k=n)
# Plot Bars
plt.figure(figsize=(6,6), dpi= 80)
plt.bar(df['label'], df['counts'], color=c, width=.5)
for i, val in enumerate(df['counts'].values):
plt.text(i, val, float(val), horizontalalignment='center', verticalalignment='bottom',
fontdict={'fontweight':500, 'size':12})
# Decoration
plt.gca().set_xticklabels(df['label'], rotation=60, horizontalalignment= 'right')
plt.ylim(0, 6000)
plt.show()
import matplotlib.pyplot as plt
import random
# Prepare Data
df = data.groupby('label').size().reset_index(name='counts')
n = df['label'].unique().__len__()+1
all_colors = list(plt.cm.colors.cnames.keys())
random.seed(100)
c = random.choices(all_colors, k=n)
# Plot Bars
plt.figure(figsize=(6,6), dpi= 80)
plt.bar(df['label'], df['counts'], color=c, width=.5)
for i, val in enumerate(df['counts'].values):
plt.text(i, val, float(val), horizontalalignment='center', verticalalignment='bottom',
fontdict={'fontweight':500, 'size':12})
# Decoration
plt.gca().set_xticklabels(df['label'], rotation=60, horizontalalignment= 'right')
plt.ylim(0, 6000)
plt.show()
X
<10000x6295 sparse matrix of type '<class 'numpy.float64'>'
with 60929 stored elements in Compressed Sparse Row format>
SCREENSHOTS
CHAPTER 5
5.1 CONCLUSION:
The goal of this project is to the automatic detection of hate speech-related posts on
social media. Given the information overload on the web, manual monitoring for
cyberbullying has become unfeasible. Automatic detection of signals of cyberbullying
would enhance moderation and allow to respond quickly when necessary. However, these
posts could just as well indicate that cyberbullying is going on. The main aim of this project
is that it presents a system to automatically detect signals of cyberbullying on social media,
including different types of hate speech, covering posts from bullies, victims and
bystanders.
In this project, we develop achievability protocols and outer bounds for the secure
network coding setting, where the edges are subject to packet erasures, and public feedback
of the channel state is available to both Eve and the legitimate network nodes. Secure
network coding assumes that the underlying network channels are error-free; thus, if our
channels introduce errors, we need to first apply a channel code to correct them, and then
build security on top of the resulting error-free network. We show that by leveraging
erasures and feedback, we can achieve secrecy rates that are in some cases multiple times
higher than the alternative of separate channel-error-correction followed by secure network
coding;
5.2 APPLICATION
DETECTION OF HATESPEECH
The dataset based on toxic command been gathered and preprocessing been done.
After preprocessing data training been done in machine and various algorithm been applied
and trained the data and finally get the prediction with accuracy
[1] D. Poeter. (2011) Study: A Quarter of Parents Say Their Child Involved in Cyberbullying.
pcmag.com. [Online].Available: http://www.pcmag.com/article2/0,2817,2388540,00.asp
[2] J. W. Patchin and S. Hinduja, “Bullies move Beyond the Schoolyard; a Preliminary Look at
Cyberbullying,” Youth Violence and Juvenile Justice, vol. 4, no. 2, pp. 148–169,2006
[4] N. E. Willard, Cyberbullying and Cyberthreats: Responding to the Challenge of Online Social
Aggression, Threats, and Distress. Research Press, 2007.
[5] D. Maher, “Cyberbullying: an Ethnographic Case Study of one Australian Upper Primary
School Class,” Youth Studies Australia, vol. 27, no. 4, pp. 50–57, 2008.
[8] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques,
Second Edition. San Francisco, CA: Morgan Kauffman, 2005.
[9] R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kauffman, 1993.
[10] W. W. Cohen, “Fast Effective Rule Induction,” in Proc. Twelfth International Conference on
Machine Learning (ICML’95), Tahoe City, CA, 1995, pp. 115–123.
[11] D. W. Aha and D. Kibler, “Instance-based Learning Algorithms,” Machine Learning, vol. 6,
pp. 37–66, 1991.
[12] J. C. Platt, “Fast Training of Support Vector Machines using Sequential Minimal
Optimization,” Advances in Kernel Methods, pp. 185–208, 1999. [Online]. Available:
http://portal.acm.org/citation.cfm?id=299094.299105
[13]https://www.sciencedirect.com/topics/computerscience/deep-neural-network [14]An
Effective Approach for Cyberbullying Detection and avoidance ieee paper [15]Approaches to
Automated Detection of Cyberbullying: A Survey ieee paper.