0% found this document useful (0 votes)
27 views4 pages

Toxic Comment Classification Using Natural Language Processing IRJET-V7I61123

Uploaded by

prachinikalje850
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views4 pages

Toxic Comment Classification Using Natural Language Processing IRJET-V7I61123

Uploaded by

prachinikalje850
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 07 Issue: 06 | June 2020 www.irjet.net p-ISSN: 2395-0072

Toxic Comment Classification using Natural Language Processing


A. Akshith Sagar, J. Sai Kiran
---------------------------------------------------------------------***----------------------------------------------------------------------
Abstract - Text classification has become one of the most 2.1 Tokenization
useful applications of Deep Learning, this process includes Tokenization is the process of converting a text corpus to a
techniques like Tokenizing, Stemming, and Embedding. This set of distinct tokens of any size. These tokens are usually
paper uses these techniques along with few algorithms, that numbers which are assigned to the words present in the text.
are used to classify online comments based on their level of As a computer cannot understand a language, this method
toxicity. We proposed a neural network model to classify the helps us to map all the words to distinct numbers which
comments and compared the model’s accuracy with some makes it easier for the computer to understand. So the result
other models like Long Short Term Memory (LSTM), Naive of this process is a dictionary of fixed size that contains a
Bayes Support Vector Machine, Fasttext and Convolutional mapping from words to numbers.
Neural Network .The comments are first passed to a tokenizer
or vectorizer to create a dictionary of words, then an 2.2 Vectorization
embedding matrix is created after which it is passed to a Vectorization is a technique in which words are converted
model to classify. The proposed model achieved an accuracy of to feature vectors. This paper uses the Term Frequency
98.15%. Inverse Document Frequency Vectorization (TFIDF). TFIDF
Vectorization converts the words in the document to a
Key Words: Toxic comment detection, Long short term vector that can be used as input to the estimator. It can be
Memory, Convolutional Neural Networks, Naive Bayes, used to learn how important a word is to a document. This is
Support Vector Machine, Fasttext. done by assigning a score to each word in the document.
Term Frequency can be defined as follows: -
1. INTRODUCTION
TF(w) = (Number of times word w appears in a document)
Social media is a place where a lot of discussions happen, / (Total number of words in the document)
being anonymous while doing so has given the freedom to
many people to express their opinions freely. But people Inverse Document Frequency is defined as follows: -
who disagree with a point of view extremely can misuse this IDF(w) = log (Total number of documents / Number of
freedom sometimes. Sharing things that you care about will documents with word ‘w ‘in it).
become a difficult task with this constant threat of
harassment or toxic comments online. This will eventually The TF-IDF score is simply the product of these two
lead to people not sharing their ideas online and stop asking frequencies. i.e.
for other people’s opinion on them. Unfortunately, the social
media platforms face these issues all the time and find it TF-IDF(w) = TF(w) * IDF(w)
difficult to identify and stop these toxic remarks before it
leads to the abrupt end of conversations. 2.3 Word Embeddings
Every word in the dataset is embedded into feature
In this paper, we will be using Natural Language vectors, this is done by creating an embedding matrix. An
Processing with Deep neural networks to solve this problem embedding matrix is a list of words and their corresponding
of identifying the toxicity of online comments. Word embeddings. Embeddings usually refer to n-dimensional
embeddings will be used in conjunction with recurrent dense vectors. The embedding matrix is of shape (vocab_size,
neural networks with Long Short Term Memory (LSTM), embed_size). Here vocab_size is the number of words in the
Convolutional Neural Networks (CNN), and Naive Bayes dictionary that are obtained from the tokenization method
(NB)-Support Vector Machine (SVM) and Fasttext separately and embed_size is the number of features into which the
and see which model fits and works best. words will be embedded. There are a lot of pre-trained word
embeddings available with different embedding sizes like
2. PRE-PROCESSING TEXT the GloVe (Global Vectors for Word Representation),
word2vec, Fasttext-crawl, etc. This paper uses fasttext-
Pre-processing of the text is the first step that is performed crawl-300d-2m for the embedding matrix. This embedding
on the dataset. The dataset is cleaned and prepared for the matrix is then passed to different algorithms.
classification tasks by removing punctuation, imputing
missing values, normalisation, etc. Besides these common
preprocessing functions there are other techniques that are
used specifically for deep learning classification.

© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 6007
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 06 | June 2020 www.irjet.net p-ISSN: 2395-0072

3. ALGORITHMS 3.3 SVM with NB features (NBSVM)


Naive Bayes classifier algorithm is based on Bayes theorem
3.1 Long Short Term Memory (LSTM) which determines the posterior probability of an event
Recurrent Neural Networks (RNN) are neural networks occurring based on prior knowledge or evidence.
that contain cyclic connections. The output of a given hidden Multinomial Naive Bayes an instance of NB classifier which
layer is fed back to itself in an RNN to remember some uses a multinomial distribution for each feature of data.
information as the memory from previous computations.
This makes RNNs a powerful tool for sequential data like Support Vector Machine (SVM): SVM works on the principle
text, video, and speech. of finding hyperplanes that distinctly classify the data units
when you plot them onto an n-dimensional graph(here the n
Long Short Term Memory (LSTM) is an RNN that can learn refers to the number of features of the data).
long term dependencies which is something a traditional
RNN finds difficult to do. An LSTM model, like RNN, has a In Naive Bayes -Support Vector Machine the probabilities
chain-like architecture where each unit of this repeating calculated in MNB are then fed to SVM to classify. NBSVM is
structure is called an LSTM cell. observed to give better results than a simple NB classifier or
SVM classifier when used separately.
An LSTM cell contains an input gate, an output gate and
a forget gate that regulates the data which is flown into and 3.4 Fasttext
outside of the cell. The forget gate decides what information Fasttext is a text classification library developed by
it's going to discard from the current cell. The input gate Facebook. Fasttext can be used to learn word embeddings,
then decides what new information is going to be added to create supervised or unsupervised classification models
modify the current state of the memory and finally, the from these word embeddings. Fasttext has its word
output gate decides what information leaves the cell. embeddings called Fasttext crawl which is trained on around
600 Billion tokens. These word embeddings are open and
3.2 Convolutional Neural Network can be downloaded by anyone for their use.
Convolutional Neural Network (CNN) is a deep neural Fasttext has multiple pre-trained models to choose from
network that is usually applied to images. CNNs were depending on the nature of the problem. In this paper we use
inspired by the human brain. Like the human brain, CNN the default supervised classifier model.
consists of interconnected neurons in different layers. Each
neuron in a layer is a perceptron that performs some 4. IMPLEMENTATION
computation to the weights that are passed to it. Although
CNNs are mostly used for image classification, they can also Keras framework was used to implement the LSTM and CNN
be used for text classification by passing the feature vectors models. Keras is an open-sourced neural network library
of input text to CNN. The CNN then computes weights for built on python which provides user-friendly, high-level APIs
different neurons which are used to determine a function which enable easy implementations of different deep neural
that maps the feature vectors to the output. A CNN usually network algorithms.
consists of the following layers: -
The dataset used for classification in this paper is given as
a) Convolutional Layer: - The purpose of a Convolutional part of the competition hosted by Jigsaw and it contains
layer is to extract and learn features from the input vectors. 159571 comments taken from Wikipedia. Each comment is
A convolutional layer computes outputs of neurons by classified into one of 6 labels based on their level of toxicity.
performing dot product operations to the weights and The results are then validated against a test data set of
passes this output to an activation function. 153164 new examples.

b) Activation Function: - The output of a convolutional layer 4.1 Naive Bayes-Support Vector Machines
is passed to an activation function. An activation function is In this approach, we first compute TF-IDF scores for the
used to add non-linearity to the output of the Convolutional words in the train data using the TF-IDF Vectorizer. This
Layer. The most common activation function is the Rectified generates a matrix that contains an array of scores for each
Linear Unit (ReLu) function. A ReLu function can be defined training example.
as follows: Then we use the Multinomial Naive Bayes theorem on each
f(x) = max (0, x) column of the labels to generate the NB probabilities from
the TF-IDF matrix. This is then fed to SVM to predict the
c)Pooling Layer: - A pooling layer is used to reduce the probabilities of each label.
dimensions of the input by preserving the important After predicting the test set NB-SVM achieved an accuracy of
features. A Convolutional layer is often succeeded by a 97.61 %.
pooling layer to reduce the size and number of parameters
from the previous layer.
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 6008
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 06 | June 2020 www.irjet.net p-ISSN: 2395-0072

4.2 Fasttext
The fasttext library takes the input in a text format, so all
the comments from the train data are converted to a text
document with each training example starting with ’
__label__’ followed by the respective label of the comment
and then the comment itself. This text file is fed into the
fasttext model and after fine-tuning the hyper parameters of
the number of epochs and learning rate to 5 and 0.1
respectively, the model achieved an accuracy of 95.4%.

4.3 Long Short Term Memory


The first step of this algorithm is tokenization. This
fig 1 LSTM Architecture
generates a sequence of numbers for each comment. As each
comment may vary in their lengths, the output of
This model of LSTM achieved an accuracy of 96.92%.
tokenization is padded to a fixed length of 200. This is then
passed to a Keras Embedding Layer which learns
embeddings. The embedding size used was 300. The output 4.4 Convolutional Neural Networks
of this embedding layer is then fed to an LSTM of 60 units, The first few steps of this algorithm are the same as that of
which then returns sequences. LSTM. But here, instead of allowing the embedding layer to
learn the weights by itself, we provide the Embedding Layer
This is then passed to a Pooling Layer, Dense Layer of 60 with a matrix extracted from the fasttext-crawl file. After the
units, Dropout Layer, and finally to a Dense layer of 6 units embedding layer, we then use a series of convolutional
with a sigmoid activation function. This predicts the layers in conjunction with pooling layers. This paper uses 4
probabilities of 6 classes. convolutional layers and 4 max-pooling layers. The output of
these layers is concatenated and then Flattened to an array,
which is finally fed to a Dense layer of 6 units with a sigmoid
activation function, which predicts the probabilities of each
label.

fig 2 CNN Architecture

© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 6009
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 06 | June 2020 www.irjet.net p-ISSN: 2395-0072

This CNN model achieved an accuracy of 98.13%.

5. CONCLUSION

With the Internet being a platform accessible to everyone, it


is important to make sure that people with different ideas
are heard without the fear of any toxic and hateful remarks.
And after analyzing various approaches to solve this
problem of classification of toxic comments online, it is
found that CNN model works slightly better than LSTM and
NB-SVM with the accuracy of 98.13%. Future scope for this
analysis would be integrating such classification algorithms
into social media platforms to automatically classify and
censor or toxic comments.

REFERENCES

[1] Siwei Lai, Liheng Xu, Kang Liu, Jun Zhao, ” Recurrent
Convolutional Neural Networks for Text Classification”
Proceedings of the Twenty-Ninth AAAI Conference on
Artificial Intelligence, Austin, Texas, 2015.
[2] Sida Wang and Christopher D. Manning, “Baselines and
Bigrams: Simple, Good Sentiment and Topic
Classification”, Stanford, CA,
[3] Mujahed A. Saif, Alexander N. Medvedev, Maxim A.
Medvedev, and Todorka Atanasova, “Classification of
online toxic comments using the logistic regression and
neural networks models”, AIP Conference Proceedings
2048, 060011 (2018)
[4] R. Nicole, “Title of paper with only first word
capitalized,” J. Name Stand. Abbrev., in press.
[5] Sepp Hochreiter, Jurgen Schmidhuber, “LONG SHORT-
TERM MEMORY”, Neural Computation 9(8):1735-1780,
1997
[6] Navaney, P., Dubey, G., &Rana, A. (2018). "SMS Spam
Filtering Using Supervised Machine Learning
Algorithms." 2018 8th International Conference on
Cloud Computing, Data Science & Engineering
(Confluence).

© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 6010

You might also like