Rushikesh Kamble (Roll No. 20) Siddhesh Ghosalkar (Roll No. 14) Pradnya Dhondge (Roll No. 07)
Rushikesh Kamble (Roll No. 20) Siddhesh Ghosalkar (Roll No. 14) Pradnya Dhondge (Roll No. 07)
ON
Bachelor of Engineering
BY
SUPERVISOR
This is to certify that the project work entitled" IDENTIFY & CLASSIFY ONLINE TOXIC
COMMENTS " is the bonafide work of" Rushikesh Kamble", "Siddhesh Ghosalkar" and
"Pradnya Dhondge" Group No. 25 submitted to the university of Mumbai in partial
fulfillment of the requirement for the award of the degree of "BACHELOR OF
ENGINEERING" in "COMPUTER ENGINEERING".
I declare that this written submission represents my ideas in my own words and
where others' ideas or words have been included, I have adequately cited and
referenced the original sources. I also declare that I have adhered to all principles
of academic honesty and integrity and have not misrepresented or fabricated or
falsified any idea/data/fact/source in my submission. I understand that any
violation of the above will be cause for disciplinary action by the Institute and can
also evoke penal action from the sources which have thus not been properly cited
or from whom proper permission has not been taken when needed.
Pradnya Dhongade(07)
Siddhesh Ghosalkar(14)
Date:-
Rushikesh Kamble(20)
TABLE OF CONTENTS
1.Abstract 3
2.Introduction 4
3. Preprocessing Text 5
4.Algorithms 6
5.Implementation 7
6.Conclusion 8
7. References 8
1. Abstract
3.1 TOKENIZATION
Tokenization is the process of converting a text corpus to a set of distinct
tokens of any size. These tokens are usually numbers which are assigned
to the words present in the text. As a computer cannot understand a
language, this method helps us to map all the words to distinct numbers
which makes it easier for the computer to understand. So the result of
this process is a dictionary of fixed size that contains a mapping from
words to numbers.
3.2 VECTORIZATION
Vectorization is a technique in which words are converted to feature
vectors. This paper uses the Term Frequency Inverse Document
Frequency Vectorization (TFIDF). TFIDF Vectorization converts the
words in the document to a vector that can be used as input to the
estimator. It can be used to learn how important a word is to a document.
This is done by assigning a score to each word in the document.
3.3 WORD EMBEDDINGS
Every word in the dataset is embedded into feature vectors, this is done
by creating an embedding matrix. An embedding matrix is a list of
words and their corresponding embeddings. Embeddings usually refer to
n-dimensional dense vectors. The embedding matrix is of shape
(vocab_size, embed_size). Here vocab_size is the number of words in
the dictionary that are obtained from the tokenization method and
embed_size is the number of features into which the words will be
embedded. There are a lot of pre-trained word embeddings available
with different embedding sizes like the GloVe (Global Vectors for Word
Representation), word2vec, Fasttext-crawl, etc. This paper uses fasttext-
crawl-300d-2m for the embedding matrix. This embedding matrix is
then passed to different algorithms.
4. ALGORITHMS
4.4 Fasttext –
Fasttext is a text classification library developed by Facebook. Fasttext
can be used to learn word embeddings, create supervised or
unsupervised classification models from these word embeddings.
Fasttext has its word embeddings called Fasttext crawl which is trained
on around 600 Billion tokens. These word embeddings are open and can
be downloaded by anyone for their use.
Fasttext has multiple pre-trained models to choose from depending on
the nature of the problem. In this paper we use the default supervised
classifier model.
5. IMPLEMENTATION
5.2 FASTTEXT
The fasttext library takes the input in a text format, so all the comments
from the train data are converted to a text document with each training
example starting with ’ __label ’ followed by the respective label of
the comment and then the comment itself. This text file is fed into the
fasttext model and after fine-tuning the hyper parameters of the number
of epochs and learning rate to 5 and 0.1 respectively, the model achieved
an accuracy of 95.4%.
6. CONCLUSION
With the Internet being a platform accessible to everyone, it is important
to make sure that people with different ideas are heard without the fear
of any toxic and hateful remarks. And after analyzing various
approaches to solve this problem of classification of toxic comments
online, it is found that CNN model works slightly better than LSTM and
NB-SVM with the accuracy of 98.13%. Future scope for this analysis
would be integrating such classification algorithms into social media
platforms to automatically classify and censor or toxic comments
7. REFERENCES
[1]Siwei Lai, Liheng Xu, Kang Liu, Jun Zhao, ” Recurrent
Convolutional Neural Networks for Text Classification” Proceedings of
the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin,
Texas, 2015.
[2]Sida Wang and Christopher D. Manning, “Baselines and Bigrams:
Simple, Good Sentiment and Topic Classification”, Stanford,CA,
[3]Mujahed A. Saif, Alexander N. Medvedev, Maxim A.
Medvedev, and Todorka
Atanasova, “Classification of online toxic comments using the logistic
regression and neural networks models”, AIP Conference Proceedings
2048, 060011 (2018)
[4]R. Nicole, “Title of paper with only first word capitalized,” J. Name
Stand.
Abbrev., in press.
[5]Sepp Hochreiter, Jurgen Schmidhuber, “LONG SHORT- TERM
MEMORY”,
Neural Computation 9(8):1735-1780, 1997
[6]Navaney, P., Dubey, G., &Rana, A. (2018). "SMS Spam Filtering
Using Supervised Machine Learning Algorithms." 2018 8th International
Conference on Cloud Computing, Data Science & Engineering
(Confluence).