Project Report Draft FInal
Project Report Draft FInal
Project Report Draft FInal
submitted to
Indian Institute of Information Technology, Kalyani
for partial fulfillment of degree of
Bachelor of
Technology In
Computer Science and Engineering
By
Ashish Kumar (Reg No. 678)
Amarjit Hore (Reg No. 669)
Roshan Kumar (Reg No. 729)
The project has fulfilled all the requirements as per the regulations of the
Indian Institute of Information Technology Kalyani and in my opinion, has
reached the standards needed for submission. The work, techniques and
the results presented have not been submitted to any other university or
institute for the award of any other degree or diploma.
………………………..
Assistant Professor
(i)
DECLARATION
We hereby declare that the work being presented in this project entitled
TOXIC COMMENT ANALYSER, submitted to Indian Institute of Information
Technology Kalyani in partial fulfilment for the award of the degree of
Bachelor of Technology in Computer Science and Engineering during the
period from Aug 2023 to Nov 2023 under the supervision of Dr. Anirban
Lakshman, Indian Institute of Information Technology Kalyani, West Bengal
- 741235, India, does not contain any classified information.
Date: 14/11/2023
(ii)
ACKNOWLEDGEMENT
First of all, we would like to thank our guide, Dr. Anirban Lakshman, for his
encouragement, guidance and cooperation to carry out this project, and for
giving us an opportunity to work on this project and providing us with a
great environment to carry our work in ease. We also thank other
resources which we have mentioned in our references.
(iii)
ABSTRACT
In the ever-expanding digital landscape, online platforms provide spaces
for diverse interactions, yet the prevalence of toxic comments poses
significant challenges to maintaining a healthy online environment. This
project addresses the critical issue of identifying and classifying toxic
comments using advanced machine learning techniques. The Toxic Comment
Classifier employs a state-of-the-art deep learning model, incorporating a
bidirectional LSTM network and embedding layers for effective feature
extraction from textual data.
The project adheres to the rigorous standards set by the Indian Institute of
Information Technology Kalyani, fulfilling all regulatory requirements. Under
the expert supervision and guidance of their mentor, the students
demonstrate a comprehensive understanding of natural language processing
and deep learning techniques.
(iv)
CONTENTS
Contents
CHAPTER – 1.............................................................................................................................................................................................. 1
PROBLEM STATEMENT.............................................................................................................................................................. 1
OBJECTIVE........................................................................................................................................................................................ 2
CHAPTER – 2.............................................................................................................................................................................................. 1
LITERATURE SURVEY.................................................................................................................................................................. 1
CHAPTER – 3.............................................................................................................................................................................................. 3
CHAPTER – 4.............................................................................................................................................................................................. 6
ARCHITECTURE & USER INTERACTION FLOW...............................................................................................................6
CHAPTER – 5........................................................................................................................................................................................... 11
WORKING....................................................................................................................................................................................... 11
16.................................................................................................................................................................................................................. 12
CHAPTER – 6........................................................................................................................................................................................... 13
EVALUATION AND RESULTS................................................................................................................................................. 13
MODEL SAVING AND LOADING............................................................................................................................................ 13
INTEGRATION WITH GRADIO............................................................................................................................................... 14
CHAPTER – 7........................................................................................................................................................................................... 15
RESULTS.......................................................................................................................................................................................... 15
Non-Toxic Comment as input, Output shows toxic is false.................................................................................................16
CHAPTER – 8........................................................................................................................................................................................... 17
CONCLUSION................................................................................................................................................................................. 17
FUTURE SCOPE OF WORK.......................................................................................................................................................18
REFERENCES................................................................................................................................................................................. 21
FIGURES
CHAPTER – 1
PROBLEM STATEMENT
To build a prototype of online hate and abuse comment classifier which can
be used to classify hate and offensive comments so that it can be controlled
and restricted from spreading hatred and cyberbullying.
The aim is to develop a prototype for an online hate and abuse comment
classifier. This classifier will play a crucial role in identifying and
categorizing hate speech and offensive comments. The primary goal is to
enable effective control and restriction of the dissemination of such content,
thereby mitigating the spread of hatred and preventing instances of
cyberbullying. The development of this prototype underscores a
commitment to fostering a safer and more responsible online environment.
OBJECTIVE
❖ Automated Detection of Toxic Comments:
❖ Multi-Class Categorization
❖ User-Friendly Integration
LITERATURE SURVEY
❖ Early Approaches
Supervised Learning:
The majority of research in toxic comment classification has
adopted supervised learning techniques. Various algorithms,
including Support Vector Machines (SVM), Naive Bayes, and
decision trees, have been employed to train models on labelled
datasets. These models leverage features such as bag-of-words, TF-
IDF, and word embeddings to identify patterns associated with toxic
language.
Deep Learning:
The advent of deep learning has significantly impacted the field,
with recurrent neural networks (RNNs), long short-term memory
networks (LSTMs), and more recently, transformer-based models
such as BERT and GPT, achieving state-of-the-art performance.
These models excel in capturing contextual information and
semantic relationships, enabling them to effectively identify subtle
instances of toxicity.
❖ Challenges and Open Problems
PROPOSED SYSTEM
In the dynamic realm of online communication, the unrestricted exchange
of ideas on digital platforms has empowered diverse voices. However, this
openness has also given rise to the persistent challenge of toxic comments,
which can undermine the constructive nature of online discussions.
Recognizing the gravity of this issue, our project, the Toxic Comment
Classifier, spearheaded by undergraduate students Ashish Kumar, Amarjit
Hore, and Roshan Kumar from the Department of Computer Science and
Engineering at the Indian Institute of Information Technology Kalyani,
delves into the intricate domain of natural language processing and machine
learning.
METHODOLOGY
❖ Dataset
- Source of Dataset: The dataset for the Toxic Comment Classification
project is obtained from Kaggle, a popular platform for machine
learning datasets and competitions.
▪ Severe toxic
▪ Obscene
▪ Threat
▪ Insult
▪ Identity hate
11
CHAPTER – 4
ARCHITECTURE & USER INTERACTION
FLOW
As we delve into the intricacies of this solution, it becomes evident that the Toxic
Comment Analyzer not only aims to enhance online safety but also embodies a
commitment to fostering responsible and respectful digital communication. Let's
explore how this tool, equipped with a sophisticated architecture, navigates the
challenges posed by toxic comments, ultimately contributing to a healthier
online discourse.
1.Data Preprocessing
- Dataset Loading: Load the toxic comment dataset
(e.g., 'train.csv') containing comments and corresponding toxicity labels.
- Text Vectorization: Utilise the TextVectorization layer to convert raw
text into numerical vectors, allowing for efficient processing by the model.
- Label Preparation: Extract the target labels (toxicity categories) and
format them for model training.
2. Model Architecture
1. Input Layer:
- Accepts the preprocessed text input, which goes through initial
tokenization and embedding.
2. Embedding Layer:
- Converts words into dense vectors to capture semantic relationships.
- Pre-trained word embeddings (such as Word2Vec, GloVe, or FastText)
may be used to leverage contextual information.
7. Output Layer:
- Utilizes a sigmoid activation function for binary classification or
softmax for multi-class classification.
- Generates probability scores for each toxicity category.
8. Loss Function:
- Cross-entropy loss is commonly used for multi-class classification
tasks.
9. Optimizer:
- Adam or RMSprop optimizers are often chosen for efficient gradient
descent during model training.
❖ Advantages:
o Interpretability: Results are easily interpretable, providing
probabilities for class membership.
o Efficiency: Computationally efficient and does not require high
computational resources.
o Less Prone to Overfitting: Less susceptible to overfitting
compared to more complex models when the feature space is
small.
❖ Disadvantages:
o Linear Decision Boundary: Limited to linear decision
boundaries, which might be a drawback for complex datasets.
o Assumption of Linearity: Assumes a linear relationship between
independent variables and the log-odds of the dependent
variable.
o Sensitivity to Outliers: Sensitive to outliers, which can impact the
model's performance.
❖ Use Cases:
o Binary Classification: Well-suited for problems with two classes,
such as spam detection or disease diagnosis.
o Probabilistic Predictions: Useful when probability estimates for
class membership are required.
❖ Implementation:
o Algorithm: Uses the logistic function to model the probability of
a particular outcome.
o Optimization: Typically optimized using techniques like gradient
descent.
❖ Scalability:
o Scalability: Scales well with the number of features but may not
be the best choice for large and highly complex datasets.
CHAPTER – 5
WORKING
❖ Memory Cell:
- The core of an LSTM is its memory cell, which serves as a storage unit
capable of retaining information over long periods. This memory cell is
responsible for keeping track of relevant information from earlier parts of
the sequence.
❖ Three Gates:
- LSTMs employ three gates to regulate the flow of information: the input
gate, forget gate, and output gate
- The input gate determines which values from the input should be stored
in the memory cell.
- The forget gate decides what information to discard from the memory cell.
- The output gate regulates the information that should be output based on
the current input and the memory cell content.
❖ Cell State:
- The memory cell maintains a continuous 'cell state' that runs through the
entire sequence. This state is modified by the gates, allowing the LSTM to
selectively update, add, or remove information from the cell state.
❖ Hidden State:
- The hidden state is the LSTM's way of capturing and storing information
from previous time steps. It acts as a summary or representation of the
relevant information learned from the entire sequence.
❖ Advantages of LSTMs:
- Long-Term Dependencies: LSTMs excel at capturing and learning
dependencies over extended sequences, making them suitable for tasks
requiring an understanding of context over time.
- Gradient Flow: The gating mechanisms help in mitigating the vanishing
and exploding gradient problems that often hinder the training of
traditional RNNs.
- Versatility: LSTMs can be applied to a wide range of sequential data tasks,
including natural language processing, speech recognition, and time series
prediction.
In summary, LSTMs address the limitations of traditional RNNs by
introducing memory cells and gating mechanisms, enabling them to effectively
capture long-term dependencies in sequential data. This makes them a
powerful tool for tasks that involve understanding context and relationships
across extended sequences.
Bidirectional LSTM
One shortcoming of conventional RNNs is that they are only able to make use of
previous context. … Bidirectional RNNs (BRNNs) do this by processing the data
in both directions with two separate hidden layers, which are then fed forwards
to the same output layer. … Combining BRNNs with LSTM gives Bidirectional
LSTM, which can access long-range context in both input directions
16
CHAPTER – 6
EVALUATION AND RESULTS
Precision, Recall, Accuracy: Utilize precision, recall, and categorical accuracy
metrics to evaluate the model's performance on the test set.
model.compile(loss='BinaryCrossentropy', optimizer='Adam')
history = model.fit(train, epochs=6, validation_data=val)
pre = Precision()
re = Recall()
acc = CategoricalAccuracy()
y_true = y_true.flatten()
yhat = yhat.flatten()
pre.update_state(y_true, yhat)
re.update_state(y_true, yhat)
acc.update_state(y_true, yhat)
model.save('toxicity.h5')
import gradio as gr
def score_comment(comment):
vectorized_comment = vectorizer([comment])
results = model.predict(vectorized_comment)
text = ''
for idx, col in enumerate(df.columns[2:]):
text += '{}: {}\n'.format(col, results[0][idx]>0.5)
return text
interface = gr.Interface(fn=score_comment,
inputs=gr.inputs.Textbox(lines=2, placeholder='Comment to score'),
outputs='text')
interface.launch(share=True)
CHAPTER – 7
RESULTS
1. LOSS GRAPH
2. MODEL SCORES
3. OUTPUT
CHAPTER – 8
CONCLUSION
❖ Collaborative Milestone:
❖ Multimodal Analysis:
Expand analysis to include multimodal content (images, videos) for a
comprehensive approach to toxicity detection.
Enhance the model's capability to handle diverse forms of media.
The future scope of work for the Toxic Comment Analyzer encompasses
technical advancements, ethical considerations, and collaborative efforts to
create a safer and more inclusive online environment, presented through a
combination of bullet points and explanatory paragraphs.
25
REFERENCES
[1] - Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I.
(2019). Language Models are Few-Shot Learners. arXiv preprint
arXiv:2005.14165.
[2] - Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., ... &
Brew, J. (2020). Transformers: State-of-the-Art Natural Language
Processing. In Proceedings of the 2020 Conference on Empirical Methods
in Natural Language Processing: System Demonstrations (pp. 38-45).
[3] - Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal,
P., ... & Agarwal, S. (2020). Language Models are Few-Shot Learners.
arXiv preprint arXiv:2005.14165.
[4] - Kasneci, Enkelejda & Seßler, Kathrin & Kuechemann, Stefan &
Bannert, Maria & Dementieva, Daryna & Fischer, Frank & Gasser, Urs &
Groh, Georg & Gü nnemann, Stephan & Hü llermeier, Eyke & Krusche,
Stephan & Kutyniok, Gitta & Michaeli, Tilman & Nerdel, Claudia & Pfeffer,
Juergen & Poquet, Oleksandra & Sailer, Michael & Schmidt, Albrecht &
Seidel, Tina & Kasneci, Gjergji. (2023). ChatGPT for Good? On
Opportunities and Challenges of Large Language Models for Education.
103. 102274. 10.1016/j.lindif.2023.102274.
[5] - Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023).
QLoRA: Efficient Finetuning of Quantized LLMs. ArXiv, abs/2305.14314.
[6] - Vaswani, A., Sukhbaatar, S., Child, R., Shazeer, N., Parmar, N.,
Uszkoreit J. & Polosukhin, I. (2018). On the Pitfalls of Measuring Accuracy
in Natural Language Processing Systems. arXiv preprint
arXiv:1802.09941.
29
[8] - Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A.,
Cistac, P., Rault, T., Louf, R., Funtowicz, M. and Davison, J., 2019.
Huggingface's transformers: State-of-the-art natural language
processing. arXiv preprint arXiv:1910.03771.
[9] - Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y.,
Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S. and Bikel, D., 2023.
Llama 2: Open foundation and fine-tuned chat models. arXiv preprint
arXiv:2307.09288.
[10] - Sun, S., Zhang, Y., Yan, J., Gao, Y., Ong, D., Chen, B. and Su, J., 2023.
Battle of the Large Language Models: Dolly vs LLaMA vs Vicuna vs
Guanaco vs Bard vs ChatGPT--A Text-to-SQL Parsing Comparison. arXiv
preprint arXiv:2310.10190.
30