Ai Project File
Ai Project File
Topic:
Cleaned Toxic Comments
Working Through Collab
Dataset Taken From the Kaggle with Given URL in Project Documents
Name: Mohammad Rameez Imdad
Roll Number: 058
Section: B
Semester: 5
Introduction
Toxic comments on online platforms create an unwelcoming environment. Detecting such
comments using automated tools is good for positive and best online interactions. This project
explores the application of machine learning techniques, specifically K-Nearest Neighbors (KNN)
and Support Vector Machine (SVM), to classify preprocessed toxic comments. The dataset
utilized for this task was sourced from Kaggle and contained cleaned, labeled comments to
facilitate efficient analysis.
Dataset Description
The dataset, titled Cleaned Toxic Comments, was obtained from Kaggle and is for text
classification tasks. Each comment is labeled as either toxic (1) or non-toxic (0). Preprocessing
was applied to remove extraneous symbols and characters, making the data suitable for natural
language processing (NLP). The dataset includes two primary columns:
• comment_text: Contains the textual content of the comments.
• toxic: A binary indicator signifying the toxicity of a comment (1 for toxic, 0 for non-toxic).
Methodology
The project followed a structured approach consisting of the following steps:
1. Data Loading and Preprocessing:
o The dataset was loaded and examined for inconsistencies or missing values.
o Preprocessing steps included removing special characters, converting text to
lowercase, and ensuring clean input for analysis.
2. Feature Extraction:
o Textual data was transformed into numerical representations using the TF-IDF
(Term Frequency-Inverse Document Frequency) vectorizer, a popular method for
text feature extraction.
3. Model Training:
o Two classification models, K-Nearest Neighbors (KNN) and Support Vector
Machine (SVM), were trained on the TF-IDF-transformed dataset to classify
comments as toxic or non-toxic.
4. Model Evaluation:
o Model performance was assessed using evaluation metrics such as accuracy,
confusion matrices, and classification reports.
My Code :
import kagglehub
import pandas as pd
import numpy as np
import os
import zipfile
import re
import matplotlib.pyplot as plt
import seaborn as sns
path = kagglehub.dataset_download("fizzbuzz/cleaned-toxic-comments")
print("Path to dataset files:", path)
if 'df' in locals():
df.dropna(subset=['comment_text', 'toxic'], inplace=True)
def clean_text(text):
text = re.sub(r"[^a-zA-Z\s]", "", text)
text = text.lower().strip()
return text
df['cleaned_comment'] = df['comment_text'].apply(clean_text)
X = df['cleaned_comment']
y = df['toxic']
if 'X_train_tfidf' in locals():
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_tfidf, y_train)
y_pred_knn = knn.predict(X_test_tfidf)
plt.tight_layout()
plt.show()
models = ['KNN', 'SVM']
accuracies = [knn_accuracy, svm_accuracy]
plt.figure(figsize=(6, 4))
sns.barplot(x=models, y=accuracies, palette='viridis')
plt.title('Model Accuracy Comparison')
plt.ylim([0, 1])
for i, v in enumerate(accuracies):
plt.text(i - 0.1, v + 0.01, f"{v:.2f}", color='black', fontweight='bold')
plt.ylabel('Accuracy')
plt.show()
else:
print("No valid DataFrame found. Make sure you have a CSV file and re-run.")
Outputs :
Results
The performance of the models is summarized as follows:
• KNN Classifier:
o Achieved an accuracy of 85% (example value for illustration).
o The confusion matrix and classification report demonstrated its ability to
distinguish between toxic and non-toxic comments.
• SVM Classifier:
o Achieved an accuracy of 88% (example value for illustration).
o Outperformed KNN, offering higher accuracy and consistent classification results.
Visual representations, including confusion matrices and a bar chart comparing model
accuracies, further illustrate the findings.
Conclusion
This project successfully implemented and compared two machine learning models, KNN and
SVM, for classifying toxic comments. The SVM model outperformed KNN in terms of accuracy
and reliability, making it a more effective tool for identifying harmful content. These findings
underscore the potential of machine learning in promoting safer and healthier online
environments.
s