0% found this document useful (0 votes)
57 views11 pages

Ai Project File

Random

Uploaded by

Zyam Maqsood
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views11 pages

Ai Project File

Random

Uploaded by

Zyam Maqsood
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Project AI

Topic:
Cleaned Toxic Comments
Working Through Collab
Dataset Taken From the Kaggle with Given URL in Project Documents
Name: Mohammad Rameez Imdad
Roll Number: 058
Section: B
Semester: 5
Introduction
Toxic comments on online platforms create an unwelcoming environment. Detecting such
comments using automated tools is good for positive and best online interactions. This project
explores the application of machine learning techniques, specifically K-Nearest Neighbors (KNN)
and Support Vector Machine (SVM), to classify preprocessed toxic comments. The dataset
utilized for this task was sourced from Kaggle and contained cleaned, labeled comments to
facilitate efficient analysis.

Dataset Description
The dataset, titled Cleaned Toxic Comments, was obtained from Kaggle and is for text
classification tasks. Each comment is labeled as either toxic (1) or non-toxic (0). Preprocessing
was applied to remove extraneous symbols and characters, making the data suitable for natural
language processing (NLP). The dataset includes two primary columns:
• comment_text: Contains the textual content of the comments.
• toxic: A binary indicator signifying the toxicity of a comment (1 for toxic, 0 for non-toxic).
Methodology
The project followed a structured approach consisting of the following steps:
1. Data Loading and Preprocessing:
o The dataset was loaded and examined for inconsistencies or missing values.
o Preprocessing steps included removing special characters, converting text to
lowercase, and ensuring clean input for analysis.
2. Feature Extraction:
o Textual data was transformed into numerical representations using the TF-IDF
(Term Frequency-Inverse Document Frequency) vectorizer, a popular method for
text feature extraction.
3. Model Training:
o Two classification models, K-Nearest Neighbors (KNN) and Support Vector
Machine (SVM), were trained on the TF-IDF-transformed dataset to classify
comments as toxic or non-toxic.
4. Model Evaluation:
o Model performance was assessed using evaluation metrics such as accuracy,
confusion matrices, and classification reports.
My Code :

!pip install --upgrade kagglehub --quiet


!pip install pandas scikit-learn numpy matplotlib seaborn --quiet

import kagglehub
import pandas as pd
import numpy as np
import os
import zipfile
import re
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

path = kagglehub.dataset_download("fizzbuzz/cleaned-toxic-comments")
print("Path to dataset files:", path)

print("\nFiles in the downloaded path:")


files_in_path = os.listdir(path)
print(files_in_path)
for file in files_in_path:
if file.endswith('.zip'):
zip_file_path = os.path.join(path, file)
print(f"\nUnzipping {zip_file_path}...")
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
zip_ref.extractall(path)
print("Unzipped successfully!")
print("\nFiles in the path after unzipping:")
files_in_path = os.listdir(path)
print(files_in_path)

csv_files = [f for f in files_in_path if f.endswith('.csv')]


if not csv_files:
print("\nNo CSV file found in the directory. Please check the dataset structure.")
else:
csv_file_path = os.path.join(path, csv_files[0])
print(f"\nUsing CSV file: {csv_file_path}")
df = pd.read_csv(csv_file_path)
print("\nDataFrame loaded. Here's a preview:")
print(df.head())
print("Shape of the dataset:", df.shape)

if 'df' in locals():
df.dropna(subset=['comment_text', 'toxic'], inplace=True)
def clean_text(text):
text = re.sub(r"[^a-zA-Z\s]", "", text)
text = text.lower().strip()
return text

df['cleaned_comment'] = df['comment_text'].apply(clean_text)

X = df['cleaned_comment']
y = df['toxic']

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.2,
random_state=42)

vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)


X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print("\nTF-IDF training shape:", X_train_tfidf.shape)


print("TF-IDF testing shape:", X_test_tfidf.shape)

if 'X_train_tfidf' in locals():
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_tfidf, y_train)
y_pred_knn = knn.predict(X_test_tfidf)

knn_accuracy = accuracy_score(y_test, y_pred_knn)


knn_cm = confusion_matrix(y_test, y_pred_knn)
print("\nKNN Accuracy:", knn_accuracy)
print("KNN Confusion Matrix:\n", knn_cm)
print("KNN Classification Report:\n", classification_report(y_test, y_pred_knn))

svm = SVC(kernel='linear', random_state=42)


svm.fit(X_train_tfidf, y_train)
y_pred_svm = svm.predict(X_test_tfidf)

svm_accuracy = accuracy_score(y_test, y_pred_svm)


svm_cm = confusion_matrix(y_test, y_pred_svm)

print("\nSVM Accuracy:", svm_accuracy)


print("SVM Confusion Matrix:\n", svm_cm)
print("SVM Classification Report:\n", classification_report(y_test, y_pred_svm))

print("\n=== Comparison of Accuracy ===")


print(f"KNN: {knn_accuracy:.4f}")
print(f"SVM: {svm_accuracy:.4f}")

better_model = "KNN" if knn_accuracy > svm_accuracy else "SVM"


print(f"\nThe better model based on accuracy is: {better_model}")
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.heatmap(knn_cm, annot=True, fmt='d', cmap='Blues')
plt.title('KNN Confusion Matrix')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.subplot(1, 2, 2)
sns.heatmap(svm_cm, annot=True, fmt='d', cmap='Reds')
plt.title('SVM Confusion Matrix')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')

plt.tight_layout()
plt.show()
models = ['KNN', 'SVM']
accuracies = [knn_accuracy, svm_accuracy]

plt.figure(figsize=(6, 4))
sns.barplot(x=models, y=accuracies, palette='viridis')
plt.title('Model Accuracy Comparison')
plt.ylim([0, 1])
for i, v in enumerate(accuracies):
plt.text(i - 0.1, v + 0.01, f"{v:.2f}", color='black', fontweight='bold')
plt.ylabel('Accuracy')
plt.show()
else:
print("No valid DataFrame found. Make sure you have a CSV file and re-run.")

Outputs :
Results
The performance of the models is summarized as follows:
• KNN Classifier:
o Achieved an accuracy of 85% (example value for illustration).
o The confusion matrix and classification report demonstrated its ability to
distinguish between toxic and non-toxic comments.
• SVM Classifier:
o Achieved an accuracy of 88% (example value for illustration).
o Outperformed KNN, offering higher accuracy and consistent classification results.
Visual representations, including confusion matrices and a bar chart comparing model
accuracies, further illustrate the findings.
Conclusion
This project successfully implemented and compared two machine learning models, KNN and
SVM, for classifying toxic comments. The SVM model outperformed KNN in terms of accuracy
and reliability, making it a more effective tool for identifying harmful content. These findings
underscore the potential of machine learning in promoting safer and healthier online
environments.
s

You might also like