Ai Project File

Random

Uploaded by

Zyam Maqsood

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views11 pages

Ai Project File

Random

Uploaded by

Zyam Maqsood

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Project AI

Topic:
Cleaned Toxic Comments
Working Through Collab
Dataset Taken From the Kaggle with Given URL in Project Documents
Name: Mohammad Rameez Imdad
Roll Number: 058
Section: B
Semester: 5
Introduction
Toxic comments on online platforms create an unwelcoming environment. Detecting such
comments using automated tools is good for positive and best online interactions. This project
explores the application of machine learning techniques, specifically K-Nearest Neighbors (KNN)
and Support Vector Machine (SVM), to classify preprocessed toxic comments. The dataset
utilized for this task was sourced from Kaggle and contained cleaned, labeled comments to
facilitate efficient analysis.

Dataset Description
The dataset, titled Cleaned Toxic Comments, was obtained from Kaggle and is for text
classification tasks. Each comment is labeled as either toxic (1) or non-toxic (0). Preprocessing
was applied to remove extraneous symbols and characters, making the data suitable for natural
language processing (NLP). The dataset includes two primary columns:
• comment_text: Contains the textual content of the comments.
• toxic: A binary indicator signifying the toxicity of a comment (1 for toxic, 0 for non-toxic).
Methodology
The project followed a structured approach consisting of the following steps:
1. Data Loading and Preprocessing:
o The dataset was loaded and examined for inconsistencies or missing values.
o Preprocessing steps included removing special characters, converting text to
lowercase, and ensuring clean input for analysis.
2. Feature Extraction:
o Textual data was transformed into numerical representations using the TF-IDF
(Term Frequency-Inverse Document Frequency) vectorizer, a popular method for
text feature extraction.
3. Model Training:
o Two classification models, K-Nearest Neighbors (KNN) and Support Vector
Machine (SVM), were trained on the TF-IDF-transformed dataset to classify
comments as toxic or non-toxic.
4. Model Evaluation:
o Model performance was assessed using evaluation metrics such as accuracy,
confusion matrices, and classification reports.
My Code :

!pip install --upgrade kagglehub --quiet

!pip install pandas scikit-learn numpy matplotlib seaborn --quiet

import kagglehub
import pandas as pd
import numpy as np
import os
import zipfile
import re
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

path = kagglehub.dataset_download("fizzbuzz/cleaned-toxic-comments")
print("Path to dataset files:", path)

print("\nFiles in the downloaded path:")

files_in_path = os.listdir(path)
print(files_in_path)
for file in files_in_path:
if file.endswith('.zip'):
zip_file_path = os.path.join(path, file)
print(f"\nUnzipping {zip_file_path}...")
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
zip_ref.extractall(path)
print("Unzipped successfully!")
print("\nFiles in the path after unzipping:")
files_in_path = os.listdir(path)
print(files_in_path)

csv_files = [f for f in files_in_path if f.endswith('.csv')]

if not csv_files:
print("\nNo CSV file found in the directory. Please check the dataset structure.")
else:
csv_file_path = os.path.join(path, csv_files[0])
print(f"\nUsing CSV file: {csv_file_path}")
df = pd.read_csv(csv_file_path)
print("\nDataFrame loaded. Here's a preview:")
print(df.head())
print("Shape of the dataset:", df.shape)

if 'df' in locals():
df.dropna(subset=['comment_text', 'toxic'], inplace=True)
def clean_text(text):
text = re.sub(r"[^a-zA-Z\s]", "", text)
text = text.lower().strip()
return text

df['cleaned_comment'] = df['comment_text'].apply(clean_text)

X = df['cleaned_comment']
y = df['toxic']

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.2,
random_state=42)

vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)

X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print("\nTF-IDF training shape:", X_train_tfidf.shape)

print("TF-IDF testing shape:", X_test_tfidf.shape)

if 'X_train_tfidf' in locals():
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_tfidf, y_train)
y_pred_knn = knn.predict(X_test_tfidf)

knn_accuracy = accuracy_score(y_test, y_pred_knn)

knn_cm = confusion_matrix(y_test, y_pred_knn)
print("\nKNN Accuracy:", knn_accuracy)
print("KNN Confusion Matrix:\n", knn_cm)
print("KNN Classification Report:\n", classification_report(y_test, y_pred_knn))

svm = SVC(kernel='linear', random_state=42)

svm.fit(X_train_tfidf, y_train)
y_pred_svm = svm.predict(X_test_tfidf)

svm_accuracy = accuracy_score(y_test, y_pred_svm)

svm_cm = confusion_matrix(y_test, y_pred_svm)

print("\nSVM Accuracy:", svm_accuracy)

print("SVM Confusion Matrix:\n", svm_cm)
print("SVM Classification Report:\n", classification_report(y_test, y_pred_svm))

print("\n=== Comparison of Accuracy ===")

print(f"KNN: {knn_accuracy:.4f}")
print(f"SVM: {svm_accuracy:.4f}")

better_model = "KNN" if knn_accuracy > svm_accuracy else "SVM"

print(f"\nThe better model based on accuracy is: {better_model}")
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.heatmap(knn_cm, annot=True, fmt='d', cmap='Blues')
plt.title('KNN Confusion Matrix')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.subplot(1, 2, 2)
sns.heatmap(svm_cm, annot=True, fmt='d', cmap='Reds')
plt.title('SVM Confusion Matrix')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')

plt.tight_layout()
plt.show()
models = ['KNN', 'SVM']
accuracies = [knn_accuracy, svm_accuracy]

plt.figure(figsize=(6, 4))
sns.barplot(x=models, y=accuracies, palette='viridis')
plt.title('Model Accuracy Comparison')
plt.ylim([0, 1])
for i, v in enumerate(accuracies):
plt.text(i - 0.1, v + 0.01, f"{v:.2f}", color='black', fontweight='bold')
plt.ylabel('Accuracy')
plt.show()
else:
print("No valid DataFrame found. Make sure you have a CSV file and re-run.")

Outputs :
Results
The performance of the models is summarized as follows:
• KNN Classifier:
o Achieved an accuracy of 85% (example value for illustration).
o The confusion matrix and classification report demonstrated its ability to
distinguish between toxic and non-toxic comments.
• SVM Classifier:
o Achieved an accuracy of 88% (example value for illustration).
o Outperformed KNN, offering higher accuracy and consistent classification results.
Visual representations, including confusion matrices and a bar chart comparing model
accuracies, further illustrate the findings.
Conclusion
This project successfully implemented and compared two machine learning models, KNN and
SVM, for classifying toxic comments. The SVM model outperformed KNN in terms of accuracy
and reliability, making it a more effective tool for identifying harmful content. These findings
underscore the potential of machine learning in promoting safer and healthier online
environments.
s

Email Spam Detection Final Presentation-21BSCHH010002
No ratings yet
Email Spam Detection Final Presentation-21BSCHH010002
17 pages
Sentimental Analysis
No ratings yet
Sentimental Analysis
3 pages
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
10 pages
Practical Guide To Large Database Migration (Preston Zhang)
No ratings yet
Practical Guide To Large Database Migration (Preston Zhang)
206 pages
Nitin ML Assignment 1
No ratings yet
Nitin ML Assignment 1
18 pages
DM Practical File
No ratings yet
DM Practical File
21 pages
Machine Learning Code Explanation
No ratings yet
Machine Learning Code Explanation
33 pages
Sample Code
No ratings yet
Sample Code
9 pages
ML3,4
No ratings yet
ML3,4
11 pages
ML Practical Kiranjot 6-10
No ratings yet
ML Practical Kiranjot 6-10
10 pages
ML Practical Lovepreet 6-10
No ratings yet
ML Practical Lovepreet 6-10
10 pages
Bi 6 New
No ratings yet
Bi 6 New
6 pages
Kindle Review Sentiment Analysis - Ipynb - Colab
No ratings yet
Kindle Review Sentiment Analysis - Ipynb - Colab
5 pages
ML Projrct Article 2
No ratings yet
ML Projrct Article 2
6 pages
Assignment No 2 - ML - Output
No ratings yet
Assignment No 2 - ML - Output
4 pages
ML 2 16
No ratings yet
ML 2 16
6 pages
Twitter Sentiment Analysis Dss
No ratings yet
Twitter Sentiment Analysis Dss
14 pages
Analysis For An Open-Source Library in Database Ma
No ratings yet
Analysis For An Open-Source Library in Database Ma
6 pages
ML Prac1-10
No ratings yet
ML Prac1-10
32 pages
Text Classification Using Decision Forests and Pretrained Embeddings - 1716327972920
No ratings yet
Text Classification Using Decision Forests and Pretrained Embeddings - 1716327972920
12 pages
Triggers and Its Type
No ratings yet
Triggers and Its Type
23 pages
ML Lab6
No ratings yet
ML Lab6
4 pages
ML Week10.1
No ratings yet
ML Week10.1
5 pages
Python Code For KNN Classifier 1. Initial Message
No ratings yet
Python Code For KNN Classifier 1. Initial Message
7 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
7 pages
Progress of CATBOOST ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of CATBOOST ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
9 pages
Practical File OF Machine Learning
No ratings yet
Practical File OF Machine Learning
31 pages
Towards Industry 4.0
No ratings yet
Towards Industry 4.0
27 pages
Module 5.pptx - 20250608 - 201231 - 0000
No ratings yet
Module 5.pptx - 20250608 - 201231 - 0000
43 pages
Assignment 2 Specification
No ratings yet
Assignment 2 Specification
3 pages
TE Comp 2019 I AY23-24 DBMS UT1
No ratings yet
TE Comp 2019 I AY23-24 DBMS UT1
1 page
KNN-SVM Assignment
No ratings yet
KNN-SVM Assignment
4 pages
The Theatre of Max Reinhardt
No ratings yet
The Theatre of Max Reinhardt
351 pages
DDM Question Bank
No ratings yet
DDM Question Bank
24 pages
FPT Facilities Feedback: Software Design Document
No ratings yet
FPT Facilities Feedback: Software Design Document
12 pages
Gis and Business Intelligence
No ratings yet
Gis and Business Intelligence
15 pages
ML Report
No ratings yet
ML Report
14 pages
AAM PR QB
No ratings yet
AAM PR QB
13 pages
Nmi 111 Group Eight
No ratings yet
Nmi 111 Group Eight
10 pages
Email Spam Detection
No ratings yet
Email Spam Detection
3 pages
Shobit Sharma (2124399) ML Lab File PDF
No ratings yet
Shobit Sharma (2124399) ML Lab File PDF
19 pages
Project Ali Huzaifa
No ratings yet
Project Ali Huzaifa
6 pages
06 Database, Security, CDN, and EI Services
No ratings yet
06 Database, Security, CDN, and EI Services
90 pages
178 hw1
No ratings yet
178 hw1
4 pages
Malignant Comments Classifier Project
No ratings yet
Malignant Comments Classifier Project
30 pages
DBMS Functional Dependency - Transitive, Trivial, Multivalued (Example)
No ratings yet
DBMS Functional Dependency - Transitive, Trivial, Multivalued (Example)
1 page
SVM K NN MLP With Sklearn Jupyter NoteBo
No ratings yet
SVM K NN MLP With Sklearn Jupyter NoteBo
22 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
8 pages
WDM - Week - I
No ratings yet
WDM - Week - I
24 pages
Notebook - Text Classification
No ratings yet
Notebook - Text Classification
7 pages
Aiml 5-8
No ratings yet
Aiml 5-8
19 pages
CBSI Consent Form
No ratings yet
CBSI Consent Form
3 pages
Step by Step Upgrading Oracle 10g To Oracle 11g: Samadhandba
No ratings yet
Step by Step Upgrading Oracle 10g To Oracle 11g: Samadhandba
27 pages
ML With Python Practical
No ratings yet
ML With Python Practical
22 pages
New Chat: 1. Predicting Uber Ride Prices
No ratings yet
New Chat: 1. Predicting Uber Ride Prices
16 pages
Importing Packages: Id Label Tweet 0 1 2 3 4
No ratings yet
Importing Packages: Id Label Tweet 0 1 2 3 4
8 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
Lecture-13 Indexing and Its Types: Subject: DBMS Subject Code: BCA-S301T Faculty: Saurabh Jha
No ratings yet
Lecture-13 Indexing and Its Types: Subject: DBMS Subject Code: BCA-S301T Faculty: Saurabh Jha
16 pages
ML Manual With Outputs
No ratings yet
ML Manual With Outputs
30 pages
Conceptual View of Memory Cell
No ratings yet
Conceptual View of Memory Cell
64 pages
Q 3
No ratings yet
Q 3
2 pages
UNIT 4 Cte Note
No ratings yet
UNIT 4 Cte Note
12 pages
Email Spam Classifier
No ratings yet
Email Spam Classifier
22 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
Sentiment Analysis On Tweets
No ratings yet
Sentiment Analysis On Tweets
2 pages
Python CA 4
No ratings yet
Python CA 4
9 pages
8 Esh Narayan 734 Research Article CSIT June 2012
No ratings yet
8 Esh Narayan 734 Research Article CSIT June 2012
9 pages
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
No ratings yet
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
38 pages
Jadavpur University: Assignment Submission
No ratings yet
Jadavpur University: Assignment Submission
9 pages
Unit2 ML Programs
No ratings yet
Unit2 ML Programs
7 pages
Interview Questions
No ratings yet
Interview Questions
12 pages
DB01
No ratings yet
DB01
95 pages
E96660695201532
No ratings yet
E96660695201532
5 pages
Internet of Things (IoT) Solved MCQs (Set-3)
100% (1)
Internet of Things (IoT) Solved MCQs (Set-3)
5 pages
Power Bi
100% (1)
Power Bi
27 pages
Capstone Project - Jaro-Prof. Babji
No ratings yet
Capstone Project - Jaro-Prof. Babji
5 pages
Normalization
No ratings yet
Normalization
17 pages
Multi Blast
No ratings yet
Multi Blast
3 pages
Chatbot For Employee Frequently Asked Questio1
No ratings yet
Chatbot For Employee Frequently Asked Questio1
11 pages
Manual
No ratings yet
Manual
48 pages
DF Syllabus
No ratings yet
DF Syllabus
3 pages
List of Shortcut Keys
No ratings yet
List of Shortcut Keys
18 pages
CS178 Homework #1: Problem 0: Getting Connected
No ratings yet
CS178 Homework #1: Problem 0: Getting Connected
4 pages
Data Management Systems: IT Auditing & Assurance, 2e, Hall & Singleton
No ratings yet
Data Management Systems: IT Auditing & Assurance, 2e, Hall & Singleton
24 pages
SQL Queries - Examples - 2 - Practice
No ratings yet
SQL Queries - Examples - 2 - Practice
12 pages
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
33 pages
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet