0% found this document useful (0 votes)
470 views5 pages

Email Classification: Roll No-41463 (LP-3)

The document discusses classifying emails using binary classification with K-Nearest Neighbors and Support Vector Machine models. It analyzes the performance of these models on an email spam classification dataset from Kaggle. For KNN, the accuracy was 87% and for SVM the accuracy was 93%, showing SVM had slightly better performance on this task.

Uploaded by

fgfsgsg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
470 views5 pages

Email Classification: Roll No-41463 (LP-3)

The document discusses classifying emails using binary classification with K-Nearest Neighbors and Support Vector Machine models. It analyzes the performance of these models on an email spam classification dataset from Kaggle. For KNN, the accuracy was 87% and for SVM the accuracy was 93%, showing SVM had slightly better performance on this task.

Uploaded by

fgfsgsg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Roll No- 41463 (LP-3)

Email Classification

Classify the email using binary classification method. Email Spam detection has two
states: a) Normal State Not Spam b) Abnormal State Spam. Use K-Nearest Neighbors and
Support Vector Machine for Classification. Analyze their performance.

Dataset used: https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv


(https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv)

In [1]: import numpy as np


import pandas as pd

from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import mean_squared_error,mean_absolute_error
from sklearn.metrics import accuracy_score

In [2]: df = pd.read_csv("emails.csv")
df.head()

Out[2]:
Email
the to ect and for of a you hou ... connevey jay valued lay infrastructu
No.

Email
0 0 0 1 0 0 0 2 0 0 ... 0 0 0 0
1

Email
1 8 13 24 6 6 2 102 1 27 ... 0 0 0 0
2

Email
2 0 0 1 0 0 0 8 0 0 ... 0 0 0 0
3

Email
3 0 5 22 0 5 1 51 2 10 ... 0 0 0 0
4

Email
4 7 6 17 1 5 2 57 0 9 ... 0 0 0 0
5

5 rows × 3002 columns


In [3]: df.tail()

Out[3]:
Email
the to ect and for of a you hou ... connevey jay valued lay infrastru
No.

Email
5167 2 2 2 3 0 0 32 0 0 ... 0 0 0 0
5168

Email
5168 35 27 11 2 6 5 151 4 3 ... 0 0 0 0
5169

Email
5169 0 0 1 1 0 0 11 0 0 ... 0 0 0 0
5170

Email
5170 2 7 1 0 2 1 28 2 0 ... 0 0 0 0
5171

Email
5171 22 24 5 1 6 5 148 8 2 ... 0 0 0 0
5172

5 rows × 3002 columns

In [4]: df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 5172 entries, 0 to 5171

Columns: 3002 entries, Email No. to Prediction

dtypes: int64(3001), object(1)

memory usage: 118.5+ MB

In [5]: df.describe()

Out[5]:
the to ect and for of

count 5172.000000 5172.000000 5172.000000 5172.000000 5172.000000 5172.000000 5172.00000

mean 6.640565 6.188128 5.143852 3.075599 3.124710 2.627030 55.51740

std 11.745009 9.534576 14.101142 6.045970 4.680522 6.229845 87.57417

min 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.00000

25% 0.000000 1.000000 1.000000 0.000000 1.000000 0.000000 12.00000

50% 3.000000 3.000000 1.000000 1.000000 2.000000 1.000000 28.00000

75% 8.000000 7.000000 4.000000 3.000000 4.000000 2.000000 62.25000

max 210.000000 132.000000 344.000000 89.000000 47.000000 77.000000 1898.00000

8 rows × 3001 columns


In [6]: df.isnull().sum()

Out[6]: Email No. 0

the 0

to 0

ect 0

and 0

for 0

of 0

a 0

you 0

hou 0

in 0

on 0

is 0

this 0

enron 0

i 0

be 0

that 0

will 0

have 0

with 0

your 0

at 0

we 0

s 0

are 0

it 0

by 0

com 0

as 0

..

decisions 0

produced 0

ended 0

greatest 0

degree 0

solmonson 0

imbalances 0

fall 0

fear 0

hate 0

fight 0

reallocated 0

debt 0

reform 0

australia 0

plain 0

prompt 0

remains 0

ifhsc 0

enhancements 0

connevey 0

jay 0

valued 0

lay 0

infrastructure 0

military 0

allowing 0

ff 0

dry 0

Prediction 0

Length: 3002, dtype: int64

Splitting Train and Test dataset

In [7]: x = df.iloc[:,1:3001]
y = df.iloc[:,-1].values

In [8]: x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2,

a) Using K-Nearest Neighbours


In [9]: knn = KNeighborsClassifier(n_neighbors=8)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)

In [ ]: ​

Analyzing performance

In [10]: print("MSE: ", mean_squared_error(y_test, y_pred))


print("MAE: ", mean_absolute_error(y_test, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 Score: ", metrics.r2_score(y_test, y_pred))
print("Accuracy Score for KNN: ", accuracy_score(y_test, y_pred))

MSE: 0.12560386473429952

MAE: 0.12560386473429952

RMSE: 0.3544063553807966

R2 Score: 0.40780091899790494

Accuracy Score for KNN: 0.8743961352657005

b) Using Support Vector Machine(SVM)


In [11]: svc = SVC(C=1.0, gamma='auto', kernel='rbf')
svc.fit(x_test, y_test)
y_pred = svc.predict(x_test)

Analyzing Performance
In [12]: print("MSE: ", mean_squared_error(y_test, y_pred))
print("MAE: ", mean_absolute_error(y_test, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 Score: ", metrics.r2_score(y_test, y_pred))
print("Accuracy Score for KNN: ", accuracy_score(y_test, y_pred))

MSE: 0.07149758454106281

MAE: 0.07149758454106281

RMSE: 0.2673903224521464

R2 Score: 0.6629020615834228

Accuracy Score for KNN: 0.9285024154589372

In [ ]: ​

You might also like