0% found this document useful (0 votes)
30 views6 pages

Assignment B 2 EmailClassification

Uploaded by

Mahesh Kadam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views6 pages

Assignment B 2 EmailClassification

Uploaded by

Mahesh Kadam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

B.E.

(COMP) Sinhgad Institute of Technology, Lonavala LP_III

a
Name of the Student: __________________________________ Roll No: ____
CLASS: - B. E. [COMP] Division: A, B, C Course: LP-III
Machine Learning
Assignment No. 02
EMAIL SPAM CLASSIFICATION
Marks: /10

Date of Performance: / /2023


2024 Sign with Date:

Title : Classify the email using the binary classification method

Objectives:
• To classify email using binary classification method.
• To analyse performance of KNN and SVM classifiers.

Outcomes:
• Predict the class of user.

PEOs, POs, PSOs and COs satisfied


PEOs: I, III POs: 1, 2, 3, 4, 5 PSOs: 1, 2 COs: 1

Problem Statement:
Classify the email using the binary classification method. Email Spam detection has two states:
a) Normal State – Not Spam, b) Abnormal State – Spam. Use K-Nearest Neighbors and Support
Vector Machine for classification. Analyze their performance.
Dataset link: The emails.csv dataset on the Kaggle
https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv

Theory:
K-Nearest Neighbors

KNN is a non-parametric and lazy learning algorithm. Non-parametric means there is no


assumption for underlying data distribution. In other words, the model structure determined
from the dataset. This will be very helpful in practice where most of the real world datasets do
not follow mathematical theoretical assumptions. Lazy algorithm means it does not need any
training data points for model generation. All training data used in the testing phase. This makes
training faster and testing phase slower and costlier. Costly testing phase means time and
memory. In the worst case, KNN needs more time to scan all data points and scanning all data
points will require more memory for storing training data.

1 | Department of Computer Engineering, SIT, Lonavala


B.E. (COMP) Sinhgad Institute of Technology, Lonavala LP_III

How does the KNN algorithm work?


In KNN, K is the number of nearest neighbors. The number of neighbors is the core deciding
factor. K is generally an odd number if the number of classes is 2. When K=1, then the
algorithm is known as the nearest neighbor algorithm. This is the simplest case. Suppose P1 is
the point, for which label needs to predict. First, you find the one closest point to P1 and then
the label of the nearest point assigned to P1.

Suppose P1 is the point, for which label needs to predict. First, you find the k closest point to
P1 and then classify points by majority vote of its k neighbors. Each object votes for their class
and the class with the most votes is taken as the prediction. For finding closest similar points,
you find the distance between points using distance measures such as Euclidean distance,
Hamming distance, Manhattan distance and Minkowski distance.

KNN Classifier Building in Scikit-learn

Generating Model

First, import the KNeighborsClassifier module and create KNN classifier object by passing
argument number of neighbors in KNeighborsClassifier() function.

2 | Department of Computer Engineering, SIT, Lonavala


B.E. (COMP) Sinhgad Institute of Technology, Lonavala LP_III

Then, fit your model on the train set using fit() and perform prediction on the test set using
predict().

From sklearn.neighbors import KNeighborsClassifier

Model= KNeighborsClassifier(n_neighbors=3)

# Train the model using the training sets


Mpdel.fit(features,label)

#Predict Output
predicted=model.predict([[0,2]])# 0:Overcast, 2:Mild
print(predicted)

Support Vector Machine Algorithm

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is
used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.

3 | Department of Computer Engineering, SIT, Lonavala


B.E. (COMP) Sinhgad Institute of Technology, Lonavala LP_III

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.

How does SVM works?

Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we have
a dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We
want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue.
Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary
or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from
both the classes. These points are called support vectors. The distance between the vectors and
the hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.

4 | Department of Computer Engineering, SIT, Lonavala


B.E. (COMP) Sinhgad Institute of Technology, Lonavala LP_III

Support Vector Machine Classifier Building in Scikit-learn

1. from sklearn.svm import SVC # "Support vector classifier"


2. classifier = SVC(kernel='linear', random_state=0)
3. classifier.fit(x_train, y_train)
#Predicting the test set result
y_pred= classifier.predict(x_test)
#Creating the Confusion matrix
from sklearn.metrics import confusion_matrix
cm= confusion_matrix(y_test, y_pred)

Evaluating Model
Accuracy can be computed by comparing actual test set values and predicted values.

# Model Accuracy, how often is the classifier correct?

print("Accuracy:",metrics.accuracy_score(y_test,
y_pred))

Conclusion:
Thus we implemented SVM and KNN classifiers using PYTHON scikit-learn library.

5 | Department of Computer Engineering, SIT, Lonavala


B.E. (COMP) Sinhgad Institute of Technology, Lonavala LP_III

A. Write short answer of following questions :


1. Explain the Confusion Matrix with Respect to Machine Learning Algorithms.
2. Explain the K Nearest Neighbor Algorithm.
3. Explain Support Vector Machine Algorithm.
4. Explain following Classification evaluation metrics.
a. Accuracy
b. Precision
c. Recall
d. AUC-ROC
e. F-beta score

6 | Department of Computer Engineering, SIT, Lonavala

You might also like