0% found this document useful (0 votes)

12 views

Mod3_Classification

Uploaded by

xetaxog451

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Mod3_Classification

Uploaded by

xetaxog451

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 32

Module 3 – Classification

•Defining Classification Problem with IRIS datasets.

•Mathematical formulation of K-Nearest Neighbour Algorithm for binary
classification.
•Implementation of K-Nearest Neighbour Algorithm using sci-kit learn.
•Classification using Decision tree.
•Construction of decision trees based on entropy.
•Implementation of Decision Trees for Iris datasets .
•Classification using Support Vector Machines.
•SVM for Binary classification
•Regulating different functional parameters of SVM using sci-kit learn.
•SVM for multi class classification.
•Implementation of SVM using Iris datasets .
•Implementation of Model Evaluation Metrics using sci-kit learn and IRIS datasets.
- A classification problem is when the output variable is a category, such as “red” or “blue” or “disease”
and “no disease”.

- Machine learning classification problems are those which require the given data set to be classified in two
or more categories.

- Examples:
• Whether a person is suffering from a disease X (answer in Yes or No).
• Classify if it is spam or not.
• Given a handwritten character, classify it as one of the known characters.
Every machine learning project begins by understanding what the data and drawing the objectives. While applying
machine learning algorithms to your data set, you are understanding, building and analyzing the data as to get the
end result.

Following are the steps involved in creating a well-defined ML project:

Project steps:
1. Create the dataset
2. Build the model
3. Train the model
4. Make predictions

To understand various machine learning algorithms let us use the Iris data set, one of the most famous datasets
available.
•Defining Classification Problem with IRIS datasets.

This data set consists of the physical parameters of three species of flower — Versicolor, Setosa and Virginica.
The numeric parameters which the dataset contains are Sepal width, Sepal length, Petal width and Petal length. In
this data we will be predicting the classes of the flowers based on these parameters. The data consists of
continuous numeric values which describe the dimensions of the respective features. We will be training the
model based on these features.

-The sepal encloses the petals and is typically green and leaf-like, while the petals are typically colored leaves.
Download the IRIS Dataset: http://archive.ics.uci.edu/ml/datasets/Iris

The data set consists of:

• 150 samples
• 3 labels: species of Iris (Iris setosa, Iris virginica and Iris versicolor)
• 4 features: Sepal length, Sepal width, Petal length, Petal Width in cm
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3 1.4 0.2 Iris-setosa
- - - - - -
- - - - - -
49 5.3 3.7 1.5 0.2 Iris-setosa
50 5 3.3 1.4 0.2 Iris-setosa
51 7 3.2 4.7 1.4 Iris-versicolor
52 6.4 3.2 4.5 1.5 Iris-versicolor
- - - - - -
- - - - - -
99 5.1 2.5 3 1.1 Iris-versicolor
100 5.7 2.8 4.1 1.3 Iris-versicolor
101 6.3 3.3 6 2.5 Iris-virginica
102 5.8 2.7 5.1 1.9 Iris-virginica
- - - - - -
- - - - - -
149 6.2 3.4 5.4 2.3 Iris-virginica
150 5.9 3 5.1 1.8 Iris-virginica
K-Nearest Neighbor (KNN) Algorithm

KNN is the most commonly used and one of the simplest algorithms for finding patterns in classification and
regression problems. It is an unsupervised algorithm and also known as lazy learning algorithm. It works by
calculating the distance of 1 test observation from all the observation of the training dataset and then finding K nearest
neighbors of it. This happens for each and every test observation and that is how it finds similarities in the data. For
calculating distances KNN uses a distance metric from the list of available metrics.
Why do we need a K-NN Algorithm:
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this data
point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm. With the help of
K-NN, we can easily identify the category or class of a particular dataset. Consider the below diagram:
Example_1:
Take the simplest case of binary classification, suppose we have a group of +ve and -ve points in the
dataset D such that the Xis belongs to the R-dim. data points and Yi are labels (+ve and -ve).

From the above image, you can conclude that there are several data points in 2 dim. Having the specific label,
they are classified according to the +ve and -ve labels. If you noticed in the image there is one Query point
referred to as Xq which has an unknown label. The surrounding points of Xq we considered as neighbours of
Xq and the points which are close to the Xq are nearest neighbours.

So how can we conclude that this point is nearest or not? It’s by finding the distance b/w the points. So, here’s
the Euclidean distance measures come existence.
The K-NN Algorithm for binary classification working can be explained on the basis of the below algorithm:

Step-1: Select the number K of the neighbors

Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
Step-6: Our model is ready.

See the Example in the next slide:

(i) Suppose we have a new data point and we need to put it in the required
category. Consider the image:

(ii) The number of neighbors is the core deciding factor. The odd value of K should be preferred over even values in order
to ensure that there are no ties in the voting. So we will choose the k=5

(iii) We will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points. It can be calculated as:
(iv) By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:

(v) As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to category A.
Mathematical formulation of K-Nearest Neighbour Algorithm for binary classification.

Here male is denoted with numeric value 0 and female

with 1. Let’s find in which class of people Angelina will
lie whose k factor is 3 and age is 5. So we have to find
out the distance using
#load the Iris dataset
from sklearn import datasets
iris=datasets.load_iris()

#Assign the data and target to separate variables

x=iris.data
y=iris.target

#Splitting the dataset;

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.5)
#x_train contains the training features, x_test contains the testing features,
#y_train contains the training label, y_test contains the testing labels

#Build the model

from sklearn import neighbors
classifier=neighbors.KNeighborsClassifier()

#Train the Model

classifier.fit(x_train,y_train)

#Make predictions
predictions=classifier.predict(x_test)

from sklearn.metrics import accuracy_score

print(accuracy_score(y_test,predictions))
Decision Tree

A decision tree is a supervised learning algorithm used for both classification and regression problems. It is a
decision support technique that forms a tree-like structure.

- A decision tree consists of three components: decision nodes, leaf nodes, and a root node. A decision tree
algorithm divides a training dataset into branches, which further segregate into other branches. Each
internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf
node (terminal node) holds a class label.

- The nodes in the decision tree represent

attributes that are used for predicting the
outcome. Decision nodes provide a link to the
leaves. The following diagram shows the three
types of nodes in a decision tree.
Entropy and information gain are the building blocks of decision trees.

-Entropy is an information theory metric that measures the impurity or uncertainty in a

group of observations. It determines how a decision tree chooses to split data. The
image below gives a better description of the purity of a set.

-Information gain is a measure of how uncertainty in the target variable is reduced,

given a set of independent variables.
EXAMPLE:
Suppose we want to predict if a customer will purchase a mobile phone or not. The
features of the phone form the basis of his decision. This analysis can be presented
in a decision tree diagram.

The root node and decision nodes

of the decision represent the
features of the phone mentioned
above. The leaf node represents
the final output,
either buying or not buying. The
main features that determine the
choice include the price, internal
storage, and Random Access
Memory (RAM). The decision tree
will appear as follows.
#load the Iris dataset
from sklearn import datasets
iris=datasets.load_iris()

#Assign the data and target to separate variables

x=iris.data
y=iris.target

#Splitting the dataset;

#Build the model using Decision tree algorithm

from sklearn import tree
classifier=tree.DecisionTreeClassifier()

#Train the Model

classifier.fit(x_train,y_train)

#Make predictions
predictions=classifier.predict(x_test)

from sklearn.metrics import accuracy_score

print(accuracy_score(y_test,predictions))
Support Vector Machine (SVM) Algorithm
- Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for
Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine
Learning.
- The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space
into classes so that we can easily put the new data point in the correct category in the future. This best decision
boundary is called a hyperplane.
- SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called as
support vectors, and hence algorithm is termed as Support Vector Machine.
- Consider the below diagram in which there are two different categories that are classified using a decision boundary
or hyperplane:
Hyperplane:

-There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space, but we need
to find out the best decision boundary that helps to classify the data points. This best boundary is known as the
hyperplane of SVM.

-The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2
features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.

-We always create a hyperplane that has a maximum margin, which means the maximum distance between the
data points.
Example:
SVM can be understood with the example that we have used in the KNN classifier. Suppose we see a strange cat
that also has some features of dogs, so if we want a model that can accurately identify whether it is a cat or dog,
so such a model can be created by using the SVM algorithm. We will first train our model with lots of images of
cats and dogs so that it can learn about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the support vectors, it
will classify it as a cat. Consider the below diagram:
SVM can be of two types:

• Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into
two classes by using a single straight line, then such data is termed as linearly separable data, and classifier is
used called as Linear SVM classifier.

• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot be
classified by using a straight line, then such data is termed as non-linear data and classifier used is called as
Non-linear SVM classifier.
How does SVM works?
Linear SVM:

(i) Suppose we have a dataset that has two tags (green and blue), and the
dataset has two features x1 and x2. We want a classifier that can classify the
pair(x1, x2) of coordinates in either green or blue. Consider the image:

(ii) So as it is 2-d space so by just using a straight line, we can easily separate
these two classes. But there can be multiple lines that can separate these
classes. Consider the image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is
called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These
points are called support vectors. The distance between the vectors and the hyperplane is called as margin.
And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is called
the optimal hyperplane.
Non-Linear SVM:

(i) If data is linearly arranged, then we can separate it by using a straight line, but
for non-linear data, we cannot draw a single straight line. Consider the image:

(ii) So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a
third dimension z. It can be calculated as:

z=x2 +y2
By adding the third dimension, the sample space will become as mage:

-So now, SVM will divide the datasets into classes in the following way.
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
Model Evaluation Metrics using sci-kit learn and IRIS datasets.

(i) Precision, recall and F-measures

-The precision is intuitively the ability of the classifier not to label as positive a sample
that is negative.
(OR) Precision is the fraction of the identified samples that are relevant (Equation 1).

-The recall is intuitively the ability of the classifier to find all the positive samples (OR)
The recall is the fraction of relevant samples that have been identified over the total
amount of relevant terms (Equation 2).

-F-measure or F1-Score is harmonic mean of precision and recall (Equation 3).

(ii) Confusion matrix:
The confusion_matrix function computes the confusion matrix to evaluate the accuracy on a classification
problem. By definition, a confusion matrix C is such that Cij is equal to the number of observations known to be
in group i but predicted to be in group j. Here an example of such confusion matrix:

Example:

from sklearn.metrics import confusion_matrix

y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
confusion_matrix(y_true, y_pred)

array([[2, 0, 0],
[0, 0, 1],
[1, 0, 2]])
(iii) Classification report
The classification_report function builds a text report showing the main classification metrics.

Here a small example with custom target_names and inferred labels:

from sklearn.metrics import classification_report

y_true = [0, 1, 2, 2, 0]
y_pred = [0, 0, 2, 2, 0]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))

precision recall f1-score support

class 0 0.67 1.00 0.80 2

class 1 0.00 0.00 0.00 1
class 2 1.00 1.00 1.00 2

avg / total 0.67 0.80 0.72 5

REFERENCES:

https://www.kaggle.com/code/sixteenpython/machine-learning-with-iris-dataset/notebook

https://medium.com/@jebaseelanravi96/machine-learning-iris-classification-33aa18a4a983

Preparation of Detailed Project Report (DPR) For Road/Highway Projects
88% (118)
Preparation of Detailed Project Report (DPR) For Road/Highway Projects
58 pages
Documentation / Electronic Health Record: Vitals
50% (4)
Documentation / Electronic Health Record: Vitals
11 pages
Coincent - Data Science With Python Assignment
100% (2)
Coincent - Data Science With Python Assignment
23 pages
Autolisp Programming Notes PDF
No ratings yet
Autolisp Programming Notes PDF
23 pages
5. K-Nearest Neighbors Classifiers 2025
No ratings yet
5. K-Nearest Neighbors Classifiers 2025
33 pages
ML_Course_15 -17
No ratings yet
ML_Course_15 -17
31 pages
Knn Datacamp
No ratings yet
Knn Datacamp
31 pages
UNIT 3 - Final
No ratings yet
UNIT 3 - Final
37 pages
Sayan Das - Machine Learning
No ratings yet
Sayan Das - Machine Learning
4 pages
Unit - II
No ratings yet
Unit - II
37 pages
A Complete Guide To KNN
No ratings yet
A Complete Guide To KNN
16 pages
U02Lecture08 Statistical Machine Learning
No ratings yet
U02Lecture08 Statistical Machine Learning
41 pages
Supervised Learning
No ratings yet
Supervised Learning
71 pages
ml3
No ratings yet
ml3
6 pages
Unit Ii
No ratings yet
Unit Ii
102 pages
ML-UNIT-2
No ratings yet
ML-UNIT-2
46 pages
Machine Learning With Python - Machine Learning Algorithms - KNN
No ratings yet
Machine Learning With Python - Machine Learning Algorithms - KNN
15 pages
ML Lab Programs (1-13)
No ratings yet
ML Lab Programs (1-13)
44 pages
Total Listing Machine Learning
100% (1)
Total Listing Machine Learning
114 pages
Decision Tree KNN
No ratings yet
Decision Tree KNN
9 pages
Sridevi Women'S Engineering College: Mini Project Seminar On
No ratings yet
Sridevi Women'S Engineering College: Mini Project Seminar On
23 pages
Lecture7 KNN
No ratings yet
Lecture7 KNN
40 pages
Unit II - 2 - Supervised Learning
No ratings yet
Unit II - 2 - Supervised Learning
23 pages
Unit 5 - DA - Classification & Clustering
No ratings yet
Unit 5 - DA - Classification & Clustering
105 pages
Yunsu Han KNN K Means
No ratings yet
Yunsu Han KNN K Means
8 pages
ML Unit-2
No ratings yet
ML Unit-2
26 pages
FPA unit 2
No ratings yet
FPA unit 2
20 pages
It - S All About Neighbors - Completed
No ratings yet
It - S All About Neighbors - Completed
14 pages
Machine Learning Lab Manual 7
100% (1)
Machine Learning Lab Manual 7
8 pages
DS Report
No ratings yet
DS Report
11 pages
Lec03 Classifiers KNN+DT
No ratings yet
Lec03 Classifiers KNN+DT
30 pages
KNN Lab
No ratings yet
KNN Lab
4 pages
Knn Classifier
No ratings yet
Knn Classifier
5 pages
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
100% (1)
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
125 pages
Lecture 12 K-Nearest Neighbors
No ratings yet
Lecture 12 K-Nearest Neighbors
24 pages
Rahul Raj.ipynb - Colab
No ratings yet
Rahul Raj.ipynb - Colab
50 pages
K-NN Algorithm in Machine Learning
No ratings yet
K-NN Algorithm in Machine Learning
11 pages
Practical 10 K-Nearest Neighbors Algorithm
No ratings yet
Practical 10 K-Nearest Neighbors Algorithm
16 pages
DW&M Unit 3 Part I
No ratings yet
DW&M Unit 3 Part I
101 pages
DM - MP (1)
No ratings yet
DM - MP (1)
15 pages
Lab 1 - Machine Learning with Python - ML Engineering مهم
No ratings yet
Lab 1 - Machine Learning with Python - ML Engineering مهم
10 pages
Ml 7th Sem Aiml Ite Notes Complete Long[1]-63-155
No ratings yet
Ml 7th Sem Aiml Ite Notes Complete Long[1]-63-155
93 pages
Wk07
No ratings yet
Wk07
8 pages
ML Unit 2
No ratings yet
ML Unit 2
84 pages
ML-KN
No ratings yet
ML-KN
12 pages
W 3 Slides
No ratings yet
W 3 Slides
39 pages
week 11 KNN (1)
No ratings yet
week 11 KNN (1)
5 pages
ML Notes
100% (2)
ML Notes
125 pages
MyChap3-Classification - Part 1
No ratings yet
MyChap3-Classification - Part 1
21 pages
ML Assignment 2 PDF
No ratings yet
ML Assignment 2 PDF
9 pages
ML1 - Classification - KNN & NB
No ratings yet
ML1 - Classification - KNN & NB
23 pages
MLT lab 09
No ratings yet
MLT lab 09
3 pages
Introduction to Classification and Classification Algorithms
No ratings yet
Introduction to Classification and Classification Algorithms
9 pages
ML Copy
No ratings yet
ML Copy
33 pages
Introduction To Data Science Lecture 6 KG Sir OEC M 621 (E)
No ratings yet
Introduction To Data Science Lecture 6 KG Sir OEC M 621 (E)
8 pages
Experiment No 7 ML
No ratings yet
Experiment No 7 ML
4 pages
CH 04 Classification Techniques
No ratings yet
CH 04 Classification Techniques
89 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
Presentation UNIT-2(Old)
No ratings yet
Presentation UNIT-2(Old)
58 pages
K-Nearest Neighbor (KNN) Algorithm For Machine Learning
No ratings yet
K-Nearest Neighbor (KNN) Algorithm For Machine Learning
17 pages
Algorithms New
No ratings yet
Algorithms New
8 pages
Week3 Stat
No ratings yet
Week3 Stat
4 pages
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
Welcome TO Saint Nicholas Academy of Vintar, Inc
No ratings yet
Welcome TO Saint Nicholas Academy of Vintar, Inc
16 pages
Dissertation - Tea Export of Assam
No ratings yet
Dissertation - Tea Export of Assam
3 pages
Part 1 - MATLAB Fundamentals and Programming
No ratings yet
Part 1 - MATLAB Fundamentals and Programming
50 pages
mx00_1193240481
No ratings yet
mx00_1193240481
76 pages
CL1 - Suggested Solutions-December 2022
No ratings yet
CL1 - Suggested Solutions-December 2022
16 pages
Lecture 4 _ Choking
No ratings yet
Lecture 4 _ Choking
21 pages
Mcm 499A Proposal
No ratings yet
Mcm 499A Proposal
9 pages
Overview of Transaction Processing and Enterprise Resource Planning Systems
No ratings yet
Overview of Transaction Processing and Enterprise Resource Planning Systems
34 pages
Lada (Niva, Kalina, Priora) Compatible ELM 327
No ratings yet
Lada (Niva, Kalina, Priora) Compatible ELM 327
13 pages
Architecture Books
No ratings yet
Architecture Books
8 pages
Untitled
No ratings yet
Untitled
875 pages
Genmath e Portfolio
No ratings yet
Genmath e Portfolio
17 pages
Greta Thunberg Gap-Fill
No ratings yet
Greta Thunberg Gap-Fill
2 pages
Research Report: Bvlgari
No ratings yet
Research Report: Bvlgari
49 pages
The ChemSep-COffdfdfCO Casebook - Air Separation Unit
No ratings yet
The ChemSep-COffdfdfCO Casebook - Air Separation Unit
5 pages
Pump Head Calculations
100% (2)
Pump Head Calculations
4 pages
BF 00571142
No ratings yet
BF 00571142
1 page
Fully Funded EduOpportunity Network Resource Guide 2024
No ratings yet
Fully Funded EduOpportunity Network Resource Guide 2024
182 pages
Total Amount $ 362,337.21: Payment Terms
No ratings yet
Total Amount $ 362,337.21: Payment Terms
34 pages
Alumnos de Doctorado
No ratings yet
Alumnos de Doctorado
135 pages
Teaching and Learning Resources For Grade IX Biology: Recommended Key Textbook
No ratings yet
Teaching and Learning Resources For Grade IX Biology: Recommended Key Textbook
7 pages
NCR Reported by NCR Issued To: Non-Conformance Report
No ratings yet
NCR Reported by NCR Issued To: Non-Conformance Report
2 pages
Introduction To Popular Culture
No ratings yet
Introduction To Popular Culture
27 pages
Extbase Cheat Sheet
No ratings yet
Extbase Cheat Sheet
2 pages
DWeek 40 3-9oct-2022
No ratings yet
DWeek 40 3-9oct-2022
1 page
ERA - InfoTech LTD
No ratings yet
ERA - InfoTech LTD
2 pages
Phase Standard Color Code
100% (1)
Phase Standard Color Code
4 pages