0% found this document useful (0 votes)
12 views

Mod3_Classification

Uploaded by

xetaxog451
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Mod3_Classification

Uploaded by

xetaxog451
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Module 3 – Classification

•Defining Classification Problem with IRIS datasets.


•Mathematical formulation of K-Nearest Neighbour Algorithm for binary
classification.
•Implementation of K-Nearest Neighbour Algorithm using sci-kit learn.
•Classification using Decision tree.
•Construction of decision trees based on entropy.
•Implementation of Decision Trees for Iris datasets .
•Classification using Support Vector Machines.
•SVM for Binary classification
•Regulating different functional parameters of SVM using sci-kit learn.
•SVM for multi class classification.
•Implementation of SVM using Iris datasets .
•Implementation of Model Evaluation Metrics using sci-kit learn and IRIS datasets.
- A classification problem is when the output variable is a category, such as “red” or “blue” or “disease”
and “no disease”.

- Machine learning classification problems are those which require the given data set to be classified in two
or more categories.

- Examples:
• Whether a person is suffering from a disease X (answer in Yes or No).
• Classify if it is spam or not.
• Given a handwritten character, classify it as one of the known characters.
Every machine learning project begins by understanding what the data and drawing the objectives. While applying
machine learning algorithms to your data set, you are understanding, building and analyzing the data as to get the
end result.

Following are the steps involved in creating a well-defined ML project:

Project steps:
1. Create the dataset
2. Build the model
3. Train the model
4. Make predictions

To understand various machine learning algorithms let us use the Iris data set, one of the most famous datasets
available.
•Defining Classification Problem with IRIS datasets.

This data set consists of the physical parameters of three species of flower — Versicolor, Setosa and Virginica.
The numeric parameters which the dataset contains are Sepal width, Sepal length, Petal width and Petal length. In
this data we will be predicting the classes of the flowers based on these parameters. The data consists of
continuous numeric values which describe the dimensions of the respective features. We will be training the
model based on these features.

-The sepal encloses the petals and is typically green and leaf-like, while the petals are typically colored leaves.
Download the IRIS Dataset: http://archive.ics.uci.edu/ml/datasets/Iris

The data set consists of:


• 150 samples
• 3 labels: species of Iris (Iris setosa, Iris virginica and Iris versicolor)
• 4 features: Sepal length, Sepal width, Petal length, Petal Width in cm
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3 1.4 0.2 Iris-setosa
- - - - - -
- - - - - -
49 5.3 3.7 1.5 0.2 Iris-setosa
50 5 3.3 1.4 0.2 Iris-setosa
51 7 3.2 4.7 1.4 Iris-versicolor
52 6.4 3.2 4.5 1.5 Iris-versicolor
- - - - - -
- - - - - -
99 5.1 2.5 3 1.1 Iris-versicolor
100 5.7 2.8 4.1 1.3 Iris-versicolor
101 6.3 3.3 6 2.5 Iris-virginica
102 5.8 2.7 5.1 1.9 Iris-virginica
- - - - - -
- - - - - -
149 6.2 3.4 5.4 2.3 Iris-virginica
150 5.9 3 5.1 1.8 Iris-virginica
K-Nearest Neighbor (KNN) Algorithm

KNN is the most commonly used and one of the simplest algorithms for finding patterns in classification and
regression problems. It is an unsupervised algorithm and also known as lazy learning algorithm. It works by
calculating the distance of 1 test observation from all the observation of the training dataset and then finding K nearest
neighbors of it. This happens for each and every test observation and that is how it finds similarities in the data. For
calculating distances KNN uses a distance metric from the list of available metrics.
Why do we need a K-NN Algorithm:
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this data
point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm. With the help of
K-NN, we can easily identify the category or class of a particular dataset. Consider the below diagram:
Example_1:
Take the simplest case of binary classification, suppose we have a group of +ve and -ve points in the
dataset D such that the Xis belongs to the R-dim. data points and Yi are labels (+ve and -ve).

From the above image, you can conclude that there are several data points in 2 dim. Having the specific label,
they are classified according to the +ve and -ve labels. If you noticed in the image there is one Query point
referred to as Xq which has an unknown label. The surrounding points of Xq we considered as neighbours of
Xq and the points which are close to the Xq are nearest neighbours.

So how can we conclude that this point is nearest or not? It’s by finding the distance b/w the points. So, here’s
the Euclidean distance measures come existence.
The K-NN Algorithm for binary classification working can be explained on the basis of the below algorithm:

Step-1: Select the number K of the neighbors


Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
Step-6: Our model is ready.

See the Example in the next slide:


(i) Suppose we have a new data point and we need to put it in the required
category. Consider the image:

(ii) The number of neighbors is the core deciding factor. The odd value of K should be preferred over even values in order
to ensure that there are no ties in the voting. So we will choose the k=5

(iii) We will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points. It can be calculated as:
(iv) By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:

(v) As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to category A.
Mathematical formulation of K-Nearest Neighbour Algorithm for binary classification.

Here male is denoted with numeric value 0 and female


with 1. Let’s find in which class of people Angelina will
lie whose k factor is 3 and age is 5. So we have to find
out the distance using
#load the Iris dataset
from sklearn import datasets
iris=datasets.load_iris()

#Assign the data and target to separate variables


x=iris.data
y=iris.target

#Splitting the dataset;


from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.5)
#x_train contains the training features, x_test contains the testing features,
#y_train contains the training label, y_test contains the testing labels

#Build the model


from sklearn import neighbors
classifier=neighbors.KNeighborsClassifier()

#Train the Model


classifier.fit(x_train,y_train)

#Make predictions
predictions=classifier.predict(x_test)

from sklearn.metrics import accuracy_score


print(accuracy_score(y_test,predictions))
Decision Tree

A decision tree is a supervised learning algorithm used for both classification and regression problems. It is a
decision support technique that forms a tree-like structure.

- A decision tree consists of three components: decision nodes, leaf nodes, and a root node. A decision tree
algorithm divides a training dataset into branches, which further segregate into other branches. Each
internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf
node (terminal node) holds a class label.

- The nodes in the decision tree represent


attributes that are used for predicting the
outcome. Decision nodes provide a link to the
leaves. The following diagram shows the three
types of nodes in a decision tree.
Entropy and information gain are the building blocks of decision trees.

-Entropy is an information theory metric that measures the impurity or uncertainty in a


group of observations. It determines how a decision tree chooses to split data. The
image below gives a better description of the purity of a set.

-Information gain is a measure of how uncertainty in the target variable is reduced,


given a set of independent variables.
EXAMPLE:
Suppose we want to predict if a customer will purchase a mobile phone or not. The
features of the phone form the basis of his decision. This analysis can be presented
in a decision tree diagram.

The root node and decision nodes


of the decision represent the
features of the phone mentioned
above. The leaf node represents
the final output,
either buying or not buying. The
main features that determine the
choice include the price, internal
storage, and Random Access
Memory (RAM). The decision tree
will appear as follows.
#load the Iris dataset
from sklearn import datasets
iris=datasets.load_iris()

#Assign the data and target to separate variables


x=iris.data
y=iris.target

#Splitting the dataset;


from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.5)
#x_train contains the training features, x_test contains the testing features,
#y_train contains the training label, y_test contains the testing labels

#Build the model using Decision tree algorithm


from sklearn import tree
classifier=tree.DecisionTreeClassifier()

#Train the Model


classifier.fit(x_train,y_train)

#Make predictions
predictions=classifier.predict(x_test)

from sklearn.metrics import accuracy_score


print(accuracy_score(y_test,predictions))
Support Vector Machine (SVM) Algorithm
- Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for
Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine
Learning.
- The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space
into classes so that we can easily put the new data point in the correct category in the future. This best decision
boundary is called a hyperplane.
- SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called as
support vectors, and hence algorithm is termed as Support Vector Machine.
- Consider the below diagram in which there are two different categories that are classified using a decision boundary
or hyperplane:
Hyperplane:

-There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space, but we need
to find out the best decision boundary that helps to classify the data points. This best boundary is known as the
hyperplane of SVM.

-The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2
features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.

-We always create a hyperplane that has a maximum margin, which means the maximum distance between the
data points.
Example:
SVM can be understood with the example that we have used in the KNN classifier. Suppose we see a strange cat
that also has some features of dogs, so if we want a model that can accurately identify whether it is a cat or dog,
so such a model can be created by using the SVM algorithm. We will first train our model with lots of images of
cats and dogs so that it can learn about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the support vectors, it
will classify it as a cat. Consider the below diagram:
SVM can be of two types:

• Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into
two classes by using a single straight line, then such data is termed as linearly separable data, and classifier is
used called as Linear SVM classifier.

• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot be
classified by using a straight line, then such data is termed as non-linear data and classifier used is called as
Non-linear SVM classifier.
How does SVM works?
Linear SVM:

(i) Suppose we have a dataset that has two tags (green and blue), and the
dataset has two features x1 and x2. We want a classifier that can classify the
pair(x1, x2) of coordinates in either green or blue. Consider the image:

(ii) So as it is 2-d space so by just using a straight line, we can easily separate
these two classes. But there can be multiple lines that can separate these
classes. Consider the image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is
called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These
points are called support vectors. The distance between the vectors and the hyperplane is called as margin.
And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is called
the optimal hyperplane.
Non-Linear SVM:

(i) If data is linearly arranged, then we can separate it by using a straight line, but
for non-linear data, we cannot draw a single straight line. Consider the image:

(ii) So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a
third dimension z. It can be calculated as:

z=x2 +y2
By adding the third dimension, the sample space will become as mage:

-So now, SVM will divide the datasets into classes in the following way.
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
Model Evaluation Metrics using sci-kit learn and IRIS datasets.

(i) Precision, recall and F-measures

-The precision is intuitively the ability of the classifier not to label as positive a sample
that is negative.
(OR) Precision is the fraction of the identified samples that are relevant (Equation 1).

-The recall is intuitively the ability of the classifier to find all the positive samples (OR)
The recall is the fraction of relevant samples that have been identified over the total
amount of relevant terms (Equation 2).

-F-measure or F1-Score is harmonic mean of precision and recall (Equation 3).


(ii) Confusion matrix:
The confusion_matrix function computes the confusion matrix to evaluate the accuracy on a classification
problem. By definition, a confusion matrix C is such that Cij is equal to the number of observations known to be
in group i but predicted to be in group j. Here an example of such confusion matrix:

Example:

from sklearn.metrics import confusion_matrix


y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
confusion_matrix(y_true, y_pred)

array([[2, 0, 0],
[0, 0, 1],
[1, 0, 2]])
(iii) Classification report
The classification_report function builds a text report showing the main classification metrics.

Here a small example with custom target_names and inferred labels:

from sklearn.metrics import classification_report


y_true = [0, 1, 2, 2, 0]
y_pred = [0, 0, 2, 2, 0]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))

precision recall f1-score support

class 0 0.67 1.00 0.80 2


class 1 0.00 0.00 0.00 1
class 2 1.00 1.00 1.00 2

avg / total 0.67 0.80 0.72 5


REFERENCES:

https://www.kaggle.com/code/sixteenpython/machine-learning-with-iris-dataset/notebook

https://medium.com/@jebaseelanravi96/machine-learning-iris-classification-33aa18a4a983

You might also like