Mod3_Classification
Mod3_Classification
- Machine learning classification problems are those which require the given data set to be classified in two
or more categories.
- Examples:
• Whether a person is suffering from a disease X (answer in Yes or No).
• Classify if it is spam or not.
• Given a handwritten character, classify it as one of the known characters.
Every machine learning project begins by understanding what the data and drawing the objectives. While applying
machine learning algorithms to your data set, you are understanding, building and analyzing the data as to get the
end result.
Project steps:
1. Create the dataset
2. Build the model
3. Train the model
4. Make predictions
To understand various machine learning algorithms let us use the Iris data set, one of the most famous datasets
available.
•Defining Classification Problem with IRIS datasets.
This data set consists of the physical parameters of three species of flower — Versicolor, Setosa and Virginica.
The numeric parameters which the dataset contains are Sepal width, Sepal length, Petal width and Petal length. In
this data we will be predicting the classes of the flowers based on these parameters. The data consists of
continuous numeric values which describe the dimensions of the respective features. We will be training the
model based on these features.
-The sepal encloses the petals and is typically green and leaf-like, while the petals are typically colored leaves.
Download the IRIS Dataset: http://archive.ics.uci.edu/ml/datasets/Iris
KNN is the most commonly used and one of the simplest algorithms for finding patterns in classification and
regression problems. It is an unsupervised algorithm and also known as lazy learning algorithm. It works by
calculating the distance of 1 test observation from all the observation of the training dataset and then finding K nearest
neighbors of it. This happens for each and every test observation and that is how it finds similarities in the data. For
calculating distances KNN uses a distance metric from the list of available metrics.
Why do we need a K-NN Algorithm:
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this data
point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm. With the help of
K-NN, we can easily identify the category or class of a particular dataset. Consider the below diagram:
Example_1:
Take the simplest case of binary classification, suppose we have a group of +ve and -ve points in the
dataset D such that the Xis belongs to the R-dim. data points and Yi are labels (+ve and -ve).
From the above image, you can conclude that there are several data points in 2 dim. Having the specific label,
they are classified according to the +ve and -ve labels. If you noticed in the image there is one Query point
referred to as Xq which has an unknown label. The surrounding points of Xq we considered as neighbours of
Xq and the points which are close to the Xq are nearest neighbours.
So how can we conclude that this point is nearest or not? It’s by finding the distance b/w the points. So, here’s
the Euclidean distance measures come existence.
The K-NN Algorithm for binary classification working can be explained on the basis of the below algorithm:
(ii) The number of neighbors is the core deciding factor. The odd value of K should be preferred over even values in order
to ensure that there are no ties in the voting. So we will choose the k=5
(iii) We will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points. It can be calculated as:
(iv) By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:
(v) As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to category A.
Mathematical formulation of K-Nearest Neighbour Algorithm for binary classification.
#Make predictions
predictions=classifier.predict(x_test)
A decision tree is a supervised learning algorithm used for both classification and regression problems. It is a
decision support technique that forms a tree-like structure.
- A decision tree consists of three components: decision nodes, leaf nodes, and a root node. A decision tree
algorithm divides a training dataset into branches, which further segregate into other branches. Each
internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf
node (terminal node) holds a class label.
#Make predictions
predictions=classifier.predict(x_test)
-There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space, but we need
to find out the best decision boundary that helps to classify the data points. This best boundary is known as the
hyperplane of SVM.
-The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2
features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.
-We always create a hyperplane that has a maximum margin, which means the maximum distance between the
data points.
Example:
SVM can be understood with the example that we have used in the KNN classifier. Suppose we see a strange cat
that also has some features of dogs, so if we want a model that can accurately identify whether it is a cat or dog,
so such a model can be created by using the SVM algorithm. We will first train our model with lots of images of
cats and dogs so that it can learn about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the support vectors, it
will classify it as a cat. Consider the below diagram:
SVM can be of two types:
• Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into
two classes by using a single straight line, then such data is termed as linearly separable data, and classifier is
used called as Linear SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot be
classified by using a straight line, then such data is termed as non-linear data and classifier used is called as
Non-linear SVM classifier.
How does SVM works?
Linear SVM:
(i) Suppose we have a dataset that has two tags (green and blue), and the
dataset has two features x1 and x2. We want a classifier that can classify the
pair(x1, x2) of coordinates in either green or blue. Consider the image:
(ii) So as it is 2-d space so by just using a straight line, we can easily separate
these two classes. But there can be multiple lines that can separate these
classes. Consider the image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is
called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These
points are called support vectors. The distance between the vectors and the hyperplane is called as margin.
And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is called
the optimal hyperplane.
Non-Linear SVM:
(i) If data is linearly arranged, then we can separate it by using a straight line, but
for non-linear data, we cannot draw a single straight line. Consider the image:
(ii) So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a
third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as mage:
-So now, SVM will divide the datasets into classes in the following way.
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
Model Evaluation Metrics using sci-kit learn and IRIS datasets.
-The precision is intuitively the ability of the classifier not to label as positive a sample
that is negative.
(OR) Precision is the fraction of the identified samples that are relevant (Equation 1).
-The recall is intuitively the ability of the classifier to find all the positive samples (OR)
The recall is the fraction of relevant samples that have been identified over the total
amount of relevant terms (Equation 2).
Example:
array([[2, 0, 0],
[0, 0, 1],
[1, 0, 2]])
(iii) Classification report
The classification_report function builds a text report showing the main classification metrics.
https://www.kaggle.com/code/sixteenpython/machine-learning-with-iris-dataset/notebook
https://medium.com/@jebaseelanravi96/machine-learning-iris-classification-33aa18a4a983