0% found this document useful (0 votes)
20 views90 pages

Refer For KNNDecison Tree SVM

Uploaded by

zaidnadaf14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views90 pages

Refer For KNNDecison Tree SVM

Uploaded by

zaidnadaf14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

Module 2

Supervised Learning Algorithms


Contents
• Regression
-Linaer
-Logestic
-Polynomial
• Classification
- KNN Classifier
- Decision Tree
- Random Forest
- SVM
Performance Matrix – Confusion Matrix
Polynomial Regression
• Polynomial Regression is a regression algorithm that models the
relationship between a dependent(y) and independent variable(x) as
nth degree polynomial.
• The Polynomial Regression equation is given below:


Linear vs Polynomial
• The main steps involved in Polynomial Regression are given below:

K-Nearest Neighborhood Algorithm (KNN)
• Intution behind KNN Algorithm
Features
• (K-NN) algorithm is a versatile and widely used machine learning algorithm
that is primarily used for its simplicity and ease of implementation.
• It does not require any assumptions about the underlying data distribution.
• It can also handle both numerical and categorical data, making it a flexible
choice for various types of datasets in classification and regression tasks.
• It is a non-parametric method that makes predictions based on the
similarity of data points in a given dataset.
• K-NN is less sensitive to outliers compared to other algorithms.

• The K-NN algorithm works by finding the K nearest neighbors to a
given data point based on a distance metric, such as Euclidean
distance.
• The class or value of the data point is then determined by the
majority vote or average of the K neighbors.
• This approach allows the algorithm to adapt to different patterns and
make predictions based on the local structure of the data.

Distance Metrics Used in KNN Algorithm

• Euclidean Distance
• Manhattan Distance

• The K-NN algorithm compares a new data entry to the values in a
given data set (with different classes or categories).
• Based on its closeness or similarities in a given range (K) of neighbors,
the algorithm assigns the new data to a class or category in the data
set (training data).

Steps in KNN Algorithm
KNN Example 1
• Since the value of K is 3, the algorithm will only consider the 3 nearest
neighbors to the green point (new entry). This is represented in the
graph above.

KNN Example 2
• Consider following dataset
Assumptions
KNN Algorithm
Decision Tree
• Decision trees, a key tool in machine learning,
• This model predict outcomes based on input data through a tree-like
structure.
• They offer interpretability, versatility, and simple visualization,
making them valuable for both categorization and regression tasks.

Concept
• It is a tree-like structure where
- each internal node tests on attribute,
- each branch corresponds to attribute value and
- each leaf node represents the final decision or prediction.
• While decision trees have advantages like ease of understanding,
they may face challenges such as overfitting.
• Understanding their terminologies and formation process is essential
for effective application in diverse scenarios.
• Decision trees are upside down which means the root is at the top
and then this root is split into various several nodes.
• Decision trees are nothing but a bunch of if-else statements in layman
terms.
• It checks if the condition is true and if it is then it goes to the next
node attached to that decision.
Example 1:
• Here, it will ask –
• what is the weather?
• Is it sunny, cloudy, or rainy?
• If yes then it will go to the next feature which is humidity and wind.
• It will again check if there is a strong wind or weak, if it’s a weak wind
and it’s rainy then the person may go and play.
We see that if the weather is cloudy then we must go to play.
Why didn’t it split more? Why did it stop there?
• But in simple terms,
• output for the training dataset is always yes for cloudy weather, since
there is no disorderliness here we don’t need to split the node
further.

• Entropy, information gain, and Gini index.

• The goal of machine learning is to decrease uncertainty or disorders


from the dataset and for this, we use decision trees.
Questions
• How do I know what should be the root node?
• what should be the decision node?
• when should I stop splitting?
• To decide this, there is a metric called “Entropy” which is the amount
of uncertainty in the dataset.
• Decision Tree algorithm works in simpler steps
• Starting at the Root: The algorithm begins at the top, called the “root
node,” representing the entire dataset.
• Asking the Best Questions: It looks for the most important feature or
question that splits the data into the most distinct groups. This is like
asking a question at a fork in the tree.
• Branching Out: Based on the answer to that question, it divides the
data into smaller subsets, creating new branches. Each branch
represents a possible route through the tree.
• Repeating the Process: The algorithm continues asking questions and
splitting the data at each branch until it reaches the final “leaf
nodes,” representing the predicted outcomes or classifications.
Entropy
• Entropy is nothing but the uncertainty in our dataset or measure of
disorder.
• Examples to understand concept of Entropy
Example 1
Left node Entropy
• Feature 2:
Right Node Entropy
• For Feature 3:
• Left node has low entropy or more purity than right node since left
node has a greater number of “yes” and it is easy to decide here.

• the higher the Entropy, the lower will be the purity and the higher
will be the impurity.
• The goal of machine learning is to decrease the uncertainty or
impurity in the dataset, here by using the entropy we are getting
the impurity of a particular node
• we don’t know if the parent entropy or the entropy of a particular
node has decreased or not.

• New metric called “Information gain” which tells us how much the
parent entropy has decreased after splitting it with some feature.
Information Gain
• Information gain measures the reduction of uncertainty given some
feature and it is also a deciding factor for which attribute should be
selected as a decision node or root node.

• It is just entropy of the full dataset – entropy of the dataset given


some feature.
Example
• Suppose our entire population has a total of 30 instances.
• The dataset is to predict whether the person will go to the gym or
not. Let’s say 16 people go to the gym and 14 people don’t
• Decide Features as Feature 1: “Energy” which takes two values “high
- 13” and “low 17”
• Feature 2 is “Motivation” which takes 3 values “No motivation”,
“Neutral” and “Highly motivated”.
• Use Decision Tree
• Use Information gain to decide which feature should be the root
node and which feature should be placed after the split.
• Using Feature 1
- Calculate Entropy
- Calculate Information Gain
• Entropy and Information Gain

• Parent entropy was near 0.99 and after looking at this value of information
gain-
Conclusion : entropy of the dataset will decrease by 0.37 if we make
“Energy” as our root node.
• Feature 2
• Conclusions:
• “Energy” feature gives more reduction which is 0.37 than the
“Motivation” feature. Hence we will select the feature which has the
highest information gain and then split the node based on that
feature.
• “Energy” will be our root node and we’ll do the same for
sub-nodes. Here we can see that when the energy is “high” the
entropy is low and hence we can say a person will definitely go to
the gym if he has high energy,
• but what if the energy is low? We will again split the node based on
the new feature which is “Motivation”.
Prunning
• Pruning is another method that can help us avoid overfitting. It helps
in improving the performance of the Decision tree by cutting the
nodes or sub-nodes which are not significant. Additionally, it removes
the branches which have very low importance.
• There are mainly 2 ways for pruning:
• Pre-pruning – we can stop growing the tree earlier, which means we
can prune/remove/cut a node if it has low importance while
growing the tree.
• Post-pruning – once our tree is built to its depth, we can start
pruning the nodes based on their significance.
Example 3
SVM (Support Vector Machine)
⦿ Concept
⦿ Types
⦿ Linear
⦿ Non-linear
⦿ Use of Dot products
⦿ Examples
⦿ Kernel in SVM
Concept
• SVM is a powerful supervised algorithm that works best on smaller
datasets but on complex ones.
• used for both regression and classification tasks, but generally, they
work best in classification problems.
• It is a supervised machine learning problem where we try to find a
hyperplane that best separates the two classes.
• Don’t get confused between SVM and logistic regression.
• Both the algorithms try to find the best hyperplane, but the main
difference is logistic regression is a probabilistic approach whereas
support vector machine is based on statistical approaches.
• Answers to questions like –
- which hyperplane does it select?
- There can be an infinite number of hyperplanes passing through a
point and classifying the two classes perfectly.
- So, which one is the best?
• Depending on the number of features you have you can either
choose Logistic Regression or SVM.
• SVM works best when the dataset is small and complex.
• advisable to first use logistic regression and see how does it performs,
if it fails to give a good accuracy you can go for SVM without any
kernel
• Logistic regression and SVM without any kernel have similar
performance but depending on your features, one may be more
efficient than the other.
Types of SVM
• Linear SVM: When the data is perfectly linearly separable only then
we can use Linear SVM. Perfectly linearly separable means that the
data points can be classified into 2 classes by using a single straight
line(if 2D).
• Non-Linear SVM: When the data is not linearly separable then we can
use Non-Linear SVM, which means when the data points cannot be
separated into 2 classes by using a straight line (if 2D) then we use
some advanced techniques like kernel tricks to classify them.
• In most real-world applications we do not find linearly separable
datapoints hence we use kernel trick to solve them.
Important Definitions
• Support Vectors: These are the points that are closest to the
hyperplane. A separating line will be defined with the help of these
data points.
• Margin: it is the distance between the hyperplane and the
observations closest to the hyperplane (support vectors). In SVM
large margin is considered a good margin. There are two types of
margins hard margin and soft margin.
Example – Linear SVM
• We want to classify that the new data point as either blue or green.
• To classify these points, we can have many decision boundaries, but
the question is which is the best and how do we find it?
• The best hyperplane is that plane that has the maximum distance
from both the classes, and this is the main aim of SVM.
• This is done by finding different hyperplanes which classify the labels
in the best way then it will choose the one which is farthest from the
data points or the one which has a maximum margin.
How does it work?
⦿ Identify Cat or Dog?
⦿ Support Vectors :
⦿ Linear SVM : Hyperplane
⦿ Non-linear SVM example
Example - Non –linear SVM
⦿ Finding equation for SV :
⦿ Final Classification result
Use of Dot Product in SVM
• The dot product can be defined as the projection of one vector along
with another, multiply by the product of another vector.
• Consider a random point X and we want to know whether it lies on
the right side of the plane or the left side of the plane (positive or
negative).
• Assume this point is a vector (X) and then we make a vector (w) which
is perpendicular to the hyperplane. Let’s say the distance of vector w
from origin to decision boundary is ‘c’. Now we take the projection of
X vector on w.
• Criteria for Classification based on dot product:
- projection of any vector or another vector is called dot-product. we
take the dot product of x and w vectors.
• If the dot product is greater than ‘c’ point lies on the right side.
• If the dot product is less than ‘c’ then the point is on the left side
• If the dot product is equal to ‘c’ then the point lies on the decision
boundary.
Margin in Support Vector Machine
• To classify a point as negative or positive we need to define a decision
rule.
• The equation of a hyperplane is w.x+b=0 where w is a vector normal
to hyperplane and b is an offset.

• If the value of w.x+b>0 then we can say it is a positive point otherwise


it is a negative point.
• we need (w,b) such that the margin has a maximum distance. Let’s
say this distance is ‘d’.

• To calculate ‘d’ we need the equation of L1 and L2.


• For this, we will take few assumptions that –
• the equation of L1 is w.x+b=1 and for
• L2 it is w.x+b=-1.
• Why the magnitude is equal, why didn’t we take 1 and -2?
• Why did we only take 1 and -1, why not any other value like 24 and
-100?
• Why did we assume this line?
Example:
• Let’s say the equation of our hyperplane is 2x+y=2
• Create margin for this hyperplane,
Summary
• If you multiply these equations by 10,
- the parallel line (red and green) gets closer to our hyperplane.
• If we divide this equation by 10
-then these parallel lines get bigger
• The parallel lines depend on (w,b) of our hyperplane,
• If we multiply the equation of hyperplane with a factor greater than
1 then the parallel lines will shrink
• If we multiply with a factor less than 1, they expand.
• These lines will move as we do changes in (w,b) and this is how this
gets optimized.
SVM Error
• SVM Error = Margin Error + Classification Error.
• The higher the margin, the lower would-be margin error, and vice
versa
• high value of ‘c’ =1000, this would mean that you don’t want to focus
on margin error and just want a model which doesn’t misclassify any
data point.
• Which is a better model?
- the one where the margin is maximum and has 2 misclassified points
or
- the one where the margin is very less, and all the points are correctly
classified?
• Increase ‘c’ to decrease Classification Error but
• If you want that your margin should be maximized then the value of
‘c’ should be minimized.
• That’s why ‘c’ is a hyperparameter and we find the optimal value of
‘c’
Kernels in SVM
• Need
• Solution:
• Converting this lower dimension space to a higher dimension space
using some quadratic functions which will allow us to find a decision
boundary that clearly divides the data points.
• These functions which help us do this are called Kernels.
• which kernel to use is purely determined by hyperparameter tuning.
• Use of Kernel
Evaluation Matrix for classification:
Confusion Matrix
• Machine learning models are increasingly used in various applications
to classify data into different categories.
• However, evaluating the performance of these models is crucial to
ensure their accuracy and reliability.
• One essential tool in this evaluation process is the confusion matrix.
• A confusion matrix is a matrix that summarizes the performance of a
machine learning model on a set of test data.
• It is a means of displaying the number of accurate and inaccurate
instances based on the model’s predictions.
• It is often used to measure the performance of classification models,
which aim to predict a categorical label for each input instance.

• The matrix displays the number of instances produced by the model
on the test data.

Metrics based on Confusion Matrix Data
Confusion Matrix For binary classification

• A 2X2 Confusion matrix is shown below for the image recognition


having a Dog image or Not Dog image.
• Scenario: Example: Confusion Matrix for Dog Image Recognition
with Numbers
• Confusion Matrix
Confusion Matrix For Multi-class
Classification
• In multi-class classification, you have more than two possible classes
for your model to predict. The confusion matrix expands to
accommodate these additional classes.
• Rows: Represent the actual classes (ground truth) in your dataset.
• Columns: Represent the predicted classes by your model.
• Each cell within the matrix shows the count of instances where the
model predicted a particular class (column) when the actual class was
another (row).
• A 3X3 Confusion matrix is shown below for the image having three
classes.
• Example: Confusion Matrix for Image Classification (Cat, Dog, Horse)

•True Positive (TP): The image was of a particular animal (cat,


dog, or horse), and the model correctly predicted that animal.
For example, a picture of a cat correctly identified as a cat.

•False Negative (FN): The image was of a particular animal, but


the model incorrectly predicted it as a different animal. For
example, a picture of a dog mistakenly identified as a cat.
• In this scenario:
Cats: 8 were correctly identified, 1 was misidentified as a dog, and 1
was misidentified as a horse.
Dogs: 10 were correctly identified, 2 were misidentified as cats.
Horses: 8 were correctly identified, 2 were misidentified as dogs.

You might also like