0% found this document useful (0 votes)
8 views29 pages

ML Unit-3

The document explains the ID3 algorithm, a classification method used to build decision trees by selecting attributes that maximize Information Gain or minimize Entropy. It details the steps involved in the ID3 algorithm, including calculating entropy and information gain for various attributes, and constructing a decision tree based on these calculations. Additionally, it discusses pruning techniques to minimize overfitting and enhance the decision tree's performance.

Uploaded by

ytind2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views29 pages

ML Unit-3

The document explains the ID3 algorithm, a classification method used to build decision trees by selecting attributes that maximize Information Gain or minimize Entropy. It details the steps involved in the ID3 algorithm, including calculating entropy and information gain for various attributes, and constructing a decision tree based on these calculations. Additionally, it discusses pruning techniques to minimize overfitting and enhance the decision tree's performance.

Uploaded by

ytind2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

UNIT-3

Decision Tree
Using ID3 Algorithm to build a Decision Tree to predict the weather
ID3 algorithm, stands for Iterative Dichotomiser 3, is a classification algorithm that
follows a greedy approach of building a decision tree by selecting a best attribute that
yields maximum Information Gain (IG) or minimum Entropy (H).
What is a Decision Tree?
A Supervised Machine Learning Algorithm, used to build classification and regression models
in the form of a tree structure.
A decision tree is a tree where each -
• Node - a feature(attribute)
• Branch - a decision(rule)
• Leaf - an outcome(categorical or continuous)
There are many algorithms to build decision trees, here we are going to discuss ID3 algorithm
with an example.
What is an ID3 Algorithm?
ID3 stands for Iterative Dichotomiser 3
It is a classification algorithm that follows a greedy approach by selecting a best attribute that
yields maximum Information Gain(IG) or minimum Entropy(H).
What is Entropy and Information gain?
Entropy is a measure of the amount of uncertainty in the dataset S. Mathematical
Representation of Entropy is shown here –

Where,
• S - The current dataset for which entropy is being calculated(changes every iteration of
the ID3 algorithm).
• C - Set of classes in S {example - C ={yes, no}}
• p(c) - The proportion of the number of elements in class c to the number of elements in
set S.
In ID3, entropy is calculated for each remaining attribute. The attribute with the smallest
entropy is used to split the set S on that particular iteration.
Entropy = 0 implies it is of pure class, that means all are of same category.
Information Gain IG(A) tells us how much uncertainty in S was reduced after splitting
set S on attribute A. Mathematical representation of Information gain is shown here –
What are the steps in ID3 algorithm?
The steps in ID3 algorithm are as follows:
Calculate entropy for dataset.
For each attribute/feature.
2.1. Calculate entropy for all its categorical values.
2.2. Calculate information gain for the feature.
Find the feature with maximum information gain.
Repeat it until we get the desired tree.

Complete entropy of dataset is:

H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no))


= - (9/14) * log2(9/14) - (5/14) * log2(5/14)
= - (-0.41) - (-0.53)
= 0.94

For each attribute of the dataset, let's follow the step-2 of pseudocode : -
First Attribute - Outlook

Categorical values - sunny, overcast and rain


H(Outlook=sunny) = -(2/5)*log(2/5)-(3/5)*log(3/5) =0.971
H(Outlook=rain) = -(3/5)*log(3/5)-(2/5)*log(2/5) =0.971
H(Outlook=overcast) = -(4/4)*log(4/4)-0 = 0

Average Entropy Information for Outlook -


I(Outlook) = p(sunny) * H(Outlook=sunny) + p(rain) * H(Outlook=rain) +
p(overcast) * H(Outlook=overcast)
= (5/14)*0.971 + (5/14)*0.971 + (4/14)*0
= 0.693

Information Gain = H(S) - I(Outlook)


= 0.94 - 0.693
= 0.247

Second Attribute - Temperature

Categorical values - hot, mild, cool


H(Temperature=hot) = -(2/4)*log(2/4)-(2/4)*log(2/4) = 1
H(Temperature=cool) = -(3/4)*log(3/4)-(1/4)*log(1/4) = 0.811
H(Temperature=mild) = -(4/6)*log(4/6)-(2/6)*log(2/6) = 0.9179
Average Entropy Information for Temperature -
I(Temperature) = p(hot)*H(Temperature=hot) + p(mild)*H(Temperature=mild) +
p(cool)*H(Temperature=cool)
= (4/14)*1 + (6/14)*0.9179 + (4/14)*0.811
= 0.9108

Information Gain = H(S) - I(Temperature)


= 0.94 - 0.9108
= 0.0292

Third Attribute - Humidity

Categorical values - high, normal


H(Humidity=high) = -(3/7)*log(3/7)-(4/7)*log(4/7) = 0.983
H(Humidity=normal) = -(6/7)*log(6/7)-(1/7)*log(1/7) = 0.591

Average Entropy Information for Humidity -


I(Humidity) = p(high)*H(Humidity=high) + p(normal)*H(Humidity=normal)
= (7/14)*0.983 + (7/14)*0.591
= 0.787

Information Gain = H(S) - I(Humidity)


= 0.94 - 0.787
= 0.153

Fourth Attribute - Wind

Categorical values - weak, strong


H(Wind=weak) = -(6/8)*log(6/8)-(2/8)*log(2/8) = 0.811
H(Wind=strong) = -(3/6)*log(3/6)-(3/6)*log(3/6) = 1

Average Entropy Information for Wind -


I(Wind) = p(weak)*H(Wind=weak) + p(strong)*H(Wind=strong)
= (8/14)*0.811 + (6/14)*1
= 0.892

Information Gain = H(S) - I(Wind)


= 0.94 - 0.892
= 0.048

Here, the attribute with maximum information gain is Outlook. So, the
decision tree built so far -

Here, when Outlook == overcast, it is of pure class(Yes).


Now, we have to repeat same procedure for the data with rows consist of Outlook value as
Sunny and then for Outlook value as Rain.
Now, finding the best attribute for splitting the data with Outlook=Sunny
values{ Dataset rows = [1, 2, 8, 9, 11]}.

Complete entropy of Sunny is -


H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no))
= - (2/5) * log2(2/5) - (3/5) * log2(3/5)
= 0.971

First Attribute - Temperature

Categorical values - hot, mild, cool


H(Sunny, Temperature=hot) = -0-(2/2)*log(2/2) = 0
H(Sunny, Temperature=cool) = -(1)*log(1)- 0 = 0
H(Sunny, Temperature=mild) = -(1/2)*log(1/2)-(1/2)*log(1/2) = 1
Average Entropy Information for Temperature -
I(Sunny, Temperature) = p(Sunny, hot)*H(Sunny, Temperature=hot) + p(Sunny,
mild)*H(Sunny, Temperature=mild) + p(Sunny, cool)*H(Sunny,
Temperature=cool)
= (2/5)*0 + (1/5)*0 + (2/5)*1
= 0.4

Information Gain = H(Sunny) - I(Sunny, Temperature)


= 0.971 - 0.4
= 0.571

Second Attribute - Humidity

Categorical values - high, normal


H(Sunny, Humidity=high) = - 0 - (3/3)*log(3/3) = 0
H(Sunny, Humidity=normal) = -(2/2)*log(2/2)-0 = 0

Average Entropy Information for Humidity -


I(Sunny, Humidity) = p(Sunny, high)*H(Sunny, Humidity=high) + p(Sunny,
normal)*H(Sunny, Humidity=normal)
= (3/5)*0 + (2/5)*0
= 0

Information Gain = H(Sunny) - I(Sunny, Humidity)


= 0.971 - 0
= 0.971

Third Attribute - Wind

Categorical values - weak, strong


H(Sunny, Wind=weak) = -(1/3)*log(1/3)-(2/3)*log(2/3) = 0.918
H(Sunny, Wind=strong) = -(1/2)*log(1/2)-(1/2)*log(1/2) = 1

Average Entropy Information for Wind -


I(Sunny, Wind) = p(Sunny, weak)*H(Sunny, Wind=weak) + p(Sunny,
strong)*H(Sunny, Wind=strong)
= (3/5)*0.918 + (2/5)*1
= 0.9508

Information Gain = H(Sunny) - I(Sunny, Wind)


= 0.971 - 0.9508
= 0.0202

Here, the attribute with maximum information gain is Humidity. So, the
decision tree built so far -

Here, when Outlook = Sunny and Humidity = High, it is a pure class of


category "no". And When Outlook = Sunny and Humidity = Normal, it is
again a pure class of category "yes". Therefore, we don't need to do further
calculations.

Now, finding the best attribute for splitting the data with Outlook=Sunny
values{ Dataset rows = [4, 5, 6, 10, 14]}.

Complete entropy of Rain is -


H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no))
= - (3/5) * log(3/5) - (2/5) * log(2/5)
= 0.971

First Attribute - Temperature

Categorical values - mild, cool


H(Rain, Temperature=cool) = -(1/2)*log(1/2)- (1/2)*log(1/2) = 1
H(Rain, Temperature=mild) = -(2/3)*log(2/3)-(1/3)*log(1/3) = 0.918
Average Entropy Information for Temperature -
I(Rain, Temperature) = p(Rain, mild)*H(Rain, Temperature=mild) + p(Rain,
cool)*H(Rain, Temperature=cool)
= (2/5)*1 + (3/5)*0.918
= 0.9508

Information Gain = H(Rain) - I(Rain, Temperature)


= 0.971 - 0.9508
= 0.0202

Second Attribute - Wind

Categorical values - weak, strong


H(Wind=weak) = -(3/3)*log(3/3)-0 = 0
H(Wind=strong) = 0-(2/2)*log(2/2) = 0

Average Entropy Information for Wind -


I(Wind) = p(Rain, weak)*H(Rain, Wind=weak) + p(Rain, strong)*H(Rain,
Wind=strong)
= (3/5)*0 + (2/5)*0
= 0

Information Gain = H(Rain) - I(Rain, Wind)


= 0.971 - 0
= 0.971

Here, the attribute with maximum information gain is Wind. So, the decision
tree built so far -

Here, when Outlook = Rain and Wind = Strong, it is a pure class of


category "no". And When Outlook = Rain and Wind = Weak, it is again a
pure class of category "yes".
And this is our final desired tree for the given dataset.
USING GINI INDEX
Parents attribute has the Minimum Gini Index so, it can be the Root Node.
The Decision Tree is
Ranking and probability estimation trees

Grouping classifiers such as decision trees divide the instance space into segments, and so can
be turned into rankers by learning an ordering on those segments.
The ranking obtained from the empirical probabilities in the leaves of a decision tree yields a
convex ROC curve on the training data.
Adding a split to a decision tree can be interpreted in terms of coverage curves as the following
two-step process:
->Split the corresponding curve segment into two or more segments;
->Reorder the segments on decreasing slope.

• To turn a feature tree into a ranker, we order its leaves on non-increasing empirical
probabilities, which is provably optimal on the training set;
• To turn the tree into a probability estimator, we predict the empirical probabilities in
each leaf, applying Laplace or m-estimate smoothing to make these estimates more
robust for small leaves;
• To turn the tree into a classifier, we choose the operating conditions and find the
operating point that is optimal under those operating conditions.

The operation of merging all leaves in a subtree is called pruning the subtree.

The advantage of pruning is that we can simplify the tree without affecting the chosen operating
point, which is sometimes useful if we want to communicate the tree model to somebody else.
The disadvantage is that we lose ranking performance.

Pruning is therefore not recommended unless (i) you only intend to use the tree for
classification, not for ranking or probability estimation; and (ii) you can define the expected
operating conditions with sufficient precision. One popular algorithm for pruning decision trees
is called reduced-error pruning.
Pruning: It is a process of adjusting your decision tree in order to minimize the miss
classification error.
Motive: Minimize the overfitting the model.
To reach the motive, We have TWO techniques
1) Pre Pruning or Early Stopping Rule
2) Post Pruning ( one variation is reduced-error pruning of a decision tree.)

1) Pre Pruning or Early Stopping Rule


-> Halt the subtree construction at some node after checking some measure. ( Gini Index /
Information Gain)
-> stops tree growth prematurely
-> Avoids generating overly complex subtrees which overfits the training data.
2) Post Pruning
-> Grows the Decision Tree entirely
-> Trim the nodes of Decision tree in Bottom-up fashion
-> Replace a node by leaf.
-> Even after trimming, if error increases, replace the subtree by a leaf.
Note: In terms of efficiency, Pre pruning is faster compared with post pruning.
Interaction effect is there in Post pruning.

Reduced-error pruning of a decision tree. (REP)

➔ This approach is only useful if we have a large set of data


In REP, we generate a pruning set, that is holding out by available training instances.
Pruning set provides an estimate of error rate on future instances (training data is used to build
the decision tree but pruning data is used to check the error estimate)

If error(parent) < error(child)


{
Then prune
}
Else
{
Don’t prune
}

Note: Entropy and Gini index are sensitive to fluctuations in the class distribution, √𝐆𝐢𝐧𝐢
isn’t.
Tree learning as variance reduction

Regression trees

Decision Tree - Regression


Decision tree builds regression or classification models in the form of a tree
structure. It breaks down a dataset into smaller and smaller subsets while at the
same time an associated decision tree is incrementally developed. The final
result is a tree with decision nodes and leaf nodes. A decision node (e.g.,
Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy), each
representing values for the attribute tested. Leaf node (e.g., Hours Played)
represents a decision on the numerical target. The topmost decision node in a
tree which corresponds to the best predictor called root node. Decision trees can
handle both categorical and numerical data.
Outlook Temperature Humidity Windy Hours Played
Sunny Hot High FALSE 25
Sunny Hot High TRUE 30
Overcast Hot High FALSE 46
Rainy Mild High FALSE 45
Rainy Cool Normal FALSE 52
Rainy Cool Normal TRUE 23
Overcast Cool Normal TRUE 43
Sunny Mild High FALSE 35
Sunny Cool Normal FALSE 38
Rainy Mild Normal FALSE 46
Sunny Mild Normal TRUE 48
Overcast Mild High TRUE 52
Overcast Hot Normal FALSE 44
Rainy Mild High TRUE 30

Decision Tree Algorithm


The core algorithm for building decision trees called ID3 by J. R. Quinlan which
employs a top-down, greedy search through the space of possible branches with
no backtracking. The ID3 algorithm can be used to construct a decision tree for
regression by replacing Information Gain with Standard Deviation Reduction.

Standard Deviation
A decision tree is built top-down from a root node and involves partitioning
the data into subsets that contain instances with similar values (homogenous).
We use standard deviation to calculate the homogeneity of a numerical sample.
If the numerical sample is completely homogeneous its standard deviation is
zero.

a) Standard deviation for one attribute:


• Standard Deviation (S) is for tree building (branching).
• Coefficient of Deviation (CV) is used to decide when to stop branching.
We can use Count (n) as well.
• Average (Avg) is the value in the leaf nodes.
b) Standard deviation for two attributes (target and predictor):
Standard Deviation Reduction
The standard deviation reduction is based on the decrease in standard deviation
after a dataset is split on an attribute. Constructing a decision tree is all about
finding attribute that returns the highest standard deviation reduction (i.e., the
most homogeneous branches).
Step 1: The standard deviation of the target is calculated.
Standard deviation (Hours Played) = 9.32
Step 2: The dataset is then split on the different attributes. The standard deviation
for each branch is calculated. The resulting standard deviation is subtracted from
the standard deviation before the split. The result is the standard deviation
reduction.

Step 3: The attribute with the largest standard deviation reduction is chosen for the decision
node.

Step 4a: The dataset is divided based on the values of the selected attribute.
This process is run recursively on the non-leaf branches, until all data is
processed.
In practice, we need some termination criteria. For example, when coefficient of
deviation (CV) for a branch becomes smaller than a certain threshold (e.g., 10%)
and/or when too few instances (n) remain in the branch (e.g., 3).
Step 4b: "Overcast" subset does not need any further splitting because its CV
(8%) is less than the threshold (10%). The related leaf node gets the average of
the "Overcast" subset.

Step 4c: However, the "Sunny" branch has an CV (28%) more than the threshold
(10%) which needs further splitting. We select "Temp" as the best best node after
"Outlook" because it has the largest SDR.
Because the number of data points for both branches (FALSE and TRUE) is
equal or less than 3 we stop further branching and assign the average of each
branch to the related leaf node.

Step 4d: Moreover, the "rainy" branch has an CV (22%) which is more than the
threshold (10%). This branch needs further splitting. We select "Temp" as the
best best node because it has the largest SDR.

Because the number of data points for all three branches (Cool, Hot and Mild) is
equal or less than 3 we stop further branching and assign the average of each
branch to the related leaf node.
Linear models

In this chapter we will look at models that can be understood in terms of lines and planes,
commonly called linear models.

1. Linear models are parametric, meaning that they have a fixed form with a small
number of numeric parameters that need to be learned from data. This is different from
tree or rule models, where the structure of the model (e.g., which features to use in the
tree, and where) is not fixed in advance.
2. Linear models are stable, which is to say that small variations in the training data have
only limited impact on the learned model.
3. Linear models are less likely to overfit the training data than some other models,
largely because they have relatively few parameters. The flipside of this is that they
sometimes lead to underfitting.

The last two points can be summarised by saying that linear models have low variance but high
bias. Such models are often preferable when you have limited data and want to avoid
overfitting.
High variance–low bias models such as decision trees are preferable if data is large but
underfitting is a concern.
Linear models exist for all predictive tasks, including classification, probability estimation and
regression.

The least-squares method

➔ It can be used to learn linear models for classification and regression


➔ The differences between the actual and estimated function values on the training
examples are called residuals €i = f (xi) − ˆf (xi).
[ function estimator ˆf: X → R from examples (xi , f (xi)), where in this chapter we
assume X = Rd ]
The least-squares method, introduced by Carl Friedrich Gauss in the late eighteenth century,
consists in finding ˆf such that is minimised.

The following example illustrates the method in the simple case of a single feature, which is
called univariate regression.
Rule models:
Learning ordered rule lists
[email protected]
First-order rule learning
First-order rule learning, also known as first-order logic learning or relational learning, is a
branch of machine learning that focuses on learning logical rules from data with relational
structures. Traditional machine learning algorithms often assume that data instances are
independent and identically distributed, which may not hold true for many real-world problems
where relationships and dependencies between instances are crucial.
First-order rule learning extends classical logic and allows for the representation of complex
relationships and dependencies between objects and their attributes. It operates on first-order
logic, which is a formal language for representing knowledge and making logical inferences.
In first-order rule learning, the goal is to learn logical rules from a set of relational data, often
represented in the form of a relational database. These rules describe relationships, patterns,
and dependencies between entities and their attributes. The learned rules can then be used for
various tasks, such as classification, prediction, or generating new knowledge.
There are several approaches to first-order rule learning, including:
1. Inductive Logic Programming (ILP): ILP combines logic programming and machine
learning to induce logical rules from examples and background knowledge. ILP algorithms
typically use a top-down, hypothesis-driven approach to construct rules that accurately describe
the data.
2. Statistical Relational Learning (SRL): SRL methods integrate statistical modeling techniques
with first-order logic to capture both the uncertainty in the data and the relational structure.
SRL algorithms can handle probabilistic reasoning and leverage statistical learning techniques
such as Bayesian inference or Markov logic networks.
3. Relational Reinforcement Learning (RRL): RRL extends reinforcement learning to problems
with relational structures. It aims to learn policies or decision-making rules that consider the
relational dependencies between entities in an environment.
First-order rule learning has applications in various domains, including natural language
processing, social network analysis, bioinformatics, and robotics. By explicitly modeling
relationships and dependencies, it enables the discovery of complex patterns and facilitates
reasoning about structured data.
Linear models:
The least-squares method
The least-squares method is a widely used approach in machine learning for solving regression
problems. It aims to find the best-fitting line or curve that minimizes the sum of the squared
differences between the observed target values and the predicted values.

In the context of linear regression, the least-squares method finds the line that best fits a given
set of data points. The line is defined by the equation:

y = mx + b

where y is the target variable, x is the input variable, m is the slope of the line, and b is the y-
intercept. The goal is to find the values of m and b that minimize the sum of the squared
differences between the observed target values and the predicted values.

To achieve this, the least-squares method minimizes the objective function known as the sum
of squared residuals:

SSR = Σ(yᵢ - (mxᵢ + b))²

where yᵢ is the observed target value for the i-th data point, and xᵢ is the corresponding input
value. The sum is taken over all data points in the dataset.

The least-squares method estimates the values of m and b that minimize the SSR by taking the
partial derivatives of the SSR with respect to m and b, setting them to zero, and solving the
resulting equations. The resulting estimates are often referred to as the ordinary least squares
(OLS) estimates.

Once the values of m and b are determined, the fitted line can be used to make predictions for
new input values by plugging them into the equation.
The least-squares method is not limited to linear regression. It can also be used for fitting other
types of curves, such as polynomial regression or nonlinear regression. In these cases, the
equation defining the curve may have additional parameters, and the least-squares method aims
to find the values of these parameters that minimize the sum of squared differences.
Overall, the least-squares method is a fundamental technique in machine learning for finding
the best-fitting line or curve in regression problems by minimizing the sum of squared
differences between observed and predicted values.
Perceptrons
• Frank Rosenblatt, an American psychologist, proposed the classical perceptron model
(1958).
• Perceptron is an algorithm for Supervised Learning of single layer binary linear classifiers.
This algorithm enables neurons to learn and processes elements in the training set one at a
time.

There are two types of Perceptrons: Single layer and Multilayer.


• Single layer - Single layer perceptrons can learn only linearly separable patterns.
• Multilayer - Multilayer perceptrons or feedforward neural networks with two or more
layers have the greater processing power.
How it operates:
It will give an output of 1 if the weighted sum of the inputs is greater than a threshold
and the output is going to be 0 if this weighted sum is less than threshold. [ f is an activation
function]
Why do we need Activation Function (Transfer Function)?
The activation functions are used to map the input between the required values like (0,
1) or (-1, 1). Types of activation functions include the sign, step, and sigmoid functions.
We are assuming is that instead of
having this threshold as a separate quantity, just
think that is one of our inputs which is always
ON and the weight of that input is -Ɵ (minus
THETA). So, now the job of all these other
inputs and their weights is to make sure that their
sum is greater than this input (x0 =1, w0 = -Ɵ).
So, it fires when this summation is
greater than equal to 0, otherwise it does not
fire.
w0 is often called the bias as it represents
the prior. Add this bias to the weighted sum to
improve the model's performance.

Why do we need Weights and Bias?


Weights shows the strength of the particular node. A bias value allows you to shift
the activation function curve up or down.

Types of Perceptrons:
There are two types of Perceptrons:
1) Single layer Perceptrons
2) Multilayer Perceptrons (Deep feedforward networks/ feedforward neural networks)

The above algorithm is a perceptron learning algorithm.


Going beyond linearity with kernel methods
In machine learning, the kernel perceptron is a variant of the popular perceptron learning
algorithm that can learn kernel machines, i.e. non-linear classifiers that employ a kernel
function to compute the similarity of unseen samples to training samples.
Support vector machines
SVM is a powerful supervised algorithm that works best on smaller datasets but on complex
ones. Support Vector Machine, abbreviated as SVM can be used for both regression and
classification tasks, but generally, they work best in classification problems.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a hyperplane.
Support Vectors: SVM chooses the extreme points/vectors that help in creating the
hyperplane. These extreme cases are called as support vectors, and hence algorithm is termed
as Support Vector Machine. These are the points that are closest to the hyperplane.
Margin: it is the distance between the hyperplane and the observations closest to the
hyperplane (support vectors). In SVM large margin is considered a good margin. There are two
types of margins hard margin and soft margin.
Note: Don’t get confused between SVM and logistic regression. Both the algorithms try to find
the best hyperplane, but the main difference is logistic regression is a probabilistic approach
whereas support vector machine is based on statistical approaches.
Consider the below diagram in which there are two different categories that are classified using
a decision boundary or hyperplane:

How does Support Vector Machine work?


Suppose we have a dataset that has two classes (green and blue). We want to classify that the
new data point as either blue or green. (Fig:1)

Fig:1 Fig:2

There can be multiple lines/decision boundaries to segregate the classes in n-dimensional


space, but we need to find out the best decision boundary that helps to classify the data
points(Fig:2). This best boundary is known as the hyperplane of SVM.

The SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both
the classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.

Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is termed
as non-linear data and classifier used is called as Non-linear SVM classifier.
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Kernels in Support Vector Machine
The most interesting feature of SVM is that it can even work with a non-linear dataset and for
this, we use “Kernel Trick” which makes it easier to classifies the points. Suppose we have a
dataset like this:

Here we cannot draw a single line or say hyperplane which can classify the points correctly. So
what we do is try converting this lower dimension space to a higher dimension space using
some quadratic functions which will allow us to find a decision boundary that clearly divides
the data points. These functions which help us do this are called Kernels and which kernel to
use is purely determined by hyperparameter tuning.
Different Kernel functions
Some kernel functions which you can use in SVM are given below:
1. Polynomial kernel
2. Sigmoid kernel
3. RBF (Radial Basis Function) kernel
4. Bessel function kernel
5. Anova Kernel
Advantages of SVM
1. SVM works better when the data is Linear
2. It is more effective in high dimensions
3. With the help of the kernel trick, we can solve any complex problem
4. SVM is not sensitive to outliers
5. Can help us with Image classification
Disadvantages of SVM
1. Choosing a good kernel is not easy
2. It doesn’t show good results on a big dataset

You might also like