0% found this document useful (0 votes)
18 views67 pages

Unit 7

Uploaded by

Manali Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views67 pages

Unit 7

Uploaded by

Manali Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

7.

Supervised Learning:
Classification and Regression
Subject: Machine Learning(3170724)
Dr. Ami Tusharkant Choksi
Associate professor, Computer Engineering Department,
C.K.Pithawala College of Engineering and Technology
Website: www.ckpcet.ac.in

1
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724)
Supervised vs. Unsupervised Learning
■ Supervised learning (classification)
■ Supervision: The training data (observations, measurements, etc.)
are accompanied by labels indicating the class of the observations
■ New data is classified based on the training set
■ Unsupervised learning (clustering)
■ The class labels of training data is unknown
■ Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 2


Prediction Problems: Classification vs.
Numeric Prediction
■ Classification
■ predicts categorical class labels (discrete or nominal)
■ classifies data (constructs a model) based on the training set
and the values (class labels) in a classifying attribute and
uses it in classifying new data
■ Numeric Prediction
■ predicts unknown or missing values
■ Typical applications
■ Credit/loan approval:
■ Medical diagnosis: if a tumor is cancerous or benign
■ Fraud detection: if a transaction is fraudulent
■ Web page categorization: which category it is

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 3


Definitions

■ Classification is the grouping of existing data with


using class labels.
■ Prediction is the use of existing data values to
guess a future value.
■ Estimation is the prediction of a continuous value.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 4


Tree Construction (Termination Conditions)

■ All samples for a given node belong to


the same class
■ There are no remaining attributes for
further partitioning – majority voting
is employed for classifying the leaf
■ There are no samples left
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 5
Summary of ID3 algorithm

■ Calculate the entropy of every attribute a a of the


data set S.
■ Partition ("split") the set S into subsets using the
attribute for which the resulting entropy after
splitting is minimized; or, equivalently, information
gain is maximum
■ Make a decision tree node containing that attribute.
■ Recurse on subsets using the remaining attributes.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 6


Computing Information-Gain for Continuous-
Valued Attributes
■ Let attribute A be a continuous-valued attribute
■ Must determine the best split point for A
■ Sort the value A in increasing order

■ Typically, the midpoint between each pair of adjacent values is

considered as a possible split point


■ (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
■ The point with the minimum expected information requirement for A is
selected as the split-point for A
■ Split:
■ D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of

tuples in D satisfying A > split-point

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 7


Disadvantages of ID3 algorithm

■ Data may be over-fitted or over-classified, if a small


sample is tested.
■ Only one attribute at a time is tested for making a
decision.
■ Does not handle numeric attributes and missing
values.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 8


C4.5 algorithm

■ C4.5 is an algorithm used to generate a decision


tree developed by Ross Quinlan. C4. 5 is an
extension of Quinlan's earlier ID3 algorithm. The
decision trees generated by C4.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 9


Gain Ratio for Attribute Selection (C4.5)
■ Information gain measure is biased towards attributes with a large
number of values
■ C4.5 (a successor of ID3) uses gain ratio to overcome the problem
(normalization to information gain)

■ GainRatio(A) = Gain(A)/SplitInfo(A)
■ Ex.

■ gain_ratio(income) = 0.029/1.557 = 0.019


■ The attribute with the maximum gain ratio is selected as the splitting
attribute
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 10
Disadvantages of C4.5 algorithm

■ C4.5 constructs empty branches with zero values


■ Over fitting happens when algorithm model picks
up data with uncommon characteristics , especially
when data is noisy.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 11


Gini Index (CART, IBM IntelligentMiner)
■ If a data set D contains examples from n classes, gini index, gini(D) is defined
as

where pj is the relative frequency of class j in D


■ If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is
defined as

■ Reduction in Impurity:

■ The attribute provides the smallest ginisplit(D) (or the largest reduction in
impurity) is chosen to split the node (need to enumerate all the possible
splitting points for each attribute)
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 12
Computation of Gini Index
■ Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”

■ Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4
in D2

Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the {low,medium}


(and {high}) since it has the lowest Gini index

13
13
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724)
Computation of Gini Index
■ All attributes are assumed continuous-valued
■ May need other tools, e.g., clustering, to get the possible split
values
■ Can be modified for categorical attributes

14
14
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724)
Disadvantages of CART

■ It can split on only one variable


■ Trees formed may be unstable

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 15


Comparing Attribute Selection Measures
■ The three measures, in general, return good results but
■ Information gain:

■ biased towards multivalued attributes

■ Gain ratio:

■ tends to prefer unbalanced splits in which one partition is

much smaller than the others


■ Gini index:

■ biased to multivalued attributes

■ has difficulty when # of classes is large

■ tends to favor tests that result in equal-sized partitions and

purity in both partitions


Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 16
Other Attribute Selection Measures
■ CHAID: a popular decision tree algorithm, measure based on χ2 test for independence
■ C-SEP: performs better than info. gain and gini index in certain cases
■ G-statistic: has a close approximation to χ2 distribution
■ MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred):
■ The best tree as the one that requires the fewest # of bits to both (1) encode the
tree, and (2) encode the exceptions to the tree
■ Multivariate splits (partition based on multiple variable combinations)
■ CART: finds multivariate splits based on a linear comb. of attrs.
■ Which attribute selection measure is the best?
■ Most give good results, none is significantly superior than others

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 17


Overfitting and Tree Pruning
■ Overfitting: An induced tree may overfit the training data
■ Too many branches, some may reflect anomalies due to noise or
outliers
■ Poor accuracy for unseen samples

■ Overfitting is a modeling error which occurs when a function is


too closely fit to a limited set of data points. Overfitting the model
generally takes the form of making an overly complex model to
explain idiosyncrasies in the data under study.
■ Overfitting occurs when a statistical model or machine learning
algorithm captures the noise of the data. Intuitively, overfitting
occurs when the model or the algorithm fits the data too well.
Specifically, overfitting occurs if the model or algorithm shows
low bias but high variance.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 18
Overfitting and Tree Pruning
■ How do you know if your Overfitting?
■ Consequently, you can detect overfitting by determining
whether your model fits new data as well as it fits the data
used to estimate the model. In statistics, we call this cross-
validation, and it often involves partitioning your data.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 19


Overfitting and Tree Pruning
■ Two approaches to avoid overfitting
■ Prepruning: Halt tree construction early ̵ do not split a node

if this would result in the goodness measure falling below a


threshold
■ Difficult to choose an appropriate threshold

■ Postpruning: Remove branches from a “fully grown” tree—

get a sequence of progressively pruned trees


■ Use a set of data different from the training data to

decide which is the “best pruned tree”

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 20


Enhancements to Basic Decision Tree Induction
■ Allow for continuous-valued attributes
■ Dynamically define new discrete-valued attributes that partition
the continuous attribute value into a discrete set of intervals
■ Handle missing attribute values
■ Assign the most common value of the attribute

■ Assign probability to each of the possible values

■ Attribute construction
■ Create new attributes based on existing ones that are sparsely
represented
■ This reduces fragmentation, repetition, and replication

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 21


Distance based classification algorithms

■ Two specific distance-based algorithms, the


nearest neighbor algorithm and the nearest-
hyperrectangle algorithm.
■ We will study nearest neighbor algorithm.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 22


k-nearest neighbor algorithm (kNN)

■ Step-1: Select the number K of the neighbors


■ Step-2: Calculate the Euclidean distance of K
number of neighbors
■ Step-3: Take the K nearest neighbors as per the
calculated Euclidean distance.
■ Step-4: Among these k neighbors, count the
number of the data points in each category.
■ Step-5: Assign the new data points to that category
for which the number of the neighbor is maximum.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 23
k-nearest neighbor algorithm (kNN)

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 24


k-nearest neighbor algorithm (kNN)

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 25


kNN Numerical example
Find food type for tomato(sweet=6, crunch=4)
Ingredient Sweet crunch Food Type
Grape 8 5 Fruit
Greenbean 3 7 Vegetable
Nut 3 6 Protein
Orange 7 3 Fruit

Find food type for tomato(sweet=6, crunch=4)

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 26


kNN example:find distance

D(Tomato, Grape) = sqrt((6-8)^2)+(4-5)^2))=sqrt(5)=2.2


D(Tomato, Greenbean) = sqrt((6-3)^2)+(4-7)^2)=sqrt(18)=4.2
D(Tomato, Nut) = sqrt((6-3)^2)+(4-6)^2)=sqrt(13)=3.6
D(Tomato, Orange) = sqrt((6-7)^2)+(4-3)^2)=sqrt(2)=1.4

As minimum distance of Tomato with Orange, so tomato


belongs to fruit.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 27


Recall classification steps

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 28


k=3 numerical example

■ k=3, only three nearest neighbours or three training


data elements closest to the test data element are
considered.
■ Out of the three data elements, the class which is
predominant is considered as the class label to be
assigned to the test data.
■ In case the value of k=1, only the closest training data
element is considered.
■ The class label of that data element is directly assigned
to the test data element.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 29
k=3 numerical example

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 30


k=3 numerical example
Name is non-
numeric data,
so we ignore
it.

Test data
given here is:
Name=Josh,
Aptitude=5
Communicati
on=4.5
Class=?

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 31


k=3 numerical example

■ k=3, only three nearest neighbours or three training


data elements closest to the test data element are
considered.
■ Out of the three data elements, the class which is
predominant is considered as the class label to be
assigned to the test data.
■ In case the value of k=1, only the closest training data
element is considered.
■ The class label of that data element is directly assigned
to the test data element.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 32
k=3 numerical example
● If k=1, test data
matches to
Gouri- so the
output will be
“Intel”
● If k=2, test data
matches with
Gouri and
Susant, but
minimum
among that is
Gouri, so
output will be
“Intel”
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 33
k=3 numerical example
● If k=3, test data
matches with
Gouri, Susant, and
Bobby with
distances being
1.118, 1.414, and
1.5, respectively.
● Gouri and Bobby
have class value
‘Intel’, while Susant
has class value
‘Leader’.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 34


k=3 numerical example
● In this case, the
class value of Josh
is decided by
majority voting.
● Because the class
value of ‘Intel’ is
formed by the
majority of the
neighbours, the
class value of Josh
is assigned as
‘Intel’. This same
process can be
extended for any
value of k.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 35
K for knn

■ But it is often a tricky decision to decide the value of k. The reasons are as
follows:
■ If the value of k is very large (in the extreme case equal to the total number of
records in the training data), the class label of the majority class of the training
data set will be assigned to the test data regardless of the class labels of the
neighbours nearest to the test data.
■ If the value of k is very small (in the extreme case equal to 1), the class value of a
noisy data or outlier in the training data set which is the nearest neighbour to
the test data will be assigned to the test data.
■ The best k value is somewhere between these two extremes.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 36


K for knn

■ Few strategies, highlighted below, are adopted by machine learning


practitioners to arrive at a value for k.
■ One common practice is to set k equal to the square root of the number of
training records.
■ An alternative approach is to test several k values on a variety of test data sets
and choose the one that delivers the best performance.
■ Another interesting approach is to choose a larger value of k, but apply a
weighted voting process in which the vote of close neighbours is considered
more influential than the vote of distant neighbours.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 37


Why knn lazy lerner?
■ Eager learners follow the general steps of machine learning, i.e. perform an
abstraction of the information obtained from the input data and then follow
it through by a generalization step.
■ However, as we have seen in the case of the kNN algorithm, these steps are
completely skipped. It stores the training data and directly applies the
philosophy of nearest neighbourhood finding to arrive at the classification.
So, for kNN, there is no learning happening in the real sense. Therefore, kNN
falls under the category of lazy learner.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 38


Applications of knn
■ One of the most popular areas in machine learning where the
kNN algorithm is widely adopted is recommender systems.
■ As we know, recommender systems recommend users different
items which are similar to a particular item that the user seems
to like. The liking pattern may be revealed from past purchases
or browsing history and the similar items are identified using the
kNN algorithm.
■ Another area where there is widespread adoption of kNN is
searching documents/ contents similar to a given
document/content. This is a core area under information
retrieval and is known as concept search.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 39
Random forest
■ Random forest is an ensemble classifier, i.e. a combining classifier that
uses and combines many decision tree classifiers.
■ Ensembling is usually done using the concept of bagging with different
feature sets.
■ The reason for using large number of trees in random forest is to train
the trees enough such that contribution from each feature comes in a
number of models.
■ After the random forest is generated by combining the trees, majority
vote is applied to combine the output of the different trees.
■ A simplified random forest model is depicted in Figure.
■ The result from the ensemble model is usually better than that from the
individual decision tree models.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 40
Random forest

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 41


Random forest algorithm
■ 1.If there are N variables or features in the input data set, select a subset
of ‘m’ (m < N) features at random out of the N features. Also, the
observations or data instances should be picked randomly.
■ 2.Use the best split principle on these ‘m’ features to calculate the
number of nodes ‘d’.
■ 3. Keep splitting the nodes to child nodes till the tree is grown to the
maximum possible extent.
■ 4. Select a different subset of the training data ‘with replacement’ to
train another decision tree following steps (1) to (3). Repeat this to
build and train ‘n’ decision trees.
■ 5. Final class assignment is done on the basis of the majority votes from
the ‘n’ trees.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 42
Out-of-bag (OOB) error in random forest

■ In random forests, we have seen, that each tree is constructed using a


different bootstrap sample from the original data.
■ The samples left out of the bootstrap and not used in the construction of
the i-th tree can be used to measure the performance of the model.
■ At the end of the run, predictions for each such sample evaluated each
time are tallied, and the final prediction for that sample is obtained by
taking a vote.
■ The total error rate of predictions for such samples is termed as out-of-
bag (OOB) error rate.
■ The error rate shown in the confusion matrix reflects the OOB error
rate. Because of this reason, the error rate displayed is often
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 43
Strengths of random forest
■ It runs efficiently on large and expansive data sets.
■ It has a robust method for estimating missing data and maintains
precision when a large proportion of the data is absent.
■ It has powerful technique
■ It gives estimates (or assessments) about which features are the
most important ones in the overall classification.
■ It generates an internal unbiased estimate (gauge) of the
generalization error as the forest generation progresses.
■ Generated forests can be saved for future use on other data.
■ Lastly, the random forest algorithm can be used to solve both
classification and regression problems.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 44


Weaknesses of random forest

■ This model, because it combines a number of


decision tree models, is not as easy to understand
as a decision tree model.
■ It is computationally much more expensive than a
simple model like decision tree.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 45


Application of random forest

■ Random forest is a very powerful classifier which


combines the versatility of many decision tree
models into a single model.
■ Because of the superior results, this ensemble
model is gaining wide adoption and popularity
amongst the machine learning practitioners to
solve a wide range of classification problems.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 46


Support Vector Machine (SVM)

■ SVM is a model, which can do linear classification as well as


regression.
■ SVM is based on the concept of a surface, called a
hyperplane, which draws a boundary between data
instances plotted in the multi-dimensional feature space.
■ The output prediction of an SVM is one of two conceivable
classes which are already defined in the training data.
■ In summary, the SVM algorithm builds an N-dimensional
hyperplane model that assigns future instances into one of
the two possible output classes.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 47


SVM: Classification using hyperplanes
■ In SVM, a model is built to discriminate the data instances
belonging to different classes.
■ Let us assume for the sake of simplicity that the data instances
are linearly separable.
■ In this case, when mapped in a two-dimensional space, the data
instances belonging to different classes fall in different sides of a
straight line drawn in the two-dimensional space as depicted in
Figure(a).
■ If the same concept is extended to a multidimensional feature
space, the straight line dividing data instances belonging to
different classes transforms to a hyperplane as depicted in
Figure.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 48


SVM: Classification using hyperplanes

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 49


SVM: Classification using hyperplanes
■ Thus, an SVM model is a representation of the input instances as points in the
feature space, which are mapped so that an apparent gap between them divides
the instances of the separate classes.
■ In other words, the goal of the SVM analysis is to find a plane, or rather a
hyperplane, which separates the instances on the basis of their classes. New
examples (i.e. new instances) are then mapped into that same space and
predicted to belong to a class on the basis of which side of the gap the new
instance will fall on.
■ In summary, in the overall training process, the SVM algorithm analyses input
data and identifies a surface in the multi-dimensional feature space called the
hyperplane. There may be many possible hyperplanes, and one of the challenges
with the SVM model is to find the optimal hyperplane.

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 50


SVM: Classification using hyperplanes

■ Training data sets which have a substantial grouping


periphery will function well with SVM.
■ Generalization error in terms of SVM is the measure of
how accurately and precisely this SVM model can
predict values for previously unseen data(new data).
■ A hard margin in terms of SVM means that an SVM
model is inflexible in classification and tries to work
exceptionally fit in the training set, thereby causing
overfitting.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 51
SVM: Classification using hyperplanes

■ Support Vectors: Support vectors are the data points (representing


classes), the critical component in a data set, which are near the
identified set of lines (hyperplane). If support vectors are removed, they
will alter the position of the dividing hyperplane.
■ Hyperplane and Margin: For an N-dimensional feature space,
hyperplane is a flat subspace of dimension (N−1) that separates and
classifies a set of data. For example, if we consider a two-dimensional
feature space (which is nothing but a data set having two features and a
class variable), a hyperplane will be a one-dimensional subspace or a
straight line. In the same way, for a three-dimensional feature space
(data set having three features and a class variable), hyperplane is a
two-dimensional subspace or a simple plane.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 52
SVM: Classification using hyperplane
■ However, quite understandably, it is difficult to visualize a
feature space greater than three dimensions, much like for a
subspace or hyperplane having more than three
dimensions.
■ Mathematically, in a two-dimensional space, hyperplane
can be defined by the equation:
■ which is nothing but an equation of a straight line.
■ Extending this concept to an N-dimensional space,
hyperplane can be defined by the equation:

■ which, in short, can be represented as follows:

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 53


SVM: Classification using hyperplane

■ Spontaneously, the further (or more distance) from


the hyperplane the data points lie, the more
confident we can be about correct categorization.
So, when a new testing data point/data set is
added, the side of the hyperplane it lands on will
decide the class that we assign to it. The distance
between hyperplane and data points is known as
margin.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 54
Identifying the correct hyperplane in SVM

■ As we have already discussed, there may be


multiple options for hyperplanes dividing the data
instances belonging to the different classes. We
need to identify which one will result in the best
classification. Let us examine a few scenarios
before arriving to that conclusion. For the sake of
simplicity of visualization, the hyperplanes have
been shown as straight lines in most of the
diagrams.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 55
Identifying the correct hyperplane in SVM:1

■ As depicted in Figure, in this


scenario, we have three
hyperplanes: A, B, and C.
■ Now, we need to identify the
correct hyperplane which better
segregates the two classes
represented by the triangles and
circles.
■ As we can see, hyperplane ‘A’ has
performed this task quite well.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 56
Identifying the correct hyperplane in SVM:2
■ As depicted in Figure, we have
three hyperplanes: A, B, and C. We
have to identify the correct
hyperplane which classifies the
triangles and circles in the best
possible way.
■ Here, maximizing the distances
between the nearest data points of
both the classes and hyperplane
will help us decide the correct
hyperplane. This distance is called ● Another quick reason for selecting the
as margin. hyperplane with higher margin (distance) is
■ Figure b, you can see that the robustness.
margin for hyperplane ● If we select a hyperplane having a lower
■ A is high as compared to those for margin (distance), then there is a high
both B and C. Hence, hyperplane A probability of misclassification.
is the correct
Dr. Ami Tusharkant hyperplane.
Choksi@CKPCET Machine Learning(3170724) 57
Identifying the correct hyperplane in SVM:3
■ Use the rules as discussed in the previous
section to identify the correct hyperplane in
the scenario shown in Figure.
■ Some of you might have selected hyperplane
B as it has a higher margin (distance from the
class) than A.
■ But, here is the catch; SVM selects the
hyperplane which classifies the classes
accurately before maximizing the margin.
■ Here, hyperplane B has a classification error,
and A has classified all data instances
correctly.
■ Therefore, A is the correct hyperplane.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 58
Identifying the correct hyperplane in SVM:4
■ In this scenario, as shown
in Figure a, it is not
possible to distinctly
segregate the two classes
by using a straight line, as
one data instance
belonging to one of the
classes (triangle) lies in
the territory of the other
■ class (circle) as an outlier.
■ One triangle at the other end is like an outlier for the triangle class. SVM has a
feature to ignore outliers and find the hyperplane that has the maximum margin
(hyperplane A, as shown in Fig.b). Hence, we can say that SVM is robust to
outliers. 59
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724)
Identifying the correct hyperplane in SVM
■ So, by summarizing the observations from the different scenarios, we
can say that
■ 1. The hyperplane should segregate the data instances belonging to the
two
classes in the best possible way.
■ 2. It should maximize the distances between the nearest data points of
both
the classes, i.e. maximize the margin.
■ 3. If there is a need to prioritize between higher margin and lesser
misclassification, the hyperplane should try to reduce misclassifications.
■ Let’s now find out a way to identify a hyperplane which maximizes the
margin.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 60
Maximum margin hyperplane

■ Finding the Maximum Margin Hyperplane (MMH) is


nothing but identifying the hyperplane which has the
largest separation with the data instances of the two
classes.
■ Though any set of three hyperplanes can do the correct
classification, why do we need to search for the set of
hyperplanes causing the largest separation?
■ The answer is that doing so helps us in achieving more
generalization and hence less number of issues in the
classification of unknown data.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 61
Maximum margin hyperplane(MMH)
■ Support vectors, as can be observed in Figure,
are data instances from the two classes which
are closest to the MMH.
■ Quite understandably, there should be at least
one support vector from each class. The
identification of support vectors requires
intense mathematical formulation, which is
out of scope of this book.
■ However, it is fairly intuitive to understand
that modelling a problem using SVM is nothing
but identifying the support vectors and MMH
corresponding to the problem space.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 62
Identifying the MMH for linearly separable data

■ Finding out the MMH is relatively


straightforward for the data that is
linearly separable. In this case, an
outer boundary needs to be drawn
for the data instances belonging to
the different classes. These outer
boundaries are known as convex
hull, as depicted in Figure.
■ The MMH can be drawn as the
perpendicular bisector of the
shortest line (i.e. the connecting line
having the shortest length) between
the convex hulls.
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 63
Identifying the MMH for linearly separable data

■ We have already seen earlier that a hyperplane in the


Ndimensional feature space can be represented by the equation:
■ Using this equation , the objective is to find a set of values for the
vector such that two hyperplanes, represented by the
equations below, can be specified.

■ This is to ensure that all the data instances that belong to one
class falls above one hyperplane and all the data instances
belonging to the other class falls below another hyperplane.
■ According to vector geometry, the distance of these planes
Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 64
Identifying the MMH for linearly separable data

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 65


Linear SVM example

LSVM

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 66


Non Linear SVM example

Non Linear SVM

Dr. Ami Tusharkant Choksi@CKPCET Machine Learning(3170724) 67

You might also like