What is a Support Vector Machine?

A classification approach that has received considerable scrutiny is the support vector machine (SVM). This approach has its roots in statistical learning theory and has displayed promising empirical outcomes in several practical applications, from handwritten digit identification to text classification.

SVM also operates with high-dimensional data and prevents the curse of dimensionality issues. There is the second element of this approach is that it defines the decision boundary using a subset of the training instances, called the support vectors.

SVM can be prepared to explicitly view this type of hyperplane in linearly separable data. It can achieve by displaying how the SVM methodology can be continued to non-linearly separable data. The data set is linearly separable; i.e., it can discover a hyperplane including all the squares residing on one side of the hyperplane and all the circles residing on the different sides.

The classifier should select one of these hyperplanes to describe its decision boundary, depending on how well they are expected to implement on test instances. Consider the two decision boundaries, B1 and B2. Both decision boundaries can separate the training instances into their specific classes without executing some misclassification errors. Each decision boundary Bi is related to a pair of hyperplanes, indicated as bi1 and bi2, accordingly.

Bi1 is acquired by changing a parallel hyperplane away from the decision boundary until it communicates the closest square(s), whereas bi2 is acquired by changing the hyperplane until it communicates the closest circle(s). The distance between these two hyperplanes is called the margin of the classifier.

Decision boundaries with high margins influence to have higher generalization errors than those with low margins. If the margin is small, therefore some slight perturbations to the decision boundary can have an essential impact on its classification.

A proper description relating the margin of a linear classifier to its generalization error is given by a statistical learning principle called structural risk minimization (SRM). This principle supports an upper bound to the generalization error of a classifier (R) in terms of its training error (Re), the number of training examples (N), and the model complexity called its capacity (h). More categorically, with a probability of 1 - n, the generalization error of the classifier can be at worst

$$\mathrm{R\leq\:R_e\:+\varphi(\frac{h}{N},\frac{1og(n)}{N})}$$

where φ is a monotone increasing function of the capacity h. The preceding inequality can be familiar to the readers because it simulates the minimum description length (MDL) principle. SRM is another approach to define generalization error as a trade-off between training error and model complexity.