Q1. Explain Why SVM Is More Efficient Than Logistic Regression?
Q1. Explain Why SVM Is More Efficient Than Logistic Regression?
The support vector machine is a model used for both classification and regression problems
though it is mostly used to solve classification problems. The algorithm creates a hyperplane or
line(decision boundary) which separates data into classes. It uses the kernel trick to find the
best line separator (decision boundary that has the same distance from the boundary point of
both classes). It is a clear and more powerful way of learning complex nonlinear functions.
● SVM tries to find the “best” margin (distance between the line and the support
vectors) that separates the classes and this reduces the risk of error on the data,
while logistic regression does not, instead it can have different decision boundaries
with different weights that are near the optimal point.
● SVM works well with unstructured and semi-structured data like text and images
while logistic regression works with already identified independent variables.
● SVM is based on geometrical properties of the data while logistic regression is
based on statistical approaches.
● The risk of overfitting is less in SVM, while Logistic regression is vulnerable to
overfitting.
The advantages of Logistic regression is that it introduces a simpler model and can be
implemented more easily. It determines weights of importance for each variable which could be
assigned to probability functions for each prediction. Which is very important to discover insights
to rating problems such as credit risk scoring. Furthermore logistic regression could be easily
updated as new data comes in which is perfect for streaming data.
However, as Logistic Regression tries to maximize the conditional likelihood of the training data,
Logistic Regression is highly prone to outliers. Standardization as co-linearity checks are also
fundamental to make sure the features’ weight does not dominate over the others.
SVM, on the other hand, is not as prone to outliers as it only cares about the points closest to
the decision boundary. It might change the decision boundary depending on the placement of
the new positive or negative events.
Another reason why SVM is highly popular is that it can be made into kernels easily to
determine nonlinear classification problems. The idea is to project the data points into
higher-dimensional mapping where it is separable. This means, rather than separating 2D
features (plane) with a 1D feature (line), you could now separate 3D features (hyper plane) with
a 2D feature (plane). The kernels will take data as input then transform it into the required form
(linear, non linear, radial, sigmoid, etc).
Q2.Explain the need of a kernel function. Write any 4 kernel function. What do you
understand by Kernel Tricks? Explain Mercer's Condition and its importance in
SVM.
a)
· A kernel function is defined as a function that corresponds to a dot product of two feature
vectors in some expanded feature space:
K xa, xb = φ xa .φ(xb)
· Kernel functions provide a way to manipulate data as though it were projected into a higher
dimensional space, by operating on it in its original space.
EX:
· It is impossible to draw a line in this 2d plot which could separate the blue samples from
the red ones.
· The idea is mapping the non-linear separable data-set into a higher dimensional space
where we can find a hyperplane that can separate the samples.
b)
c)
· We have seen how higher dimensional transformations can allow us to separate data in
order to make classification predictions. It seems that in order to train a support vector classifier
and optimize our objective function, we would have to perform operations with the higher
dimensional vectors in the transformed feature space. In real applications, there might be many
features in the data and applying transformations that involve many polynomial combinations of
these features will lead to extremely high and impractical computational costs.
· The kernel trick provides a solution to this problem. The “trick” is that kernel methods
represent the data only through a set of pairwise similarity comparisons between the original
data observations x (with the original coordinates in the lower dimensional space), instead of
explicitly applying the transformations ϕ (x) and representing the data by these transformed
coordinates in the higher dimensional feature space.
d)
Importance:
1. Finally, what happens if one uses a kernel which does not satisfy Mercer’s condition? In
general, there may exist data such that the Hessian is indefinite, and for which the
quadratic programming problem will have no solution (the dual objective function can
become arbitrarily large).
2. However, even for kernels that do not satisfy Mercer’s condition, one might still find that
a given training set results in a positive semidefinite Hessian, in which case the training
will converge perfectly well. In this case, however, the geometrical interpretation
described above is lacking."
3. So without a kernel which is satisfying Mercer's condition, you lose at least some
convergence guarantees (it's possible that you lose even more: e.g. convergence-speed
or approximation-accuracy when early-stopping)!