0% found this document useful (0 votes)
14 views27 pages

Supervised Alg

Uploaded by

kar20201214
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views27 pages

Supervised Alg

Uploaded by

kar20201214
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Supervised Algorithms (3)

SUPPORT VECTOR MACHINE (SVM)

1
 The two classes can clearly be separated easily with a straight line (they are linearly separable).
 The left plot shows the decision boundaries of three possible linear classifiers.
 The model whose decision boundary is represented by the dashed line is so bad that it does not even
separate the classes properly.
 The other two models work perfectly on this training set, but their decision boundaries come so close
to the instances that these models will probably not perform as well on new instances.

 In contrast, the solid line in the plot on the right represents the decision boundary of an SVM
classifier; this line not only separates the two classes but also stays as far away from the
closest training instances as possible.
 You can think of an SVM classifier as fitting the widest possible street (represented by the
parallel dashed lines) between the classes. This is called large margin classification

Margin: Margin is the distance between the support vector and hyperplane.

2

 A Support Vector Machine is a very powerful ML model, capable of performing linear or
nonlinear classification(SVC), regression(SVR), and even outlier detection.
 The main objective of the SVM algorithm is to find the optimal hyperplane (decision boundary)
in such a way that the separation between the two classes is as wide as possible.
◦ The hyperplane tries that the margin between the closest points/vectors of different classes should be
as maximum as possible.
◦ These closest data points are called as support vectors, and hence algorithm is termed as Support
Vector Machine

 The SVM produces a real-valued output that is negative or positive depending on which side of
the decision boundary it falls.
◦ To classify a new data point x, we simply determine which side of the plane x falls on.

Linear SVM: This means that a single straight line (in 2D) or a hyperplane (in higher dimensions) can
entirely divide the data points into their respective classes.

3
Hard Margin
 If we strictly impose that all instances be off the street and on the right side (without any
misclassifications), this is called hard margin classification
Hard Margin
 There are two main issues with hard margin classification.
◦ First, it only works if the data is linearly separable,
◦ Second it is quite sensitive to outliers.

Outlier is a labeled point that is unusually close to points with the opposite label

4
Soft Margin Classification
 To avoid these issues it is preferable to use a more flexible model.
 The objective is to find a good balance between
◦ keeping the street as large as possible and
◦ limiting the margin violations (i.e., instances that end up in the middle of the street or even on the
wrong side).

 This is called soft margin classification.


 This idea is based on a simple premise: allow SVM to make a certain number of mistakes and
keep margin as wide as possible so that other points can still be classified correctly.

5
In soft margin case, we let our model give some relaxation to few There can be three cases for a point x(i)
points, so instead of considering them as support vectors we consider 1. It lies beyond the margin : correctly or
them as error points and give certain penalty for them which is incorrectly classified
proportional to the amount by which each data point is violating the 2. It lies on the margin, then it is a support vector.
hard constraint. 3. It lies inside the margin

6
 In Scikit-Learn’s SVM classes, you can control this balance using the C hyperparameter: A
smaller C value leads to a wider street but more margin violations.
◦ On the left, using a low C value the margin is quite large, but many instances end up on the street.
◦ On the right, using a high C value the classifier makes fewer margin violations but ends up with a smaller
margin.

 However, it seems likely that the first classifier will generalize better: in fact even on this
training set it makes fewer prediction errors, since most of the margin violations are actually on
the correct side of the decision boundary.

7

The Regularization Parameter C in SVM
• Margin maximization and misclassification fines are balanced by C
 Greater value of C, results in a smaller margin and perhaps fewer misclassifications.
 Hard margin SVM generally has large values of C.
 Soft margin SVM generally has small values of C.

Hinge Loss function: An SVM, uses a hinge loss, which penalizes only those points on the
wrong side of the hyperplane or very near it on the correct side

8

 Hinge Loss Function: The x-axis represents the distance ()
from the boundary of any single instance, and the y-axis
represents the loss size, or penalty, that the function will incur
depending on its distance
 When a data point’s distance from the boundary is greater than
or equal to 1, the loss size is 0.
 If the distance from the boundary is less than 1, the we incur a
loss.
◦ At 0 distance (the data point is on the boundary), then the loss size is 1.
◦ Correctly classified points will have a small loss size, while incorrectly
classified instances will have a high loss size.
 High hinge loss indicates data points being on the wrong side of
the boundary, and hence are misclassified; while a positive
distance calls for low (or zero) hinge loss and correct classification.

9

 The loss is defined according to the following formula, (y) = max(0,1−t ⋅ y), where t is the
actual outcome (either 1 or -1), and y is the output of the classifier.
 Examples:
◦ if an observation was associated with an actual outcome of +1, and the SVM produced an output of 1.5,
the loss would equal 0.  (1.5) = max(0, 1 - 1 ⋅ 1.5) = 0  No penalty
◦ An observation that is located directly on the boundary would incur a loss of 1 regardless of whether
the real outcome was +1 or -1.  (0) = max(0, 1 - 1 ⋅ 0) = 1
◦ If the actual outcome was 1 and the classifier predicted 0.5, the corresponding loss would be 0.5 even
though the classification is correct.
◦ The outcome was -1, and the prediction was 0.5.  (0.5) = max(0, 1 – (-1) ⋅ 0.5) = 1.5

10

 The goal of optimization is to minimize
◦ where β is the margin, is the loss from the ith support vector to the margin, and C is a model hyper-
parameter that determines the relative contribution of the two terms.
◦ Now if we increase C, we are penalizing the errors more.

 Common optimizers like gradient descent can be used for this loss function.

11
Non-Linear SVM
 Non-Linear SVM: Non-Linear SVM can be used to classify data when it
cannot be separated into two classes by a straight line.
 What to Do if Data are Not Linearly Separable?
◦ We apply transformations to the data, which map the data from the original
space into a higher dimensional feature space.
◦ The goal is that after the transformation to the higher dimensional space, the
classes are now linearly separable in this higher dimensional feature space.

 A simple way to do this is to add powers of each feature as new features


 In this example, the picture on the top shows our original data points.
◦ In 1-dimension, this data is not linearly separable, but
◦ after applying the transformation ϕ(x) = x² and adding this second dimension
to our feature space, the classes become linearly separable.

12

 sickit-Learn uses PolynomialFeatures(degree=k) to generate a new feature matrix consisting of
all polynomial combinations of the features with degree less than or equal to the specified
degree.
 For example,
◦ if there were 2 features a and b, the degree-2, then the polynomial features are [ a, b, a 2, ab, b2 ].
◦ with degree=3, the following features will be added: a2,ab, b2, a3, b3, a2b, and ab2.
◦  Beware of the combinatorial explosion of the number of features!

 Adding polynomial features is simple to implement and can work great with all sorts of
Machine Learning algorithms (not just SVMs), but:
◦ at a low polynomial degree it cannot deal with very complex datasets,
◦ with a high polynomial degree it creates a huge number of features, making the model too slow.

13
Kernel Trick
 Fortunately, when using SVMs you can apply mathematical technique called the kernel trick.
 It makes it possible to get the same result as if you added many polynomial features, without
actually having to add them.
◦ So there is no combinatorial explosion of the number of features since you don’t actually add any
features

 Some of the common kernel functions are linear, polynomial, radial basis function(RBF), and
sigmoid.

if your model is overfitting, you might


want to reduce the polynomial degree. if it
is underfitting, you can try increasing it

14
Feature Scaling
 With few exceptions, ML Alg don’t perform well when the input numerical attributes have very
different scales.
 There are two common ways to get all attributes to have the same scale:
◦ Min-max scaling (normalization) is quite simple: values are shifted and rescaled so that they end up
ranging from 0 to 1. We do this by subtracting the min value and dividing by the max minus the min.
◦ Scikit-Learn provides a transformer called MinMaxScaler for this.
◦ Standardization: first it subtracts the mean value (so standardized values always have a zero mean), and
then it divides by the standard deviation so that the resulting distribution has unit variance.
◦ Scikit-Learn provides a transformer called StandardScaler for standardization.

Note that scaling the target values is generally not required.

15

from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]] # normal list
scaler = MinMaxScaler()
scaler.fit(data) # Compute the minimum and maximum to be used for later scaling.
print(scaler.data_max_)
print(scaler.data_min_)
n = scaler.transform(data) # return ndarray transformed array.
print(n)
print(scaler.transform([[2, 2]]))

sss = scaler.fit_transform(df[[“median_income"]]) scale just one attribute

16

from sklearn.preprocessing import StandardScaler
data = [[0, 0], [0, 0], [1, 1], [1, 1]] # normal list
scaler = StandardScaler()
scaler.fit(data) # Compute the mean and std to be used for later scaling.
print(scaler.mean_)
n = scaler.transform(data) # return ndarray transformed array.
print(n)
print(scaler.transform([[2, 2]]))

17

 Notes:
◦ Unlike min-max scaling, standardization does not bound values to a specific range, which may be a
problem for some algorithms (e.g., neural networks often expect an input value ranging from 0 to 1).
◦ Standardization is much less affected by outliers. For example, suppose a district had a median income
equal to 100 (by mistake). Min-max scaling would then crush all the other values from 0–15 down to 0–
0.15, whereas standardization would not be much affected.

SVMs are sensitive to the feature scales. You should center the training set first by subtracting its mean.
This is automatic if you scale the data using the StandardScaler.

18
Scikit-Learn SVM Classes
 1-from sklearn.svm import LinearSVC
◦ Implements an optimized algorithm for linear SVMs.
◦ It does not support the kernel trick.
◦ The algorithm takes longer if you require a very high precision. This is controlled by the tol (tolerance
hyperparameter). In most classification tasks, the default tolerance is fine.
◦ The width of the street is controlled by a hyperparameter C : decreasing C will increase the margin.
◦ Similar to SVC with parameter kernel=’linear’

svm = LinearSVC(C=1)
svm.fit(X, y)
svm.predict(y_test)

• Unlike Logistic Regression classifiers, SVC classifiers do not output probabilities for each class.

19

from sklearn.preprocessing import PolynomialFeatures
# perform a polynomial features transform of the dataset
trans = PolynomialFeatures(degree=3)
X = trans.fit_transform(data)
svm = LinearSVC(C=10)
svm.fit(X, y)

20

 2-from sklearn.svm import LinearSVR
◦ The LinearSVR class is the regression equivalent of the LinearSVC class.
◦ The width of the street is controlled by a hyperparameter ϵ : increasing ϵ will increase the margin.

from sklearn.svm import LinearSVR


svm_reg = LinearSVR(epsilon=1.5)
svm_reg.fit(X, y)

21

 3-from sklearn.svm import SVC
◦ implements an algorithm that supports the kernel trick.
◦ it gets extremely slow when the number of training instances gets large (e.g., hundreds of thousands of
instances).
◦ This algorithm is perfect for complex but small or medium training sets.
◦ SVC(kernel="poly", degree=3, coef0=1, C=5))

• The coef0 controls how much the model is influenced by high-degree polynomials versus low-degree
polynomials.
• It is only significant in ‘poly’ and ‘sigmoid’.

22

# import the required classes # Feature scaling
df = pd.read_csv('Social_Network_Ads.csv') sc = StandardScaler()
df.drop(columns = ['User ID'], inplace=True) X_train = sc.fit_transform(X_train)
# Label encoding X_test = sc.transform(X_test)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() # Train Support Vector Machine model
df['Gender']= le.fit_transform(df['Gender']) classifier = SVC(kernel = 'linear', random_state = 0)
# Drop insignificant data based on correlation classifier.fit(X_train, y_train)
print(df.corr()) # Prediction
df.drop(columns=['Gender'], inplace=True) y_pred = classifier.predict(X_test)
# Split data # Accuracy
X = df.iloc[:, :-1].values #2D (400,2) print(accuracy_score(y_test, y_pred))
y = df.iloc[:, -1].values #one dim (400,) # Predict purchase with Age(30) and Salary(87000)
X_train, X_test, y_train, y_test = print(classifier.predict(sc.transform([[30,
train_test_split(X, y, test_size = 0.25,
87000]])))
random_state = 0)

23

 4-from sklearn.svm import SVR
◦ The SVR class is the regression equivalent of the SVC class
◦ SVR(kernel="poly", degree=2, C=100, epsilon=0.1)

24

import pandas as pd X_ = sc_X.fit_transform(X)
from sklearn.preprocessing import StandardScaler y_ = sc_y.fit_transform(y)
from sklearn.svm import SVR regressor = SVR(kernel = 'rbf')
df = pd.read_csv('Position_Salaries.csv') regressor.fit(X_, y_.ravel())
X = df.iloc[:, 1:-1].values# numpy (10x1) (2D) y_test = sc_X.transform([[7.5]])
y = df.iloc[:, -1].values# numpy (10,) (1D) y_pred = regressor.predict(y_test)
# Feature scaling print(sc_y.inverse_transform([y_pred]))
# fit-transform operation expected 2D array
y = y.reshape(len(y), 1)  (10,1)
sc_X = StandardScaler()
sc_y = StandardScaler()

when you have y_.shape == (10, 1), using y.ravel().shape == (10, ).... it flattens an array.

25
OVO
 Scikit-Learn detects when you try to use a binary
classification algorithm for a multiclass classification task,
and it automatically runs OvA (except for SVM classifiers
for which it uses OvO).
 OVO strategy train a binary classifier for every pair of
classes: one to distinguish C1 and C2, another to
distinguish C1 and C3, another for C2 and C3, and so on.
 If there are N classes, you need to train N × (N – 1) / 2
classifiers.
 you have to run the new instance through all classifiers
and see which class wins the most duels

26

 Classically, this approach is suggested for support vector machines (SVM) and related kernel-
based algorithms. This is believed because the performance of kernel methods does not scale in
proportion to the size of the training dataset and using subsets of the training data may counter
this effect.
 The support vector machine implementation in the scikit-learn is provided by the SVC class and
supports the one-vs-one method for multi-class classification problems. This can be achieved by
setting the “decision_function_shape” argument to ‘ovo‘.

27

You might also like