MLT 07
MLT 07
Algorithm:
Model Overview
The Linear Kernel in Support Vector Machine (SVM) is used when the data is linearly separable. In this case, the algorithm aims to
find the hyperplane that best separates the data points into two classes with a clear margin. The decision boundary is a straight line
(in 2D) or a flat hyperplane (in higher dimensions). The linear kernel is the simplest type of kernel, and it works well when the
classes can be separated by a linear boundary. In the case of linearly separable data, the goal of the SVM with a linear kernel is to
find the optimal decision boundary (hyperplane) that maximizes the margin between the support vectors of each class. This ensures
that the decision boundary has the largest possible distance from the closest data points (support vectors) on either side, which
results in better generalization and less overfitting.
Steps Involved
1. Data Loading:
The dataset is loaded into a structured format such as a Pandas DataFrame, containing labeled data for classification. This
dataset should have features that are used to predict the target variable (class labels).
2. Data Exploration:
Initial exploration is conducted to understand the structure of the dataset, identify missing or inconsistent values, and check for
the presence of categorical or numerical features. Basic statistical analysis and visualizations like histograms, box plots, and
pairwise scatter plots are used to examine feature distributions and relationships between them.
3. Data Preprocessing:
Handling Missing Data: Any missing or inconsistent data is either imputed with the mean/median/mode or removed
entirely from the dataset.
Feature Selection: Important features are selected to improve the efficiency and performance of the model. Irrelevant or
redundant features may be dropped.
Feature Scaling: SVM is sensitive to the scale of features, so standardization or normalization is performed on numerical
features to ensure that they are on the same scale. This is essential to ensure the SVM’s ability to effectively compute the
hyperplane.
Handling Categorical Data: Categorical variables, if any, are encoded into numerical representations through one-hot
encoding, label encoding, or other relevant methods.
4. Model Creation and Training:
A Support Vector Machine model is created by initializing the model with the desired kernel function (linear, polynomial, or
radial basis function (RBF)). If the data is linearly separable, a linear kernel is chosen. If not, a non-linear kernel like RBF is used.
The model is then trained on the dataset, and the algorithm works to find the optimal hyperplane that maximizes the margin
between the two classes. The model identifies the support vectors, which are the points closest to the hyperplane, and uses
these to adjust the boundary.
5. Model Evaluation:
Since SVM is a supervised algorithm, performance can be evaluated using metrics like accuracy, precision, recall, F1-score,
and confusion matrix on the test dataset.
Cross-validation may also be used to evaluate model performance on different splits of the data and to avoid overfitting.
Additionally, support vectors can be visualized to understand which data points are influencing the decision boundary.
Decision boundaries can be plotted to visually assess how well the model separates the classes, especially in 2D or 3D data.
For higher-dimensional data, dimensionality reduction techniques like PCA or t-SNE can be used to project the data into a
lower dimension for visualization.
Visualization
plt.scatter(x_train[:, 0], x_train[:, 1], c=y_train, cmap='winter', marker='o', label='Training Data', edgecolors='k
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
Applying SVM
Out[6]: ▾ SVC i ?
SVC(kernel='linear', random_state=42)
Classification Report:
precision recall f1-score support
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
Z = svm_classifier.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.scatter(x_train[y_train == 0, 0], x_train[y_train == 0, 1], c='blue', label='Class 0', edgecolors='k', s=100)
plt.scatter(x_train[y_train == 1, 0], x_train[y_train == 1, 1], c='red', label='Class 1', edgecolors='k', s=100)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Linear SVM Classifier with Decision Boundary, Margin, and Support Vectors')
plt.legend()
plt.show()
Result:
The linear kernel shows good results for linearly separable data
Algorithm:
Model Overview
Linear Kernel Model Overview The Linear Kernel in Support Vector Machine (SVM) is used when the data is linearly separable. In
this case, the algorithm aims to find the hyperplane that best separates the data points into two classes with a clear margin. The
decision boundary is a straight line (in 2D) or a flat hyperplane (in higher dimensions). The linear kernel is the simplest type of
kernel, and it works well when the classes can be separated by a linear boundary. In the case of linearly separable data, the goal of
the SVM with a linear kernel is to find the optimal decision boundary (hyperplane) that maximizes the margin between the support
vectors of each class. This ensures that the decision boundary has the largest possible distance from the closest data points
(support vectors) on either side, which results in better generalization and less overfitting.
The Radial Basis Function (RBF) Kernel is a non-linear kernel used in SVM when the data is not linearly separable. The RBF kernel
maps the input data into a higher-dimensional space using a Gaussian function, enabling SVM to create a non-linear decision
boundary. This allows the algorithm to separate classes that are not linearly separable in the original feature space.
The RBF kernel works by calculating the similarity between data points based on their distance in the feature space. This similarity is
measured using the Gaussian function, which means that data points closer to each other in the original space will have higher
similarity values. The SVM then tries to find a decision boundary in the transformed space that best separates the classes,
maximizing the margin between the support vectors. The advantage of using the RBF kernel is that it can handle complex datasets
with highly non-linear relationships between features.
Steps Involved
1. Data Loading:
The dataset is loaded into a structured format such as a Pandas DataFrame, containing labeled data for classification. This
dataset should have features that are used to predict the target variable (class labels).
2. Data Exploration:
Initial exploration is conducted to understand the structure of the dataset, identify missing or inconsistent values, and check for
the presence of categorical or numerical features. Basic statistical analysis and visualizations like histograms, box plots, and
pairwise scatter plots are used to examine feature distributions and relationships between them.
3. Data Preprocessing:
Handling Missing Data: Any missing or inconsistent data is either imputed with the mean/median/mode or removed
entirely from the dataset.
Feature Selection: Important features are selected to improve the efficiency and performance of the model. Irrelevant or
redundant features may be dropped.
Feature Scaling: SVM is sensitive to the scale of features, so standardization or normalization is performed on numerical
features to ensure that they are on the same scale. This is essential to ensure the SVM’s ability to effectively compute the
hyperplane.
Handling Categorical Data: Categorical variables, if any, are encoded into numerical representations through one-hot
encoding, label encoding, or other relevant methods.
4. Model Creation and Training:
A Support Vector Machine model is created by initializing the model with the desired kernel function (linear, polynomial, or
radial basis function (RBF)). If the data is linearly separable, a linear kernel is chosen. If not, a non-linear kernel like RBF is used.
The model is then trained on the dataset, and the algorithm works to find the optimal hyperplane that maximizes the margin
between the two classes. The model identifies the support vectors, which are the points closest to the hyperplane, and uses
these to adjust the boundary.
5. Model Evaluation:
Since SVM is a supervised algorithm, performance can be evaluated using metrics like accuracy, precision, recall, F1-score,
and confusion matrix on the test dataset.
Cross-validation may also be used to evaluate model performance on different splits of the data and to avoid overfitting.
Additionally, support vectors can be visualized to understand which data points are influencing the decision boundary.
Decision boundaries can be plotted to visually assess how well the model separates the classes, especially in 2D or 3D data.
For higher-dimensional data, dimensionality reduction techniques like PCA or t-SNE can be used to project the data into a
lower dimension for visualization.
Visualization
plt.scatter(x_train[:, 0], x_train[:, 1], c=y_train, cmap='winter', marker='o', label='Training Data', edgecolors='k
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
Out[15]: ▾ SVC i ?
SVC(kernel='linear', random_state=42)
accuracy 0.33 30
macro avg 0.17 0.50 0.25 30
weighted avg 0.11 0.33 0.17 30
Visualization
y_pred_linear = y_pred_linear.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.scatter(x_train[y_train == 0, 0], x_train[y_train == 0, 1], c='blue', label='Class 0', edgecolors='k', s=100)
plt.scatter(x_train[y_train == 1, 0], x_train[y_train == 1, 1], c='red', label='Class 1', edgecolors='k', s=100)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Linear SVM on Non-Linearly Separable Data')
plt.legend()
plt.show()
SVC(gamma=0.5, random_state=42)
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
y_pred_rbf = y_pred_rbf.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.scatter(x_train[y_train == 0, 0], x_train[y_train == 0, 1], c='blue', label='Class 0', edgecolors='k', s=100)
plt.scatter(x_train[y_train == 1, 0], x_train[y_train == 1, 1], c='red', label='Class 1', edgecolors='k', s=100)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('RBF SVM Classifier with Decision Boundary, Margin, and Support Vectors')
plt.legend()
plt.show()
Result:
Classification on non linearly separable 2-dimensional data using Gaussian Radial Basis Function is better than using linear kernel