Unit 5 Learning with Algorithm
Unit 5 Learning with Algorithm
1. Basic Principle
The KNN algorithm classifies a data point based on how its neighbors are classified.
It works by finding the k closest data points (neighbors) to the input data point and making a
decision based on the majority class among the neighbors in classification or averaging the
values in regression.
2. Distance Metrics
To determine the closest neighbors, KNN relies on a distance metric to measure the
similarity between data points. Common distance metrics include:
3. Choosing k
The value of k (the number of neighbors) is crucial and can significantly affect the
performance of the algorithm:
A small k may be sensitive to noise in the data.
A large k may smooth out the predictions too much and lose important details.
Common practice is to choose k via cross-validation.
4. Classification vs Regression
Classification: The output is a class label. The class label is determined by the
majority vote of the nearest neighbors.
Regression: The output is a continuous value. The value is typically the mean (or
sometimes the median) of the nearest neighbors' values.
Here’s a simple example of KNN for a classification problem using Python’s scikit-learn
library:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
# Sample data
data = {
'Feature1': [2, 3, 5, 7, 1, 6, 4, 8],
'Feature2': [1, 5, 8, 3, 4, 7, 2, 6],
'Label': [0, 1, 1, 0, 0, 1, 0, 1]
}
# Create DataFrame
df = pd.DataFrame(data)
# Evaluation
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)
1. Data Preparation:
o A sample dataset is created with two features and a binary label.
2. Feature and Target Selection:
o The features (X) and target (y) are separated.
3. Data Splitting:
o The data is split into training and test sets using a 75-25 split.
4. Model Creation and Training:
o A KNeighborsClassifier with k=3 is instantiated and trained using the
training data.
5. Predictions:
o Predictions are made on the test set.
6. Model Evaluation:
o The accuracy, confusion matrix, and classification report are calculated to
evaluate the model's performance.
Applications of KNN
Pros
Cons
KNN is a versatile and intuitive algorithm that can be highly effective, especially for small to
medium-sized datasets. However, its performance and efficiency can be significantly affected
by the choice of parameters and the scale of the data.
Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used
for classification and regression tasks. It is particularly known for its effectiveness in high-
dimensional spaces and its ability to create a robust decision boundary between different
classes.
A hyperplane is a decision boundary that separates the data points of different classes. In a
2D space, it is a line, and in a 3D space, it is a plane. For higher-dimensional spaces, it is
called a hyperplane.
2. Support Vectors
Support vectors are the data points that are closest to the hyperplane and influence its position
and orientation. These points are critical in defining the optimal hyperplane.
3. Margin
The margin is the distance between the hyperplane and the nearest support vectors from both
classes. SVM aims to maximize this margin, ensuring that the data points are as far away
from the hyperplane as possible, leading to better generalization on new data.
4. Optimal Hyperplane
The optimal hyperplane is the one that maximizes the margin between the support vectors of
the two classes. This is also known as the maximum-margin hyperplane.
Hard Margin: Assumes that the data is perfectly linearly separable. It tries to find a
hyperplane that completely separates the classes without any misclassification.
Soft Margin: Allows some misclassifications to make the model more robust and
handle noisy data better. It introduces a regularization parameter (C) to control the
trade-off between maximizing the margin and minimizing the classification error.
6. Kernel Trick
When the data is not linearly separable, SVM uses the kernel trick to map the data into a
higher-dimensional space where it becomes linearly separable. Common kernels include:
Mathematical Formulation
For a binary classification problem, the decision function of SVM can be represented as:
where:
In the soft margin formulation, the optimization objective includes a regularization term to
penalize misclassifications:
where ξi\xi_iξi are slack variables that allow for misclassification, and CCC is the
regularization parameter.
Implementation of SVM in Python
Here's an example of using SVM for classification with Python's scikit-learn library:
python
Copy code
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
# Sample data
data = {
'Feature1': [2, 3, 5, 7, 1, 6, 4, 8],
'Feature2': [1, 5, 8, 3, 4, 7, 2, 6],
'Label': [0, 1, 1, 0, 0, 1, 0, 1]
}
# Create DataFrame
df = pd.DataFrame(data)
# Predictions
y_pred = svm.predict(X_test)
# Evaluation
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)
1. Data Preparation:
o A sample dataset is created with two features and a binary label.
2. Feature and Target Selection:
o The features (X) and target (y) are separated.
3. Data Splitting:
o The data is split into training and test sets using a 75-25 split.
4. Model Creation and Training:
o A SVC (Support Vector Classifier) with a linear kernel is instantiated and
trained using the training data.
5. Predictions:
o Predictions are made on the test set.
6. Model Evaluation:
o The accuracy, confusion matrix, and classification report are calculated to
evaluate the model's performance.
Applications of SVM
Text Classification: Spam detection, sentiment analysis.
Image Classification: Object detection, face recognition.
Bioinformatics: Protein classification, cancer detection.
Finance: Credit risk assessment, fraud detection.
Cons
Computationally Expensive: Especially with large datasets, the training time can be
significant.
Memory Intensive: Requires storing the entire dataset, which can be problematic
with large datasets.
Sensitive to Noise: Particularly in the case of overlapping classes, SVM can be
sensitive to outliers.
SVM is a versatile and powerful algorithm that can be highly effective in various
classification and regression tasks, especially when the data is high-dimensional and
separable with an appropriate kernel function. However, it requires careful tuning of
parameters and is computationally intensive for large datasets.