SVM, KNN, Tree NBC
SVM, KNN, Tree NBC
Instructions:
• Read this study material carefully and make your own handwritten short notes. (Short
notes must not be more than 5-6 pages)
• Revise this material at least 5 times and once you have prepared your short notes, then
revise your short notes twice a week
• If you are not able to understand any topic or required detailed explanation,
please mention it in our discussion forum on webiste
• Let me know, if there are any typos or mistake in study materials. Mail
me at [email protected]
1 K-Nearest Neighbors
• K-Nearest Neighbors (KNN) is a simple and intuitive machine-learning algorithm used
for both classification and regression tasks.
• K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into
a well suite category by using K- NN algorithm.
• K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
• It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs
an action on the dataset.
• KNN tries to predict the correct class for the test data by calculating the distance
between the test data and all the training points. Then select the K number of points
which is closest to the test data. The KNN algorithm calculates the probability of the
test data belonging to the classes of ‘K’ training data and class that holds the highest
probability will be selected. In the case of regression, the value is the mean of the ‘K’
selected training points.
Choosing Value of K
• Larger k may lead to better performance But if we set k too large we may end up
looking at samples that are not neighbors (are far away from the query)
1.1 Working
The K-NN working can be explained on the basis of the below algorithm:
4. Among these k neighbors, count the number of the data points in each category.
5. Assign the new data points to that category for which the number of the neighbor is
maximum.
2. Find K Nearest Neighbors: Identify the three nearest neighbors based on the
calculated distances. In this case, the three closest points are A, B, and C.
3. Majority Voting: Determine the majority class among the three nearest neighbors.
Since A and B are Blue, and C is Red, the majority class is Blue.
4. Prediction: Predict that the new point X1 = 2.5, X2 = 2.5 belongs to the majority
class, which is Blue.
• Non-parametric: KNN doesn’t make any assumptions about the underlying data
distribution, making it versatile for a wide range of applications.
• Adaptability: KNN can be used for both classification and regression tasks, and it
can handle multi-class problems without modification.
• Determining the Optimal K: Selecting the right value for K is crucial, and choosing
an inappropriate K can lead to underfitting or overfitting. There’s no universally
optimal value, and it often requires experimentation.
• Imbalanced Data: KNN can be biased towards the majority class in imbalanced
datasets. It’s essential to balance the dataset or adjust the class weights when neces-
sary.
• Bayes’ Theorem: The classifier is based on Bayes’ theorem, which calculates the
probability of a hypothesis (in this case, a class label) given the evidence (features or at-
tributes). Mathematically, it is expressed as P(class—evidence) = [P(evidence—class)
* P(class)] / P(evidence).
1. Multinomial Naive Bayes: Typically used for text classification where features
represent word counts.
2. Gaussian Naive Bayes: Suitable for continuous data and assumes a Gaussian
distribution of features.
3. Bernoulli Naive Bayes: Applicable when features are binary, such as presence or
absence.
• Classification: To classify a new data point, the classifier calculates the posterior
probabilities for each class and selects the class with the highest probability.
10
• Works Well with Small Datasets: It can perform reasonably well even with limited
training data.
• Interpretable: The results are easy to interpret, as it provide the probability of belong-
ing to each class.
• Sensitivity to Feature Distribution: It may not perform well when features have com-
plex, non-Gaussian distributions.
• Requires Sufficient Data: For some cases, Naive Bayes might not perform well when
there is a scarcity of data.
• Zero Probability Problem: If a feature-class combination does not exist in the training
data, the probability will be zero, causing issues. Smoothing techniques are often used
to address this.
11
3 Decision Trees
• A decision tree is a simple model for supervised classification. It is used for classifying
a single discrete target feature.
• Each internal node performs a Boolean test on an input feature (in general, a test may
have more than two options, but these can be converted to a series of Boolean tests).
The edges are labeled with the values of that input feature.
• Classifying an example using a decision tree is very intuitive. We traverse down the
tree, evaluating each test and following the corresponding edge. When a leaf is reached,
we return the classification on that leaf.
• Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
• The decisions or the test are performed on the basis of features of the given dataset. It
is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
• In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
12
3.1 Terminologies
• Root Node: A decision tree’s root node, which represents the original choice or
feature from which the tree branches, is the highest node.
• Internal Nodes (Decision Nodes): Nodes in the tree whose choices are determined
by the values of particular attributes. There are branches on these nodes that go to
other nodes.
• Leaf Nodes (Terminal Nodes): The branches’ termini, when choices or forecasts
are decided upon. There are no more branches on leaf nodes.
• Branches (Edges): Links between nodes that show how decisions are made in re-
sponse to particular circumstances.
• Splitting: The process of dividing a node into two or more sub-nodes based on a
decision criterion. It involves selecting a feature and a threshold to create subsets of
data.
• Parent Node: A node that is split into child nodes. The original node from which a
split originates.
• Decision Criterion: The rule or condition used to determine how the data should
be split at a decision node. It involves comparing feature values against a threshold.
• Pruning: The process of removing branches or nodes from a decision tree to improve
its generalization and prevent overfitting.
13
1. Information Gain
2. Gini Index
14
• Gini index is a measure of impurity or purity used while creating a decision tree
in the CART(Classification and Regression Tree) algorithm.
• An attribute with a low Gini index should be preferred as compared to the high
Gini index.
• It only creates binary splits, and the CART algorithm uses the Gini index to
create binary splits.
• Feature Selection: They can automatically select the most important features, reducing
the need for feature engineering.
• Versatility: Decision Trees can handle both categorical and numerical data.
• Efficiency: They are relatively efficient during prediction, with time complexity loga-
rithmic in the number of data points.
• Bias Toward Dominant Classes: In classification tasks, Decision Trees can be biased
toward dominant classes, leading to imbalanced predictions.
• Instability: Small variations in the data can lead to different tree structures, making
them unstable models.
• Greedy Algorithm: Decision Trees use a greedy algorithm, making locally optimal
decisions at each node, which may not lead to the global optimal tree structure
15
• The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new data
point in the correct category in the future. This best decision boundary is called a
hyperplane.
• SVMs pick best separating hyperplane according to some criterion e.g. maximum
margin
• SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine.
Consider the below diagram in which there are two different categories that are classi-
fied using a decision boundary or hyperplane:
Here are the key concepts and characteristics of Support Vector Machines:
16
• In a binary classification problem, an SVM finds a hyperplane that best separates the
data points of different classes. This hyperplane is the decision boundary.
• The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight
line. And if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maxi-
mum distance between the data points.
• Support Vectors:The data points or vectors that are the closest to the hyperplane
and which affect the position of the hyperplane are termed as Support Vector. Since
these vectors support the hyperplane, hence called a Support vector. They are critical
for defining the margin and determining the location of the hyperplane.
• Margin: The margin is the distance between the support vectors and the decision
boundary. SVM aims to maximize this margin because a larger margin often leads to
better generalization.
• C Parameter: The regularization parameter ”C” controls the trade-off between maxi-
mizing the margin and minimizing the classification error. A smaller ”C” value results
in a larger margin but may allow some misclassifications, while a larger ”C” value
allows for fewer misclassifications but a smaller margin.
• Multi-Class Classification: SVMs are inherently binary classifiers, but they can be
extended to handle multi-class classification using techniques like one-vs-one (OvO) or
one-vs-all (OvA) classification.
• The Scalar Product:The scalar or dot product is, in some sense, a measure of
Similarity a.b = |a|.|b|cos(θ)
17
4.1 Kernels
We may use Kernel functions to implicitly map to a new feature space
• Kernel fn: K(x1 , x2 ) ∈ R
• Kernel must be equivalent to an inner product in some feature space
• Kernel Trick: SVM can handle non-linearly separable data by using a kernel function
to map the data into a higher-dimensional space where it becomes linearly separable.
Common kernel functions include linear, polynomial, radial basis function (RBF), and
sigmoid kernels.
18
19
The working of the SVM algorithm can be understood by using an example. Suppose
we have a dataset that has two tags (green and blue), and the dataset has two features
x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in
either green or blue. So as it is 2-d space so by just using a straight line, we can easily
separate these two classes. But there can be multiple lines that can separate these
classes.
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of
the lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called as margin. And the goal of SVM
20
is to maximize this margin. The hyperplane with the maximum margin is called the
optimal hyperplane.
2. Non-linear SVM: Non-linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as
non-linear data and classifier used is called as Non-linear SVM classifier.
21
• Robust to Overfitting: SVMs are less prone to overfitting, especially when the margin
is maximized. Accurate for Non-Linear Data: The kernel trick allows SVMs to work
effectively on non-linear data by transforming it into higher dimensions.
• Sensitivity to Kernel Choice: The choice of the kernel function and kernel parameters
can significantly impact the SVM’s performance.
• Challenging for Large Datasets: SVMs may not be suitable for very large datasets
because of their computational complexity.
22