QUESTIONS
QUESTIONS
Naive Bayes classifiers are supervised machine learning algorithms used for
classification tasks, based on Bayes’ Theorem to find probabilities. This article
will give you an overview as well as more advanced use and implementation of
Naive Bayes in machine learning.
Key Features of Naive Bayes Classifiers
The main idea behind the Naive Bayes classifier is to use Bayes’ Theorem to
classify data based on the probabilities of different classes given the features
of the data. It is used mostly in high-dimensional text classification
• The Naive Bayes Classifier is a simple probabilistic classifier and it
has very few number of parameters which are used to build the ML
models that can predict at a faster speed than other classification
algorithms.
• It is a probabilistic classifier because it assumes that one feature in
the model is independent of existence of another feature. In other
words, each feature contributes to the predictions with no relation
between each other.
• Naïve Bayes Algorithm is used in spam filtration, Sentimental
analysis, classifying articles and many more.
Why it is Called Naive Bayes?
It is named as “Naive” because it assumes the presence of one feature does not
affect other features.
The “Bayes” part of the name refers to for the basis in Bayes’ Theorem.
Consider a fictional dataset that describes the weather conditions for playing a
game of golf. Given the weather conditions, each tuple classifies the conditions
as fit(“Yes”) or unfit(“No”) for playing golf. Here is a tabular representation of
our dataset.
Types of Naive Bayes Model
There are three types of Naive Bayes Model :
Gaussian Naive Bayes
In Gaussian Naive Bayes, continuous values associated with each feature are
assumed to be distributed according to a Gaussian distribution. A Gaussian
distribution is also called Normal distribution When plotted, it gives a bell
shaped curve which is symmetric about the mean of the feature values as
shown below:
Multinomial Naive Bayes
Multinomial Naive Bayes is used when features represent the frequency of
terms (such as word counts) in a document. It is commonly applied in text
classification, where term frequencies are important.
Bernoulli Naive Bayes
Bernoulli Naive Bayes deals with binary features, where each feature
indicates whether a word appears or not in a document. It is suited for
scenarios where the presence or absence of terms is more relevant than their
frequency. Both models are widely used in document classification tasks
Advantages of Naive Bayes Classifier
• Easy to implement and computationally efficient.
• Effective in cases with a large number of features.
• Performs well even with limited training data.
• It performs well in the presence of categorical features.
• For numerical features data is assumed to come from normal
distributions
Disadvantages of Naive Bayes Classifier
• Assumes that features are independent, which may not always hold
in real-world data.
• Can be influenced by irrelevant attributes.
• May assign zero probability to unseen events, leading to poor
generalization.
Applications of Naive Bayes Classifier
• Spam Email Filtering: Classifies emails as spam or non-spam based
on features.
• Text Classification: Used in sentiment analysis, document
categorization, and topic classification.
• Medical Diagnosis: Helps in predicting the likelihood of a disease
based on symptoms.
• Credit Scoring: Evaluates creditworthiness of individuals for loan
approval.
• Weather Prediction: Classifies weather conditions based on various
factors.
2. SVM
Support Vector Machine (SVM) is a supervised machine learning algorithm
used for classification and regression tasks. While it can handle regression
problems, SVM is particularly well-suited for classification tasks.
SVM aims to find the optimal hyperplane in an N-dimensional space to
separate data points into different classes. The algorithm maximizes the
margin between the closest points of different classes.
Support Vector Machine (SVM) Terminology
• Hyperplane: A decision boundary separating different classes in
feature space, represented by the equation wx + b = 0 in linear
classification.
• Support Vectors: The closest data points to the hyperplane, crucial
for determining the hyperplane and margin in SVM.
• Margin: The distance between the hyperplane and the support
vectors. SVM aims to maximize this margin for better classification
performance.
• Kernel: A function that maps data to a higher-dimensional space,
enabling SVM to handle non-linearly separable data.
• Hard Margin: A maximum-margin hyperplane that perfectly
separates the data without misclassifications.
• Soft Margin: Allows some misclassifications by introducing slack
variables, balancing margin maximization and misclassification
penalties when data is not perfectly separable.
• C: A regularization term balancing margin maximization and
misclassification penalties. A higher C value enforces a stricter
penalty for misclassifications.
• Hinge Loss: A loss function penalizing misclassified points or
margin violations, combined with regularization in SVM.
• Dual Problem: Involves solving for Lagrange multipliers associated
with support vectors, facilitating the kernel trick and efficient
computation.
How does Support Vector Machine Algorithm Work?
The key idea behind the SVM algorithm is to find the hyperplane that best
separates two classes by maximizing the margin between them. This margin is
the distance from the hyperplane to the nearest data points (support vectors)
on each side.
The best hyperplane, also known as the “hard margin,” is the one that
maximizes the distance between the hyperplane and the nearest data points
from both classes. This ensures a clear separation between the classes. So,
from the above figure, we choose L2 as hard margin.
Let’s consider a scenario like shown below:
Here, we have one blue ball in the boundary of the red ball.
How does SVM classify the data?
It’s simple! The blue ball in the boundary of red ones is an outlier of blue
balls. The SVM algorithm has the characteristics to ignore the outlier and finds
the best hyperplane that maximizes the margin. SVM is robust to outliers.
In this case, the new variable y is created as a function of distance from the
origin.
Mathematical Computation: SVM
Consider a binary classification problem with two classes, labeled as +1 and -
1. We have a training dataset consisting of input feature vectors X and their
corresponding class labels Y.
The equation for the linear hyperplane can be written as:
wTx+b=0wTx+b=0
Where:
• ww is the normal vector to the hyperplane (the direction
perpendicular to it).
• bb is the offset or bias term, representing the distance of the
hyperplane from the origin along the normal vector ww.
Distance from a Data Point to the Hyperplane
The distance between a data point x_i and the decision boundary can be
calculated as:
di=wTxi+b∣∣w∣∣di=∣∣w∣∣wTxi+b
where ||w|| represents the Euclidean norm of the weight vector w. Euclidean
norm of the normal vector W
Linear SVM Classifier
Distance from a Data Point to the Hyperplane:
y^={1: wTx+b≥00: wTx+b <0y^={10: wTx+b≥0: wTx+b <0
Where y^y^ is the predicted label of a data point.
Optimization Problem for SVM
For a linearly separable dataset, the goal is to find the hyperplane that
maximizes the margin between the two classes while ensuring that all data
points are correctly classified. This leads to the following optimization
problem:
minimizew,b12∥w∥2w,bminimize21∥w∥2
Subject to the constraint:
yi(wTxi+b)≥1fori=1,2,3,⋯,myi(wTxi+b)≥1fori=1,2,3,⋯,m
Where:
• yiyi is the class label (+1 or -1) for each training instance.
• xixi is the feature vector for the ii-th training instance.
• mm is the total number of training instances.
The condition yi(wTxi+b)≥1yi(wTxi+b)≥1 ensures that each data point is
correctly classified and lies outside the margin.
Soft Margin Linear SVM Classifier
In the presence of outliers or non-separable data, the SVM allows some
misclassification by introducing slack variables ζiζi. The optimization problem
is modified as:
minimize w,b12∥w∥2+C∑i=1mζiw,bminimize 21∥w∥2+C∑i=1m
ζi
Subject to the constraints:
yi(wTxi+b)≥1–ζiandζi≥0for i=1,2,…,myi(wTxi+b)≥1–ζiandζi
≥0for i=1,2,…,m
Where:
• CC is a regularization parameter that controls the trade-off between
margin maximization and penalty for misclassifications.
• ζiζi are slack variables that represent the degree of violation of the
margin by each data point.
Dual Problem for SVM
The dual problem involves maximizing the Lagrange multipliers associated
with the support vectors. This transformation allows solving the SVM
optimization using kernel functions for non-linear classification.
The dual objective function is given by:
maximize α12∑i=1m∑j=1mαiαjtitjK(xi,xj)–∑i=1mαiαmaximize 21
∑i=1m∑j=1mαiαjtitjK(xi,xj)–∑i=1mαi
Where:
• αiαi are the Lagrange multipliers associated with the ii-th training
sample.
• titi is the class label for the iii-th training sample (+1+1+1 or −1-
1−1).
• K(xi,xj)K(xi,xj) is the kernel function that computes the similarity
between data points xixi and xjxj. The kernel allows SVM to handle
non-linear classification problems by mapping data into a higher-
dimensional space.
The dual formulation optimizes the Lagrange multipliers αiαi, and the support
vectors are those training samples where αi>0αi>0.
SVM Decision Boundary
Once the dual problem is solved, the decision boundary is given by:
w=∑i=1mαitiK(xi,x)+bw=∑i=1mαitiK(xi,x)+b
Where ww is the weight vector, xx is the test data point, and bb is the bias
term.
Finally, the bias term bb is determined by the support vectors, which satisfy:
ti(wTxi–b)=1⇒b=wTxi–titi(wTxi–b)=1⇒b=wTxi–ti
Where xixi is any support vector.
This completes the mathematical framework of the Support Vector Machine
algorithm, which allows for both linear and non-linear classification using the
dual problem and kernel trick.
Types of Support Vector Machine
Based on the nature of the decision boundary, Support Vector Machines
(SVM) can be divided into two main parts:
• Linear SVM: Linear SVMs use a linear decision boundary to
separate the data points of different classes. When the data can be
precisely linearly separated, linear SVMs are very suitable. This
means that a single straight line (in 2D) or a hyperplane (in higher
dimensions) can entirely divide the data points into their respective
classes. A hyperplane that maximizes the margin between the classes
is the decision boundary.
• Non-Linear SVM: Non-Linear SVM can be used to classify data
when it cannot be separated into two classes by a straight line (in the
case of 2D). By using kernel functions, nonlinear SVMs can handle
nonlinearly separable data. The original input data is transformed by
these kernel functions into a higher-dimensional feature space, where
the data points can be linearly separated. A linear SVM is used to
locate a nonlinear decision boundary in this modified space.
4. KNN
K-Nearest Neighbors (KNN) is a simple way to classify things by looking
at what’s nearby. Imagine a streaming service wants to predict if a new
user is likely to cancel their subscription (churn) based on their age.
They checks the ages of its existing users and whether they churned or
stayed. If most of the “K” closest users in age of new user canceled
their subscription KNN will predict the new user might churn too. The
key idea is that users with similar ages tend to have similar behaviors
and KNN uses this closeness to make decisions.
Getting Started with K-Nearest Neighbors
K-Nearest Neighbors is also called as a lazy learner algorithm because it
does not learn from the training set immediately instead it stores the
dataset and at the time of classification it performs an action on the
dataset.
As an example, consider the following table of data points containing
two features:
1. Introduction to Clustering
Clustering is an unsupervised machine learning technique used to group similar data points
into clusters. The goal is to ensure that:
I. Partitioning Methods
These methods divide the data set into k non-overlapping clusters, where k is user-defined.
1. K-Means Clustering
Algorithm Steps:
Advantages:
Disadvantages:
• Need to specify k
• Sensitive to outliers
• Assumes equal-sized clusters
2. K-Medoids (PAM)
Builds a tree-like structure (dendrogram) showing how data points are merged/split at each
level.
Types:
1. Agglomerative (bottom-up):
o Start with each data point as a single cluster.
o Iteratively merge the closest clusters.
2. Divisive (top-down):
o Start with all data points in one cluster.
o Recursively split into smaller clusters.
Linkage Criteria:
Advantages:
• No need to specify k
• Dendrogram gives detailed cluster structure
Disadvantages:
• Computationally expensive
• Not suitable for very large datasets
These methods group data based on regions of high density separated by low-density areas.
Parameters:
Disadvantages:
These methods divide the data space into a finite number of cells to form a grid and perform
clustering on these grids.
Advantages:
• Fast processing
• Suitable for large databases
Disadvantages:
V. Model-Based Clustering
Advantages:
Disadvantages:
• Computationally expensive
• May overfit with many components
3. Evaluation of Clustering
• Silhouette Coefficient: Measures how similar an object is to its own cluster vs.
others.
• Davies-Bouldin Index
• Dunn Index
• Intra-cluster distance (low) vs Inter-cluster distance (high)
4. Applications of Clustering
5. Advantages of Clustering
6. Disadvantages of Clustering
Output:
A dataset of K clusters
Method:
1. Randomly assign K objects from the dataset(D) as cluster
centres(C)
2. (Re) Assign each object to which object is most similar based
upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster
with the updated values.
4. Repeat Step 2 until no change occurs.