Unit 5-6
Unit 5-6
Naive Bayes:
The Naive Bayes algorithm is a simple and powerful probabilistic machine learning classifier based on
Bayes' Theorem. It is particularly well-suited for classification tasks, especially in text classification, spam
detection, sentiment analysis, and recommendation systems. "Naive" refers to the assumption that features
are independent of each other, given the class label, which rarely holds true in real-world data but often
works surprisingly well in practice.
The Naive Bayes algorithm is based on Bayes' Theorem, which describes the probability of an event
occurring given prior knowledge of related events. It provides you with a way of calculating the probability
of a hypothesis with the provided evidence. The formula for Bayes' Theorem is:
Consider the problem of playing golf, here the only predictor is Humidity and Play Golf? is the target. Using
the above formula we can calculate posterior probability if we know the mean and standard deviation.
The Zero-Frequency Problem:
One of the disadvantages of Naïve-Bayes is that if you have no occurrences of a class label and a certain
attribute value together then the frequency-based probability estimate will be zero. And this will get a zero
when all the probabilities are multiplied.
An approach to overcome this ‘zero-frequency problem’ in a Bayesian environment is to add one to the
count for every attribute value-class combination when an attribute value doesn’t occur with every class
value.
For example, say your training data looked like this:
Then you should add one to every value in this table when you’re using it to calculate probabilities:
The Naive Bayes Algorithm is used for various real-world problems like those below:
• Text classification: The Naive Bayes Algorithm is used as a probabilistic learning technique for text
classification. It is one of the best-known algorithms used for document classification of one or many
classes.
• Sentiment analysis: The Naive Bayes Algorithm is used to analyse sentiments or feelings, whether
positive, neutral, or negative.
• Recommendation system: The Naive Bayes Algorithm is a collection of collaborative filtering issued
for building hybrid recommendation systems that assist you in predicting whether a user will receive
any resource.
• Spam filtering: It is also similar to the text classification process. It is popular for helping you
determine if the mail you receive is spam.
• Medical diagnosis: This algorithm is used in medical diagnosis and helps you to predict the patient’s
risk level for certain diseases.
• Weather prediction: You can use this algorithm to predict whether the weather will be good.
• Face recognition: This helps you identify faces.
Naïve Bayes Example
Concerning our dataset, the concept of assumptions made by the algorithm can be understood as:
• We assume that no pair of features are dependent. For example, the color being ‘Red’ has nothing to
do with the Type or the Origin of the car. Hence, the features are assumed to be Independent.
• Secondly, each feature is given the same influence(or importance). For example, knowing the only
Color and Type alone can’t predict the outcome perfectly. So none of the attributes are irrelevant and
assumed to be contributing Equally to the outcome.
Note: The assumptions made by Naïve Bayes are generally not correct in real-world situations. The
independence assumption is never correct but often works well in practice.
The variable y is the class variable(stolen?), which represents if the car is stolen or not given the conditions.
Variable X represents the parameters/features.
X is given as,
Here x1, x2…, xn represent the features, i.e they can be mapped to Color, Type, and Origin. By substituting
for X and expanding using the chain rule we get,
Now, you can obtain the values for each by looking at the dataset and substitute them into the equation. For
all entries in the dataset, the denominator does not change, it remains static. Therefore, the denominator can
be removed and proportionality can be injected.
In our case, the class variable(y) has only two outcomes, yes or no. There could be cases where the
classification could be multivariate. Therefore, we have to find the class variable(y) with maximum
probability.
Using the above function, we can obtain the class, given the predictors/features.
Below are the Frequency and likelihood tables for all three predictors.
As per the equations discussed above, we can calculate the posterior probability
P (Yes | X) as :
P (No | X ) as:
The K-Nearest Neighbors (KNN) algorithm is a simple, supervised machine learning technique used for
both classification and regression tasks. KNN is a non-parametric and instance-based learning
algorithm, which means it does not make any assumptions about the data distribution and stores all available
cases. When it makes predictions, it uses the similarity (distance) between data points to classify or predict
the target value.
In KNN:
• We choose a value for K, the number of neighbors to consider.
• For classification, KNN assigns a class based on the majority class among the K nearest neighbors.
• For regression, KNN assigns an average value from the K nearest neighbors.
KNN is the most commonly used and one of the simplest algorithms for finding patterns in classification
and regression problems. It is an unsupervised algorithm and also known as lazy learning algorithm. It
works by calculating the distance of 1 test observation from all the observation of the training dataset and
then finding K nearest neighbours of it.
Distance Metrics:
As we know that the KNN algorithm helps us identify the nearest points or the groups for a query point. But
to determine the closest groups or the nearest points for a query point we need some metric. For this
purpose, we use below distance metrics:
Minkowski Distance –
It is a metric intended for real-valued vector spaces. We can calculate Minkowski distance only in a normed
vector space, which means in a space where distances can be represented as a vector that has a length and
the lengths cannot be negative.
The p value in the formula can be manipulated to give us different distances like:
• p = 1, when p is set to 1 we get Manhattan distance
• p = 2, when p is set to 2 we get Euclidean distance
Manhattan Distance –
This distance is also known as taxicab distance or city block distance, that is because the way this distance is
calculated. The distance between two points is the sum of the absolute differences of their Cartesian
coordinates.
As we know we get the formula for Manhattan distance by substituting p=1 in the Minkowski distance
formula.
Suppose we have two points as shown in the image the red(4,4) and the green(1,1).
We will get, d = |4-1| + |4-1| = 6
This distance is preferred over Euclidean distance when we have a case of high dimensionality.
Euclidean Distance –
This distance is the most widely used one as it is the default metric that SKlearn library of Python uses for
K-Nearest Neighbour. It is a measure of the true straight line distance between two points
in Euclidean space.
Now suppose we have two point the red (4,4) and the green (1,1).
And now we have to calculate the distance using Euclidean distance metric.
We will get, 4.24
Cross-Validation:
Cross-validation is a technique used to assess the performance of a machine learning model by dividing the
dataset into multiple subsets, or "folds," to ensure that the model generalizes well to unseen data. It’s a
reliable way to test model accuracy and stability, particularly in cases where data is limited.
Cross-validation helps:
• Avoid overfitting by ensuring the model is evaluated on data it hasn't seen.
• Provide a more robust estimate of model performance compared to a single train-test split.
• Enable tuning hyperparameters (such as K in KNN) based on average performance across folds.
Types of Cross-Validation
1. k-Fold Cross-Validation
o The dataset is split into k equal-sized folds.
o The model is trained on k-1 folds and tested on the remaining fold.
o This process repeats k times, with each fold serving as the test set once.
o The final performance is the average of the scores across all folds.
For example, in 5-fold cross-validation, the data is split into 5 parts; each time, 4 parts are used for
training and 1 part for testing, until every part has served as the test set.
Optimal K:
Optimal K balances model bias and variance, making the model more generalizable to new data.
It refers to the various challenges that arise when working with high-dimensional data (i.e., data with a large
number of features or dimensions). This concept is particularly relevant in algorithms like K-Nearest
Neighbors (KNN), which rely on distance metrics to make predictions. As the number of dimensions
increases, the volume of the feature space grows exponentially, and data points become sparse, leading to
several issues.
It is an essential step in data preprocessing, especially for machine learning algorithms that require
numerical input, such as K-Nearest Neighbors (KNN) and many other models. Categorical data consists of
values that represent different categories or groups, which can be nominal (no order) or ordinal (with an
order). Here's how you can handle categorical data effectively.
A. One-Hot Encoding
• Best for: Nominal categories with no inherent order (e.g., color: red, blue, green).
• Description: One-hot encoding creates binary columns for each category. For example, a “color”
column with values “red,” “blue,” and “green” would be converted into three binary columns:
color_red, color_blue, and color_green.
B. Label Encoding
• Best for: Ordinal categories with a meaningful order (e.g., rating: low, medium, high).
• Description: Label encoding assigns an integer to each category based on its order. For example, the
categories “low,” “medium,” and “high” could be encoded as 0, 1, and 2, respectively.
C. Target Encoding
• Best for: High-cardinality categorical features in regression problems.
• Description: Target encoding replaces categories with the mean value of the target variable for each
category. It is popular in cases where categorical variables have many unique values (e.g., zip codes).
• Implementation: Target encoding is not directly available in Scikit-Learn but can be implemented
using libraries like Category Encoders.
D. Binary Encoding
• Best for: High-cardinality categorical features in both classification and regression.
• Description: Binary encoding converts categories into binary digits, which can be efficient for
categorical variables with a large number of levels. This method reduces dimensionality more
effectively than one-hot encoding.
• Implementation: Also available in the Category Encoders library.
• KNN is known for its simplicity, comprehensibility, and scalability. Learning and implementation
are extremely simple and intuitive.
• It is easy to interpret. The mathematical computations are easy to comprehend and understand.
• The calculation time is less.
• Predictive power is very high, which makes it effective and efficient.
• KNN is very effective for large training sets.
• It is very useful for nonlinear data because this algorithm has no assumptions about data.
• It is a versatile algorithm as we can use it for classification and regression.
• It has relatively high accuracy.
• KNN can be expensive in determining K if the dataset is large. It requires more memory storage than
an effective classifier or supervised learning algorithms.
• In KNN, the prediction phase is slow for a larger dataset. The computation of accurate distances
plays a big role in determining the algorithm’s accuracy.
• One of the major steps in KNN is determining the parameter K. Sometimes, it is unclear which type
of distance to use and which feature will give the best result.
• It is very sensitive to the data’s scale and irrelevant features. Irrelevant or correlated features have a
high impact and must be eliminated.
SVM:
“Support Vector Machine” (SVM) is a supervised learning machine learning algorithm that can be used for
both classification or regression challenges. However, it is mostly used in classification problems, such as
text classification. In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n
is the number of features you have), with the value of each feature being the value of a particular coordinate.
Then, we perform classification by finding the optimal hyper-plane that differentiates the two classes very
well (look at the below snapshot).
Support Vectors are simply the coordinates of individual observation, and a hyper-plane is a form of SVM
visualization. The SVM classifier is a frontier that best segregates the two classes (hyper-plane/line).
• Support Vectors: Support vectors are the data points that are closest to the hyperplane. These points
are critical because they determine the position and orientation of the hyperplane. If you remove a
support vector, it can change the hyperplane’s position.
1. Linear SVM
Linear SVM is used when the data is linearly separable, which means that the classes can be separated
with a straight line (in 2D) or a flat plane (in 3D). The SVM algorithm finds the hyperplane that best
divides the data into classes.
2. Non-Linear SVM
Non-Linear SVM is used when the data is not linearly separable. In such cases, SVM employs kernel
functions to transform the data into a higher-dimensional space where a linear separation is possible. The
algorithm then finds the optimal hyperplane in this new space.
The C parameter plays a crucial role in controlling the trade-off between margin size and classification
accuracy:
• When C is large:
o The penalty for misclassification is high.
o The model will prioritize correct classification of every training point, resulting in a smaller
margin and potentially overfitting to noise in the data (especially in cases where there are
outliers or overlapping classes).
o The model becomes less tolerant to errors.
• When C is small:
o The penalty for misclassification is low.
o The model will allow more misclassifications to obtain a larger margin between classes,
which can lead to better generalization.
o The model becomes more tolerant to errors.
Thus, C is a hyperparameter that allows the user to control the complexity of the model:
• Small C: More regularization, wider margin, but more possible misclassifications.
• Large C: Less regularization, narrower margin, fewer misclassifications but more likely to overfit.
Kernels are functions that take low-dimensional input space and transform it into a higher-dimensional
space. SVM can create complex decision boundaries by using kernel functions. A kernel helps us find a
hyperplane in the higher dimensional space without increasing the computational cost.
Types are as follows:
1. Linear Kernel
2. Polynomial Kernel
Where c is a constant, and d is the degree of the polynomial. This kernel is useful for classifying data
with polynomial relationships.
Where γ is a parameter that defines the influence of a single training example. This is one of the most
popular kernels for non-linear data.
4. Sigmoid Kernel
Where α and care kernel parameters. It behaves like a neural network’s activation function.
• Support vectors are the data points that lie closest to the decision boundary (hyperplane) and are
critical in determining the margin between classes. In a sense, these points can be considered the
"landmarks" of the SVM model because they define the decision boundary.
• The role of these points is significant because the SVM maximizes the margin (distance between the
hyperplane and the nearest data points) to improve generalization.
While in SVM the concept of choosing landmark points typically relates to identifying support vectors, there
are some strategies you could consider for selecting them or for visualizing their importance:
• In Linear SVM:
o Landmark points are easy to visualize in 2D or 3D by plotting the support vectors and the
hyperplane. These points lie at the edges of the margin and directly affect the boundary
between classes.
• In Non-linear SVM:
o Landmark points are less directly interpretable due to the transformation of the data into a
higher-dimensional feature space using kernels (e.g., RBF, polynomial). However, they still
correspond to the support vectors that are closest to the decision boundary in this transformed
space.
Similarity Function
A similarity function in SVM is a measure that quantifies how similar two data points are in terms of their
features. The main goal of using similarity functions (via kernel functions) in SVM is to implicitly map the
input data into a higher-dimensional space without explicitly calculating the transformation. This is done
using kernel tricks.
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Optional: Visualize the decision boundary for the first two features (2D visualization)
X_train_2d = X_train[:, :2]
X_test_2d = X_test[:, :2]
Support Vector Machines (SVM) are widely used in machine learning for classification problems, but they
can also be applied to regression problems through Support Vector Regression (SVR). SVR uses the same
principles as SVM but focuses on predicting continuous outputs rather than classifying data points. This
tutorial will explore SVR’s work, emphasizing key concepts such as quadratic, radial basis function, and
sigmoid kernels. By leveraging these kernels, SVR can effectively handle complex, non-linear relationships
in data.
The SVR model, unlike typical regression models, employs support vector machines (SVMs) principles to
transform input features into high-dimensional spaces to locate the ideal hyperplane that accurately
represents the data. This method enables support vector regression (SVR) to effectively manage both linear
and non-linear relationships, rendering it a versatile tool across different fields, such as financial forecasting
and scientific research.
The problem of regression is to find a function that approximates mapping from an input domain to real
numbers based on a training sample. So, let’s dive deep and understand how SVR actually works.
Consider these two red lines as the decision boundary and the green line as the hyperplane. When we move
on with SVR in Machine Learning, our objective is to consider the points within the decision boundary line.
Our best fit line is the hyperplane with the maximum number of points.
The first thing that we’ll understand is what is the decision boundary (the danger red line above!). Consider
these lines as being at any distance, say ‘a’, from the hyperplane. So, these are the lines that we draw at
distance ‘+a’ and ‘-a’ from the hyperplane. This ‘a’ in the text is basically referred to as epsilon.
wx+b= -a
Thus, any hyperplane that satisfies our SVM for Regression Model should satisfy:
-a < Y- wx+b < +a
Pros:
• It works really well with a clear margin of separation.
• It is effective in high-dimensional spaces.
• It is effective in cases where the number of dimensions is greater than the number of samples.
• It uses a subset of the training set in the decision function (called support vectors), so it is also
memory efficient.
Cons:
• It doesn’t perform well when we have a large data set because the required training time is higher.
• It also doesn’t perform very well when the data set has more noise, i.e., target classes are
overlapping.
• The SVM algorithm doesn’t directly provide probability estimates; it calculates them using an
expensive five-fold cross-validation. The related SVC method of the Python scikit-learn library
includes this feature.