Classification and Clustering Algorithm Notes
Classification and Clustering Algorithm Notes
Classification Algorithms can be further divided into the Mainly two category:
o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification
Lazy Learners
It first stores the training dataset before waiting for the test dataset to arrive.
When using a lazy learner, the classification is carried out using the training
dataset's most appropriate data. Less time is spent on training, but more
time is spent on predictions. Some of the examples are case-based
reasoning and the KNN algorithm.
Eager Learners
1. Logistic Regression
2. Naive Byes
Naive Bayes determines whether a data point falls into a particular category.
It can be used to classify phrases or words in text analysis as either falling
within a predetermined classification or not. It assumes that predictors in a
dataset are independent. This means that it assumes the features are
unrelated to each other. For example, if given a banana, the classifier will
see that the fruit is of yellow color, oblong-shaped and long and tapered. All
of these features will contribute independently to the probability of it being a
banana and are not dependent on each other. Naive Bayes is based on
Bayes’ theorem, which is given as:
Where :
3. K-Nearest Neighbors
It calculates the likelihood that a data point will join the groups based on
which group the data points closest to it are a part of. When using k-NN for
classification, you determine how to classify the data according to its nearest
neighbor.
Given a point whose class we do not know, we can try to understand which
points in our feature space are closest to it. These points are the k-nearest
neighbors. Since similar things occupy similar places in feature space, it’s
very likely that the point belongs to the same class as its neighbors. Based on
that, it’s possible to classify a new point as belonging to one class or another.
Some advanced methods for selecting k that are suitable for these cases.
The optimal K value can be calculated as the square root of the total number
of samples in the training dataset. Use an error plot or accuracy plot to find
the most favorable K value. KNN performs well with multi-label classes, but in
case of the structure with many outliers, it can fail, and you’ll need to use
other methods.
Begin with k=1, then perform cross-validation (5 to 10 fold – these figures are
common practice as they provide a good balance between the computational
efforts and statistical validity), and evaluate the accuracy. Keep repeating the
same steps until you get consistent results. As k goes up, the error usually
decreases, then stabilizes, and then grows again. The optimal k lies at the
beginning of the stable zone.
K-distance is the distance between data points and a given query point. To
calculate it, we have to pick a distance metric.
Some of the most popular metrics are explained below.
Euclidean distance
The Euclidean distance between two points is the length of the straight line
segment connecting them. This most common distance metric is applied to
real-valued vectors.
Manhattan distance
The Manhattan distance between two points is the sum of the absolute
differences between the x and y coordinates of each point. Used to measure
the minimum distance by summing the length of all the intervals needed to get
from one location to another in a city, it’s also known as the taxicab distance.
Minkowski distance
Hamming distance
Hamming distance is used to compare two binary vectors (also called data
strings or bitstrings). To calculate it, data first has to be translated into a binary
system.
4. Decision Tree
Entropy
A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain
instances with similar values (homogenous). ID3 algorithm uses entropy to calculate the homogeneity of a sam
the sample is completely homogeneous the entropy is zero and if the sample is an equally divided it has entrop
one.
To build a decision tree, we need to calculate two types of entropy using frequency tables as follows:
a) Entropy using the frequency table of one attribute:
Information Gain
The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a
decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneou
branches).
Step 3: Choose attribute with the largest information gain as the decision node, divide the dataset by its branch
repeat the same process on every branch.
Step 4a: A branch with entropy of 0 is a leaf node.
Step 4b: A branch with entropy more than 0 needs further splitting.
Step 5: The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified.
5. Random Forest Algorithm
Step 4: Finally, select the most voted prediction result as the final prediction
result.
The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that we can easily
put the new data point in the correct category in the future. This best
decision boundary is called a hyperplane.
2. Confusion Matrix:
3. AUC-ROC curve: