Unit 3 Data
Unit 3 Data
CLUSTERING
Clustering is the task of dividing the population or data points into a number of groups such that data
points in the same groups are more similar to other data points in the same group and dissimilar to the
data points in other groups. It is basically a collection of objects on the basis of similarity and
dissimilarity between them.
For ex– The data points in the graph below clustered together can be classified into one single group.
Clustering Methods :
Density-Based Methods: Density-Based Clustering refers to one of the most popular unsupervised
learning methodologies used in model building and machine learning algorithms. The data points in
the region separated by two clusters of low point density are considered as noise. These methods
consider the clusters as the dense region having some similarities and differences from the lower
dense region of the space. These methods have good accuracy and the ability to merge two
clusters. Example DBSCAN (Density-Based Spatial Clustering of Applications with
Noise), OPTICS (Ordering Points to Identify Clustering Structure), etc.
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is the K-Means
Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-
defined groups. The cluster center is created in such a way that the distance between the data points of
one cluster is minimum as compared to another cluster centroid.
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group the
unlabeled datasets into a cluster and also known as hierarchical cluster analysis or HCA.In this
algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped structure is
known as the dendrogram. The clusters formed in this method form a tree-type structure based on the
hierarchy. New clusters are formed using the previously formed one. It is divided into two category
https://www.kdnuggets.com/2019/09/hierarchical-clustering.html
1. Agglomerative:
Initially consider every data point as an individual Cluster and at every step, merge the nearest
pairs of the cluster. (It is a bottom-up method).
Algorithm for Agglomerative Hierarchical Clustering is:
Calculate the similarity of one cluster with all the other clusters (calculate proximity matrix)
Consider every data point as a individual cluster
Merge the clusters which are highly similar or close to each other.
Recalculate the proximity matrix for each cluster
Repeat Step 3 and 4 until only a single cluster remains.
Let’s see the graphical representation of this algorithm using a dendrogram.
Note:
2. Divisive:
We can say that the Divisive Hierarchical clustering is precisely the opposite of the
Agglomerative Hierarchical clustering. In Divisive Hierarchical clustering, we take into account
all of the data points as a single cluster and in every iteration, we separate the data points from the
clusters which aren’t comparable. In the end, we are left with N clusters.
As we have seen, the closest distance between the two clusters is crucial for the hierarchical clustering.
There are various ways to calculate the distance between two clusters, and these ways decide the rule
for clustering. These measures are called Linkage methods. Some of the popular linkage methods are
given below:
1. Single Linkage: It is the Shortest Distance between the closest points of the clusters. Consider
the below image:
2. Complete Linkage: It is the farthest distance between the two points of two different clusters. It
is one of the popular linkage methods as it forms tighter clusters than single-linkage.
3. Average Linkage: It is the linkage method in which the distance between each pair of datasets
is added up and then divided by the total number of datasets to calculate the average distance
between two clusters. It is also one of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the centroid of the
clusters is calculated. Consider the below image:
Partitioning Methods: These methods partition the objects into k clusters and each partition
forms one cluster. This method is used to optimize an objective criterion similarity function such
as when the distance is a major parameter example K-means, CLARANS (Clustering Large
Applications based upon Randomized Search), etc.
K-MEANS
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems
in machine learning or data science. Here K defines the number of pre-defined clusters that need to be
created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and
so on. The main aim of this algorithm is to minimize the sum of distances between the data point and
their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in this
algorithm.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given
below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point. So, here we are selecting the below two
points as k points, which are not the part of our dataset. Consider the below image:
o
o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance between
two points. So, we will draw a median between both the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and
points to the right of the line are close to the yellow centroid. Let's color them as blue and yellow for
clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new centroid.
To choose the new centroids, we will compute the center of gravity of these centroids, and will
find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:
o
From the above image, we can see, one yellow point is on the left side of the line, and two blue points
are right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-
points.
o We will repeat the process by finding the center of gravity of centroids, so the new centroids will
be as shown in the below image:
o
o As we got the new centroids so again will draw the median line and reassign the data points. So,
the image will be:
o We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:
o
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be
as shown in the below image:
USE CASES OF K MEAN
CLASSIFICATION
What is the Classification Algorithm?
The Classification algorithm is a Supervised Learning technique that is used to identify the category of
new observations on the basis of training data. In Classification, a program learns from the given dataset
or observations and then classifies new observation into a number of classes or groups. Such as, Yes or
No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or categories.
The main goal of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below diagram, there
are two classes, class A and Class B. These classes have features that are similar to each other and
dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier. There are two
types of Classifications:
o Binary Classifier: If the classification problem has only two possible outcomes, then it is called
as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then it is called as
Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the test
dataset. In Lazy learner case, classification is done on the basis of the most related data stored in
the training dataset. It takes less time in training but more time for predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners:Eager Learners develop a classification model based on a training dataset before
receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in learning,
and less time in prediction. Example: Decision Trees, Naïve Bayes, ANN.
Classification Algorithms can be further divided into the Mainly two category:
o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification
o Artificial Neural Networks
o
2. Confusion Matrix:
o The confusion matrix provides us a matrix/table as output and describes the performance of the
model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a total number of
correct predictions and incorrect predictions. The matrix looks like as below table:
Classification algorithms can be used in different places. Below are some popular use cases of
Classification Algorithms:
SVM
SVM stands for Support Vector Machine. It is a machine learningapproach used for classification
and regression analysis. It depends on supervised learning models and trained by learning algorithms.
They analyze the large amount of data to identify patterns from them.
An SVM Divide the 2 categories by a clear gap that should be as wide as possible. Do this partitioning
by a plane called hyperplane.An SVM creates hyperplanes that have the largest margin in a high-
dimensional space to separate given data into classes. The margin between the 2 classes represents the
longest distance between closest data points of those classes.
The larger the margin, the lower is the generalization error of the classifier.
SVM Algorithm
Separable case – Infinite boundaries are possible to separate the data into two classes.
Non Separable case – Two classes are not separated but overlap with each other.
What is a separating hyperplane?
Suppose we select the green hyperplane and use it to classify on real life data.
This time, it makes some mistakes as it wrongly classify three women. Intuitively, we can see that if we
select an hyperplane which is close to the data points of one class, then it might not generalize well.
So we will try to select an hyperplane as far as possible from data points from each category:
This one looks better. When we use it with real life data, we can see it still make perfect classification.
The black hyperplane classifies more accurately than the green one
That's why the objective of a SVM is to find the optimal separating hyperplane: because it correctly
classifies the training data and because it is the one which will generalize better with unseen data
What is the margin and how does it help choosing the optimal hyperplane?
Given a particular hyperplane, we can compute the distance between the hyperplane and the closest data
point. Once we have this value, if we double it we will get what is called the margin.
Basically the margin is a no man's land. There will never be any data point inside the margin. (Note:
this can cause some problems when data is noisy, and this is why soft margin classifier will be
introduced later)
For another hyperplane, the margin will look like this :
As you can see, Margin B is smaller than Margin A.
NAÏVE BAYES
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is
not a single algorithm but a family of algorithms where all of them share a common principle, i.e.
every pair of features being classified is independent of each other.Let us go through some of the
simple concepts of probability that we will use. Consider the following example of tossing two coins. If
we toss two coins and look at all the different possibilities, we have the sample space as:{HH, HT, TH,
TT}
While calculating the math on probability, we usually denote probability as P. Some of the probabilities
in this event would be as follows:
The Bayes theorem gives us the conditional probability of event A, given that event B has occurred. In
this case, the first coin toss will be B and the second coin toss A. This could be confusing because we've
reversed the order of them and go from B to A instead of A to B.
Naïve Bayes algorithms is a classification technique based on applying Bayes’ theorem with a strong
assumption that all the predictors are independent to each other. In simple words, the assumption is that
the presence of a feature in a class is independent to the presence of any other feature in the same class.
For example, a phone may be considered as smart if it is having touch screen, internet facility, good
camera etc. Though all these features are dependent on each other, they contribute independently to the
probability of that the phone is a smart phone.
o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem
and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which
helps in building the fast machine learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis,
and classifying articles.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases of
color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each
feature individually contributes to identify that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis
is true.
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using
this dataset we need to decide that whether we should play or not on a particular day according to the
weather conditions. So to solve this problem, we need to follow the below steps:
Problem: If the weather is sunny, then the Player should play or not?
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Frequency table for the Weather Conditions:
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Likelihood table weather condition:
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.
o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal distribution. This means
if predictors take continuous values instead of discrete, then the model assumes that these values
are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor
variables are the independent Booleans variables. Such as if a particular word is present or not in
a document. This model is also famous for document classification tasks.
KNN
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised
Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases and put
the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for
the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an action on
the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we
want to know either it is a cat or dog. So for this identification, we can use the KNN algorithm,
as it works on a similarity measure. Our KNN model will find the similar features of the new
data set to the cats and dogs images and based on the most similar features it will put it in either
cat or dog category.
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so
this data point will lie in which of these categories. To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset.
Consider the below diagram:
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
Suppose we have a new data point and we need to put it in the required category. Consider the below
image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean distance
is the distance between two points, which we have already studied in geometry. It can be
calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.
How to select the value of K in the K-NN Algorithm?
o There is no particular way to determine the best value for "K", so we need to try some values to
find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in
the model.
o Large values for K are good, but it may find some difficulties.
SEGMENTATION
REGRESSION
REGRESSION NUMERICALS
Example 9.9
Calculate the regression coefficient and obtain the lines of regression for the following
data
Solution:
Regression coefficie nt of X on Y
(i) Regression equation of X on Y
= 0.929X+7.284
There are two types of inverted indexes: A record-level inverted index contains a list of references
to documents for each word. A word-level inverted index additionally contains the positions of each
word within a document. The latter form offers more functionality, but needs more processing power
and space to be created.
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in
data. Entropy can be calculated as:
Where,
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
o Gini index can be calculated using the below formula: