Module 6 - Un-Supervised Learning Algorithms
Module 6 - Un-Supervised Learning Algorithms
• Here, we have taken unlabeled input data, which means it is not categorized and
corresponding outputs are also not given.
• Now, this unlabeled input data is fed to the machine learning model in order to train it.
• Firstly, it will interpret the raw data to find the hidden patterns from the data and then will
apply suitable algorithms such as k-means clustering, Decision tree, etc.
• Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and differences between the objects.
Types of Unsupervised Learning Algorithm
• The unsupervised learning algorithm can be further
categorized into two types of problems:
• Clustering: Clustering is a method of grouping the objects
into clusters such that objects with most similarities remains
into a group and have less or no similarities with the objects
of another group. Cluster analysis finds the commonalities
between the data objects and categorizes them as per the
presence and absence of those commonalities.
• Association: An association rule is an unsupervised learning
method which is used for finding the relationships between
variables in the large database. It determines the set of
items that occur together in the dataset. Association rule
makes marketing strategy more effective. Such as, people
who buy X item (suppose a bread) are also tend to purchase
Y (Butter/Jam) item. A typical example of an Association
rule is Market Basket Analysis.
Unsupervised Learning algorithms
• K-means clustering
• KNN (k-nearest neighbors)
• Hierarchal clustering
• Anomaly detection
• Neural Networks
• Principle Component Analysis
• Independent Component Analysis
• Apriori algorithm
• Singular value decomposition
1 - K-Means Clustering Algorithm
• K-Means Clustering is an Unsupervised Learning algorithm which groups the
unlabeled dataset into different clusters.
• Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three
clusters, and so on.
• It is an iterative algorithm that divides the unlabeled dataset into k different
clusters in such a way that each dataset belongs to only one group that has similar
properties.
• It allows us to cluster the data into different groups and a convenient way to
discover the categories of groups in the unlabeled dataset on its own without the
need for any training.
• The main aim of this algorithm is to minimize the sum of distances between the
data point and their corresponding clusters.
1 - K-Means Clustering Algorithm
• The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best clusters.
The value of k should be predetermined in this algorithm.
• The k-means clustering algorithm mainly performs two tasks:
• Determines the best value for K center points or centroids by an iterative
process.
• Assign each data point to its closest k-center. Those data points which are near
to the particular k-center, create a cluster.
• Hence each cluster has datapoints with some commonalities, and it is away from
other clusters.
1 - K-Means Clustering Algorithm
How to choose the value of "K number of
clusters" in K-means Clustering?
• The performance of the K-means clustering algorithm depends upon the highly
efficient clusters that it forms.
• But choosing the optimal number of clusters is a big task.
• There are some different ways to find the optimal number of clusters, but here we are
discussing the most appropriate method to find the number of clusters or value of
K. The method is Elbow Method:
Elbow Method
• The Elbow method is one of the most popular ways to find the optimal number of
clusters.
• This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum
of Squares, which defines the total variations within a cluster.
• To measure the distance between data points and centroid, we can use any method
such as Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
• It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).
• For each value of K, calculate the WCSS value.
• Plots a curve between calculated WCSS values and the number of clusters K.
• The sharp point of bend or a point of the plot looks like an arm, then that point is considered as
the best value of K.
Elbow Method
• Since the graph shows the sharp bend, which looks like an elbow, hence it is known as
the elbow method. The graph for the elbow method looks like the below image:
2- Hierarchical Clustering
• In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this
tree-shaped structure is known as the dendrogram.
• There is no requirement to predetermine the number of clusters for this algorithm as
we did in the K-Means algorithm.
• The hierarchical clustering technique has two approaches:
• Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm
starts with taking all data points as single clusters and merging them until one cluster
is left.
• Divisive: The divisive algorithm is the reverse of the agglomerative algorithm as it is a
top-down approach.
Why hierarchical clustering?
• As we already have other clustering algorithms such as K-Means Clustering, then why we
need hierarchical clustering?
• There are some challenges with K-means clustering, which are a predetermined
number of clusters, and it always tries to create clusters of the same size.
• To solve these two challenges, we can opt for the hierarchical clustering algorithm
because, in this algorithm, we don't need to have knowledge about the predefined
number of clusters.
Agglomerative Hierarchical clustering
• To group the datasets into clusters, it follows the bottom-up approach.
• It means, this algorithm considers each datapoint/dataset as a single cluster at the
beginning, and then start combining the closest pair of clusters together.
• It does this until all the clusters are merged into a single cluster that contains all the
datapoints/datasets.
• This hierarchy of clusters is represented in the form of the dendrogram.
How the Agglomerative Hierarchical
clustering Work?
Step-1: Create each data point as a single cluster. Let's say there are N data points, so the
number of clusters will also be N.
How the Agglomerative Hierarchical
clustering Work?
Step-2: Take two closest data points or clusters and merge them to form one cluster. So,
there will now be N-1 clusters.
How the Agglomerative Hierarchical
clustering Work?
Step-3: Again, take the two closest clusters and merge them together to form one cluster.
There will be N-2 clusters.
How the Agglomerative Hierarchical
clustering Work?
Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters.
Throughout the merging process, we keep track of the hierarchy of clusters by creating a
dendrogram, which is a tree-like diagram that illustrates the order in which clusters were
merged.
Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram
to divide the clusters as per the problem.
Measure for the distance between two
clusters
• The closest distance between the two clusters is crucial for hierarchical clustering.
• There are various ways to calculate the distance between two clusters, and these ways
decide the rule for clustering. These measures are called Linkage methods.
• Some of the popular linkage methods are given in the next slides:
Measure for the distance between two
clusters
1 - Single Linkage:
• It is the Shortest Distance between the closest points of the clusters.
• It merges the two clusters with the closest data points
Measure for the distance between two
clusters
2 - Complete Linkage: It is the farthest distance between the two points of two different
clusters. It is one of the popular linkage methods as it forms tighter clusters than single-
linkage.
Measure for the distance between two
clusters
3 - Average Linkage: It is the linkage method in which the distance between each pair of
datasets is added up and then divided by the total number of datasets to calculate the
average distance between two clusters.
4 - Centroid Linkage: It is the linkage method in which the distance between the centroid
of the clusters is calculated.
3 - Dimensionality Reduction Technique
• The number of input features, variables, or columns present in a given dataset is known
as dimensionality, and the process to reduce these features is called dimensionality
reduction.
• A dataset contains a huge number of input features in various cases, which makes the
predictive modelling task more complicated. Because it is very difficult to visualise or
make predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to use.
• Dimensionality reduction technique can be defined as, "It is a way of converting the
higher dimensions dataset into lesser dimensions dataset ensuring that it provides
similar information." These techniques are widely used in machine learning for
obtaining a better fit predictive model while solving classification and regression
problems.
• It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.
The Curse of Dimensionality
• Handling the high-dimensional data is very difficult in practice, commonly known as the
curse of dimensionality.
• If the dimensionality of the input dataset increases, any machine learning algorithm and
model becomes more complex.
• As the number of features increases, the number of samples also gets increased
proportionally, and the chance of overfitting also increases.
• If the machine learning model is trained on high-dimensional data, it becomes
overfitted and results in poor performance.
• Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.
Approaches of Dimension Reduction
1 - Feature Selection: It is the process of selecting the subset of the relevant features and
leaving out the irrelevant features present in a dataset to build a model of high accuracy.
In other words, it is a way of selecting the optimal features from the input dataset.
• Filter Methods
• Wrapper Methods
• Embedded Methods
2 - Feature Extraction: It transforms the original features into a lower-dimensional
space. The new features are typically a linear combination of the original features and
are chosen to maximize variance or preserve other desirable properties. This approach is
useful when we want to keep the whole information but use fewer resources while
processing the information.
• Principal Component Analysis
• Linear Discriminant Analysis
• Kernel PCA
• Quadratic Discriminant Analysis
4 - Principal Component Analysis
• It is a statistical process that converts the observations of correlated features into a set of
linearly uncorrelated features with the help of orthogonal transformation.
• These new transformed features are called the Principal Components.
• It is one of the popular tools that is used for exploratory data analysis and predictive
modeling.
• It is a technique to draw strong patterns from the given dataset by reducing the variances
(a measure of the spread or dispersion of a set of data points).
• PCA generally tries to find the lower-dimensional surface to project the high-
dimensional data.
• PCA works by considering the variance of each attribute because the high attribute shows
the good split between the classes, and hence it reduces the dimensionality.
• Some real-world applications of PCA are image processing, movie recommendation
system, optimizing the power allocation in various communication channels.
Some common terms used in PCA algorithm
• Dimensionality: It is the number of features or variables present in the given dataset.
More easily, it is the number of columns present in the dataset.
• Correlation: It signifies how strongly two variables are related to each other. Such as if
one changes, the other variable also gets changed. The correlation value ranges from -1
to +1. Here, -1 occurs if variables are inversely proportional to each other, and +1
indicates that variables are directly proportional to each other.
• Orthogonal: It defines that variables are not correlated to each other, and hence the
correlation between the pair of variables is zero.
Principal Components in PCA
• The transformed new features or the output of PCA are the Principal Components.
• The number of these PCs are either equal to or less than the original features present
in the dataset.
• Some properties of these principal components are given below:
• The principal component must be the linear combination of the original features.
• These components are orthogonal, i.e., the correlation between a pair of variables is
zero.
• The importance of each component decreases when going to 1 to n, it means the 1
PC has the most importance, and n PC will have the least importance.
Applications of Principal Component Analysis
• PCA is mainly used as the dimensionality reduction technique in various AI applications
such as computer vision, image compression, etc.
• It can also be used for finding hidden patterns if data has high dimensions. Some
fields where PCA is used are Finance, data mining, Psychology, etc.