0% found this document useful (0 votes)
10 views19 pages

Unit Iv

Uploaded by

apdeshmukh371122
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views19 pages

Unit Iv

Uploaded by

apdeshmukh371122
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIT IV UnSupervised Learning

The task of grouping data points based on their similarity with each other is
called Clustering or Cluster Analysis.

This method is defined under the branch of Unsupervised Learning, which aims
at gaining insights from unlabelled data points, that is, unlike supervised
learning we don’t have a target variable.

Clustering aims at forming groups of homogeneous data points from a


heterogeneous dataset. It evaluates the similarity based on a metric like
Euclidean distance, Cosine similarity, Manhattan distance, etc. and then group
the points with highest similarity score together.

For Example, In the graph given below, we can clearly see that there are 3
circular clusters forming on the basis of distance.

or example, In the below given graph we can see that the clusters formed are
not circular in shape
Types of Clustering

Broadly speaking, there are 2 types of clustering that can be performed to


group similar data points:

Hard Clustering: In this type of clustering, each data point belongs to a cluster
completely or not. For example, Let’s say there are 4 data point and we have
to cluster them into 2 clusters. So each data point will either belong to cluster
1 or cluster 2.

Data Points Clusters

A C1

B C2

C C2

D C1

Soft Clustering: In this type of clustering, instead of assigning each data point
into a separate cluster, a probability or likelihood of that point being that
cluster is evaluated. For example, Let’s say there are 4 data point and we have
to cluster them into 2 clusters. So we will be evaluating a probability of a data
point belonging to both clusters. This probability is calculated for all data
points.

Data Points Probability of C1 Probability of C2

A 0.91 0.09

B 0.3 0.7

C 0.17 0.83

D 1 0

K means Clustering

What is K-means Clustering?

K-Means Clustering is an Unsupervised Machine Learning algorithm, which


groups the unlabeled dataset into different clusters.
Calculate new c1 as 2+2+4/3=2.66 and 4+6+7/3=5.66 like this calculate new c2
and c3
Cluster the following eight points (with (x, y) representing locations) into three
clusters:

A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)

After second iteration, the center of the three clusters are-

C1(3, 9.5)

C2(6.5, 5.25) and C3(1.5, 3.5)


K-medoids

K-medoids clustering is a partitioning technique similar to k-means, but with


some key differences that make it more robust to noise and outliers. Here’s a
brief overview:

Key Features of K-Medoids Clustering:

Medoids as Centers: Unlike k-means, which uses the mean of the points in a
cluster as the center, k-medoids selects actual data points as the centers
(medoids). This makes the cluster centers more interpretable.

Robustness: K-medoids minimizes a sum of pairwise dissimilarities instead of


squared Euclidean distances, making it less sensitive to outliers and noise1.

Dissimilarity Measures: It can use arbitrary dissimilarity measures, whereas k-


means generally requires Euclidean distance
Algorithm Steps:

1. Choose k number of random points from the data and assign these k
points to k number of clusters. These are the initial medoids.
2. For all the remaining data points, calculate the distance from each
medoid and assign it to the cluster with the nearest medoid.
3. Calculate the total cost (Sum of all the distances from all the data points
to the medoids)
4. Select a random point as the new medoid and swap it with the previous
medoid. Repeat 2 and 3 steps.
5. If the total cost of the new medoid is less than that of the previous
medoid, make the new medoid permanent and repeat step 4.
6. If the total cost of the new medoid is greater than the cost of the
previous medoid, undo the swap and repeat step 4.
7. The Repetitions have to continue until no change is encountered with
new medoids to classify data points.
{Cost(3,4),(2,6)}=|3-4|+|4-6|=3

Total cost=3+4+4+3+1+1+2+2=20
Example

Take new random non medoid (8,4)


Hierarchical clustering
Hierarchical clustering is a method of cluster analysis in machine learning and
statistics that builds a hierarchy of clusters. It is particularly useful for
discovering the underlying structure in data.

Key Features of Hierarchical Clustering

Types:

Agglomerative (Bottom-Up): Starts with each data point as its own cluster and
merges the closest pairs of clusters iteratively until all points are in a single
cluster or a stopping criterion is met.

Steps:

Consider each alphabet as a single cluster and calculate the distance of one
cluster from all the other clusters.

In the second step, comparable clusters are merged together to form a single
cluster. Let’s say cluster (B) and cluster (C) are very similar to each other
therefore we merge them in the second step similarly to cluster (D) and (E) and
at last, we get the clusters [(A), (BC), (DE), (F)]
We recalculate the proximity (it find similarities’ and dissimilarities) according
to the algorithm and merge the two nearest clusters ([(DE), (F)]) together to
form new clusters as [(A), (BC), (DEF)]

Repeating the same process; The clusters DEF and BC are comparable and
merged together to form a new cluster. We’re now left with clusters [(A),
(BCDEF)].

At last, the two remaining clusters are merged together to form a single cluster
[(ABCDEF)].

Divisive (Top-Down): Begins with all data points in one cluster and recursively
splits them into smaller clusters.

Dendrogram: The results are often visualized using a dendrogram, a tree-like


diagram that shows the arrangement of the clusters formed at each step.

Distance Metrics: Various distance metrics (e.g., Euclidean, Manhattan) and


linkage criteria (e.g., single, complete, average) can be used to determine the
similarity between clusters.

Applications: Hierarchical clustering is used in various fields such as


bioinformatics, image analysis, and market research to group similar items
together and understand the relationships between them.

Advantages and Disadvantages

Advantages:

Does not require the number of clusters to be specified in advance.

Can capture complex cluster structures.

Provides a clear visual representation of the clustering process through


dendrograms.

Disadvantages:

Computationally intensive, especially for large datasets.

Sensitive to noise and outliers.


The choice of distance metric and linkage method can significantly affect the
results.

Multi-view clustering

Multi-view clustering is an exciting area of unsupervised learning that aims to


group unlabeled data points by leveraging multiple views or feature sets of the
data

Here are some key points about it:

Definition: Multi-view clustering involves using different "views" or feature


sets of the same data to improve clustering performance2

Each view might contain different information about the data points, and
combining these views can lead to more accurate and robust clustering
results

Challenges: One of the main challenges is how to effectively integrate and align
these different views, especially when they have varying levels of noise and
completeness

Another challenge is balancing view consistency (ensuring the views agree with
each other) and view specificity (capturing unique information from each view)

Methods: Various methods have been proposed to tackle these challenges,


including graph-based approaches, contrastive learning, and deep learning
techniques

For example, some methods use graph learning to capture the relationships
between data points across different views, while others use contrastive
learning to align representations from different views

Applications: Multi-view clustering is used in various fields such as image and


video analysis, text mining, and bioinformatics, where data can naturally be
represented in multiple views

You might also like