Unit IV Unsupervised Learning
Unit IV Unsupervised Learning
Unsupervised learning is a type of machine learning where the algorithm is trained on data that has
no labelled outcomes or target variables. The goal is to identify patterns, structures, or relationships
in the data without explicit supervision. In unsupervised learning, the algorithm tries to group data
points (clustering), reduce dimensions (dimensionality reduction), or discover hidden structures
(association rules) based on the inherent patterns in the data.
Cluster analysis is a technique used in data analysis to group similar data points together based on
certain characteristics or features. The goal is to identify patterns or structures in data, where data
points within the same group (or cluster) are more similar to each other than to those in other
groups. It's widely used in fields like market research, pattern recognition, and machine learning.
Common clustering algorithms include K-means, hierarchical clustering, and DBSCAN.
Partition methods are clustering techniques that divide a dataset into non-overlapping groups (or
clusters) such that each data point belongs to exactly one cluster. These methods require the user
to specify the desired number of clusters (K) in advance. The goal is to optimize a specific criterion,
like minimizing the variance within clusters or maximizing the distance between clusters.
K-Means: A clustering algorithm that partitions data into a predefined number of clusters by
minimizing the variance within each cluster.
K-Medoids: Similar to K-means, but instead of using the mean of the points in a cluster, it uses an
actual data point (medoid) to represent the cluster.
K-Means:
Method: This algorithm starts by selecting K initial centroids (randomly or using some
strategy). Each data point is assigned to the nearest centroid, forming K clusters. Then, the
centroids are recalculated as the mean of the points in each cluster. The process repeats
until the centroids no longer change significantly.
Pros: Efficient for large datasets, works well when clusters are spherical and equally sized.
Cons: Sensitive to initial centroid placement, struggles with non-spherical or unevenly sized
clusters.
K-Medoids:
Method: Like K-means, K-medoids aims to partition the data into K clusters, but it uses actual
points (medoids) as cluster centers. Instead of minimizing variance, it minimizes the total
pairwise dissimilarity between points in the cluster and their medoid. Algorithms like
Partitioning Around Medoids (PAM) are often used.
Pros: More robust to outliers since the medoid is an actual data point and not an average.
Cons: Computationally more expensive than K-means, especially for large datasets.
4. What are the Hierarchical Methods: Agglomerative and Divisive Hierarchical Clustering
Agglomerative Hierarchical Clustering: A bottom-up approach where each data point starts as its
own cluster, and clusters are progressively merged based on similarity.
Divisive Hierarchical Clustering: A top-down approach where all data points start in one cluster, and
the cluster is recursively split into smaller clusters.
Method: In this approach, each data point is initially treated as its own cluster. The algorithm
then iteratively merges the closest clusters based on a distance metric (e.g., Euclidean
distance). This process continues until all points belong to a single cluster or a stopping
criterion is met. A dendrogram (tree-like diagram) is often used to visualize the merges.
Method: This is the opposite of agglomerative clustering. It starts with all data points in one
single cluster and then recursively splits the cluster into smaller clusters. The splits are based
on maximizing the dissimilarity between resulting clusters. Like agglomerative clustering, a
dendrogram can be used to visualize the process.
Pros: Useful for finding a global structure in the data, can work better for certain types of
data than agglomerative.
Cons: More computationally expensive than agglomerative clustering, especially with large
datasets.
Dynamic Clustering: A clustering method that adapts to changes in data over time, updating clusters
as new data points become available.
Multi-view Clustering: A clustering approach that combines information from multiple sources or
views to create more robust clusters.
Dynamic Clustering:
Method: Dynamic clustering adjusts clusters over time to reflect changes in the data. It is
commonly used in situations where data evolves, such as in streaming data or when periodic
updates are needed. Algorithms for dynamic clustering continuously adjust cluster members
or centers as new data is added, ensuring that the clustering structure adapts to emerging
patterns or trends.
Pros: Ideal for real-time data or systems where data continuously changes (e.g., sensor data,
online platforms).
Multi-view Clustering:
Method: This technique involves combining multiple data representations (or "views") of the
same dataset to create a more accurate and comprehensive clustering solution. Each view
represents different perspectives or feature sets (e.g., text, images, or graph-based data),
and the algorithm seeks to find clusters that are consistent across these views. By fusing
information from these multiple sources, multi-view clustering can improve the quality and
robustness of the clustering outcome.
Pros: Can leverage complementary information from different sources, improving clustering
accuracy in complex datasets (e.g., multimodal data such as images and text).
Cons: Requires multiple data views, which may not always be available or easy to combine;
computational complexity can be high when integrating diverse data sources.
Internal Evaluation Metrics: Measures the quality of clustering based on the data and the clustering
itself, without reference to external ground truth.
External Evaluation Metrics: Compares the clustering results to an external, known classification
(ground truth) to measure its accuracy.
Method: Internal metrics assess the quality of clustering based solely on the data and the
structure of the clusters without external information. They typically focus on cohesion (how
close the points within a cluster are) and separation (how distinct the clusters are from each
other). Common internal metrics include:
o Silhouette Score: Measures how similar a point is to its own cluster compared to
other clusters.
o Davies-Bouldin Index: Evaluates the average similarity ratio of each cluster with the
cluster that is most similar to it.
o Dunn Index: Measures the ratio of the minimum inter-cluster distance to the
maximum intra-cluster distance.
Pros: Useful when no ground truth is available. Helps in fine-tuning clustering algorithms.
Cons: May not fully capture the true quality of clusters, especially in ambiguous cases where
ground truth is unknown.
o Rand Index: Measures the percentage of pairs of data points that are either in the
same cluster or in different clusters in both the predicted and actual classifications.
o Adjusted Rand Index (ARI): A variation of the Rand Index that adjusts for chance
groupings, giving a more accurate measure.
Pros: Provides a clear, quantitative evaluation when ground truth is available, making it
easier to compare clustering algorithms.
Cons: Requires the availability of ground truth labels, which may not always be present or
reliable.