unsupervised-learning
unsupervised-learning
Unsupervised
Learning
2
Clustering
• Clustering is a technique for finding similarity groups
in data, called clusters. I.e.,
it groups data instances that are similar to (near) each
other in one cluster and data instances that are very
different (far away) from each other into different clusters.
• Clustering is often called an unsupervised
learning task as no class values denoting an a
priori grouping of the data instances are given,
which is the case in supervised learning.
• Due to historical reasons, clustering is often
considered synonymous with unsupervised learning.
In fact, association rule mining is also unsupervised
5
What is clustering for?
(cont…)
• Example 3: Given a collection of text documents, we want to
organize them according to their content similarities,
To produce a topic hierarchy
6
K-means clustering
• K-means is a partitional clustering algorithm
• Let the set of data points (or instances) D be
7
K-means algorithm
• Given k, the k-means algorithm works as follows:
1)Randomly choose k data points (seeds) to be the initial centroids, cluster
centers
2)Assign each data point to the closest centroid
3)Re-compute the centroids using the current cluster memberships.
4)If a convergence criterion is not met, go to 2).
8
K-means algorithm – (cont
…)
9
K-means summary
• Despite weaknesses, k-means is still the most popular algorithm
due to its simplicity, efficiency and
other clustering algorithms have their own lists of weaknesses.
10
Common ways to represent
clusters
• Use the centroid of each cluster to represent the cluster.
compute the radius and
standard deviation of the cluster to determine its spread in each
dimension
The centroid representation alone works well if the clusters are of the
hyper-spherical shape.
If clusters are elongated or are of other shapes, centroids are not
sufficient
1
Hierarchical Clustering
• Produce a nested sequence of clusters, a
tree, also called Dendrogram.
14
Apriori Algorithm
Application
• Real-Time Problem: Optimizing Product
Placement in Retail
• Objective: Identify frequently purchased items
together to improve store layout and product
recommendations.
• Dataset: Transaction data from a large retail store.
• Process:
• Apply the Apriori algorithm to find association rules
between products (e.g., milk and bread are often bought
together).
• Set a minimum support and confidence to filter the rules.
15
Conclusion and Key
Takeaways
• Unsupervised Learning is powerful for uncovering
hidden patterns in unlabeled data.
• Real-Time Applications:
• Customer segmentation (K-Means)
• Anomaly detection (DBSCAN)
• Market basket analysis (Apriori)
• Case Study: Retail industry benefits from association
rule mining to improve sales and customer
experience.
16
Summary
• Clustering is has along history and still active
There are a huge number of clustering algorithms
More are still coming every year.
• We only introduced several main algorithms. There
are many others, e.g.,
density based algorithm, sub-space clustering, scale-up
methods, neural networks based methods, fuzzy clustering,
co-clustering, etc.
• Clustering is hard to evaluate, but very useful in
practice. This partially explains why there are still a
large number of clustering algorithms being devised
every year.
• Clustering is highly application dependent and to
some extent subjective.
17
•Thank You!
18