0% found this document useful (0 votes)
7 views

unsupervised-learning

The document discusses unsupervised learning, particularly focusing on clustering as a key technique for identifying intrinsic structures in data without predefined labels. It covers various clustering methods, including K-means and DBSCAN, and highlights their applications in real-world scenarios like customer segmentation and anomaly detection. The conclusion emphasizes the ongoing development of clustering algorithms and their practical significance across multiple fields.

Uploaded by

preyanshi555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

unsupervised-learning

The document discusses unsupervised learning, particularly focusing on clustering as a key technique for identifying intrinsic structures in data without predefined labels. It covers various clustering methods, including K-means and DBSCAN, and highlights their applications in real-world scenarios like customer segmentation and anomaly detection. The conclusion emphasizes the ongoing development of clustering algorithms and their practical significance across multiple fields.

Uploaded by

preyanshi555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 18

ASSIGNMENT 4:

Unsupervised
Learning

Made BY: Preyanshi


Enrollment
No:226140307031
Supervised learning vs.
unsupervised learning
• Supervised learning: discover patterns in the data that relate data
attributes with a target (class) attribute.
 These patterns are then utilized to predict the values of the target
attribute in future data instances.

• Unsupervised learning: The data have no target attribute.


 We want to explore the data to find some intrinsic structures in them.

2
Clustering
• Clustering is a technique for finding similarity groups
in data, called clusters. I.e.,
 it groups data instances that are similar to (near) each
other in one cluster and data instances that are very
different (far away) from each other into different clusters.
• Clustering is often called an unsupervised
learning task as no class values denoting an a
priori grouping of the data instances are given,
which is the case in supervised learning.
• Due to historical reasons, clustering is often
considered synonymous with unsupervised learning.
 In fact, association rule mining is also unsupervised

• This chapter focuses on clustering.


3
An illustration
• The data set has three natural groups of data
points, i.e., 3 natural clusters.

CS583, Bing Liu, UIC 4


What is clustering for?
• Let us see some real-life examples
• Example 1: groups people of similar sizes together to make “small”,
“medium” and “large” T-Shirts.
 Tailor-made for each person: too expensive
 One-size-fits-all: does not fit all.

• Example 2: In marketing, segment customers according to their


similarities
 To do targeted marketing.

5
What is clustering for?
(cont…)
• Example 3: Given a collection of text documents, we want to
organize them according to their content similarities,
 To produce a topic hierarchy

• In fact, clustering is one of the most utilized data mining


techniques.
 It has a long history, and used in almost every field, e.g., medicine,
psychology, botany, sociology, biology, archeology, marketing, insurance,
libraries, etc.
 In recent years, due to the rapid increase of online documents, text
clustering becomes important.

6
K-means clustering
• K-means is a partitional clustering algorithm
• Let the set of data points (or instances) D be

{x1, x2, …, xn},


where xi = (xi1, xi2, …, xir) is a vector in a real-valued space X  Rr, and r is
the number of attributes (dimensions) in the data.

• The k-means algorithm partitions the given data into k clusters.


 Each cluster has a cluster center, called centroid.
 k is specified by the user

7
K-means algorithm
• Given k, the k-means algorithm works as follows:
1)Randomly choose k data points (seeds) to be the initial centroids, cluster
centers
2)Assign each data point to the closest centroid
3)Re-compute the centroids using the current cluster memberships.
4)If a convergence criterion is not met, go to 2).

8
K-means algorithm – (cont
…)

9
K-means summary
• Despite weaknesses, k-means is still the most popular algorithm
due to its simplicity, efficiency and
 other clustering algorithms have their own lists of weaknesses.

• No clear evidence that any other clustering algorithm performs


better in general
 although they may be more suitable for some specific types of data or
applications.

• Comparing different clustering algorithms is a difficult task. No one


knows the correct clusters!

10
Common ways to represent
clusters
• Use the centroid of each cluster to represent the cluster.
 compute the radius and
 standard deviation of the cluster to determine its spread in each
dimension

 The centroid representation alone works well if the clusters are of the
hyper-spherical shape.
 If clusters are elongated or are of other shapes, centroids are not
sufficient

1
Hierarchical Clustering
• Produce a nested sequence of clusters, a
tree, also called Dendrogram.

CS583, Bing Liu, UIC 12


Using classification model
• All the data points in a
cluster are regarded
to have the same
class label, e.g., the
cluster ID.
 run a supervised
learning algorithm on
the data to find a
classification model.

CS583, Bing Liu, UIC 13


DBSCAN Application
• Real-Time Problem: Anomaly Detection in
Credit Card Transactions
• Objective: Detect fraudulent credit card
transactions.
• Dataset: Transaction records including amount,
location, and time.
• Process:
• Apply DBSCAN to cluster normal transactions while
identifying outliers.
• DBSCAN is effective because it does not assume
spherical clusters and can detect outliers.

• Result: Detect anomalies that may indicate


fraudulent activity.

14
Apriori Algorithm
Application
• Real-Time Problem: Optimizing Product
Placement in Retail
• Objective: Identify frequently purchased items
together to improve store layout and product
recommendations.
• Dataset: Transaction data from a large retail store.
• Process:
• Apply the Apriori algorithm to find association rules
between products (e.g., milk and bread are often bought
together).
• Set a minimum support and confidence to filter the rules.

• Result: Store layouts are redesigned to place


frequently bought-together items closer, boosting sales
by cross-promoting products.

15
Conclusion and Key
Takeaways
• Unsupervised Learning is powerful for uncovering
hidden patterns in unlabeled data.
• Real-Time Applications:
• Customer segmentation (K-Means)
• Anomaly detection (DBSCAN)
• Market basket analysis (Apriori)
• Case Study: Retail industry benefits from association
rule mining to improve sales and customer
experience.

16
Summary
• Clustering is has along history and still active
 There are a huge number of clustering algorithms
 More are still coming every year.
• We only introduced several main algorithms. There
are many others, e.g.,
 density based algorithm, sub-space clustering, scale-up
methods, neural networks based methods, fuzzy clustering,
co-clustering, etc.
• Clustering is hard to evaluate, but very useful in
practice. This partially explains why there are still a
large number of clustering algorithms being devised
every year.
• Clustering is highly application dependent and to
some extent subjective.
17
•Thank You!

18

You might also like