Lect 2 Supervised and Unsupervised Learning
Lect 2 Supervised and Unsupervised Learning
STAT669
Supervised learning vs. unsupervised learning
Clustering
• Clustering is a technique for finding similarity groups
in data, called clusters. I.e.,
– it groups data instances that are similar to (near) each other
in one cluster and data instances that are very different (far
away) from each other into different clusters.
• Clustering is often called an unsupervised learning
task as no class values denoting an a priori grouping
of the data instances are given, which is the case in
supervised learning.
• Due to historical reasons, clustering is often
considered synonymous with unsupervised learning.
– In fact, association rule mining is also unsupervised
• This chapter focuses on clustering.
AIAS
An illustration
• The data set has three natural groups of data points,
i.e., 3 natural clusters.
AIAS
Aspects of clustering
• A clustering algorithm
– Partitional clustering
– Hierarchical clustering
• A distance (similarity, or dissimilarity) function
• Clustering quality
– Inter-clusters distance maximized
– Intra-clusters distance minimized
• The quality of a clustering result depends on
the algorithm, the distance function, and the
application.
AIAS
K-means clustering
• K-means is a partitional clustering algorithm
• Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-valued
space X Rr, and r is the number of attributes
(dimensions) in the data.
• The k-means algorithm partitions the given data
into k clusters.
– Each cluster has a cluster center, called centroid.
– k is specified by the user
AIAS
K-means algorithm
Stopping/convergence criterion
1. no (or minimum) re-assignments of data points to
different clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared error
(SSE), (1)
k
SSE
j 1
xC j
dist (x, m j ) 2
An example
+
+
AIAS
An example (cont …)
AIAS
Strengths of k-means
• Strengths:
– Simple: easy to understand and to implement
– Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
– Since both k and t are small. k-means is considered a
linear algorithm.
• K-means is the most popular clustering algorithm.
• Note that: it terminates at a local optimum if SSE is
used. The global optimum is hard to find due to
complexity.
AIAS
Weaknesses of k-means
• The algorithm is only applicable if the mean is defined.
– For categorical data, k-mode - the centroid is
represented by most frequent values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
– Outliers are data points that are very far away from
other data points.
– Outliers could be errors in the data recording or
some special data points with very different values.
AIAS
+
AIAS
K-means summary
• Despite weaknesses, k-means is still the most
popular algorithm due to its simplicity, efficiency
and
– other clustering algorithms have their own lists of
weaknesses.
• No clear evidence that any other clustering
algorithm performs better in general
– although they may be more suitable for some specific
types of data or applications.
• Comparing different clustering algorithms is a
difficult task. No one knows the correct clusters!
AIAS
Hierarchical Clustering
• Produce a nested sequence of clusters, a tree,
also called Dendrogram.
AIAS
Agglomerative clustering
It is more popular then divisive methods.
• At the beginning, each data point forms
a cluster (also called a node).
• Merge nodes/clusters that have the least
distance.
• Go on merging
• Eventually all nodes belong to one
cluster
AIAS
Confusion matrix