Unsupervised Learning Update
Unsupervised Learning Update
Road map
Basic concepts
Hierarchical clustering
K-means algorithm
Representation of clusters
Distance functions
2
Supervised learning vs. unsupervised
learning
Supervised learning: discover patterns in the
data that relate data attributes with a target
(class) attribute.
These patterns are then utilized to predict the
values of the target attribute in future data
instances.
Unsupervised learning: The data have no
target attribute.
We want to explore the data to find some intrinsic
structures in them.
3
Clustering
Clustering is a technique for finding similarity groups
in data, called clusters. I.e.,
it groups data instances that are similar to (near) each other
in one cluster and data instances that are very different (far
away) from each other into different clusters.
Clustering is often called an unsupervised learning
task as no class values denoting an a priori grouping
of the data instances are given, which is the case in
supervised learning.
Due to historical reasons, clustering is often
considered synonymous with unsupervised learning.
In fact, association rule mining is also unsupervised
4
An illustration
The data set has three natural groups of data points,
i.e., 3 natural clusters.
6
What is clustering for? (cont…)
Example 3: Given a collection of text
documents, we want to organize them
according to their content similarities,
To produce a topic hierarchy
In fact, clustering is one of the most utilized
data mining techniques.
It has a long history, and used in almost every
field, e.g., medicine, psychology, botany,
sociology, biology, archeology, marketing,
insurance, libraries, etc.
In recent years, due to the rapid increase of online
documents, text clustering becomes important.
7
K-means clustering
K-means is a partitional clustering algorithm
Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-valued
space X Rr, and r is the number of attributes
(dimensions) in the data.
The k-means algorithm partitions the given
data into k clusters.
Each cluster has a cluster center, called centroid.
k is specified by the user
8
K-means algorithm
Given k, the k-means algorithm works as
follows:
1) Randomly choose k data points (seeds) to be the
initial centroids, cluster centers
2) Assign each data point to the closest centroid
3) Re-compute the centroids using the current
cluster memberships.
4) If a convergence criterion is not met, go to 2).
9
K-means algorithm – (cont …)
10
Stopping/convergence criterion
1. no (or minimum) re-assignments of data
points to different clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared
error (SSE),
k
SSE
j 1
xC j
dist (x, m j ) 2 (1)
Ci is the jth cluster, mj is the centroid of cluster Cj
(the mean vector of all the data points in Cj), and
dist(x, mj) is the distance between data point x
and centroid mj.
11
An example
+
+
12
An example (cont …)
13
Strengths of k-means
Strengths:
Simple: easy to understand and to implement
Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
Since both k and t are small. k-means is considered a
linear algorithm.
K-means is the most popular clustering algorithm.
Note that: it terminates at a local optimum if SSE is
used. The global optimum is hard to find due to
complexity.
14
Weaknesses of k-means
The algorithm is only applicable if the mean is
defined.
For categorical data, k-mode - the centroid is
represented by most frequent values.
The user needs to specify k.
The algorithm is sensitive to outliers
Outliers are data points that are very far away from
other data points.
Outliers could be errors in the data recording or
some special data points with very different values.
15
Weaknesses of k-means: Problems with
outliers
32
CS583, Bing Liu, UIC 33
CS583, Bing Liu, UIC 34
CS583, Bing Liu, UIC 35
CS583, Bing Liu, UIC 36
Check the players and their attributes