Unit 6 Unsupervised Learning
Unit 6 Unsupervised Learning
Unsupervised
Learning
Road map
Basic concepts
K-means algorithm
Representation of clusters
Hierarchical clustering
Distance functions
Which clustering algorithm to use?
Cluster evaluation
Summary
SSE
j 1
xC j
dist (x, m j ) 2 (1)
+
+
the cluster.
compute the radius and
standard deviation of the cluster to determine its
spread in each dimension
distance.
Go on merging
to
run several algorithms using different distance functions
and parameter settings, and
then carefully analyze and compare the results.
The interpretation of the results must be based on
insight into the meaning of the original data together
with knowledge of the algorithms used.
Clustering is highly application dependent and to
certain extent subjective (personal preferences).
evaluate because
We do not know the correct clusters
Some methods are used:
User inspection
Study centroids, and spreads
Rules from a decision tree.
For text documents, one can read some documents in
clusters.
algorithms.
A real-life data set for clustering has no class labels.
Thus although an algorithm may perform very well on some
labeled data sets, no guarantee that it will perform well on
the actual application data at hand.
The fact that it performs well on some label data
sets does give us some confidence of the quality of
the algorithm.
This evaluation method is said to be based on
external data or information.