Clustering
Clustering
Clustering
x
Single Prototypes
In this case, the patterns of each class tend to
cluster tightly about a typical or
representative pattern for that class. Under
these conditions, minimum-distance classifiers
can constitute a very effective approach to the
classification problem.
Contd..
Consider M pattern classes and assume that
these classes are representable by prototype
patterns Z 1,Z 2,...,Z M .
D i X Z i ( X Z i )( X Z i ) ………………….(1)
Contd..
A minimum-distance classifier computes the
distance from a pattern x of unknown
classification to the prototype of each class,
and assigns the pattern to the class to which it
is closest.
Contd..
In other words, x is assigned to class wi if
Di < Dj for all i≠j. Ties are resolved arbitrarily.
D i 2 X Z i 2
( X Z i )( X Z i )
Contd..
X X 2 X Z iZ i Z i
1
X X 2( X Z i Z i Z i )
2
Contd..
Choosing the minimum Di2 is equivalent to
choosing the minimum Di since all distances
are positive, since the term XtX is independent
of i, choosing the minimum Dt2 is equivalent to
choosing the maximum (XtZi – 1/2xZitZi)
Consequently, we may define the decision
functions..
Contd..
1 ……….(2)
d i ( X ) X Z i Z i Z i , i 1,2,..., M
2
where x is assigned to class wi , if d i ( x) d j ( x)
for all j i.
D i min X Z i l , l 1,2,..., N i
Contd..
Following the development for single prototype,
the decision functions
1 l
d i ( X ) max{( X Z i ) ( Z i )Z i l }, l 1,2,..., N i
l
2
where, as before, x is placed in class w i ,if
d i ( X ) d j ( X ), for all j i.
XZ
S(X , Z)
X X Z Z X Z
Clustering Criteria
After a measure of pattern similarity has
been adopted, we are still faced with the
problem of specifying a procedure for
partitioning the given data into cluster
domains.
Contd..
The clustering criterion used may represent a
heuristic scheme, or it may be based on the
minimization (or maximization) of a certain
performance index. The heuristic approach is
guided by intuition and experience.
Contd..
j 1 X S j
#Cj
represents the number of points in C j
Contd..
K 2
1 k k j k n
S (n, k ) (1) j .
k! j 1 j
This clearly indicates that exhaustive
enumeration cannot lead to the required
solution for most practical problems in
reasonable computation time. For example,
for the crude-oil data, the exact solution
requires the examination of S (56,3) 1018
partitions.
Contd..
Thus, approximate heuristic techniques
seeking a compromise or looking for an
acceptable solution have usually been
adopted. One such method is the Forgy's K-
means algorithm.
Algorithm K-means
Step 1 : Select an initial cluster configuration.
Repeat
Step 2 : Calculate cluster centers z j , j 1,2,.....K .
of the existing groups.
Step 3 : Redistribute patterns among clusters
utilizing the minimum squared Euclidean distance
classifier concept2 :
xi C jif xi z j xi zl l {1,2,....K }, l j
2
NC
1
D
N
N
j 1
j Dj
Contd..
Step 7.
(a) If this is the last iteration, set C 0
and go to Step 11.
(b) If N C K / 2, go to Step 8.
(c) If this is an even-numbered iteration,
or if N C 2K , go to Step 11; otherwise,
continue
Contd..
Step 8. Find the standard deviation vector
j ( 1 j, 2 j,..., nj) for each sample subset, using
the relation 1
ij
N
(X
X S j
ik
2
Z ij ) , i 1,2,..., n; j 1,2,..., N C
Step 1.K 2, N 1, S 1, C 4, L 0, I 4
Contd..
If no a priori information on the data being
analyzed is available, these parameters are
arbitrarily chosen and then adjusted during
successive iterations through the algorithm.
Step 2. Since there is only one cluster
center,
S 1 { X 1, X 2,,..., X 8} and N 1 8
Step 5. ComputeD j :
1
D 1
N1
X S 1
X Z 1 2.26
Contd..
Step 6. Compute D in this case
D D1 2.26
,
N c K / 2 Go to step 8.
4.38 2.38
Z 1 , Z 1
2.75 2.75
Contd..
For convenience these two cluster centers
are renamed Z 1 and Z 2 respectively. Also, N C
is increased by 1. Go to Step 2.
Step 2. The sample sets are now
S 1 { X 4, X 5, X 6, X 7, X 8}, S 2 { X 1, X 2, X 3}
and N 1 5, N 2 3.
Step 3. Since both N 1 and N 2 are greater
than N, no subsets are discarded.
Contd..
Step 4. Update the cluster centers :
1 4.80 1 1.00
Z 1 X , Z 2
N1 X S 1 3.80 N2 X S 2 1.00
1 1
D 1
N1
X S 1
X Z 1 0.80, D 2
N2
X S 2
X Z 2 0.94
Contd..
Step 6. Compute D :
NC
1 1 2
D
N
j 1
N j D j N j D j 0.85
8 j 1
0.75 0.82
1 , 2
0.75 0.82
Contd..
Step 9. In this case 1 max 0.75 and 2 max 0.82 .
Step 10. The conditions for splitting are not
satisfied. Therefore, we proceed to Step 11.
Step 11. We obtain the same result as in the
previous iteration:
D12 Z 1 Z 2 4.72
Cluster
Center
Z1 Z2 Z3 Z4 Z5
Z4 0.0 49.3
Z5 0.0
Cluster Variances
Domains
1 2 3 4
S1 1.2 0.9 4
0.7 1.0
Note :
Since domain S1 has very similar variances,
it can be expected to be roughly spherical in
nature.
Cluster domain S5 on the other hand, is
significantly elongated about the third
coordinate axis. A similar analysis can be
carried out for the other domains.
Contd..