An Efficient Method For Active Semi-Supervised Density Based Clustering
An Efficient Method For Active Semi-Supervised Density Based Clustering
Viet-Vu Vu, International Journal of Advances in Computer Science and Technology, 4(4), April 2015, 59 - 62
ABSTRACT
Semi-supervised clustering algorithms relies on side
information, either labeled data (seeds) or pairwise
constraints (must-link or cannot link) between data objects, to
improve the quality of clustering. This paper proposes to
extend an existing seed-based clustering algorithm with an
active learning mechanism to collect pairwise constraints. My
new semi-supervised algorithm can deal with both seeds and
constraints. Experiment results on real data sets show the
efficient of my algorithm when compared to the initial
seed-based clustering algorithm.
Key words: semi-supervised clustering, active learning,
seed, constraint.
1. INTRODUCTION
Viet-Vu Vu, International Journal of Advances in Computer Science and Technology, 4(4), April 2015, 59 - 62
Then, section 4 presents the experimental protocol and the
preliminary results. Finally, Section 5 concludes and devises
some perspectives of this research.
Algorithm 2: ActSSDBSCAN
Input: Set of data points X = {x1,x2,,xn}; xi Rd, {S1,
S2,,SK} is the set of seeds
Output: Disjoint K partitioning {X1, X2,,XK}
Repeat
Step 1: Build cluster as in SSDBSCAN, if the stop condition
is true, go to step 3
Step 2: For each sorted edge, ask the expert if the relation
between the vertices is a must-link (ML) or cannot-link (CL)
constraint;
Step 3: While the expert answer is ML go to step 2;
Step 4: If the expert answer is CL then choose the edge rDist
value as a separation distance and obtain a cluster.
Until the set of seeds is empty
Viet-Vu Vu, International Journal of Advances in Computer Science and Technology, 4(4), April 2015, 59 - 62
should help minimizing expert solicitations during the active
learning step.
4. EXPERIMENT RESULTS
I use 5 real datasets from the Machine Learning Repository
[15] named: Protein, Iris, Glass, Thyroid, and LetterIJL to
evaluate my algorithm. The detail of datasets is shown in
Table 1.
ID
Protein
115
20
Iris
150
Glass
214
Thyroid
215
LetterIJL
227
16
RI ( P1 , P2 )
2(a b)
n(n 1)
Protein
(1)
Iris
5. CONCLUSION
This paper presents a new active learning density based
clustering algorithm named ActSSDBSCAN. To the best of
my knowledge, this is the first semi-supervised algorithm to
use both seeds and constraints as side information.
Preliminary results on real data sets show the benefit of my
approach when compared to SSDBSCAN. Future research
Thyroid
61
Viet-Vu Vu, International Journal of Advances in Computer Science and Technology, 4(4), April 2015, 59 - 62
5.
6.
7.
8.
LetterIJL
9.
10.
11.
12.
Soybean
Figure 3: Experiment results
13.
14.
REFERENCES
1.
2.
3.
4.
15.
62