Jia2021 Article AnAutomaticThree-wayClustering
Jia2021 Article AnAutomaticThree-wayClustering
https://doi.org/10.1007/s13042-020-01255-8
ORIGINAL ARTICLE
Received: 17 July 2020 / Accepted: 9 December 2020 / Published online: 14 January 2021
© Springer-Verlag GmbH Germany, part of Springer Nature 2021
Abstract
The three-way clustering is an extension of traditional clustering by adding the concept of fringe region, which can effec-
tively solve the problem of inaccurate decision-making caused by inaccurate information or insufficient data in traditional
two-way clustering methods. The existing three-way clustering works often select the appropriate number of clusters and
the thresholds for three-way partition according to subjective tuning. However, the method of fixing the number of clus-
ters and the thresholds of the partition cannot automatically select the optimal number of clusters and partition thresholds
for different data sets with different sizes and densities. To address the above problem, this paper proposed an improved
three-way clustering method. First, we define the roughness degree by introducing the sample similarity to measure the
uncertainty of the fringe region. Moreover, based on the roughness degree, we define a novel partitioning validity index to
measure the clustering partitions and propose an automatic threshold selection method. Second, based on the concept of
sample similarity, we introduce the intra-class similarity and the inter-class similarity to describe the quantitative change of
the relationship between the sample and the clusters, and define a novel clustering validity index to measure the clustering
performance under different numbers of clusters through the integration of the above two kinds of similarities. Furthermore,
we propose an automatic cluster number selection method. Finally, we give an automatic three-way clustering approach by
combining the proposed threshold selection method and the cluster number selection method. The comparison experiments
demonstrate the effectiveness of our proposal.
1 Introduction
1
* Weiwei Li School of Computer Science and Engineering, Nanjing
[email protected] University of Science and Technology, Nanjing 210094,
China
Xiuyi Jia
2
[email protected] College of Astronautics, Nanjing University of Aeronautics
and Astronautics, Nanjing 210016, China
Ya Rao
3
[email protected] School of Computer Science and Technology, Anhui
University of Technology, Ma’anshan 243032, China
Sichun Yang
4
[email protected] College of Computer Science and Technology,
Chongqing University of Posts and Telecommunications,
Hong Yu
Chongqing 400065, China
[email protected]
13
Vol.:(0123456789)
1546 International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556
with uncertainty information based on the semantic studies should be selected to maximize the accuracy of the samples
of positive rules, boundary rules and negative rules coming contained in each cluster and minimize the uncertainty of
from rough set theory. When the decision makers have suffi- knowledge representation of the final clustering result.
cient confidence based on given information, the acceptance To address the above problems, in this paper, we propose
or rejection in two-way decisions are usually adopted to out- an automatic three-way clustering method based on sample
put an immediately result. However, when the decision mak- similarity, in which the quantitative changes of the relation-
ers have insufficient confidence in the prejudgment based on ship between all samples and clusters are considered. First,
current information, the deferment decision is often adopted we define sample similarity to describe the quantitative rela-
to reduce the risk brought by the immediate decisions. For tionship among samples. Based on the sample similarity, we
uncertainty information, whether it is to accept or reject, redefine the concept of roughness to measure the uncertainty
there is a high probability of making the wrong decision. of the clusters after partitioning. Moreover, by consider-
Three-way clustering [32, 35, 36] is an important appli- ing the roughness and the sample distribution of the fringe
cation of the three-way decisions theory, which can effec- region, we also define the partition validity index to measure
tively solve the problem of inaccurate partition caused by the the performance of the cluster partitions corresponding to
incomplete information or insufficient data in the traditional different partition thresholds. Second, based on the sample
two-way clustering methods. Compared with two-way clus- similarity, we introduce the concepts of intra-class sample
tering methods, the three-way clustering incorporates the similarity and inter-class sample similarity to describe the
concept of fringe region (or boundary region) for uncertain quantitative relationships of samples within a same cluster
samples, and the clustering result is mainly affected by the and samples in different clusters, respectively. Furthermore,
number of clusters and the thresholds for making three-way by combining the two kinds of sample similarities, we also
decisions. In the existing works, people usually choose the define the cluster validity index to measure the performance
appropriate number of clusters based on the expert opin- of clustering corresponding to the number of clusters. Third,
ion, and select the same constant threshold for all data in we design an automatic three-way clustering method based
the iterations of making three-way decisions. However, the on the proposed partition validity index and cluster valid-
choices of such fixed thresholds and number of clusters does ity index, which aims to obtain the appropriate number of
not provide a good indication of the differences between clusters and thresholds of three-way partitioning for achiev-
clusters or data sets, especially for data sets with different ing a high clustering performance. At last, we implement
sizes and densities. comparative experiments to verify the effectiveness of the
Generally, three-way clustering can be regarded as an proposed method.
effective combination of unsupervised clustering and three- The remainder of this paper is organized as follows. Sec-
way partitioning of clusters [33]. Firstly, from the perspec- tion 2 introduces the preliminary knowledge and related
tive of clustering validity, it is expected that the samples in work of three-way decisions theory and three-way cluster-
different clusters after clustering are easier to distinguish, ing. Section 3 gives an automatic threshold selection algo-
that is, the distance between different clusters is as far as rithm based on proposed sample similarity. Section 4 gives
possible. For the samples in the same cluster, it is expected an automatic cluster number selection algorithm. Section 5
that the samples are as close as possible, that is, the smaller integrates the threshold selection algorithm and the clus-
the distance between the samples within the cluster, the bet- ter number selection algorithm to implement an automatic
ter [6]. Similarly, for the three-way partitioning of clusters, it three-way clustering method. Section 6 reports the compari-
is to improve the accuracy of the samples contained in each son experimental results. Section 7 concludes this paper.
cluster by introducing the fringe region, which is beneficial
for the accuracy of the overall clustering result [32]. How-
ever, the introduction of the fringe region also increases the 2 Preliminary knowledge and related work
uncertainty of knowledge representation of the clustering
result. Therefore, from the perspective of partition valid- In this section, we will introduce the preliminary knowledge
ity, the sample distribution and its uncertainty in the fringe and related work on three-way decisions theory and three-
region are the key factors affecting the performance of the way clustering.
cluster partitioning. In view of this, for three-way clustering,
it is expected to improve the performance of clustering by 2.1 Three‑way decisions
selecting appropriate number of clusters and thresholds for
three-way partitioning. To be specific, appropriate number In classical two-way decisions theory, one only has two
of clusters should be selected to make the samples within kinds of choices: accept it or reject it. However, in many
same cluster closer and the samples in different clusters far- real-world applications, if existing information is insuf-
ther, and appropriate thresholds for three-way partitioning ficient to support decisions that are explicitly accepted or
13
International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556 1547
rejected, no matter which of the two decisions is chosen, dynamic 3WD models [38], three-way enhanced convolu-
it may cause inaccuracy in the decision. Three-way deci- tional neural networks (3W-CNN) [42], etc. In terms of
sions (3WD) theory [29] is an extension of classical two-way application, 3WD has been successfully applied in senti-
decisions (2WD) theory, in which a deferment decision (or ment analysis [41, 42], image processing [12], medical deci-
non-commitment decision) is adopted when current informa- sion [28], spam filtering [7], software defect detection [16],
tion is limited or insufficient to make acceptance decision or etc.
rejection decision.
In 3WD theory, a set of samples X can be divided into
three pairwise disjoint regions through three-way decisions,
2.2 Three‑way clustering
namely the positive region POS (X), the boundary region
As an effective data processing method in machine learning
BND (X) and the negative region NEG (X), which respec-
and data mining areas, clustering can effectively divide sam-
tively correspond to the acceptance decision, the deferment
ples without class information into multiple clusters com-
decision and the rejection decision. A simple representation
posed of similar samples, thereby determining the sample
of 3WD is to construct three-way decisions based rules by
distribution in each cluster and improving the efficiency of
incorporating a pair of thresholds. To be specific, for a given
data post-processing.
sample x and a subset of samples X, p(X|x) is the conditional
The three-way clustering is an improvement of the
probability of x belongs to X, if we have a pair of thresholds
K-means algorithm. To address the problem of inaccurate
(𝛼, 𝛽) with an assumption of 0 ≤ 𝛽 ≤ 𝛼 ≤ 1, we can make the
decision-making caused by inaccurate information or insuf-
following 3WD based rules:
ficient data in traditional clustering methods, three-way
clustering adds the concepts of positive domain and fringe
(P): If p(X|x) ≥ 𝛼 , decide x ∈ POS(X);
domain to the clusters in the traditional K-means-based
(B): If 𝛽 < p(X|x) < 𝛼 , decide x ∈ BND(X);
results to further divide the clustering results, which forms
(N): If p(X|x) ≤ 𝛽 , decide x ∈ NEG(X).
the three regions of a cluster as core region (represented by
Co(C) for cluster C), fringe region (represented by Fr(C)
For the (P) rule, if the probability of x belongs to X is
for cluster C) and trivial region (represented by Tr(C) for
greater than or equal to 𝛼 , we make an acceptance decision
cluster C), respectively. If x ∈ Co(C), the object x belongs
to classify x into the positive region of X. For the (B) rule,
to the cluster C definitely; if x ∈ Fr(C), the object x might
if the probability of x belongs to X is greater than 𝛽 and less
belong to C; if x ∈ Tr(C), the object x does not belong to C
than 𝛼 , we make a deferment decision to classify x into the
definitely. Moreover, assume 𝜋K = {U1 , U2 , … , UK } is the K
boundary region of X. For the (N) rule, if the probability of
clusters of the set of all objects U based on K-means cluster-
x belongs to X is less than or equal to 𝛽 , we make a rejection
ing method, the three regions of all clusters are represented
decision to classify x into the negative region of X.
by the followings:
3WD has been widely concerned by researchers. Many
results on 3WD have been reported from the perspectives 𝜋KT = {Co(𝜋K ), Fr(𝜋K ), Tr(𝜋K )}. (1)
of theory, model and application. Here we take some repre-
sentative achievements from the three perspectives to intro- According to 3WD theory, if x ∈ Co(𝜋K ), then x is deter-
duce the related work of 3WD. In terms of theory, starting mined to belong to a certain cluster in 𝜋K ; if x ∈ Fr(𝜋K ),
from the 3WD concept proposed by Yao, it has gradually then x maybe related to multiple clusters in 𝜋K ; if x ∈ Tr(𝜋K ),
expanded from 3W decisions to 3W + X theories, such as then x is determined not to belong to a certain cluster in 𝜋K .
3W classifications [10, 14, 19, 39, 40], 3W attribute reduc- Usually, we have Tr(𝜋K ) = U − Co(𝜋K ) − Fr(𝜋K ) = � and
tion [8, 9, 11, 17], 3W clustering [1, 34, 36], 3W active Co(𝜋K ) ∩ Fr(𝜋K ) = �.
learning [23], 3W concept learning [13], 3W concept lat- In three-way clustering, each cluster can be represented
tices [26, 27], etc. Furthermore, Yao [30] discussed a wide by a pair of core region and fringe region. For the K clus-
sense of 3WD and proposed a trisecting-acting-outcome ter partitioning of the sample set, it satisfies the following
(TAO) model of 3WD. In addition, Liu and Liang [22] pro- properties:
posed a function based three-way decisions to generalize
the existing models. Li et al. [18] generalized 3WD models (1) ∀Ui ∈ 𝜋K , Co(Ui ) ≠ �;
⋃K ⋃K
based on subset evaluation. Hu [5] studied three-way deci- (2) U = i=1 Co(Ui ) ∪ i=1 Fr(Ui ).
sion spaces. In terms of model, 3WD was combined with
existing models to generate a lot of new models, such as The property (1) requires that the core domain of each clus-
3WD based Bayesian network [4], neighbourhood three- ter cannot be empty after division, that is, there is at least
way decision-theoretic rough set model [15], 3WD based one sample in each cluster; the property (2) is used to ensure
intuitionistic fuzzy decision-theoretic rough sets model [20], that the clustering method can implement effective division
13
1548 International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556
for all samples. Thus, the three-way clustering result can be cluster. However, for partitioned clusters, the introduction
represented by follows: of the fringe region also brings additional uncertainty for
the knowledge representation of the clustering result. If
𝜋K = {(Co(U1 ), Fr(U1 )), … , (Co(UK ), Fr(UK ))}. (2) the fringe region contains more samples, the uncertainty
⋃K of the cluster representation is greater. In view of this, in
For the fringe region Fr(Ui ) , if i=1 Fr(Ui ) = � , then
𝜋K = {Co(U1 ), … , Co(UK )} is a two-way clustering result. order to select the optimal thresholds making the clus-
In the traditional clustering method, all samples can only be ters have the best partition performance, we define the
divided into uniquely determined clusters. In the three-way roughness degree by introducing the sample similarity to
clustering method, if the sample is closely related to multiple measure the quantitative change of the sample distribution
clusters at the same time, the sample may belong to multi- in the positive and fringe regions. Moreover, based on the
ple clusters at the same time. In this case, it is reasonable roughness degree, we define a novel partitioning validity
to classify the sample into the fringe regions of the closely index to measure the clustering partitions and propose an
related clusters. automatic threshold selection method for categorical data.
With the in-depth study of clustering methods, more and
more scholars believe that traditional clustering methods are
based on hard partitioning, which can not truly reflect the 3.1 Partitioning validity index
actual relationship between samples and clusters. To address
this problem, some approaches such as fuzzy clustering, We first give the definition of sample similarity as follows.
rough clustering and interval clustering, have been proposed
to deal with this kind of uncertain relationship between sam- Definition 1 (Sample similarity) [11] Let IS = (U, A) be an
ples and clusters. These approaches are also called soft clus- information system, where U denotes the set of all samples,
tering or overlapping clustering based on the meaning that and A denotes the set of attributes describing each sample,
a sample can belong to more than one cluster [25, 31]. By for any two samples xi ∈ U and xj ∈ U , the similarity of xi
using the concept of membership instead of the traditional and xj is:
distance-based methods to measure the degree of member- ∑�A�
I(xik , xjk )
ship of each sample with a certain cluster, Friedman and SA (xi , xj ) =
k=1
, (3)
Rubin first proposed the concept of fuzzy clustering [3]. �A�
Then, Dunn proposed a widely used fuzzy C-means (FCM)
where |A| represents the number of attributes. For categorical
clustering algorithm based on the related concepts of fuzzy
data, the function I(xik , xjk ) is computed by:
clustering [2]. Later, through the extension of the concept of
{
rough set theory, Lingras et al. proposed a rough C-means 1 xik = xjk
(RCM) clustering method [21], which considers the differ- I(xik , xjk ) =
0 xik ≠ xjk
. (4)
ent influences of positive and fringe regions of each cluster
on calculating corresponding center. Yu introduced 3WD As defined above, the sample similarity is an indicator
theory into traditional clustering and proposed a framework of how similar between the samples is. The greater the
of three-way cluster analysis [32, 33, 35]. value, the more similar the two samples.
Through the introduction of sample similarity, the
related concepts in 3WD theory can be calculated, which
are defined as follows.
3 An automatic threshold selection method
based on sample similarity Definition 2 (Equivalence relation) [11] Given an informa-
tion system IS = (U, A), the equivalence relation based on a
In the three-way clustering methods, each cluster has threshold 𝛼 is represented by E𝛼:
a corresponding positive region and fringe region. The
thresholds for partitioning the regions of each cluster is the E𝛼 = {(x, y) ∈ U × U|SA (x, y) > 𝛼}. (5)
key factor to determine the final clustering performance.
Similarly, for sample x, its equivalence class is computed by:
Different partition thresholds will generate different
regions for the clusters. For three-way decisions theory, it [x]𝛼 = {y ∈ U|SA (x, y) > 𝛼}. (6)
makes the deferment decision for the samples with uncer-
tain information by classifying them into the fringe region. The equivalence relation E𝛼 can divide U into sev-
The introduction of such fringe region can deduce the pos- eral disjoint sample subsets, denoted by 𝜋𝛼 . For a given
sibility of classifying the sample directly into the wrong objective subset X ⊆ U , we can define its lower and upper
approximations based on 𝛼 as follows:
13
International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556 1549
As mentioned above, the introduction of fringe region For the partitioning validity index, its value tends to be 0 when
in three-way clustering is a mixed blessing. On the one either the fringe region is maximized or the roughness is mini-
hand, it can reduce the classification error by classifying mized, so we choose the value of the largest PVI as the bal-
samples with less certainty into the fringe region. On the ance point between the roughness and the fringe region size to
other hand, it also brings additional uncertainty for knowl- show the better comprehensive performance of the clustering
edge representation. By considering the trade-off relation- result. Thus, we can construct an optimization problem for the
ship between the size of fringe region and the roughness, threshold selection problem, in which the largest PVI𝛼 (X) is
we define a novel partitioning validity index as follows. determined by the optimal threshold 𝛼.
arg max PVI𝛼 (X) . (11)
Definition 4 (Partitioning Validity Index) Given an informa- 𝛼
tion system IS = (U, A), X ⊆ U is a subset of samples, for a
Since PVI𝛼 (X) presents a kind of trade-off relationship
specific threshold 𝛼 , the partitioning validity index of X is
between the size of fringe region and the roughness, there
defined as:
does not exist a monotonicity property between PVI𝛼 (X) and
𝛼 . In order to obtain the optimal value of 𝛼 , we traverse
13
1550 International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556
all candidate partition thresholds by setting the appropriate Similarly, we can change the number of clusters by chang-
step size. Algorithm 1 shows the description of the proposed ing the inter-class distance between any two clusters and the
algorithm. intra-class distance among in each cluster, thereby affect-
ing the final clustering performance. To select appropriate
number of clusters, we first define the intra-class sample
similarity and the inter-class sample similarity concepts that
can fully reflect the tightness of distribution between sam-
ples in each cluster and the difference of sample distribution
between various clusters, respectively. Then, by integrating
the intra-class sample similarity and the inter-class simi-
larity, we define a clustering validity index to measure the
performance of clusters related to the number of clusters.
13
International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556 1551
Definition 7 (Inter-class Similarity for Two Clusters) Given 4.2 An automatic cluster number selection method
an information system IS = (U, A), 𝜋K = {U1 , U2 , … , UK } is
the partition of U with K clusters. The inter-class similarity For clustering validity index, the larger the value of CVI(𝜋K ),
InterS(Ui , Uj ) between any two clusters Ui and Uj under the the better the performance of the clustering result deter-
set of attributes A is computed as: mined by the current cluster number K. Therefore, to choose
∑�Ui � ∑�Uj � the appropriate number of clusters, we can construct an opti-
SA (Uim , Ujn ) mization problem based on clustering validity index, which
InterS(Ui , Uj ) = m=1 n=1
, (14)
�Ui � ⋅ �Uj � is represented as follows:
where Uim and Ujn are the m-th sample in cluster Ui and the arg max CVI(𝜋K ). (17)
K
n-th sample in cluster Uj , respectively..
This optimization problem expects to obtain the optimal
Moreover, we can define the inter-class similarity for the number of clusters that maximize the clustering validity
partition 𝜋K as follows. index after clustering. Since the distribution of samples
(number of samples) in each cluster is not monotonic with
Definition 8 (Inter-class similarity for all clusters) Given the cluster number K, the clustering validity index does not
an information system IS = (U, A), the inter-class similarity satisfy the monotonicity property. Thus, we traverse all can-
InterS(𝜋K ) represents the sample similarity of the partition didate number of clusters to obtain the optimal K. Algo-
𝜋K = {U1 , U2 , … , UK }, which is computed as: rithm 2 shows the description of the proposed algorithm.
In the algorithm, based on the suggestion
√ in Ref. [37], the
∑K−1 ∑K search range of K is set from 2 to �U� with step-size 1.
2⋅ InterS(Ui , Uj )
InterS(𝜋K ) =
i=1 j>i
. (15) Then, we apply K-means clustering algorithm to calculate
K ⋅ (K − 1)
the initial clustering result and compute the correspond-
As discussed above, both the intra-class similarity for all ing clustering validity index. Finally, the optimal cluster-
clusters and the inter-class similarity for all clusters are two ing number with respect to the maximal clustering validity
important factors that reflect the clustering performance. index CVI(𝜋K ) is outputted. Assume |U| = n, since the time
Thus, by integrating the two kinds of similarities, we define complexity of K-means 3
is O(nK), then the time complexity
a novel clustering validity index as follows. of Algorithm 2 is O(n 2 K). Usually, K is a constant,
3
thus we
can say that the final time complexity is O(n 2 ).
Definition 9 (Clustering validity index) Given an informa-
tion system IS = (U, A) , 𝜋K = {U1 , U2 , … , UK } are the K
clusters, the clustering validity index for 𝜋K is computed as:
CVI(𝜋K ) = IntraS(𝜋K ) − InterS(𝜋K ). (16)
This definition tells us that a good clustering result with a
large CVI(𝜋K ) will have a large intra-class sample similarity
and a small inter-class sample similarity.
13
1552 International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556
the positive region and the fringe region also reflect differ-
ent influences on the cluster centers. Therefore, based on
the threshold 𝛼 and the sample similarity, the cluster center
in three-way clustering method can be defined as follows.
1
https://archive.ics.uci.edu/ml/datasets.php.
13
International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556 1553
Table 2 The brief descriptions of 15 data sets Table 3 The accuracy of all comparison methods on 15 data sets
ID Data set |U| |A| |D| ID K-means FCM RCM TWC Proposed
13
1554 International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556
Both our algorithm and the comparison algorithm have low 6.3 Experimental result of number of clusters
accuracy on datasets with a large number of actual classes,
such as data set zoo with 7 classes and data set primary- We also examined the number of clusters based on our pro-
tumor with 22 classes, which also indicates that our algo- posed clustering method. Table 5 compares our proposed
rithm performs better on datasets with a small number of method and the ground-truth result. It can be seen that our
actual classes. proposed method can obtain the true number of clusters on
Similarly, Table 4 shows that our proposed method can 8 data sets and approximate results on 5 data sets.
get 11 best results on 15 data sets. Specially, for the NMI Another interesting phenomenon is that our algorithm
measures, our proposed method can beat TWC on all data tends to cluster into very few clusters, such as the data-
sets except heat-stalog data set. sets zoo and primary-tumor, which actually have 7 and 22
13
International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556 1555
Fig. 2 The trend graph of partitioning validity index with the change of the threshold on data set zoo
13
1556 International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556
13