0% found this document useful (0 votes)
38 views12 pages

Jia2021 Article AnAutomaticThree-wayClustering

This document summarizes a research article that proposes an improved automatic three-way clustering method based on sample similarity. The method first defines sample similarity to measure uncertainty in clusters. It then introduces a partition validity index to evaluate clustering thresholds and a cluster validity index to assess the optimal number of clusters. The indices consider intra-class and inter-class sample similarities. An automatic three-way clustering approach is presented that selects thresholds and cluster numbers to maximize clustering performance. Comparative experiments demonstrate the effectiveness of the proposed method.

Uploaded by

anwarshahphd2021
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views12 pages

Jia2021 Article AnAutomaticThree-wayClustering

This document summarizes a research article that proposes an improved automatic three-way clustering method based on sample similarity. The method first defines sample similarity to measure uncertainty in clusters. It then introduces a partition validity index to evaluate clustering thresholds and a cluster validity index to assess the optimal number of clusters. The indices consider intra-class and inter-class sample similarities. An automatic three-way clustering approach is presented that selects thresholds and cluster numbers to maximize clustering performance. Comparative experiments demonstrate the effectiveness of the proposed method.

Uploaded by

anwarshahphd2021
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556

https://doi.org/10.1007/s13042-020-01255-8

ORIGINAL ARTICLE

An automatic three‑way clustering method based on sample similarity


Xiuyi Jia1 · Ya Rao1 · Weiwei Li2 · Sichun Yang3 · Hong Yu4

Received: 17 July 2020 / Accepted: 9 December 2020 / Published online: 14 January 2021
© Springer-Verlag GmbH Germany, part of Springer Nature 2021

Abstract
The three-way clustering is an extension of traditional clustering by adding the concept of fringe region, which can effec-
tively solve the problem of inaccurate decision-making caused by inaccurate information or insufficient data in traditional
two-way clustering methods. The existing three-way clustering works often select the appropriate number of clusters and
the thresholds for three-way partition according to subjective tuning. However, the method of fixing the number of clus-
ters and the thresholds of the partition cannot automatically select the optimal number of clusters and partition thresholds
for different data sets with different sizes and densities. To address the above problem, this paper proposed an improved
three-way clustering method. First, we define the roughness degree by introducing the sample similarity to measure the
uncertainty of the fringe region. Moreover, based on the roughness degree, we define a novel partitioning validity index to
measure the clustering partitions and propose an automatic threshold selection method. Second, based on the concept of
sample similarity, we introduce the intra-class similarity and the inter-class similarity to describe the quantitative change of
the relationship between the sample and the clusters, and define a novel clustering validity index to measure the clustering
performance under different numbers of clusters through the integration of the above two kinds of similarities. Furthermore,
we propose an automatic cluster number selection method. Finally, we give an automatic three-way clustering approach by
combining the proposed threshold selection method and the cluster number selection method. The comparison experiments
demonstrate the effectiveness of our proposal.

Keywords Three-way decisions · Three-way clustering · Sample similarity

1 Introduction

As a research hotspot in the field of rough sets, three-way


decisions theory [29] is an extension of classical two-way
This work is supported by the National Natural Science decisions theory, in which a deferment decision (or non-
Foundation of China (Grant Nos. 61773208, 61906090, commitment decision) is adopted when current information
61876027 and 71671086), the Natural Science Foundation
of Jiangsu Province (Grant No. BK20191287), the Natural is limited or insufficient to make acceptance decision or
Science Foundation of Anhui Province of China (Grant No. rejection decision. Three-way decisions is a kind of deci-
1808085MF178), the Fundamental Research Funds for the Central sion-making theory to describe human beings in dealing
Universities (Grant No. 30920021131), and the China Postdoctoral
Science Foundation (Grant No. 2018M632304).

1
* Weiwei Li School of Computer Science and Engineering, Nanjing
[email protected] University of Science and Technology, Nanjing 210094,
China
Xiuyi Jia
2
[email protected] College of Astronautics, Nanjing University of Aeronautics
and Astronautics, Nanjing 210016, China
Ya Rao
3
[email protected] School of Computer Science and Technology, Anhui
University of Technology, Ma’anshan 243032, China
Sichun Yang
4
[email protected] College of Computer Science and Technology,
Chongqing University of Posts and Telecommunications,
Hong Yu
Chongqing 400065, China
[email protected]

13
Vol.:(0123456789)
1546 International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556

with uncertainty information based on the semantic studies should be selected to maximize the accuracy of the samples
of positive rules, boundary rules and negative rules coming contained in each cluster and minimize the uncertainty of
from rough set theory. When the decision makers have suffi- knowledge representation of the final clustering result.
cient confidence based on given information, the acceptance To address the above problems, in this paper, we propose
or rejection in two-way decisions are usually adopted to out- an automatic three-way clustering method based on sample
put an immediately result. However, when the decision mak- similarity, in which the quantitative changes of the relation-
ers have insufficient confidence in the prejudgment based on ship between all samples and clusters are considered. First,
current information, the deferment decision is often adopted we define sample similarity to describe the quantitative rela-
to reduce the risk brought by the immediate decisions. For tionship among samples. Based on the sample similarity, we
uncertainty information, whether it is to accept or reject, redefine the concept of roughness to measure the uncertainty
there is a high probability of making the wrong decision. of the clusters after partitioning. Moreover, by consider-
Three-way clustering [32, 35, 36] is an important appli- ing the roughness and the sample distribution of the fringe
cation of the three-way decisions theory, which can effec- region, we also define the partition validity index to measure
tively solve the problem of inaccurate partition caused by the the performance of the cluster partitions corresponding to
incomplete information or insufficient data in the traditional different partition thresholds. Second, based on the sample
two-way clustering methods. Compared with two-way clus- similarity, we introduce the concepts of intra-class sample
tering methods, the three-way clustering incorporates the similarity and inter-class sample similarity to describe the
concept of fringe region (or boundary region) for uncertain quantitative relationships of samples within a same cluster
samples, and the clustering result is mainly affected by the and samples in different clusters, respectively. Furthermore,
number of clusters and the thresholds for making three-way by combining the two kinds of sample similarities, we also
decisions. In the existing works, people usually choose the define the cluster validity index to measure the performance
appropriate number of clusters based on the expert opin- of clustering corresponding to the number of clusters. Third,
ion, and select the same constant threshold for all data in we design an automatic three-way clustering method based
the iterations of making three-way decisions. However, the on the proposed partition validity index and cluster valid-
choices of such fixed thresholds and number of clusters does ity index, which aims to obtain the appropriate number of
not provide a good indication of the differences between clusters and thresholds of three-way partitioning for achiev-
clusters or data sets, especially for data sets with different ing a high clustering performance. At last, we implement
sizes and densities. comparative experiments to verify the effectiveness of the
Generally, three-way clustering can be regarded as an proposed method.
effective combination of unsupervised clustering and three- The remainder of this paper is organized as follows. Sec-
way partitioning of clusters [33]. Firstly, from the perspec- tion 2 introduces the preliminary knowledge and related
tive of clustering validity, it is expected that the samples in work of three-way decisions theory and three-way cluster-
different clusters after clustering are easier to distinguish, ing. Section 3 gives an automatic threshold selection algo-
that is, the distance between different clusters is as far as rithm based on proposed sample similarity. Section 4 gives
possible. For the samples in the same cluster, it is expected an automatic cluster number selection algorithm. Section 5
that the samples are as close as possible, that is, the smaller integrates the threshold selection algorithm and the clus-
the distance between the samples within the cluster, the bet- ter number selection algorithm to implement an automatic
ter [6]. Similarly, for the three-way partitioning of clusters, it three-way clustering method. Section 6 reports the compari-
is to improve the accuracy of the samples contained in each son experimental results. Section 7 concludes this paper.
cluster by introducing the fringe region, which is beneficial
for the accuracy of the overall clustering result [32]. How-
ever, the introduction of the fringe region also increases the 2 Preliminary knowledge and related work
uncertainty of knowledge representation of the clustering
result. Therefore, from the perspective of partition valid- In this section, we will introduce the preliminary knowledge
ity, the sample distribution and its uncertainty in the fringe and related work on three-way decisions theory and three-
region are the key factors affecting the performance of the way clustering.
cluster partitioning. In view of this, for three-way clustering,
it is expected to improve the performance of clustering by 2.1 Three‑way decisions
selecting appropriate number of clusters and thresholds for
three-way partitioning. To be specific, appropriate number In classical two-way decisions theory, one only has two
of clusters should be selected to make the samples within kinds of choices: accept it or reject it. However, in many
same cluster closer and the samples in different clusters far- real-world applications, if existing information is insuf-
ther, and appropriate thresholds for three-way partitioning ficient to support decisions that are explicitly accepted or

13
International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556 1547

rejected, no matter which of the two decisions is chosen, dynamic 3WD models [38], three-way enhanced convolu-
it may cause inaccuracy in the decision. Three-way deci- tional neural networks (3W-CNN) [42], etc. In terms of
sions (3WD) theory [29] is an extension of classical two-way application, 3WD has been successfully applied in senti-
decisions (2WD) theory, in which a deferment decision (or ment analysis [41, 42], image processing [12], medical deci-
non-commitment decision) is adopted when current informa- sion [28], spam filtering [7], software defect detection [16],
tion is limited or insufficient to make acceptance decision or etc.
rejection decision.
In 3WD theory, a set of samples X can be divided into
three pairwise disjoint regions through three-way decisions,
2.2 Three‑way clustering
namely the positive region POS (X), the boundary region
As an effective data processing method in machine learning
BND (X) and the negative region NEG (X), which respec-
and data mining areas, clustering can effectively divide sam-
tively correspond to the acceptance decision, the deferment
ples without class information into multiple clusters com-
decision and the rejection decision. A simple representation
posed of similar samples, thereby determining the sample
of 3WD is to construct three-way decisions based rules by
distribution in each cluster and improving the efficiency of
incorporating a pair of thresholds. To be specific, for a given
data post-processing.
sample x and a subset of samples X, p(X|x) is the conditional
The three-way clustering is an improvement of the
probability of x belongs to X, if we have a pair of thresholds
K-means algorithm. To address the problem of inaccurate
(𝛼, 𝛽) with an assumption of 0 ≤ 𝛽 ≤ 𝛼 ≤ 1, we can make the
decision-making caused by inaccurate information or insuf-
following 3WD based rules:
ficient data in traditional clustering methods, three-way
clustering adds the concepts of positive domain and fringe
(P): If p(X|x) ≥ 𝛼 , decide x ∈ POS(X);
domain to the clusters in the traditional K-means-based
(B): If 𝛽 < p(X|x) < 𝛼 , decide x ∈ BND(X);
results to further divide the clustering results, which forms
(N): If p(X|x) ≤ 𝛽 , decide x ∈ NEG(X).
the three regions of a cluster as core region (represented by
Co(C) for cluster C), fringe region (represented by Fr(C)
For the (P) rule, if the probability of x belongs to X is
for cluster C) and trivial region (represented by Tr(C) for
greater than or equal to 𝛼 , we make an acceptance decision
cluster C), respectively. If x ∈ Co(C), the object x belongs
to classify x into the positive region of X. For the (B) rule,
to the cluster C definitely; if x ∈ Fr(C), the object x might
if the probability of x belongs to X is greater than 𝛽 and less
belong to C; if x ∈ Tr(C), the object x does not belong to C
than 𝛼 , we make a deferment decision to classify x into the
definitely. Moreover, assume 𝜋K = {U1 , U2 , … , UK } is the K
boundary region of X. For the (N) rule, if the probability of
clusters of the set of all objects U based on K-means cluster-
x belongs to X is less than or equal to 𝛽 , we make a rejection
ing method, the three regions of all clusters are represented
decision to classify x into the negative region of X.
by the followings:
3WD has been widely concerned by researchers. Many
results on 3WD have been reported from the perspectives 𝜋KT = {Co(𝜋K ), Fr(𝜋K ), Tr(𝜋K )}. (1)
of theory, model and application. Here we take some repre-
sentative achievements from the three perspectives to intro- According to 3WD theory, if x ∈ Co(𝜋K ), then x is deter-
duce the related work of 3WD. In terms of theory, starting mined to belong to a certain cluster in 𝜋K ; if x ∈ Fr(𝜋K ),
from the 3WD concept proposed by Yao, it has gradually then x maybe related to multiple clusters in 𝜋K ; if x ∈ Tr(𝜋K ),
expanded from 3W decisions to 3W + X theories, such as then x is determined not to belong to a certain cluster in 𝜋K .
3W classifications [10, 14, 19, 39, 40], 3W attribute reduc- Usually, we have Tr(𝜋K ) = U − Co(𝜋K ) − Fr(𝜋K ) = � and
tion [8, 9, 11, 17], 3W clustering [1, 34, 36], 3W active Co(𝜋K ) ∩ Fr(𝜋K ) = �.
learning [23], 3W concept learning [13], 3W concept lat- In three-way clustering, each cluster can be represented
tices [26, 27], etc. Furthermore, Yao [30] discussed a wide by a pair of core region and fringe region. For the K clus-
sense of 3WD and proposed a trisecting-acting-outcome ter partitioning of the sample set, it satisfies the following
(TAO) model of 3WD. In addition, Liu and Liang [22] pro- properties:
posed a function based three-way decisions to generalize
the existing models. Li et al. [18] generalized 3WD models (1) ∀Ui ∈ 𝜋K , Co(Ui ) ≠ �;
⋃K ⋃K
based on subset evaluation. Hu [5] studied three-way deci- (2) U = i=1 Co(Ui ) ∪ i=1 Fr(Ui ).
sion spaces. In terms of model, 3WD was combined with
existing models to generate a lot of new models, such as The property (1) requires that the core domain of each clus-
3WD based Bayesian network [4], neighbourhood three- ter cannot be empty after division, that is, there is at least
way decision-theoretic rough set model [15], 3WD based one sample in each cluster; the property (2) is used to ensure
intuitionistic fuzzy decision-theoretic rough sets model [20], that the clustering method can implement effective division

13
1548 International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556

for all samples. Thus, the three-way clustering result can be cluster. However, for partitioned clusters, the introduction
represented by follows: of the fringe region also brings additional uncertainty for
the knowledge representation of the clustering result. If
𝜋K = {(Co(U1 ), Fr(U1 )), … , (Co(UK ), Fr(UK ))}. (2) the fringe region contains more samples, the uncertainty
⋃K of the cluster representation is greater. In view of this, in
For the fringe region Fr(Ui ) , if i=1 Fr(Ui ) = � , then
𝜋K = {Co(U1 ), … , Co(UK )} is a two-way clustering result. order to select the optimal thresholds making the clus-
In the traditional clustering method, all samples can only be ters have the best partition performance, we define the
divided into uniquely determined clusters. In the three-way roughness degree by introducing the sample similarity to
clustering method, if the sample is closely related to multiple measure the quantitative change of the sample distribution
clusters at the same time, the sample may belong to multi- in the positive and fringe regions. Moreover, based on the
ple clusters at the same time. In this case, it is reasonable roughness degree, we define a novel partitioning validity
to classify the sample into the fringe regions of the closely index to measure the clustering partitions and propose an
related clusters. automatic threshold selection method for categorical data.
With the in-depth study of clustering methods, more and
more scholars believe that traditional clustering methods are
based on hard partitioning, which can not truly reflect the 3.1 Partitioning validity index
actual relationship between samples and clusters. To address
this problem, some approaches such as fuzzy clustering, We first give the definition of sample similarity as follows.
rough clustering and interval clustering, have been proposed
to deal with this kind of uncertain relationship between sam- Definition 1 (Sample similarity) [11] Let IS = (U, A) be an
ples and clusters. These approaches are also called soft clus- information system, where U denotes the set of all samples,
tering or overlapping clustering based on the meaning that and A denotes the set of attributes describing each sample,
a sample can belong to more than one cluster [25, 31]. By for any two samples xi ∈ U and xj ∈ U , the similarity of xi
using the concept of membership instead of the traditional and xj is:
distance-based methods to measure the degree of member- ∑�A�
I(xik , xjk )
ship of each sample with a certain cluster, Friedman and SA (xi , xj ) =
k=1
, (3)
Rubin first proposed the concept of fuzzy clustering [3]. �A�
Then, Dunn proposed a widely used fuzzy C-means (FCM)
where |A| represents the number of attributes. For categorical
clustering algorithm based on the related concepts of fuzzy
data, the function I(xik , xjk ) is computed by:
clustering [2]. Later, through the extension of the concept of
{
rough set theory, Lingras et al. proposed a rough C-means 1 xik = xjk
(RCM) clustering method [21], which considers the differ- I(xik , xjk ) =
0 xik ≠ xjk
. (4)
ent influences of positive and fringe regions of each cluster
on calculating corresponding center. Yu introduced 3WD As defined above, the sample similarity is an indicator
theory into traditional clustering and proposed a framework of how similar between the samples is. The greater the
of three-way cluster analysis [32, 33, 35]. value, the more similar the two samples.
Through the introduction of sample similarity, the
related concepts in 3WD theory can be calculated, which
are defined as follows.
3 An automatic threshold selection method
based on sample similarity Definition 2 (Equivalence relation) [11] Given an informa-
tion system IS = (U, A), the equivalence relation based on a
In the three-way clustering methods, each cluster has threshold 𝛼 is represented by E𝛼:
a corresponding positive region and fringe region. The
thresholds for partitioning the regions of each cluster is the E𝛼 = {(x, y) ∈ U × U|SA (x, y) > 𝛼}. (5)
key factor to determine the final clustering performance.
Similarly, for sample x, its equivalence class is computed by:
Different partition thresholds will generate different
regions for the clusters. For three-way decisions theory, it [x]𝛼 = {y ∈ U|SA (x, y) > 𝛼}. (6)
makes the deferment decision for the samples with uncer-
tain information by classifying them into the fringe region. The equivalence relation E𝛼 can divide U into sev-
The introduction of such fringe region can deduce the pos- eral disjoint sample subsets, denoted by 𝜋𝛼 . For a given
sibility of classifying the sample directly into the wrong objective subset X ⊆ U , we can define its lower and upper
approximations based on 𝛼 as follows:

13
International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556 1549

apr (X) = {x ∈ U|[x]𝛼 ⊆ X}, Table 1  An information system U a1 a2 a3 a4 a5 a6


(7)
𝛼
apr𝛼 (X) = {x ∈ U|[x]𝛼 ∩ X ≠ �}. x1 1 1 1 1 1 1
1 0 1 0 1 1
Furthermore, we can use the positive region POS𝛼 (X) and
x2
0 1 1 1 0 0
the fringe region BND𝛼 (X) to describe the objective subset
x3
1 1 1 0 0 1
X:
x4
x5 0 0 1 1 0 1
POS𝛼 (X) = apr (X), x6 1 0 1 0 1 1
𝛼
(8) x7 0 0 0 1 1 0
BND𝛼 (X) = apr𝛼 (X) − apr (X).
𝛼 x8 1 0 1 0 1 1
0 0 1 1 0 1
Usually, the positive region POS𝛼 (X) contains the samples
x9

that belong to X definitely and the fringe region contains the


samples that belong to X possibly, we can use roughness to
compute the uncertainty of X represented by two regions. |BND𝛼 (X) ∩ X|
PVI𝛼 (X) = (1 − RN𝛼 (X)) ⋅ . (10)
|X|
Definition 3 (Roughness) [24] Given an information sys- |BND𝛼 (X)∩X|
tem IS = (U, A), for objective subset X ⊆ U , RN𝛼 (X) is the In this definition, |X|
represents the ratio of the num-
roughness of X based on the threshold 𝛼: ber of samples belonging to the objective subset X and within
its fringe region to the number of samples in X.
|apr (X)| |BND𝛼 (X)| 0 ≤ PVI𝛼 (X) ≤ 1.
RN𝛼 (X) = 1 − 𝛼
= . (9)
|apr𝛼 (X)| |POS𝛼 (X)| + |BND𝛼 (X)| In the three-way clustering problem, we prefer to obtain a
partition with low classification error and put the uncertainty
Given a threshold, the roughness degree is related to sample in the fringe region to delay decision making, which
the distribution of samples in the positive region and the increases the fringe region while also posing the problem of
fringe region for each cluster, which also leads to a mono- increasing roughness, so we constrain the fringe region to be
tonicity property between the roughness and the value of too large by minimizing the roughness.
the threshold.
Example 1 Table 1 is an information system, where
Property 1 Given an information system IS = (U, A), X ⊆ U U = {x1 , x2 , … , x9 } and A = {a1 , a2 , … , a6 } . For objec-
is an objective subset of samples, for any two thresholds 𝛼 tive subset X = {x1 , x2 }, if the threshold 𝛼 = 0.8, we have
and 𝛽 , if 0 ≤ 𝛽 < 𝛼 ≤ 1, then RN𝛽 (X) ≥ RN𝛼 (X). RN0.8 (X) = 3∕4 and PVI0.8 (X) = 3∕8. Similarly, if 𝛼 = 0.6,
we have RN0.6 (X) = 5∕5 and PVI0.6 (X) = 0 . Apparently,
Proof For any sample x ∈ U , if 𝛽 < 𝛼 , according to Eq. (6) 𝛼 = 0.8 is a better choice for objective subset {x1 , x2 }.
and Eq. (7), we have [x]𝛼 ⊆ [x]𝛽 , apr (X) ⊆ apr (X) and
𝛽 𝛼
|apr (X)| |apr (X)|
apr𝛼 (X) ⊆ apr𝛽 (X), thus |apr 𝛼
≥ 𝛽
. Based on Eq. (9),
𝛼 (X)| |apr𝛽 (X)|
now we have RN𝛽 (X) ≥ RN𝛼 (X).  ◻ 3.2 An automatic threshold selection algorithm

As mentioned above, the introduction of fringe region For the partitioning validity index, its value tends to be 0 when
in three-way clustering is a mixed blessing. On the one either the fringe region is maximized or the roughness is mini-
hand, it can reduce the classification error by classifying mized, so we choose the value of the largest PVI as the bal-
samples with less certainty into the fringe region. On the ance point between the roughness and the fringe region size to
other hand, it also brings additional uncertainty for knowl- show the better comprehensive performance of the clustering
edge representation. By considering the trade-off relation- result. Thus, we can construct an optimization problem for the
ship between the size of fringe region and the roughness, threshold selection problem, in which the largest PVI𝛼 (X) is
we define a novel partitioning validity index as follows. determined by the optimal threshold 𝛼.
arg max PVI𝛼 (X) . (11)
Definition 4 (Partitioning Validity Index) Given an informa- 𝛼
tion system IS = (U, A), X ⊆ U is a subset of samples, for a
Since PVI𝛼 (X) presents a kind of trade-off relationship
specific threshold 𝛼 , the partitioning validity index of X is
between the size of fringe region and the roughness, there
defined as:
does not exist a monotonicity property between PVI𝛼 (X) and
𝛼 . In order to obtain the optimal value of 𝛼 , we traverse

13
1550 International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556

all candidate partition thresholds by setting the appropriate Similarly, we can change the number of clusters by chang-
step size. Algorithm 1 shows the description of the proposed ing the inter-class distance between any two clusters and the
algorithm. intra-class distance among in each cluster, thereby affect-
ing the final clustering performance. To select appropriate
number of clusters, we first define the intra-class sample
similarity and the inter-class sample similarity concepts that
can fully reflect the tightness of distribution between sam-
ples in each cluster and the difference of sample distribution
between various clusters, respectively. Then, by integrating
the intra-class sample similarity and the inter-class simi-
larity, we define a clustering validity index to measure the
performance of clusters related to the number of clusters.

4.1 Clustering validity index

We first give the definition of intra-class similarity as


follows.

Definition 5 (Intra-class Similarity for a Cluster) Given an


information system IS = (U, A), the intra-class similarity
IntraS(Ul ) represents the sample similarity of cluster Ul ⊂ U
under the set of attributes A, which is computed as:
In the algorithm, the sample similarities of all possible
pairs are computed first. The minimal value Smin and the 2⋅
∑�Ul �−1 ∑�Ul �
SA (Uli , Ulj )
maximal value Smax with step-size 0.01 constitute the can- IntraS(Ul ) =
i=1 j>i
, (12)
didate threshold space. For each candidate threshold, we �Ul � ⋅ (�Ul � − 1)
compute the corresponding positive region and fringe region
where |Ul | is the number of samples in cluster Ul , Uli and
of objective subset X, and obtain the current partitioning
Ulj are the i-th sample and the j-th sample in the cluster Ul ,
validity index PVI. Finally, the threshold with respect to the
respectively.
maximal PVI is outputted as the optimal threshold. Assume
objective subset X has n samples, the time complexities of
Similarly, we can define the intra-class similarity for the
computing all sample similarities and iteratively solving for
partition of U as follows.
the optimal threshold are O(n2 ) and O(n), thus, the time com-
plexity of Algorithm 1 is O(n2 ).
Definition 6 (Intra-class Similarity for All Clusters) Given
an information system IS = (U, A), the intra-class similarity
IntraS(𝜋K ) represents the sample similarity of all clusters
4 An automatic cluster number selection 𝜋K = {U1 , U2 , … , UK }, which is computed as:
method based on sample similarity ∑K
IntraS(Ul )
IntraS(𝜋K ) = l=1
, (13)
In the process of clustering, when the clustering method K
is determined, the number of clusters will become the key where Ul ∈ 𝜋K is the l-th cluster in the partition 𝜋K .
influencing factor of clustering performance. Different num-
ber of clusters will also result in completely different cluster- Different from the classification problem, the samples in
ing results. In general, most existing clustering methods set the clustering problem have no label information. Therefore,
the number of clusters by expert opinion or experimental it is not enough to only consider the intra-class similarity for
cross-validation. In this section, we will give an automatic each cluster during the clustering process. The inter-class
cluster number selection method based on sample similarity. similarity is also a kind of measurement to evaluate the clus-
In general, all samples can be easily divided into two tering performance, which reflects the degree of differentia-
clusters if there exist large inter-class distance between the tion of different clusters.
two clusters and small intra-class distance in each cluster.

13
International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556 1551

Definition 7 (Inter-class Similarity for Two Clusters) Given 4.2 An automatic cluster number selection method
an information system IS = (U, A), 𝜋K = {U1 , U2 , … , UK } is
the partition of U with K clusters. The inter-class similarity For clustering validity index, the larger the value of CVI(𝜋K ),
InterS(Ui , Uj ) between any two clusters Ui and Uj under the the better the performance of the clustering result deter-
set of attributes A is computed as: mined by the current cluster number K. Therefore, to choose
∑�Ui � ∑�Uj � the appropriate number of clusters, we can construct an opti-
SA (Uim , Ujn ) mization problem based on clustering validity index, which
InterS(Ui , Uj ) = m=1 n=1
, (14)
�Ui � ⋅ �Uj � is represented as follows:

where Uim and Ujn are the m-th sample in cluster Ui and the arg max CVI(𝜋K ). (17)
K
n-th sample in cluster Uj , respectively..
This optimization problem expects to obtain the optimal
Moreover, we can define the inter-class similarity for the number of clusters that maximize the clustering validity
partition 𝜋K as follows. index after clustering. Since the distribution of samples
(number of samples) in each cluster is not monotonic with
Definition 8 (Inter-class similarity for all clusters) Given the cluster number K, the clustering validity index does not
an information system IS = (U, A), the inter-class similarity satisfy the monotonicity property. Thus, we traverse all can-
InterS(𝜋K ) represents the sample similarity of the partition didate number of clusters to obtain the optimal K. Algo-
𝜋K = {U1 , U2 , … , UK }, which is computed as: rithm 2 shows the description of the proposed algorithm.
In the algorithm, based on the suggestion
√ in Ref. [37], the
∑K−1 ∑K search range of K is set from 2 to �U� with step-size 1.
2⋅ InterS(Ui , Uj )
InterS(𝜋K ) =
i=1 j>i
. (15) Then, we apply K-means clustering algorithm to calculate
K ⋅ (K − 1)
the initial clustering result and compute the correspond-
As discussed above, both the intra-class similarity for all ing clustering validity index. Finally, the optimal cluster-
clusters and the inter-class similarity for all clusters are two ing number with respect to the maximal clustering validity
important factors that reflect the clustering performance. index CVI(𝜋K ) is outputted. Assume |U| = n, since the time
Thus, by integrating the two kinds of similarities, we define complexity of K-means 3
is O(nK), then the time complexity
a novel clustering validity index as follows. of Algorithm 2 is O(n 2 K). Usually, K is a constant,
3
thus we
can say that the final time complexity is O(n 2 ).
Definition 9 (Clustering validity index) Given an informa-
tion system IS = (U, A) , 𝜋K = {U1 , U2 , … , UK } are the K
clusters, the clustering validity index for 𝜋K is computed as:
CVI(𝜋K ) = IntraS(𝜋K ) − InterS(𝜋K ). (16)
This definition tells us that a good clustering result with a
large CVI(𝜋K ) will have a large intra-class sample similarity
and a small inter-class sample similarity.

Example 2 For information system as shown in Table 1


, assume all samples are divided into the two clus-
ters 𝜋2 = {{x1 , x2 , x3 , x4 , x5 }, {x6 , x7 , x8 , x9 }} , then we
h av e IntraS(𝜋2 ) = 0.5305 a n d InterS(𝜋2 ) = 0.6111 ,
thus the clustering validity index based on 𝜋2 is
CVI(𝜋2 ) = −0.0806 . Considering another situation,
assume all samples are divided into the three clusters 5 An automatic three‑way clustering
𝜋3 = {{x1 , x2 }, {x3 , x4 , x5 }, {x6 , x7 , x8 , x9 }} , then we have method based on sample similarity
IntraS(𝜋3 ) = 0.5833 and InterS(𝜋3 ) = 0.5796 , thus the
clustering validity index based on 𝜋3 is CVI(𝜋3 ) = 0.0037. In the fuzzy or rough clustering methods, the influence of the
Apparently, 𝜋3 is a better choice based on a larger clustering positive region and the fringe region on the cluster centers
validity index. is different. Similarly, in the three-way clustering method,

13
1552 International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556

the positive region and the fringe region also reflect differ-
ent influences on the cluster centers. Therefore, based on
the threshold 𝛼 and the sample similarity, the cluster center
in three-way clustering method can be defined as follows.

Definition 10 (Cluster center in three-way cluster-


ing method) Given an information system IS = (U, A) ,
𝜋K = {U1 , U2 , … , UK } are K clusters determined by
K-means method. For each cluster Ui , assume 𝛼i is the opti-
mal threshold. Let PU = POS𝛼i (Ui ) and BU = BND𝛼i (Ui )
represent the positive region and the fringe region of Ui
based on 𝛼i , respectively. xm and xn represent the m-th sam-
ple in PU and the n-th sample in BU, respectively. Thus,
xm = [xm1 , xm2 , … , xm|A| ] and xn = [xn1 , xn2 , … , xn|A| ]. Then
the center of cluster Ui is represented by vi which is also a
vector with |A| attribute values. vi = [vi1 , vi2 , … , vi|A| ]. And
the value of vi on attribute j can be computed as:
∑�PU� ∑�BU�
𝜔P ⋅ xmj + 𝜔B ⋅ xnj
vij =
m=1 n=1
, (18)
𝜔P ⋅ �PU� + 𝜔B ⋅ �BU�

where 𝜔P and 𝜔B are two weight factors for the positive


region and the fringe region.

In general, 𝜔P > 𝜔B ≥ 0. In our experiments, 𝜔P and 𝜔B


are set to 1 and 0.5, respectively. From the calculation of
the cluster centers and above definitions, it can be seen that In this algorithm, the Euclidean distance is used to cal-
the change of the threshold for three-way clustering will culate the distance of any two samples. The optimal thresh-
also affect the division of the positive and the fringe regions old based on Algorithm 1 is first selected. When iteratively
for each cluster, thus affecting the update of cluster cent- updating the cluster center changes, if the distance between
ers and the final clustering result. Similarly, the selection of the older and the new centers is less than 𝜖 , the center is no
the number of clusters also affects the selection of optimal longer changed and the iteration stops. The front part of
thresholds and the division of different regions. Therefore, the algorithm is mainly to combine Algorithm 1 and Algo-
it is not appropriate to separate the selection of the opti- rithm 2 to obtain the optimal thresholds and the optimal
mal number of clusters and the selection of optimal thresh- number of clusters, and the latter part is to give the three-
old in the three-way clustering process. In view of this, in way representation of the clustering
5
result. The time com-
this paper, through the integration of the two concepts of plexity of Algorithm 3 is O(n 2 ).
clustering validity index and partitioning validity index, an
automatic three-way clustering method based on sample
similarity is proposed, which can consider the interaction 6 Experiments
between the change of number of clusters and the variation
of the threshold, and results in an optimal three-way cluster- 6.1 Experiments setting
ing result based on the optimal number of clusters and the
optimal thresholds. Algorithm 3 shows the description of To evaluate the efficiency of proposed method, we imple-
the proposed algorithm. ment several comparison experiments on 15 categorical
data sets coming from UCI. 1 The brief descriptions of
these data sets are listed in Table 2. In these data sets,
some attributes representing ID information have been
removed in the preprocessing procedure. Since our data
sets are all categorical attributes, it is not easy to calculate

1
https​://archi​ve.ics.uci.edu/ml/datas​ets.php.

13
International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556 1553

Table 2  The brief descriptions of 15 data sets Table 3  The accuracy of all comparison methods on 15 data sets
ID Data set |U| |A| |D| ID K-means FCM RCM TWC​ Proposed

1 balance-scale 625 4 3 1 0.6352 0.5312 0.5312 0.6000 0.9216


2 vote 435 16 2 2 0.8804 0.8804 0.8804 0.8804 0.9793
3 zoo 101 16 7 3 0.6039 0.6039 0.6039 0.6039 0.4059
4 colic 368 22 2 4 0.6929 0.6847 0.7038 0.7119 0.7527
5 heart-statlog 270 13 2 5 0.8370 0.8296 0.7481 0.8333 0.6666
6 primary-tumor 339 17 22 6 0.2477 0.2536 0.2477 0.2477 0.3215
7 bank 600 10 2 7 0.5433 0.5550 0.5433 0.5500 0.7016
8 promoters 106 57 2 8 0.6509 0.8301 0.5471 0.6792 0.7358
9 spect-train 80 22 2 9 0.6875 0.7000 0.6125 0.6875 0.8125
10 hepatitis 155 19 2 10 0.7935 0.7935 0.7935 0.7935 0.8709
11 lung-cancer 32 56 2 11 0.7187 0.7187 0.7187 0.7187 0.8437
12 lymph 148 18 4 12 0.7027 0.5472 0.6216 0.6283 0.7297
13 heart-c 296 13 5 13 0.8141 0.8243 0.6351 0.8108 0.8817
14 house-votes-84 232 16 2 14 0.8922 0.8879 0.8836 0.8922 1.0000
15 breast-cancer 227 9 2 15 0.7075 0.7075 0.7075 0.7075 0.8122
Score 2 2 1 1 12
|U| represents the number of samples, |A| represents the number of
attributes, and |D| represents the number of classes The best performance of each row is boldfaced

Table 4  The normalized mutual information of all comparison meth-


ods on 15 data sets
ID K-means FCM RCM TWC​ Proposed
the distance of any two samples. Therefore, to facilitate
the calculation of the sample distance, we applied one- 1 0.1016 0.1040 0.1023 0.1016 0.1215
hot encoding to convert categorical data to integer data. 2 0.4525 0.4525 0.4525 0.4525 0.4648
We compare our method with four kinds of clustering 3 0.1089 0.1089 0.1089 0.1089 0.1335
methods including K-means, FCM [2], RCM [21] and a 4 0.4389 0.4392 0.4396 0.4396 0.4417
classical three-way clustering method (TWC) [32]. The 5 0.4423 0.4417 0.4368 0.4420 0.4384
evaluation measures are the accuracy (ACC) and the nor- 6 0.0306 0.0312 0.0307 0.0307 0.0457
malized mutual information (NMI) that are common used 7 0.2811 0.2803 0.2818 0.2805 0.2826
in clustering related works. The clustering accuracy is 8 0.4225 0.4279 0.4131 0.4194 0.4229
measured by counting the number of correctly assigned 9 0.4206 0.4185 0.4300 0.4185 0.4306
samples and dividing by the number of all samples, which 10 0.2738 0.2911 0.2869 0.2755 0.2826
can be computed as follows: 11 0.3918 0.3957 0.3846 0.3846 0.4174
12 0.2058 0.2028 0.2039 0.2040 0.2065
∑�𝜋L �
� max{�𝜋Li ∩ 𝜋Cj � ∣ 𝜋Cj ∈ 𝜋C }� 13 0.2794 0.2927 0.2952 0.2801 0.2890
ACC = i=1
, (19) 14 0.4466 0.4457 0.4459 0.4466 0.4636
�U�
15 0.4344 0.4353 0.4342 0.4344 0.4482
where 𝜋C and 𝜋L are real clusters in U and predicted clus- Score 1 2 1 0 11
ters, respectively. 𝜋Li represents the i-th cluster of 𝜋L and
The best performance of each row is boldfaced
𝜋Cj represents the j-th cluster of 𝜋C . The normalized mutual
information is used to measure the similarity between two
clustering results.
6.2 Experimental result of clustering performance
I(𝜋C ;𝜋L )
NMI(𝜋C , 𝜋L ) = 2 ⋅ , (20) The comparison results on ACC and NMI are show in
H(𝜋C ) + H(𝜋L )
Tables 3 and 4. In each table, ‘score’ means the number of
where I(𝜋C ;𝜋L ) represents the mutual information of 𝜋C and best results in the corresponding clustering method.
𝜋L , H(𝜋C ) and H(𝜋L ) are the related entropy of 𝜋C and 𝜋L , From Table 3, we can see that our proposed method can
respectively. significantly improve the clustering accuracy on 12 data sets.

13
1554 International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556

Fig. 1  The trend graph of


clustering validity index with
the change of the number of
clusters on each data set

Both our algorithm and the comparison algorithm have low 6.3 Experimental result of number of clusters
accuracy on datasets with a large number of actual classes,
such as data set zoo with 7 classes and data set primary- We also examined the number of clusters based on our pro-
tumor with 22 classes, which also indicates that our algo- posed clustering method. Table 5 compares our proposed
rithm performs better on datasets with a small number of method and the ground-truth result. It can be seen that our
actual classes. proposed method can obtain the true number of clusters on
Similarly, Table 4 shows that our proposed method can 8 data sets and approximate results on 5 data sets.
get 11 best results on 15 data sets. Specially, for the NMI Another interesting phenomenon is that our algorithm
measures, our proposed method can beat TWC on all data tends to cluster into very few clusters, such as the data-
sets except heat-stalog data set. sets zoo and primary-tumor, which actually have 7 and 22

13
International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556 1555

Fig. 2  The trend graph of partitioning validity index with the change of the threshold on data set zoo

Table 5  The comparison ID Proposed Ground-truth 6.4 Experimental analysis of partitioning threshold


between the number of clusters
calculated by the proposed 1 2 3
algorithm and the actual number In order to visually show the trend of the partitioning valid-
2 2 2 ity index with the partitioning threshold selected by the clus-
of clusters
3 2 7 ter, we also analyze the PVI values of the two clusters of
4 2 2 the data set zoo with the change of the selected partitioning
5 2 2 threshold. In Fig. 2, for the two clusters in zoo, the selected
6 2 22 thresholds vary by [0.19, 1.0] and [0.44, 1.0], respectively.
7 3 2 In the above comparison experiments, we also choose the
8 2 2 thresholds with the largest PVI values.
9 2 2
10 3 2
11 2 2 7 Conclusion
12 2 4
13 3 5 As a kind of extension of traditional clustering method,
14 2 2 three-way clustering method can effectively deal with the
15 2 2 incomplete and inaccurate data. In view of the problem
that existing three-way clustering methods often select
the appropriate number of clusters and the partitioning
classes, but our algorithm divides them both into two clus- threshold according to the subjective tuning in the imple-
ters. On this point we suspect that the clusters themselves mentation process, in this paper, based on introducing the
are not directly distinguishable, and that many of them are sample similarity, we defined the clustering validity index
merged into the fringe region of the two clusters. The ten- and the partitioning validity index to automatically com-
dency to use fewer clusters to represent the data and to put pute the number of clusters and the partitioning thresh-
the uncertainty into the fringe region is itself a feature of old, respectively. Furthermore, we proposed an automatic
three-way clustering method. three-way clustering approach by combining the proposed
Figure 1 is the trend graph of the clustering validity index cluster number selection method and the threshold selec-
with the change of the number of clusters on 15 data sets. In tion method. Comparison experiments also validate the
the above comparison experiments, the number of clusters effectiveness of our proposed method.
with the largest CVI value of each data set is selected.

13
1556 International Journal of Machine Learning and Cybernetics (2021) 12:1545–1556

References 23. Min F, Liu F, Wen L, Zhang Z (2019) Tri-partition cost-sensitive


active learning through kNN. Soft Comput 23:1557–1572
24. Pawlak Z (1982) Rough sets. Int J Comput Inform Sci
1. Afridi MK, Azam N, Yao J, Alanazi E (2018) A three-way clus-
11(5):341–356
tering approach for handling missing data using GTRS. Int J
25. Peters G, Crespo F, Lingras P, Weber R (2013) Soft clustering -
Approx Reason 98:11–24
fuzzy and rough approaches and their extensions and derivatives.
2. Dunn J (1973) A fuzzy relative of the isodata process and its
Int J Approx Reason 54(2):307–322
use in detecting compact well-separated clusters. J Cybern
26. Qi J, Qian T, Wei L (2016) The connections between three-way
3(3):32–57
and classical concept lattices. Knowl-Based Syst 91:143–151
3. Friedman HP, Rubin J (1967) On some invariant criteria for
three-way Decisions and Granular Computing
grouping data. J Am Stat Assoc 62(320):1159–1178
27. Qian T, Wei L, Qi J (2017) Constructing three-way concept lat-
4. Gu Y, Jia X, Shang L (2015) Three-way decisions based bayesian
tices based on apposition and subposition of formal contexts.
network. In: Proceedings of the IEEE international conference on
Knowl-Based Syst 116:39–48
progress in informatics and computing (PIC), pp 51–55
28. Yao J, Azam N (2015) Web-based medical decision support sys-
5. Hu B (2017) Three-way decisions based on semi-three-way deci-
tems for three-way medical decision making with game-theoretic
sion spaces. Inf Sci 382–383:415–440
rough sets. IEEE Trans Fuzzy Syst 23(1):3–15
6. Jain AK, Murty MN, Flynn PJ (1999) ACM Comput Surv
29. Yao Y (2010) Three-way decisions with probabilistic rough sets.
31:264–323
Inf Sci 180:341–353
7. Jia X, Shang L (2014) Three-way decisions versus two-way deci-
30. Yao Y (2018) Three-way decision and granular computing. Int J
sions on filtering spam email. In: Transactions on rough sets
Approx Reason 103:107–123
XVIII, pp 69–91
31. Yu H (2018) Three-way decisions and three-way clustering. In:
8. Jia X, Liao W, Tang Z, Shang L (2013) Minimum cost attrib-
Proceedings of the international joint conference on rough sets,
ute reduction in decision-theoretic rough set models. Inf Sci
pp 13–28
219:151–167
32. Yu H, Wang Y (2012) Three-way decisions method for overlap-
9. Jia X, Shang L, Zhou B, Yao Y (2016) Generalized attribute
ping clustering. In: Proceedings of international conference on
reduct in rough set theory. Knowl Based Syst 91:204–218
rough sets and current trends in computing, pp 277–286
10. Jia X, Li W, Shang L (2019) A multiphase cost-sensitive learn-
33. Yu H, Liu Z, Wang G (2014) An automatic method to determine
ing method based on the multiclass three-way decision-theoretic
the number of clusters using decision-theoretic rough set. Int J
rough set model. Inf Sci 485:248–262
Approx Reason 55(1, Part 2):101–115
11. Jia X, Rao Y, Shang L, Li T (2020) Similarity-based attribute
34. Yu H, Zhang C, Wang G (2016) A tree-based incremental over-
reduction in rough set theory: a clustering perspective. Int J Mach
lapping clustering method using the three-way decision theory.
Learn Cybernet 11:1047–1060
Knowl-Based Syst 91(1):189–203
12. Li H, Zhang L, Zhou X, Huang B (2017) Cost-sensitive sequential
35. Yu H, Chen Y, Lingras P, Wang G (2019) A three-way cluster
three-way decision modeling using a deep neural network. Int J
ensemble approach for large-scale data. Int J Approx Reason
Approx Reason 85:68–78
115:32–49
13. Li J, Huang C, Qi J, Qian Y, Liu W (2017) Three-way cognitive
36. Yu H, Wang X, Wang G, Zeng X (2020) An active three-way
concept learning via multi-granularity. Inf Sci 378:244–263
clustering method via low-rank matrices for multi-view data. Inf
14. Li W, Huang Z, Jia X (2013) Two-phase classification based on
Sci 507:823–839
three-way decisions. In: Proceedings of the international confer-
37. Yu J, Cheng Q (2002) Search range of optimal cluster number in
ence on rough sets and knowledge technology, pp 338–345
fuzzy clustering methods. Sci Chin Ser E Technol Sci 32:274–280
15. Li W, Huang Z, Jia X, Cai X (2016) Neighborhood based deci-
(in Chinese)
sion-theoretic rough set models. Int J Approx Reason 69:1–17
38. Zhang Q, Lv G, Chen Y, Wang G (2018) A dynamic three-way
16. Li W, Huang Z, Li Q (2016) Three-way decisions based software
decision model based on the updating of attribute values. Knowl-
defect prediction. Knowl-Based Syst 91:263–274
Based Syst 142:71–84
17. Li W, Jia X, Wang L, Zhou B (2019) Multi-objective attribute
39. Zhang Y, Yao J (2017) Gini objective functions for three-way
reduction in three-way decision-theoretic rough set model. Int J
classifications. Int J Approx Reason 81:103–114
Approx Reason 105:327–341
40. Zhang Y, Miao D, Zhang Z, Xu J, Luo S (2018) A three-way selec-
18. Li X, Yi H, She Y, Sun B (2017) Generalized three-way deci-
tive ensemble model for multi-label classification. Int J Approx
sion models based on subset evaluation. Int J Approx Reason
Reason 103:394–413
83:142–159
41. Zhang Y, Miao D, Wang J, Zhang Z (2019) A cost-sensitive three-
19. Li Y, Zhang L, Xu Y, Yao Y, Lau RYK, Wu Y (2017) Enhancing
way combination technique for ensemble learning in sentiment
binary classification by modeling uncertain boundary in three-way
classification. Int J Approx Reason 105:85–97
decisions. IEEE Trans Knowl Data Eng 29(7):1438–1451
42. Zhang Y, Zhang Z, Miao D, Wang J (2019) Three-way enhanced
20. Liang D, Xu Z, Liu D (2017) Three-way decisions with intuition-
convolutional neural networks for sentence-level sentiment clas-
istic fuzzy decision-theoretic rough sets based on point operators.
sification. Inf Sci 477:55–64
Inf Sci 375:183–201
21. Lingras P, Yan R, West C (2003) Comparison of conventional and
Publisher’s Note Springer Nature remains neutral with regard to
rough k-means clustering. In: Proceedings of the international
jurisdictional claims in published maps and institutional affiliations.
conference on rough sets, fuzzy sets, data mining, and granular
computing, pp 130–137
22. Liu D, Liang D (2014) An overview of function based three-way
decisions. In: Proceedings of the international conference on
rough sets and knowledge technology, pp 812–823

13

You might also like