0% found this document useful (0 votes)
33 views14 pages

Fast Self-Supervised Clustering With Anchor Graph

1) This document discusses a new fast self-supervised clustering method that uses an anchor graph to improve efficiency. 2) The method first uses a balanced K-means hierarchical K-means (BKHK) algorithm to obtain anchor points, then constructs a bipartite graph between samples and anchors to build a sparse similarity matrix. 3) It introduces a fast semisupervised framework (FSSF) that adopts virtual labels of anchors from BKHK to obtain a soft label matrix and conduct label propagation without supervised information.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views14 pages

Fast Self-Supervised Clustering With Anchor Graph

1) This document discusses a new fast self-supervised clustering method that uses an anchor graph to improve efficiency. 2) The method first uses a balanced K-means hierarchical K-means (BKHK) algorithm to obtain anchor points, then constructs a bipartite graph between samples and anchors to build a sparse similarity matrix. 3) It introduces a fast semisupervised framework (FSSF) that adopts virtual labels of anchors from BKHK to obtain a soft label matrix and conduct label propagation without supervised information.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO.

9, SEPTEMBER 2022 4199

Fast Self-Supervised Clustering With Anchor Graph


Jingyu Wang , Member, IEEE, Zhenyu Ma , Feiping Nie , Member, IEEE, and Xuelong Li , Fellow, IEEE

Abstract— Benefit from avoiding the utilization of labeled divided into three categories: supervised learning [9], semisu-
samples, which are usually insufficient in the real world, unsu- pervised learning (SSL) [10], and unsupervised learning [11].
pervised learning has been regarded as a speedy and powerful The first group fully utilizes labeled samples, such as support
strategy on clustering tasks. However, clustering directly from
primal data sets leads to high computational cost, which limits vector machine [12], [13] and linear discriminant analysis [14],
its application on large-scale and high-dimensional problems. [15]. The second group merely exploits a few labeled samples,
Recently, anchor-based theories are proposed to partly mitigate where the rest are all substituted with unlabeled sample points.
this problem and field naturally sparse affinity matrix, while The third group completely uses unlabeled samples, in which
it is still a challenge to get excellent performance along with principle component analysis [16], [17] is a fundamentally
high efficiency. To dispose of this issue, we first presented a
fast semisupervised framework (FSSF) combined with a bal- typical dimensionality reduction (DR) [18] approach.
anced K -means-based hierarchical K -means (BKHK) method Since the labeled data in the real world are extremely
and the bipartite graph theory. Thereafter, we proposed a deficient, learning label information is treated as an indispens-
fast self-supervised clustering method involved in this crucial able task in classification [19] and regression missions [20].
semisupervised framework, in which all labels are inferred from However, as the acquisition of labeled samples is tedious
a constructed bipartite graph with exactly k connected compo-
nents. The proposed method remarkably accelerates the general and laborious, sometimes even impossible, prior knowledge
semisupervised learning through the anchor and consists of four is usually not available when dealing with practical prob-
significant parts: 1) obtaining the anchor set as interim through lems [21]. Furthermore, label annotations of low quality also
BKHK algorithm; 2) constructing the bipartite graph; 3) solving have a serious impact on the performance of algorithms.
the self-supervised problem to construct a typical probability Consequently, semisupervised and unsupervised learning are
model with FSSF; and 4) selecting the most representative points
regarding anchors from BKHK as an interim and conducting more generally favored than supervised learning.
label propagation. The experimental results on toy examples SSL leverages both limited labeled data and abundant
and benchmark data sets have demonstrated that the proposed unlabeled data to learn a more efficient model. For instance,
method outperforms other approaches. the general semisupervised model combined with novel class
Index Terms— Bipartite graph, label propagation, discovery [22] is proposed to predict outliers in unknown data.
self-supervised learning, semisupervised framework, special To tackle large graph problems, the neighbor graph construc-
selection. tion with SSL is introduced [23] and large-scale SSL [24]
in classification is also proposed to enhance robustness with
I. I NTRODUCTION
l2, p -norm. However, SSL needs to utilize labeled data after all.

L EARNING-BASED methods have been regarded as one


of the most prominent researching strategies in machine
learning [1]–[3], pattern recognition [4]–[6], and data min-
Hence, compared with SSL, unsupervised learning possesses
more efficiency in most applications, which predicts labels of
unknown data without auxiliary information.
ing [7], [8] scenarios. Generally, the learning techniques are Clustering is playing an important role in unsupervised
learning, which aims to classify samples into different groups
Manuscript received 20 April 2020; revised 22 September 2020 and
10 January 2021; accepted 27 January 2021. Date of publication 15 February on the basis of certain criteria. Many classical clustering
2021; date of current version 2 September 2022. This work was methods have been introduced, including K -means [25], fuzzy
supported in part by the National Natural Science Foundation of China K -means [26], spectral clustering (SC) [27], hierarchical clus-
under Grant 61976179 and Grant 61502391; in part by the Nation Key
Research and Development Program under Grant 2018XXX08241041 and tering [28], and nonnegative matrix factorization [29]. The
Grant 2018YFB1305702; in part by the Fundamental Research target of K -means is to update centers of all clusters until
Funds for the Central Universities under Grant 3102019HTXM005, satisfying the convergence condition. Although K -means is
Grant 3102017HQZZ003, and Grant 3102019HTXS001; and in part by
the Key Industrial Innovation Chain Project in Industrial Domain of commonly used to conduct fast clustering and evaluate the
Key Research and Development Program of Shaanxi Province under performance of DR, it is applicable only in convex distribution.
Grant 2018ZDCXLGY030203. (Corresponding author: Feiping Nie.) Comparatively, SC is a graph-based clustering technique seek-
Jingyu Wang is with the School of Astronautics, School of Artificial
Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical ing the optimal graph-based decomposition, which is capable
University, Xi’an 710072, China (e-mail: [email protected]). of tackling more complex data structures compared with
Zhenyu Ma is with the School of Astronautics, Northwestern Polytechnical K -means. Recently, numerous researchers have proposed
University, Xi’an 710072, China (e-mail: [email protected]).
Feiping Nie and Xuelong Li are with the School of Artificial Intelligence, many innovative and comprehensive analyses in terms of
Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi’an optimizing SC, such as spectral embedding clustering com-
710072, China (e-mail: [email protected]; [email protected]). bined with spectral embedding [30] and nonnegative SC [31].
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNNLS.2021.3056080. However, the construction of similarity matrix on the spectral
Digital Object Identifier 10.1109/TNNLS.2021.3056080 graph in these approaches needs to select appropriate criteria
2162-237X © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: UNIVERSIDAD DE VIGO. Downloaded on February 02,2023 at 10:30:42 UTC from IEEE Xplore. Restrictions apply.
4200 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 9, SEPTEMBER 2022

strictly and also decide how to build the affinity matrix. technique is relatively practical especially dealing with
The introduction of the Gaussian kernel function makes SC large-scale problems possessing complex data structures.
not parameter-free, which is time-consuming when solving 2) Using the bipartite graph learned from samples and
large-scale problems on account of the overall computa- anchors is significant to build sparse similarity matrix
tional complexity of general SC is O(n 2 d), where n and d W , where there is only one parameter k meaning k
is the number of samples and features, respectively. Thus, nearest anchors for each sample.
landmark-based sparse representation strategy [32] is proposed 3) Inspired by the SSL framework, we adopt virtual labels
to apply to SC and construct a more sparse similarity graph. of representative anchors from BKHK to get soft label
As a special kind of unsupervised learning pattern, self- matrix by fulfilling semisupervised clustering problem
supervision-based learning [33], [34] has attracted much utilizing the proposed FSSF. The soft label matrix
attention from researchers in recent years to further improve expresses the meaning of probabilities of the data
clustering performance. Generally, the self-supervised clus- belonging to classes in virtual labels.
tering process mainly adopts auxiliary tasks to explore and 4) The novel special selection strategy is proposed to find
acquire its own supervision information from existed unsuper- out the most representative points, the number of which
vised data. Thereafter, the obtained supervised signals will be is equivalent to the number of real classes c. This proce-
utilized for subsequent training or supervised learning. In the dure is relatively significant to select ultimate real labels.
past two years, a self-supervised clustering module has been The SSL is conducted to operate label propagation at the
widely used in various deep learning networks. For instance, end of our method.
Zhang et al. [35] proposed a self-supervised convolutional It is essential to emphasize some aspects of our method.
subspace clustering network to simultaneously achieve feature 1) The major measures to accelerate the entire algorithm
learning and subspace clustering. In order to cope with the are threefold: the BKHK algorithm, the construction of
outliers problem in multiview clustering, Sun et al. [36] a naturally sparse bipartite graph, and inversion lemma.
proposed a self-supervised deep multiview subspace clustering 2) We take the advantage of fake labels of each anchor
algorithm, in which the clustering results and affinity matrix point to operate self-supervised learning, which is
are mutually trained by an integrated deep learning framework exactly an unsupervised learning strategy with the pro-
to obtain superior clustering performance under multiviews. posed FSSF.
Comparatively, in recent years, there are a few researches
The rest of this article is scheduled as follows. Related
on self-supervision based clustering in the field of machine
works and notations are discussed in Section II. The details
learning. Ye et al. [37] proposed an affinity learning-based
of FSSF are significantly demonstrated in Section III. The
self-supervised diffusion (SSD) for SC to deal with the sen-
proposed method FSSC with FSSF is expounded in Section IV.
sitivity of SC to a fixed affinity matrix, where the cluster-
Corresponding experiments on toy and benchmark data sets
ing results in each iteration provides supervisory signals for
are elaborated in Section V. Finally, we give the conclusion
the diffusion process. Furthermore, a novel self-supervised
and prospective works in Section VI.
clustering method is also proposed to effectively discover
new user intentions [38]. However, the above-mentioned few
self-supervised clustering studies in the field of machine II. N OTATIONS AND R ELATED W ORKS
learning still have potential limitations. Concretely, attributed In this section, some related works, including spectral
to the property of traditional SC, the iterative process in SSD graph theory and some pivotal graph-based clustering tech-
might bring serious computational burden when tackling the niques, will be briefly reviewed first. Meanwhile, some relative
large-scale problem. The study for new user intentions still notations will be described for a vivid description of these
leverages a small number of labeled samples in this clustering algorithms and theories. The BKHK algorithm will be roughly
framework, which is essentially not in the realm of self- described at last for the preparation of our main method.
supervision.
Thence, in order to acquire clustering results with higher A. Graph-Based Learning
quality and faster processing speed, we focus on perform- In recent studies, graph-based clustering [39] techniques
ing self-supervised clustering tasks from the perspective of have been widely and mainly applied in an unsupervised man-
semisupervised label propagation. More specifically, it is pon- ner. First, we denote the data matrix X = [x 1 ; x 2 ; . . . ; x n ] ∈
dered whether it is possible to use sparse theory to integrate Rn×d , where n and d are the number of samples and dimen-
semisupervised ideas with unsupervised methods in clustering sionality, respectively. x i ∈ Rd expresses the i th data point,
tasks. Thus, we propose a fast clustering technique referred which is a row vector. Supposed that ei j means the similarity
to as fast self-supervised clustering (FSSC) with anchor graph weight to connect x i and x j , Euclidean distance is usually
in this article, which integrates a fast semisupervised frame- adopted to be simple. Second, these approaches employ a
work (FSSF) with unsupervised methods in clustering tasks. graph G = (V, E) to model the original data set, where
The procedures of our algorithm are listed as follows. V = X is the graph vertex set and E is the edge set. Associated
1) We utilize the efficient BKHK algorithm to find anchor with each edge ei j ∈ E, Wi j is a nonnegative weight indicating
set from the original data set, which has lower com- the similarity between x i and x j . Generally, the graph can be
putational complexity and better effectiveness in com- divided into directed graph (Wi j = W j i ) [40] and undirected
parison with K -means. Therefore, this anchor generated graph (Wi j = W j i ) [41]. In this article, we focus on the

Authorized licensed use limited to: UNIVERSIDAD DE VIGO. Downloaded on February 02,2023 at 10:30:42 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: FSSC WITH ANCHOR GRAPH 4201

undirected graph with local neighborhood information and the idea of which has been utilized in a novel unsupervised
next a brief review on commonly used similarity graphs is feature selection method [46] to preserve local and global
provided. structure simultaneously. In multiview clustering field, since
1) The ε-Neighborhood Graph: For two random data points, the flaws of fixed common affinity matrix, one-step multiview
if distances between them are smaller than ε, we connect them SC (OMSC) method [47] was proposed with discrimina-
with the same scale, which is regarded as an unweighted graph. tive weights in diverse views, where mapping matrix and
However, the scale of ε is not easy to adjust based on the affinity matrix for different views are optimized dynamically
density distribution of primal data sets. to acquire better clustering performance. Besides, a fuzzy
2) k-Nearest Neighbor Graph: We connect vertex x i with and robust multiview clustering model [48] was proposed to
data point x j if x j belongs to the K-nearest neighbors (KNN) strengthen the stability of each view, where a more sparse
of x i , which leading to a problem that the similarity graph will affinity matrix with more abundant and helpful information
be not symmetric unless we operate W  = (W T + W )/2 [42]. was obtained in this framework. Furthermore, in multitask
While we impose a stronger condition which means two SC [49], affinity matrix and mapping function are simultane-
sample points are nearest neighbors to each other, the cor- ously learned with mutual improvement to effectively explore
responding graph is referred to as the mutual KNN graph. intertask correlation.
3) The Fully Connected Graph: Full points are simply On the other hand, since the extra heat kernel parameter
connected with nonnegative similarity values with each other. and singular value decomposition in affinity matrix from
For instance, the Gaussian kernel function [43] primal data, classical SC is generally not efficient to tackle
  a large-scale problem, the computational complexity of which
−x i − x j 2
Wi j = W j i = exp (1) is O(n 2 d + n 3 ). Hence, anchor-based theory [50], [51] has
2σ 2
been introduced to generate bipartite graph connecting anchor
can be utilized, where the parameter σ is the width of Gaussian layer and sample layer in recent years, where anchors are a
function. This parameter plays a similar role as the parameter series of representative points that roughly cover entire sample
ε in occasion of the ε-neighborhood graph. We are capable of points. Many scholars have utilized bipartite graphs to build a
utilizing various criteria, including Euclidean distance, Maha- naturally sparse similarity matrix by adjusting the number of
lanobis distance, Minkowski distance, and cosine similarity for anchors and their nearest neighbors. Zhu et al. [52] proposed
improvement of graph. a fast SC to construct a parameter-free large graph with
4) Adaptive Gaussian method graph: In order to pursue effective neighbor assignment. Subsequently, double anchor
the parameter-free measurement, the self-tuning graph with layers and even hierarchical bipartite graph were proposed
Gaussian function [44] is proposed. The details of the simi- and utilized to explore more explicit connection relationship
larity between two vertex x i and x j can be written as among all samples, e.g., Representative Point-based Spec-
 2  tral Clustering (RPSC) [53] and SC based on Hierarchical
−d (x i , x j )
Wi j = exp (2) Bipartite Graph (SCHBG) [54]. Random walk-based Laplacian
σi σ j
matrix [55] was also proposed to balance the anchors and
where σi represents the local scaling for x i , which is the samples and improve clustering performance. Furthermore,
distances between vertex x i and its farthest neighborhood. recently, an anchor-based graph has been combined with SSL
Therefore, this parameter can be formulated as to efficiently deal with large-scale problem [56]–[58], which
σi = d(x i , x k ) (3) are able to effectively dispose of the out-of-sample problem
as well.
where x k is the kth neighbor for x i . This strategy will be also In this article, inspired by the integration between
used in following bipartite graph construction. anchor-based theory and SSL, the proposed FSSC method
To consider the connection between each point and avoid with FSSF first and quickly obtain the crucial probability
high computational complexity caused by too much iterations model between anchors and samples, which represents the
using K -means in large-scale data sets, classical SC methods belonging relationship about fake labels for each sample.
are divided into two steps: 1) solving a relaxed continuous The novel special selection strategy is then conducted to
optimization problem [we denote tr (·) is the trace operator] choose the most representative points with the best quality
H  = arg min tr(HLHT ) (4) by extracting the maximum score for each sample, which
H T H =I depicts the probability that the sample point belongs to the
where L is the graph Laplacian matrix followed by affinity outlier. Terminally, clustering results can be gained by the label
matrix W , to obtain a constrained matrix H and 2) applying propagation process. As an essential part to generate anchor
K -means or spectral rotations to build indicator matrix of points, the BKHK algorithm will be roughly described later.
clusters.
However, on the one hand, the affinity matrix directly
obtained from original data incorporates noisy and redundant B. Balanced K-Means Based Hierarchical K-Means
information, which extremely affects the clustering perfor- In the past few years, the anchor-based theory has been
mance. In order to dispose of this issue, Zhu et al. [45] adopted introduced to accelerate the construction of a connected graph.
low-dimensional subspace and low-rank constraint dynam- For a specific quantity of anchors, generally, the number of
ically to learn similarity matrix effectively and efficiently, anchors is far less than that of samples, while too sparse

Authorized licensed use limited to: UNIVERSIDAD DE VIGO. Downloaded on February 02,2023 at 10:30:42 UTC from IEEE Xplore. Restrictions apply.
4202 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 9, SEPTEMBER 2022

Assuming that the number of true class is c, we denote


the initial label matrix Y = [Y1T ; Y2T ; . . . ; YnT ] ∈ Rn×(c+1) ,
where Yi ∈ Rc+1 (1 ≤ i ≤ n) are column vectors and the
motivation of the (c+1)th column is to evaluate the probability
of belonging to outlier for each sample point. For l labeled
data points, Yi j = 1 if the i th sample belongs to j th class
and Yi j = 0 otherwise. Comparatively, for u unlabeled data
points, we initially set Yi j = 1 if j = c + 1 and Yi j = 0
otherwise. The soft label matrix F = [F1T ; F2T ; . . . ; FnT ] ∈
Rn×(c+1) is a true probability model and optimized objective
matrix, where Fi ∈ Rc+1 (1 ≤ i ≤ n) are column vectors and
Fi j represents the probability of the i th sample belonging to
j th class. Generally, the meaning of soft label matrix F is
Fig. 1. Similar to cell division, BKHK performs dichotomy at each data very useful for postprocessing.
group.
Assuming Wi j is the similarity matrix among n data points
representative anchors are not capable of meeting the satisfy- with the Gaussian kernel function and its corresponding degree
ing performance. On the contrary, dense enough anchors might matrixD is a diagonal matrix, the i th entry of which is
bring extremely high computational cost when we deal with di = j Wi j . Let us denote  ·  F represents the Frobenius
2

the large-scale problem. In addition, for concrete selection norm of matrix, i.e., M F = tr(M T M). According to the
2

methods of anchors, on the one hand, though we are able to constructed graph, the general semisupervised framework can
randomly select representative points to fast view the local be formulated as the following cost function:
structure, it is difficult to reach satisfying performance by 
n 
n

the constructed graph from original samples and these ran- Q(F) = Wi j Fi − F j 2F + μi Fi − Yi 2F . (5)
domly selected anchors. On the other hand, utilizing K -means i, j =1 i=1
in BKHK can obtain desirable representative anchors with The first term in this cost function is a clustering term,
ideal performance; however, the computational complexity which can reach the target that similar samples have similar
is extremely high when solving large-scale problems. Thus, labels. For details, if the similarity between i th data point and
our overall method starts with a speedy and steady BKHK j th data point is extremely high, in order to solve the minimum
algorithm [52] to find representative anchors. of the cost function, Fi can be set almost equal to F j . If the
Fig. 1 vividly shows the entire course of the BKHK similarity between them is almost zero or equal to zero,
algorithm. Similar to the cell division process, this algorithm the constraint to these two samples will be relatively weak.
adopts a balanced binary tree structure, balances to segment We denote that μi > 0 is a regularization parameter for each
almost equal number samples into two clusters, and hierarchi- data point. Accordingly, the second term is a regularization
cally processes each obtained new clusters (if p hierarchies term for labels to measure the discrepancy between obtained
are generated, 2 p representative anchors will be obtained to soft labels Fi and primal labels Yi . On the one hand, if μi is
constitute anchor set U ). Since BKHK is really efficient to deployed to be zero, the label constraint will be invalid and
apply to large data sets that have high dimensions or large this problem will be purely regarded as a label propagation
sample points, it is adopted to accelerate the graph learning to process. On the other hand, if μi is employed as infinitely
get the anchor set U as an interim. To simplify the construction large, the initial label Yi will be maintained. The concrete
of label matrix Y , the index of all anchors learned from primal employment for different samples will be elaborated when two
data sets are recorded. novel parameters are introduced subsequently.
To cope with this problem, we set corresponding SSL model
III. FAST S EMISUPERVISED F RAMEWORK as
In this section, the details of the FSSF will be described. n 
n

First, we will introduce a general SSL framework for cluster- min Wi j Fi − F j 2F + μi Fi − Yi 2F . (6)
F
ing tasks. Second, the acceleration strategy for this framework i, j =1 i=1

on similarity matrix then will be demonstrated. Since the Then, we try to deform the Frobenius norm and gain matrix
labels for labeled points and the rest points should be treated form of the problem as
differently, the parameter αl and αu related to regularization 
n 
n
parameter μ will be introduced, the meaning of which will be min Wi j [(Fi−F j ) (Fi−F j )]+
T
tr[(Fi −Yi )T U (Fi −Yi )]
F
explained later. i, j =1 i=1
(7)
A. General Semisupervised Framework
where U is a n×n diagonal matrix, the i th entry of which is μi .
Consider a graph G = (V, E) with V nodes corresponding The problem can be transformed through identity deformation
to n sample points, in which the first l nodes denote l labeled as
data points and the rest u nodes denote abundant u unlabeled
data points. min tr[F T L F] + tr[(F − Y )T U (F − Y )] (8)
F

Authorized licensed use limited to: UNIVERSIDAD DE VIGO. Downloaded on February 02,2023 at 10:30:42 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: FSSC WITH ANCHOR GRAPH 4203

where L = D − W is referred as an unweighted Laplacian Furthermore, the difference in concrete value and corre-
matrix. sponding meaning between αl and αu will be elaborated and
Let us denote the optimized function in (8) is P(F). Since discussed as follows.
the optimal solution of problem (8) satisfies that the derivative 1) Assuming a labeled data point x i at first, when we
of P(F) for F is equal to zero, we could gain absolutely guarantee the validity of the initial label, this
  label should remain unchanged, where αi will be set
∂P(F)
| F=F ∗ = 2L F ∗ + 2U (F ∗ − Y ) = 0. (9) to be zero. In this condition, μi should be deployed
∂F
to be large enough in (11), which will promote the
Through simplification, the optimal solution F ∗ can be effect of fitting term on the problem (6) and contributing
obtained as to complete fixation of the label of x i . Otherwise,
we should make αi to be positive value to make the
F ∗ = (L + U )−1 U Y (10)
initial label adjustable, which can effectively solve the
where the sum of each row of F ∗ is equal to 1, which noises in initial labels.
represents a classical probability model for easier follow-up 2) Second, for one unlabeled data point x j , if we ensure
data processing. [We will see the strict proof of probability that it is impossible to exist novel class in data set,
model following (12) when two novel parameters have been which means every sample all belongs to assumed c
introduced.] classes, α j will be set to be 1 and the value μ j is low
As for the evaluation of probability belonging to outliers enough compared with d j , where the optimization only
for each sample point, which will be demonstrated in the conduct clustering term according to the primitive graph.
following parts, two parameters αl and αu should be introduced Generally, we set αi to be a relatively large value but
to simplify the parameter-setting process. lower than 1, since the number of outliers is usually low
Since the introduction of evaluation strategy, we need to on earth.
set different constraints for labeled samples and unlabeled
samples. Generally, for data point x i , whether x i is labeled B. Acceleration on Similarity Matrix W
or unlabeled, the value of corresponding α is calculated by However, the similarity matrix W in (12) purely con-
structed by the Gaussian kernel function is directly gained
αi = di /(di + μi ) (11)
from primal data set X, which contributes that the solution
where di is the i th element of degree matrix D obtained from process of general semisupervised clustering framework is
W and αi is determined by di and regularization parameter not fast enough. Therefore, in order to accelerate general
μi . On this condition, the value range of αi is limited into framework, we consider to utilize anchor set U from BKHK
[0, 1] which is really efficient for parameter selection. And to construct naturally sparse bipartite graph [59] to build
the corresponding model in framework of certain value on αi connection between anchor set U and original data set X.
will be specifically illustrated. We assume that the number of anchors is m and the objective
Meanwhile, βi = μi /(di + μi ) is introduced to satisfy βi = bipartite graph matrix is B = [b1T ; b2T ; . . . ; bnT ] ∈ Rn×m ,
1 − αi . Assuming matrix Iα and Iβ , we could easily know that which means the similarity between samples and repre-
Iα = I − Iβ , where I is a n × n identity matrix and Iα is a sentative anchors. The original problem can be formulated
n × n diagonal matrix with the i th entry being αi . as
Defining P = D −1 W , (10) becomes n 
n
min h i j bi j + γ bi2j (14)
F ∗ = (D − W + U )−1 U Y bi 1=1,bi ≥0
j =1 j =1
= (I − D −1 W + D −1 U )−1 (D −1 U )Y where h i j = x i −y j 22 ,
which represents the square of distance
= (Iα − Iα D −1 W + Iβ )−1 Iβ Y between i th sample and j th anchor using Euclidean distance
= (I − Iα P)−1 Iβ Y (12) to be simple. The first term is the regularization term, which
means the smoothness between anchors and original samples.
which is the ultimate solution of soft label matrix F in the The second term is the sparse term, where B could become as
general semisupervised framework. sparse as possible if the value of γ is set to be large enough.
Based on the concrete constructed details of P and Y , it can Thus, the problem (14) becomes
be easily found that P1n = 1n and Y 1c+1 = 1n , where
1n ∈ Rn×1 and 1c+1 ∈ R(c+1)×1 indicate a column vector with max = γ (15)
γ ,bi 0 =k
elements are all one. Combined with the trait of Iα and Iβ ,
where k embodies the connected components for each original
we will have
sample in bipartite graph matrix B, which is a nonzero integer
Iα P1n + Iβ Y 1c+1 = 1n ⇒ Iβ Y 1c+1 = (I − Iα P)1n and similar to the parameter k in the KNN algorithm. The
⇒ (I − Iα P)−1 Iβ Y 1c+1 = 1n (13) value of k varies on different data sets owing to distinct data
distribution itself. In general, the parameter k is set from
which embodies that the obtained solution F∗ in this semisu- 3 to 30.
pervised framework possesses that characteristic of probability There exists two constraints of this problem: The first is
model. the equality constraint bi 1 = 1, which avoids the appearance

Authorized licensed use limited to: UNIVERSIDAD DE VIGO. Downloaded on February 02,2023 at 10:30:42 UTC from IEEE Xplore. Restrictions apply.
4204 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 9, SEPTEMBER 2022

of all zero vector; the second is inequality constraint bi ≥ 0, regarded as an interim to get final c points belonging to c
since the Bi j symbolizes the similarity meaning. Thus, through classes in overall clustering. The label propagation will be
Lagrange solution we have the bipartite graph conducted at the end of our method.
⎧ The computational complexity analysis for each step and the

⎨ h i,k+1 − h i j
k , if j ≤ k overall algorithm will be stated at last of this section. Thus,
b̂i j = kh i,k+1 − j  =1 h i j  (16) our approach can be divided into the following parts.

⎩ 0, if j > k.
Moreover, we optimize the naturally sparse bipartite graph A. Self-Supervised Clustering Algorithm With Fake Label
matrix B obtained by the (16): assuming x i belongs to anchor The motivation of the self-supervised clustering based on
set U , the nearest anchors for x i will be itself. Therefore, FSSF is to construct a probability model for easier post-
the selection of neighbors is optimized by disposing of zero processing. The initial and artificial labels for m representative
value to improve a little performance. points are usually virtual, which play an important transitive
When accomplishing the construction of bipartite graph effect in FSSF to obtain matrix F. As for Iα and Iβ , on the
matrix B, the symmetric similarity matrix W ∈ Rn×n then one hand, we generally set positive αi for labeled data x i
can be constructed as to tackle the appearance of noises. However, different from
W = B−1 B T (17) the known labels in traditional SSL, m representative points
are selected to roughly and evenly cover primal data points,
where the matrix  ∈ R m×m
is a diagonal matrix and the the acquired fake labels of which are merely related to the
i th element
 θ i is the sum of i th column of B, which means original structures of n samples. In order to maintain the
θi = i bi j . We can easily get corresponding matrix D = I original distribution of entire representative points, the m
from the bipartite graph. The Laplacian matrix L will be equal different labels of m representative points in matrix Y will
to I − W . Thus, we will have be considered as no noises theoretically. Furthermore, this
L = I −B−1 B T . (18) assumption is validated in experimental results in benchmark
data sets. On the other hand, for unlabeled data x j , α j = 1
In this condition, the number of labeled data points is means that we will lose the capability of filtering outliers.
m, which is equivalent to that of representative anchors. Accordingly, α j = 1 will be deployed to be relatively large to
Initial label matrix Y ∈ Rn×(m+1) and soft label matrix alter their labels as much as possible and lower than 1 to retain
F ∈ Rn×(m+1) are also correspondingly changed in dimension. the function to detect and filter outliers in the semisupervised
Due to D = I from bipartite graph theory earlier, αi will label propagation process.
be set to be 1/(1 + μi ) and βi is equal to μi /(1 + μi ), Moreover, we utilize matrix inversion lemma [60] to fur-
comparatively. ther accelerate the computation of (19) when solving the
According to the solution process of general semisupervised large-scale problem (in this condition, the computation of
framework, the optimal soft label matrix F ∗ can be obtained the inversion of a general n × n matrix will be usually
as time-consuming). This inversion will be transformed into
the computation of m × m matrix. Assuming Q 1 = (I −
F ∗ = (I − Iα W )−1 Iβ Y. (19)
Iα B−1 B T )−1 − I , through (I − Iα B−1 B T )(I + Q 1 ) = I ,
Therefore, the FSSF has been proposed with optimized nat- Q 1 can be transformed into
urally sparse similarity matrix W , where the specific selection
Q 1 = (I − Iα B−1 B T )−1 Iα B−1 B T . (20)
of all parameters will be elaborated in Section IV to tackle
clustering problems. We assume Q 2 = Iα B−1 B T . Since F ∗ = (Q 1 + I )Iβ Y ,

F will be deformed as follows:
IV. FAST S ELF -S UPERVISED C LUSTERING
F ∗ = (I − Iα B−1 B T )−1 Q 2 + I Iβ Y
In this section, the details of the proposed FSSC method are −1
described. The setting of parameters based on the proposed = −Iα B (−Iα B)−1 + −1 B T ) Q 2 + I Iβ Y
FSSF in self-supervised clustering will be demonstrated con- −1
cretely followed by the special selection strategy to discover = − (−Iα B)−1 + −1 B T −1 B T Iβ Y + Iβ Y
−1
the most representative c points with the best quality for label = − −1 (−Iα B)−1 + B T −1 B T Iβ Y + Iβ Y
propagation from m anchors.
−1
We exploit FSSF to conduct the clustering process in a = − (−Iα B)−1 + B T B T Iβ Y + Iβ Y
self-supervised manner, where m representative anchors are −1
considered as the foundation of a bipartite graph and m labeled = −  + B T (−Iα B) (−Iα B)−1 B T Iβ Y + Iβ Y
points simultaneously. The labels are completely artificial, −1
= Iα B  + B T (−Iα B) B T Iβ Y + Iβ Y. (21)
which are marked by the selected order of each representative
point in the last hierarchy of the BKHK structure. According The contributions of obtained soft label matrix F are as
to the assessment strategy on outliers, we are able to find follows.
proper features for each sample to extract corresponding 1) It is relatively convenient for follow-up processing
representative points, where anchor set U from BKHK is attributed to the probability model of compact F and we

Authorized licensed use limited to: UNIVERSIDAD DE VIGO. Downloaded on February 02,2023 at 10:30:42 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: FSSC WITH ANCHOR GRAPH 4205

can judge the rough class through each Fi j for m classes Algorithm 1 FSSC
from selected anchors. Though the class belonging to Input: The representative anchor set U and the order of all
m anchors does not represent the ultimate clustering anchors Rank.
results, we are able to extract significant information to Self-supervised clustering with FSSF:
continue following the clustering process. 1: Set Y of all samples according to U and Rank.
2) Benefit from the matrix inversion lemma in (20), 2: Build B utilizing Eq.(16).
the inversion process for n × n matrix has been changed 3: while not converge do
into the inversion of m×m, which dramatically mitigates 4: Calculate F by Eq.(19).
the computational cost due to m n in most cases. 5: Accelerate computing of F by Eq.(21).
Concretely, in (20), the computational complexity of 6: end while
( + B T (−Iα B))−1 is O(m 2 n) + O(m 3 ) + O(mn) and The selection of c representative points:
computational complexity of the rest matrix calculation 7: i = 1.
is O(m 2 n) + O(mn). Therefore, we need O(m 2 n) + 8: while i < c do
O(m 3 ) to calculate soft label matrix F according to (20), 9: Extract first m columns of F.
which is linearly related to the number of samples n. 10: Calculate the sum of each row scor ei for x i .
11: Select maximum of scor ei and record the order.
12: Amend other scores.
B. Selection of c Representative Points
13: i = i + 1.
Fi j only represents the probability of i th sample belonging 14: end while
to j th representative anchor or outliers instead of the belonging Label Propagation:
relationship between samples and true classes. For details, 15: while not converge do
since it is sure that the number of true classes in the data set is 16: Calculate the T via computing Eq.(26) quickly.
c, there is still some doubt that how to express the probability 17: end while
of each sample belonging to c classes. The focus on this prob- Output: The result of anticipated clustering label Z with
lem is how to use m obtained anchors to get c representative n data points from converged T f inal .
points from n primal samples. Thus, we consider extracting
a unique score for each sample from matrix F to represent
significance not being outliers.
Attributed to the (m + 1)th column of F ∈ Rn×(m+1) for samples, where more similar to the representative point,
represents the underlying probability of becoming outliers, the smaller corrected sum of rows on similarity will be.
we delete this column maintaining first m columns to get Thus, the probability of the other high similarity points being
F̂ ∈ Rn×m for preprocessing. For sample x i , we can easily selected will be very small.
find that the greater the sum of the i th row of F̂, the less Inspired by feature selection, we denote the sum of i th row
likely it that the i th sample will be identified as an outlier in of F̂ to be the score for i th sample point as
the data set when we are sure that the true label contains c 
m
classes. We set α = 1 to avoid the sum of the row for each score(x i ) = Fi j (22)
unlabeled data is all 1, which has a negative impact to regard j
the sum of the row as a proper choosing score.
Therefore, a special selection strategy to extract c repre- where we need to choose the maximum of scores of n samples,
sentative points with the best quality from n samples will which means that larger scores, more possible not to be outlier
be introduced, which aims to find out one corresponding and representative one of c classes with best quality.
representative point belonging to one class in each of the When the first representative point et.x i is selected,
c classes and avoid the absence of representative points in the scores of other samples will be amended by the similarity
a class. It should be significantly emphasized that there is between x i and other points in W . Assuming the scores of x i
no correlation between c representative points and c cluster is z 1 , the explanation can be formulated as
centers actually. On the one hand, the larger extracted score
z 1 = arg max score(x i ). (23)
for the i th sample merely embodies that the i th sample is
more likely to be included among known c true classes. We assume one point x j different from x i , the feature score
On the other hand, the selection of subsequent representative of which is score(x j ), the modification process should be
points is subject to previous representative points, which only
aims to choose representative points from different classes, score(x j )new = (1 − Wi j )score(x j ) (24)
respectively. It is only an ideal situation that the selected c
representative points can refer to the cluster center of each where the scores of other samples will be amended by (24)
category, the probability of which is extremely small. one by one. Then, we will choose the second largest score
Since the number of selected representative points is x h when accomplishing all the amending. The algorithm
exactly c, whenever we choose a representative point, ends until c representative points are selected completely.
we should remove the sample points with high similarity. These c representative points will be treated as the initial
We adapt to the correction of the sum of all other rows points with true labels to operate label propagation processing.

Authorized licensed use limited to: UNIVERSIDAD DE VIGO. Downloaded on February 02,2023 at 10:30:42 UTC from IEEE Xplore. Restrictions apply.
4206 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 9, SEPTEMBER 2022

Fig. 2. Illustration of FSSC model from BKHK to final label propagation.

C. Label Propagation D. Computational Complexity Analysis


The general SSL framework in Section III is also adapted The computational complexity of SC is O(n 2 d +n 3 ), which
to conduct label propagation, which can be formulated as is really time-consuming to solve the large-scale problem,

n 
n while the proposed method possesses more efficient computa-
min Wi j Ti − T j 2F + μ̂i Ti − Ŷi 2F (25) tional performance. Concretely, the calculated complexity of
T
i, j =1 i=1 FSSC can be separated fourfold.
where T ∈ Rn×(c+1) is the terminal propagating result and 1) BKHK is conducted based on the dichotomous K -means
Ŷ ∈ Rn×(c+1) is the initial label matrix based on the most algorithm. The computational complexity of the first
representative points. We still utilize the affinity matrix W dichotomous K -means on the primal data set is O(nd).
obtained from Section III to improve the performance. And In subsequent hierarchies of a binary tree, each time the
the problem (25) can be processed by Lagrange solution. Thus, number of samples of binary K -means will be halved,
the label propagation procedure becomes while the number of executions will be doubled. On this
condition, the computational complexity is still O(nd).
T ∗ = (I − W + Û )−1 Û Ŷ (26)
In addition, since the number of decomposed layers is
where Û is a diagonal matrix with the i th entry being μ̂i . log(m), the calculated cost of BKHK is O(nd log(m))
Then, α̂i = di /(di + μ̂i ) and β̂i = μ̂i /(di + μ̂i ) are defined to obtain m anchors from n samples.
as the same as general semisupervised framework. In this 2) It takes us O(nmd) to build the major matrix H ∈ Rn×m
condition, for labeled data point x i , α̂i will be set to be in (14) between m anchors and n samples. Besides,
zero, since c representative points do not exist noises exactly. the occupied time of (16) is so little that its time
Comparatively, for unlabeled data point x j , α̂ j will be set to be complexity can be ignored to construct bipartite graph
one due to the absence about outliers. Meanwhile, we denote matrix B. Therefore, we spend O(nmd) in building
diagonal matrix Iˆα and Iˆβ with the i th entry α̂i and β̂i , anchor-based graph B and speeding up the semisuper-
respectively. Thus, the iteration format becomes vised framework.
3) The majority of the computational cost when acquiring
T ∗ = (I − Iˆα P)−1 Iˆβ Ŷ . (27)
soft label matrix F derives from the calculation of (21).
Furthermore, we are capable of accelerating the computation Benefit from the acceleration of bipartite graph and
of (27), which is the same as the meaning of (19). matrix inversion lemma, the corresponding computa-
Through certain iterations of label propagation, we could tional complexity is optimized from O(n 3 ) to O(m 2 n +
find out the largest Ti j from Ti , which means i th data m 3 ). Generally, due to m n, it is determined as
point belongs to j th class on ground truth (GT). Therefore, O(m 2 n) in FSSF.
the anticipated label Z ∈ Rn vector can be obtained. To make 4) It takes us O(nmc) to find out the most representa-
the general model more impressive and intuitive, we uti- tive c sample points by updated feature-like selection.
lize a portrayal to illustrate the FSSC algorithm, as shown Subsequently, propagating labels through the most rep-
in Fig. 2. Subsequently, the overall algorithm is summarized resentative c points with the best quality to all sample
in Algorithm I. points is spending O(m 2 n + nmd) by FSSF.

Authorized licensed use limited to: UNIVERSIDAD DE VIGO. Downloaded on February 02,2023 at 10:30:42 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: FSSC WITH ANCHOR GRAPH 4207

Fig. 4. Clustering results on two toy examples. (a) Two moon data.
(b) Spheres data.
TABLE I
D ESCRIPTION OF D ATA S ETS

with 0.12 noise and Fig. 3(d) shows initial spheres toy data
which consists of four classes.
From Fig. 3(b), it can be seen that the number of labeled
representative anchors is set to be 16 and it is reasonable
for 400 samples and two classes. We can easily find that
the selection of anchor points is roughly uniform within and
between classes, where the red diamonds represent the anchor
points of the first class and the blue squares represent the
anchor points of the second class. Combined with the theory
of the construction of bipartite graph, for sample x i , it will
Fig. 3. Experiments on two toy examples. (a) Original data of Two Moon. be most similar to the nearest anchor point of the same
(b) Anchors of Two Moon. (c) Final points of Two Moon. (d) Original data category. The anchor results on the spheres data set have
of Spheres. (e) Anchors of Spheres. (f) Final points of Spheres. similar characteristics to the two moon data set, which is
shown in Fig. 3(e).
Fig. 3(c) and (f) show the results of selection about c
Considering that m n and c n, the overall computa-
representative points in these two data sets, which express that
tional complexity of the FSSC method is O(nmd). there is only one representative point of each category.
The clustering results in these two data sets are shown
V. E XPERIMENTAL R ESULTS in Fig. 4, where the clustering accuracy of two moon data
In this section, we first validate our approach and exhibit the can achieve 100% and that of spheres data can reach 99.5%.
main stages of the proposed method with two toy examples Though the m anchors and c representative points sometimes
graphically. Second, the parameter sensitivity of our method lightly fluctuate in different experiments, the desirable results
based on the number of anchors m, αl and αu will be con- can be always maintained in these two toy examples.
cretely analyzed on five benchmark data sets with clustering
metrics ACCuracy (ACC), normalized mutual information B. Parameter Sensitivity
(NMI), and clustering time. Finally, compared results with
K -means, SC [61], LSC-R [62], LSC-K [62], FSC [52], and We begin via describing our experimental benchmark data
FRWL-B [55] are assessed, where we run every method ten sets. The specific characteristics of these data sets about the
times in these five data sets, calculate the mean of all results, number of instances, dimensions, and classes of GT are all
and report the metrics ACC, NMI, and running time. These listed on Table I. Two of them, Abalone and Letter, are from
experimental results can demonstrate the effectiveness of our UCI machine learning repository [63]. While PalmData25 with
algorithm and can also validate the computational complexity 16 × 16 image scale, US Postal handwritten digit (USPS) with
analysis in Section IV-D. 16×16 image scale, and Mixed National Institute of Standards
and Technology handwritten digit (MNIST) with 24×24 image
scale belong to image data sets.
A. Validation on Toy Example The three essential parameters in our method are the number
We give two toy examples to analyze and validate our of the representative anchors m and the regularization parame-
method. Fig. 3(a) shows the primal two moon toy data con- ter αl and αu . The influence of these parameters on running
taining two classes which have the same number of samples time and performance will be validated as follows.

Authorized licensed use limited to: UNIVERSIDAD DE VIGO. Downloaded on February 02,2023 at 10:30:42 UTC from IEEE Xplore. Restrictions apply.
4208 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 9, SEPTEMBER 2022

Fig. 5. Trend of ACC, NMI, and clustering time via adjusting the number of anchors on five benchmark data sets. (a) PalmData25. (b) Abalone. (c) USPS.
(d) Letter. (e) MNIST.

1) Number of Anchors m: The experimental results on these


five benchmark data sets are exhibited in Fig. 5. The first
picture of Fig. 5(b)–(e) shows that dense enough representative
anchors are indispensable to build a reasonable connecting
relationship between anchors and original samples, while
overly large amounts of anchors which usually contain redun-
dant information will be useless for clustering performance.
Therefore, an appropriate number of anchors plays an essential
role to improve final performance. In addition, the second
picture of Fig. 5(a)–(e) demonstrates that running time linearly
ascend as the number of representative anchors increases,
which matches the comprehensive complexity analysis in
Section IV-D.
2) Framework Parameter αl and αu : The specific parameter
Fig. 6. Clustering accuracy of PalmData25 via adjusting parameter αl and
sensitivity on αl and αu with clustering metric ACC are por- αu when fixing k and m.
trayed in Figs. 6–10, which correspond PalmData25, Abalone,
USPS, Letter, and MNIST, respectively. As for the different
combined value of αl and αu , we record corresponding mean Section III, αl controls the fake labels of m representative
results of ten times experiments merely on ACC, which is anchors, while αu controls the detection capability for outliers.
sufficient to illustrate the limitation of these two parameters. On the one hand, Figs. 6–10 show that the employment
According to the description of these two parameters in of αl = 0 effectively improve clustering performance, which

Authorized licensed use limited to: UNIVERSIDAD DE VIGO. Downloaded on February 02,2023 at 10:30:42 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: FSSC WITH ANCHOR GRAPH 4209

Fig. 7. Clustering accuracy of Abalone via adjusting parameter αl and αu Fig. 9. Clustering accuracy of Letter via adjusting parameter αl and αu when
when fixing k and m. fixing k and m.

Fig. 8. Clustering accuracy of USPS via adjusting parameter αl and αu when


fixing k and m. Fig. 10. Clustering accuracy of MNIST via adjusting parameter αl and αu
when fixing k and m.

validates that artificial labels of representative anchors are


fairly reliable (virtual labels theoretically do not exist noises C. Comparison Results
mentioned in Section IV-A) to conduct label propagation in To illustrate the effectiveness and efficiency of the proposed
our FSSF. Thus, the αl in final label propagation from c method, we compare it with K -means, SC, landmark-based
representative points to n samples will be accurately assigned SC using random sampling to select landmarks (LSC-R) [62],
to be zero. In addition, Fig. 7 shows that the influence of αl landmark-based SC using K -means to select landmarks
on ACC are able to be ignored on Abalone, which attributes (LSC-K) [62], fast SC (FSC) [52], and FSC based on random
to the characteristic of data itself. walk Laplacian with BKHK (FRWL-B) [55]. Except for
On the other hand, for unlabeled samples, Figs. 6–10 also two traditional clustering approaches K -means and SC,
show that it is critical to select suitable value of αu to detect the rest compared methods all belong to anchor graph-based
outliers and avoid them to become one of c representative SC techniques. More concretely, LSC-R randomly selects
points in final label propagation stage. For instance, Fig. 10 anchors from original samples, while LSC-K utilizes K -
shows that overly large and small values of αu , which means means for anchor-selection. FSC and FRWL-B adopt BKHK
overly weak and strong capability to avert the occurrence of to find anchors and conduct subsequent spectral analysis.
outliers in the final representative point set, will result in In order to acquire a better comparison effect, for these
drawbacks for final clustering performance in MNIST data five anchor graph-based methods, we choose 1024 anchors
set. While for Abalone, the clustering ACC will constantly and in PalmData25, 1024 anchors in Abalone, 512 anchors in
slowly augment as the αu increases. Thus, the optimal value USPS, 512 anchors in Letter, and 512 anchors in MNIST to
of αu should be dynamically tuned based on the characteristic construct a homogeneous anchor graph with the same scale.
of various data sets. We still perform 10 times duplicated experiments for each
As a whole, we set the value of αl to be zero as default in method and record mean results.
subsequently compared experiments. Furthermore, combined To demonstrate the effectiveness of our method, the exper-
with experimental results by adjusting αu on all data sets, imental results in different methods on the five benchmark
it will be assigned to be 0.99 for convenience. data sets are reported in Tables II and III with ACC and

Authorized licensed use limited to: UNIVERSIDAD DE VIGO. Downloaded on February 02,2023 at 10:30:42 UTC from IEEE Xplore. Restrictions apply.
4210 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 9, SEPTEMBER 2022

TABLE II
C OMPARISON IN T ERMS OF ACC (%)

TABLE III
C OMPARISON IN T ERMS OF NMI (%)

TABLE IV As a whole, our method has obvious superiority on metrics


C OMPARISON IN T ERMS OF RUNNING T IME (s) ACC and NMI, and it performs better on complexity cost
than other related approaches especially with a large number
of samples. Therefore, we can believe that our method could
outperform other methods to reach effectiveness and efficiency
in real applications.
VI. C ONCLUSION
This study designed a novel anchor-based clustering method
NMI, respectively. To indicate the efficiency of our method,
in a self-supervised manner, which is referred to as FSSC. This
we also compare clustering time between the proposed method
approach operates unsupervised clustering with semisuper-
and other five graph-based SC techniques. More specifically,
vised thoughts through the construction of fake labels, which
Table II shows that our method reaches the highest per-
utilizes BKHK to generate targeted representative anchors with
formance in all data sets especially in Abalone and USPS.
fast access at first. Second, a parameter-free bipartite graph
However, the standard deviation of our method is fairly large,
connecting anchors and original samples is constructed. Third,
which embodies the instability of our algorithm resulting from
we present a novel FSSF to perform FSSC and obtain a crucial
the random initialization of K -means and label propagation in
probability model. Next, we introduce a feature-like selection
our FSSF. Table III shows that our method also obtains the best
strategy to find out the most representative c points, where
performance with relatively unstable results upon NMI metric.
each point indicates one class. Terminally, label propagation is
In Table IV, we bold the two results with the least compu-
conducted together with FSSF to accomplish label prediction
tational cost among the six algorithms for each benchmark
of all samples based on c selected points. Several experimental
data set. From Tables II–IV, it can be first seen that the
results have been provided to show the effectiveness and
clustering performance of traditional K -means is poor. Second,
efficiency of our method. In the future, we should improve
conventional clustering method SC is not capable of running
the initialization upon generation of anchors to dispose of
on MNIST, and its computational burden is extremely high
undesirable stability in our algorithm. Meanwhile, we will also
with the increasing number of samples. Third, though the
pay more attention to adaptively adjust the capability of feature
computational cost of LSC-R is generally advantageous in all
amend to find optimal representative points for subsequent
data sets, the experimental results upon ACC and NMI are
label propagation.
overall inferior to other graph-based approaches, sometimes
even lower than K -means. Furthermore, attributed to the R EFERENCES
randomly anchor-selection strategy, the standard deviation of [1] M. Ozay, I. Esnaola, F. T. Yarman Vural, S. R. Kulkarni, and H. V. Poor,
LSC-R on ACC and NMI is extremely large. Compared with “Machine learning methods for attack detection in the smart grid,”
LSC-R, LSC-K has a significant improvement in accuracy IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 8, pp. 1773–1786,
Aug. 2016.
and stability, while it is still difficult to achieve satisfactory [2] M. A. Muñoz, L. Villanova, D. Baatar, and K. Smith-Miles, “Instance
clustering results. Finally, the stability of FSC and FRWL-B is spaces for machine learning classification,” Mach. Learn., vol. 107,
greater than our method except for the USPS data set; however, no. 1, pp. 109–147, Jan. 2018.
[3] X. Li, M. Chen, F. Nie, and Q. Wang, “Locality adaptive discriminant
their computational cost is distinctly larger than our method analysis,” in Proc. 26th Int. Joint Conf. Artif. Intell., Aug. 2017,
in Abalone, USPS, and MNIST. pp. 2201–2207.

Authorized licensed use limited to: UNIVERSIDAD DE VIGO. Downloaded on February 02,2023 at 10:30:42 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: FSSC WITH ANCHOR GRAPH 4211

[4] M. Hausknecht, W.-K. Li, M. Mauk, and P. Stone, “Machine learning [31] R. Zhang, F. Nie, M. Guo, X. Wei, and X. Li, “Joint learning of fuzzy
capabilities of a simulated cerebellum,” IEEE Trans. Neural Netw. Learn. K-means and nonnegative spectral clustering with side information,”
Syst., vol. 28, no. 3, pp. 510–522, Mar. 2017. IEEE Trans. Image Process., vol. 28, no. 5, pp. 2152–2162, May 2019.
[5] W. Kim, M. S. Stankovic, K. H. Johansson, and H. J. Kim, [32] D. Cai and X. Chen, “Large scale spectral clustering via landmark-
“A distributed support vector machine learning over wireless sensor based sparse representation,” IEEE Trans. Cybern., vol. 45, no. 8,
networks,” IEEE Trans. Cybern., vol. 45, no. 11, pp. 2599–2611, pp. 1669–1680, Aug. 2015.
Nov. 2015. [33] D. Hendrycks, M. Mazeika, S. Kadavath, and D. Song, “Using self-
[6] X. Li, M. Chen, F. Nie, and Q. Wang, “A multiview-based parameter free supervised learning can improve model robustness and uncertainty,” in
framework for group detection,” in Proc. AAAI, 2017, pp. 4147–4153. Proc. NeurIPS, 2019, pp. 15637–15648.
[7] X.-D. Wang, R.-C. Chen, Z.-Q. Zeng, C.-Q. Hong, and F. Yan, “Robust [34] Q. Ma, S. Li, W. Zhuang, S. Li, J. Wang, and D. Zeng, “Self-
dimension reduction for clustering with local adaptive learning,” IEEE supervised time series clustering with model-based dynamics,” IEEE
Trans. Neural Netw. Learn. Syst., vol. 30, no. 3, pp. 657–669, Mar. 2019. Trans. Neural Netw. Learn. Syst., early access, Aug. 31, 2020, doi:
[8] Z. Feng and Y. Zhu, “A survey on trajectory data mining: Techniques 10.1109/TNNLS.2020.3016291.
and applications,” IEEE Access, vol. 4, pp. 2056–2067, 2016. [35] J. Zhang et al., “Self-supervised convolutional subspace clustering
[9] Z. Li, Z. Zhang, J. Qin, Z. Zhang, and L. Shao, “Discriminative Fisher network,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
embedding dictionary learning algorithm for object recognition,” IEEE (CVPR), Jun. 2019, pp. 5473–5482.
Trans. Neural Netw. Learn. Syst., vol. 31, no. 3, pp. 786–800, Mar. 2020. [36] X. Sun, M. Cheng, C. Min, and L. Jing, “Self-supervised deep
[10] X. Fang et al., “Flexible affinity matrix learning for unsupervised and multi-view subspace clustering,” in Proc. ACML, vol. 101, 2019,
semisupervised classification,” IEEE Trans. Neural Netw. Learn. Syst., pp. 1001–1016.
vol. 30, no. 4, pp. 1133–1149, Apr. 2019. [37] J. Ye, Q. Li, J. Yu, X. Wang, and H. Wang, “Affinity learning via
[11] H. Jia, Y.-M. Cheung, and J. Liu, “A new distance metric for unsuper- self-supervised diffusion for spectral clustering,” IEEE Access, vol. 9,
vised learning of categorical data,” IEEE Trans. Neural Netw. Learn. pp. 7170–7182, 2021.
Syst., vol. 27, no. 5, pp. 1065–1079, May 2016. [38] T. Lin, H. Xu, and H. Zhang, “Constrained self-supervised clustering
[12] F. Cai and V. Cherkassky, “Generalized SMO algorithm for SVM-based for discovering new intents (student abstract),” in Proc. AAAI, 2020,
multitask learning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, pp. 13863–13864.
no. 6, pp. 997–1003, Jun. 2012. [39] M. A. Lozano and F. Escolano, “Graph matching and clustering using
[13] A. Dong, F.-L. Chung, Z. Deng, and S. Wang, “Semi-supervised SVM kernel attributes,” Neurocomputing, vol. 113, pp. 177–194, Aug. 2013.
with extended hidden features,” IEEE Trans. Cybern., vol. 46, no. 12, [40] N. Paudel, L. Georgiadis, and G. F. Italiano, “Computing critical nodes
pp. 2924–2937, Dec. 2016. in directed graphs,” ACM J. Experim. Algorithmics, vol. 23, pp. 1–24,
[14] B. Leng, J. Zeng, M. Yao, and Z. Xiong, “3D object retrieval with Nov. 2018.
multitopic model combining relevance feedback and LDA model,” IEEE [41] L. Gellert and R. Sanyal, “On degree sequences of undirected, directed,
Trans. Image Process., vol. 24, no. 1, pp. 94–105, Jan. 2015. and bidirected graphs,” Eur. J. Combinatorics, vol. 64, pp. 113–124,
[15] Y. Aliyari Ghassabeh, F. Rudzicz, and H. A. Moghaddam, “Fast Aug. 2017.
incremental LDA feature extraction,” Pattern Recognit., vol. 48, no. 6, [42] F. Nie, X. Wang, M. I. Jordan, and H. Huang, “The constrained
pp. 1999–2012, Jun. 2015. Laplacian rank algorithm for graph-based clustering,” in Proc. AAAI,
[16] C. Alzate and J. A. K. Suykens, “Multiway spectral clustering with 2016, pp. 1969–1976.
out-of-sample extensions through weighted kernel PCA,” IEEE Trans. [43] M. Filippone, F. Camastra, F. Masulli, and S. Rovetta, “A survey of
Pattern Anal. Mach. Intell., vol. 32, no. 2, pp. 335–347, Feb. 2010. kernel and spectral methods for clustering,” Pattern Recognit., vol. 41,
[17] Z. Khan, F. Shafait, and A. Mian, “Joint group sparse PCA for com- no. 1, pp. 176–190, Jan. 2008.
pressed hyperspectral imaging,” IEEE Trans. Image Process., vol. 24, [44] L. Zelnik-Manor and P. Perona, “Self-tuning spectral clustering,” in
no. 12, pp. 4934–4942, Dec. 2015. Proc. NIPS, 2004, pp. 1601–1608.
[18] Z. Lai, Y. Xu, J. Yang, L. Shen, and D. Zhang, “Rotational invariant [45] X. Zhu, S. Zhang, Y. Li, J. Zhang, L. Yang, and Y. Fang, “Low-rank
dimensionality reduction algorithms,” IEEE Trans. Cybern., vol. 47, sparse subspace for spectral clustering,” IEEE Trans. Knowl. Data Eng.,
no. 11, pp. 3733–3746, Nov. 2017. vol. 31, no. 8, pp. 1532–1543, Aug. 2019.
[19] L. Gan, J. Xia, P. Du, and Z. Xu, “Dissimilarity-weighted sparse [46] X. Zhu, S. Zhang, R. Hu, Y. Zhu, and J. Song, “Local and global
representation for hyperspectral image classification,” IEEE Geosci. structure preservation for robust unsupervised spectral feature selection,”
Remote Sens. Lett., vol. 14, no. 11, pp. 1968–1972, Nov. 2017. IEEE Trans. Knowl. Data Eng., vol. 30, no. 3, pp. 517–529, Mar. 2018.
[20] J. Gan, G. Wen, H. Yu, W. Zheng, and C. Lei, “Supervised feature [47] X. Zhu, S. Zhang, W. He, R. Hu, C. Lei, and P. Zhu, “One-step multi-
selection by self-paced learning regression,” Pattern Recognit. Lett., view spectral clustering,” IEEE Trans. Knowl. Data Eng., vol. 31, no. 10,
vol. 132, pp. 30–37, Apr. 2020. pp. 2022–2034, Oct. 2019.
[21] J. Wang, X. Wang, K. Zhang, K. Madani, and C. Sabourin, “Morpho- [48] X. Zhu, S. Zhang, Y. Zhu, W. Zheng, and Y. Yang, “Self-weighted multi-
logical band selection for hyperspectral imagery,” IEEE Geosci. Remote view fuzzy clustering,” ACM Trans. Knowl. Discovery Data, vol. 14,
Sens. Lett., vol. 15, no. 8, pp. 1259–1263, Aug. 2018. no. 4, p. 48, 2020,
[22] F. Nie, S. Xiang, Y. Liu, and C. Zhang, “A general graph-based semi- [49] Y. Yang, Z. Ma, Y. Yang, F. Nie, and H. T. Shen, “Multitask spectral
supervised learning with novel class discovery,” Neural Comput. Appl., clustering by exploring intertask correlation,” IEEE Trans. Cybern.,
vol. 19, no. 4, pp. 549–555, Jun. 2010. vol. 45, no. 5, pp. 1069–1080, May 2015.
[23] L. Berton, “Graph construction based on neighborhood for semisuper- [50] X. Chen, R. Chen, Q. Wu, Y. Fang, F. Nie, and J. Z. Huang, “LABIN:
vised,” Ph.D. dissertation, Univ. São Paulo, São Paulo, Brazil, 2016. Balanced min cut for large-scale data,” IEEE Trans. Neural Netw. Learn.
[24] L. Zhang et al., “Large-scale robust semisupervised classification,” IEEE Syst., vol. 31, no. 3, pp. 725–736, Mar. 2020.
Trans. Cybern., vol. 49, no. 3, pp. 907–917, Mar. 2019. [51] X. Chen, W. Hong, F. Nie, D. He, M. Yang, and J. Z. Huang, “Spectral
[25] W.-L. Zhao, C.-H. Deng, and C.-W. Ngo, “K-means: A revisit,” Neuro- clustering of large-scale data by directly solving normalized cut,” in
computing, vol. 291, pp. 195–206, May 2018. Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
[26] A. Karlekar, A. Seal, O. Krejcar, and C. Gonzalo-Martín, “Fuzzy Jul. 2018, pp. 1206–1215.
k-means using non-linear s-distance,” IEEE Access, vol. 7, [52] W. Zhu, F. Nie, and X. Li, “Fast spectral clustering with efficient large
pp. 55121–55131, 2019. graph construction,” in Proc. ICASSP, 2017, pp. 2492–2496.
[27] R. Langone and J. A. K. Suykens, “Fast kernel spectral clustering,” [53] L. Yang, X. Liu, F. Nie, and M. Liu, “Large-scale spectral cluster-
Neurocomputing, vol. 268, pp. 27–33, Dec. 2017. ing based on representative points,” Math. Problems Eng., vol. 2019,
[28] H. Zheng and J. Wu, “Which, when, and how: Hierarchical clustering Dec. 2019, Art. no. 5864020.
with human–machine cooperation,” Algorithms, vol. 9, no. 4, p. 88, [54] X. Yang, W. Yu, R. Wang, G. Zhang, and F. Nie, “Fast spectral clustering
Dec. 2016. learning with hierarchical bipartite graph for large-scale data,” Pattern
[29] C. Fevotte and N. Dobigeon, “Nonlinear hyperspectral unmixing with Recognit. Lett., vol. 130, pp. 345–352, Feb. 2020.
robust nonnegative matrix factorization,” IEEE Trans. Image Process., [55] C. Wang, F. Nie, R. Wang, and X. Li, “Revisiting fast spectral clustering
vol. 24, no. 12, pp. 4810–4819, Dec. 2015. with anchor graph,” in Proc. ICASSP, 2020, pp. 3902–3906.
[30] Y. Pang, J. Xie, F. Nie, and X. Li, “Spectral clustering by joint spectral [56] F. He, F. Nie, R. Wang, X. Li, and W. Jia, “Fast semisupervised learning
embedding and spectral rotation,” IEEE Trans. Cybern., vol. 50, no. 1, with bipartite graph for large-scale data,” IEEE Trans. Neural Netw.
pp. 247–258, Jan. 2020. Learn. Syst., vol. 31, no. 2, pp. 626–638, Feb. 2020.

Authorized licensed use limited to: UNIVERSIDAD DE VIGO. Downloaded on February 02,2023 at 10:30:42 UTC from IEEE Xplore. Restrictions apply.
4212 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 9, SEPTEMBER 2022

[57] F. He, F. Nie, R. Wang, H. Hu, W. Jia, and X. Li, “Fast semi-supervised Feiping Nie (Member, IEEE) received the Ph.D.
learning with optimal bipartite graph,” IEEE Trans. Knowl. Data Eng., degree in computer science from Tsinghua Univer-
early access, Jan. 21, 2020, doi: 10.1109/TKDE.2020.2968523. sity, Beijing, China, in 2009.
[58] W. Liu, J. He, and S. Chang, “Large graph construction for scalable He has authored over 100 articles in top
semi-supervised learning,” in Proc. ICML, 2010, pp. 679–686. journals and conferences, including the IEEE
[59] F. Nie, X. Wang, and H. Huang, “Clustering and projected clustering T RANSACTIONS ON PATTERN A NALYSIS AND
with adaptive neighbors,” in Proc. KDD, 2014, pp. 977–986. M ACHINE I NTELLIGENCE, the International Jour-
[60] M. Šorel and F. Šroubek, “Fast convolutional sparse coding using matrix nal of Computer Vision, the IEEE T RANSACTIONS
inversion lemma,” Digit. Signal Process., vol. 55, pp. 44–51, Aug. 2016. ON I MAGE P ROCESSING , the IEEE T RANSAC -
[61] S. Mehrkanoon, C. Alzate, R. Mall, R. Langone, and J. A. K. Suykens, TIONS ON N EURAL N ETWORKS AND L EARNING
“Multiclass semisupervised learning based upon kernel spectral cluster- S YSTEMS, the IEEE T RANSACTIONS ON N EURAL
ing,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 4, pp. 720–733, N ETWORKS , the IEEE T RANSACTIONS ON K NOWLEDGE AND D ATA E NGI -
Apr. 2015. NEERING , the ACM Transactions on Knowledge Discovery from Data,
[62] X. Chen and D. Cai, “Large scale spectral clustering with landmark- the Bioinformatics, the International Conference on Machine Learning,
based representation,” in Proc. AAAI, 2011, pp. 313–318. the Conference on Neural Information Processing Systems, the Knowledge
[63] M. M. R. Khan, R. B. Arif, M. A. B. Siddique, and M. R. Oishe, “Study Discovery and Data Mining Conference, the International Joint Conference
and observation of the variation of accuracies of KNN, SVM, LMNN, on Artificial Intelligence, the Association for the Advancement of Artificial
ENN algorithms on eleven different datasets from UCI machine learning Intelligence, the International Conference on Computer Vision, the Conference
repository,” CoRR, vol. abs/1809.06186, pp. 124–129, Sep. 2018. on Computer Vision and Pattern Recognition, and the ACM Multimedia. His
current research interests include machine learning and its applications, such
as pattern recognition, data mining, computer vision, image processing, and
information retrieval.
Jingyu Wang (Member, IEEE) received the Ph.D. Dr. Nie is currently serving as an associate editor or a PC member for
degree in signal, image, and automation from the several prestigious journals and conferences in the related fields. His articles
Université Paris-Est, Paris, France, in 2015. have been cited over 5000 times (Google scholar).
He is currently an Associate Professor with the
School of Astronautics, School of Artificial Intel-
ligence, Optics and Electronics (iOPEN), North-
western Polytechnical University, Xi’an, China. His
research interests include information processing,
computer vision, and intelligent perception.

Zhenyu Ma is currently pursuing the M.E. degree


with the School of Astronautics, Northwestern Poly-
technical University, Xi’an, China.
His research interests include machine learning
and computer vision.

Xuelong Li (Fellow, IEEE) is currently a Full Professor with the School


of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern
Polytechnical University, Xi’an, China.

Authorized licensed use limited to: UNIVERSIDAD DE VIGO. Downloaded on February 02,2023 at 10:30:42 UTC from IEEE Xplore. Restrictions apply.

You might also like