0% found this document useful (0 votes)
2 views

3. Explicit unsupervised feature selection based on structured graph and locally

This document presents a novel unsupervised feature selection (UFS) method that integrates an explicit feature selection matrix with structured graph learning and locally linear embedding to effectively filter out redundant features in high-dimensional unlabeled data. The proposed approach enhances clustering accuracy and Normalized Mutual Information by over 5% compared to existing methods, demonstrating its superiority. The methodology includes an iterative algorithm with convergence and computational complexity analysis, addressing limitations of previous UFS techniques.

Uploaded by

avion4vientos7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

3. Explicit unsupervised feature selection based on structured graph and locally

This document presents a novel unsupervised feature selection (UFS) method that integrates an explicit feature selection matrix with structured graph learning and locally linear embedding to effectively filter out redundant features in high-dimensional unlabeled data. The proposed approach enhances clustering accuracy and Normalized Mutual Information by over 5% compared to existing methods, demonstrating its superiority. The methodology includes an iterative algorithm with convergence and computational complexity analysis, addressing limitations of previous UFS techniques.

Uploaded by

avion4vientos7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Expert Systems With Applications 255 (2024) 124568

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

Explicit unsupervised feature selection based on structured graph and locally


linear embedding
Jianyu Miao a , Jingjing Zhao b , Tiejun Yang a , Chao Fan a , Yingjie Tian c,d , Yong Shi c,d ,
Mingliang Xu e ,∗
a
School of Artificial Intelligence and Big Data, Henan University of Technology, Zhengzhou, 450001, China
b College of Information Science and Engineering, Henan University of Technology, Zhengzhou, 450001, China
c School of Economics and Management, University of Chinese Academy of Sciences, Beijing, 100190, China
d
Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing, 100190, China
e
School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China

ARTICLE INFO ABSTRACT

Keywords: Numerous redundant and irrelevant features are usually contained in high-dimensional data whose presence,
Unsupervised feature selection if ignored, can bring detrimental effects on the performance of data processing tasks. Unsupervised feature
Structured graph selection (UFS) as a trending topic in a great deal of domains has been greatly concerned, which can effectively
Locally linear embedding
filter out redundant and irrelevant features in the unlabeled data. The majority of existing UFS methods mainly
Alternative optimization strategy
focus on building the parameterized model in the original feature space or some certain low-dimensional
feature subspace, which cannot fully capture the information of the target space. In this work, we propose
a novel UFS method that unifies the explicit feature selective matrix and structured graph into a learning
framework. An explicit selection matrix, whose scale, bound of the elements and structure are considered,
is tailored for UFS. Furthermore, we take into account the local consistency so that the low dimensional
manifold structures can be preserved to a better degree. Instead of using the pre-defined similarity matrix to
construct graph, we make use of the way of learning to obtain the adaptive graph. Besides, for the purpose of
exploiting the cluster structure of data, we impose a rank constraint on the graph. Notably, both the explicit
selection matrix and intrinsic structure are learned in the target feature subspace, and thus the proposed
methodology is more straightforward and effective. We present an effective and efficient algorithm for the
resulting problem based on alternative optimization strategy, and establish its convergence and computational
complexity analysis. The experimental results show that our proposed approach has improved by over 5.28%
and 5.79% on average in terms of clustering accuracy (ACC) and Normalized Mutual Information (NMI)
compared with comparison methods, which demonstrates its superiority.

1. Introduction feature selection has become a hot research topic in the field of machine
learning and data mining, which aims at identifying the optimal feature
Recent years have witnessed a rapid growth of high-dimensional subset from the original high-dimensional feature set based on the
data in many real-world applications, for instance image processing, pre-defined criterion (Li, Cheng et al., 2018; Wright & Ma, 2022).
text analysis, and pattern recognition (Jain & Zongker, 1997). Massive In other words, feature selection is dedicated to investigating how to
high-dimensional data has brought us plenty of opportunities as well reserve valuable and useful features while discard noisy and redundant
as serious challenges. Often, high-dimensional data contains noise, out- features.
liers and redundant features that may increase storage costs and com- Substantial numbers of feature selection algorithms have been ex-
putational complexity, and reduce the performance and interpretability
tensively studied and developed by researchers from various commu-
of subsequent learning tasks (Li, Cheng et al., 2018).
nities (da Silva et al., 2021; Wang, Xue et al., 2023; Zhang et al.,
As one of the popular methods to alleviate the above issues, feature
2023; Zheng et al., 2023; Zhou et al., 2021). Because of the prevalence
selection has recently received a considerable attention (Cai et al.,
of unlabeled data, UFS approaches, which are specially designed to
2018; Gui et al., 2016; Li, Cheng et al., 2018). Over the past decades,

∗ Corresponding author.
E-mail addresses: [email protected] (J. Miao), [email protected] (J. Zhao), [email protected] (T. Yang), [email protected] (C. Fan),
[email protected] (Y. Tian), [email protected] (Y. Shi), [email protected] (M. Xu).

https://doi.org/10.1016/j.eswa.2024.124568
Received 25 November 2023; Received in revised form 11 June 2024; Accepted 20 June 2024
Available online 28 June 2024
0957-4174/© 2024 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
J. Miao et al. Expert Systems With Applications 255 (2024) 124568

handle high-dimensional unlabeled data, have become increasingly selection process once the original data points are given (Cai et al.,
important (Mitra et al., 2002). It is the lack of label information 2010; He et al., 2006; Li et al., 2014; Yang et al., 2011). Commonly used
that makes UFS extremely challenging (Dy & Brodley, 2004). Even similarity measures include Gaussian kernel, dot-product (Cai et al.,
though it is, recently the great advance has been made in performance, 2010). In fact, such strategy would inevitably suppress the quality of
robustness and efficiency of UFS methods (Gui et al., 2016; Li, Cheng the selected features. More recently, adaptive similarity graph has been
et al., 2018). proposed to replace the pre-defined one, which is intended to learn the
The existing studies on UFS can be mainly categorized into filter similarity matrix from the given data (Li, Zhang et al., 2018; Zhong,
methods (He et al., 2006; Yao et al., 2017; Zhao & Liu, 2007), wrapper 2020; Zhou et al., 2020). The performance of adaptive similarity graph
methods (Dy & Brodley, 2004; Law et al., 2002) and embedded meth- has been well demonstrated. However they still have the limitation:
ods (Du et al., 2017; Gong et al., 2022; Hou et al., 2014; Hu et al., 2017; the similarity matrix is learned from the original feature space or some
Karami et al., 2023; Li et al., 2014; Liu et al., 2020; Miao et al., 2022; certain low-dimensional feature subspace which leads to insufficient
Shi et al., 2018; Tang et al., 2018; Wahid et al., 2022; Wang et al., 2015; exploration of the target space. Therefore, it is important to construct
Wang, Wang et al., 2023; Zhu, Xu et al., 2018; Zhu, Zhang et al., 2018;
an adaptive graph of the target space. On the other hand, the sparse
Zhu et al., 2015). Filter methods typically assess features based on
selective matrix has been obtained via constructing and optimizing a
intrinsic properties of the data, which are usually effective and efficient.
pseudo label based model (Hou et al., 2014; Li et al., 2014; Shi et al.,
Laplacian score (He et al., 2006) evaluates each feature via the local
2018) or data reconstruction model (Hu et al., 2017; Wang et al., 2015;
preserving ability. Spectral Feature Selection (SPEC) (Zhao & Liu, 2007)
Zhu et al., 2015) with sparse regularization term or sparse constraint.
exploits the spectrum of the graph to select features, and LLEscore (Yao
As stated previously, pseudo label based methods first need to get
et al., 2017) quantifies the distinction between the local structure of
pseudo labels of data and then use them to guide the process of feature
each feature and that of the original data. Wrapper methods (Dy &
Brodley, 2004; Law et al., 2002) train the predetermined learning selection (Miao et al., 2022). Obviously the pseudo label plays a vital
model on each subset and then utilize the learning performance as the role in the subsequent feature selection, and directly determines the
score of the subset. The higher score, the preferred the subset. Although quality of the selected features (Nie et al., 2023). In contrast, data
wrapper methods perform well, they have high computational costs due reconstruction based methods can avoid this issue, which models the
to the exhaustive search in the power set of the original set. selection matrix with the way of data reconstruction (Zhu et al., 2015).
Embedded methods consider feature selection as a part of the model Notice that the current methods aim to reconstruct the data points
training process, whose computational cost falls in between filter and from the original space, which cannot reflect the correlation of the
wrapper methods. Such methods (Cai et al., 2010; Du et al., 2017; feature subspace. In addition, almost all the recent methods achieve the
Hou et al., 2014; Hu et al., 2017; Li et al., 2014, 2012; Liu et al., sparsity by adding the sparse regularization into the objective function
2020; Miao et al., 2022; Shi et al., 2018; Tang et al., 2018; Wang or enforcing the sparse constraint on the selection matrix (Du et al.,
et al., 2015; Zhu, Xu et al., 2018; Zhu, Zhang et al., 2018; Zhu et al., 2018; Li et al., 2022, 2023; Tang et al., 2019; Wang et al., 2022;
2015) usually aim to find a matrix, such as a transformed matrix, a Zhu et al., 2023). A variety of sparsity-inducing terms can be utilized,
representation coefficient matrix, a latent feature matrix and so on, including the convex function, such as the 𝓁2,1 norm, and the non-
which can be utilized to select the useful features. From the perspective convex ones, such as the 𝓁2,0 , the 𝓁2,𝑝 (0 < 𝑝 < 1), the 𝓁2,1−2 . In fact, how
of how to model the matrix, in this work, we further classify em- to choose a desirable one is a still problem since the convex ones usually
bedded approaches as: pseudo label based, self-representation based, possess the higher efficiency, while non-convex ones often possess the
non-negative matrix factorization based methods. Pseudo label based better sparsity. The previous works usually seek a balance between the
methods (Cai et al., 2010; Hou et al., 2014; Li et al., 2014, 2012; Shi sparsity and the efficiency leading to a suboptimal solution.
et al., 2018) often leverage clustering methods to get the pseudo labels To alleviate the limitations of the existing methodologies, we pro-
of data, then regard them as targets to perform linear or nonlinear pose a novel UFS approach that incorporates structured graph learning
fitting for learning the transformed matrix. Non-negative matrix fac- into locally linear embedding (SGLLE). Specifically, we first define
torization based algorithms (Du et al., 2017; Luo et al., 2022; Qian &
the characteristics of the sparse selective matrix, including the scale,
Zhai, 2013; Tang et al., 2018; Wang et al., 2015; Zhu, Xu et al., 2018)
the bound of the elements and the structure, and then use it to for-
factorize the given data matrix into the product of the latent feature and
mulate the target space. Each data in the target space can be well
cluster indicator matrix. Suppose each feature can be represented as
reconstructed. Meanwhile, both the adaptive similarity graph and its
the linear combination of its relevant features, self-representation based
cluster structure are learned in a unified framework. It should be
methods (Hu et al., 2017; Liu et al., 2020; Miao et al., 2022; Zhu, Zhang
emphasized that the data reconstruction and structured graph learning
et al., 2018; Zhu et al., 2015) consider the data itself as target and train
are implemented in the target space simultaneously, which are more
the linear model to obtain the representation coefficient matrix. To
straightforward and tailored for feature selection. To resolve the re-
better achieve the goal of selecting the optimal features, the above three
types of methods typically employ the sparse regularization to exploit sulting problem, we design an iterative algorithm based on alternative
the sparsity of the matrix. Commonly used sparse regularization terms optimization strategy. The convergence and computational complexity
include the 𝓁2,0 (Cai et al., 2013; Du et al., 2018; Wang et al., 2022; Zhu analysis of our optimization scheme are provided. In summary, the
et al., 2023), the 𝓁2,1 norm (Tang et al., 2019), the 𝓁2,𝑝 (0 < 𝑝 < 1) (Li main contributions of this paper are highlighted as follows:
et al., 2022, 2023), the 𝓁2,1−2 (Miao et al., 2021; Shang et al., 2022;
Shi et al., 2018), etc. (1) A novel UFS approach is proposed by unifying locally linear
Due to the excellent performance, in this paper, we concentrate on embedding and structured graph learning into a framework.
studying UFS from the perspective of embedding way. The embedded What is more, they are all constrained in the feature subspace.
approaches usually involve two significantly important components, (2) An explicit selection matrix is introduced to fulfill feature se-
i.e., the similarity graph and the sparse selective matrix. The similarity lection, making the sparse regularization term unnecessary and
graph aims to preserve the structure of the original feature space, avoiding the process of feature ranking.
while the sparse selective matrix is to select the informative and dis- (3) To tackle the challenging optimization problem, an iterative al-
criminative features. However, the majority of existing methods have gorithm along with its convergence analysis and computational
limitations in both two aspects. On one hand, the similarity graph complexity analysis are provided.
has been pre-defined and constructed according to the experience or (4) Extensive experimental results verify the effectiveness of SGLLE,
prior knowledge, and remained unchanged in the subsequent feature and its superiority over many other advanced approaches.

2
J. Miao et al. Expert Systems With Applications 255 (2024) 124568

Table 1 As its name indicates, statistics based methods (Davis & Sampson,
Notations and descriptions.
1986; Gini, 1912; Liu & Setiono, 1995) take advantage of a variety
Notations Descriptions of statistics to measure the feature relevance. Since the score for each
𝑑 Number of features feature is calculated individually according to the pre-defined statistic,
𝑛 Number of samples
thus like similarity based methods, they are incapable of resolving the
𝐾 Number of selected features
𝑘 Size of neighborhood
feature redundancy problem. The frequently used statistics are tailored
𝑐 Number of clusters for the data with discrete features, thus data discretization is necessary
𝐱 A column vector for numerical and continuous cases. Additionally, most of such methods
𝐗 Data matrix are supervised ones, including t-score (Davis & Sampson, 1986) and
𝐋 Graph Laplacian matrix
chi-square score (Liu & Setiono, 1995). Low variance is a typical
‖𝐱‖2 The euclidean norm of 𝐱
‖𝐗‖F The Frobenius norm of 𝐗 unsupervised one, which removes features whose variances are below
‖𝐗‖2,1 The 𝓁2,1 norm of 𝐗 the pre-defined threshold. In an extreme case, a feature has the same
𝜎𝑖 (𝐗) The 𝑖th smallest singular value of 𝐗 value for all instances and the variance is 0, it thus should be removed.
𝑟𝑎𝑛𝑘(𝐗) The rank of 𝐗 Conversely, the features with high variance may not necessarily be
⟨𝐗, 𝐘⟩ The inner product between 𝐗 and 𝐘
useful and discriminative. Owing to unreliable results, low variance
𝐈𝑐 The 𝑐 × 𝑐 identity matrix
based methods are rarely used in practice.
Information theoretical based methods (Battiti, 1994; Peng et al.,
2005) aim at developing various heuristic criteria for maximizing fea-
The remainder of the paper is organized as follows. Section 2 first ture relevance while minimizing feature redundancy. Because the rele-
gives some notations to be used in this work and then presents a brief vance of features can be usually evaluated by label information, most
review of UFS methods. Section 3 introduces the proposed method of such algorithms work for supervised scenarios. Moreover, most
based on structured graph and locally linear embedding. The optimiza- concepts in information theory are only applicable to discrete variables.
tion algorithm for solving the proposed model is discussed in Section 4. Thus, some data discretization techniques should be performed on con-
Section 5 provides the convergence and computational complexity tinuous feature values before using this kind of methods. In contrast to
analysis of the proposed algorithm. A wide range of experiments are similarity and statistical based methods, information theoretical based
conducted in Section 6. The conclusions are presented in Section 7. methods are able to take feature relevance and feature redundancy into
account via a unified probabilistic framework.
2. Background Sparse learning based methods (Li et al., 2012; Liu et al., 2014;
Shang et al., 2020; Wang et al., 2015; Yang et al., 2011) have at-
2.1. Notations and definitions tracted widespread attention in recent years because of the excellent
performance and interpretability. Different from the above three kinds
of methods that are independent of learning algorithm, sparse learning
Before introducing our proposed model, we give some notations
based ones heavily depend on a typical pre-defined learning model and
that will be used throughout this work. Scalars are written as normal
incorporate feature selection into the training of the model. Mathemati-
lowercase characters. We employ lowercase boldface characters and
cally, such methods usually build a joint parametered model of learning
uppercase boldface characters to represent vectors and matrices, re-
algorithm and some certain sparse regularization terms, which can be
spectively. For a matrix 𝐗, 𝐗𝑖𝑗 , 𝐱𝑖 and 𝐱𝑖 denote the entry at the 𝑖th
formulated as
row and the 𝑗-column of 𝐗, the 𝑖th row of 𝐗, and the 𝑖th column of 𝐗,
respectively. 𝐗T denotes the transpose of 𝐗. Table 1 lists more notations min (𝐗, 𝐖) + 𝛺(𝐖), (1)
𝐖∈
and the corresponding descriptions.
where (, ⋅, ) is the loss term and 𝛺(⋅) is the regularization term. 𝐖
is the model parameter, which can often reflect the importance and
2.2. Related works
discrimination of features. More specifically, a feature can be regarded
as an important one if its weight has a large value. By optimizing the
In this part, we review the prior UFS works. In recent decades, UFS
joint parametered model, we can get the optimal solution of parameters
has been extensively studied. A large number of criteria for defining the
that can be utilized to identify the most distinguished features. The
usefulness and relevance of features, and the corresponding algorithms sparse regularizer is used to guarantee that most of values of the
have been developed (Gui et al., 2016). Following Li, Cheng et al. optimal solution are very small or exactly zero, making it extremely
(2018), we broadly group the existing UFS approaches as similar- suitable for feature selection.
ity based, information theoretical based, statistical based and sparse Table 2 lists a variety of sparse learning based UFS methods.
learning based methods. From the table, we can observe that almost all the UFS methods
Similarity based methods (He et al., 2006; Nie et al., 2008; Zhao consist of three part, i.e., loss term, sparse regularization/constraint
& Liu, 2007) assess feature importance by the ability to preserve data and graph regularization. Loss term can be designed by linear re-
similarity. Different from supervised feature selection taking advantage gression (NDFS, JESLR, RUFSM, UGFS, AGUFS), dictionary learning
of label information of data to obtain similarity, unsupervised ones (JGSC), self-representation (GSR_SFS, SLSDR) and local linear em-
leverage a variety of distance measures to define similarity. Lapla- bedding (GLLE). The 𝓁2,0 and 𝓁2,1 are commonly used to promote
cian score (He et al., 2006) is a simple yet very effective approach, the sparsity. Compared with the 𝓁2,1 norm, the 𝓁2,0 can yield the
which selects features that can preserve the data manifold structure better sparsity but lead to the higher computation cost. In unsuper-
as much as possible. As an extension to Laplacian score, SPEC (Zhao vised setting, exploring the geometric structure of data is beneficial
& Liu, 2007) utilizes spectral properties of graph for assessing fea- to unsupervised learning. Suppose that two points in the original
ture relevance, which applies to both supervised and unsupervised space are similar, their low-dimensional embeddings should be also
settings. Such approaches usually perform computationally efficient similar. Under this assumption, UFS methods can utilize the pre-defined
due to only involving constructing a similarity matrix and calculating similarity matrix to construct graph leading to conventional graph
feature scores. However, since they select the optimal features in a (NDFS, JGSC, JESLR, GSR_SFS, RUFSM, UGFS, SLSDR) or learning the
greedy manner, they cannot guarantee that the selected features are similarity matrix from data leading to adaptive graph (AGUFS, FOG-R,
highly uncorrelated. Such methods may select features that are related, ADUFS, GLUFS). Since adaptive graph is learned from the data and thus
leading to feature redundancy problems. can be adaptive to the data, it often produce the better performance.

3
J. Miao et al. Expert Systems With Applications 255 (2024) 124568

Table 2
Some state-of-the-art sparse learning based UFS methods.
Methods Loss term  Regularization term Ω Constraints set 
NDFS (Li et al., 2012) ‖𝐗T 𝐖 − 𝐅‖2F 𝜆1 Tr(𝐅T 𝐋𝐅) + 𝜆2 ‖𝐖‖2,1 𝐅T 𝐅 = 𝐈, 𝐅 ≥ 0
∑𝑛 ∑𝑑
JGSC (Zhu et al., 2013) ‖𝐗 − 𝐁𝐖‖2F 𝜆1 Tr(𝐖𝐋𝐖T ) + 𝜆2 ‖𝐖‖2,1 𝑖=1
𝐁 ≤1
𝑗=1 𝑖𝑗
JESLR (Hou et al., 2014) ‖𝐖T 𝐗 − 𝐅‖2F 𝜆1 Tr(𝐅𝐋𝐅T ) + 𝜆2 ‖𝐖‖𝑝𝑟,𝑝 𝐅𝐅T = 𝐈
GSR_SFS (Hu et al., 2017) ‖𝐗 − 𝐗𝐖‖2F 𝜆1 Tr(𝐖T 𝐗T 𝐋𝐗𝐖) + 𝜆2 ‖𝐖‖2,1 /
∑𝑛
RUFSM (Du et al., 2017) ‖𝐖𝐗 − 𝐁𝐇‖2,1 𝜆1 ‖𝐖‖2,1 + 𝜆2 Tr(𝐇𝐋𝐇T ) + 𝜆3 𝑖=1 ‖𝐡𝑖 ‖1 𝐁T 𝐁 = 𝐈
UGFS (Du et al., 2018) ‖𝐗T 𝐖 − 𝐅‖2F 𝜆Tr(𝐅T 𝐋𝐅) ‖𝐖‖2,0 = 𝑘, 𝐅T 𝐅 = 𝐈, 𝐅 ≥ 0
SLSDR (Shang et al., 2020) ‖𝐗T − 𝐗T 𝐒𝐕‖2F 𝜆1 (Tr(𝐕𝐋𝐕 𝐕T ) + 𝜆2 Tr(𝐒T 𝐗𝐋𝐒 𝐗T 𝐒)) + 𝜆3 (‖𝐒𝐒T ‖1 − ‖𝐒‖2F ) 𝐒 ≥ 0, 𝐕 ≥ 0, 𝐒T 𝐒 = 𝐈
1 ∑𝑛 ∑𝑛 ∑
AGUFS (Huang et al., 2021) ‖𝐗T 𝐖 + 𝟏𝐛T − 𝐅‖2F 𝜆1 ‖𝐖‖2,1 + 𝜆2 𝑖,𝑗=1 ‖𝐖T 𝐱𝑖 − 𝐖T 𝐱𝑗 ‖22 𝐒𝑖𝑗 + 𝜆3 𝑖=1 ‖𝐬𝑖 ‖22 + 𝜆4 Tr(𝐅T 𝐋𝐒 𝐅) 𝐖T 𝐖 = 𝐈, 𝐅T 𝐅 = 𝐈, 𝑗 𝐒𝑖𝑗 = 1, 𝐒𝑖𝑖 = 0, 0 ≤ 𝐒𝑖𝑗 ≤ 1
2 ∑
FOR-R (Chen et al., 2022) ‖𝐗 − 𝐁𝐖‖2F 𝜆1 Tr(𝐅 𝐋𝐒 𝐅) + 𝜆2 ‖𝐒‖F + 𝜆3 ‖𝐖‖2,1
T 2 T
𝑗 𝐒𝑖𝑗 = 1, 0 ≤ 𝐒𝑖𝑗 ≤ 1, 𝐖 𝐖 = 𝐈
∑𝑚 ∑
ADUFS (Zhao et al., 2022) ‖𝐒 − 𝐀‖2F 𝜆1 𝑖,𝑗=1 𝐒𝑖𝑗 ‖𝐖T 𝐱𝑖 − 𝐖T 𝐱𝑗 ‖22 + 𝜆2 Tr(𝐅T 𝐋𝐒 𝐅) + 𝜆3 ‖𝐖‖2,1 𝐖T (𝐒𝑡 + 𝜀𝐈)𝐖 = 𝐈, 𝐅T 𝐅 = 𝐈, 𝑗 𝐒𝑖𝑗 = 1, 0 ≤ 𝐒𝑖𝑗 ≤ 1

GLUFS (Zhu et al., 2023) ‖𝐅 − 𝐗T 𝐖‖2F 𝜆1 Tr(𝐅T 𝐋𝐅) + 𝜆2 ‖𝐒 − 𝐀‖2F 𝐒
𝑗 𝑖𝑗 = 1, 𝐒𝑖𝑗 ≥ 0, ‖𝐖‖ 2,0 = 𝑘

3. Methodology via keeping the local linear relationship in the feature subspace. More
Specifically, we transform the learning process of Ψ∗ into solving the
3.1. Unsupervised feature selection via explicit selection matrix following optimization problem

𝑛
‖ T ∑ ‖2
Let 𝐗 = [𝐱1 , 𝐱2 , … , 𝐱𝑛 ] ∈ ℜ𝑑×𝑛 be the input data matrix and  = min ‖Ψ 𝐱𝑖 − 𝐖𝑖𝑗 (ΨT 𝐱𝑗 )‖
𝐖,Ψ ‖ ‖2
{𝑓1 , 𝑓2 , … , 𝑓𝑑 } be the original feature set, where 𝐱𝑖 and 𝑓𝑖 are the 𝑖th 𝑖=1 𝑗∈𝑘 (𝐱𝑖 )
sample and the 𝑖th feature, respectively. According to the pre-defined (3)

𝑘
criterion, unsupervised feature selection aims at identifying the optimal 𝑠.𝑡. 𝐖𝑖𝑗 = 1, Ψ ∈ ,
feature subset 𝐼 ∗ = {𝑓𝐼 ∗ (1) , 𝑓𝐼 ∗ (2) , … , 𝑓𝐼 ∗ (𝐾) } ⊂  (𝐾 ≪ 𝑑) from  , 𝑗=1
or identifying the optimal index set 𝐼 ∗ = {𝐼 ∗ (1), 𝐼 ∗ (2), … , 𝐼 ∗ (𝐾)} ⊂ where 𝐖𝑖𝑗 is the reconstruction coefficient and can be regarded as the
{1, 2, … , 𝑑}. contribution of the sample 𝐱𝑗 to the reconstruction of the sample 𝐱𝑖 , 
In this work, we will focus on transforming identifying 𝐼 ∗ into is the set consisting of those matrices satisfying the above conditions
learning an explicit selection matrix Ψ ∈ ℜ𝑑×𝐾 from the given data. To C1-C3, 𝑘 (𝐱𝑖 ) denotes the index set of 𝑘 samples closest to 𝐱𝑖 . By
the end, we first define the matrix Ψ, which should satisfy the following comparing the formulation of locally linear embedding and (3), we can
conditions: observe that locally linear embedding implements the data reconstruc-
• C1. Ψ𝑖𝑗 ∈ {0, 1} ∈ ℜ𝑑×𝐾 , for 𝑖 = 1, 2, … , 𝑑 and 𝑗 = 1, 2, … , 𝐾; tion in the original space, while (3) performs the data reconstruction
∑𝐾 in the feature subspace.
• C2. 𝑗=1 Ψ𝑖𝑗 ≤ 1, for 𝑖 = 1, 2, … , 𝑑;
∑𝑑
• C3. 𝑖=1 Ψ𝑖𝑗 = 1, for 𝑗 = 1, 2, … , 𝐾.
3.2. Structured graph learning
From C1-C3, it can be easily verified that only one position has
a value of 1 and other positions have a value of 0 in each column
Graph learning supposes that two instances are close in the original
of Ψ, and the number of 1 in each row of Ψ is at most 1. For ease
space, then their low-dimensional representations will be also close.
of representation, we collect the positions of 1 in each column into
This assumption can be realized from the perspective of the similarity.
𝐼 = {𝐼(1), 𝐼(2), … , 𝐼(𝐾)}, where 𝐼(𝑖) is the index of 1 in the 𝑖th column.
Based on the explicit selection matrix Ψ defined in Section 3.1, we
In fact, we have realized the process of selecting 𝐾 features, i.e., 𝐼 .
obtain the corresponding sample ΨT 𝐱𝑖 (after feature selection) in the
𝐱1 𝐱2 ⋯ 𝐱𝑛
feature subspace for each sample 𝐱𝑖 . Let 𝐒 be the similarity matrix to
𝑓1 ⎛ 𝐗11 𝐗12 ⋯ 𝐗1𝑛 ⎞
⎜ ⎟ be learned, where 𝐒𝑖𝑗 represents the score of similarity between data
𝑓 𝐗 𝐗22 ⋯ 𝐗2𝑛 ⎟
A toy example: Let 𝐗 = 2 ⎜ 21 be the data point 𝐱𝑖 and 𝐱𝑗 . Mathematically, we can learn 𝐒 by
𝑓3 ⎜ 𝐗31 𝐗32 ⋯ 𝐗3𝑛 ⎟
𝑓4 ⎜⎝ 𝐗41 𝐗42 ⋯ 𝐗4𝑛 ⎟⎠ ∑ 𝑛 (
𝑛 ∑ )
4×𝑛
matrix and  = {𝑓1 , 𝑓2 , 𝑓3 , 𝑓4 } be the original feature set. Suppose min ‖ΨT 𝐱𝑖 − ΨT 𝐱𝑗 ‖22 𝐒𝑖𝑗 + 𝛼𝐒2𝑖𝑗
𝐒
⎛1 0 0⎞ 𝑖=1 𝑗=1
(4)
⎜ ⎟ ∑
𝑛
0 0 0⎟
we have learned the selective matrix as Ψ = ⎜ , which 𝑠.𝑡. 𝐒𝑖𝑗 = 1 (∀𝑖), 𝐒𝑖𝑗 ≥ 0 (∀𝑖, 𝑗),
⎜0 1 0⎟ 𝑗=1
⎜0 0 1⎟
⎝ ⎠4×3 where the first term is a manifold regularization term for maintaining
satisfies the above conditions C1-C3. Therefore, we obtain 𝐼 = {1, 3, 4}
and select the three features as 𝐼 = {𝑓1 , 𝑓3 , 𝑓4 }. We can also perform the local geometry structure, the second term is a Tikhonov regulariza-
feature selection as follows: tion term for avoiding the trivial solution, and 𝛼 > 0 is a parameter for
balancing these two terms.
𝐱1 𝐱2 ⋯ 𝐱𝑛
Problem (4) leads to the similarity or weight matrix 𝐒, which can be
𝑓1 ⎛ 𝐗11 𝐗12 ⋯ 𝐗1𝑛 ⎞ usually considered as a graph. In fact, such a graph cannot be directly
𝐗′ = ΨT 𝐗 = 𝑓3 ⎜𝐗 𝐗32 ⋯ 𝐗3𝑛 ⎟ , (2)
⎜ 31 ⎟ utilized to extract the clustering indicators of data, meaning a subop-
𝑓4 ⎝ 𝐗41 𝐗42 ⋯ 𝐗4𝑛 ⎠3×𝑛 timal solution. For the purpose of alleviating the issue, we try to learn
where 𝐗′ ∈ ℜ3×𝑛 is the data matrix after feature selection. the graph with the cluster structure. To be more specific, we enforce the
Note that most of the existing methods achieve the goal of selecting graph 𝐒 to be learned to have exact 𝑐 connected components, where 𝑐
the distinguished and discriminative features by introducing a sparse is the number of clusters. Let 𝐋𝑆 = 𝐃𝑆 − (𝐒T + 𝐒)∕2 be the Laplacian
regularizer into the model or enforcing a sparsity constraint on the matrix, where 𝐃𝑆 is the diagonal matrix and the 𝑖th diagonal element is
∑𝑛
selector. Thus from the perspective of selection, the proposed method 𝑗=1 (𝐒𝑖𝑗 +𝐒𝑗𝑖 )∕2 for 𝑖 = 1, 2, … , 𝑛. In fact, the above mentioned goal can
is more effective and straightforward. Now we are going to show here be achieved by imposing a constraint on 𝐋𝑆 . Due to the nonnegativity
how to obtain the optimal selective matrix Ψ∗ . Drawing inspiration of the similarity matrix 𝐒, the Laplacian matrix 𝐋𝑆 has the following
from locally linear embedding, we can get the explicit selection matrix important property, which is first introduced by Mohar et al. (1991).

4
J. Miao et al. Expert Systems With Applications 255 (2024) 124568

Fig. 1. The framework of the proposed method.

Theorem 1. The multiplicity c of the eigenvalue 0 of the Laplacian matrix where 𝐈𝑐 ∈ ℜ𝑐×𝑐 is an identical matrix. With Eq. (7), we further
𝐋𝑆 is equal to the number of connected components in the graph with the reformulate (31) and propose our final model as
similarity matrix 𝐒. ∑
𝑛 ∑
‖ T ‖2
min 𝛽 ‖Ψ 𝐱𝑖 − 𝐖𝑖𝑗 (ΨT 𝐱𝑗 )‖
Ψ,𝐖,𝐒,𝐐 ‖ ‖2
𝑖=1 𝑗∈𝑘 (𝐱𝑖 )
According to Theorem 1, we can add the rank constraint 𝑟𝑎𝑛𝑘(𝐋𝑆 ) =
∑ 𝑛 (
𝑛 ∑ )
𝑛 − 𝑐 into (4) to learn the cluster structure of graph. Then, on the basis
+ ‖ΨT 𝐱𝑖 − ΨT 𝐱𝑗 ‖22 𝐒𝑖𝑗 + 𝛼𝐒2𝑖𝑗 + 𝛾Tr(𝐐T 𝐋𝑆 𝐐) (8)
of (4), structured graph learning can be formulated into optimizing the 𝑖=1 𝑗=1
following problem

𝑘 ∑
𝑛
𝑛 ( ) 𝑠.𝑡. Ψ ∈ , 𝐖𝑖𝑗 = 1, 𝐒𝑖𝑗 = 1, 𝐒𝑖𝑗 ≥ 0, 𝐐T 𝐐 = 𝐈𝑐 ,

𝑛 ∑
T T
min ‖Ψ 𝐱𝑖 − Ψ 𝐱𝑗 ‖22 𝐒𝑖𝑗 + 𝛼𝐒2𝑖𝑗 𝑗=1 𝑗=1
𝐒
𝑖=1 𝑗=1 where 𝛾 is a positive parameter, whose value is large enough to ensure
(5)
∑𝑛
𝑟𝑎𝑛𝑘(𝐋𝑆 ) = 𝑛−𝑐 is satisfied. The main framework of the proposed SGLLE
𝑠.𝑡. 𝐒𝑖𝑗 = 1 (∀𝑖), 𝐒𝑖𝑗 ≥ 0 (∀𝑖, 𝑗), 𝑟𝑎𝑛𝑘(𝐋𝑆 ) = 𝑛 − 𝑐
is shown in Fig. 1. The joint minimization of locally linear embedding
𝑗=1
and structured graph learning enables Ψ to evaluate the importance,
relevance and usefulness of features, making it particularly suitable for
3.3. Overall objective function feature selection. Instead of employing a pre-defined graph to charac-
terize the intrinsic manifold structure, the proposed methodology not
only attempts to learn the graph from the given data, but also learn its
The above analyses suggest that learning the explicit selective ma- structure. Notably, both locally linear embedding and structured graph
trix and the reliable structured similarity graph in a unified framework learning are implemented in the feature subspace. Such a strategy is
is more reasonable and beneficial to UFS tasks. Now, putting concern more suitable and straightforward for feature selection task, and can
(3) and (5) together results in our proposed model as follows: be more beneficial for identifying the optimal feature subset.


𝑛
‖ T ∑ ‖2
min 𝛽 ‖Ψ 𝐱𝑖 − 𝐖𝑖𝑗 (ΨT 𝐱𝑗 )‖ 4. Optimization algorithm
Ψ,𝐖,𝐒 ‖ ‖
𝑖=1 𝑗∈𝑘 (𝐱𝑖 ) 2
∑ 𝑛 (
𝑛 ∑ ) The coupling of the four variables and the presence of the orthogo-
‖ T ‖2
+ ‖Ψ 𝐱𝑖 − ΨT 𝐱𝑗 ‖ 𝐒𝑖𝑗 + 𝛼𝐒2𝑖𝑗 (6) nality constraint make the proposed model (8) significantly difficult to
‖ ‖2
𝑖=1 𝑗=1 solve. Here, we utilize the alternative optimization strategy to optimize

𝑘 ∑
𝑛
(8), i.e., iteratively updating each variable while fixing the remaining
𝑠.𝑡. Ψ ∈ , 𝐖𝑖𝑗 = 1, 𝐒𝑖𝑗 = 1, 𝐒𝑖𝑗 ≥ 0, 𝑟𝑎𝑛𝑘(𝐋𝑆 ) = 𝑛 − 𝑐,
three variables until the algorithm converges. It should be emphasized
𝑗=1 𝑗=1
that all the resulting subproblems have closed-form solutions. The
where 𝛼 and 𝛽 are positive hyper-parameters. In fact, the rank con- specific solving details are given in the following.
straint on the Laplacian matrix 𝐋𝑆 is non-convex and it will be typically
NP-hard to tackle. According to Singular Value Decomposition (SVD) 4.1. Updating 𝐖
theorem, for a given real symmetric matrix, its rank is equal to the
number of its non-zero singular values. Let 𝜎𝑖 (𝐋𝑆 ) be the 𝑖th eigenvalue When fixing all the variables except 𝐖, problem (8) is simplified as
of 𝐋𝑆 . Since 𝐋𝑆 is positive semi-definite, 𝜎𝑖 (𝐋𝑆 ) ≥ 0 holds for 𝑖 =

𝑛
‖ T ∑ ‖2 ∑
𝑘
1, 2, … , 𝑛. Without loss of generality, assume 0 ≤ 𝜎1 (𝐋𝑆 ) ≤ 𝜎2 (𝐋𝑆 ) ≤ min ‖Ψ 𝐱𝑖 − 𝐖𝑖𝑗 (ΨT 𝐱𝑗 )‖ 𝑠.𝑡. 𝐖𝑖𝑗 = 1. (9)
𝐖 ‖ ‖2
⋯ 𝜎𝑐 (𝐋𝑆 ) ≤ ⋯ ≤ 𝜎𝑛 (𝐋𝑆 ). In fact, we can realize 𝑟𝑎𝑛𝑘(𝐋𝑆 ) = 𝑛 − 𝑐 𝑖=1 𝑗∈𝑘 (𝐱𝑖 ) 𝑗=1

by enforcing 𝜎𝑖 (𝐋𝑆 ) = 0 for 𝑖 = 1, 2, … , 𝑐, i.e., 𝑐𝑖=1 𝜎𝑖 (𝐋𝑆 ) = 0. Using Lagrange multiplier method, we can easily get the optimal
Furthermore, based on Ky Fan’s Theorem in Fan (1949), we have solution as
∑ −1

𝑐
𝑞∈𝑘 (𝐱𝑖 ) 𝐂𝑗𝑞
𝜎𝑖 (𝐋𝑆 ) = min Tr(𝐐T 𝐋𝑆 𝐐), (7) 𝐖𝑖𝑗 = ∑ ∑ −1
, (10)
𝑖=1
𝐐∈ℜ𝑛×𝑐 ,𝐐T 𝐐=𝐈𝑐 𝑙∈𝑘 (𝐱𝑖 ) 𝑠∈𝑘 (𝐱𝑖 ) 𝐂𝑙𝑠

5
J. Miao et al. Expert Systems With Applications 255 (2024) 124568

where 𝐂𝑗𝑞 = (ΨT 𝐱𝑖 − ΨT 𝐱𝑗 )T (ΨT 𝐱𝑖 − ΨT 𝐱𝑞 ) for 𝑖 = 1, 2, … , 𝑛 and 4.3. Updating 𝐒


𝑗 = 1, 2, … , 𝑛. Essentially, it performs locally linear embedding in the
feature subspace ΨT 𝐗. When updating the affinity matrix 𝐒, we are going to optimize the
following problem

𝑛 ∑
𝑛

4.2. Updating Ψ min (‖ΨT 𝐱𝑖 − ΨT 𝐱𝑗 ‖22 𝐒𝑖𝑗 + 𝛼𝐒2𝑖𝑗 ) + 𝛾Tr(𝐐T 𝐋𝑆 𝐐)


𝐒
𝑖=1 𝑗=1
(18)

𝑛
𝑠.𝑡. 𝐒𝑖𝑗 = 1 (∀𝑖), 𝐒𝑖𝑗 ≥ 0 (∀𝑖, 𝑗).
With 𝐖, 𝐒 and 𝐐 fixed, removing the terms that are independent of 𝑗=1
Ψ, problem (8) turns into ∑ ‖ 𝑖
As we know, 2Tr(𝐐T 𝐋𝑆 𝐐) = 𝑗 ‖2 𝑖 𝑗
𝑖,𝑗 ‖𝐪 − 𝐪 ‖2 𝐒𝑖𝑗 , where 𝐪 and 𝐪 are

𝑛
‖ T ∑ ‖2 ∑
𝑛 ∑
𝑛
‖ T ‖2 ‖ T ‖2
‖Ψ 𝐱𝑖 − 𝐖𝑖𝑗 (ΨT 𝐱𝑗 )‖ + ‖Ψ 𝐱𝑖 − ΨT 𝐱𝑗 ‖ 𝐒𝑖𝑗 . the 𝑖th and 𝑗th row of the matrix 𝐐. By denoting ‖Ψ 𝐱𝑖 − Ψ 𝐱𝑗 ‖ as
T
min 𝛽
‖ ‖2 ‖ ‖2 ‖ ‖2
𝐔𝑖𝑗 and ‖ 𝑗 ‖2
Ψ∈ 𝑖
𝑖=1 𝑗∈𝑘 (𝐱𝑖 ) 𝑖=1 𝑗=1 ‖𝐪 − 𝐪 ‖2 as 𝐕𝑖𝑗 , we rewrite (18) as
(11) ∑
𝑛 ∑
𝑛 ∑
𝑛 ∑
𝑛
𝛾 ∑∑
𝑛 𝑛
min 𝐔𝑖𝑗 𝐒𝑖𝑗 + 𝛼 𝐒2𝑖𝑗 + 𝐕 𝐒
𝐒
𝑖=1 𝑗=1 𝑖=1 𝑗=1
2 𝑖=1 𝑗=1 𝑖𝑗 𝑖𝑗
Using simple linear algebra, the first term in (11) can be rewritten into (19)
the more compact form ∑𝑛
𝑠.𝑡. 𝐒𝑖𝑗 = 1 (∀𝑖), 𝐒𝑖𝑗 ≥ 0 (∀𝑖, 𝑗).

𝑛
‖ T ∑ ‖2
𝑗=1
‖Ψ 𝐱𝑖 − 𝐖𝑖𝑗 (ΨT 𝐱𝑗 )‖
‖ ‖2 Let 𝐬𝑖 , 𝐮𝑖 and 𝐯𝑖 be the 𝑖th row of 𝐒, 𝐔 and 𝐕, respectively, we then
𝑖=1 𝑗∈𝑘 (𝐱𝑖 )
obtain the equivalent problem of (19) in vector form

𝑛
‖ T ∑𝑛
‖2
= ‖Ψ 𝐱𝑖 − 𝐖𝑖𝑗 (ΨT 𝐱𝑗 )‖ (12) ∑𝑛 ( )
‖ ‖2 T T 𝛾 T
𝑖=1 𝑗=1 min 𝐬𝑖 𝐮𝑖 + 𝛼𝐬𝑖 𝐬𝑖 + 𝐬𝑖 𝐯𝑖
𝑛
{𝐬𝑖 }𝑖=1 2 (20)
‖ ‖2 𝑖=1
= ‖ΨT 𝐗 − (ΨT 𝐗)𝐖T ‖
‖ ‖F 𝑖 𝑖
𝑠.𝑡. 𝐬 𝟏 = 1, 𝐬 ≥ 0 (∀𝑖),
( )
=Tr ΨT 𝐗(𝐈 − 𝐖)T (𝐈 − 𝐖)𝐗T Ψ ,
which can be transformed into solving 𝑛 independent subproblems
where the first equality holds because we set 𝐖𝑖𝑗 = 0 if 𝑗 ∉ 𝑘 (𝐱𝑖 ) for simultaneously. Taking the 𝑖th subproblem as an example, we have
𝑗 = 1, 2, … , 𝑛. Similarly, we can also derive the second term as T T 𝛾 T
min 𝐬𝑖 𝐮𝑖 + 𝛼𝐬𝑖 𝐬𝑖 + 𝐬𝑖 𝐯𝑖
𝐬𝑖 2 (21)

𝑛 ∑
𝑛
𝑠.𝑡. 𝐬𝑖 𝟏 = 1, 𝐬𝑖 ≥ 0.
‖ΨT 𝐱𝑖 − ΨT 𝐱𝑗 ‖22 𝐒𝑖𝑗
𝑖=1 𝑗=1
Let 𝐡𝑖 = −(2𝐮𝑖 + 𝛾𝐯𝑖 )∕2𝛼, arranging the objective function in (21)
∑𝑛 ∑ 𝑛 𝑛 ∑
∑ 𝑛
and then removing those irrelevant terms to 𝐬𝑖 result in
=2 ‖ΨT 𝐱𝑖 ‖22 𝐒𝑖𝑗 − 2 ⟨ΨT 𝐱𝑖 , ΨT 𝐱𝑗 ⟩𝐒𝑖𝑗 (13)
𝑖=1 𝑗=1 𝑖=1 𝑗=1 1
min ‖𝐬𝑖 − 𝐡𝑖 ‖22 𝑠.𝑡. 𝐬𝑖 𝟏 = 1, 𝐬𝑖 ≥ 0. (22)
T
̃ T Ψ) − 2Tr(ΨT 𝐗𝐒𝐗T Ψ)
=2Tr(Ψ 𝐗𝐃𝐗 𝐬𝑖 2

The above problem is a quadratic programming that can be tackled


=2Tr(ΨT 𝐗(𝐃
̃ − 𝐒)𝐗T Ψ),
via various optimization approaches. In this work, we utilize Lagrange
where 𝐃 ̃ ∈ ℜ𝑛×𝑛 is the diagonal matrix with 𝐃̃ 𝑖𝑖 = ∑𝑛 𝐒𝑖𝑗 . multiplier method (Duchi et al., 2008) to solve problem (22), which is
𝑗=1
̃ − 𝐒)𝐗T , combing (12) and (13) very simple yet significantly efficient. We present the solving process
Let 𝐙 ≜ 𝛽𝐗(𝐈 − 𝐖)T (𝐈 − 𝐖)𝐗T + 2𝐗(𝐃
in Algorithm 1.
results in the equivalent form of (11) as

min Tr(ΨT 𝐙Ψ). (14) Algorithm 1 Solving problem (18)


Ψ∈
T T
Input: 𝐇 ≜ [𝐡1 , 𝐡2 , ⋯ , 𝐡𝑛T ] ∈ ℜ𝑛×𝑛
Let Ψ = [𝝍 1 , 𝝍 2 , … , 𝝍 𝐾 ] ∈ ℜ𝑑×𝐾 , based on the definition of trace
1: for 𝑖 = 1, 2, ⋯ , 𝑛 do
operator, problem (14) is equivalent to 2: Sort 𝐡𝑖 in descending order, and put the sorted results into 𝜅


𝐾 3: 𝜌 ∶= max{1 < 𝑗 < 𝑛 ∶ 𝜅𝑗 + 1𝑗 (1 − 𝑗𝑠=1 𝜅𝑠 ) > 0}
min 𝝍 T𝑖 𝐙𝝍 𝑖 . (15) 1 ∑𝜌
Ψ∈ 4: 𝜈 ∶= 𝜌 (1 − 𝑠=1 𝜅𝑠 )
𝑖=1
5: 𝐬𝑖 ∶= max(𝐡𝑖 + 𝜈, 0)
As stated previously, only one value in the 𝝍 𝑖 is 1 and other values 6: end for
T T
are 0. Thus we have 𝝍 T𝑖 𝐙𝝍 𝑖 = 𝐙𝐼(𝑖)𝐼(𝑖) , where 𝐼(𝑖) is the index of 1 in Output: The optimal 𝐒∗ = [𝐬1 , 𝐬2 , ⋯ , 𝐬𝑛T ]T
the 𝝍 𝑖 . Consequently, we can get 𝐼 ∗ by


𝐾 4.4. Updating 𝐐
𝐼 ∗ ∶= arg min 𝐙𝐼(𝑖)𝐼(𝑖) . (16)
𝐼⊂{1,2,…,𝑑},|𝐼|=𝐾 𝑖=1
When optimizing 𝐐, we aim at solving the following subproblem
As can be seen from (16), we can arrive at 𝐼 ∗ by identifying the 𝐾 min Tr(𝐐T 𝐋𝑆 𝐐) 𝑠.𝑡. 𝐐T 𝐐 = 𝐈 𝑐 . (23)
𝐐∈ℜ𝑛×𝑐
smallest diagonal elements of 𝐙, which means that the optimal feature
subset 𝐼 ∗ is also been obtained. For 𝑖 = 1, 2, … , 𝑑 and 𝑗 = 1, 2, … , 𝐾, we According to Nie et al. (2016), the optimal solution can be ob-
can immediately arrive at the optimal selection matrix Ψ∗ = [Ψ∗𝑖𝑗 ]𝑑×𝐾 tained by forming the 𝑐 eigenvectors corresponding to the 𝑐 smallest
as eigenvalues of the Laplacian matrix 𝐋𝑆 .
{ With the above updating rules, we alternatively update 𝐖, Ψ, 𝐒 as
1 if 𝑖 = 𝐼 ∗ (𝑗) well as 𝐐, and repeat the process until the objective function converges.
Ψ∗𝑖𝑗 = (17)
0 otherwise We summarize the updating rules in Algorithm 2.

6
J. Miao et al. Expert Systems With Applications 255 (2024) 124568

Algorithm 2 SGLLE: Algorithm to solve problem (18) With 𝐖𝑡+1 , Ψ𝑡+1 and 𝐒𝑡 fixed, we update 𝐐 by
Input: Data matrix 𝐗, the number of clusters 𝑐, hyper-parameters 𝛼, 𝛽 𝐐𝑡+1 ∶= arg min Tr(𝐐T 𝐋𝑆 𝑡 𝐐), (27)
and 𝛾, the neighborhood size 𝑘 and the number of selected features 𝐐T 𝐐=𝐈𝑐

𝐾 which implies the following inequality holds.


1: Set 𝑡 = 0
Tr((𝐐𝑡+1 )T 𝐋𝑆 𝑡 𝐐𝑡+1 ) ≤ Tr((𝐐𝑡 )T 𝐋𝑆 𝑡 𝐐𝑡 ). (28)
2: Initialize 𝐒0
3: repeat Similar to the above procedure, with and 𝐖𝑡+1 , Ψ𝑡+1
fixed, we 𝐐𝑡+1
4: Update 𝐖𝑡+1 via Eq. (10); update 𝐒 by solving problem (18). Then, the following inequality can
5: Update 𝚿𝑡+1 via Eq. (17); be gotten.
6: Update 𝐐𝑡+1 as the 𝑐 eigenvectors of 𝐋𝑆 𝑡 corresponding to the 𝑐 ∑ 𝑛 (
𝑛 ∑ )
smallest eigenvalues; ‖(Ψ𝑡+1 )T 𝐱𝑖 − (Ψ𝑡+1 )T 𝐱𝑗 ‖22 𝐒𝑡+1 𝑡+1 2
𝑖𝑗 + 𝛼(𝐒𝑖𝑗 )
7: Update 𝐒𝑡+1 via Algorithm 1; 𝑖=1 𝑗=1
T ( )
8: Update 𝐋𝑆 𝑡+1 = 𝐃𝑆 𝑡+1 − (𝐒𝑡+1 + 𝐒𝑡+1 )∕2, where 𝐃𝑆 𝑡+1 is a diagonal + 𝛾Tr (𝐐𝑡+1 )T 𝐋𝑆 𝑡+1 (𝐐𝑡+1 )
∑ (29)
matrix with the 𝑖-th diagonal element as 𝑛𝑗=1 (𝐒𝑡+1 𝑡+1
𝑖𝑗 + 𝐒𝑗𝑖 )∕2; ∑ 𝑛 (
𝑛 ∑ )
9: 𝑡 ∶= 𝑡 + 1; ≤ ‖(Ψ𝑡+1 )T 𝐱𝑖 − (Ψ𝑡+1 )T 𝐱𝑗 ‖22 𝐒𝑡𝑖𝑗 + 𝛼(𝐒𝑡𝑖𝑗 )2
10: until some stopping criterion is satisfied 𝑖=1 𝑗=1

Output: The optimal 𝚿∗ + 𝛾Tr((𝐐𝑡+1 )T 𝐋𝑆 𝑡 (𝐐𝑡+1 )).


According to inequities (25), (26), (28) and (29), we arrive at


𝑛
‖ 𝑡+1 T ∑ ‖2 ( )
𝑡+1 T
5. Theoretical analysis 𝛽 ‖(Ψ ) 𝐱𝑖 − 𝐖𝑡+1
𝑖𝑗 ((Ψ ) 𝐱𝑗 )‖ + 𝛾Tr (𝐐𝑡+1 )T 𝐋𝑆 𝑡+1 𝐐𝑡+1
‖ ‖
𝑖=1 𝑗∈𝑘 (𝐱𝑖 ) 2
In this section, we theoretically analyze the convergence and the 𝑛 (
𝑛 ∑
∑ )
2
computational time complexity of Algorithm 2. + ‖(Ψ𝑡+1 )T 𝐱𝑖 − (Ψ𝑡+1 )T 𝐱𝑗 ‖2 𝐒𝑡+1 𝑡+1 2
𝑖𝑗 + 𝛼(𝐒𝑖𝑗 )
𝑖=1 𝑗=1

5.1. Convergence analysis ∑𝑛


‖ 𝑡+1 T ∑ ‖2
𝑡+1 T
≤𝛽 ‖(Ψ ) 𝐱𝑖 − 𝐖𝑡+1
𝑖𝑗 ((Ψ ) 𝐱𝑗 )‖ + 𝛾Tr((𝐐𝑡+1 )T 𝐋𝑆 𝑡 𝐐𝑡+1 )
‖ ‖2
𝑖=1 𝑗∈𝑘 (𝐱𝑖 )
The proposed iterative procedure is able to ensure that the objective
∑ 𝑛 (
𝑛 ∑ )
‖ 𝑡+1 T ‖2
function is monotonically decreasing by the following theorem. + ‖(Ψ ) 𝐱𝑖 − (Ψ𝑡+1 )T 𝐱𝑗 ‖ 𝐒𝑡𝑖𝑗 + 𝛼(𝐒𝑡𝑖𝑗 )2
‖ ‖2
𝑖=1 𝑗=1
Theorem 2. The alternate updating rules in Algorithm 2 monotonically ∑𝑛
‖ 𝑡T ∑ ‖2 ( )
𝑡 T
decrease the objective function value of the resulting problem (8) in each ≤𝛽 ‖(Ψ ) 𝐱𝑖 − 𝐖𝑡+1
𝑖𝑗 ((Ψ ) 𝐱𝑗 )‖ + 𝛾Tr (𝐐𝑡 )T 𝐋𝑆 𝑡 𝐐𝑡
‖ ‖2
𝑖=1 𝑗∈𝑘 (𝐱𝑖 )
iteration.
∑ 𝑛 (
𝑛 ∑ )
‖ 𝑡T ‖2
+ ‖(Ψ ) 𝐱𝑖 − (Ψ𝑡 )T 𝐱𝑗 ‖ 𝐒𝑡𝑖𝑗 + 𝛼(𝐒𝑡𝑖𝑗 )2
Proof. According to Algorithm 2, we update 𝐖, Ψ, 𝐐 and 𝐒 in ‖ ‖2
𝑖=1 𝑗=1
sequence. Suppose that {𝐖𝑡 , Ψ𝑡 , 𝐐𝑡 , 𝐒𝑡 } is the iteration point at the 𝑡th
∑𝑛
‖ 𝑡T ∑ ‖2 ( )
iteration. ≤𝛽 ‖(Ψ ) 𝐱𝑖 − 𝐖𝑡𝑖𝑗 ((Ψ𝑡 )T 𝐱𝑗 )‖ + 𝛾Tr (𝐐𝑡 )T 𝐋𝑆 𝑡 𝐐𝑡
‖ ‖2
To begin with, based on Ψ𝑡 , 𝐐𝑡 and 𝐒𝑡 , we get 𝐖𝑡+1 by solving the 𝑖=1 𝑗∈𝑘 (𝐱𝑖 )
following problem ∑ 𝑛 (
𝑛 ∑ )
‖ 𝑡T ‖2
+ ‖(Ψ ) 𝐱𝑖 − (Ψ𝑡 )T 𝐱𝑗 ‖ 𝐒𝑡𝑖𝑗 + 𝛼(𝐒𝑡𝑖𝑗 )2 ,

𝑛
‖ 𝑡T ∑ ‖2 ‖ ‖2
min ‖(Ψ ) 𝐱𝑖 − 𝐖𝑖𝑗 ((Ψ𝑡 )T 𝐱𝑗 )‖ 𝑖=1 𝑗=1
𝐖 ‖ ‖2
𝑖=1 𝑗∈𝑘 (𝐱𝑖 ) (30)
(24)

𝑘
𝑠.𝑡. 𝐖𝑖𝑗 = 1, for 𝑖 = 1, 2, … , 𝑛. where the first inequality follows from (29), the second inequality
𝑗=1 follows from (28) and (26), and the third inequality follows from
(25). □
According to the optimality condition, we can immediately arrive at
the following inequality
5.2. Computational complexity analysis

𝑛
‖ 𝑡T ∑ ‖2
𝑡 T
‖(Ψ ) 𝐱𝑖 − 𝐖𝑡+1
𝑖𝑗 ((Ψ ) 𝐱𝑗 )‖
‖ ‖2 Let us first revisit the notations used in this work. 𝑑, 𝐾 and 𝑛 are
𝑖=1 𝑗∈𝑘 (𝐱𝑖 )
(25) the number of original features, selected features, and instances, re-

𝑛
‖ 𝑡T ∑ ‖2 spectively. As observed from Algorithm 2, its computational complexity
≤ ‖(Ψ ) 𝐱𝑖 − 𝐖𝑡𝑖𝑗 ((Ψ𝑡 )T 𝐱𝑗 )‖
‖ ‖2 mainly comes from four parts, including updating 𝐖, Ψ, 𝐐 and 𝐒. Next,
𝑖=1 𝑗∈𝑘 (𝐱𝑖 )
we elaborate on each part.
Next, using 𝐖𝑡+1 , 𝐐𝑡 and 𝐒𝑡 , we obtain Ψ𝑡+1 by solving problem (11).
Thus we have (1) Updating 𝐖: The cost of updating 𝐖 can be mainly decided by
the computation of ΨT 𝐗 and (10), whose complexity are of the

𝑛
‖ 𝑡+1 T ∑ ‖2
‖(Ψ ) 𝐱𝑖 − 𝐖𝑡+1 𝑡+1 T
) 𝐱𝑗 )‖ order of (𝑛𝑑𝐾) and (𝑛𝑑 + 𝑛2 𝑑 + 𝑛2 ), respectively. Then the cost
𝛽
‖ 𝑖𝑗 ((Ψ ‖2
𝑖=1 𝑗∈𝑘 (𝐱𝑖 ) for 𝐖 is (𝑛𝑑𝐾 + 𝑛2 𝑑).

𝑛 ∑
𝑛 (2) Updating Ψ: To update Ψ, one must compute 𝐙 first whose
‖ 𝑡+1 T ‖2
+ ‖(Ψ ) 𝐱𝑖 − (Ψ𝑡+1 )T 𝐱𝑗 ‖ 𝐒𝑡𝑖𝑗 computational cost is (𝑛𝑑 2 + 𝑛2 𝑑). Besides, it costs (log 𝑑) to
‖ ‖2
𝑖=1 𝑗=1 solve (15). Hence the cost for updating Ψ is (𝑛𝑑 2 + 𝑛2 𝑑 + log 𝑑).
(26)

𝑛
‖ 𝑡T ∑ ‖2 (3) Updating 𝐐: The main computation cost for 𝐐 involves comput-
𝑡 T
≤𝛽 ‖(Ψ ) 𝐱𝑖 − 𝐖𝑡+1
𝑖𝑗 ((Ψ ) 𝐱𝑗 )‖ ing the 𝑐 eigenvectors of 𝐋𝑆 , which is (𝑛2 𝑐).
‖ ‖2
𝑖=1 𝑗∈𝑘 (𝐱𝑖 )
(4) Updating 𝐒: According to Algorithm 1, sorting 𝐡𝑖 (step 2) takes

𝑛 ∑
𝑛
‖ 𝑡T ‖2 the cost of (𝑛 log 𝑛) and the cost for step 3–5 is (1). Thus
+ ‖(Ψ ) 𝐱𝑖 − (Ψ𝑡 )T 𝐱𝑗 ‖ 𝐒𝑡𝑖𝑗 .
‖ ‖2 updating 𝐒 takes the cost of (𝑛2 log 𝑛).
𝑖=1 𝑗=1

7
J. Miao et al. Expert Systems With Applications 255 (2024) 124568

Table 3 Yang et al., 2011), the number of selected features is tuned by the
The statistics of datasets.
grid-search method from 50 to 300 with the incremental size 50.
Datasets # Features # Instances # Classes Besides, a variety of parameters exist in the comparison algorithms and
𝑑 𝑛 𝑐
our proposed SGLLE, such as regularization parameters, neighborhoods
JAFFE 676 213 10
size, and so on, whose value should be specified before implement-
att 644 400 40
Yale 1024 165 15 ing experiments. More specifically, we utilize 𝑘-nearest neighborhood
PIE10P 2420 210 10 method and set the neighborhood size as 5 for building the graph in
BA 320 1404 36 LS, SPEC, MCFS and UDFS. We employ the grid-search strategy to tune
TOX_171 5748 171 4 the regularization parameters in UDFS, RNE, VCSDFS, FSKD, GOSUP,
GLIOMA 4434 50 4
LRPFS and SGLLE from {10−6 , 10−4 , 10−2 , 1, 102 , 104 , 106 }. To construct
colon 2000 62 2
lymphoma 4026 96 9 the local reconstruction matrix of RNE and SGLLE, we need to set a
value to the size of neighborhoods. According to Liu et al. (2020), we
adjust it from 5 to 15 with the incremental size 2. For RNE, we fix
𝜇 = 106 to ensure the orthogonal constraint satisfied.
Overall, Algorithm 2 takes the computation cost of (𝑛𝑑 2 + 𝑛2 𝑑 + Similar to the existing works, evaluating UFS approaches mainly
𝑛2 𝑐 + 𝑛2 log 𝑛) for each iteration. consists of three steps: (1) implementing the comparison algorithms and
the proposed SGLLE to obtain the selected features; (2) performing K-
6. Experiments means clustering algorithm on the selected features; (3) analyzing the
clustering results and reporting the clustering performance (ACC and
In this section, we conduct experiments to demonstrate the effec- NMI). Since the clustering performances of K-means algorithm have
tiveness and superiority of our proposed SGLLE. a strong dependency on the initial values, we independently repeat
K-means algorithm 20 times to get the more reliable results. And,
6.1. Datasets and comparison approaches the final results are reported as the average result over the 20 times
together with the corresponding standard deviation. To perform K-
Nine publicly available datasets are considered in the experiments, means algorithm, we specify the number of clusters as the true number
where five datasets, including JAFFE, att, Yale, PIE10P and BA, are of classes.
from image domain, and others are from biological domain, including
TOX_171, GLIOMA, colon and lymphoma. Table 3 gives the detailed 6.3. Clustering results and analysis
information of datasets.
One Baseline and ten state-of-the-art UFS methods are used for In Tables 4 and 5, we present the clustering results of the proposed
comparison. The details of compared methods are presented as follows: SGLLE and the compared approaches. The best clustering results are
highlighted in boldface. The number in the bracket is the number of se-
(1) Baseline: All the original features are utilized.
lected features when the corresponding performance is achieved. More-
(2) LS (He et al., 2006) selects features that can best preserve the
over, the average performances of each method over all the datasets
local manifold structure of data.
are shown in Fig. 2. From these experimental results, the following
(3) SPEC (Zhao & Liu, 2007) employs spectral properties of graph
observations can be obtained.
to evaluate the feature relevance.
(4) MCFS (Cai et al., 2010) selects features with two steps: (1) (1) The clustering performances, namely, ACC and NMI, of UFS
performing spectral analysis; (2) solving the 𝓁1 -norm regularized approaches are better than that of Baseline in the most cases,
regression problem. which verifies the necessity and effectiveness of UFS.
(5) UDFS (Yang et al., 2011) integrates discriminative analysis with (2) From Tables 4 and 5, compared with Baseline and ten state-of-
the 𝓁2,1 norm regularization term as a unified framework for the-art UFS methods, SGLLE performs best in many cases. Note
feature selection. that it does not perform the best on BA, GLIOMA and colon.
(6) LLEscore (Yao et al., 2017) selects features according to the However, as for BA, there is only a gap of 1.43% and 1.08%
ability of preserving the linear structure. in terms of ACC and NMI in contrast with the best FSKD and
(7) RNE (Liu et al., 2020) incorporates a selection matrix into local LRPFS. As for GLIOMA, there is only a gap of 0.3% and 1.11%
linear embedding and leverages the 𝓁1 norm as the loss function in terms of ACC and NMI in contrast with the best GOSUP and
to measure the data reconstruction error. RNE. As for colon, there is a gap of 1.61% and 4.63% in terms
(8) VCSDFS (Karami et al., 2023) selects features by joint the of ACC and NMI in contrast with the best LRPFS.
variance–covariance distance based regression and low redun- (3) The proposed SGLLE gains significant performance on a major-
dancy regularization. ity of datasets. Notably, SGLLE possesses great advantages on
(9) FSDK (Nie et al., 2023) selects features based on the 𝓁2,𝑝 regu- some datasets, such as Yale and lymphoma. It achieve at least
larized linear regression model with discrete pseudo label con- 5.27% (5.21%) and 3.59% (3.61%) improvement on the Yale
straint. (lymphoma) dataset in the matter of ACC and NMI, respectively.
(10) GOSUP (Wang, Wang et al., 2023) integrates feature selection (4) By contrasting the clustering results of RNE with that of SGLLE,
and extraction, enforcing sparsity in the projection matrix to it can be observed that SGLLE outperforms RNE in almost all
select only relevant features. cases. This indicates that it is necessary and effective to intro-
(11) LRPFS (Ma et al., 2023) enhances clustering by assigning unique duce structured graph into feature selection model. Because it
attribute scores to samples and incorporating sparse modeling to can uncover the local geometry structure of data.
ensure solution uniqueness and data interconnection. (5) From Fig. 2, it seems that, embedded approaches can usually
gain the better performance than filter approaches. On the
6.2. Experiment setting whole, SGLLE performs best in terms of the both two evaluation
criteria. Specifically, SGLLE outperforms the second best method
Up to now, determining the number of the optimal feature subset 5.28% and 5.79% on average in the matter of ACC and NMI,
is still an open problem and a great number of researchers have been respectively.
dedicated to overcoming it. Following the prior works (Li et al., 2014;

8
J. Miao et al. Expert Systems With Applications 255 (2024) 124568

Table 4
Clustering results (ACC) of Baseline and different UFS algorithms on the different datasets. The number in each bracket is the number of selected features when the corresponding
performance is achieved. The best results are highlighted in boldface.
ACC ± std(%)
JAFFE att Yale PIE10P BA TOX_171 GLIOMA colon lymphoma
81.81 ± 4.37 62.62 ± 2.24 38.36 ± 3.69 25.69 ± 1.42 41.91 ± 2.22 40.96 ± 0.13 42.30 ± 1.87 54.76 ± 0.36 54.53 ± 5.55
Baseline
(676) (644) (1024) (2420) (320) (5748) (4434) (2000) (4026)
87.84 ± 3.35 67.26 ± 2.93 43.94 ± 2.68 30.17 ± 0.39 43.94 ± 1.23 43.27 ± 1.31 60.00 ± 0.00 58.96 ± 0.00 58.49 ± 4.90
LS
(250) (300) (250) (50) (200) (50) (200) (50) (200)
84.60 ± 5.52 58.36 ± 3.29 41.97 ± 1.57 27.05 ± 0.72 42.59 ± 1.37 40.35 ± 0.00 58.10 ± 2.20 58.06 ± 0.00 59.22 ± 4.98
SPEC
(250) (300) (100) (50) (300) (200) (50) (200) (300)
87.56 ± 5.27 65.74 ± 2.89 38.48 ± 2.97 33.21 ± 2.66 43.54 ± 1.46 42.54 ± 2.02 59.30 ± 3.93 56.45 ± 0.00 60.99 ± 6.27
MCFS
(50) (250) (300) (50) (300) (100) (300) (300) (250)
86.53 ± 4.35 67.21 ± 3.13 40.57 ± 2.26 42.67 ± 1.61 41.99 ± 1.49 45.67 ± 1.02 63.00 ± 9.57 58.06 ± 0.00 54.58 ± 5.65
UDFS
(50) (300) (300) (50) (300) (50) (150) (100) (300)
87.14 ± 3.57 68.05 ± 2.29 38.27 ± 3.20 38.90 ± 2.52 42.99 ± 2.05 43.27 ± 1.07 48.70 ± 1.75 54.84 ± 0.00 55.31 ± 4.63
LLEscore
(250) (150) (300) (150) (250) (200) (50) (50) (250)
88.19 ± 5.92 66.27 ± 2.41 42.97 ± 2.83 30.55 ± 2.20 43.11 ± 1.46 48.65 ± 1.16 61.90 ± 1.02 58.06 ± 0.00 60.78 ± 3.21
RNE
(250) (150) (150) (50) (250) (150) (50) (50) (300)
91.97 ± 5.18 63.30 ± 2.99 38.45 ± 3.95 38.60 ± 2.70 42.63 ± 2.32 49.68 ± 2.22 52.20 ± 2.97 61.29 ± 0.00 54.11 ± 5.83
VCSDFS
(300) (300) (300) (50) (300) (250) (150) (100) (300)
89.46 ± 5.89 66.83 ± 2.84 37.79 ± 1.82 35.14 ± 3.70 44.22 ± 1.65 46.46 ± 2.34 62.90 ± 6.03 59.68 ± 0.00 60.89 ± 7.31
FSDK
(150) (250) (300) (200) (150) (50) (150)) (250)) (250)
80.85 ± 4.62 65.31 ± 2.59 39.82 ± 2.15 48.64 ± 2.57 41.58 ± 1.69 43.65 ± 1.91 66.60 ± 2.52 56.45 ± 0.00 56.30 ± 5.44
GOSUP
(150) (250) (300) (250) (300) (300) (50)) (50)) (250)
83.15 ± 2.83 67.73 ± 2.63 37.21 ± 2.58 41.21 ± 2.13 43.74 ± 1.37 44.24 ± 1.16 42.00 ± 2.34 62.90 ± 0.00 61.87 ± 6.43
LRPFS
(150) (300) (300) (50) (300) (50) (300) (250) (200)
93.05 ± 4.42 69.43 ± 2.79 49.21 ± 3.09 49.67 ± 4.46 42.79 ± 1.68 51.40 ± 0.68 66.30 ± 2.27 61.29 ± 0.00 67.08 ± 5.40
SGLLE
(300) (200) (250) (50) (300) (100) (100) (200) (100)

Table 5
Clustering results (NMI) of Baseline and different UFS algorithms on the different datasets. The number in each bracket is the number of selected features when the corresponding
performance is achieved. The best results are highlighted in boldface.
NMI ± std(%)
JAFFE att Yale PIE10P BA TOX_171 GLIOMA colon lymphoma
85.35 ± 3.51 81.42 ± 1.10 45.33 ± 3.07 25.25 ± 3.01 57.39 ± 0.71 12.11 ± 0.07 16.74 ± 2.74 0.198 ± 0.00 58.70 ± 3.11
Baseline
(676) (644) (1024) (2420) (320) (5748) (4434) (2000) (4026)
88.08 ± 2.65 83.76 ± 1.14 50.32 ± 1.73 29.61 ± 2.94 58.92 ± 0.75 12.78 ± 1.76 53.23 ± 0.00 1.828 ± 0.00 62.05 ± 2.36
LS
(250) (300) (250) (200) (200) (200) (200) (50) (100)
88.00 ± 2.33 77.40 ± 1.69 47.75 ± 2.22 25.64 ± 1.21 57.70 ± 0.72 9.690 ± 0.00 49.92 ± 0.21 1.828 ± 0.00 63.17 ± 2.78
SPEC
(250) (300) (150) (150) (300) (250) (50) (200) (300)
88.96 ± 3.36 82.43 ± 1.57 44.88 ± 2.18 35.93 ± 3.44 58.45 ± 0.81 9.394 ± 0.72 48.99 ± 1.53 0.444 ± 0.00 61.73 ± 4.60
MCFS
(50) (250) (300) (50) (300) (300) (250) (300) (250)
88.31 ± 2.24 83.56 ± 1.54 47.65 ± 2.27 43.44 ± 2.33 57.94 ± 0.80 17.56 ± 2.38 46.95 ± 10.9 2.271 ± 0.71 54.78 ± 4.99
UDFS
(50) (300) (200) (50) (300) (50) (150) (50) (300)
88.32 ± 2.80 83.23 ± 1.14 44.87 ± 2.24 43.56 ± 2.75 57.84 ± 1.05 10.88 ± 0.91 33.28 ± 2.28 0.084 ± 0.00 58.73 ± 2.36
LLEscore
(300) (150) (300) (150) (250) (200) (50) (50) (300)
89.75 ± 3.93 83.02 ± 1.59 48.87 ± 3.30 31.39 ± 3.09 58.17 ± 0.69 22.67 ± 0.44 53.27 ± 1.49 1.828 ± 0.00 62.55 ± 3.86
RNE
(250) (300) (100) (100) (300) (200) (300) (50) (300)
92.74 ± 2.81 80.39 ± 1.47 43.72 ± 3.00 40.96 ± 2.39 58.36 ± 0.79 28.39 ± 0.00 32.67 ± 3.35 3.950 ± 0.00 54.17 ± 3.53
VCSDFS
(300) (300) (300) (50) (300) (100) (50) (50) (300)
90.10 ± 4.17 83.15 ± 1.44 43.88 ± 2.10 36.34 ± 3.56 58.73 ± 0.89 22.53 ± 3.67 42.68 ± 8.28 1.970 ± 0.00 62.93 ± 3.57
FSDK
(150) (250) (300) (200) (200) (50) (300) (250) (250)
83.22 ± 2.45 81.32 ± 1.17 46.22 ± 1.91 51.25 ± 2.76 56.83 ± 0.57 11.95 ± 2.12 52.86 ± 2.89 3.750 ± 0.27 57.96 ± 3.67
GOSUP
(50) (300) (300) (100) (300) (300) (50) (50) (250)
85.67 ± 2.13 83.29 ± 1.30 43.58 ± 2.08 46.95 ± 1.71 58.95 ± 0.78 13.95 ± 1.04 16.10 ± 1.95 11.94 ± 0.42 62.04 ± 2.74
LRPFS
(150) (300) (300) (50) (250) (100) (300)) (200)) (250)
94.76 ± 3.78 85.22 ± 1.49 53.91 ± 3.02 55.53 ± 2.66 57.87 ± 0.79 30.06 ± 0.72 52.16 ± 0.00 7.312 ± 0.00 66.78 ± 3.77
SGLLE
(300) (200) (250) (50) (300) (100) (50) (50) (200)

9
J. Miao et al. Expert Systems With Applications 255 (2024) 124568

Fig. 2. The averages of ACC and NMI over the nine datasets.

Table 6
Running time (seconds) comparison of different methods.
LS SPEC MCFS UDFS LLEscore RNE VCSDFS FSDK GOSUP LRPFS SGLLE
JAFFE 0.0116 0.0131 0.0640 0.9318 85.058 16.362 0.6714 0.0810 3.003 0.7325 0.2800
att 0.0095 0.0420 0.0616 0.8878 155.27 8.2961 0.8376 0.1199 4.2992 1.0337 0.7715
Yale 0.0076 0.0169 0.0381 2.7144 99.602 21.377 1.8767 0.6404 10.255 0.7606 1.8169
PIE10P 0.0198 0.0372 0.1652 52.606 307.83 68.398 16.860 11.841 120.40 3.8035 1.2970
BA 0.0115 1.0923 0.3080 0.1655 252.98 4.5291 1.4734 0.2536 14.739 8.1499 7.2363
TOX_171 0.0273 0.0475 0.5615 937.10 490.17 522.57 154.30 18.233 1705.3 18.7093 1.7346
GLIOMA 0.0082 0.0211 0.3714 409.30 130.28 222.17 76.776 51.456 876.06 9.9164 8.6970
colon 0.0036 0.0087 0.1281 27.094 77.122 45.395 11.076 0.7965 56.454 2.1092 2.7310
lymphoma 0.0116 0.0225 0.3578 298.57 204.15 483.08 58.940 0.6258 619.91 8.4980 0.9200
Avg. 0.0123 0.1445 0.2284 192.15 200.27 154.67 35.868 10.631 378.94 5.9680 2.8316

6.4. Running time comparison and NMI are shown in Figs. 5 and 6, respectively. From the figures,
we can observe that the clustering results are relatively stable when
In this subsection, we conduct experiments to test the efficiency using different neighborhood size 𝑘. Moreover, we can see that the best
of the proposed method and the competitors. The experiments are results can be often achieved when using the smaller 𝑘. The reason
conducted on a computer with Intel(R) Core(TM) i5-10400 CPU and is that the proposed SGLLE can effectively reconstruct each sample
8-GB RAM. The running time of all UFS methods on all the nine from its few neighbors. If 𝑘 is large, the reconstruction is usually time-
benchmark datasets are summarized in Table 6. From this table, it can consuming due to involving more distance computations. So actually,
be found that filter methods usually are more efficient than embedded the relatively small value of 𝑘 is preferred.
ones, which coincides with the previous statement. Among embedded
methods, the time cost of our proposed SGLLE and LRPFS are relatively 6.6. Effectiveness of structured graph
less than that of UDFS, RNE, VCSDFS, FSKD and GOSUP. Moreover,
SGLLE and LRPFS are comparable in terms of the running time. On the In this part, we conduct experiments to show the significance of
average, SGLLE is more efficient than LRPFS. the structured graph in the proposed SGLLE. To the end, we use the
pre-defined similarity graph to replace the structured adaptive one of
SGLLE. Thus we have
6.5. Parameter analysis

𝑛
‖ T ∑ ‖2 ∑ ∑(‖ T
𝑛 𝑛
‖2 )
The proposed SGLLE under different parameters combinations may min 𝛽 ‖Ψ 𝐱𝑖 − 𝐖𝑖𝑗 (ΨT 𝐱𝑗 )‖ + ‖Ψ 𝐱𝑖 − ΨT 𝐱𝑗 ‖ 𝐒𝑖𝑗
Ψ,𝐖 ‖ ‖ ‖ ‖2
𝑖=1 𝑗∈𝑘 (𝐱𝑖 ) 𝑖=1 𝑗=1
produce different selections, resulting in different clustering perfor- 2
mances. Here, we will investigate how the parameters, such as the ∑
𝑘

regularization parameters 𝛼 and 𝛽, and the neighborhood size 𝑘, influ- 𝑠.𝑡. Ψ ∈ , 𝐖𝑖𝑗 = 1,
𝑗=1
ence the performance of SGLLE. Note that we empirically fix 𝛾 = 106 ,
and do not adjust its value. The better clustering performance may be (31)
obtained by tuning the value of 𝛾. In the following, we first analyze 𝛼 which can be effectively optimized by the algorithm similar to Algo-
and 𝛽 simultaneously, and then analyze 𝑘. rithm 2. For convenience, we name it as PGLLE. Table 7 shows the
We vary 𝛼 and 𝛽 from {10−6 , 10−4 , 10−2 , 1, 102 , 104 , 106 }. The changes experiment results of PGLLE and SGLLE on all the used datasets. We can
of ACC and NMI are displayed in Figs. 3 and 4, respectively. As seen observe a significant decrease in the performance of PGLLE compared
from figures, the proposed SGLLE is sensitive to 𝛼 and 𝛽, which implies to SGLLE in terms of ACC and NMI, which indicates the effectiveness
the need for us to utilize grid-search method to identify the optimal of the structured adaptive graph in UFS.
parameters. It can be also observed that, the clustering performance is
not very sensitive to 𝛽 when 𝛼 is large and sensitive to 𝛽 when 𝛼 is 6.7. Convergence study
small.
To investigate the changes of clustering performance w.r.t the An iterative optimization approach based on alternative optimiza-
neighborhood size 𝑘 in locally linear embedding, we vary the value tion strategy is developed to optimize the proposed SGLLE. In Sec-
of 𝑘 from 5 to 15 with the incremental size 2. The trends of ACC tion 5.1, we have theoretically proven the convergence of Algorithm

10
J. Miao et al. Expert Systems With Applications 255 (2024) 124568

Fig. 3. ACC with different regularizer parameter 𝛼 and 𝛽.

Table 7
ACC and NMI of PGLLE and SGLLE on all the datasets.
Dataset PGLLE SGLLE
ACC NMI ACC NMI
JAFFE 88.22 ± 3.63 89.03 ± 2.41 93.05 ± 4.42 94.76 ± 3.78
att 69.08 ± 2.86 84.88 ± 1.00 69.43 ± 2.79 85.22 ± 1.49
Yale 47.55 ± 2.90 52.42 ± 1.52 49.21 ± 3.09 53.91 ± 3.02
PIE10P 51.05 ± 2.77 55.21 ± 1.47 49.67 ± 4.46 55.53 ± 2.66
BA 42.46 ± 1.24 57.55 ± 0.75 42.79 ± 1.68 57.87 ± 0.79
TOX_171 51.23 ± 0.29 27.96 ± 0.42 51.40 ± 0.68 30.06 ± 0.72
GLIOMA 60.00 ± 0.00 49.25 ± 0.00 66.30 ± 2.27 52.16 ± 0.00
colon 59.68 ± 0.00 2.43 ± 0.00 61.29 ± 0.00 7.312 ± 0.00
lymphoma 61.20 ± 2.53 64.41 ± 3.56 67.08 ± 5.40 66.78 ± 3.77

11
J. Miao et al. Expert Systems With Applications 255 (2024) 124568

Fig. 4. NMI with different regularizer parameter 𝛼 and 𝛽.

2. In this subsection, we will experimentally study its convergence Methodology, Software. Tiejun Yang: Writing – review & editing,
speed. For the purpose, we describe the objective function values of Validation, Visualization. Chao Fan: Writing – review & editing,
each iteration point generated by Algorithm 2. The convergence curves Software. Yingjie Tian: Supervision, Writing – review & editing. Yong
are shown in Fig. 7, where the 𝑥-axis and 𝑦-axis denote the number Shi: Funding acquisition, Writing – review & editing. Mingliang Xu:
of iterations and the value of the objective function, respectively. Funding acquisition, Writing – review & editing.
From the figure, we can observe that our algorithm usually converges
within at most twelve iterations for these nine datasets, verifying the
Declaration of competing interest
effectiveness and efficiency of the proposed optimization algorithm.

7. Conclusion The authors declare that they have no known competing financial
interests or personal relationships that could have appeared to influence
In this study, we have come up with a novel UFS approach named the work reported in this paper.
SGLLE that performs structured graph learning and locally linear em-
bedding simultaneously in the feature subspace. To select the most
Data availability
discriminative and distinguished features, we introduce an explicit
selective matrix consisting of only 0 and 1 that can be incorporated
into locally linear embedding. An adaptive and structured graph is Data will be made available on request.
learned to maintain the local geometrical structure of the original data.
To optimize the proposed SGLLE, we elaborately design an iterative Acknowledgments
algorithm based on alternative optimization strategy. Moreover, we
theoretically prove the convergence and analyze the computational
This work was supported by the National Natural Science Founda-
complexity of the algorithm. The superiority of SGLLE over a couple
tion of China under Grant 62106067, Grant 62073123, Grant
of advanced approaches has been demonstrated on real-world datasets.
62036010, Grant 72231010, Grant 71932008 and Grant 71731009, the
CRediT authorship contribution statement Natural Science Project of Zhengzhou Science and Technology Bureau
under Grant 21ZZXTCX21, the Cultivation Project for Young Backbone
Jianyu Miao: Conceptualization, Methodology, Software, Valida- Teachers in Henan University of Technology, and the Innovative Funds
tion, Writing – original draft. Jingjing Zhao: Conceptualization, Plan of Henan University of Technology (2022ZKCJ11).

12
J. Miao et al. Expert Systems With Applications 255 (2024) 124568

Fig. 5. ACC with different neighborhood size.

13
J. Miao et al. Expert Systems With Applications 255 (2024) 124568

Fig. 6. NMI with different neighborhood size.

14
J. Miao et al. Expert Systems With Applications 255 (2024) 124568

Fig. 7. Convergence behavior of Algorithm 2.

References Gui, J., Sun, Z., Ji, S., Tao, D., & Tan, T. (2016). Feature selection based on structured
sparsity: A comprehensive study. IEEE Transactions on Neural Networks and Learning
Battiti, R. (1994). Using mutual information for selecting features in supervised neural Systems, 28(7), 1490–1507.
net learning. IEEE Transactions on Neural Networks, 5(4), 537–550. He, X., Cai, D., & Niyogi, P. (2006). Laplacian score for feature selection. In NIPS (pp.
Cai, J., Luo, J., Wang, S., & Yang, S. (2018). Feature selection in machine learning: A 507–514).
new perspective. Neurocomputing, 300, 70–79. Hou, C., Nie, F., Li, X., Yi, D., & Wu, Y. (2014). Joint embedding learning and sparse
Cai, X., Nie, F., & Huang, H. (2013). Exact top-k feature selection via l2, 0-norm regression: A framework for unsupervised feature selection. IEEE Transactions on
constraint. In Twenty-third international joint conference on artificial intelligence (pp. Cybernetics, 44(6), 793–804.
1240–1246). Citeseer.
Hu, R., Zhu, X., Cheng, D., He, W., Yan, Y., Song, J., & Zhang, S. (2017). Graph
Cai, D., Zhang, C., & He, X. (2010). Unsupervised feature selection for multi-cluster
self-representation method for unsupervised feature selection. Neurocomputing, 220,
data. In KDD (pp. 333–342).
130–137.
Chen, H., Nie, F., Wang, R., & Li, X. (2022). Unsupervised feature selection with flexible
optimal graph. IEEE Transactions on Neural Networks and Learning Systems. Huang, Y., Shen, Z., Cai, F., Li, T., & Lv, F. (2021). Adaptive graph-based generalized
Davis, J. C., & Sampson, R. J. (1986). Statistics and data analysis in geology: Vol. 646, regression model for unsupervised feature selection. Knowledge-Based Systems, 227,
Wiley New York. Article 107156.
Du, S., Ma, Y., Li, S., & Ma, Y. (2017). Robust unsupervised feature selection via matrix Jain, A., & Zongker, D. (1997). Feature selection: Evaluation, application, and small
factorization. Neurocomputing, 241, 115–127. sample performance. IEEE Transactions on Pattern Analysis and Machine Intelligence,
Du, X., Nie, F., Wang, W., Yang, Y., & Zhou, X. (2018). Exploiting combination effect for 19(2), 153–158.
unsupervised feature selection by 𝓁2,0 norm. IEEE Transactions on Neural Networks Karami, S., Saberi-Movahed, F., Tiwari, P., Marttinen, P., & Vahdati, S. (2023).
and Learning Systems, 30(1), 201–214. Unsupervised feature selection based on variance-covariance subspace distance.
Duchi, J., Shalev-Shwartz, S., Singer, Y., & Chandra, T. (2008). Efficient projections onto Neural Networks.
the L1-ball for learning in high dimensions. In Proceedings of the 25th international Law, M., Jain, A., & Figueiredo, M. (2002). Feature selection in mixture-based
conference on machine learning ICML ’08, (pp. 272–279). New York, NY, USA: clustering. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural
Association for Computing Machinery. information processing systems: Vol. 15, MIT Press.
Dy, J. G., & Brodley, C. E. (2004). Feature selection for unsupervised learning. Journal
Li, W., Chen, H., Li, T., Wan, J., & Sang, B. (2022). Unsupervised feature selection
of Machine Learning Research, 5, 845–889.
via self-paced learning and low-redundant regularization. Knowledge-Based Systems,
Fan, K. (1949). On a theorem of Weyl concerning eigenvalues of linear transformations
240, Article 108150.
I. Proceedings of the National Academy of Sciences, 35(11), 652–655.
Gini, C. (1912). Variability and mutability, contribution to the study of statistical Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., & Liu, H. (2018).
distribution and relaitons. Studi Economico-Giuricici della R. Feature selection: A data perspective. ACM Computing Surveys, 50(6), 94.
Gong, X., Yu, L., Wang, J., Zhang, K., Bai, X., & Pal, N. R. (2022). Unsupervised feature Li, Z., Liu, J., Yang, Y., Zhou, X., & Lu, H. (2014). Clustering-guided sparse structural
selection via adaptive autoencoder with redundancy control. Neural Networks, 150, learning for unsupervised feature selection. IEEE Transactions on Knowledge and Data
87–101. Engineering, 26(9), 2138–2150.

15
J. Miao et al. Expert Systems With Applications 255 (2024) 124568

Li, Z., Nie, F., Bian, J., Wu, D., & Li, X. (2023). Sparse pca via L2, p-norm regularization Tang, C., Bian, M., Liu, X., Li, M., Zhou, H., Wang, P., & Yin, H. (2019). Unsupervised
for unsupervised feature selection. IEEE Transactions on Pattern Analysis and Machine feature selection via latent representation learning and manifold regularization.
Intelligence, 45(4), 5322–5328. Neural Networks, 117, 163–178.
Li, Z., Yang, Y., Liu, J., Zhou, X., & Lu, H. (2012). Unsupervised feature selection using Tang, C., Chen, J., Liu, X., Li, M., Wang, P., Wang, M., & Lu, P. (2018). Consensus
nonnegative spectral analysis. In AAAI (pp. 1026–1032). learning guided multi-view unsupervised feature selection. Knowledge-Based Systems,
Li, X., Zhang, H., Zhang, R., Liu, Y., & Nie, F. (2018). Generalized uncorrelated 160, 49–60.
regression with adaptive graph for unsupervised feature selection. IEEE Transactions Wahid, A., Khan, D. M., Hussain, I., Khan, S. A., & Khan, Z. (2022). Unsupervised
on Neural Networks and Learning Systems, 30(5), 1587–1595. feature selection with robust data reconstruction (UFS-RDR) and outlier detection.
Liu, H., & Setiono, R. (1995). Chi2: Feature selection and discretization of numeric Expert Systems with Applications, 201, Article 117008.
attributes. In Proceedings of 7th IEEE international conference on tools with artificial Wang, S., Tang, J., & Liu, H. (2015). Embedded unsupervised feature selection. In AAAI
intelligence (pp. 388–391). IEEE. (pp. 470–476).
Liu, X., Wang, L., Zhang, J., Yin, J., & Liu, H. (2014). Global and local structure Wang, J., Wang, L., Nie, F., & Li, X. (2023). Joint feature selection and extraction with
preservation for feature selection. IEEE Transactions on Neural Networks and Learning sparse unsupervised projection. IEEE Transactions on Neural Networks and Learning
Systems, 25(6), 1083–1095. Systems, 34(6), 3071–3081.
Liu, Y., Ye, D., Li, W., Wang, H., & Gao, Y. (2020). Robust neighborhood embedding Wang, Z., Wu, D., Wang, R., Nie, F., & Wang, F. (2022). Joint anchor graph embedding
for unsupervised feature selection. Knowledge-Based Systems, 193, Article 105462. and discrete feature scoring for unsupervised feature selection. IEEE Transactions
Luo, C., Zheng, J., Li, T., Chen, H., Huang, Y., & Peng, X. (2022). Orthogonally on Neural Networks and Learning Systems, 1–14. http://dx.doi.org/10.1109/TNNLS.
constrained matrix factorization for robust unsupervised feature selection with local 2022.3222466.
preserving. Information Sciences, 586, 662–675. Wang, P., Xue, B., Liang, J., & Zhang, M. (2023). Feature selection using diversity-based
Ma, Z., Huang, Y., Li, H., & Wang, J. (2023). Unsupervised feature selection with latent multi-objective binary differential evolution. Information Sciences, 626, 586–606.
relationship penalty term. Axioms, 13(1), 6. Wright, J., & Ma, Y. (2022). High-dimensional data analysis with low-dimensional models:
Miao, J., Ping, Y., Chen, Z., Jin, X.-B., Li, P., & Niu, L. (2021). Unsupervised Principles, computation, and applications. Cambridge University Press.
feature selection by non-convex regularized self-representation. Expert Systems with Yang, Y., Shen, H. T., Ma, Z., Huang, Z., & Zhou, X. (2011). 𝓁2,1 -Norm regular-
Applications, 173, Article 114643. ized discriminative feature selection for unsupervised learning. In IJCAI (pp.
Miao, J., Yang, T., Sun, L., Fei, X., Niu, L., & Shi, Y. (2022). Graph regularized 1589–1594).
locally linear embedding for unsupervised feature selection. Pattern Recognition, Yao, C., Liu, Y.-F., Jiang, B., Han, J., & Han, J. (2017). LLE Score: A new filter-based
122, Article 108299. unsupervised feature selection method based on nonlinear manifold embedding and
Mitra, P., Murthy, C., & Pal, S. K. (2002). Unsupervised feature selection using feature its application to image recognition. IEEE Transactions on Image Processing, 26(11),
similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3), 5257–5269.
301–312. Zhang, C., Lu, Y., Wang, Z., Yang, J., Wu, X., Sheng, W., & Jiang, B. (2023). Efficient
Mohar, B., Alavi, Y., Chartrand, G., & Oellermann, O. (1991). The Laplacian spectrum multi-view semi-supervised feature selection. Information Sciences, Article 119675.
of graphs. Graph Theory, Combinatorics, and Applications, 2(871–898), 12. Zhao, H., Li, Q., Wang, Z., & Nie, F. (2022). Joint adaptive graph learning and
Nie, F., Ma, Z., Wang, J., & Li, X. (2023). Fast sparse discriminative k-means for discriminative analysis for unsupervised feature selection. Cognitive Computation,
unsupervised feature selection. IEEE Transactions on Neural Networks and Learning 14(3), 1211–1221.
Systems. Zhao, Z., & Liu, H. (2007). Spectral feature selection for supervised and unsupervised
Nie, F., Xiang, S., Jia, Y., Zhang, C., & Yan, S. (2008). Trace ratio criterion for feature learning. In ICML (pp. 1151–1157).
selection. In AAAI (pp. 671–676). Zheng, Z., An, G., Cao, S., Wu, D., & Ruan, Q. (2023). Collaborative and multilevel
Nie, F., Zhu, W., & Li, X. (2016). Unsupervised feature selection with structured graph feature selection network for action recognition. IEEE Transactions on Neural
optimization. Vol. 30, In Proceedings of the AAAI conference on artificial intelligence. Networks and Learning Systems, 34(3), 1304–1318.
Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information Zhong, Z. (2020). Adaptive graph learning and low-rank constraint for supervised
criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions spectral feature selection. Neural Computing and Applications, 32, 6503–6512.
on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238. Zhou, P., Du, L., Li, X., Shen, Y.-D., & Qian, Y. (2020). Unsupervised feature selection
Qian, M., & Zhai, C. (2013). Robust unsupervised feature selection. In Twenty-third with adaptive multiple graph learning. Pattern Recognition, 105, Article 107375.
international joint conference on artificial intelligence. Citeseer. Zhou, P., Li, P., Zhao, S., & Wu, X. (2021). Feature interaction for streaming feature
Shang, R., Kong, J., Feng, J., & Jiao, L. (2022). Feature selection via non-convex selection. IEEE Transactions on Neural Networks and Learning Systems, 32(10),
constraint and latent representation learning with Laplacian embedding. Expert 4691–4702.
Systems with Applications, 208, Article 118179. Zhu, P., Hou, X., Tang, K., Liu, Y., Zhao, Y.-P., & Wang, Z. (2023). Unsupervised feature
Shang, R., Xu, K., & Jiao, L. (2020). Subspace learning for unsupervised feature selection through combining graph learning and 𝓁2,0 -norm constraint. Information
selection via adaptive structure learning and rank approximation. Neurocomputing, Sciences, 622, 68–82.
413, 72–84. Zhu, X., Wu, X., Ding, W., & Zhang, S. (2013). Feature selection by joint graph sparse
Shang, R., Xu, K., Shang, F., & Jiao, L. (2020). Sparse and low-redundant subspace coding. In Proceedings of the 2013 SIAM international conference on data mining (pp.
learning-based dual-graph regularized robust feature selection. Knowledge-Based 803–811). SIAM.
Systems, 187, Article 104830. Zhu, P., Xu, Q., Hu, Q., & Zhang, C. (2018). Co-regularized unsupervised feature
Shi, Y., Miao, J., Wang, Z., Zhang, P., & Niu, L. (2018). Feature selection with 𝓁2,1−2 selection. Neurocomputing, 275, 2855–2863.
regularization. IEEE Transactions on Neural Networks and Learning Systems, 29(10), Zhu, Y., Zhang, X., Wang, R., Zheng, W., & Zhu, Y. (2018). Self-representation and PCA
4967–4982. embedding for unsupervised feature selection. World Wide Web, 21(6), 1675–1688.
da Silva, P. N., Plastino, A., Fabris, F., & Freitas, A. A. (2021). A novel feature selection Zhu, P., Zuo, W., Zhang, L., Hu, Q., & Shiu, S. C. (2015). Unsupervised feature selection
method for uncertain features: An application to the prediction of pro-/anti- by regularized self-representation. Pattern Recognition, 48(2), 438–446.
longevity genes. IEEE/ACM Transactions on Computational Biology and Bioinformatics,
18(6), 2230–2238.

16

You might also like