0% found this document useful (0 votes)
3 views

Multilabel feature selection with constrained latent structure shared term

The document presents a novel multilabel feature selection method called Shared Latent Feature and Label Structure Feature Selection (SSFS), which addresses high-dimensional multilabel data by capturing latent feature and label structures. It employs a latent structure shared term and graph regularization to enhance feature selection while considering label correlations. Experimental results demonstrate the effectiveness of SSFS compared to existing multilabel feature selection methods.

Uploaded by

sizheduan36
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Multilabel feature selection with constrained latent structure shared term

The document presents a novel multilabel feature selection method called Shared Latent Feature and Label Structure Feature Selection (SSFS), which addresses high-dimensional multilabel data by capturing latent feature and label structures. It employs a latent structure shared term and graph regularization to enhance feature selection while considering label correlations. Experimental results demonstrate the effectiveness of SSFS compared to existing multilabel feature selection methods.

Uploaded by

sizheduan36
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO.

3, MARCH 2023 1253

Multilabel Feature Selection With Constrained


Latent Structure Shared Term
Wanfu Gao , Yonghao Li, and Liang Hu

Abstract— High-dimensional multilabel data have increasingly label correlations of multilabel data is named problem trans-
emerged in many application areas, suffering from two notewor- formation method, which can be divided into three groups
thy issues: instances with high-dimensional features and large- based on label correlations: first-order approach, second-order
scale labels. Multilabel feature selection methods are widely
studied to address the issues. Previous multilabel feature selection approach, and high-order approach [6]–[8]. The first-order
methods focus on exploring label correlations to guide the feature approaches consider that the label is independent of each
selection process, ignoring the impact of latent feature structure other; in other words, they ignore the label correlations.
on label correlations. In addition, one encouraging property The second-order approaches take pairwise label correlations
regarding correlations between features and labels is that similar into account, such as the ranking between the relevant label
features intend to share similar labels. To this end, a latent
structure shared (LSS) term is designed, which shares and and the irrelevant label. The high-order approaches tackle
preserves both latent feature structure and latent label structure. multilabel data by considering high-order relation among
Furthermore, we employ the graph regularization technique to labels. Some representative problem transformation meth-
guarantee the consistency between original feature space and ods are binary relevance (BR) [9], calibrated label rank-
latent feature structure space. Finally, we derive the shared latent ing (CLR) [10], and label-specific features-dependent labels
feature and label structure feature selection (SSFS) method based
on the constrained LSS term, and then, an effective optimization (LLSF-DL) [11] corresponding to the first-order approach,
scheme with provable convergence is proposed to solve the SSFS second-order approach, and high-order approach, respectively.
method. Better experimental results on benchmark datasets are The three-class approaches first transform multilabel data into
achieved in terms of multiple evaluation criteria. single-label data, i.e., transform multilabel data to fit subse-
Index Terms— Feature selection, graph regularization, latent quent single-label algorithms. Another strategy is to transform
structure, multilabel data. algorithms to fit multilabel data, which is named algorithm
adaption methods. Algorithm adaption methods employ matrix
I. I NTRODUCTION transformation tricks or metrics of information theory to
exploit label correlations.

W ITH the rapid development of computer science tech-


nique, numerous multilabel data with high-dimensional
features have emerged in many application areas, such as
The goal of multilabel feature selection methods is to choose
critical features from high-dimensional feature space with
large-scale labels. As a result, feature correlations should be
document classification [1]–[3] and gene recognition [4], [5]. considered indeed in the design of multilabel feature selec-
There exist two salient issues in multilabel data: instances with tion methods, which are ignored in existing multilabel fea-
high-dimensional features and large-scale labels. Obviously, ture selection methods. Although many information-theoretic
the difficulty to address the two issues becomes greater as methods intend to capture feature correlations to construct
the number of features and labels increases, which is a feature selection models, they fail to obtain excellent label cor-
challenge for fitting multilabel models. As a result, multilabel relations to guide the feature selection process. The reason is
feature selection attracts much attention from researchers as that although there exist high-order label correlations in large-
an important preprocessing step in a multilabel classification scale labels, existing information-theoretic methods mainly
problem. exploit label correlations based on the low-order approach due
In multilabel learning models, previous methods make great to the high computational cost of high-order approach.
efforts to exploit complicated label correlations for fitting Inspired by information-theoretic methods, some features
multilabel learning models. The well-known strategy to exploit are dependent on each other, and others are independent of
Manuscript received 25 October 2020; revised 14 May 2021 and 25 July each other, which indicates that there exists latent feature
2021; accepted 12 August 2021. Date of publication 26 August 2021; date structure in the original feature space. The latent feature
of current version 1 March 2023. This work was supported in part by the structure means that abundant features can be clustered into
Fundamental Research Funds for the Central Universities, Jilin University
(JLU), under Grant 93K172020K36, in part by the Science Foundation of several features. For example, in the text documents, the words
Jilin Province of China under Grant 2020122209JC, and in part by the Youth “basketball,” “tennis,” and “football” can be represented as
Science Foundation of Jilin Province of China under Grant 20160520011JH “sports,” we name the “sports” as the latent feature. In real-
and Grant 20180520021JH. (Corresponding author: Liang Hu.)
The authors are with the College of Computer Science and Technol- world applications, latent features cannot be pointed obviously,
ogy, Jilin University, Changchun 130012, China (e-mail: [email protected]; and however, abundant features can be clustered into several
[email protected]; [email protected]). clusters. Fig. 1 shows the process that the original feature
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNNLS.2021.3105142. matrix is transformed into the latent feature matrix. The latent
Digital Object Identifier 10.1109/TNNLS.2021.3105142 feature matrix is constructed by some new features that are
2162-237X © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
1254 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 3, MARCH 2023

is, Y ∈ Rn×c . Each Yi j = 1 if the j th label is associated with


i -instance; otherwise, Yi j = 0.

III. R ELATED W ORK


Recent years have witnessed an increasing number of fea-
ture selection methods handling multilabel data where each
instance is related to multiple labels [12]–[16]. As mentioned
above, there exist two groups of multilabel learning based
on whether or not to address directly multilabel data: prob-
Fig. 1. Original feature matrix is transformed into latent feature matrix. lem transformation methods and algorithm adaption meth-
ods. Pruned problem transformation (PPT) is the well-known
transformation method [17], which is an enhanced version
different from the original features. Furthermore, to select the of label powerset (LP) method [18]. PPT method eliminates
critical features from original high-dimensional feature space these patterns with too rarely occurring labels according
with large-scale labels, the relationships between features and to a pre-specified counting threshold. Based on the PPT
labels should be fully considered in the design of multilabel method, a mutual information-based method PPT + MI and a
feature selection methods. One appealing property regarding statistics-based method PPT + CHI are proposed [17], [19].
multilabel data is that similar features intend to share similar Both methods focus on exploiting label correlations to guide
labels. It means if two features are dependent on each other, feature selection. The highlight of the two PPT-based methods
and they intend to share similar or same labels in relatively is efficient and simple while both two methods ignore feature
high probability. correlations in multilabel data. As a result, the classification
In view of the above analysis, we design a latent struc- performance may be limited.
ture shared (LSS) term to capture latent feature structure in Compared to problem transformation methods, many algo-
original feature space. In addition, we employ the learning rithm adaption methods are studied over the past years.
label specific (LLS) term to fit labels in multilabel data, with Some of them are based on information theory, such as
the LLS term encouraging similar features to share similar max-dependence and min-redundancy (MDMR) [20] and scal-
labels. Furthermore, to obtain accurate latent feature structure, able criterion for large label set (SCLS) [21]. The MDMR
the graph regularization technique is adopted to guarantee method selects the superior feature subset by maximizing
the consistency between the original feature space and the the feature dependence between features and labels while
latent feature structure space. Finally, a novel multilabel fea- minimizing the feature redundancy. The specific criterion of
ture selection method named shared latent feature and label MDMR is as follows:
structure feature selection (SSFS) is proposed. We highlight ⎧ ⎫
 1  ⎨      ⎬
the contributions of this article as follows: J ( fk ) = I ( f k ; li )− I fk ; f j − I f k ; li | f j
1) designing an LLS term to capture latent feature structure |S| ⎩ ⎭
li ∈L f j ∈S li ∈L
in original feature space; (1)
2) proposing a multilabel feature selection method named
SSFS to share LLS term between features and labels in where J (·) represents the objective function and f k , f j , and
multilabel data and adopting graph regularization tech- S represent the candidate features, already-selected features,
nique to guarantee the consistency between the original and already-selected feature subset, respectively. I ( fk ; li )
feature space and the latent feature structure space; measures
 the feature dependence, whereas I ( fk ; f j ) −
3) developing an effective optimization scheme with prov- li ∈L I ( f k ; li | f j ) measures the feature redundancy. (1/|S|)
able convergence to solve the SSFS method. intends to balance the magnitude between feature dependence
term and feature redundancy term.
Similarly, the feature selection criterion of SCLS includes
II. P RELIMINARIES two terms: feature relevancy term and scalable relevance
evaluation, which is denoted by
We denote matrices by italicized uppercase letters, for  
  I fk ; f j 
example, matrix A. For matrix A ∈ Rn×m , A· j and Ai· J ( fk ) = I ( f k ; li ) − I ( f k ; li ). (2)
represent the j th column and the i th row of A, respectively. l ∈L f ∈S
H ( f k ) l ∈L
i j i
Besides, bold italicized lowercase letters represent vectors,
such as b, while scalars are denoted by lowercase letters, such Recently, Zhang et al. [22] proposed a novel multilabel
as b. We use that A T and T r (A) represent transpose of A and feature selection method named multilabel feature selection
based on label redundancy (LRFS), which groups labels into
the
 trace
 of A, respectively, if A isa square  matrix. A F = two categories: independent labels and dependent labels. The
( ni=1 mj=1 A2i j )1/2 and A2,1 = ni=1 ( mj=1 A2i j )1/2 present
the Frobenius norm of A and the l2,1 -norm of A, respectively, specific form of LRFS is presented as follows:
⎧ ⎫
in which Ai j is the (i, j)th entry of matrix A. We denote the fea- ⎨    1   ⎬
ture matrix X ∈ Rn×d that has n instances with d-dimensional J ( fk ) = I fk ; li |l j − I fk ; f j . (3)
⎩ |S| ⎭
feature space and each instance has c-dimensional labels, that li ∈L l j ∈L,li  =l j f j ∈S

Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: MULTILABEL FEATURE SELECTION WITH CONSTRAINED LSS TERM 1255

In addition, Gonzalez-Lopez et al. [23] design a geometric Label-enhanced multilabel learning (LEMLL) replaces logi-
mean maximization (GMM) method to discuss how to aggre- cal labels with real-valued labels to enhance label informa-
gate the mutual information of multiple labels. GMM chooses tion [31].
the optimal feature subset with the largest geometric mean. Imposing l2,1 -norm on both loss function and regular-
The geometric mean considers the product of the measures ization, the well-known embedded-based feature selection
and then splits them with a root. method (RFS) selects features across all data points with
Beyond mentioned methods above, a large number of joint sparsity [32]. Furthermore, Cai et al. [33] imposed the
multilabel feature selection methods based on information l2,1 -norm on loss function and the l2,0 -norm on the feature
theory are proposed [24], [25]. The common characteristic weight matrix, proposing a robust feature selection method
of the methods based on information theory is that they named RALM-FS that is regarded as an improved version of
pay much attention on feature correlations. Due to the cal- RFS. It is denoted by
culative limitations of high-order feature correlations, exist-
ing methods adopt low-order information-theoretic terms to min Y − X T W − 1bT 2,1 s.t. W 2,0 = a (5)
W,V,B
exploit feature correlations. For example, the LRFS method
employs mutual information between the candidate feature where b and 1 represent the bias term and the all-one-
and each already-selected feature to represent the feature element vector, respectively, and a is the number of the
redundancy. The MDMR method adopts conditional mutual already-selected features. Besides, Ma et al. [34] proposed a
information between the candidate feature and the class SubFeature Uncovering with Sparsity (SFUS) method, which
given already-selected features to represent interactive corre- jointly selects features by a sparse regularization and uncovers
lations among features and each label. Furthermore, previous shared subspace of original features. Chang et al. [35] pro-
information-theoretic methods adopt mutual information or posed a convex semisupervised multilabel feature selection
conditional mutual information to calculate the correlation method to address large-scale datasets, which can be handled
between each candidate feature and each class label, which by a fast iterative algorithm. In addition, Chang et al. [36]
ignores high-order interactive correlation between features and proposed a semisupervised shared subspace multilabel feature
labels. As a result, previous methods are very dependent on selection method, which integrates the semisupervised learning
the importance of individual feature or label. and label correlation mining into one unified framework.
On the contrary, multilabel feature selection Furthermore, Luo et al. [37] preserved the intrinsic structure
embedded-based methods focus on exploit label correlations, of feature matrix by locally linear embedding (LLE) [38],
employing label correlations to select the compact feature which is different from traditional methods that are based on a
subset. For instance, Jian et al. [26] employed the matrix similarity matrix. The novel idea is applied to the unsupervised
factorization technique to extract latent semantics of label feature selection method. Feature selection methods based
information, with the latent semantics of label information on structured sparsity are comprehensively studied in the
fitting feature matrix of multilabel data. As a result, a feature literature [39]. Inspired by the property that similar features
weight matrix reflecting the importance of features is share similar labels, this article assumes that there exists a
obtained. Finally, the critical features are selected according latent feature structure that can be shared by both the original
to the feature weight matrix. In addition, multilabel informed feature matrix and the label matrix. The proposed method not
feature selection (MIFS) adopts the graph Laplacian matrix to only characterizes the intrinsic structure of feature matrix but
ensure a local geometry structure between feature matrix and also is associated with label matrix.
latent semantics of label information. The objective function Furthermore, graph regularization technique is widely
of the MIFS method is presented as follows: adopted in a feature selection process. For example,
Hu et al. [40] proposed a dual-graph regularization mul-
min ||X W − V ||2F + α||Y − V B||2F tilabel feature selection method named robust multilabel
W,V,B
  feature selection based on dual-graph (DRMFS), which com-
+βT r V T LV + γ ||W ||2,1 (4)
bines label matrix and feature matrix with feature weight
where W ∈ Rd×k , V ∈ Rn×k , and B ∈ Rk×c represent matrix to construct label graph regularization and feature
the feature weight matrix, the latent semantics of multilabel graph regularization, respectively. Zhang et al. [13] employed
information, and the basis matrix, respectively, L ∈ Rn×n is a low-dimensional embedding matrix to construct graph reg-
the Laplacian matrix, and α, β, and γ are three regularization ularization terms that exploit the local label correlation and
parameters of MIFS model. global label correlation. These two feature selection methods
Besides, MultiLabel learning using LOcal Correlation employ the graph regularization technique to preserve the
(ML-LOC) tries to exploit label correlations in the manifold structure of feature data and label data. In addi-
data locally [27]. multilabel learning with GLObal and tion, Shang et al. [41] combined a graph regularization tech-
loCAL (GLOCAL) explores both global and local label cor- nique with subspace learning to devise a feature selection
relations [28]. Multilabel manifold learning (ML) attempts method named subspace learning-based graph regularized fea-
to explore manifold in the label space for multilabel learn- ture selection (SGFS), which obtains excellent performance.
ing [29]. Learning label-specific features (LLSFs) assume that Furthermore, Nie et al. [42] proposed a new model to con-
each label is related with a feature subset, and arbitrary similar struct the similarity matrix by assigning the adaptive neighbors
labels share more features than two dissimilar labels [30]. for each data. Afterward, the adaptive graph technique is

Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
1256 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 3, MARCH 2023

applied to many feature selection methods [43], [44]. In this Followed by [45], we employ heat kernel function to build the
article, we propose an LSS term, and however, many LLS affinity matrix, which is defined as
can be extracted from the original feature matrix. As a result, X i· −X j · 2  
e− σ , if X i· ∈ N p X j · or X j · ∈ N p (X i· )
2
we adopt the graph regularization technique to constrain latent Si j =
X
(9)
feature structure. 0, otherwise
Inspired by previous methods, we conclude three key factors
to construct multilabel feature selection methods: 1) exploring where N p (X i· ) is the p-nearest neighbors of instance X i· and
feature correlations and constructing latent feature structure; σ is a parameter of heat kernel mode. Integrating the above
2) exploring label correlations; and 3) constructing relation- terms into one multilabel feature selection model, we obtain
ships between latent feature structure and label structure, the following objective function:
which is beneficial for multilabel feature selection. min X − V Q T 2F + αY − V M2F
V,Q,M
 
IV. P ROPOSED M ETHOD +βT r V T LV + γ Q2,1 (10)

Considering latent feature structure, we decompose the where β and γ are two parameters to measure the contribution
feature matrix X ∈ Rn×d into two low-dimensional matrices of the local geometry structure and to control the sparsity of
V ∈ Rn×k and Q ∈ Rd×k . To minimize the reconstruction the objective function, respectively. Q2,1 achieves feature
error, we obtain the following form: selection by the l2,1 -norm. In objective function (10), matrix
V reflects the latent feature structure and latent label structure.
min X − V Q T 2F (6) In addition, matrix V is defined the LSS term to achieve
V,B
that similar features share similar labels. Finally, to enhance
where V represents the latent feature structure of feature the row-sparse property further, we impose the nonnegative
matrix, with Q denoting a coefficient matrix. Formula (6) constraint onto the coefficient matrix Q. In addition, we also
indicates that d-dimensional features reduce to k-dimensional impose the nonnegative constraints onto V and M to facilitate
features, which can be explained that the original d features the subsequent optimization algorithm that solves the objective
cluster into k different clusters, each cluster contains relevant function. The final objective function, including constraints,
features, and different clusters are independent with each is reformulated as
other. By this way, we obtain the latent feature structure of
feature matrix. In addition, the row of matrix Q represents the min X − V Q T 2F + αY − V M2F
V,Q,M
importance of each feature in these k latent feature variables.  
+βT r V T LV + γ Q2,1
Similarly, we employ a product of two low-dimensional
matrices U ∈ Rn×k and M ∈ Rk×c to represent the label s.t. V, M, Q ≥ 0. (11)
matrix in multilabel data. U and M denote arbitrary latent label
structure and corresponding coefficient matrix, respectively.
V. O PTIMIZATION OF SSFS S CHEME
Inspired by the property that similar features share similar
labels, we replace matrix U with matrix V to share the A. Proposed Scheme of SSFS
correlations between labels and features. Matrix V is called The objective function (11) is joint nonconvex; as a result,
the LSS term. As a result, we obtain the following problem: we cannot obtain the global minima. In addition, the objective
function (11) is nonsmooth referring to the l2,1 -norm. To this
min X − V Q T 2F + αY − V M2F (7) end, we design three iterative rules to solve the objective
V,Q,M
function, achieving the local minima. First, Q2,1 can be
where α is used to balance the contribution regarding the relaxed as 2T r (Q T D Q), where D ∈ Rd×d is a diagonal matrix
impact of latent feature structure on label correlations and fea- with the i th diagonal element Dii = (1/(2Q i· 2 + )) and 
ture decomposition. However, a critical issue is that although is a small positive constant. The objective function (11) can
we intend to find a proper matrix V by minimizing For- be rewritten as
mula (7), there exist multiple LSS terms. To obtain a proper   
LSS term, we employ a graph regularization technique to (V, M, Q) = T r X T − QV T X − V Q T
  
guarantee the consistency between the original feature space +αT r Y T − M T V T (Y −V M)
and the latent feature structure space. The principle is that the  T   T 
+βT r V LV +2γ T r Q D Q . (12)
closer correlation between the two instances in the feature
matrix X indicates the closer correlation between the two Furthermore, we integrate nonnegative constraints into For-
latent feature variables in latent feature structure mula (12), obtaining
1     
n n
L(V, M, Q)
Si j Vi· − V j · 22 = T r V T (A − S)V = T r V T LV   
2 i=1 j =1 = T r X T − QV T X − V Q T
    
(8) +αT r Y T − M T V T (Y −V M) +βT r V T LV
 T       
in which A is a diagonal matrix, S denotes a symmetric affinity +2γ T r Q D Q − T r V T − T r M T − T r Q T
matrix, and L = A − S denotes the graph Laplacian matrix. (13)

Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: MULTILABEL FEATURE SELECTION WITH CONSTRAINED LSS TERM 1257

where  ∈ Rn×k , ∈ Rk×c , and ∈ Rd×k denote three Algorithm 1 SSFS


Lagrange multipliers. The partial derivatives of L w.r.t vari- Input:
ables V , M, and Q are Data matrix X ∈ Rn×d and label matrix Y ∈ Rn×c ;
∂L Regularization parameters α, β and γ .
= −2X Q +2V Q T Q −2αY M T +2αV M M T +2β LV −
∂V Output:
(14) Return the top-k selected features according to SSFS
∂L method.
= −2αV Y + 2αV V M −
T T
(15)
∂M 1: Initialize V ∈ Rn×k , M ∈ Rk×c and Q ∈ Rd×k randomly;
∂L t = 0;
= −2X T V + 2QV T V + 2γ D Q − . (16)
∂Q 2: Compute the degree matrix A and the affinity matrix S of

According to the Karush–Kuhn–Tucker condition, the data matrix X according to Formula (9);
3: Repeat:
i.e., i j Vi j = 0, i j Mi j = 0, and i j Q i j = 0, we obtain that
4: Update Formulas (21), (22) and (23)
(−X Q + V Q T Q − αY M T + αV M M T + β LV )i j i j = 0 5: Update D
(17) 6: t = t + 1;
7: Until Convergence;
(−αV Y + αV V M)i j
T T
ij =0 (18)
8: Return Q;
(−X V + QV V + γ D Q)i j
T T
ij =0 (19) 9: Return features according to ||Q i· ||2 .
where L = A − S in Formula (17), and hence, Formula (17)
can be written as
  Then, we present the proof of convergence regarding
−X Q +V Q T Q −αY M T +αV M M T +β(A− S)V i j i j = 0.
Formula (11) under Formulas (21)–(23). First of all, we intro-
(20) duce the concept of auxiliary function that is also used in the
As a result, the update rules regarding V , M and Q are literature [40], [46].
presented as follows: Definition 1: The function J is regarded as an auxiliary
  function of K when J (Vi j , Vij ) ≥ K (Vi j ) and J (Vi j , Vi j ) =
X Q + αY M T + β SV i j K (Vi j ) are both satisfied.
j ← Vi j 
Vit+1 
t
(21)
V Q T Q + αV M M T + β AV i j Lemma 1: The function K is nonincreasing under the
 T  following formula:
V Y ij
Mi j ← Mi j  T
t+1 t
 (22)  
V V M ij Vit+1
j = arg min J Vi j , Vitj . (27)
 T  Vi j
X V ij
Qi j ← Qi j 
t+1 t
 (23) Proof of Lemma 1: Based on Definition 1 and Lemma 1,
QV T V + γ D Q i j we can easily obtain that
where t indicates the iterative number. In addition, to avoid      t   t
zero elements during feature selection process in the denom- K Vit+1
j ≤ J Vit+1
j , Vi j ≤ J Vi j , Vi j = K Vi j . (28)
t t

inator, we add one very small constant to the denominator.


We present the executive process in Algorithm 1. Next, we present the proof that the function (11) is nonin-
creasing under the update rule (21) with a proper auxiliary
function. Since the update rules are based on element-by-
B. Proof of Convergence
element, we employ K i j to represent the part of (V ) that
Gradient descent method is a well-known optimization is related to Vi j . By taking the first-order and the second-
method, and here, we take the variable V as one example order partial derivatives of (V ) regarding V , the following

∂ formulas are obtained:
Vit+1←V t
− η (24)
j ij
∂V t ij  
K i j = −2X Q +2V Q T Q −2αY M T +2αV M M T +2β LV i j
where η is the learning rate of the gradient descent method,    
K ij = 2 Q T Q j j +2α M M T j j +2β(L)ii . (29)
a small positive constant. Here, we set η as
Vitj According to the literature [40], [47], we defined the fol-
η=   . (25) lowing auxiliary function of K i j (Vi j ):
2 V Q T Q +αV M M T +β AV i j
 
We replace η in Formula (24) with Formula (25), obtaining J Vi j , Vitj
  t
Vitj ∂ = K i j Vi j
Vi j ← Vi j − 
t+1 t
  
2 V Q T Q +αV M M T +β AV i j ∂ V t i j    V Q T Q +αV M M T +β AV i j
  +K i j Vitj Vi j − Vitj +
X Q + αY M T + β SV i j Vitj
⇔ Vi j ← Vi j 
t+1 t
 . (26)   2
V Q T Q +αV M M T +β AV i j × Vi j − Vitj . (30)

Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
1258 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 3, MARCH 2023

Compared to the Taylor function of K i j (Vi j ) TABLE I


       D ESCRIPTION OF E XPERIMENTAL D ATASETS
K i j Vi j = K i j Vitj + K i j Vitj Vi j −Vitj
1   2
+ K ij Vitj Vi j −Vitj . (31)
2
Proof: In Formula (30), when Vi j = Vitj , then J (Vi j , Vitj ) =
K i j (Vitj ); afterward, we offer the proof regarding the condition
J (Vi j , Vitj ) ≥ K i j (Vitj ). Observing Formulas (30) and (31),
we need to prove
 
V Q T Q + αV M M T + β AV i j
Card is the average number of labels for each instance, while
Vitj Den represents the label cardinality divided by the whole
   
≥ Q T Q j j + α M M T j j + β(L)ii (32) number of labels.
In the experiments, we have several parameters to set in
Obviously, we obtain advance. First, we obtain the best parameter C of SVM in
 
V Q T Q + αV M M T i j {104 , 103 , . . . , 103 , 104 }, and then, α, β, and γ in function
(11) and compared methods referring to parameters are tuned

k
  
k
  in the same grid that is {0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0}. All
= Vilt T
Q Q lj
+α Vilt M T M l j
the best parameters are obtained in fivefold cross-validation
l=1 l=1
    experiments. In addition, the two parameters p and σ in
≥ Vilt Q T Q j j +αVilt M T M j j . (33)
Formula (9) are defined as 5 and 1, respectively.
Similarly, In the experiments, we adopt three popular evaluation
criteria (Macro-F1, Micro-F1, and HL) to assess the proposed

n
β(AV )i j = β Atil Vltj ≥ β Aii Vitj ≥ β(A − S)ii Vitj = β L ii Vitj . method SSFS and compared methods. Macro-F1 and Micro-
l=1 F1 are defined as follows:
(34) 1
c
2TPi
Macro-F1 = (36)
As a result, Formula (32) holds. J (Vi j , Vitj )
is regarded as c i=1 2TP + FPi + FNi
i

an auxiliary function of K i j (Vi j ). Replacing Formula (27) by c


2TPi
Formula (30) Micro-F1 = c  i=1  (37)
i=1 2TP + FP + FN
i i i
 
K i j Vitj
where c is the number of labels, TP is the number of true
j ← Vi j − 
Vit+1 
t
2 V Q T Q +αV M M T +β AV i j positives, FP denotes the number of false positives, and FN
  represents the number of false negatives. The higher Macro-F1
X Q + αY M T + β SV i j
⇔ Vi j ← Vi j 
t+1 t
 . (35) and Micro-F1 indicate the better classification performance.
V Q T Q +αV M M T +β AV i j Furthermore, we also adopt another popular evaluation HL that
is denoted as
Based on the proof above, the objective function (11) is
11 
n
nonincreasing under the update rule Formula (21). Similarly,
the objective function (11) is nonincreasing under the update HL = |Y ⊕ Y | (38)
n i=1 c
rules regarding the other two variables M and Q.
where n represents the number of instances, Y  is the predicted
VI. E XPERIMENTAL S TUDY label set, Y is the original label set, and ⊕ presents the XOR
operation between Y  and Y . The lower the HL, the better the
A. Experimental Settings
classification performance.
To ensure the comprehensiveness and competitiveness, two
problem transformation methods (PPT + MI and PPT + CHI)
and six algorithm adaption methods (MDMR, SCLS, MIFS, B. Experimental Results
RALM-FS, LRFS, and GMM) are compared to SSFS in terms We employ the proposed method and compared methods
of Macro-F1, Micro-F1 on SVM classifier, and Hamming to extract the top 20% of the total number of feature on
loss (HL) on ML-kNN classifier (k = 10). All the datasets each dataset. The average classification results and standard
are already categorized into two parts: training datasets and deviations of each method are recorded in Tables II–IV.
test datasets according to the recommendation on the Mulan Observing Tables II–IV, we find that the proposed method
Library [48]. We present the details of these datasets in Table I. outperforms the compared methods in terms of average results
All experiments are executed on an Intel Core CPU with a of all the evaluation criteria. Specifically, in Table II, SSFS
3.40-GHz processing speed and 16-GB main memory. outperforms other compared methods on all the datasets on the
We describe the datasets from multiple aspects, including SVM classifier in terms of Macro-F1. Moreover, the improv-
the common characters that are the number of features, ing rates regarding Macro-F1 in comparison to PPT + MI,
the number of instances, and the number of labels. In addition, PPT + CHI, MDMR, SCLS, MIFS, RALM-FS, LRFS, and

Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: MULTILABEL FEATURE SELECTION WITH CONSTRAINED LSS TERM 1259

TABLE II
E XPERIMENTAL R ESULTS OF N INE M ETHODS ON S EVEN D ATASETS ON SVM C LASSIFIER IN T ERMS OF M ACRO - F1 ( MEAN ± STD )

TABLE III
E XPERIMENTAL R ESULTS OF N INE M ETHODS ON S EVEN D ATASETS ON SVM C LASSIFIER IN T ERMS OF M ICRO - F1 ( MEAN ± STD )

TABLE IV
E XPERIMENTAL R ESULTS OF N INE M ETHODS ON S EVEN D ATASETS ON ML- K NN C LASSIFIER IN T ERMS OF HL ( MEAN ± STD )

Fig. 2. Nine methods on four benchmark datasets on SVM classifier in terms of Macro-F1 . (a) Arts. (b) Education. (c) Science. (d) Social.

GMM are 57.1%, 66%, 49.2%, 114.6%, 66%, 72.5%, 39.7%, be seen in Figs. 2–4, the proposed method achieves the best
and 183.9%, respectively. Similarly, SSFS achieves the best performance in terms of three metrics in comparison with
average results in terms of Micro-F1. Regarding HL, we retain the eight methods. Furthermore, different methods achieve the
more digits to verify the classification superiority of the second best classification performance, such as MIFS method
proposed method SSFS. SSFS still achieves the best aver- on Arts in terms of Macro-F1 and RALM-FS method on
age results among all the nine multilabel feature selection Education in terms of Micro-F1, which indicates the proposed
methods. Overall, the proposed method SSFS outperforms the method is reliable.
eight compared methods in terms of these popular evaluation In general, the proposed method SSFS achieves the best
criteria. classification performance among all the methods in terms of
To show the classification performance clearly, we choose Macro-F1, Micro-F1, and HL.
four datasets (Arts, Education, Science, and Social) to present
the three evaluation metrics in Figs. 2–4. The horizontal axis C. Parameter Sensitivity Analysis
represents the number of already-selected features, whereas There exist four parameters (α, β, γ , and k) in the objective
the vertical axis denotes each evaluation metric. Different function. Obviously, α, β, and γ are used to measure the
colors represent different feature selection methods. As can contribution of each term in the objective function and k is

Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
1260 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 3, MARCH 2023

Fig. 3. Nine methods on four benchmark datasets on SVM classifier in terms of Micro-F1 . (a) Arts. (b) Education. (c) Science. (d) Social.

Fig. 4. Nine methods on four benchmark datasets on ML-kNN classifier in terms of HL. (a) Arts. (b) Education. (c) Science. (d) Social.

Fig. 5. Macro-F1 of SSFS method on Arts dataset w.r.t α, β, γ and k on SVM classifier. (a) α. (b) β. (c) γ . (d) k.

Fig. 6. Convergence curves: (a) MIFS, (b) RALM_FS, and (c) SSFS.

the dimension regarding the LSS term. To study the sensitivity TABLE V
of each parameter in the feature selection process, we adjust T IME C OMPLEXITY OF M ETHODS
one objective parameter when fixing other three parameters
by searching the grid {0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0} on
the dataset Arts. We fix other parameters as 0.5, which is also
adopted in the literature [26], [40]. We test the sensitivity of the
four parameters on the SVM classifier in terms of Macro-F1
on the dataset Arts in Fig. 5.
Observing Fig. 5, we can find that the classification perfor-
mance is not very sensitive to these four parameters.
Afterward, we present the time complexity of SSFS and
D. Convergence and Complexity Analysis other algorithm adaption methods (MIFS, MDMR, SCLS,
Furthermore, we conduct experiments on four datasets LRFS, GMM, and RALM-FS) in Table V.
(Arts, Education, Science, and Social) to present the con- Suppose that a is the number of already-selected features,
vergence of SSFS and compared methods in Fig. 6. d denotes the number of features, n indicates the number of
Figs. 5(b) and (c) and 6(a) show the convergence curves of instances, c represents the number of labels, and l is the cluster
MIFS, RALM-FS, and SSFS, respectively. SSFS methods number of labels in the MIFS method. As shown in Table VI,
converge before 30 iterations. The proposed method SSFS information-theoretic methods outperform embedded methods
converges quickly in several iterations on the four datasets. in terms of computational complexity because the former

Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: MULTILABEL FEATURE SELECTION WITH CONSTRAINED LSS TERM 1261

TABLE VI
RUNNING T IME (S ECONDS ) OF D IFFERENT M ETHODS

considers low-order feature correlations or low-order label R EFERENCES


correlations, and the latter takes high-order correlations into
[1] S. Burkhardt and S. Kramer, “Online multi-label dependency topic mod-
account in the design of methods. However, the metrics of els for text classification,” Mach. Learn., vol. 107, no. 5, pp. 859–886,
information theory cost much time of information-theoretic May 2018.
methods. As a result, the running time of the proposed method [2] B. Al-Salemi, M. Ayob, and S.-A.-M. Noah, “Feature ranking for
enhancing boosting-based multi-label text categorization,” Expert Syst.
and compared methods is reported in Table VI. Appl., vol. 113, pp. 531–543, Dec. 2018.
As shown in Table VI, the running time (seconds) of the pro- [3] F. Gargiulo, S. Silvestri, M. Ciampi, and G. D. Pietro, “Deep neural
posed method and compared methods is presented. PPT_MI network for hierarchical extreme multi-label text classification,” Appl.
Soft Comput., vol. 79, pp. 125–138, Jun. 2019.
costs the least running time on the benchmark datasets because [4] J. Zhang, Z. Zhang, Z. Wang, Y. Liu, and L. Deng, “Ontological func-
the method belongs to the first-order problem transform tion annotation of long non-coding RNAs through hierarchical multi-
algorithms that do not consider label correlations. RALM-FS label classification,” Bioinformatics, vol. 34, no. 10, pp. 1750–1757,
May 2018.
outperforms our method in terms of running time because [5] Y. Ren, G. Zhang, G. Yu, and X. Li, “Local and global structure pre-
RALM-FS converges quickly, as shown in Fig. 6. However, serving based feature selection,” Neurocomputing, vol. 89, pp. 147–157,
SSFS takes the same order of magnitude as RALM-FS in Jul. 2012.
[6] M.-L. Zhang and Z.-H. Zhou, “A review on multi-label learning algo-
terms of running time. Importantly, the proposed method SSFS rithms,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 8, pp. 1819–1837,
achieves the best classification performance on the benchmark Aug. 2014.
datasets in terms of multiple metrics. [7] E. Gibaja and S. Ventura, “A tutorial on multilabel learning,” ACM
Comput. Surv., vol. 47, no. 3, p. 52, 2015.
[8] M.-L. Zhang and L. Wu, “Lift: Multi-label learning with label-specific
E. Applications of Feature Selection features,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 1,
pp. 107–120, Jan. 2015.
We are doing a project regarding searching for latent syn- [9] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, “Learning multi-label
thetic attributes in aluminophosphate zeolites (AlPOs). It is scene classification,” Pattern Recognit., vol. 37, no. 9, pp. 1757–1771,
Sep. 2004.
important to distinguish the category of AIPOs; however, [10] J. Fürnkranz, E. Hüllermeier, E. L. Mencía, and K. Brinker, “Multilabel
existing synthetic attributes cannot distinguish all AIPOs classification via calibrated label ranking,” J. Mach. Learn., vol. 73,
correctly. Chemists consider that there exist latent synthetic no. 2, pp. 133–153, Nov. 2008.
[11] J. Huang, G. Li, Q. Huang, and X. Wu, “Learning label-specific features
attributes that are not found in their experiments. We employ and class-dependent labels for multi-label classification,” IEEE Trans.
the proposed method to discover latent synthetic attributes and Knowl. Data Eng., vol. 28, no. 12, pp. 3309–3323, Dec. 2016.
rank the importance of attributes. In addition, Gene datasets [12] Z. Sun et al., “Mutual information based multi-label feature selec-
tion via constrained convex optimization,” Neurocomputing, vol. 329,
are commonly high-dimensional data, and numerous irrelevant pp. 447–456, Feb. 2019.
and redundant features confuse researchers. Feature selection [13] J. Zhang, Z. Luo, C. Li, C. Zhou, and S. Li, “Manifold regularized dis-
removes useless features while retaining relevant features. criminative feature selection for multi-label learning,” Pattern Recognit.,
vol. 95, pp. 136–150, Nov. 2019.
Furthermore, feature selection does not change the original [14] Y. Lin, Q. Hu, J. Zhang, and X. Wu, “Multi-label feature selection with
meanings of features, which is critical for scientists. streaming labels,” Inf. Sci., vol. 372, pp. 256–275, Dec. 2016.
[15] L. Hu, Y. Li, W. Gao, P. Zhang, and J. Hu, “Multi-label feature selection
with shared common mode,” Pattern Recognit., vol. 104, Aug. 2020,
VII. C ONCLUSION AND F UTURE R ESEARCH Art. no. 107344.
Numerous features and large-scale labels not only increase [16] Y. Xu, J. Wang, S. An, J. Wei, and J. Ruan, “Semi-supervised
multi-label feature selection by preserving feature-label space consis-
time cost of learning models but also decrease the classifica- tency,” in Proc. 27th ACM Int. Conf. Inf. Knowl. Manage., Oct. 2018,
tion performance of multilabel classification. Here, we propose pp. 783–792.
an LSS term that shares and preserves both latent feature [17] J. Read, “A pruned problem transformation method for multi-label
classification,” in Proc. New Zealand Comput. Sci. Res. Student Conf.,
structure and latent label structure. In addition, the graph vol. 143150, Apr. 2008, p. 41.
regularization technique is adopted onto the proposed objec- [18] G. Tsoumakas and I. Vlahavas, “Random k-labelsets: An ensemble
tive function named SSFS. To demonstrate the effectiveness method for multilabel classification,” in Proc. Eur. Conf. Mach. Learn.,
2007, pp. 406–417.
and efficiency of SSFS, we conduct various experiments on [19] G. Doquire and M. Verleysen, “Feature selection for multi-label classi-
benchmark datasets. Experimental results verify the excellent fication problems,” in Proc. Int. Work-Conf. Artif. Neural Netw., 2011,
performance of SSFS. In future work, we continue to study the pp. 9–16.
[20] Y. Lin, Q. Hu, J. Liu, and J. Duan, “Multi-label feature selection based
feature selection technique in multilabel classification problem on max-dependency and min-redundancy,” Neurocomputing, vol. 168,
and intend to apply our research in real project. pp. 92–103, Nov. 2015.

Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
1262 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 3, MARCH 2023

[21] J. Lee and D.-W. Kim, “SCLS: Multi-label feature selection based [43] X. Li, H. Zhang, R. Zhang, Y. Liu, and F. Nie, “Generalized uncorrelated
on scalable criterion for large label set,” Pattern Recognit., vol. 66, regression with adaptive graph for unsupervised feature selection,”
pp. 342–352, Jun. 2017. IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 5, pp. 1587–1595,
[22] P. Zhang, G. X. Liu, and W. F. Gao, “Distinguishing two types of labels May 2019.
for multi-label feature selection,” Pattern Recognit., vol. 95, pp. 72–82, [44] F. Nie, W. Zhu, and X. Li, “Unsupervised feature selection with
Nov. 2019. structured graph optimization,” in Proc. AAAI Conf. Artif. Intell., vol. 30,
[23] J. Gonzalez-Lopez, S. Ventura, and A. Cano, “Distributed multi-label no. 1, 2016, pp. 1302–1308.
feature selection using individual mutual information measures,” Knowl.- [45] D. Cai, C. Zhang, and X. He, “Unsupervised feature selection for multi-
Based Syst., vol. 188, Jan. 2020, Art. no. 105052. cluster data,” in Proc. 16th ACM SIGKDD Int. Conf. Knowl. Discovery
[24] J. Lee and D.-W. Kim, “Feature selection for multi-label classification Data Mining (KDD), 2010, pp. 333–342.
using multivariate mutual information,” Pattern Recognit. Lett., vol. 34, [46] R. Shang, W. Wang, R. Stolkin, and L. Jiao, “Non-negative spectral
no. 3, pp. 349–357, 2013. learning and sparse regression-based dual-graph regularized feature
[25] J. Lee and D.-W. Kim, “Mutual information-based multi-label feature selection,” IEEE Trans. Cybern., vol. 48, no. 2, pp. 793–806, Feb. 2018.
selection using interaction information,” Expert Syst. Appl., vol. 42, [47] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularized nonnegative
no. 4, pp. 2013–2025, 2015. matrix factorization for data representation,” IEEE Trans. Pattern Anal.
[26] L. Jian, J. Li, K. Shu, and H. Liu, “Multi-label informed feature Mach. Intell., vol. 33, no. 8, pp. 1548–1560, Aug. 2011.
selection,” in Proc. IJCAI, 2016, pp. 1627–1633. [48] G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, and I. Vlahavas,
[27] S. Huang and Z. Zhou, “Multi-label learning by exploiting label corre- “Mulan: A Java library for multi-label learning,” J. Mach. Learn. Res.,
lations locally,” in Proc. AAAI Conf. Artif. Intell., 2012, pp. 949–955. vol. 12, pp. 2411–2414, Jun. 2011.
[28] Y. Zhu, J. T. Kwok, and Z.-H. Zhou, “Multi-label learning with global
and local label correlation,” IEEE Trans. Knowl. Data Eng., vol. 30,
no. 6, pp. 1081–1094, Jun. 2018.
[29] P. Hou, X. Geng, and M. Zhang, “Multi-label manifold learning,” in
Proc. AAAI Conf. Artif. Intell., 2016, pp. 1680–1686. Wanfu Gao was born in Jilin, China, in 1990.
[30] J. Huang, G. Li, Q. Huang, and X. Wu, “Learning label specific features He received the B.S. and Ph.D. degrees from
for multi-label classification,” in Proc. IEEE Int. Conf. Data Mining, the College of Computer Science, Jilin University,
Nov. 2015, pp. 181–190. Changchun, China, in 2013 and 2019, respectively.
[31] R. Shao, N. Xu, and X. Geng, “Multi-label learning with label enhance- He is currently a Lecturer with the College of
ment,” in Proc. IEEE Int. Conf. Data Mining (ICDM), Nov. 2018, Computer Science, Jilin University. He is doing
pp. 437–446. post-doctoral research at the College of Chemistry,
[32] F. Nie, H. Huang, X. Cai, and C. Ding, “Efficient and robust feature Jilin University. His research interests include fea-
selection via joint 2,1 -norms minimization,” in Proc. Adv. Neural Inf. ture selection, multilabel learning, and information
Process. Syst., 2010, pp. 1813–1821. theory.
[33] X. Cai, F. Nie, and H. Huang, “Exact top-k feature selection via l2,0 - Dr. Gao received the 2019 ACM Changchun Doc-
norm constraint,” in Proc. 23rd Int. Joint Conf. Artif. Intell., 2013, toral Dissertation Award and the Postdoctoral Innovative Talents Support
pp. 1240–1246. Program.
[34] Z. Ma, F. Nie, Y. Yang, J. R. R. Uijlings, and N. Sebe, “Web image
annotation via subspace-sparsity collaborated feature selection,” IEEE
Trans. Multimedia, vol. 14, no. 4, pp. 1021–1030, Aug. 2012.
[35] X. Chang, F. Nie, Y. Yang, and H. Huang, “A convex formulation for Yonghao Li was born in Henan, China, in 1992.
semi-supervised multi-label feature selection,” in Proc. AAAI Conf. Artif. He received the B.E. degree from Henan Agricul-
Intell., vol. 2, 2014, pp. 1171–1177. tural University, Zhengzhou, China, in 2017, and
[36] X. Chang, H. Shen, S. Wang, J. Liu, and L. Xue, “Semi-supervised the M.S. degree in computer science from Jilin
feature analysis for multimedia annotation by mining label correlation,” University, Changchun, China, in 2020, where he is
in Proc. Pacific–Asia Conf. Knowl. Discovery Data Mining, 2014, currently pursuing the Ph.D. degree with the College
pp. 74–85. of Computer Science.
[37] M. Luo, F. Nie, X. Chang, Y. Yang, A. G. Hauptmann, and Q. Zheng, His research focuses on feature selection.
“Adaptive unsupervised feature selection with structure regularization,”
IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 4, pp. 944–956,
Apr. 2017.
[38] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by
locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326,
Dec. 2000.
[39] J. Gui, Z. Sun, S. Ji, D. Tao, and T. Tan, “Feature selection based on Liang Hu was born in Jilin, China, in 1968.
structured sparsity: A comprehensive study,” IEEE Trans. Neural Netw. He received the M.S. and Ph.D. degrees in computer
Learn. Syst., vol. 28, no. 7, pp. 1490–1507, Jul. 2017. science from Jilin University, Changchun, China,
[40] J. Hu, Y. Li, W. Gao, and P. Zhang, “Robust multi-label feature selection in 1993 and 1999, respectively.
with dual-graph regularization,” Knowl.-Based Syst., vol. 203, Sep. 2020, He is currently a Professor and a Doctoral Super-
Art. no. 106126. visor at the College of Computer Science and Tech-
[41] R. Shang, W. Wang, R. Stolkin, and L. Jiao, “Subspace learning- nology, Jilin University. He is supported by the
based graph regularized feature selection,” Knowl.-Based Syst., vol. 112, Hundred-Thousand-Ten Thousand Project of China.
pp. 152–165, Nov. 2016. His research interests include feature selection, mul-
[42] F. Nie, X. Wang, and H. Huang, “Clustering and projected clustering tilabel learning, and information theory.
with adaptive neighbors,” in Proc. 20th ACM SIGKDD Int. Conf. Knowl. Dr. Hu is a member of the China Computer
Discovery Data Mining, Aug. 2014, pp. 977–986. Federation.

Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.

You might also like