Multilabel feature selection with constrained latent structure shared term
Multilabel feature selection with constrained latent structure shared term
Abstract— High-dimensional multilabel data have increasingly label correlations of multilabel data is named problem trans-
emerged in many application areas, suffering from two notewor- formation method, which can be divided into three groups
thy issues: instances with high-dimensional features and large- based on label correlations: first-order approach, second-order
scale labels. Multilabel feature selection methods are widely
studied to address the issues. Previous multilabel feature selection approach, and high-order approach [6]–[8]. The first-order
methods focus on exploring label correlations to guide the feature approaches consider that the label is independent of each
selection process, ignoring the impact of latent feature structure other; in other words, they ignore the label correlations.
on label correlations. In addition, one encouraging property The second-order approaches take pairwise label correlations
regarding correlations between features and labels is that similar into account, such as the ranking between the relevant label
features intend to share similar labels. To this end, a latent
structure shared (LSS) term is designed, which shares and and the irrelevant label. The high-order approaches tackle
preserves both latent feature structure and latent label structure. multilabel data by considering high-order relation among
Furthermore, we employ the graph regularization technique to labels. Some representative problem transformation meth-
guarantee the consistency between original feature space and ods are binary relevance (BR) [9], calibrated label rank-
latent feature structure space. Finally, we derive the shared latent ing (CLR) [10], and label-specific features-dependent labels
feature and label structure feature selection (SSFS) method based
on the constrained LSS term, and then, an effective optimization (LLSF-DL) [11] corresponding to the first-order approach,
scheme with provable convergence is proposed to solve the SSFS second-order approach, and high-order approach, respectively.
method. Better experimental results on benchmark datasets are The three-class approaches first transform multilabel data into
achieved in terms of multiple evaluation criteria. single-label data, i.e., transform multilabel data to fit subse-
Index Terms— Feature selection, graph regularization, latent quent single-label algorithms. Another strategy is to transform
structure, multilabel data. algorithms to fit multilabel data, which is named algorithm
adaption methods. Algorithm adaption methods employ matrix
I. I NTRODUCTION transformation tricks or metrics of information theory to
exploit label correlations.
Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
1254 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 3, MARCH 2023
Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: MULTILABEL FEATURE SELECTION WITH CONSTRAINED LSS TERM 1255
In addition, Gonzalez-Lopez et al. [23] design a geometric Label-enhanced multilabel learning (LEMLL) replaces logi-
mean maximization (GMM) method to discuss how to aggre- cal labels with real-valued labels to enhance label informa-
gate the mutual information of multiple labels. GMM chooses tion [31].
the optimal feature subset with the largest geometric mean. Imposing l2,1 -norm on both loss function and regular-
The geometric mean considers the product of the measures ization, the well-known embedded-based feature selection
and then splits them with a root. method (RFS) selects features across all data points with
Beyond mentioned methods above, a large number of joint sparsity [32]. Furthermore, Cai et al. [33] imposed the
multilabel feature selection methods based on information l2,1 -norm on loss function and the l2,0 -norm on the feature
theory are proposed [24], [25]. The common characteristic weight matrix, proposing a robust feature selection method
of the methods based on information theory is that they named RALM-FS that is regarded as an improved version of
pay much attention on feature correlations. Due to the cal- RFS. It is denoted by
culative limitations of high-order feature correlations, exist-
ing methods adopt low-order information-theoretic terms to min Y − X T W − 1bT 2,1 s.t. W 2,0 = a (5)
W,V,B
exploit feature correlations. For example, the LRFS method
employs mutual information between the candidate feature where b and 1 represent the bias term and the all-one-
and each already-selected feature to represent the feature element vector, respectively, and a is the number of the
redundancy. The MDMR method adopts conditional mutual already-selected features. Besides, Ma et al. [34] proposed a
information between the candidate feature and the class SubFeature Uncovering with Sparsity (SFUS) method, which
given already-selected features to represent interactive corre- jointly selects features by a sparse regularization and uncovers
lations among features and each label. Furthermore, previous shared subspace of original features. Chang et al. [35] pro-
information-theoretic methods adopt mutual information or posed a convex semisupervised multilabel feature selection
conditional mutual information to calculate the correlation method to address large-scale datasets, which can be handled
between each candidate feature and each class label, which by a fast iterative algorithm. In addition, Chang et al. [36]
ignores high-order interactive correlation between features and proposed a semisupervised shared subspace multilabel feature
labels. As a result, previous methods are very dependent on selection method, which integrates the semisupervised learning
the importance of individual feature or label. and label correlation mining into one unified framework.
On the contrary, multilabel feature selection Furthermore, Luo et al. [37] preserved the intrinsic structure
embedded-based methods focus on exploit label correlations, of feature matrix by locally linear embedding (LLE) [38],
employing label correlations to select the compact feature which is different from traditional methods that are based on a
subset. For instance, Jian et al. [26] employed the matrix similarity matrix. The novel idea is applied to the unsupervised
factorization technique to extract latent semantics of label feature selection method. Feature selection methods based
information, with the latent semantics of label information on structured sparsity are comprehensively studied in the
fitting feature matrix of multilabel data. As a result, a feature literature [39]. Inspired by the property that similar features
weight matrix reflecting the importance of features is share similar labels, this article assumes that there exists a
obtained. Finally, the critical features are selected according latent feature structure that can be shared by both the original
to the feature weight matrix. In addition, multilabel informed feature matrix and the label matrix. The proposed method not
feature selection (MIFS) adopts the graph Laplacian matrix to only characterizes the intrinsic structure of feature matrix but
ensure a local geometry structure between feature matrix and also is associated with label matrix.
latent semantics of label information. The objective function Furthermore, graph regularization technique is widely
of the MIFS method is presented as follows: adopted in a feature selection process. For example,
Hu et al. [40] proposed a dual-graph regularization mul-
min ||X W − V ||2F + α||Y − V B||2F tilabel feature selection method named robust multilabel
W,V,B
feature selection based on dual-graph (DRMFS), which com-
+βT r V T LV + γ ||W ||2,1 (4)
bines label matrix and feature matrix with feature weight
where W ∈ Rd×k , V ∈ Rn×k , and B ∈ Rk×c represent matrix to construct label graph regularization and feature
the feature weight matrix, the latent semantics of multilabel graph regularization, respectively. Zhang et al. [13] employed
information, and the basis matrix, respectively, L ∈ Rn×n is a low-dimensional embedding matrix to construct graph reg-
the Laplacian matrix, and α, β, and γ are three regularization ularization terms that exploit the local label correlation and
parameters of MIFS model. global label correlation. These two feature selection methods
Besides, MultiLabel learning using LOcal Correlation employ the graph regularization technique to preserve the
(ML-LOC) tries to exploit label correlations in the manifold structure of feature data and label data. In addi-
data locally [27]. multilabel learning with GLObal and tion, Shang et al. [41] combined a graph regularization tech-
loCAL (GLOCAL) explores both global and local label cor- nique with subspace learning to devise a feature selection
relations [28]. Multilabel manifold learning (ML) attempts method named subspace learning-based graph regularized fea-
to explore manifold in the label space for multilabel learn- ture selection (SGFS), which obtains excellent performance.
ing [29]. Learning label-specific features (LLSFs) assume that Furthermore, Nie et al. [42] proposed a new model to con-
each label is related with a feature subset, and arbitrary similar struct the similarity matrix by assigning the adaptive neighbors
labels share more features than two dissimilar labels [30]. for each data. Afterward, the adaptive graph technique is
Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
1256 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 3, MARCH 2023
applied to many feature selection methods [43], [44]. In this Followed by [45], we employ heat kernel function to build the
article, we propose an LSS term, and however, many LLS affinity matrix, which is defined as
can be extracted from the original feature matrix. As a result, X i· −X j · 2
e− σ , if X i· ∈ N p X j · or X j · ∈ N p (X i· )
2
we adopt the graph regularization technique to constrain latent Si j =
X
(9)
feature structure. 0, otherwise
Inspired by previous methods, we conclude three key factors
to construct multilabel feature selection methods: 1) exploring where N p (X i· ) is the p-nearest neighbors of instance X i· and
feature correlations and constructing latent feature structure; σ is a parameter of heat kernel mode. Integrating the above
2) exploring label correlations; and 3) constructing relation- terms into one multilabel feature selection model, we obtain
ships between latent feature structure and label structure, the following objective function:
which is beneficial for multilabel feature selection. min X − V Q T 2F + αY − V M2F
V,Q,M
IV. P ROPOSED M ETHOD +βT r V T LV + γ Q2,1 (10)
Considering latent feature structure, we decompose the where β and γ are two parameters to measure the contribution
feature matrix X ∈ Rn×d into two low-dimensional matrices of the local geometry structure and to control the sparsity of
V ∈ Rn×k and Q ∈ Rd×k . To minimize the reconstruction the objective function, respectively. Q2,1 achieves feature
error, we obtain the following form: selection by the l2,1 -norm. In objective function (10), matrix
V reflects the latent feature structure and latent label structure.
min X − V Q T 2F (6) In addition, matrix V is defined the LSS term to achieve
V,B
that similar features share similar labels. Finally, to enhance
where V represents the latent feature structure of feature the row-sparse property further, we impose the nonnegative
matrix, with Q denoting a coefficient matrix. Formula (6) constraint onto the coefficient matrix Q. In addition, we also
indicates that d-dimensional features reduce to k-dimensional impose the nonnegative constraints onto V and M to facilitate
features, which can be explained that the original d features the subsequent optimization algorithm that solves the objective
cluster into k different clusters, each cluster contains relevant function. The final objective function, including constraints,
features, and different clusters are independent with each is reformulated as
other. By this way, we obtain the latent feature structure of
feature matrix. In addition, the row of matrix Q represents the min X − V Q T 2F + αY − V M2F
V,Q,M
importance of each feature in these k latent feature variables.
+βT r V T LV + γ Q2,1
Similarly, we employ a product of two low-dimensional
matrices U ∈ Rn×k and M ∈ Rk×c to represent the label s.t. V, M, Q ≥ 0. (11)
matrix in multilabel data. U and M denote arbitrary latent label
structure and corresponding coefficient matrix, respectively.
V. O PTIMIZATION OF SSFS S CHEME
Inspired by the property that similar features share similar
labels, we replace matrix U with matrix V to share the A. Proposed Scheme of SSFS
correlations between labels and features. Matrix V is called The objective function (11) is joint nonconvex; as a result,
the LSS term. As a result, we obtain the following problem: we cannot obtain the global minima. In addition, the objective
function (11) is nonsmooth referring to the l2,1 -norm. To this
min X − V Q T 2F + αY − V M2F (7) end, we design three iterative rules to solve the objective
V,Q,M
function, achieving the local minima. First, Q2,1 can be
where α is used to balance the contribution regarding the relaxed as 2T r (Q T D Q), where D ∈ Rd×d is a diagonal matrix
impact of latent feature structure on label correlations and fea- with the i th diagonal element Dii = (1/(2Q i· 2 + )) and
ture decomposition. However, a critical issue is that although is a small positive constant. The objective function (11) can
we intend to find a proper matrix V by minimizing For- be rewritten as
mula (7), there exist multiple LSS terms. To obtain a proper
LSS term, we employ a graph regularization technique to (V, M, Q) = T r X T − QV T X − V Q T
guarantee the consistency between the original feature space +αT r Y T − M T V T (Y −V M)
and the latent feature structure space. The principle is that the T T
+βT r V LV +2γ T r Q D Q . (12)
closer correlation between the two instances in the feature
matrix X indicates the closer correlation between the two Furthermore, we integrate nonnegative constraints into For-
latent feature variables in latent feature structure mula (12), obtaining
1
n n
L(V, M, Q)
Si j Vi· − V j · 22 = T r V T (A − S)V = T r V T LV
2 i=1 j =1 = T r X T − QV T X − V Q T
(8) +αT r Y T − M T V T (Y −V M) +βT r V T LV
T
in which A is a diagonal matrix, S denotes a symmetric affinity +2γ T r Q D Q − T r V T − T r M T − T r Q T
matrix, and L = A − S denotes the graph Laplacian matrix. (13)
Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: MULTILABEL FEATURE SELECTION WITH CONSTRAINED LSS TERM 1257
According to the Karush–Kuhn–Tucker condition, the data matrix X according to Formula (9);
3: Repeat:
i.e., i j Vi j = 0, i j Mi j = 0, and i j Q i j = 0, we obtain that
4: Update Formulas (21), (22) and (23)
(−X Q + V Q T Q − αY M T + αV M M T + β LV )i j i j = 0 5: Update D
(17) 6: t = t + 1;
7: Until Convergence;
(−αV Y + αV V M)i j
T T
ij =0 (18)
8: Return Q;
(−X V + QV V + γ D Q)i j
T T
ij =0 (19) 9: Return features according to ||Q i· ||2 .
where L = A − S in Formula (17), and hence, Formula (17)
can be written as
Then, we present the proof of convergence regarding
−X Q +V Q T Q −αY M T +αV M M T +β(A− S)V i j i j = 0.
Formula (11) under Formulas (21)–(23). First of all, we intro-
(20) duce the concept of auxiliary function that is also used in the
As a result, the update rules regarding V , M and Q are literature [40], [46].
presented as follows: Definition 1: The function J is regarded as an auxiliary
function of K when J (Vi j , Vij ) ≥ K (Vi j ) and J (Vi j , Vi j ) =
X Q + αY M T + β SV i j K (Vi j ) are both satisfied.
j ← Vi j
Vit+1
t
(21)
V Q T Q + αV M M T + β AV i j Lemma 1: The function K is nonincreasing under the
T following formula:
V Y ij
Mi j ← Mi j T
t+1 t
(22)
V V M ij Vit+1
j = arg min J Vi j , Vitj . (27)
T Vi j
X V ij
Qi j ← Qi j
t+1 t
(23) Proof of Lemma 1: Based on Definition 1 and Lemma 1,
QV T V + γ D Q i j we can easily obtain that
where t indicates the iterative number. In addition, to avoid t t
zero elements during feature selection process in the denom- K Vit+1
j ≤ J Vit+1
j , Vi j ≤ J Vi j , Vi j = K Vi j . (28)
t t
Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
1258 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 3, MARCH 2023
Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: MULTILABEL FEATURE SELECTION WITH CONSTRAINED LSS TERM 1259
TABLE II
E XPERIMENTAL R ESULTS OF N INE M ETHODS ON S EVEN D ATASETS ON SVM C LASSIFIER IN T ERMS OF M ACRO - F1 ( MEAN ± STD )
TABLE III
E XPERIMENTAL R ESULTS OF N INE M ETHODS ON S EVEN D ATASETS ON SVM C LASSIFIER IN T ERMS OF M ICRO - F1 ( MEAN ± STD )
TABLE IV
E XPERIMENTAL R ESULTS OF N INE M ETHODS ON S EVEN D ATASETS ON ML- K NN C LASSIFIER IN T ERMS OF HL ( MEAN ± STD )
Fig. 2. Nine methods on four benchmark datasets on SVM classifier in terms of Macro-F1 . (a) Arts. (b) Education. (c) Science. (d) Social.
GMM are 57.1%, 66%, 49.2%, 114.6%, 66%, 72.5%, 39.7%, be seen in Figs. 2–4, the proposed method achieves the best
and 183.9%, respectively. Similarly, SSFS achieves the best performance in terms of three metrics in comparison with
average results in terms of Micro-F1. Regarding HL, we retain the eight methods. Furthermore, different methods achieve the
more digits to verify the classification superiority of the second best classification performance, such as MIFS method
proposed method SSFS. SSFS still achieves the best aver- on Arts in terms of Macro-F1 and RALM-FS method on
age results among all the nine multilabel feature selection Education in terms of Micro-F1, which indicates the proposed
methods. Overall, the proposed method SSFS outperforms the method is reliable.
eight compared methods in terms of these popular evaluation In general, the proposed method SSFS achieves the best
criteria. classification performance among all the methods in terms of
To show the classification performance clearly, we choose Macro-F1, Micro-F1, and HL.
four datasets (Arts, Education, Science, and Social) to present
the three evaluation metrics in Figs. 2–4. The horizontal axis C. Parameter Sensitivity Analysis
represents the number of already-selected features, whereas There exist four parameters (α, β, γ , and k) in the objective
the vertical axis denotes each evaluation metric. Different function. Obviously, α, β, and γ are used to measure the
colors represent different feature selection methods. As can contribution of each term in the objective function and k is
Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
1260 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 3, MARCH 2023
Fig. 3. Nine methods on four benchmark datasets on SVM classifier in terms of Micro-F1 . (a) Arts. (b) Education. (c) Science. (d) Social.
Fig. 4. Nine methods on four benchmark datasets on ML-kNN classifier in terms of HL. (a) Arts. (b) Education. (c) Science. (d) Social.
Fig. 5. Macro-F1 of SSFS method on Arts dataset w.r.t α, β, γ and k on SVM classifier. (a) α. (b) β. (c) γ . (d) k.
Fig. 6. Convergence curves: (a) MIFS, (b) RALM_FS, and (c) SSFS.
the dimension regarding the LSS term. To study the sensitivity TABLE V
of each parameter in the feature selection process, we adjust T IME C OMPLEXITY OF M ETHODS
one objective parameter when fixing other three parameters
by searching the grid {0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0} on
the dataset Arts. We fix other parameters as 0.5, which is also
adopted in the literature [26], [40]. We test the sensitivity of the
four parameters on the SVM classifier in terms of Macro-F1
on the dataset Arts in Fig. 5.
Observing Fig. 5, we can find that the classification perfor-
mance is not very sensitive to these four parameters.
Afterward, we present the time complexity of SSFS and
D. Convergence and Complexity Analysis other algorithm adaption methods (MIFS, MDMR, SCLS,
Furthermore, we conduct experiments on four datasets LRFS, GMM, and RALM-FS) in Table V.
(Arts, Education, Science, and Social) to present the con- Suppose that a is the number of already-selected features,
vergence of SSFS and compared methods in Fig. 6. d denotes the number of features, n indicates the number of
Figs. 5(b) and (c) and 6(a) show the convergence curves of instances, c represents the number of labels, and l is the cluster
MIFS, RALM-FS, and SSFS, respectively. SSFS methods number of labels in the MIFS method. As shown in Table VI,
converge before 30 iterations. The proposed method SSFS information-theoretic methods outperform embedded methods
converges quickly in several iterations on the four datasets. in terms of computational complexity because the former
Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: MULTILABEL FEATURE SELECTION WITH CONSTRAINED LSS TERM 1261
TABLE VI
RUNNING T IME (S ECONDS ) OF D IFFERENT M ETHODS
Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
1262 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 3, MARCH 2023
[21] J. Lee and D.-W. Kim, “SCLS: Multi-label feature selection based [43] X. Li, H. Zhang, R. Zhang, Y. Liu, and F. Nie, “Generalized uncorrelated
on scalable criterion for large label set,” Pattern Recognit., vol. 66, regression with adaptive graph for unsupervised feature selection,”
pp. 342–352, Jun. 2017. IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 5, pp. 1587–1595,
[22] P. Zhang, G. X. Liu, and W. F. Gao, “Distinguishing two types of labels May 2019.
for multi-label feature selection,” Pattern Recognit., vol. 95, pp. 72–82, [44] F. Nie, W. Zhu, and X. Li, “Unsupervised feature selection with
Nov. 2019. structured graph optimization,” in Proc. AAAI Conf. Artif. Intell., vol. 30,
[23] J. Gonzalez-Lopez, S. Ventura, and A. Cano, “Distributed multi-label no. 1, 2016, pp. 1302–1308.
feature selection using individual mutual information measures,” Knowl.- [45] D. Cai, C. Zhang, and X. He, “Unsupervised feature selection for multi-
Based Syst., vol. 188, Jan. 2020, Art. no. 105052. cluster data,” in Proc. 16th ACM SIGKDD Int. Conf. Knowl. Discovery
[24] J. Lee and D.-W. Kim, “Feature selection for multi-label classification Data Mining (KDD), 2010, pp. 333–342.
using multivariate mutual information,” Pattern Recognit. Lett., vol. 34, [46] R. Shang, W. Wang, R. Stolkin, and L. Jiao, “Non-negative spectral
no. 3, pp. 349–357, 2013. learning and sparse regression-based dual-graph regularized feature
[25] J. Lee and D.-W. Kim, “Mutual information-based multi-label feature selection,” IEEE Trans. Cybern., vol. 48, no. 2, pp. 793–806, Feb. 2018.
selection using interaction information,” Expert Syst. Appl., vol. 42, [47] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularized nonnegative
no. 4, pp. 2013–2025, 2015. matrix factorization for data representation,” IEEE Trans. Pattern Anal.
[26] L. Jian, J. Li, K. Shu, and H. Liu, “Multi-label informed feature Mach. Intell., vol. 33, no. 8, pp. 1548–1560, Aug. 2011.
selection,” in Proc. IJCAI, 2016, pp. 1627–1633. [48] G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, and I. Vlahavas,
[27] S. Huang and Z. Zhou, “Multi-label learning by exploiting label corre- “Mulan: A Java library for multi-label learning,” J. Mach. Learn. Res.,
lations locally,” in Proc. AAAI Conf. Artif. Intell., 2012, pp. 949–955. vol. 12, pp. 2411–2414, Jun. 2011.
[28] Y. Zhu, J. T. Kwok, and Z.-H. Zhou, “Multi-label learning with global
and local label correlation,” IEEE Trans. Knowl. Data Eng., vol. 30,
no. 6, pp. 1081–1094, Jun. 2018.
[29] P. Hou, X. Geng, and M. Zhang, “Multi-label manifold learning,” in
Proc. AAAI Conf. Artif. Intell., 2016, pp. 1680–1686. Wanfu Gao was born in Jilin, China, in 1990.
[30] J. Huang, G. Li, Q. Huang, and X. Wu, “Learning label specific features He received the B.S. and Ph.D. degrees from
for multi-label classification,” in Proc. IEEE Int. Conf. Data Mining, the College of Computer Science, Jilin University,
Nov. 2015, pp. 181–190. Changchun, China, in 2013 and 2019, respectively.
[31] R. Shao, N. Xu, and X. Geng, “Multi-label learning with label enhance- He is currently a Lecturer with the College of
ment,” in Proc. IEEE Int. Conf. Data Mining (ICDM), Nov. 2018, Computer Science, Jilin University. He is doing
pp. 437–446. post-doctoral research at the College of Chemistry,
[32] F. Nie, H. Huang, X. Cai, and C. Ding, “Efficient and robust feature Jilin University. His research interests include fea-
selection via joint 2,1 -norms minimization,” in Proc. Adv. Neural Inf. ture selection, multilabel learning, and information
Process. Syst., 2010, pp. 1813–1821. theory.
[33] X. Cai, F. Nie, and H. Huang, “Exact top-k feature selection via l2,0 - Dr. Gao received the 2019 ACM Changchun Doc-
norm constraint,” in Proc. 23rd Int. Joint Conf. Artif. Intell., 2013, toral Dissertation Award and the Postdoctoral Innovative Talents Support
pp. 1240–1246. Program.
[34] Z. Ma, F. Nie, Y. Yang, J. R. R. Uijlings, and N. Sebe, “Web image
annotation via subspace-sparsity collaborated feature selection,” IEEE
Trans. Multimedia, vol. 14, no. 4, pp. 1021–1030, Aug. 2012.
[35] X. Chang, F. Nie, Y. Yang, and H. Huang, “A convex formulation for Yonghao Li was born in Henan, China, in 1992.
semi-supervised multi-label feature selection,” in Proc. AAAI Conf. Artif. He received the B.E. degree from Henan Agricul-
Intell., vol. 2, 2014, pp. 1171–1177. tural University, Zhengzhou, China, in 2017, and
[36] X. Chang, H. Shen, S. Wang, J. Liu, and L. Xue, “Semi-supervised the M.S. degree in computer science from Jilin
feature analysis for multimedia annotation by mining label correlation,” University, Changchun, China, in 2020, where he is
in Proc. Pacific–Asia Conf. Knowl. Discovery Data Mining, 2014, currently pursuing the Ph.D. degree with the College
pp. 74–85. of Computer Science.
[37] M. Luo, F. Nie, X. Chang, Y. Yang, A. G. Hauptmann, and Q. Zheng, His research focuses on feature selection.
“Adaptive unsupervised feature selection with structure regularization,”
IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 4, pp. 944–956,
Apr. 2017.
[38] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by
locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326,
Dec. 2000.
[39] J. Gui, Z. Sun, S. Ji, D. Tao, and T. Tan, “Feature selection based on Liang Hu was born in Jilin, China, in 1968.
structured sparsity: A comprehensive study,” IEEE Trans. Neural Netw. He received the M.S. and Ph.D. degrees in computer
Learn. Syst., vol. 28, no. 7, pp. 1490–1507, Jul. 2017. science from Jilin University, Changchun, China,
[40] J. Hu, Y. Li, W. Gao, and P. Zhang, “Robust multi-label feature selection in 1993 and 1999, respectively.
with dual-graph regularization,” Knowl.-Based Syst., vol. 203, Sep. 2020, He is currently a Professor and a Doctoral Super-
Art. no. 106126. visor at the College of Computer Science and Tech-
[41] R. Shang, W. Wang, R. Stolkin, and L. Jiao, “Subspace learning- nology, Jilin University. He is supported by the
based graph regularized feature selection,” Knowl.-Based Syst., vol. 112, Hundred-Thousand-Ten Thousand Project of China.
pp. 152–165, Nov. 2016. His research interests include feature selection, mul-
[42] F. Nie, X. Wang, and H. Huang, “Clustering and projected clustering tilabel learning, and information theory.
with adaptive neighbors,” in Proc. 20th ACM SIGKDD Int. Conf. Knowl. Dr. Hu is a member of the China Computer
Discovery Data Mining, Aug. 2014, pp. 977–986. Federation.
Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.