0% found this document useful (0 votes)
7 views20 pages

Multi-Label Feature Selection With High-Sparse Personalized and Low-Redundancy Shared Common Features

This paper presents ESRFS, an Elastic net based approach for multi-label feature selection that addresses issues of sparsity and redundancy in high-dimensional data. ESRFS outperforms existing methods by achieving higher sparse personalized features and identifying low redundancy shared common features, enhancing classification performance across various metrics. Experimental results demonstrate ESRFS's superiority over eight state-of-the-art MLFS approaches on multiple datasets.

Uploaded by

sizheduan36
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views20 pages

Multi-Label Feature Selection With High-Sparse Personalized and Low-Redundancy Shared Common Features

This paper presents ESRFS, an Elastic net based approach for multi-label feature selection that addresses issues of sparsity and redundancy in high-dimensional data. ESRFS outperforms existing methods by achieving higher sparse personalized features and identifying low redundancy shared common features, enhancing classification performance across various metrics. Experimental results demonstrate ESRFS's superiority over eight state-of-the-art MLFS approaches on multiple datasets.

Uploaded by

sizheduan36
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Information Processing and Management 61 (2024) 103633

Contents lists available at ScienceDirect

Information Processing and Management


journal homepage: www.elsevier.com/locate/ipm

Multi-label feature selection with high-sparse personalized and


low-redundancy shared common features
Yonghao Li, Liang Hu, Wanfu Gao ∗
College of Computer Science and Technology, Jilin University, Changchun 130012, China
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China

ARTICLE INFO ABSTRACT

Keywords: Prevalent multi-label feature selection (MLFS) approaches to obtain the most suitable feature
Multi-label learning subset by dealing with two issues, namely sparsity and redundancy. In this paper, we design
Feature selection an efficient Elastic net based high Sparse personalized and low Redundancy Feature Selection
Sparse learning
approach for multi-label data named ESRFS to address the two obstacles, 𝑖.𝑒., low-sparse LASSO-
Classification
norm leads to personalized features for each label while high redundancy 𝑙2,1 -norm explore
shared common features for all labels in multi-label learning. These two problems impede
the selection of high-quality features for classification. In comparison with previous MLFS
approaches, ESRFS has two main superiority. First, ESRFS achieves higher sparse personalized
features than LASSO-norm. Second, ESRFS can identify low redundancy shared common features
with strong discrimination by introducing a novel regularization term. To effectively and
efficiently identify the most optimal feature subset, an alternating-multiplier-based rule is
introduced to optimize ESRFS. Experimental results on fifteen multi-label data sets show that
ESRFS can achieve obvious superior performance compared to eight state-of-the-art MLFS
approaches in 80%, 80%, 73.3%, 80%, 86.7%, 80% cases based on Hamming Loss, Zero-One
Loss using MLkNN, Micro-𝐹1 and Macro-𝐹1 using SVM as well as Micro-𝐹1 and Macro-𝐹1 using
3NN perspectives.

1. Introduction

Recent years, the rapidly growing dimension of features including a large amount of unnecessary and redundant features, which
degrades prediction classification accuracy and increases the computation time of existing learning models. Furthermore, the high-
dimensional features with multiple labels are ubiquitous in the real-world application scene, such as biometric identity, text mining
as well as image processing (Jin, Zhang, & Zhao, 2023; Karimi, Dowlatshahi, & Hashemi, 2023; Liu, Lin, Ding, Zhang, & Du, 2022;
Liu, Qi, Xu, Gao, & Liu, 2019; Ma, Chiu, & Chow, 2020). Technologies for learning and selecting features in multi-label scenarios
are two core tools to deal with high-dimensional multi-label data.
The objective of multi-label learning is to identify a reliable mapping function predicting multiple labels for each new or unseen
instance within the testing data. Nonetheless, the exponential growth of potential label permutations for predicting unseen labels
occurs as the number of labels increases (Liu, Wang, Shen, & Tsang, 2021). To tackle this challenge, numerous feature selection
techniques based on multi-label learning have been developed (Li, Hu, & Gao, 2023; Mishra & Singh, 2020). The essence of feature
selection lies in the selection of an ideal subset from the initial feature set, aiming to enhance both the prediction accuracy and
interpretability of multi-label learning algorithms.

∗ Corresponding author at: College of Computer Science and Technology, Jilin University, Changchun 130012, China.
E-mail addresses: [email protected] (Y. Li), [email protected] (L. Hu), [email protected] (W. Gao).

https://doi.org/10.1016/j.ipm.2023.103633
Received 16 June 2023; Received in revised form 25 October 2023; Accepted 26 December 2023
Available online 4 January 2024
0306-4573/© 2023 Elsevier Ltd. All rights reserved.
Y. Li et al. Information Processing and Management 61 (2024) 103633

Table 1
The differences between representative approaches and ESRFS.
Approaches Personalized features Shared common features High sparsity Low redundancy
LSML (Huang et al., 2019)  × × ×
DSTL1 (Jin et al., 2020)  × × ×
DSMFS (Hu et al., 2022)  × × ×
SCMFS (Hu et al., 2020) ×  × ×
SSFS (Gao et al., 2023) ×  × ×
MIFS (Jian et al., 2016) ×  × ×
ESRFS    
where denotes yes. × denotes no.

Generally, existing MLFS approaches can be partitioned into filter schemes, wrapper schemes and embedded schemes by taking
into account the interplay between the subsequent learning algorithms and the process of feature selection (Dai, Huang, Zhang,
& Liu, 2024; Fan et al., 2024; Spolaôr, Monard, Tsoumakas, & Lee, 2016). Filter-based approaches operate independently from
the subsequent multi-label learning algorithm. Wrapper-based approaches first select a feature subset, and then use the subsequent
multi-label learning algorithm to evaluate this subset. By repeating the aforementioned operation, a high-quality feature subset is
acquired. Embedded-based approaches perform feature subset selection and multi-label learning processes simultaneously. Typically,
embedded approaches use sparse learning based models to regularize structural information, with the goal of enhancing the ability of
multi-label learning algorithms to generalize (Guo, Sun, & Hao, 2022). Therefore, the embedded-based multi-label feature selection
approach is our focus. The proposed method falls into the embedded MLFS approach.
The embedded MLFS approaches usually use regularization-related sparse learning technologies to deal with feature sparsity
and redundancy problems in feature space (Elad, 2010; Li et al., 2017). However, a majority of approaches employ the LASSO
with low sparsity properties to obtain personalized features for each label, where personalized features have different degrees of
discrimination for every label of multi-label data (Weng et al., 2023). As is known to all, 𝑙0 -norm has the best sparsity, but the
optimization process for this norm is known to be NP-hard, presenting significant computational challenges (Zhang, Xu, Yang, Li, &
Zhang, 2015). Therefore, many MLFS researches use LASSO (𝑎.𝑘.𝑎. 𝑙1 -norm) instead of the 𝑙0 -norm to perform feature selection. For
instance, Huang et al. (2019) propose an innovative multi-label approach based on the LASSO (LSML) to learn personalized features
for incomplete label space. Jin et al. (2020) use LASSO-norm and Dempster-Shafer theory to address the defect of nonstationary,
in order to achieve internal feature selection (DSTL1). Hu, Li, Xu, and Gao (2022) propose a novel approach (DSMFS). DSMFS
approach introduces LASSO to conduct personalized feature selection based on dynamic dual-graph terms. Although LASSO-based
approaches achieve obvious achievements, LASSO cannot alleviate inherent defects, 𝑖.𝑒. , the sparsity of LASSO is still insufficient
than 𝑙0 -norm. Moreover, LASSO-based MLFS approaches may choose numerous features that lack discrimination, due to the high
feature dimensionality compared to the limited number of instances. Therefore, these problems need to be solved.
Alternatively, some MLFS methods focus on identifying shared common features. Many scholars prioritize the exploration of
MLFS techniques based on the 𝑙2,1 -norm (Nie, Huang, Cai, & Ding, 2010), owing to its inherent ability to address shared common
features. To name a few, Hu, Li, Gao, Zhang, and Hu (2020) employ coupled matrix factorization (CMF) to derive a shared common
mode (SCMFS) that captures the relationship between the feature-level space and label-level space, and then extract shared common
features for shared common mode by 𝑙2,1 -norm, 𝑖.𝑒. , SCMFS belongs to a dual shared common modes from the actual efficacy
perspective. Jian, Li, Shu, and Liu (2016) design a 𝑙2,1 -norm based approach (MIFS) to capture a higher level latent label space
representation. Gao, Li, and Hu (2023) utilize the constrained latent low dimensional space structure shared regularizer to address
multi-label data (SSFS). Although 𝑙2,1 -norm can excavate shared common features from original feature set for all labels, 𝑙2,1 -norm
ignores or inadequate considers the redundant correlation among features so that selected shared common features contain an
extensive amount of redundant features. Despite the above shortcomings, personalized features and shared common features have
been shown to offer supplementary insights during the process of multi-label feature selection, thereby enhancing the overall
effectiveness (Li, Li, Hu, & Yu, 2022; Li, Wu, Dani, & Liu, 2018).
A comparison of the dissimilarities between prior methodologies and our proposed ESRFS is presented in Table 1. To tackle the
aforementioned issues, we incorporate the elastic net regularization technique into our proposed approach. The elastic net can serve
to mitigate challenges such as the imbalance between the number of features and instances according to Zou and Hastie (2005).
Subsequently, we introduce a pioneering approach called ESRFS (Elastic net based high Sparse personalized and low Redundancy
MLFS), which aims to efficiently identify a feature subset with high sparsity and minimal redundancy. ESRFS can achieve high
sparse personalized features for each label than LASSO-norm so that the sparsity of the proposed approach is closer to the 𝑙0 -norm.
At the same time, ESRFS can identify low redundancy shared common features with stronger discrimination. To effectively and
efficiently identify the most optimal feature subset, an alternating-multiplier-based scheme is developed to optimize ESRFS. The
success of ESRFS mainly includes the following advantages:

• ESRFS not only can ensure high sparse personalized features for each label but also can select the most discriminative
personalized features than LASSO-based methods.
• ESRFS can identify low redundancy shared common features with strong discrimination by incorporating a regularization term
based on inner product.

2
Y. Li et al. Information Processing and Management 61 (2024) 103633

• ESRFS, equipped with a globally optimal solution, employs an alternating multiplier-based scheme to successfully discern the
optimal subset of features.
• ESRFS achieves better classification performance than other state-of-the-art MLFS methods on diverse multi-label data under
different metrics.

The remaining sections are structured as follows: Section 2 provides a concise overview of relevant prior studies of importance.
In Section 3, we elaborate on the intricacies of the ESRFS approach. The optimization scheme and convergence analysis of ESRFS
are presented in Section 4. Section 5 comprehensively analyzes the experimental outcomes using diverse evaluation metrics. Finally,
in Section 6, we present the concluding remarks, along with suggestions for future research directions in this domain.

2. Related work

Within this section, we provide a comprehensive overview of fundamental principles associated with various sparse learning
models. In addition, we review existing MLFS approaches that focus on personalized features with 𝑙1 -norm and shared common
features for all labels as well as other types of approaches. These related works are most relevant to the ESRFS approach.
Generally, various regularization terms that are applied on the sparse learning mainly include LASSO-norm, 𝑙2 -norm and variants
of them. The LASSO-norm is an important sparse technology that has been widely used due to effectiveness, efficiency and the
equivalence of LASSO-norm and 𝑙0 -norm under certain conditions. However, the sparsity of LASSO is still insufficient than 𝑙0 -
norm (Zhang et al., 2015). Moreover, LASSO-based approaches select numerous non-discriminative features. Therefore, Zou and
Hastie design the elastic net to deal with inherent defects of LASSO (Zou & Hastie, 2005). However, elastic net cannot alleviate
high redundancy among features. The 𝑙2 -norm is used to prevent over-fitting. However, it cannot obtain rigorously sparse (Zhang
et al., 2015). Alternatively, 𝑙2,1 -norm is seen as a representative variant of LASSO and 𝑙2 -norm due to excellent performance, 𝑖.𝑒. , it
excavate shared common features from original feature space for all labels. However, 𝑙2,1 -norm ignores two important issues. First,
it neglects personalized features due to the nature of 𝑙2 -norm. Second, it ignores the redundant correlation among features so that
selected shared common features contain a large quantity of redundant information (Han, Sun, & Hao, 2015; Shang, Xu, Shang, &
Jiao, 2020).
The majority of current multi-label learning methodologies leverage the principles and theories of sparse learning and
information-theoretical knowledge to conduct feature selection process. For sparse learning based approaches, the above-mentioned
norms and variant are used. For instance, Hu et al. (2022) propose a novel approach (DSMFS) that designs a set of dual dynamic
subspace graph constraint terms by using the low-dimensional mapping of original space. Meanwhile, this approach introduces the
𝑙1 -norm to conduct personalized feature selection based on the aforementioned dual dynamic graphs. DSMFS has the following form:

min ‖𝑋𝑊 − 𝑈 ‖2𝐹 + 𝛼‖𝑌 − 𝑈 𝑉 ‖2𝐹


{𝑊 ,𝑈 ,𝑉 }≥0
(1)
+ 𝛽(𝑇 𝑟(𝑈 𝑇 𝐿𝑈 ) + 𝑇 𝑟(𝑊 𝐿𝑈 𝑊 𝑇 )) + 𝛾‖𝑊 ‖1

where 𝑋 ∈ R𝑛×𝑑 and 𝑌 ∈ R𝑛×𝑙 denote original data space and label space respectively. 𝑊 ∈ R𝑑×𝑘 , 𝑈 ∈ R𝑛×𝑘 and 𝑉 ∈ R𝑘×𝑙 denote the
weight matrix, the low-dimensional mapping subspace and the basis matrix, respectively. 𝐿𝑈 represents a dynamic Laplacian term
constructed by the low-dimensional subspace. 𝐿 is a global instance relevance Laplacian matrix by learning second-order instance
correlation. 𝛼, 𝛽 and 𝛾 are three adjust parameters. However, DSMFS does not effectively address the issue of limited sparsity and
elevated redundancy within the chosen features.
Besides, Gao et al. (2023) propose a MLFS approach that utilizes a constrained latent structure regularization term to address
instances with high-dimensional features and labels in multi-label data (SSFS). This latent structure shared term is able to encapsulate
core information between feature set and label set simultaneously.
min ‖𝑋 − 𝑉 𝑄𝑇 ‖2𝐹 + 𝛼‖𝑌 − 𝑉 𝑀‖2𝐹
𝑉 ,𝑄,𝑀

+ 𝛽𝑇 𝑟(𝑉 𝑇 𝐿𝑉 ) + 𝛾‖𝑄‖2,1 (2)


𝑠.𝑡. {𝑉 , 𝑀, 𝑄} ≥ 0

where 𝑉 ∈ R𝑛×𝑘
denotes a constrained latent structure shared low-dimensional space. 𝑄 ∈ R𝑑×𝑘 and 𝑀 ∈ R𝑘×𝑙 denote the weighted
matrix and the basis matrix, respectively. However, SSFS is also unable to solve the low sparsity and high redundancy problem.
Moreover, personalized features are ignored. Similarly, Jian et al. (2016) design a 𝑙2,1 -norm based feature selection approach (MIFS)
that uses a low-dimensional label subspace to capture a higher level and more compact latent label representation. Xu, Wang, An,
Wei, and Ruan (2018) propose a 𝑙2,1 -based feature selection approach (SCFS) that considers space consistency between the set
of features and the set of labels from semi-supervised scenarios to address the challenge posed by the multi-label scenario with
incomplete prior information. In addition, Cai, Nie, and Huang (2013) propose an exact and robust methodology for feature selection,
known as RALM-FS that uses joint 𝑙2,1 -norm and 𝑙2,0 -norm constraint terms to obtain high-sparse based feature selection results by
the nature of 𝑙0 -norm. However, it is a thorny problem to deal with 𝑙2,0 -norm directly. Furthermore, RALM-FS ignores the important
of personalized features because this approach adopt 𝑙2 -norm, where the nature of the 𝑙2 -norm can eliminate personalization. Yu,
Cai, Wu, Liu, and Li (2023) formulate a causal-structure-based MLFS approach (M2LC) that considers common features by all labels
and personalized features by an individual label. However, sparsity and redundancy are not considered in this approach.

3
Y. Li et al. Information Processing and Management 61 (2024) 103633

Table 2
Explanation of symbols and variables.
Symbols Explanation
𝐴 Matrix
𝑎 Vector
a Scalar
𝐴𝑖⋅ The 𝑖th row of 𝐴
𝐴⋅𝑗 The 𝑗th column of 𝐴
𝐴𝑖𝑗 The (𝑖, 𝑗)-th entry of 𝐴
𝑛 Row cardinality
𝑚 Column cardinality
𝐴𝑇 𝐴’s transpose
𝑇 𝑟(𝐴) The trace of square matrix 𝐴
√∑ ∑
𝑛 𝑚
||𝐴||𝐹 𝑖=1
𝐴2
𝑗=1 𝑖𝑗
∑𝑛 ∑𝑚
||𝐴||1 |𝐴𝑖𝑗 |
∑𝑛 √∑𝑚
𝑖=1 𝑗=1
||𝐴||2,1 𝑖=1
𝐴2
𝑗=1 𝑖𝑗
𝑋 𝑋 ∈ R𝑛×𝑑
𝑌 𝑌 ∈ R𝑛×𝑙

On the other hand, many information-theoretical-based MLFS approaches are designed, such as max-dependence between the
feature set and the label set while minimizing redundancy among the features (MDMR) (Lin, Hu, Liu, & Duan, 2015) and feature
selection approach using label redundancy (LRFS) (Zhang, Liu, & Gao, 2019). MDMR identifies the optimal subset of features by
using max-dependence, and min-redundancy between features simultaneously. LRFS designs a novel label redundancy term to deal
with MLFS, which takes into account both correlated labels and uncorrelated labels simultaneously. Its form as follow:
∑ ∑ 1 ∑
𝐽 (𝑓𝑘 ) = { 𝐼(𝑓𝑘 ; 𝑦𝑗 |𝑦𝑖 )− 𝐼(𝑓𝑘;𝑓𝑗 )} (3)
𝑦 ∈𝑌 𝑦 ≠𝑦 ,𝑦 ∈𝑌
|𝑆| 𝑓 ∈𝑆
𝑖 𝑖 𝑗 𝑗 𝑗

where 𝐽 (⋅) denotes the objective function. Formula (3) includes the term capturing feature correlations and the term quantifying
feature redundancy. 𝑓𝑘 and 𝑓𝑗 denote features identified as nominees and selected features respectively. 𝐼(⋅; ⋅) denotes mutual
information. 𝑦𝑖 and 𝑦𝑗 denotes two labels from label space. |𝑆| is seen as the cardinality of selected features 𝑆. Besides, Gonzalez-
Lopez, Ventura, and Cano (2020) introduce a feature selection methodology based on mutual information, aiming to identify the
most optimal subset of features with geometric mean maximization criterion. As mentioned above, these approaches belong to
algorithm adaption strategy, 𝑖.𝑒. , multi-label data can be addressed directly. Another strategy is called problem transformation
approach, which involves the conversion of a multi-label problem into several sub-problems. One example of this approach is the
pruned problem transformation (PPT) method (Read, 2008) that is a well-known problem transformation approach. Jesse uses
𝜒 2 statistic (PPT+CHI) to select the best subset form all features. Meanwhile, Doquire and Veleysen (Doquire & Verleysen, 2011)
propose a feature selection approach (PPT+MI) to address the challenges posed by multi-label data sets. Despite the effectiveness of
the aforementioned approaches, they suffer from significant information loss and label fragmentation, which are difficult to tolerate.

3. Proposed approach

Notations: the main notations are described in Table 2.


In this section, we describe the novel elastic net based high sparse and low redundancy multi-label feature selection approach
(ESRFS). ESRFS achieves high sparse personalized features than 𝑙1 -norm. It also identifies low redundancy shared common features
with strong discrimination by introducing an inner product regularization term. Technical details are described as follows.
For sparse-based feature selection approaches, the ordinary least square model is usually seen as an error loss function. This
error loss function with sparsity-inducing regularization terms can automatically select feature subset during the model shrinkage
process. Consequently, many regularization terms are designed, for instance, the 𝑙1 -norm, 𝑙2 -norm, and elastic net regularization (Zou
& Hastie, 2005), etc. We first introduce the elastic net into the designed approach, To address the inherent limitation of 𝑙1 -norm
according to Zou and Hastie (2005). Therefore, we give a general elastic net based feature selection optimization framework:
min ‖𝑋𝑊 − 𝑌 ‖2𝐹 +𝛽‖𝑊 ‖1 + 𝛾‖𝑊 ‖2𝐹 (4)
𝑊

where 𝑊 ∈ R𝑑×𝑙 represents the weighted matrix (or feature selection indicator matrix), and 𝑌 ∈ R𝑛×𝑙 represents the label space (If
𝑌𝑖𝑗 is equal to 1, it signifies the presence of a relationship between the 𝑖th instance and the 𝑗th label. Otherwise, if 𝑌𝑖𝑗 is equal to 0,
it indicates the absence of such a relationship.). 𝛽 and 𝛾 are two regularization parameters. Although the elastic net based feature
selection optimization framework is efficient, it suffers from some problems, 𝑖.𝑒. , it not only ignores high-sparse personalized features
for each individual label but also neglects shared common features for all labels. we incorporate a regularization term based on inner
product. It is formulated as follows:

𝑑 ∑
𝑑
|<𝑊𝑖⋅ ,𝑊𝑗⋅>|
𝑖=1 𝑗=1,𝑖≠𝑗

4
Y. Li et al. Information Processing and Management 61 (2024) 103633


𝑑 ∑
𝑑 ∑
𝑑
= |<𝑊𝑖⋅ ,𝑊𝑗⋅>| − |< 𝑊𝑖⋅ , 𝑊𝑖⋅ >| (5)
𝑖=1 𝑗=1 𝑖=1

=‖𝑊 𝑊 𝑇 ‖1 − ‖𝑊 ‖2𝐹

During the process of given model shrinkage, the higher the redundancy between any two given features, the larger the dot
product value of two weights, 𝑖.𝑒. , 𝑊𝑖⋅ and 𝑊𝑗⋅ . As a result, the magnitude of the regularization term will increase during the
process of model shrinkage, which ensures low redundancy between features so that high-sparse personalized features in relation
to each individual label are captured, vice versa. In addition, the existing approaches ignore the high redundancy problem among
shared common features. Fortunately, inspired by Han et al. (2015), this designed term can also achieve low redundancy shared
common features for all labels. Thus, we derive the following expression:
min‖𝑋𝑊 − 𝑌 ‖2𝐹 + 𝛽‖𝑊 ‖1 + 𝛾‖𝑊 ‖2𝐹
𝑊 (6)
+ 𝜆(‖𝑊 𝑊 𝑇 ‖1 − ‖𝑊 ‖2𝐹 )
In addition, it is well-known that the effective utilization of label correlations is vital to learn multi-label data (Zhang & Zhou,
2013). According to loss function in formula (6), if two labels are correlated to each other, furthermore, it can be deduced that 𝑊⋅𝑖
and 𝑊⋅𝑗 are expected to exhibit similarity. Consequently, the following global second-order label correlation regularization term is
introduced.
𝑙 ∑
∑ 𝑙
𝑊⋅𝑖𝑇 𝑊⋅𝑗 𝐿𝑖𝑗 (7)
𝑖=1 𝑗=1

where 𝐿𝑖𝑗 = 1 − 𝑆𝑖𝑗 . 𝑆𝑖𝑗 indicates similarity the between 𝑌⋅𝑖 and 𝑌⋅𝑗 (By employing the cosine similarity measure, we compute the
value of 𝑆𝑖𝑗 .). Next, formula (7) is integrated into formula (6). We derive the subsequent optimization objective function:

min‖𝑋𝑊 − 𝑌 ‖2𝐹 + 𝛼𝑇 𝑟(𝑊 𝐿𝑊 𝑇 ) + 𝛽‖𝑊 ‖1


𝑊 (8)
+ 𝛾‖𝑊 ‖2𝐹 + 𝜆(‖𝑊 𝑊 𝑇 ‖1 − ‖𝑊 ‖2𝐹 )
In the presented formulation, the parameters 𝛼 and 𝜆 serve as regularization coefficients, which govern the incorporation of
structural information and control the degree of sparsity, respectively. Based on the findings of Yan, Yang, and Yang (2016), it is
observed that imposing a nonnegative constraint on the feature selection indicator matrix 𝑊 not only preserves the interpretability
of weights of features but also further guarantees sparsity. Consequently, we introduce this nonnegative constraint to the objective
function, resulting in the following final formulation:
min‖𝑋𝑊 − 𝑌 ‖2𝐹 + 𝛼𝑇 𝑟(𝑊 𝐿𝑊 𝑇 ) + 𝛽‖𝑊 ‖1
𝑊
+ 𝛾‖𝑊 ‖2𝐹 + 𝜆(‖𝑊 𝑊 𝑇 ‖1 − ‖𝑊 ‖2𝐹 ) (9)
𝑠.𝑡. 𝑊 ≥ 0

4. Solution strategy

This section presents the devised optimization scheme for solving the proposed objective function (9) and provides a convergence
analysis of the objective function under this specific scheme.

4.1. Optimization scheme

Before optimizing the objective function (9), we need to deal with several problems including smoothness and convexity of the
objective function (9). According to the related knowledge in convex optimization (Boyd, Boyd, & Vandenberghe, 2004), formula
(9) exhibits nonsmoothness as a result of the presence of the 𝑙1 -norm, whereas this formula demonstrates convexity with respect to
the exclusive variable 𝑊 owing to its positive semi-definite nature. Therefore, we design an alternating-multiplier-based relaxation
update scheme to obtain a globally optimal solution.
Let 𝛩(𝑊 ) denotes the objective function. Therefore,
𝛩(𝑊 ) = ‖𝑋𝑊 − 𝑌 ‖2𝐹 + 𝛼𝑇 𝑟(𝑊 𝐿𝑊 𝑇 ) + 𝛽‖𝑊 ‖1
+ 𝛾‖𝑊 ‖2𝐹 + 𝜆(‖𝑊 𝑊 𝑇 ‖1 − ‖𝑊 ‖2𝐹 ) (10)
𝑠.𝑡.{𝑊 ≥ 0}

To integrate nonnegative constraint condition into 𝛩(𝑊 ), a Lagrange multiplier 𝛷 ∈ R𝑑×𝑙


+ is introduced into formula (10).
Therefore, the following Lagrangian expansion equation is obtained:
(𝑊 ) = ‖𝑋𝑊 − 𝑌 ‖2𝐹 + 𝛼𝑇 𝑟(𝑊 𝐿𝑊 𝑇 ) + 𝛽‖𝑊 ‖1
(11)
+𝛾‖𝑊 ‖2𝐹 + 𝜆(‖𝑊 𝑊 𝑇 ‖1 − ‖𝑊 ‖2𝐹 ) − 𝑇 𝑟(𝛷𝑊 𝑇 )

5
Y. Li et al. Information Processing and Management 61 (2024) 103633

Subsequently, we calculate the derivative of Eq. (11) with respect to the variable 𝑊 :
𝜕
=2𝑋 𝑇 𝑋𝑊 − 2𝑋 𝑇 𝑌 + 2𝛼𝑊 𝐿 + 2𝛽𝑄◦𝑊
𝜕𝑊 (12)
+ 2𝛾𝑊 + 2𝜆(1𝑑×𝑑 𝑊 − 𝑊 ) − 𝛷

where sign ◦ denotes element-wise product. 1𝑑×𝑑 denotes a 𝑑 × 𝑑 matrix, where each element value is set to 1. 𝑄 ∈ R𝑑×𝑙 is used to
relax 𝑙1 -norm based feature selection matrix. It has the following form:
1
𝑄𝑖𝑗 = (13)
2|𝑊𝑖𝑗 | + 𝜖
where 𝜖 > 0 denotes a small constant. Using KKT condition, 𝑖.𝑒. , 𝛷𝑖𝑗 𝑊𝑖𝑗 =0, we obtain:

(𝑋 𝑇 𝑋𝑊 − 𝑋 𝑇 𝑌 + 𝛼𝑊 𝐿 + 𝛽𝑄◦𝑊
(14)
+ 𝛾𝑊 + 𝜆(1𝑑×𝑑 𝑊 − 𝑊 ))𝑖𝑗 𝑊𝑖𝑗 = 0
Hence, we derive the subsequent update rule for the variable 𝑊 :
(𝑋 𝑇 𝑌+𝛼𝑊 𝑆+𝜆𝑊 )𝑖𝑗
𝑊𝑖𝑗𝑡+1←𝑊𝑖𝑗𝑡 (15)
(𝑋 𝑇 𝑋𝑊 +𝛼𝑊 𝐴+𝛽𝑄◦𝑊 +𝛾𝑊 +𝜆1𝑑×𝑑 𝑊 )𝑖𝑗+𝜖
where 𝑡 represents the current update state. Let 𝐿 be defined as the subtraction of matrix 𝐴 and matrix 𝑆, where 𝐴 is a matrix with
uniform entries of value one. Building upon the aforementioned optimization methodology, the optimization procedure is elaborated
in Algorithm 1.

Algorithm 1 The designed approach ESRFS

Input:
The matrix 𝑋 ∈ R𝑛×𝑑 and the matrix 𝑌 ∈ R𝑛×𝑙 . Several parameters 𝛼, 𝛽, 𝛾 and 𝜆.
Output:
Output the index of the selected features.
1: Compute the similarity matrix 𝑆 and the Laplacian matrix 𝐿.
2: Preset the weighted matrix 𝑊 randomly;
3: 𝑡=0;
4: While the convergence criterion is not satisfied, perform the following iterative steps:
5: Update the While the convergence criterion is not satisfied, perform the following iterative steps: matrix 𝑄 by Eq. (13);
6: Update the weighted matrix 𝑊 by Eq. (15);
7: Update 𝑡 by formula 𝑡 = 𝑡 + 1;
8: Terminate the while loop.
9: Return matrix 𝑊 ;
10: Compute the ordered index sequence of features by evaluating the 𝐿2 norm of each row in the matrix 𝑊𝑖⋅ , where 1 ≤ 𝑖 ≤ 𝑑.

4.2. The behavior of the convergence

Within this subsection, we provide a thorough analysis of the convergence properties exhibited by the proposed objective function
within the prescribed optimization framework.
Initially, we present the subsequent widely recognized gradient descent method owing to its association with the subsequent
update rule.
𝜕𝛩
𝑊𝑖𝑗𝑡+1 ← 𝑊𝑖𝑗𝑡 − 𝜂( ) (16)
𝜕𝑊 𝑖𝑗
where 𝜂 denotes a fixed learning rate for 𝑊 . However, a fixed learning rate 𝜂 may make variable 𝑊 unable to achieve the optimal
solution. To this end, we set 𝜂 as:
𝑊𝑖𝑗
𝜂= (17)
(𝑋 𝑇 𝑋𝑊 +𝛼𝑊 𝐴+𝛽𝑄◦𝑊 +𝛾𝑊 +𝜆1𝑑×𝑑 𝑊 )𝑖𝑗
Hence, we can derive the subsequent equation.
(𝑋 𝑇 𝑌 + 𝛼𝑊 𝑆 + 𝜆𝑊 )𝑖𝑗
𝑊𝑖𝑗𝑡+1 ← 𝑊𝑖𝑗𝑡 𝑇
(18)
(𝑋 𝑋𝑊 +𝛼𝑊 𝐴+𝛽𝑄◦𝑊 +𝛾𝑊 +𝜆1 𝑑×𝑑 𝑊 )𝑖𝑗

Next, we need to prove convergence of the update rule with adaptive learning rate in formula (18). Before that, we introduce
several concepts according to Cai, He, Han, and Huang (2010), Dempster, Laird, and Rubin (1977).

Definition 1. Given that (𝑤, 𝑤′ ) ≥ (𝑤) and (𝑤, 𝑤′ ) = (𝑤), it follows that (𝑤, 𝑤′ ) serves as one of the auxiliary functions for
(𝑤), where 𝑤′ represents the current iteration number.

6
Y. Li et al. Information Processing and Management 61 (2024) 103633

Lemma 1. Assuming (𝑤, 𝑤′ ) represents an auxiliary function with respect to (𝑤), it follows that (𝑤) exhibits non-increasing behavior
under the following conditions:

𝑤𝑡+1 = arg min (𝑤, 𝑤′ ) (19)


𝑤

Proof. (𝑤𝑡+1 ) ≤ (𝑤𝑡+1 , 𝑤𝑡 ) ≤ (𝑤𝑡 , 𝑤𝑡 ) = (𝑤𝑡 ), thus 𝐋𝐞𝐦𝐦𝐚 𝟏 holds.

To find an appropriate auxiliary function (𝑤, 𝑤𝑡 ), by differentiating 𝑖𝑗(𝑊 𝑖𝑗) with respect to 𝑊𝑖𝑗 , we can compute both the
first-order and second-order derivatives., where 𝑖𝑗 (𝑊𝑖𝑗 ) denotes the part of the objective function 𝑤.𝑟.𝑡. 𝑊𝑖𝑗 .
′ 𝑇 𝑇
⎧𝑖𝑗 (𝑊𝑖𝑗 ) = (2𝑋 𝑋𝑊 − 2𝑋 𝑌 + 2𝛼𝑊 𝐿 + 2𝛽𝑄◦𝑊

⎨ + 2𝛾𝑊 + 2𝜆(1𝑑×𝑑 𝑊 − 𝑊 ))𝑖𝑗 (20)
⎪ ′′ 𝑇
⎩𝑖𝑗 (𝑊𝑖𝑗 ) = 2(𝑋 𝑋)𝑖𝑗+2𝛼𝐿𝑖𝑗+2𝛾𝐼𝑖𝑖+2𝜆(1𝑑×𝑑− 𝐼)𝑖𝑗

According to Meng et al. (2019), we construct the following function 𝑤.𝑟.𝑡. 𝑖𝑗 (𝑊𝑖𝑗 ):

(𝑊𝑖𝑗 , 𝑊𝑖𝑗𝑡 ) = 𝑖𝑗 (𝑊𝑖𝑗𝑡 ) + ′𝑖𝑗 (𝑊𝑖𝑗𝑡 )(𝑊𝑖𝑗 − 𝑊𝑖𝑗𝑡 )


(𝑋 𝑇 𝑋𝑊 +𝛼𝑊 𝐴+𝛽𝑄◦𝑊 +𝛾𝑊 +𝜆1𝑑×𝑑 𝑊 )𝑖𝑗 (21)
+ (𝑊𝑖𝑗−𝑊𝑖𝑗𝑡 )2
𝑊𝑖𝑗𝑡

To prove formula (21) is one of auxiliary function 𝑤.𝑟.𝑡. 𝑖𝑗 (𝑊𝑖𝑗 ), we obtain the Taylor series expansion of 𝑖𝑗 (𝑊𝑖𝑗 ):

𝑖𝑗 (𝑊𝑖𝑗 ) =𝑖𝑗 (𝑊𝑖𝑗𝑡 ) + ′𝑖𝑗 (𝑊𝑖𝑗𝑡 )(𝑊𝑖𝑗 − 𝑊𝑖𝑗𝑡 )


1 ′′ (22)
+  (𝑊 𝑡 )(𝑊𝑖𝑗−𝑊𝑖𝑗𝑡 )2
2 𝑖𝑗 𝑖𝑗
By comparing formula (21) and formula (22), we find that (𝑊𝑖𝑗 , 𝑊𝑖𝑗𝑡 ) = 𝑖𝑗 (𝑊𝑖𝑗 ), when 𝑊𝑖𝑗 = 𝑊𝑖𝑗𝑡 . This satisfies one of the
conditions in 𝐃𝐞𝐟 𝐢𝐧𝐢𝐭𝐢𝐨𝐧 𝟏. For another condition (𝑤, 𝑤′ ) ≥ (𝑤), It is necessary to demonstrate the validity of the given inequality:

(𝑋 𝑇 𝑋𝑊 +𝛼𝑊 𝐴+𝛽𝑄◦𝑊 +𝛾𝑊 +𝜆1𝑑×𝑑 𝑊 )𝑖𝑗


𝑊𝑖𝑗𝑡 (23)
𝑇
≥ (𝑋 𝑋)𝑖𝑖 + 𝛼𝐿𝑗𝑗 + 𝛾𝐼𝑖𝑖 + 𝜆(1𝑑×𝑑 − 𝐼)𝑖𝑖

Obviously,


𝑑
(𝑋 𝑇 𝑋𝑊 + 𝛾𝑊 + 𝜆1𝑑×𝑑 𝑊 )𝑖𝑗 = (𝑋 𝑇 𝑋)𝑖𝑟 𝑊𝑟𝑗
𝑟=1

𝑑
+ (𝛾𝐼 + 𝜆1𝑑×𝑑 )𝑖𝑟 𝑊𝑟𝑗 ≥ (𝑋 𝑇 𝑋)𝑖𝑖 𝑊𝑖𝑗 (24)
𝑟=1
+ (𝛾𝐼 + 𝜆1𝑑×𝑑 )𝑖𝑖 𝑊𝑖𝑗 ≥ (𝑋 𝑇 𝑋)𝑖𝑖 𝑊𝑖𝑗
+ (𝛾𝐼 + 𝜆(1𝑑×𝑑 − 𝐼))𝑖𝑖 𝑊𝑖𝑗


𝑙
𝛼(𝑊 𝐴)𝑖𝑗 = 𝛼 𝑊𝑖𝑟 𝐴𝑟𝑗 ≥ 𝛼𝑊𝑖𝑗 𝐴𝑗𝑗
𝑟=1 (25)
≥ 𝛼𝑊𝑖𝑗 (𝐴𝑗𝑗 − 𝑆𝑗𝑗 ) = 𝛼𝑊𝑖𝑗 𝐿𝑗𝑗

Therefore, inequality (23) holds, that is to say, formula (21) is an appropriate auxiliary function of 𝑖𝑗 (𝑊𝑖𝑗 ). As a result, we can
obtain the following update rule according to formula (19) in 𝐋𝐞𝐦𝐦𝐚 𝟏.
′𝑖𝑗 (𝑊𝑖𝑗𝑡 )
𝑊𝑖𝑗𝑡+1 ← 𝑊𝑖𝑗𝑡 −𝑊𝑖𝑗𝑡
2(𝑋 𝑇 𝑋𝑊 +𝛼𝑊 𝐴+𝛽𝑄◦𝑊 +𝛾𝑊 +𝜆1𝑑×𝑑 𝑊 )𝑖𝑗
(26)
(𝑋 𝑇 𝑌 + 𝛼𝑊 𝑆 + 𝜆𝑊 )𝑖𝑗
𝑊𝑖𝑗𝑡+1 ← 𝑊𝑖𝑗𝑡
(𝑋 𝑇 𝑋𝑊 +𝛼𝑊 𝐴+𝛽𝑄◦𝑊 +𝛾𝑊 +𝜆1𝑑×𝑑 𝑊 )𝑖𝑗

Ultimately, we establish the convergence of the objective function within the specified optimization framework.

5. Experiments

To effectively and efficiently identify the most optimal feature subset, numerous experimental studies demonstrate the perfor-
mance of ESRFS on diverse sets of benchmark data sets for multi-label classification, encompassing various viewpoints.

7
Y. Li et al. Information Processing and Management 61 (2024) 103633

Table 3
Elaborated information regarding the experimental data sets.
Data set Samples Attributes Labels Training set Test set Domain
Arts 5000 462 26 2000 3000 Web text
Business 5000 438 30 2000 3000 Web text
Education 5000 550 33 2000 3000 Web text
Emotions 593 72 6 391 202 Music
Enron 1702 1001 53 1123 579 Text
Entertain 5000 640 21 2000 3000 Web text
Flags 194 19 7 129 65 Images
Health 5000 612 32 2000 3000 Web text
Medical 978 1449 45 333 645 Text
Recreation 5000 606 22 2000 3000 Web text
Reference 5000 793 33 2000 3000 Web text
Science 5000 743 40 2000 3000 Web text
Social 5000 1047 39 2000 3000 Web text
Society 5000 636 27 2000 3000 Web text
Yeast 2417 103 14 1500 917 Biology

5.1. Data sets

For the purpose of evaluating the efficacy of the proposed ESRFS approach, our experimental assessment involved the utilization
of a comprehensive set of fifteen data sets sourced exclusively from the MULAN Library (Tsoumakas, Spyromitros-Xioufis, Vilcek,
& Vlahavas, 2011). In addition, we adapt the segmentation scheme in the literature (Tsoumakas et al., 2011) and Zhang and Zhou
(2010) to divide data sets. As an example, the Yeast data set comprises 2417 phylogenetic profiles of yeast genes, where each yeast
gene is associated with a distinct subset of 14 annotations. the data set Emotions (also called Music) have 593 songs and 6 labels,
such as relaxing-calm, sad-lonely and amazed, etc. A comprehensive overview of all employed multi-label data sets is presented in
Table 3, providing detailed descriptions for each of them.

5.2. All experimental settings

All experiments were conducted using a computer system equipped with an Intel Core I7-6700 processor and 16 GB of RAM. To
evaluate ESRFS approach, four popular criteria in multi-label learning evaluation are adopted (Two instance-level metrics, namely
Hamming Loss (𝑎.𝑘.𝑎. HL) and Zero-One Loss (𝑎.𝑘.𝑎., ZOL), as well as two label-level metrics, specifically Micro-𝐹1 and Macro-𝐹1 ,
were utilized.) as follow:

(1). HL assesses the extent of misclassification of predicted labels for individual instances:

1 ∑ |𝑌𝑖⋅ ⊕ 𝑌𝑖⋅ |
𝑛 ′
𝐻𝐿 = (27)
𝑛 𝑖=1 𝑙

(2). ZOL measures the scenario where the highest-ranked labels assigned to instances are not present in the specified label set:

𝑛
𝑍𝑂𝐿 = 1∕𝑛 𝛿(arg max (𝑋𝑖⋅ ; 𝑌⋅𝑗 )) (28)
𝑌⋅𝑗 ∈𝑌
𝑖=1

(3). The weighted average of all labels is computed using Micro-𝐹1 :


∑𝑙 𝑖
𝑖=1 2𝑇 𝑃
𝑀𝑖𝑐𝑟𝑜−𝐹1 = ∑𝑙 (29)
𝑖 𝑖 𝑖
𝑖=1 (2𝑇 𝑃 + 𝐹 𝑃 + 𝐹 𝑁 )

(4). Macro-𝐹1 calculates the arithmetic mean result of all labels:

1∑
𝑙
2(𝑇 𝑃 )𝑖
𝑀𝑎𝑐𝑟𝑜−𝐹1 = (30)
𝑙 𝑖=1 2(𝑇 𝑃 )𝑖 + (𝐹 𝑃 )𝑖 + (𝐹 𝑁)𝑖

Let 𝛺 = (𝑋𝑖⋅ ; 𝑌𝑖⋅ |1 ≤ 𝑖 ≤ 𝑛) represent a collection of test data samples. Here, 𝑌 and 𝑌 ′ refer to a label set containing 𝑙 labels
and the corresponding predicted label set, respectively. (𝑋𝑖⋅ ; 𝑌⋅𝑗 ) represents the probability of association between 𝑌⋅𝑗 and 𝑋𝑖⋅ .
Furthermore, ⊕ denotes the XOR operation. If 𝛿 = 1, we have arg max𝑌⋅𝑗 ∈𝑌 (𝑋𝑖⋅ ; 𝑌⋅𝑗 ) ∉ 𝑌𝑖⋅ ; otherwise, if 𝛿 = 0, the condition is not
satisfied. Here, 𝛿 serves as a control signal. Furthermore, a decrease in the values of HL (lower is better) and ZOL (lower is better)
signifies improved classification performance, whereas an increase in the values of Micro-𝐹1 (higher is better) and Macro-𝐹1 (higher
is better) indicates enhanced classification performance.
Three commonly employed classifiers (MLkNN with k=10, Linear Support Vector Machine (SVM) classifier, and k-Nearest
Neighbors (kNN) with k=3) were selected based on an extensive review of relevant literature (Kou, Lin, Qian, & Liao, 2023; Li et al.,
2023; Wen, Li, Zhang, & Zhai, 2022), where HL and ZOL use the output results of the MLkNN classifier, Micro-𝐹1 utilizes the output

8
Y. Li et al. Information Processing and Management 61 (2024) 103633

predictions from the Linear SVM classifier, whereas Macro-𝐹1 employs the output predictions from the kNN classifier. In addition,
all regularization parameters (all compared approaches) are tuned and selected in the following grid, 𝑖.𝑒. , {0.01, 0.1, 0.3, 0.5, 0.7,
0.9, 1.0} for increasing fairness. Analogous to approach (Hu et al., 2020), we choose the top-m (where m is equal to seventeen for
the Medical dataset and one hundred for the Flag dataset, while other data sets are set to twenty) percent of ranked features from
each dataset, with a step size of one percent. ESRFS is compared with eight feature selection approaches including sparse-learning
based approaches (DSMFS (Hu et al., 2022) with conventional 𝑙1 -norm, RALM-FS (Cai et al., 2013) with 𝑙2,0 -norm, MIFS (Jian et al.,
2016) and SSFS (Gao et al., 2023) with the variant 𝑙2,1 -norm), information-theoretical based approaches (GMM (Gonzalez-Lopez
et al., 2020) and LRFS (Zhang et al., 2019)) and traditional approaches (Doquire & Verleysen, 2011) and Read (2008)). ESRFS will
be compared with these state-of-the-art feature selection approaches.

• SSFS (Gao et al., 2023): it is a state-of-the-art multi-label feature selection approach based on a constrained latent structure
shared space.
• DSMFS (Hu et al., 2022): it uses dynamic subspace and dual-graph regularizer to achieve joint feature selection and multi-label
learning.
• MIFS (Jian et al., 2016): it exploits label subspace by non-negative matrix factorization to obtain the optimal feature subset
from multi-label data.
• RALM-FS (Cai et al., 2013): it utilizes a 𝑙2,0 -norm constraint condition to achieve a sparse solution for feature selection.
• GMM (Gonzalez-Lopez et al., 2020): it employs the largest geometric mean to calculate the mutual information among labels.
• LRFS (Zhang et al., 2019): it employs conditional mutual information to capture the relationship between independent labels
and dependent labels.
• PPT+MI (Doquire & Verleysen, 2011): it is a traditional approach for MLFS by using mutual information.
• PPT+CHI (Read, 2008): it is a traditional approach for MLFS by using 𝜒 2 -statistics.

5.3. Experiment results and analysis

In this section, we analyze the experimental results, which are recorded in Tables 4–9. These results include the mean value and
standard deviation (mean±std) for each approach, as well as their rankings across the corresponding data sets. When two or more
approaches have the same value, we calculate the average sorting value for these approaches and use it to determine their ranking.
The final row (Average (rank)) displays the average results and their corresponding ranks in each table. Besides, bold indicates the
corresponding approach achieves the best result in a certain data set.
Obviously, it is evident that the ESRFS exhibits superior performance compared to other approaches across the majority of the
employed data sets. Specifically, the ESRFS approach outperforms compared approaches on twelve data sets in Tables 4, 5, 7 and
9. In Table 6, ESRFS demonstrates superior performance when compared to the other evaluated approaches across eleven data sets.
In Table 8, ESRFS exhibits superior performance when compared to the other approaches across thirteen data sets. Overall, ESRFS
performs well compared to comparative approaches.
In addition, ESRFS exhibits limited marginal benefits in terms of classification performance in Emotions, Yeast, Enron and
Business. Concretely, we can observe in Table 4 that ESRFS is ranked fifth in the Emotions data set. This is attributed to the
significantly larger number of training samples in the observational data set compared to the number of features, which results in
certain functions of ESRFS being less effective. ESRFS obtains the second ranking in the Enron data set, which shows no significant
difference compared to the first-ranked classification results, as the difference is only 0.0005.
According to Table 5, RALM-FS and MIFS achieve the highest performance in Emotions, Enron, and Business, respectively. In
the Yeast data set, MIFS exhibits higher classification performance and significant differences than the second-ranked approach.
However, the margin between MIFS and the second-ranked approach is quite narrow in the Business data set. This indicates that
MIFS exhibits a notable advantage in the biological data set. Similar to Table 4, RALM-FS achieves optimal results in Emotions. A
preliminary assessment indicates that it demonstrates a distinct advantage in classifying emotions in the music domain. In addition,
we can see that RALM-FS achieves the best results in the Emotions data sets in Tables 6 and 7.
In Tables 6 and 7, DSMFS obtains the optimal result in Flags, Emotions and Yeast. In addition, DSMFS obtains the best result
in the Enron data set in Table 6. At the same time, DSMFS obtains sub-optimal results in most other data sets. We can see similar
situations in other tables. However, through the tabular results, we can only determine that ESRFS outperforms DSMFS in terms of
classification performance. To further analyze the strengths and weaknesses of ESRFS and DSMFS, we conduct statistical analysis
in subsequent steps.
In Table 8, RALM-FS and MIFS achieve superior classification performance in the Emotions and Yeast data sets, respectively.
In Table 9, RALM-FS attains the best results in both the Emotions and Yeast data sets, while the DSMFS demonstrates optimal
performance in the Enron data set. There is a similar situation in the previous table analysis.
Furthermore, ESRFS achieves the best results in almost text class data sets, which indicates that ESRFS is suitable for processing
test data sets. Furthermore, it is noteworthy that multi-label feature selection approaches based on sparse theory outperform other
types of approaches in most evaluation metrics. This advantage is attributed to the specific nature of textual documents, wherein
text data typically encompasses a multitude of features and labels, with intricate interrelationships among these features. Sparse
theory offers some effective technologies for selecting the most relevant and discriminative features, thus enhancing the performance
of classification models. While sparse-based ESRFS approach overcomes the shortcomings of the 𝑙2,1 -norm and the 𝑙1 -norm when

9
Y. Li et al.
Table 4
The HL classification performance of all methodologies, measured using MLkNN, is presented in terms of mean±std.
Data sets ESRFS PPT+MI PPT+CHI MIFS LRFS RALM-FS GMM SSFS DSMFS
Flags 0.2885±0.0204(1) 0.3171 ± 0.0082(5) 0.3184 ± 0.0083(7) 0.292 ± 0.0205(3) 0.3169 ± 0.0087(4) 0.3239 ± 0.0087(9) 0.3181 ± 0.0082(6) 0.3191 ± 0.0065(8) 0.2915 ± 0.0201(2)
Emotions 0.2783 ± 0.0146(5) 0.2647 ± 0.0197(2) 0.2671 ± 0.0191(3) 0.2948 ± 0.0346(6.5) 0.2966 ± 0.0151(8) 0.2618±0.0385(1) 0.2948 ± 0.0107(6.5) 0.399 ± 0.0175(9) 0.2756 ± 0.0112(4)
Medical 0.013±0.0022(1) 0.0178 ± 0.0013(5) 0.0168 ± 0.0015(3) 0.0165 ± 0.002(2) 0.0175 ± 0.001(4) 0.027 ± 0.0007(8) 0.0276 ± 0.0000(9) 0.0188 ± 0.0009(7) 0.0181 ± 0.0009(6)
Yeast 0.2261±0.0044(1) 0.233 ± 0.0109(7) 0.2301 ± 0.0079(5) 0.2289 ± 0.0082(4) 0.2335 ± 0.0081(8) 0.228 ± 0.0093(3) 0.232 ± 0.0085(6) 0.2681 ± 0.0338(9) 0.2278 ± 0.0039(2)
Enron 0.0504 ± 0.0023(2) 0.0528 ± 0.0015(4) 0.0589 ± 0.0005(9) 0.0574 ± 0.0012(7) 0.055 ± 0.003(5) 0.057 ± 0.0023(6) 0.0577 ± 0.0007(8) 0.0515 ± 0.0015(3) 0.0499±0.0013(1)
Arts 0.0586±0.001(1) 0.0647 ± 0.0006(7.5) 0.0646 ± 0.0009(6) 0.0632 ± 0.0012(4) 0.0647 ± 0.0009(7.5) 0.0633 ± 0.0008(5) 0.0648 ± 0.0007(9) 0.0629 ± 0.0015(3) 0.0618 ± 0.0008(2)
10

Business 0.0289 ± 0.0005(4) 0.0296 ± 0.0004(7.5) 0.0296 ± 0.0004(7.5) 0.0288±0.0006(2) 0.0288±0.0005(2) 0.0296 ± 0.0005(7.5) 0.0296 ± 0.0005(7.5) 0.0292 ± 0.0006(5) 0.0288±0.0006(2)
Education 0.0401±0.0008(1) 0.0448 ± 0.0013(6) 0.0446 ± 0.0014(4) 0.0449 ± 0.0007(7) 0.0447 ± 0.0011(5) 0.0433 ± 0.0005(3) 0.045 ± 0.0008(8) 0.0632 ± 0.0019(9) 0.0423 ± 0.0008(2)
Entertain 0.0581±0.0023(1) 0.0654 ± 0.0014(6) 0.0659 ± 0.0013(8) 0.0653 ± 0.0018(5) 0.0657 ± 0.0015(7) 0.0638 ± 0.0021(4) 0.0661 ± 0.0018(9) 0.0631 ± 0.0018(3) 0.0616 ± 0.0011(2)
Health 0.0408±0.0013(1) 0.0476 ± 0.0021(8) 0.0476 ± 0.002(8) 0.0448 ± 0.0026(5) 0.0454 ± 0.0013(6) 0.0423 ± 0.0022(2) 0.0476 ± 0.0021(8) 0.044 ± 0.0024(4) 0.0432 ± 0.0013(3)
Recreation 0.0591±0.0017(1) 0.0667 ± 0.0006(6.5) 0.067 ± 0.0008(8.5) 0.0601 ± 0.0017(2) 0.0667 ± 0.0007(6.5) 0.0605 ± 0.0013(3) 0.067 ± 0.001(8.5) 0.063 ± 0.0011(5) 0.0615 ± 0.001(4)
Reference 0.0292±0.0008(1) 0.0325 ± 0.0015(5.5) 0.0329 ± 0.0015(9) 0.0313 ± 0.0015(4) 0.0326 ± 0.0013(7) 0.0325 ± 0.0013(5.5) 0.0327 ± 0.0017(8) 0.0301 ± 0.0008(3) 0.0298 ± 0.0008(2)
Science 0.0342±0.0004(1) 0.0362 ± 0.0004(6.5) 0.0364 ± 0.0003(9) 0.036 ± 0.0006(4) 0.0362 ± 0.0005(6.5) 0.0361 ± 0.0004(5) 0.0363 ± 0.0004(8) 0.0354 ± 0.0004(3) 0.0352 ± 0.0004(2)
Society 0.0562±0.0009(1) 0.0589 ± 0.0009(8) 0.0588 ± 0.0009(7) 0.0582 ± 0.0012(4) 0.0584 ± 0.001(5) 0.0593 ± 0.0006(9) 0.0586 ± 0.0011(6) 0.0577 ± 0.0012(3) 0.0567 ± 0.001(2)
Social 0.0229±0.0015(1) 0.0273 ± 0.0015(4) 0.0283 ± 0.0018(6) 0.0311 ± 0.0016(8) 0.0275 ± 0.002(5) 0.0324 ± 0.001(9) 0.0284 ± 0.0019(7) 0.0249 ± 0.0012(3) 0.0247 ± 0.0006(2)

Information Processing and Management 61 (2024) 103633


Average (rank) 0.0856(1.5333) 0.0906(5.9) 0.0911(6.6667) 0.0902(4.5) 0.0927(5.7667) 0.0907(5.3333) 0.0938(7.6333) 0.102(5.1333) 0.0872(2.5333)
Y. Li et al.
Table 5
The ZOL classification performance of all methodologies, evaluated using MLkNN, is provided in terms of mean±std.
Data sets ESRFS PPT+MI PPT+CHI MIFS LRFS RALM-FS GMM SSFS DSMFS
Flags 0.8518±0.0657(1) 0.9223 ± 0.0293(6.5) 0.9255 ± 0.0275(8) 0.8761 ± 0.0321(3) 0.9206 ± 0.0182(5) 0.9312 ± 0.0284(9) 0.9223 ± 0.028(6.5) 0.919 ± 0.0263(4) 0.8599 ± 0.0438(2)
Emotions 0.8437 ± 0.0331(4) 0.8494 ± 0.0328(6) 0.8335 ± 0.0222(3) 0.8487 ± 0.0398(5) 0.8819 ± 0.0274(8) 0.8147±0.0694(1) 0.8656 ± 0.031(7) 0.9501 ± 0.0301(9) 0.8225 ± 0.0334(2)
Medical 0.4072±0.09(1) 0.5917 ± 0.0451(5) 0.5474 ± 0.0588(3) 0.5468 ± 0.0825(2) 0.5774 ± 0.0382(4) 0.866 ± 0.073(8) 1.0000 ± 0.0000(9) 0.6267 ± 0.0449(7) 0.6031 ± 0.0383(6)
Yeast 0.9084 ± 0.0386(8) 0.8899 ± 0.0336(4) 0.8835 ± 0.0322(2) 0.8826±0.0272(1) 0.8976 ± 0.0383(7) 0.8872 ± 0.0377(3) 0.8918 ± 0.0356(5.5) 0.974 ± 0.0141(9) 0.8918 ± 0.0294(5.5)
Enron 0.8919±0.0286(1) 0.9011 ± 0.0316(3) 0.9847 ± 0.0027(9) 0.9817 ± 0.0064(8) 0.9264 ± 0.0368(5) 0.9566 ± 0.0399(6) 0.9765 ± 0.0078(7) 0.9178 ± 0.0306(4) 0.8936 ± 0.0203(2)
Arts 0.8325±0.0349(1) 0.94 ± 0.0217(9) 0.9348 ± 0.0229(7) 0.9193 ± 0.0354(4) 0.9309 ± 0.0217(6) 0.9256 ± 0.0393(5) 0.9382 ± 0.0207(8) 0.8959 ± 0.0412(3) 0.883 ± 0.0264(2)
11

Business 0.4887 ± 0.013(5) 0.4916 ± 0.0112(8) 0.4951 ± 0.0101(9) 0.4821±0.0119(1) 0.4847 ± 0.0153(3.5) 0.4839 ± 0.0132(2) 0.4896 ± 0.013(7) 0.4894 ± 0.0111(6) 0.4847 ± 0.0143(3.5)
Education 0.8079±0.0371(1) 0.9162 ± 0.0287(6) 0.9164 ± 0.0351(7) 0.9421 ± 0.0373(9) 0.9105 ± 0.0304(5) 0.8753 ± 0.0353(4) 0.9283 ± 0.0279(8) 0.8635 ± 0.0317(3) 0.8632 ± 0.0246(2)
Entertain 0.7317±0.0537(1) 0.8541 ± 0.033(7) 0.8537 ± 0.0361(6) 0.8516 ± 0.0547(5) 0.8588 ± 0.0387(8) 0.8511 ± 0.0626(4) 0.8599 ± 0.0349(9) 0.8078 ± 0.0554(3) 0.7865 ± 0.0366(2)
Health 0.6467±0.0227(1) 0.7572 ± 0.0345(7) 0.7586 ± 0.0389(8) 0.7108 ± 0.0329(5) 0.7278 ± 0.0158(6) 0.6546 ± 0.024(2) 0.7611 ± 0.0385(9) 0.7103 ± 0.0511(4) 0.7009 ± 0.0292(3)
Recreation 0.7954±0.0308(1) 0.9566 ± 0.0192(8) 0.9552 ± 0.0202(6) 0.8223 ± 0.0459(2) 0.9555 ± 0.0188(7) 0.8349 ± 0.0411(3) 0.9612 ± 0.0146(9) 0.8656 ± 0.0405(5) 0.8443 ± 0.0207(4)
Reference 0.6854±0.0445(1) 0.7567 ± 0.0402(7) 0.7676 ± 0.0345(9) 0.7377 ± 0.0593(5) 0.7486 ± 0.0396(6) 0.7084 ± 0.0711(3) 0.7593 ± 0.0315(8) 0.7094 ± 0.0394(4) 0.6999 ± 0.0397(2)
Science 0.8318±0.0337(1) 0.9316 ± 0.0215(7) 0.9289 ± 0.0242(6) 0.9263 ± 0.0271(4) 0.9288 ± 0.026(5) 0.9411 ± 0.0281(9) 0.9361 ± 0.0234(8) 0.8961 ± 0.0286(3) 0.8946 ± 0.0226(2)
Society 0.7608±0.0253(1) 0.7988 ± 0.0181(4) 0.8002 ± 0.0181(5) 0.8105 ± 0.0346(8) 0.7978 ± 0.0173(3) 0.8391 ± 0.0518(9) 0.8027 ± 0.0257(6) 0.806 ± 0.0333(7) 0.7742 ± 0.029(2)
Social 0.5519±0.059(1) 0.6472 ± 0.0505(4) 0.6789 ± 0.0596(7) 0.7667 ± 0.0923(8) 0.6613 ± 0.0623(5) 0.8412 ± 0.0799(9) 0.6784 ± 0.062(6) 0.596 ± 0.0324(2) 0.6032 ± 0.0299(3)

Information Processing and Management 61 (2024) 103633


Average (rank) 0.7357(1.9333) 0.8136(6.1) 0.8176(6.3333) 0.807(4.6667) 0.8139(5.5667) 0.8274(5.1333) 0.8514(7.5333) 0.8018(4.8667) 0.7737(2.8667)
Y. Li et al.
Table 6
The Micro-𝐹1 classification performance of all methodologies, assessed using the SVM classifier, is presented in terms of mean±std.
Data sets ESRFS PPT+MI PPT+CHI MIFS LRFS RALM-FS GMM SSFS DSMFS
Flags 0.7153 ± 0.045(3) 0.6646 ± 0.0403(5) 0.6632 ± 0.0379(6) 0.7225 ± 0.0499(2) 0.6659 ± 0.0405(4) 0.6315 ± 0.0481(9) 0.6603 ± 0.0321(7) 0.652 ± 0.0339(8) 0.7226±0.0323(1)
Emotions 0.4442 ± 0.0961(2) 0.3442 ± 0.1752(4) 0.273 ± 0.1378(5) 0.0684 ± 0.059(8) 0.2257 ± 0.1846(6) 0.0082 ± 0.0296(9) 0.0897 ± 0.0608(7) 0.4215 ± 0.1169(3) 0.446±0.0558(1)
Medical 0.7704±0.0986(1) 0.7282 ± 0.0483(5) 0.736 ± 0.0668(3) 0.7149 ± 0.1058(6) 0.7567 ± 0.0548(2) 0.3632 ± 0.1474(8) 0.0000 ± 0.0000(9) 0.6936 ± 0.0777(7) 0.7301 ± 0.0729(4)
Yeast 0.5748 ± 0.0373(2) 0.5568 ± 0.0278(6) 0.5622 ± 0.0317(4) 0.5659 ± 0.0275(3) 0.5523 ± 0.0299(7) 0.559 ± 0.0288(5) 0.4933 ± 0.0136(9) 0.5395 ± 0.0333(8) 0.5788±0.0345(1)
Enron 0.5262 ± 0.0505(2) 0.4695 ± 0.0426(4) 0.3531 ± 0.0189(9) 0.3723 ± 0.0274(8) 0.4457 ± 0.0555(5) 0.3891 ± 0.0588(6) 0.385 ± 0.0357(7) 0.5055 ± 0.0417(3) 0.5365±0.0324(1)
Arts 0.2684±0.0577(1) 0.0904 ± 0.0527(8) 0.0981 ± 0.0553(7) 0.1391 ± 0.0783(4) 0.1041 ± 0.0448(5) 0.1018 ± 0.0607(6) 0.0482 ± 0.0259(9) 0.1878 ± 0.0915(3) 0.2355 ± 0.066(2)
12

Business 0.6883±0.0085(1) 0.6729 ± 0.0042(6) 0.6725 ± 0.0036(7) 0.6835 ± 0.0078(3) 0.6796 ± 0.0047(5) 0.6696 ± 0.001(8) 0.6676 ± 0.0000(9) 0.6817 ± 0.0091(4) 0.6878 ± 0.0085(2)
Education 0.3061±0.0589(1) 0.1237 ± 0.0834(6) 0.1199 ± 0.0848(7) 0.0733 ± 0.0587(8) 0.1495 ± 0.0806(5) 0.1934 ± 0.0558(4) 0.0569 ± 0.0415(9) 0.25 ± 0.0844(3) 0.2579 ± 0.0665(2)
Entertain 0.3697±0.0784(1) 0.1896 ± 0.0884(6) 0.1832 ± 0.1027(8) 0.2278 ± 0.1121(4) 0.1843 ± 0.0821(7) 0.2143 ± 0.1004(5) 0.1163 ± 0.0769(9) 0.2806 ± 0.1063(3) 0.327 ± 0.0724(2)
Health 0.5619±0.0319(1) 0.4017 ± 0.0743(7) 0.3993 ± 0.0748(8) 0.4681 ± 0.0795(6) 0.4754 ± 0.037(5) 0.5157 ± 0.0439(3) 0.3618 ± 0.0552(9) 0.4999 ± 0.1022(4) 0.5214 ± 0.0584(2)
Recreation 0.2874±0.0495(1) 0.0343 ± 0.0364(8) 0.0402 ± 0.0384(6) 0.2524 ± 0.0698(3) 0.0392 ± 0.0374(7) 0.2279 ± 0.0711(4) 0.007 ± 0.0108(9) 0.2245 ± 0.0801(5) 0.2646 ± 0.0505(2)
Reference 0.4553±0.0614(1) 0.348 ± 0.0768(8) 0.3546 ± 0.0742(7) 0.3593 ± 0.1046(6) 0.3768 ± 0.0664(5) 0.4042 ± 0.1322(4) 0.305 ± 0.0649(9) 0.4227 ± 0.068(3) 0.4315 ± 0.0664(2)
Science 0.2502±0.052(1) 0.1146 ± 0.059(6) 0.1104 ± 0.0546(7) 0.1286 ± 0.0569(4.5) 0.1286 ± 0.0776(4.5) 0.0969 ± 0.0539(8) 0.0368 ± 0.032(9) 0.1985 ± 0.0694(3) 0.2114 ± 0.0581(2)
Society 0.3489±0.0322(1) 0.3155 ± 0.0326(5) 0.3171 ± 0.0325(4) 0.3001 ± 0.0424(7) 0.321 ± 0.016(3) 0.2231 ± 0.0594(9) 0.3041 ± 0.0403(6) 0.2965 ± 0.0601(8) 0.3303 ± 0.0281(2)
Social 0.5624±0.0822(1) 0.4699 ± 0.0971(4) 0.4452 ± 0.1082(6) 0.2756 ± 0.1361(8) 0.4648 ± 0.1194(5) 0.149 ± 0.1124(9) 0.3843 ± 0.1113(7) 0.5376 ± 0.0755(3) 0.5459 ± 0.0552(2)

Information Processing and Management 61 (2024) 103633


Average (rank) 0.4753(1.3333) 0.3683(5.8667) 0.3552(6.2667) 0.3568(5.3667) 0.3713(5.0333) 0.3165(6.4667) 0.2611(8.2667) 0.4261(4.5333) 0.4552(1.8667)
Y. Li et al.
Table 7
The Macro-𝐹1 classification performance of all methodologies, evaluated using the SVM classifier, is provided in terms of mean±std.
Data sets ESRFS PPT+MI PPT+CHI MIFS LRFS RALM-FS GMM SSFS DSMFS
Flags 0.5651 ± 0.089(3) 0.5135 ± 0.0372(6) 0.5174 ± 0.0457(5) 0.5697 ± 0.0998(2) 0.5106 ± 0.0442(7) 0.4872 ± 0.0436(9) 0.5233 ± 0.0329(4) 0.4987 ± 0.048(8) 0.5911±0.0319(1)
Emotions 0.3638 ± 0.1092(2) 0.2825 ± 0.145(3) 0.1925 ± 0.1176(5) 0.0409 ± 0.0365(8) 0.166 ± 0.1549(6) 0.0051 ± 0.0184(9) 0.0483 ± 0.0329(7) 0.2521 ± 0.0699(4) 0.3705±0.0833(1)
Medical 0.3451±0.0647(1) 0.2483 ± 0.0459(5) 0.2609 ± 0.0385(4) 0.2216 ± 0.0483(7) 0.3172 ± 0.0735(2) 0.1288 ± 0.0629(8) 0.0000 ± 0.0000(9) 0.2306 ± 0.0648(6) 0.2802 ± 0.0823(3)
Yeast 0.257 ± 0.0478(2) 0.2337 ± 0.0376(6) 0.2435 ± 0.0464(4) 0.2511 ± 0.0385(3) 0.228 ± 0.0448(7) 0.2397 ± 0.0416(5) 0.1437 ± 0.0194(9) 0.2084 ± 0.0438(8) 0.2647±0.0444(1)
Enron 0.1367±0.036(1) 0.1019 ± 0.0346(4) 0.067 ± 0.0164(9) 0.0743 ± 0.0169(7) 0.0988 ± 0.0325(5) 0.0741 ± 0.0268(8) 0.0798 ± 0.0237(6) 0.1224 ± 0.0447(3) 0.1301 ± 0.0314(2)
Arts 0.1138±0.0296(1) 0.0373 ± 0.0216(8) 0.0397 ± 0.0228(6) 0.055 ± 0.034(4) 0.0433 ± 0.0187(5) 0.0385 ± 0.0256(7) 0.0194 ± 0.0107(9) 0.079 ± 0.0384(3) 0.0931 ± 0.0293(2)
13

Business 0.064±0.0108(1) 0.0385 ± 0.0082(7) 0.0395 ± 0.0079(6) 0.0564 ± 0.0107(3) 0.0547 ± 0.0072(4) 0.032 ± 0.0006(8) 0.0309 ± 0.0000(9) 0.0546 ± 0.0136(5) 0.0633 ± 0.011(2)
Education 0.081±0.0173(1) 0.0342 ± 0.0249(7) 0.0345 ± 0.0263(6) 0.0195 ± 0.017(8) 0.0485 ± 0.0279(5) 0.0525 ± 0.0144(4) 0.0132 ± 0.0106(9) 0.0647 ± 0.0202(3) 0.0683 ± 0.0165(2)
Entertain 0.1627±0.0355(1) 0.0723 ± 0.033(8) 0.0781 ± 0.0436(7) 0.0971 ± 0.0474(4) 0.0788 ± 0.0307(6) 0.0815 ± 0.0384(5) 0.043 ± 0.0294(9) 0.1163 ± 0.047(3) 0.141 ± 0.033(2)
Health 0.1945±0.0474(1) 0.1023 ± 0.0435(8) 0.1039 ± 0.0428(7) 0.1181 ± 0.0461(6) 0.1454 ± 0.0378(5) 0.1592 ± 0.0415(4) 0.0441 ± 0.0164(9) 0.1685 ± 0.0628(2) 0.1638 ± 0.0478(3)
Recreation 0.1554±0.031(1) 0.0197 ± 0.0213(8) 0.0245 ± 0.0239(6) 0.1336 ± 0.0334(3) 0.0231 ± 0.0225(7) 0.1216 ± 0.0421(4) 0.0043 ± 0.0065(9) 0.1119 ± 0.0422(5) 0.1352 ± 0.0249(2)
Reference 0.1029±0.022(1) 0.0382 ± 0.0192(7) 0.037 ± 0.018(8) 0.0628 ± 0.0236(4.5) 0.051 ± 0.0207(6) 0.0628 ± 0.0278(4.5) 0.0186 ± 0.0044(9) 0.0677 ± 0.0258(3) 0.081 ± 0.0294(2)
Science 0.0954±0.0247(1) 0.0335 ± 0.0221(7) 0.0318 ± 0.0205(8) 0.0341 ± 0.0164(6) 0.0388 ± 0.0287(4) 0.0357 ± 0.0199(5) 0.0087 ± 0.0081(9) 0.0695 ± 0.028(3) 0.0747 ± 0.0227(2)
Society 0.0977±0.0183(1) 0.0479 ± 0.0149(6) 0.0467 ± 0.0144(7) 0.0554 ± 0.0197(4) 0.0493 ± 0.0155(5) 0.0318 ± 0.0116(9) 0.0346 ± 0.008(8) 0.0555 ± 0.0186(3) 0.0786 ± 0.0177(2)
Social 0.1328±0.0274(1) 0.0732 ± 0.0296(6) 0.0813 ± 0.0314(5) 0.0308 ± 0.0164(8) 0.0832 ± 0.0312(4) 0.0144 ± 0.012(9) 0.0349 ± 0.011(7) 0.0988 ± 0.0316(2) 0.0983 ± 0.03(3)

Information Processing and Management 61 (2024) 103633


Average (rank) 0.1912(1.2667) 0.1251(6.4) 0.1199(6.2) 0.1214(5.1667) 0.1291(5.2) 0.1043(6.5667) 0.0698(8.1333) 0.1466(4.0667) 0.1756(2)
Y. Li et al.
Table 8
The Micro-𝐹1 classification performance of all methodologies, analyzed using the 3NN classifier, is reported in terms of mean±std.
Data sets ESRFS PPT+MI PPT+CHI MIFS LRFS RALM-FS GMM SSFS DSMFS
Flags 0.6856±0.0391(1) 0.615 ± 0.0109(6) 0.6154 ± 0.0105(5) 0.6615 ± 0.0302(2) 0.6157 ± 0.018(4) 0.6057 ± 0.0122(9) 0.6129 ± 0.0136(8) 0.6133 ± 0.0141(7) 0.6611 ± 0.04(3)
Emotions 0.5082 ± 0.0869(4) 0.5324 ± 0.0383(2) 0.5276 ± 0.0487(3) 0.4877 ± 0.0736(7) 0.4771 ± 0.0315(8) 0.5492±0.0753(1) 0.4928 ± 0.0289(6) 0.3701 ± 0.0483(9) 0.5075 ± 0.0295(5)
Medical 0.7364±0.0914(1) 0.6179 ± 0.0397(4) 0.6374 ± 0.0599(3) 0.6104 ± 0.0953(6) 0.6427 ± 0.0349(2) 0.2939 ± 0.1077(8) 0.0000 ± 0.0000(9) 0.5838 ± 0.0443(7) 0.6114 ± 0.0443(5)
Yeast 0.5439 ± 0.049(6) 0.5494 ± 0.0252(3) 0.5477 ± 0.0228(4.5) 0.5547±0.021(1) 0.5406 ± 0.0273(8) 0.5477 ± 0.0237(4.5) 0.5438 ± 0.0212(7) 0.3131 ± 0.1321(9) 0.5501 ± 0.0297(2)
Enron 0.4927±0.0484(1) 0.4539 ± 0.0133(4) 0.3446 ± 0.0262(9) 0.4102 ± 0.0243(7) 0.4194 ± 0.048(5) 0.365 ± 0.0729(8) 0.4145 ± 0.0502(6) 0.4639 ± 0.0147(3) 0.4842 ± 0.0253(2)
Arts 0.2932±0.0357(1) 0.1812 ± 0.0365(9) 0.187 ± 0.0335(6) 0.2016 ± 0.0521(4) 0.1963 ± 0.0273(5) 0.1824 ± 0.0435(8) 0.1838 ± 0.0336(7) 0.23 ± 0.0489(3) 0.2449 ± 0.0292(2)
14

Business 0.6763±0.0126(1) 0.6515 ± 0.0229(8) 0.6556 ± 0.0234(6) 0.6605 ± 0.066(5) 0.67 ± 0.0093(2) 0.63 ± 0.0774(9) 0.6523 ± 0.0233(7) 0.6626 ± 0.0197(4) 0.6678 ± 0.0063(3)
Education 0.3389±0.0517(1) 0.2301 ± 0.0433(6) 0.2274 ± 0.0426(7) 0.1827 ± 0.055(9) 0.2431 ± 0.0455(5) 0.2605 ± 0.0505(4) 0.2223 ± 0.0361(8) 0.2856 ± 0.0432(3) 0.289 ± 0.0361(2)
Entertain 0.3896±0.0613(1) 0.2837 ± 0.0297(5) 0.2845 ± 0.0325(4) 0.2764 ± 0.0652(8) 0.2807 ± 0.0339(6) 0.2735 ± 0.0651(9) 0.2802 ± 0.0323(7) 0.317 ± 0.0557(3) 0.343 ± 0.0491(2)
Health 0.5064±0.0187(1) 0.3905 ± 0.0443(7) 0.387 ± 0.0445(9) 0.435 ± 0.0706(6) 0.4365 ± 0.0204(5) 0.4739 ± 0.0587(2) 0.3901 ± 0.0446(8) 0.4537 ± 0.0521(4) 0.4625 ± 0.0298(3)
Recreation 0.3103±0.0442(1) 0.1399 ± 0.0262(8) 0.1418 ± 0.0234(6) 0.2824 ± 0.0535(2) 0.1406 ± 0.0264(7) 0.2717 ± 0.0635(3) 0.1343 ± 0.0238(9) 0.2293 ± 0.0435(5) 0.2553 ± 0.0229(4)
Reference 0.4461±0.0233(1) 0.3733 ± 0.0436(7) 0.3657 ± 0.0395(9) 0.3816 ± 0.0546(6) 0.3859 ± 0.0456(4) 0.3857 ± 0.089(5) 0.371 ± 0.041(8) 0.4041 ± 0.0341(3) 0.4129 ± 0.0338(2)
Science 0.2765±0.038(1) 0.1621 ± 0.0284(7) 0.1623 ± 0.0262(6) 0.1711 ± 0.0365(5) 0.1765 ± 0.0346(4) 0.1597 ± 0.0476(9) 0.161 ± 0.0301(8) 0.2241 ± 0.0423(2) 0.2229 ± 0.0239(3)
Society 0.3382±0.0417(1) 0.3167 ± 0.0228(4) 0.317 ± 0.0258(3) 0.3055 ± 0.043(8) 0.3153 ± 0.0282(5) 0.2546 ± 0.0499(9) 0.3117 ± 0.0294(6) 0.3113 ± 0.0322(7) 0.3281 ± 0.0214(2)
Social 0.546±0.0515(1) 0.451 ± 0.0522(5) 0.4422 ± 0.0552(6) 0.3378 ± 0.0811(8) 0.4549 ± 0.0543(4) 0.3154 ± 0.054(9) 0.4295 ± 0.0578(7) 0.4947 ± 0.0277(2) 0.486 ± 0.0128(3)

Information Processing and Management 61 (2024) 103633


Average (rank) 0.4726(1.5333) 0.3966(5.6667) 0.3895(5.7667) 0.3973(5.6) 0.3997(4.9333) 0.3713(6.5) 0.3467(7.4) 0.3971(4.7333) 0.4351(2.8667)
Y. Li et al.
Table 9
The Macro-𝐹1 classification performance of all methodologies, assessed using the 3NN classifier, is presented in terms of mean±std.
Data sets ESRFS PPT+MI PPT+CHI MIFS LRFS RALM-FS GMM SSFS DSMFS
Flags 0.5535±0.0765(1) 0.448 ± 0.0112(7) 0.4493 ± 0.0108(6) 0.5265 ± 0.0776(2) 0.4531 ± 0.022(4) 0.4386 ± 0.0109(9) 0.4452 ± 0.0156(8) 0.4504 ± 0.0202(5) 0.5155 ± 0.0581(3)
Emotions 0.4773 ± 0.1073(4) 0.5143 ± 0.0402(2) 0.4997 ± 0.0506(3) 0.4746 ± 0.0749(6) 0.4551 ± 0.03(8) 0.5355±0.0758(1) 0.4739 ± 0.0282(7) 0.1751 ± 0.0277(9) 0.4757 ± 0.0519(5)
Medical 0.2762±0.047(1) 0.164 ± 0.0304(5) 0.187 ± 0.0234(3) 0.161 ± 0.0209(6) 0.1872 ± 0.0265(2) 0.0694 ± 0.0287(8) 0.0000 ± 0.0000(9) 0.1472 ± 0.0265(7) 0.1669 ± 0.0308(4)
Yeast 0.3258 ± 0.0517(8) 0.3404 ± 0.0273(3) 0.3376 ± 0.0252(4) 0.3407 ± 0.0215(2) 0.3287 ± 0.0275(7) 0.3421±0.0254(1) 0.3317 ± 0.0217(6) 0.138 ± 0.0489(9) 0.3359 ± 0.0438(5)
Enron 0.1321 ± 0.0216(2) 0.1164 ± 0.0193(4) 0.0738 ± 0.0125(9) 0.0873 ± 0.014(7) 0.1093 ± 0.0219(5) 0.0809 ± 0.0256(8) 0.1038 ± 0.0199(6) 0.122 ± 0.0196(3) 0.1363±0.0198(1)
Arts 0.1446±0.0377(1) 0.0879 ± 0.0219(8) 0.0912 ± 0.0219(6) 0.0951 ± 0.033(5) 0.1006 ± 0.025(4) 0.0714 ± 0.0253(9) 0.0887 ± 0.0229(7) 0.1186 ± 0.0349(2) 0.1167 ± 0.0246(3)
15

Business 0.1152±0.0193(1) 0.081 ± 0.0187(6) 0.0785 ± 0.0154(8) 0.1075 ± 0.023(2) 0.0958 ± 0.0172(5) 0.0654 ± 0.0178(9) 0.0795 ± 0.0169(7) 0.098 ± 0.0252(4) 0.1011 ± 0.0199(3)
Education 0.1144±0.0273(1) 0.0816 ± 0.0193(7) 0.0829 ± 0.0222(6) 0.0426 ± 0.0176(9) 0.087 ± 0.0202(4) 0.0838 ± 0.0234(5) 0.0796 ± 0.0185(8) 0.0959 ± 0.0195(3) 0.0984 ± 0.0163(2)
Entertain 0.2022±0.0392(1) 0.1468 ± 0.0187(5) 0.1511 ± 0.0223(4) 0.1379 ± 0.0418(8) 0.1448 ± 0.0198(7) 0.1305 ± 0.0402(9) 0.1455 ± 0.0175(6) 0.1627 ± 0.0353(3) 0.1823 ± 0.032(2)
Health 0.1971±0.033(1) 0.1329 ± 0.0293(7) 0.1327 ± 0.0268(8) 0.157 ± 0.0407(6) 0.1611 ± 0.0225(5) 0.1859 ± 0.0317(2) 0.1313 ± 0.0277(9) 0.1738 ± 0.0367(3) 0.1736 ± 0.0316(4)
Recreation 0.1869±0.0332(1) 0.095 ± 0.0233(7) 0.0973 ± 0.0214(6) 0.1703 ± 0.0381(2) 0.0932 ± 0.0213(8) 0.1597 ± 0.0447(4) 0.0909 ± 0.0205(9) 0.1472 ± 0.035(5) 0.1632 ± 0.0234(3)
Reference 0.1161±0.0184(1) 0.0785 ± 0.0165(6) 0.0761 ± 0.0155(9) 0.088 ± 0.0238(4) 0.0852 ± 0.017(5) 0.0766 ± 0.0257(7) 0.0763 ± 0.0172(8) 0.0956 ± 0.022(3) 0.1051 ± 0.0255(2)
Science 0.1234±0.0305(1) 0.0652 ± 0.0228(6) 0.0657 ± 0.0217(5) 0.0618 ± 0.0147(7) 0.0673 ± 0.0231(4) 0.0567 ± 0.0211(9) 0.0611 ± 0.021(8) 0.0931 ± 0.0282(3) 0.0957 ± 0.0227(2)
Society 0.1183±0.0204(1) 0.083 ± 0.0147(8) 0.0833 ± 0.016(7) 0.0888 ± 0.0211(3) 0.0847 ± 0.0129(5) 0.0534 ± 0.0156(9) 0.0836 ± 0.0152(6) 0.0855 ± 0.018(4) 0.1028 ± 0.018(2)
Social 0.1598±0.0278(1) 0.0986 ± 0.0341(6) 0.1073 ± 0.0296(5) 0.0506 ± 0.0167(8) 0.1114 ± 0.0269(4) 0.0382 ± 0.0123(9) 0.0957 ± 0.0293(7) 0.1203 ± 0.0353(2) 0.1168 ± 0.0368(3)

Information Processing and Management 61 (2024) 103633


Average (rank) 0.2162(1.7333) 0.1689(5.8) 0.1676(5.9333) 0.1726(5.1333) 0.171(5.1333) 0.1592(6.6) 0.1525(7.4) 0.1482(4.3333) 0.1924(2.9333)
Y. Li et al. Information Processing and Management 61 (2024) 103633

Table 10
The Friedman 𝐹𝐹 statistics and corresponding critical values.
Evaluation Metric Friedman 𝐹𝐹 Critical value (𝛼=0.05)
HL (based on MLkNN) 13.6235
ZOL (based on MLkNN) 9.2230
Micro-𝐹1 (based on SVM) 25.6102
2.022
Macro-𝐹1 (based on SVM) 27.0736
Micro-𝐹1 (based on 3NN) 10.7398
Macro-𝐹1 (based on 3NN) 10.1750

Fig. 1. The Bonferroni–Dunn post hoc test was conducted with a significance level (𝛼) of 0.05.

dealing with multi-label data, thus achieving better classification results than other sparse-based approaches. These findings hold
significant implications for research in the field of text mining.
To further analyze the ESRFS approach, Friedman and Bonferroni–Dunn tests are used to test ESRFS’s classification performance.
Initially, Table 10 displays the Friedman 𝐹𝐹 statistics along with the critical value (𝛼=0.05). Based on the information provided
in Table 10, the null hypothesis can be refuted. Therefore, we use the postpositional Bonferroni–Dunn test to conduct the next
experiment, where the ESRFS approach is a control approach in this paper. To evaluate the disparity between the ESRFS methodology
and alternative methodologies, we examine the interplay between their average √ ranks and the critical distance (CD). In this study,
we utilize the given values to compute CD, employing the formula CD=𝑞𝛼 𝑘(𝑘 + 1)∕6𝑁, where 𝑞𝛼 =2.724, 𝛼=0.05, k=9, and N=15.
Consequently, we derive CD to be 2.724. Subsequently, we utilize Fig. 1 to illustrate the critical distance diagrams for all employed
evaluation metrics. In this representation, the horizontal red line connecting two methodologies signifies the extent of divergence
between ESRFS and other methodologies within a single critical distance. This visual aid is employed to indicate the statistical
comparability between the two distinct approaches. As can be seen, the ESRFS approach and the DSMFS approach have no significant
difference. However, the ESRFS approach has a competitive classification performance over other compared approaches.
Next, Fig. 2 is employed to examine the classification performance of the compared methodologies. Fig. 2 demonstrates the
relationship between the number of selected features (referred to as top-m% features in this study) and the classification performance
of all compared methodologies. The horizontal 𝑋-axis represents the number of selected features, while the vertical 𝑌 -axis represents
the classification performance measured by Hamming Loss and Zero-One Loss (MLkNN), Micro-𝐹1 and Macro-𝐹1 (SVM), and Micro-𝐹1
and Macro-𝐹1 (3NN), respectively. Notably, Fig. 2 shows the classification performance on several representative data sets, namely
Arts, Education, Flags, and Health. By these data sets, we can show and analyze experimental results, where the first two sub-graphs
in each row indicate the results of HL (MLkNN) and ZOL (MLkNN), while the last four sub-graphs in each row indicate the results
of Micro-𝐹1 and Macro-𝐹1 (SVM), Micro-𝐹1 and Macro-𝐹1 (3NN).
By examining these line charts, we observe significant fluctuations in the performance of certain approaches on some data
sets. This phenomenon is attributed to the fact that the selected node intervals for different data sets amount to 1% of the total
features, leading to a shortage of multiple features between consecutive nodes. Consequently, this setting has a considerable impact
on the fluctuation of classification results. Additionally, the feature selection approach employs a classical descending weight-based
ranking method, resulting in better classification performance for the subset of features composed of the top k ranked features
compared to the subset composed of the bottom k ranked features. Furthermore, we observe that ESRFS exhibits stronger stability
than other comparative approaches. Simultaneously, other sparse-based MLFS approaches demonstrate greater stability compared to
information-theoretical-based and problem-transformation-based algorithms. However, it is evident from the results that the ESRFS
methodology achieves exceptional classification outcomes across the majority of data sets. Additionally, all utilized sparse-based
approaches, such as ESRFS, DSMFS, SSFS, MIFS, and RALM-FS, outperform several conventional approaches, including PPT+MI
and PPT+CHI, as well as two information-theoretical-based method, LRFS and GMM, in terms of classification performance. These
experimental findings provide compelling evidence for the superior performance of the proposed ESRFS approach.

5.4. Analysis of parameter sensitivity

In the proposed ESRFS methodology, there exist four regularization parameters (𝛼, 𝛽, 𝛾, and 𝜆). To assess their sensitivity,
we adopt a similar strategy as described in Jian, Li, and Liu (2018). Specifically, we quantitatively analyze the impact of these
regularization parameters by tuning them within a common grid of values, namely {0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0}. We perform

16
Y. Li et al. Information Processing and Management 61 (2024) 103633

Fig. 2. The experimental outcomes for HL and ZOL using MLkNN, as well as Micro-𝐹1 and Macro-𝐹1 utilizing SVM and 3NN classifiers, are reported.

Table 11
The execution duration (in seconds) for each approach.
Data sets ESRFS DSMFS SSFS MIFS RALM-FS LRFS GMM PPT+MI PPT+CHI
Flags 0.0070 0.0210 0.1855 37.7202 0.0878 0.5316 0.0309 0.0020 0.2075
Emotions 0.0189 0.1570 0.2483 5.4764 0.2543 4.4760 0.2124 0.0319 1.0023
Medical 30.7642 86.5260 1.0991 45.8594 0.4229 7092.9612 24.7748 1.1439 14.8503
Yeast 0.0419 1.9250 3.9873 93.2647 1.0263 106.8972 2.5702 0.1676 2.1563
Enron 40.6992 188.7480 6.5934 155.9359 1.3050 18 092.3621 70.1554 0.7879 13.1508
Arts 0.2354 4.7570 3.5235 32.4866 1.8629 4180.9622 32.3296 1.0381 91.4707
Business 0.2453 2.6700 2.8484 27.9942 1.8231 4473.9806 32.1042 1.0949 65.1740
Education 0.3341 5.2550 6.7599 35.6676 1.8311 7306.1785 43.2494 1.2427 93.0931
Entertain 0.3920 5.1340 11.1263 72.6079 1.8870 5440.0593 32.3096 1.5170 113.5574
Health 0.4857 4.4700 5.8503 110.7948 1.8341 8433.4837 47.0652 1.3803 93.0911
Recreation 0.3521 3.3650 5.6529 29.0872 1.7952 5213.5564 32.5141 1.3674 73.9912
Reference 0.5859 9.1330 6.5246 41.6147 1.9428 13 960.6576 64.4178 1.9258 100.1712
Science 0.7670 9.2610 5.9112 20.2100 1.9518 16 294.1888 73.5574 1.9797 196.2386
Society 0.4528 3.1720 5.9670 55.1017 1.9897 7874.2184 43.4937 1.3683 140.5333
Social 1.4790 11.4720 7.1180 48.0066 2.1822 27 130.5610 97.0306 2.7656 151.7017

the analysis individually while keeping the other parameters fixed at a value of 0.5 for convenience. For the purpose of this study,
we focus on the Entertain dataset. The sensitivity analysis results for the regularization parameters are presented in Fig. 3. Figures
3 (a)–(h) and (i)–(p) illustrate the sensitivity analysis outcomes in terms of Micro-𝐹1 and Macro-𝐹1 (SVM), as well as HL and ZOL
(MLkNN), respectively. It can be observed that the ESRFS approach exhibits limited sensitivity to the four regularization parameters
concerning Micro-𝐹1 and Macro-𝐹1 (SVM). However, for the MLkNN classifier, ESRFS demonstrates slight sensitivity to 𝛽 and 𝜆.
Hence, it is crucial to consider the potential impact of these factors in real-world applications.

5.5. Convergence and time complexity analysis

In this section, we conduct experiments to validate the convergence of the objective function. The experimental results for
several utilized data sets are presented in Fig. 4. These experiments provide further evidence supporting the convergence analysis
discussed in Section 4.2. Additionally, we provide an assessment of the computational complexity for all approaches. In each update
iteration, the computational complexities of the compared approaches are evaluated: (𝑑𝑛𝑙 + 𝑑 2 𝑛 + 𝑑 2 𝑙) (ESRFS), (𝑛𝑑 2 + 𝑑 2 𝑘 + 𝑛2 𝑘)
(DSMFS), (𝑛𝑑𝑚+𝑛2 𝑚) (SSFS), (𝑘𝑛𝑑 +𝑛2 ) (MIFS), (𝑑 3 ) (RALM-FS), (𝑑𝑙2 +𝑚𝑑) (LRFS), (𝑑𝑛𝑙) (GMM), (𝑛𝑑) (PPT+MI) and (𝑛𝑑)
(PPT+CHI), where 𝑚 denotes the number of already-selected features. The computational complexity of LRFS, GMM and the used
traditional approaches outperform all sparse-based approaches, because the former takes into account individual attributes, while
the latter takes into consideration collective attributes. However, ESRFS costs less time and obtains better classification performance.

17
Y. Li et al. Information Processing and Management 61 (2024) 103633

Fig. 3. A sensitivity analysis of the parameters for the SVM and MLkNN classifiers is conducted, considering various evaluation criteria, on the Entertain dataset.

Fig. 4. Convergence analysis.

18
Y. Li et al. Information Processing and Management 61 (2024) 103633

Subsequently, we present the runtime of all utilized methods across all data sets, as illustrated in Table 11. It is evident that ESRFS
exhibits the shortest runtime due to its high sparsity. Following closely, RALM-FS and PPT+MI exhibit similar runtime. Similarly,
the runtime of DSMFS is comparable to that of SSFS, as they are both embedded methods (or sparse-based methods) that do not
consider all feature-label combinations. MIFS, GMM, and PPT+CHI display similar runtime. LRFS requires more runtime compared
to other approaches due to its consideration of interactive correlations among features and labels. Overall, the ESRFS approach
demonstrates acceptable performance for real-world applications.

6. Conclusion

Existing methods for multi-label feature selection often suffer from issues such as low sparsity and high redundancy, which
can have a negative impact on the classification performance of multi-label learning models. In this study, we have addressed
these challenges by introducing a novel approach called Elastic Net-based Sparse and Redundancy Feature Selection (ESRFS).
ESRFS utilizes the elastic net regularization framework to extract personalized and shared common features from multi-label data.
Compared to other approaches like LASSO-norm, our method achieves higher sparsity in personalized features while identifying low
redundancy shared common features that exhibit stronger discrimination. By leveraging the advantages of both sparsity and low
redundancy, ESRFS significantly improves the performance of multi-label learning models. To validate the effectiveness of ESRFS, we
have conducted comprehensive experiments comparing it with eight state-of-the-art feature selection approaches. These include four
sparse-based methods (DSMFS with 𝑙1 -norm, RALM-FS with 𝑙2,0 -norm, SSFS, and MIFS with 𝑙2,1 -norm), two information-theoretical
based methods (GMM and LRFS), and two problem transformation-based methods (PPT+MI and PPT+CHI). Through rigorous
experimentation, our results demonstrate that ESRFS outperforms these existing approaches in terms of classification accuracy and
feature selection performance.
In addition to addressing the current limitations in multi-label feature selection, our research opens up avenues for future
investigation. We plan to explore the combination of different norms with causality analysis on partial multi-label data due to partial
multi-label feature selection methods confront interpretability challenge. Additionally, most of these methods utilize traditional kNN
methods to build the affinity matrix, which leads to inaccurate results due to fixed k-value in the manifold structure. To this end, we
will design a series of methods to address these issues. By integrating causality analysis into the partial multi-label feature selection
process, we aim to uncover causal relationships between features and labels, providing deeper insights and facilitating more accurate
predictions in multi-label learning tasks.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared
to influence the work reported in this paper.

Data availability

Data will be made available on request.

Acknowledgments

This work is funded by: by Science Foundation of Jilin Province of China under Grant No. 20230508179RC, and China
Postdoctoral Science Foundation funded project under Grant No. 2023M731281, and Changchun Science and Technology Bureau
Project 23YQ05.

References

Boyd, S., Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge University Press.
Cai, D., He, X., Han, J., & Huang, T. S. (2010). Graph regularized nonnegative matrix factorization for data representation. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 33(8), 1548–1560.
Cai, X., Nie, F., & Huang, H. (2013). Exact top-k feature selection via l2, 0-norm constraint. In Twenty-third international joint conference on artificial intelligence.
Dai, J., Huang, W., Zhang, C., & Liu, J. (2024). Multi-label feature selection by strongly relevant label gain and label mutual aid. Pattern Recognition, 145, Article
109945.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society.
Series B. Statistical Methodology, 39(1), 1–22.
Doquire, G., & Verleysen, M. (2011). Feature selection for multi-label classification problems. In International work-conference on artificial neural networks (pp.
9–16). Springer.
Elad, M. (2010). Sparse and redundant representations: From theory to applications in signal and image processing, vol. 2, no. 1. Springer.
Fan, Y., Liu, J., Tang, J., Liu, P., Lin, Y., & Du, Y. (2024). Learning correlation information for multi-label feature selection. Pattern Recognition, 145, Article
109899.
Gao, W., Li, Y., & Hu, L. (2023). Multilabel feature selection with constrained latent structure shared term. IEEE Transactions on Neural Networks and Learning
Systems, 34(3), 1253–1262.
Gonzalez-Lopez, J., Ventura, S., & Cano, A. (2020). Distributed multi-label feature selection using individual mutual information measures. Knowledge-Based
Systems, 188, Article 105052.
Guo, Y., Sun, H., & Hao, S. (2022). Adaptive dictionary and structure learning for unsupervised feature selection. Information Processing & Management, 59(3),
Article 102931.

19
Y. Li et al. Information Processing and Management 61 (2024) 103633

Han, J., Sun, Z., & Hao, H. (2015). Selecting feature subset with sparsity and low redundancy for unsupervised learning. Knowledge-Based Systems, 86, 210–223.
Hu, L., Li, Y., Gao, W., Zhang, P., & Hu, J. (2020). Multi-label feature selection with shared common mode. Pattern Recognition, 104, Article 107344.
Hu, J., Li, Y., Xu, G., & Gao, W. (2022). Dynamic subspace dual-graph regularized multi-label feature selection. Neurocomputing, 467, 184–196.
Huang, J., Qin, F., Zheng, X., Cheng, Z., Yuan, Z., Zhang, W., et al. (2019). Improving multi-label classification with missing labels by learning label-specific
features. Information Sciences, 492, 124–146.
Jian, L., Li, J., & Liu, H. (2018). Exploiting multilabel information for noise-resilient feature selection. ACM Transactions on Intelligent Systems and Technology,
9(5), 1–23.
Jian, L., Li, J., Shu, K., & Liu, H. (2016). Multi-label informed feature selection. In IJCAI, vol. 16 (pp. 1627–1633).
Jin, J., Xiao, R., Daly, I., Miao, Y., Wang, X., & Cichocki, A. (2020). Internal feature selection method of CSP based on L1-norm and Dempster–Shafer theory.
IEEE Transactions on Neural Networks and Learning Systems, 32(11), 4814–4825.
Jin, L., Zhang, L., & Zhao, L. (2023). Feature selection based on absolute deviation factor for text classification. Information Processing & Management, 60(3),
Article 103251.
Karimi, F., Dowlatshahi, M. B., & Hashemi, A. (2023). SemiACO: A semi-supervised feature selection based on ant colony optimization. Expert Systems with
Applications, 214, Article 119130.
Kou, Y., Lin, G., Qian, Y., & Liao, S. (2023). A novel multi-label feature selection method with association rules and rough set. Information Sciences, 624, 299–323.
Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., et al. (2017). Feature selection: A data perspective. ACM Computing Surveys (CSUR), 50(6),
1–45.
Li, Y., Hu, L., & Gao, W. (2023). Multi-label feature selection via robust flexible sparse regularization. Pattern Recognition, 134, Article 109074.
Li, J., Li, P., Hu, X., & Yu, K. (2022). Learning common and label-specific features for multi-label classification with correlation information. Pattern Recognition,
121, Article 108259.
Li, J., Wu, L., Dani, H., & Liu, H. (2018). Unsupervised personalized feature selection. In Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1.
Lin, Y., Hu, Q., Liu, J., & Duan, J. (2015). Multi-label feature selection based on max-dependency and min-redundancy. Neurocomputing, 168, 92–103.
Liu, J., Lin, Y., Ding, W., Zhang, H., & Du, J. (2022). Fuzzy mutual information-based multilabel feature selection with label dependency and streaming labels.
IEEE Transactions on Fuzzy Systems, 31(1), 77–91.
Liu, N., Qi, E.-S., Xu, M., Gao, B., & Liu, G.-Q. (2019). A novel intelligent classification model for breast cancer diagnosis. Information Processing & Management,
56(3), 609–623.
Liu, W., Wang, H., Shen, X., & Tsang, I. W. (2021). The emerging trends of multi-label learning. IEEE Transactions on Pattern Analysis and Machine Intelligence,
44(11), 7955–7974.
Ma, J., Chiu, B. C. Y., & Chow, T. W. (2020). Multilabel classification with group-based mapping: a framework with local feature selection and local label
correlation. IEEE Transactions on Cybernetics, 52(6), 4596–4610.
Meng, Y., Shang, R., Shang, F., Jiao, L., Yang, S., & Stolkin, R. (2019). Semi-supervised graph regularized deep NMF with bi-orthogonal constraints for data
representation. IEEE Transactions on Neural Networks and Learning Systems, 31(9), 3245–3258.
Mishra, N. K., & Singh, P. K. (2020). FS-MLC: Feature selection for multi-label classification using clustering in feature space. Information Processing & Management,
57(4), Article 102240.
Nie, F., Huang, H., Cai, X., & Ding, C. (2010). Efficient and robust feature selection via joint l2, 1-norms minimization. Advances in Neural Information Processing
Systems, 23, 1813–1821.
Read, J. (2008). A pruned problem transformation method for multi-label classification. In Proc. 2008 New Zealand computer science research student conference,
vol. 143150 NZCSRS 2008, (p. 41).
Shang, R., Xu, K., Shang, F., & Jiao, L. (2020). Sparse and low-redundant subspace learning-based dual-graph regularized robust feature selection. Knowledge-Based
Systems, 187, Article 104830.
Spolaôr, N., Monard, M. C., Tsoumakas, G., & Lee, H. D. (2016). A systematic review of multi-label feature selection and a new method based on label construction.
Neurocomputing, 180, 3–15.
Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., & Vlahavas, I. (2011). Mulan: A java library for multi-label learning. Journal of Machine Learning Research, 12,
2411–2414.
Wen, X., Li, D., Zhang, C., & Zhai, Y. (2022). A weighted ML-KNN based on discernibility of attributes to heterogeneous sample pairs. Information Processing &
Management, 59(5), Article 103053.
Weng, W., Wei, B., Ke, W., Fan, Y., Wang, J., & Li, Y. (2023). Learning label-specific features with global and local label correlation for multi-label classification.
Applied Intelligence, 53(3), 3017–3033.
Xu, Y., Wang, J., An, S., Wei, J., & Ruan, J. (2018). Semi-supervised multi-label feature selection by preserving feature-label space consistency. In Proceedings of
the 27th ACM international conference on information and knowledge management (pp. 783–792).
Yan, H., Yang, J., & Yang, J. (2016). Robust joint feature weights learning framework. IEEE Transactions on Knowledge and Data Engineering, 28(5), 1327–1339.
Yu, K., Cai, M., Wu, X., Liu, L., & Li, J. (2023). Multilabel feature selection: a local causal structure learning approach. IEEE Transactions on Neural Networks
and Learning Systems, 34(3), 3044–3057.
Zhang, P., Liu, G., & Gao, W. (2019). Distinguishing two types of labels for multi-label feature selection. Pattern Recognition, 95, 72–82.
Zhang, Z., Xu, Y., Yang, J., Li, X., & Zhang, D. (2015). A survey of sparse representation: algorithms and applications. IEEE Access, 3, 490–530.
Zhang, Y., & Zhou, Z.-H. (2010). Multilabel dimensionality reduction via dependence maximization. ACM Transactions on Knowledge Discovery from Data (TKDD),
4(3), 1–21.
Zhang, M.-L., & Zhou, Z.-H. (2013). A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8), 1819–1837.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
67(2), 301–320.

20

You might also like