0% found this document useful (0 votes)

3 views

Multilabel feature selection with constrained latent structure shared term

The document presents a novel multilabel feature selection method called Shared Latent Feature and Label Structure Feature Selection (SSFS), which addresses high-dimensional multilabel data by capturing latent feature and label structures. It employs a latent structure shared term and graph regularization to enhance feature selection while considering label correlations. Experimental results demonstrate the effectiveness of SSFS compared to existing multilabel feature selection methods.

Uploaded by

sizheduan36

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Multilabel feature selection with constrained latent structure shared term

Uploaded by

sizheduan36

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO.

3, MARCH 2023 1253

Multilabel Feature Selection With Constrained

Latent Structure Shared Term
Wanfu Gao , Yonghao Li, and Liang Hu

Abstract— High-dimensional multilabel data have increasingly label correlations of multilabel data is named problem trans-
emerged in many application areas, suffering from two notewor- formation method, which can be divided into three groups
thy issues: instances with high-dimensional features and large- based on label correlations: first-order approach, second-order
scale labels. Multilabel feature selection methods are widely
studied to address the issues. Previous multilabel feature selection approach, and high-order approach [6]–[8]. The first-order
methods focus on exploring label correlations to guide the feature approaches consider that the label is independent of each
selection process, ignoring the impact of latent feature structure other; in other words, they ignore the label correlations.
on label correlations. In addition, one encouraging property The second-order approaches take pairwise label correlations
regarding correlations between features and labels is that similar into account, such as the ranking between the relevant label
features intend to share similar labels. To this end, a latent
structure shared (LSS) term is designed, which shares and and the irrelevant label. The high-order approaches tackle
preserves both latent feature structure and latent label structure. multilabel data by considering high-order relation among
Furthermore, we employ the graph regularization technique to labels. Some representative problem transformation meth-
guarantee the consistency between original feature space and ods are binary relevance (BR) [9], calibrated label rank-
latent feature structure space. Finally, we derive the shared latent ing (CLR) [10], and label-specific features-dependent labels
feature and label structure feature selection (SSFS) method based
on the constrained LSS term, and then, an effective optimization (LLSF-DL) [11] corresponding to the first-order approach,
scheme with provable convergence is proposed to solve the SSFS second-order approach, and high-order approach, respectively.
method. Better experimental results on benchmark datasets are The three-class approaches first transform multilabel data into
achieved in terms of multiple evaluation criteria. single-label data, i.e., transform multilabel data to fit subse-
Index Terms— Feature selection, graph regularization, latent quent single-label algorithms. Another strategy is to transform
structure, multilabel data. algorithms to fit multilabel data, which is named algorithm
adaption methods. Algorithm adaption methods employ matrix
I. I NTRODUCTION transformation tricks or metrics of information theory to
exploit label correlations.

W ITH the rapid development of computer science tech-

nique, numerous multilabel data with high-dimensional
features have emerged in many application areas, such as
The goal of multilabel feature selection methods is to choose
critical features from high-dimensional feature space with
large-scale labels. As a result, feature correlations should be
document classification [1]–[3] and gene recognition [4], [5]. considered indeed in the design of multilabel feature selec-
There exist two salient issues in multilabel data: instances with tion methods, which are ignored in existing multilabel fea-
high-dimensional features and large-scale labels. Obviously, ture selection methods. Although many information-theoretic
the difficulty to address the two issues becomes greater as methods intend to capture feature correlations to construct
the number of features and labels increases, which is a feature selection models, they fail to obtain excellent label cor-
challenge for fitting multilabel models. As a result, multilabel relations to guide the feature selection process. The reason is
feature selection attracts much attention from researchers as that although there exist high-order label correlations in large-
an important preprocessing step in a multilabel classification scale labels, existing information-theoretic methods mainly
problem. exploit label correlations based on the low-order approach due
In multilabel learning models, previous methods make great to the high computational cost of high-order approach.
efforts to exploit complicated label correlations for fitting Inspired by information-theoretic methods, some features
multilabel learning models. The well-known strategy to exploit are dependent on each other, and others are independent of
Manuscript received 25 October 2020; revised 14 May 2021 and 25 July each other, which indicates that there exists latent feature
2021; accepted 12 August 2021. Date of publication 26 August 2021; date structure in the original feature space. The latent feature
of current version 1 March 2023. This work was supported in part by the structure means that abundant features can be clustered into
Fundamental Research Funds for the Central Universities, Jilin University
(JLU), under Grant 93K172020K36, in part by the Science Foundation of several features. For example, in the text documents, the words
Jilin Province of China under Grant 2020122209JC, and in part by the Youth “basketball,” “tennis,” and “football” can be represented as
Science Foundation of Jilin Province of China under Grant 20160520011JH “sports,” we name the “sports” as the latent feature. In real-
and Grant 20180520021JH. (Corresponding author: Liang Hu.)
The authors are with the College of Computer Science and Technol- world applications, latent features cannot be pointed obviously,
ogy, Jilin University, Changchun 130012, China (e-mail: [email protected]; and however, abundant features can be clustered into several
[email protected]; [email protected]). clusters. Fig. 1 shows the process that the original feature
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNNLS.2021.3105142. matrix is transformed into the latent feature matrix. The latent
Digital Object Identifier 10.1109/TNNLS.2021.3105142 feature matrix is constructed by some new features that are
2162-237X © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
1254 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 3, MARCH 2023

is, Y ∈ Rn×c . Each Yi j = 1 if the j th label is associated with

i -instance; otherwise, Yi j = 0.

III. R ELATED W ORK

Recent years have witnessed an increasing number of fea-
ture selection methods handling multilabel data where each
instance is related to multiple labels [12]–[16]. As mentioned
above, there exist two groups of multilabel learning based
on whether or not to address directly multilabel data: prob-
Fig. 1. Original feature matrix is transformed into latent feature matrix. lem transformation methods and algorithm adaption meth-
ods. Pruned problem transformation (PPT) is the well-known
transformation method [17], which is an enhanced version
different from the original features. Furthermore, to select the of label powerset (LP) method [18]. PPT method eliminates
critical features from original high-dimensional feature space these patterns with too rarely occurring labels according
with large-scale labels, the relationships between features and to a pre-specified counting threshold. Based on the PPT
labels should be fully considered in the design of multilabel method, a mutual information-based method PPT + MI and a
feature selection methods. One appealing property regarding statistics-based method PPT + CHI are proposed [17], [19].
multilabel data is that similar features intend to share similar Both methods focus on exploiting label correlations to guide
labels. It means if two features are dependent on each other, feature selection. The highlight of the two PPT-based methods
and they intend to share similar or same labels in relatively is efficient and simple while both two methods ignore feature
high probability. correlations in multilabel data. As a result, the classification
In view of the above analysis, we design a latent struc- performance may be limited.
ture shared (LSS) term to capture latent feature structure in Compared to problem transformation methods, many algo-
original feature space. In addition, we employ the learning rithm adaption methods are studied over the past years.
label specific (LLS) term to fit labels in multilabel data, with Some of them are based on information theory, such as
the LLS term encouraging similar features to share similar max-dependence and min-redundancy (MDMR) [20] and scal-
labels. Furthermore, to obtain accurate latent feature structure, able criterion for large label set (SCLS) [21]. The MDMR
the graph regularization technique is adopted to guarantee method selects the superior feature subset by maximizing
the consistency between the original feature space and the the feature dependence between features and labels while
latent feature structure space. Finally, a novel multilabel fea- minimizing the feature redundancy. The specific criterion of
ture selection method named shared latent feature and label MDMR is as follows:
structure feature selection (SSFS) is proposed. We highlight ⎧ ⎫
1 ⎨ ⎬
the contributions of this article as follows: J ( fk ) = I ( f k ; li )− I fk ; f j − I f k ; li | f j
1) designing an LLS term to capture latent feature structure |S| ⎩ ⎭
li ∈L f j ∈S li ∈L
in original feature space; (1)
2) proposing a multilabel feature selection method named
SSFS to share LLS term between features and labels in where J (·) represents the objective function and f k , f j , and
multilabel data and adopting graph regularization tech- S represent the candidate features, already-selected features,
nique to guarantee the consistency between the original and already-selected feature subset, respectively. I ( fk ; li )
feature space and the latent feature structure space; measures
the feature dependence, whereas I ( fk ; f j ) −
3) developing an effective optimization scheme with prov- li ∈L I ( f k ; li | f j ) measures the feature redundancy. (1/|S|)
able convergence to solve the SSFS method. intends to balance the magnitude between feature dependence
term and feature redundancy term.
Similarly, the feature selection criterion of SCLS includes
II. P RELIMINARIES two terms: feature relevancy term and scalable relevance
evaluation, which is denoted by
We denote matrices by italicized uppercase letters, for
I fk ; f j
example, matrix A. For matrix A ∈ Rn×m , A· j and Ai· J ( fk ) = I ( f k ; li ) − I ( f k ; li ). (2)
represent the j th column and the i th row of A, respectively. l ∈L f ∈S
H ( f k ) l ∈L
i j i
Besides, bold italicized lowercase letters represent vectors,
such as b, while scalars are denoted by lowercase letters, such Recently, Zhang et al. [22] proposed a novel multilabel
as b. We use that A T and T r (A) represent transpose of A and feature selection method named multilabel feature selection
based on label redundancy (LRFS), which groups labels into
the
trace
of A, respectively, if A isa square matrix. A F = two categories: independent labels and dependent labels. The
( ni=1 mj=1 A2i j )1/2 and A2,1 = ni=1 ( mj=1 A2i j )1/2 present
the Frobenius norm of A and the l2,1 -norm of A, respectively, specific form of LRFS is presented as follows:
⎧ ⎫
in which Ai j is the (i, j)th entry of matrix A. We denote the fea- ⎨ 1 ⎬
ture matrix X ∈ Rn×d that has n instances with d-dimensional J ( fk ) = I fk ; li |l j − I fk ; f j . (3)
⎩ |S| ⎭
feature space and each instance has c-dimensional labels, that li ∈L l j ∈L,li =l j f j ∈S

Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: MULTILABEL FEATURE SELECTION WITH CONSTRAINED LSS TERM 1255

In addition, Gonzalez-Lopez et al. [23] design a geometric Label-enhanced multilabel learning (LEMLL) replaces logi-
mean maximization (GMM) method to discuss how to aggre- cal labels with real-valued labels to enhance label informa-
gate the mutual information of multiple labels. GMM chooses tion [31].
the optimal feature subset with the largest geometric mean. Imposing l2,1 -norm on both loss function and regular-
The geometric mean considers the product of the measures ization, the well-known embedded-based feature selection
and then splits them with a root. method (RFS) selects features across all data points with
Beyond mentioned methods above, a large number of joint sparsity [32]. Furthermore, Cai et al. [33] imposed the
multilabel feature selection methods based on information l2,1 -norm on loss function and the l2,0 -norm on the feature
theory are proposed [24], [25]. The common characteristic weight matrix, proposing a robust feature selection method
of the methods based on information theory is that they named RALM-FS that is regarded as an improved version of
pay much attention on feature correlations. Due to the cal- RFS. It is denoted by
culative limitations of high-order feature correlations, exist-
ing methods adopt low-order information-theoretic terms to min Y − X T W − 1bT 2,1 s.t. W 2,0 = a (5)
W,V,B
exploit feature correlations. For example, the LRFS method
employs mutual information between the candidate feature where b and 1 represent the bias term and the all-one-
and each already-selected feature to represent the feature element vector, respectively, and a is the number of the
redundancy. The MDMR method adopts conditional mutual already-selected features. Besides, Ma et al. [34] proposed a
information between the candidate feature and the class SubFeature Uncovering with Sparsity (SFUS) method, which
given already-selected features to represent interactive corre- jointly selects features by a sparse regularization and uncovers
lations among features and each label. Furthermore, previous shared subspace of original features. Chang et al. [35] pro-
information-theoretic methods adopt mutual information or posed a convex semisupervised multilabel feature selection
conditional mutual information to calculate the correlation method to address large-scale datasets, which can be handled
between each candidate feature and each class label, which by a fast iterative algorithm. In addition, Chang et al. [36]
ignores high-order interactive correlation between features and proposed a semisupervised shared subspace multilabel feature
labels. As a result, previous methods are very dependent on selection method, which integrates the semisupervised learning
the importance of individual feature or label. and label correlation mining into one unified framework.
On the contrary, multilabel feature selection Furthermore, Luo et al. [37] preserved the intrinsic structure
embedded-based methods focus on exploit label correlations, of feature matrix by locally linear embedding (LLE) [38],
employing label correlations to select the compact feature which is different from traditional methods that are based on a
subset. For instance, Jian et al. [26] employed the matrix similarity matrix. The novel idea is applied to the unsupervised
factorization technique to extract latent semantics of label feature selection method. Feature selection methods based
information, with the latent semantics of label information on structured sparsity are comprehensively studied in the
fitting feature matrix of multilabel data. As a result, a feature literature [39]. Inspired by the property that similar features
weight matrix reflecting the importance of features is share similar labels, this article assumes that there exists a
obtained. Finally, the critical features are selected according latent feature structure that can be shared by both the original
to the feature weight matrix. In addition, multilabel informed feature matrix and the label matrix. The proposed method not
feature selection (MIFS) adopts the graph Laplacian matrix to only characterizes the intrinsic structure of feature matrix but
ensure a local geometry structure between feature matrix and also is associated with label matrix.
latent semantics of label information. The objective function Furthermore, graph regularization technique is widely
of the MIFS method is presented as follows: adopted in a feature selection process. For example,
Hu et al. [40] proposed a dual-graph regularization mul-
min ||X W − V ||2F + α||Y − V B||2F tilabel feature selection method named robust multilabel
W,V,B
feature selection based on dual-graph (DRMFS), which com-
+βT r V T LV + γ ||W ||2,1 (4)
bines label matrix and feature matrix with feature weight
where W ∈ Rd×k , V ∈ Rn×k , and B ∈ Rk×c represent matrix to construct label graph regularization and feature
the feature weight matrix, the latent semantics of multilabel graph regularization, respectively. Zhang et al. [13] employed
information, and the basis matrix, respectively, L ∈ Rn×n is a low-dimensional embedding matrix to construct graph reg-
the Laplacian matrix, and α, β, and γ are three regularization ularization terms that exploit the local label correlation and
parameters of MIFS model. global label correlation. These two feature selection methods
Besides, MultiLabel learning using LOcal Correlation employ the graph regularization technique to preserve the
(ML-LOC) tries to exploit label correlations in the manifold structure of feature data and label data. In addi-
data locally [27]. multilabel learning with GLObal and tion, Shang et al. [41] combined a graph regularization tech-
loCAL (GLOCAL) explores both global and local label cor- nique with subspace learning to devise a feature selection
relations [28]. Multilabel manifold learning (ML) attempts method named subspace learning-based graph regularized fea-
to explore manifold in the label space for multilabel learn- ture selection (SGFS), which obtains excellent performance.
ing [29]. Learning label-specific features (LLSFs) assume that Furthermore, Nie et al. [42] proposed a new model to con-
each label is related with a feature subset, and arbitrary similar struct the similarity matrix by assigning the adaptive neighbors
labels share more features than two dissimilar labels [30]. for each data. Afterward, the adaptive graph technique is

Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
1256 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 3, MARCH 2023

applied to many feature selection methods [43], [44]. In this Followed by [45], we employ heat kernel function to build the
article, we propose an LSS term, and however, many LLS affinity matrix, which is defined as
can be extracted from the original feature matrix. As a result, X i· −X j · 2
e− σ , if X i· ∈ N p X j · or X j · ∈ N p (X i· )
2
we adopt the graph regularization technique to constrain latent Si j =
X
(9)
feature structure. 0, otherwise
Inspired by previous methods, we conclude three key factors
to construct multilabel feature selection methods: 1) exploring where N p (X i· ) is the p-nearest neighbors of instance X i· and
feature correlations and constructing latent feature structure; σ is a parameter of heat kernel mode. Integrating the above
2) exploring label correlations; and 3) constructing relation- terms into one multilabel feature selection model, we obtain
ships between latent feature structure and label structure, the following objective function:
which is beneficial for multilabel feature selection. min X − V Q T 2F + αY − V M2F
V,Q,M

IV. P ROPOSED M ETHOD +βT r V T LV + γ Q2,1 (10)

Considering latent feature structure, we decompose the where β and γ are two parameters to measure the contribution
feature matrix X ∈ Rn×d into two low-dimensional matrices of the local geometry structure and to control the sparsity of
V ∈ Rn×k and Q ∈ Rd×k . To minimize the reconstruction the objective function, respectively. Q2,1 achieves feature
error, we obtain the following form: selection by the l2,1 -norm. In objective function (10), matrix
V reflects the latent feature structure and latent label structure.
min X − V Q T 2F (6) In addition, matrix V is defined the LSS term to achieve
V,B
that similar features share similar labels. Finally, to enhance
where V represents the latent feature structure of feature the row-sparse property further, we impose the nonnegative
matrix, with Q denoting a coefficient matrix. Formula (6) constraint onto the coefficient matrix Q. In addition, we also
indicates that d-dimensional features reduce to k-dimensional impose the nonnegative constraints onto V and M to facilitate
features, which can be explained that the original d features the subsequent optimization algorithm that solves the objective
cluster into k different clusters, each cluster contains relevant function. The final objective function, including constraints,
features, and different clusters are independent with each is reformulated as
other. By this way, we obtain the latent feature structure of
feature matrix. In addition, the row of matrix Q represents the min X − V Q T 2F + αY − V M2F
V,Q,M
importance of each feature in these k latent feature variables.
+βT r V T LV + γ Q2,1
Similarly, we employ a product of two low-dimensional
matrices U ∈ Rn×k and M ∈ Rk×c to represent the label s.t. V, M, Q ≥ 0. (11)
matrix in multilabel data. U and M denote arbitrary latent label
structure and corresponding coefficient matrix, respectively.
V. O PTIMIZATION OF SSFS S CHEME
Inspired by the property that similar features share similar
labels, we replace matrix U with matrix V to share the A. Proposed Scheme of SSFS
correlations between labels and features. Matrix V is called The objective function (11) is joint nonconvex; as a result,
the LSS term. As a result, we obtain the following problem: we cannot obtain the global minima. In addition, the objective
function (11) is nonsmooth referring to the l2,1 -norm. To this
min X − V Q T 2F + αY − V M2F (7) end, we design three iterative rules to solve the objective
V,Q,M
function, achieving the local minima. First, Q2,1 can be
where α is used to balance the contribution regarding the relaxed as 2T r (Q T D Q), where D ∈ Rd×d is a diagonal matrix
impact of latent feature structure on label correlations and fea- with the i th diagonal element Dii = (1/(2Q i· 2 + )) and
ture decomposition. However, a critical issue is that although is a small positive constant. The objective function (11) can
we intend to find a proper matrix V by minimizing For- be rewritten as
mula (7), there exist multiple LSS terms. To obtain a proper
LSS term, we employ a graph regularization technique to (V, M, Q) = T r X T − QV T X − V Q T

guarantee the consistency between the original feature space +αT r Y T − M T V T (Y −V M)
and the latent feature structure space. The principle is that the T T
+βT r V LV +2γ T r Q D Q . (12)
closer correlation between the two instances in the feature
matrix X indicates the closer correlation between the two Furthermore, we integrate nonnegative constraints into For-
latent feature variables in latent feature structure mula (12), obtaining
1
n n
L(V, M, Q)
Si j Vi· − V j · 22 = T r V T (A − S)V = T r V T LV
2 i=1 j =1 = T r X T − QV T X − V Q T

(8) +αT r Y T − M T V T (Y −V M) +βT r V T LV
T
in which A is a diagonal matrix, S denotes a symmetric affinity +2γ T r Q D Q − T r V T − T r M T − T r Q T
matrix, and L = A − S denotes the graph Laplacian matrix. (13)

where ∈ Rn×k , ∈ Rk×c , and ∈ Rd×k denote three Algorithm 1 SSFS

Lagrange multipliers. The partial derivatives of L w.r.t vari- Input:
ables V , M, and Q are Data matrix X ∈ Rn×d and label matrix Y ∈ Rn×c ;
∂L Regularization parameters α, β and γ .
= −2X Q +2V Q T Q −2αY M T +2αV M M T +2β LV −
∂V Output:
(14) Return the top-k selected features according to SSFS
∂L method.
= −2αV Y + 2αV V M −
T T
(15)
∂M 1: Initialize V ∈ Rn×k , M ∈ Rk×c and Q ∈ Rd×k randomly;
∂L t = 0;
= −2X T V + 2QV T V + 2γ D Q − . (16)
∂Q 2: Compute the degree matrix A and the affinity matrix S of

According to the Karush–Kuhn–Tucker condition, the data matrix X according to Formula (9);
3: Repeat:
i.e., i j Vi j = 0, i j Mi j = 0, and i j Q i j = 0, we obtain that
4: Update Formulas (21), (22) and (23)
(−X Q + V Q T Q − αY M T + αV M M T + β LV )i j i j = 0 5: Update D
(17) 6: t = t + 1;
7: Until Convergence;
(−αV Y + αV V M)i j
T T
ij =0 (18)
8: Return Q;
(−X V + QV V + γ D Q)i j
T T
ij =0 (19) 9: Return features according to ||Q i· ||2 .
where L = A − S in Formula (17), and hence, Formula (17)
can be written as
Then, we present the proof of convergence regarding
−X Q +V Q T Q −αY M T +αV M M T +β(A− S)V i j i j = 0.
Formula (11) under Formulas (21)–(23). First of all, we intro-
(20) duce the concept of auxiliary function that is also used in the
As a result, the update rules regarding V , M and Q are literature [40], [46].
presented as follows: Definition 1: The function J is regarded as an auxiliary
function of K when J (Vi j , Vij ) ≥ K (Vi j ) and J (Vi j , Vi j ) =
X Q + αY M T + β SV i j K (Vi j ) are both satisfied.
j ← Vi j
Vit+1
t
(21)
V Q T Q + αV M M T + β AV i j Lemma 1: The function K is nonincreasing under the
T following formula:
V Y ij
Mi j ← Mi j T
t+1 t
(22)
V V M ij Vit+1
j = arg min J Vi j , Vitj . (27)
T Vi j
X V ij
Qi j ← Qi j
t+1 t
(23) Proof of Lemma 1: Based on Definition 1 and Lemma 1,
QV T V + γ D Q i j we can easily obtain that
where t indicates the iterative number. In addition, to avoid t t
zero elements during feature selection process in the denom- K Vit+1
j ≤ J Vit+1
j , Vi j ≤ J Vi j , Vi j = K Vi j . (28)
t t

inator, we add one very small constant to the denominator.

We present the executive process in Algorithm 1. Next, we present the proof that the function (11) is nonin-
creasing under the update rule (21) with a proper auxiliary
function. Since the update rules are based on element-by-
B. Proof of Convergence
element, we employ K i j to represent the part of (V ) that
Gradient descent method is a well-known optimization is related to Vi j . By taking the first-order and the second-
method, and here, we take the variable V as one example order partial derivatives of (V ) regarding V , the following

∂ formulas are obtained:
Vit+1←V t
− η (24)
j ij
∂V t ij
K i j = −2X Q +2V Q T Q −2αY M T +2αV M M T +2β LV i j
where η is the learning rate of the gradient descent method,
K ij = 2 Q T Q j j +2α M M T j j +2β(L)ii . (29)
a small positive constant. Here, we set η as
Vitj According to the literature [40], [47], we defined the fol-
η= . (25) lowing auxiliary function of K i j (Vi j ):
2 V Q T Q +αV M M T +β AV i j

We replace η in Formula (24) with Formula (25), obtaining J Vi j , Vitj
t
Vitj ∂ = K i j Vi j
Vi j ← Vi j −
t+1 t

2 V Q T Q +αV M M T +β AV i j ∂ V t i j V Q T Q +αV M M T +β AV i j
+K i j Vitj Vi j − Vitj +
X Q + αY M T + β SV i j Vitj
⇔ Vi j ← Vi j
t+1 t
. (26) 2
V Q T Q +αV M M T +β AV i j × Vi j − Vitj . (30)

Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
1258 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 3, MARCH 2023

Compared to the Taylor function of K i j (Vi j ) TABLE I

D ESCRIPTION OF E XPERIMENTAL D ATASETS
K i j Vi j = K i j Vitj + K i j Vitj Vi j −Vitj
1 2
+ K ij Vitj Vi j −Vitj . (31)
2
Proof: In Formula (30), when Vi j = Vitj , then J (Vi j , Vitj ) =
K i j (Vitj ); afterward, we offer the proof regarding the condition
J (Vi j , Vitj ) ≥ K i j (Vitj ). Observing Formulas (30) and (31),
we need to prove

V Q T Q + αV M M T + β AV i j
Card is the average number of labels for each instance, while
Vitj Den represents the label cardinality divided by the whole

≥ Q T Q j j + α M M T j j + β(L)ii (32) number of labels.
In the experiments, we have several parameters to set in
Obviously, we obtain advance. First, we obtain the best parameter C of SVM in

V Q T Q + αV M M T i j {104 , 103 , . . . , 103 , 104 }, and then, α, β, and γ in function
(11) and compared methods referring to parameters are tuned

k

k
in the same grid that is {0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0}. All
= Vilt T
Q Q lj
+α Vilt M T M l j
the best parameters are obtained in fivefold cross-validation
l=1 l=1
experiments. In addition, the two parameters p and σ in
≥ Vilt Q T Q j j +αVilt M T M j j . (33)
Formula (9) are defined as 5 and 1, respectively.
Similarly, In the experiments, we adopt three popular evaluation
criteria (Macro-F1, Micro-F1, and HL) to assess the proposed

n
β(AV )i j = β Atil Vltj ≥ β Aii Vitj ≥ β(A − S)ii Vitj = β L ii Vitj . method SSFS and compared methods. Macro-F1 and Micro-
l=1 F1 are defined as follows:
(34) 1
c
2TPi
Macro-F1 = (36)
As a result, Formula (32) holds. J (Vi j , Vitj )
is regarded as c i=1 2TP + FPi + FNi
i

an auxiliary function of K i j (Vi j ). Replacing Formula (27) by c

2TPi
Formula (30) Micro-F1 = c i=1 (37)
i=1 2TP + FP + FN
i i i

K i j Vitj
where c is the number of labels, TP is the number of true
j ← Vi j −
Vit+1
t
2 V Q T Q +αV M M T +β AV i j positives, FP denotes the number of false positives, and FN
represents the number of false negatives. The higher Macro-F1
X Q + αY M T + β SV i j
⇔ Vi j ← Vi j
t+1 t
. (35) and Micro-F1 indicate the better classification performance.
V Q T Q +αV M M T +β AV i j Furthermore, we also adopt another popular evaluation HL that
is denoted as
Based on the proof above, the objective function (11) is
11
n
nonincreasing under the update rule Formula (21). Similarly,
the objective function (11) is nonincreasing under the update HL = |Y ⊕ Y | (38)
n i=1 c
rules regarding the other two variables M and Q.
where n represents the number of instances, Y is the predicted
VI. E XPERIMENTAL S TUDY label set, Y is the original label set, and ⊕ presents the XOR
operation between Y and Y . The lower the HL, the better the
A. Experimental Settings
classification performance.
To ensure the comprehensiveness and competitiveness, two
problem transformation methods (PPT + MI and PPT + CHI)
and six algorithm adaption methods (MDMR, SCLS, MIFS, B. Experimental Results
RALM-FS, LRFS, and GMM) are compared to SSFS in terms We employ the proposed method and compared methods
of Macro-F1, Micro-F1 on SVM classifier, and Hamming to extract the top 20% of the total number of feature on
loss (HL) on ML-kNN classifier (k = 10). All the datasets each dataset. The average classification results and standard
are already categorized into two parts: training datasets and deviations of each method are recorded in Tables II–IV.
test datasets according to the recommendation on the Mulan Observing Tables II–IV, we find that the proposed method
Library [48]. We present the details of these datasets in Table I. outperforms the compared methods in terms of average results
All experiments are executed on an Intel Core CPU with a of all the evaluation criteria. Specifically, in Table II, SSFS
3.40-GHz processing speed and 16-GB main memory. outperforms other compared methods on all the datasets on the
We describe the datasets from multiple aspects, including SVM classifier in terms of Macro-F1. Moreover, the improv-
the common characters that are the number of features, ing rates regarding Macro-F1 in comparison to PPT + MI,
the number of instances, and the number of labels. In addition, PPT + CHI, MDMR, SCLS, MIFS, RALM-FS, LRFS, and

TABLE II
E XPERIMENTAL R ESULTS OF N INE M ETHODS ON S EVEN D ATASETS ON SVM C LASSIFIER IN T ERMS OF M ACRO - F1 ( MEAN ± STD )

TABLE III
E XPERIMENTAL R ESULTS OF N INE M ETHODS ON S EVEN D ATASETS ON SVM C LASSIFIER IN T ERMS OF M ICRO - F1 ( MEAN ± STD )

TABLE IV
E XPERIMENTAL R ESULTS OF N INE M ETHODS ON S EVEN D ATASETS ON ML- K NN C LASSIFIER IN T ERMS OF HL ( MEAN ± STD )

Fig. 2. Nine methods on four benchmark datasets on SVM classifier in terms of Macro-F1 . (a) Arts. (b) Education. (c) Science. (d) Social.

GMM are 57.1%, 66%, 49.2%, 114.6%, 66%, 72.5%, 39.7%, be seen in Figs. 2–4, the proposed method achieves the best
and 183.9%, respectively. Similarly, SSFS achieves the best performance in terms of three metrics in comparison with
average results in terms of Micro-F1. Regarding HL, we retain the eight methods. Furthermore, different methods achieve the
more digits to verify the classification superiority of the second best classification performance, such as MIFS method
proposed method SSFS. SSFS still achieves the best aver- on Arts in terms of Macro-F1 and RALM-FS method on
age results among all the nine multilabel feature selection Education in terms of Micro-F1, which indicates the proposed
methods. Overall, the proposed method SSFS outperforms the method is reliable.
eight compared methods in terms of these popular evaluation In general, the proposed method SSFS achieves the best
criteria. classification performance among all the methods in terms of
To show the classification performance clearly, we choose Macro-F1, Micro-F1, and HL.
four datasets (Arts, Education, Science, and Social) to present
the three evaluation metrics in Figs. 2–4. The horizontal axis C. Parameter Sensitivity Analysis
represents the number of already-selected features, whereas There exist four parameters (α, β, γ , and k) in the objective
the vertical axis denotes each evaluation metric. Different function. Obviously, α, β, and γ are used to measure the
colors represent different feature selection methods. As can contribution of each term in the objective function and k is

Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
1260 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 3, MARCH 2023

Fig. 3. Nine methods on four benchmark datasets on SVM classifier in terms of Micro-F1 . (a) Arts. (b) Education. (c) Science. (d) Social.

Fig. 4. Nine methods on four benchmark datasets on ML-kNN classifier in terms of HL. (a) Arts. (b) Education. (c) Science. (d) Social.

Fig. 5. Macro-F1 of SSFS method on Arts dataset w.r.t α, β, γ and k on SVM classifier. (a) α. (b) β. (c) γ . (d) k.

Fig. 6. Convergence curves: (a) MIFS, (b) RALM_FS, and (c) SSFS.

the dimension regarding the LSS term. To study the sensitivity TABLE V
of each parameter in the feature selection process, we adjust T IME C OMPLEXITY OF M ETHODS
one objective parameter when fixing other three parameters
by searching the grid {0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0} on
the dataset Arts. We fix other parameters as 0.5, which is also
adopted in the literature [26], [40]. We test the sensitivity of the
four parameters on the SVM classifier in terms of Macro-F1
on the dataset Arts in Fig. 5.
Observing Fig. 5, we can find that the classification perfor-
mance is not very sensitive to these four parameters.
Afterward, we present the time complexity of SSFS and
D. Convergence and Complexity Analysis other algorithm adaption methods (MIFS, MDMR, SCLS,
Furthermore, we conduct experiments on four datasets LRFS, GMM, and RALM-FS) in Table V.
(Arts, Education, Science, and Social) to present the con- Suppose that a is the number of already-selected features,
vergence of SSFS and compared methods in Fig. 6. d denotes the number of features, n indicates the number of
Figs. 5(b) and (c) and 6(a) show the convergence curves of instances, c represents the number of labels, and l is the cluster
MIFS, RALM-FS, and SSFS, respectively. SSFS methods number of labels in the MIFS method. As shown in Table VI,
converge before 30 iterations. The proposed method SSFS information-theoretic methods outperform embedded methods
converges quickly in several iterations on the four datasets. in terms of computational complexity because the former

TABLE VI
RUNNING T IME (S ECONDS ) OF D IFFERENT M ETHODS

considers low-order feature correlations or low-order label R EFERENCES

correlations, and the latter takes high-order correlations into
[1] S. Burkhardt and S. Kramer, “Online multi-label dependency topic mod-
account in the design of methods. However, the metrics of els for text classification,” Mach. Learn., vol. 107, no. 5, pp. 859–886,
information theory cost much time of information-theoretic May 2018.
methods. As a result, the running time of the proposed method [2] B. Al-Salemi, M. Ayob, and S.-A.-M. Noah, “Feature ranking for
enhancing boosting-based multi-label text categorization,” Expert Syst.
and compared methods is reported in Table VI. Appl., vol. 113, pp. 531–543, Dec. 2018.
As shown in Table VI, the running time (seconds) of the pro- [3] F. Gargiulo, S. Silvestri, M. Ciampi, and G. D. Pietro, “Deep neural
posed method and compared methods is presented. PPT_MI network for hierarchical extreme multi-label text classification,” Appl.
Soft Comput., vol. 79, pp. 125–138, Jun. 2019.
costs the least running time on the benchmark datasets because [4] J. Zhang, Z. Zhang, Z. Wang, Y. Liu, and L. Deng, “Ontological func-
the method belongs to the first-order problem transform tion annotation of long non-coding RNAs through hierarchical multi-
algorithms that do not consider label correlations. RALM-FS label classification,” Bioinformatics, vol. 34, no. 10, pp. 1750–1757,
May 2018.
outperforms our method in terms of running time because [5] Y. Ren, G. Zhang, G. Yu, and X. Li, “Local and global structure pre-
RALM-FS converges quickly, as shown in Fig. 6. However, serving based feature selection,” Neurocomputing, vol. 89, pp. 147–157,
SSFS takes the same order of magnitude as RALM-FS in Jul. 2012.
[6] M.-L. Zhang and Z.-H. Zhou, “A review on multi-label learning algo-
terms of running time. Importantly, the proposed method SSFS rithms,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 8, pp. 1819–1837,
achieves the best classification performance on the benchmark Aug. 2014.
datasets in terms of multiple metrics. [7] E. Gibaja and S. Ventura, “A tutorial on multilabel learning,” ACM
Comput. Surv., vol. 47, no. 3, p. 52, 2015.
[8] M.-L. Zhang and L. Wu, “Lift: Multi-label learning with label-specific
E. Applications of Feature Selection features,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 1,
pp. 107–120, Jan. 2015.
We are doing a project regarding searching for latent syn- [9] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, “Learning multi-label
thetic attributes in aluminophosphate zeolites (AlPOs). It is scene classification,” Pattern Recognit., vol. 37, no. 9, pp. 1757–1771,
Sep. 2004.
important to distinguish the category of AIPOs; however, [10] J. Fürnkranz, E. Hüllermeier, E. L. Mencía, and K. Brinker, “Multilabel
existing synthetic attributes cannot distinguish all AIPOs classification via calibrated label ranking,” J. Mach. Learn., vol. 73,
correctly. Chemists consider that there exist latent synthetic no. 2, pp. 133–153, Nov. 2008.
[11] J. Huang, G. Li, Q. Huang, and X. Wu, “Learning label-specific features
attributes that are not found in their experiments. We employ and class-dependent labels for multi-label classification,” IEEE Trans.
the proposed method to discover latent synthetic attributes and Knowl. Data Eng., vol. 28, no. 12, pp. 3309–3323, Dec. 2016.
rank the importance of attributes. In addition, Gene datasets [12] Z. Sun et al., “Mutual information based multi-label feature selec-
tion via constrained convex optimization,” Neurocomputing, vol. 329,
are commonly high-dimensional data, and numerous irrelevant pp. 447–456, Feb. 2019.
and redundant features confuse researchers. Feature selection [13] J. Zhang, Z. Luo, C. Li, C. Zhou, and S. Li, “Manifold regularized dis-
removes useless features while retaining relevant features. criminative feature selection for multi-label learning,” Pattern Recognit.,
vol. 95, pp. 136–150, Nov. 2019.
Furthermore, feature selection does not change the original [14] Y. Lin, Q. Hu, J. Zhang, and X. Wu, “Multi-label feature selection with
meanings of features, which is critical for scientists. streaming labels,” Inf. Sci., vol. 372, pp. 256–275, Dec. 2016.
[15] L. Hu, Y. Li, W. Gao, P. Zhang, and J. Hu, “Multi-label feature selection
with shared common mode,” Pattern Recognit., vol. 104, Aug. 2020,
VII. C ONCLUSION AND F UTURE R ESEARCH Art. no. 107344.
Numerous features and large-scale labels not only increase [16] Y. Xu, J. Wang, S. An, J. Wei, and J. Ruan, “Semi-supervised
multi-label feature selection by preserving feature-label space consis-
time cost of learning models but also decrease the classifica- tency,” in Proc. 27th ACM Int. Conf. Inf. Knowl. Manage., Oct. 2018,
tion performance of multilabel classification. Here, we propose pp. 783–792.
an LSS term that shares and preserves both latent feature [17] J. Read, “A pruned problem transformation method for multi-label
classification,” in Proc. New Zealand Comput. Sci. Res. Student Conf.,
structure and latent label structure. In addition, the graph vol. 143150, Apr. 2008, p. 41.
regularization technique is adopted onto the proposed objec- [18] G. Tsoumakas and I. Vlahavas, “Random k-labelsets: An ensemble
tive function named SSFS. To demonstrate the effectiveness method for multilabel classification,” in Proc. Eur. Conf. Mach. Learn.,
2007, pp. 406–417.
and efficiency of SSFS, we conduct various experiments on [19] G. Doquire and M. Verleysen, “Feature selection for multi-label classi-
benchmark datasets. Experimental results verify the excellent fication problems,” in Proc. Int. Work-Conf. Artif. Neural Netw., 2011,
performance of SSFS. In future work, we continue to study the pp. 9–16.
[20] Y. Lin, Q. Hu, J. Liu, and J. Duan, “Multi-label feature selection based
feature selection technique in multilabel classification problem on max-dependency and min-redundancy,” Neurocomputing, vol. 168,
and intend to apply our research in real project. pp. 92–103, Nov. 2015.

Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.
1262 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 3, MARCH 2023

[21] J. Lee and D.-W. Kim, “SCLS: Multi-label feature selection based [43] X. Li, H. Zhang, R. Zhang, Y. Liu, and F. Nie, “Generalized uncorrelated
on scalable criterion for large label set,” Pattern Recognit., vol. 66, regression with adaptive graph for unsupervised feature selection,”
pp. 342–352, Jun. 2017. IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 5, pp. 1587–1595,
[22] P. Zhang, G. X. Liu, and W. F. Gao, “Distinguishing two types of labels May 2019.
for multi-label feature selection,” Pattern Recognit., vol. 95, pp. 72–82, [44] F. Nie, W. Zhu, and X. Li, “Unsupervised feature selection with
Nov. 2019. structured graph optimization,” in Proc. AAAI Conf. Artif. Intell., vol. 30,
[23] J. Gonzalez-Lopez, S. Ventura, and A. Cano, “Distributed multi-label no. 1, 2016, pp. 1302–1308.
feature selection using individual mutual information measures,” Knowl.- [45] D. Cai, C. Zhang, and X. He, “Unsupervised feature selection for multi-
Based Syst., vol. 188, Jan. 2020, Art. no. 105052. cluster data,” in Proc. 16th ACM SIGKDD Int. Conf. Knowl. Discovery
[24] J. Lee and D.-W. Kim, “Feature selection for multi-label classification Data Mining (KDD), 2010, pp. 333–342.
using multivariate mutual information,” Pattern Recognit. Lett., vol. 34, [46] R. Shang, W. Wang, R. Stolkin, and L. Jiao, “Non-negative spectral
no. 3, pp. 349–357, 2013. learning and sparse regression-based dual-graph regularized feature
[25] J. Lee and D.-W. Kim, “Mutual information-based multi-label feature selection,” IEEE Trans. Cybern., vol. 48, no. 2, pp. 793–806, Feb. 2018.
selection using interaction information,” Expert Syst. Appl., vol. 42, [47] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularized nonnegative
no. 4, pp. 2013–2025, 2015. matrix factorization for data representation,” IEEE Trans. Pattern Anal.
[26] L. Jian, J. Li, K. Shu, and H. Liu, “Multi-label informed feature Mach. Intell., vol. 33, no. 8, pp. 1548–1560, Aug. 2011.
selection,” in Proc. IJCAI, 2016, pp. 1627–1633. [48] G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, and I. Vlahavas,
[27] S. Huang and Z. Zhou, “Multi-label learning by exploiting label corre- “Mulan: A Java library for multi-label learning,” J. Mach. Learn. Res.,
lations locally,” in Proc. AAAI Conf. Artif. Intell., 2012, pp. 949–955. vol. 12, pp. 2411–2414, Jun. 2011.
[28] Y. Zhu, J. T. Kwok, and Z.-H. Zhou, “Multi-label learning with global
and local label correlation,” IEEE Trans. Knowl. Data Eng., vol. 30,
no. 6, pp. 1081–1094, Jun. 2018.
[29] P. Hou, X. Geng, and M. Zhang, “Multi-label manifold learning,” in
Proc. AAAI Conf. Artif. Intell., 2016, pp. 1680–1686. Wanfu Gao was born in Jilin, China, in 1990.
[30] J. Huang, G. Li, Q. Huang, and X. Wu, “Learning label specific features He received the B.S. and Ph.D. degrees from
for multi-label classification,” in Proc. IEEE Int. Conf. Data Mining, the College of Computer Science, Jilin University,
Nov. 2015, pp. 181–190. Changchun, China, in 2013 and 2019, respectively.
[31] R. Shao, N. Xu, and X. Geng, “Multi-label learning with label enhance- He is currently a Lecturer with the College of
ment,” in Proc. IEEE Int. Conf. Data Mining (ICDM), Nov. 2018, Computer Science, Jilin University. He is doing
pp. 437–446. post-doctoral research at the College of Chemistry,
[32] F. Nie, H. Huang, X. Cai, and C. Ding, “Efficient and robust feature Jilin University. His research interests include fea-
selection via joint 2,1 -norms minimization,” in Proc. Adv. Neural Inf. ture selection, multilabel learning, and information
Process. Syst., 2010, pp. 1813–1821. theory.
[33] X. Cai, F. Nie, and H. Huang, “Exact top-k feature selection via l2,0 - Dr. Gao received the 2019 ACM Changchun Doc-
norm constraint,” in Proc. 23rd Int. Joint Conf. Artif. Intell., 2013, toral Dissertation Award and the Postdoctoral Innovative Talents Support
pp. 1240–1246. Program.
[34] Z. Ma, F. Nie, Y. Yang, J. R. R. Uijlings, and N. Sebe, “Web image
annotation via subspace-sparsity collaborated feature selection,” IEEE
Trans. Multimedia, vol. 14, no. 4, pp. 1021–1030, Aug. 2012.
[35] X. Chang, F. Nie, Y. Yang, and H. Huang, “A convex formulation for Yonghao Li was born in Henan, China, in 1992.
semi-supervised multi-label feature selection,” in Proc. AAAI Conf. Artif. He received the B.E. degree from Henan Agricul-
Intell., vol. 2, 2014, pp. 1171–1177. tural University, Zhengzhou, China, in 2017, and
[36] X. Chang, H. Shen, S. Wang, J. Liu, and L. Xue, “Semi-supervised the M.S. degree in computer science from Jilin
feature analysis for multimedia annotation by mining label correlation,” University, Changchun, China, in 2020, where he is
in Proc. Pacific–Asia Conf. Knowl. Discovery Data Mining, 2014, currently pursuing the Ph.D. degree with the College
pp. 74–85. of Computer Science.
[37] M. Luo, F. Nie, X. Chang, Y. Yang, A. G. Hauptmann, and Q. Zheng, His research focuses on feature selection.
“Adaptive unsupervised feature selection with structure regularization,”
IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 4, pp. 944–956,
Apr. 2017.
[38] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by
locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326,
Dec. 2000.
[39] J. Gui, Z. Sun, S. Ji, D. Tao, and T. Tan, “Feature selection based on Liang Hu was born in Jilin, China, in 1968.
structured sparsity: A comprehensive study,” IEEE Trans. Neural Netw. He received the M.S. and Ph.D. degrees in computer
Learn. Syst., vol. 28, no. 7, pp. 1490–1507, Jul. 2017. science from Jilin University, Changchun, China,
[40] J. Hu, Y. Li, W. Gao, and P. Zhang, “Robust multi-label feature selection in 1993 and 1999, respectively.
with dual-graph regularization,” Knowl.-Based Syst., vol. 203, Sep. 2020, He is currently a Professor and a Doctoral Super-
Art. no. 106126. visor at the College of Computer Science and Tech-
[41] R. Shang, W. Wang, R. Stolkin, and L. Jiao, “Subspace learning- nology, Jilin University. He is supported by the
based graph regularized feature selection,” Knowl.-Based Syst., vol. 112, Hundred-Thousand-Ten Thousand Project of China.
pp. 152–165, Nov. 2016. His research interests include feature selection, mul-
[42] F. Nie, X. Wang, and H. Huang, “Clustering and projected clustering tilabel learning, and information theory.
with adaptive neighbors,” in Proc. 20th ACM SIGKDD Int. Conf. Knowl. Dr. Hu is a member of the China Computer
Discovery Data Mining, Aug. 2014, pp. 977–986. Federation.

Authorized licensed use limited to: JILIN UNIVERSITY. Downloaded on May 12,2025 at 05:52:25 UTC from IEEE Xplore. Restrictions apply.

Company Wise Data Science Interview Questions
100% (2)
Company Wise Data Science Interview Questions
39 pages
Multi-label feature selection with shared common mode
No ratings yet
Multi-label feature selection with shared common mode
12 pages
Joint Feature Selection and Classification For Multilabel Learning
No ratings yet
Joint Feature Selection and Classification For Multilabel Learning
14 pages
Multi-Label Learning With Global and Local Label Correlation
No ratings yet
Multi-Label Learning With Global and Local Label Correlation
14 pages
Robust Multi-label Feature Selection With Dual-graph Regularization
No ratings yet
Robust Multi-label Feature Selection With Dual-graph Regularization
12 pages
1-s2.0-S0020025521005922-main
No ratings yet
1-s2.0-S0020025521005922-main
23 pages
Pattern Recognition Letters: Jaesung Lee, Dae-Won Kim
No ratings yet
Pattern Recognition Letters: Jaesung Lee, Dae-Won Kim
9 pages
Multi-label feature selection with high-sparse personalized and low-redundancy shared common features
No ratings yet
Multi-label feature selection with high-sparse personalized and low-redundancy shared common features
20 pages
Bharat Hic ML 2010
No ratings yet
Bharat Hic ML 2010
8 pages
Multilabel Things
No ratings yet
Multilabel Things
42 pages
Random Manifold Sampling and Joint Sparse Regularization For Multi-Label Feature Selection
No ratings yet
Random Manifold Sampling and Joint Sparse Regularization For Multi-Label Feature Selection
17 pages
1-s2.0-S0031320324001626-main
No ratings yet
1-s2.0-S0031320324001626-main
12 pages
A Review On Multi-Label Learning Algorithms: IEEE Transactions On Knowledge and Data Engineering August 2014
No ratings yet
A Review On Multi-Label Learning Algorithms: IEEE Transactions On Knowledge and Data Engineering August 2014
43 pages
AReviewonMulti-Label Learning Algorithms
No ratings yet
AReviewonMulti-Label Learning Algorithms
43 pages
Multilabel Feature Selection: A Comprehensive Review and Guiding Experiments
No ratings yet
Multilabel Feature Selection: A Comprehensive Review and Guiding Experiments
29 pages
Benchmarking Multi-Label Classification Algorithms
No ratings yet
Benchmarking Multi-Label Classification Algorithms
12 pages
A Review On Multi-Label Learning Algorithms: Min-Ling Zhang and Zhi-Hua Zhou, Fellow, IEEE
No ratings yet
A Review On Multi-Label Learning Algorithms: Min-Ling Zhang and Zhi-Hua Zhou, Fellow, IEEE
43 pages
Backtracking Algorithms and Applications: Definitive Reference for Developers and Engineers
From Everand
Backtracking Algorithms and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
B-Tree Algorithms and Applications: Definitive Reference for Developers and Engineers
From Everand
B-Tree Algorithms and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
XGBoost in Practice: Definitive Reference for Developers and Engineers
From Everand
XGBoost in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient String Processing with Trie Structures: Definitive Reference for Developers and Engineers
From Everand
Efficient String Processing with Trie Structures: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
9211-Article Text-12739-1-2-20201228
No ratings yet
9211-Article Text-12739-1-2-20201228
7 pages
CatBoost Algorithms and Applications: Definitive Reference for Developers and Engineers
From Everand
CatBoost Algorithms and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Object-Relational Mapping Concepts and Techniques: Definitive Reference for Developers and Engineers
From Everand
Object-Relational Mapping Concepts and Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Structures Explained: A Practical Guide with Examples
From Everand
Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
8996-Article Text-12524-1-2-20201228
No ratings yet
8996-Article Text-12524-1-2-20201228
7 pages
Mastering Data Structures and Algorithms with Python: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering Data Structures and Algorithms with Python: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
Deep Reinforcement Learning: An Essential Guide
From Everand
Deep Reinforcement Learning: An Essential Guide
Robert Johnson
No ratings yet
C++ Data Structures Explained: A Practical Guide with Examples
From Everand
C++ Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Applied Techniques for GPT-3: Definitive Reference for Developers and Engineers
From Everand
Applied Techniques for GPT-3: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
A Survey and Analysis of Multi-Label Learning Techniques For Data Streams
No ratings yet
A Survey and Analysis of Multi-Label Learning Techniques For Data Streams
5 pages
DOC-20250118-WA0009.
No ratings yet
DOC-20250118-WA0009.
22 pages
Union-Find Data Structures and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Union-Find Data Structures and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Principles of MapReduce Systems: Definitive Reference for Developers and Engineers
From Everand
Principles of MapReduce Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
A Literature Survey On Algorithms For Multi-Label
No ratings yet
A Literature Survey On Algorithms For Multi-Label
26 pages
C Data Structures and Algorithms: Implementing Efficient ADTs
From Everand
C Data Structures and Algorithms: Implementing Efficient ADTs
Larry Jones
No ratings yet
3. Explicit unsupervised feature selection based on structured graph and locally
No ratings yet
3. Explicit unsupervised feature selection based on structured graph and locally
16 pages
LightGBM in Practice: Definitive Reference for Developers and Engineers
From Everand
LightGBM in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
s10994-020-05888-2 (1)
No ratings yet
s10994-020-05888-2 (1)
21 pages
Animprovedfeatureselectionmethodforclassification onincompletedata
No ratings yet
Animprovedfeatureselectionmethodforclassification onincompletedata
15 pages
Multi-label class
No ratings yet
Multi-label class
131 pages
Transformers in Deep Learning Architecture: Definitive Reference for Developers and Engineers
From Everand
Transformers in Deep Learning Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Few-Shot Machine Learning: Doing More with Less Data
From Everand
Few-Shot Machine Learning: Doing More with Less Data
Robert Johnson
No ratings yet
A Survey On Evolutionary Multiobjective Feature Selection in Classification Approaches Applications and Challenges
No ratings yet
A Survey On Evolutionary Multiobjective Feature Selection in Classification Approaches Applications and Challenges
21 pages
Principles of Observability for Modern Systems: Definitive Reference for Developers and Engineers
From Everand
Principles of Observability for Modern Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Symmetry 13 00322 v3
No ratings yet
Symmetry 13 00322 v3
20 pages
Practical Replication Architectures and Protocols: Definitive Reference for Developers and Engineers
From Everand
Practical Replication Architectures and Protocols: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Multi-Label Classification Using Labels As Hidden Nodes
No ratings yet
Multi-Label Classification Using Labels As Hidden Nodes
34 pages
Directed Acyclic Graphs in Theory and Practice: Definitive Reference for Developers and Engineers
From Everand
Directed Acyclic Graphs in Theory and Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Grangier Melvin Nips 2010
No ratings yet
Grangier Melvin Nips 2010
9 pages
Mongoose in Practice: Definitive Reference for Developers and Engineers
From Everand
Mongoose in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comparative Analysis of Multi-Label Classification Algorithms
No ratings yet
Comparative Analysis of Multi-Label Classification Algorithms
4 pages
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
Accepted Manuscript: Information Fusion
No ratings yet
Accepted Manuscript: Information Fusion
39 pages
A Multi-Label Approach For Diagnosis Problems in Energy Systems Using LAMDA Algorithm
No ratings yet
A Multi-Label Approach For Diagnosis Problems in Energy Systems Using LAMDA Algorithm
6 pages
s10618-023-00937-5
No ratings yet
s10618-023-00937-5
62 pages
Self-Supervised Learning: Teaching AI with Unlabeled Data
From Everand
Self-Supervised Learning: Teaching AI with Unlabeled Data
Robert Johnson
No ratings yet
Multilabel Dimensionality Reduction Liang Sun Shuiwang Ji Jieping Ye pdf download
No ratings yet
Multilabel Dimensionality Reduction Liang Sun Shuiwang Ji Jieping Ye pdf download
80 pages
Blockchain Adoption in Supply Chain Management and Logistics
From Everand
Blockchain Adoption in Supply Chain Management and Logistics
Niels Hackius
No ratings yet
Ids Unit I
No ratings yet
Ids Unit I
46 pages
Predicting Brain Age using ml algorithms
No ratings yet
Predicting Brain Age using ml algorithms
9 pages
Coordinate Descent Algorithms: Stephen J. Wright
No ratings yet
Coordinate Descent Algorithms: Stephen J. Wright
32 pages
(2017) Formal Guarantees On The Robustness of A Classifier Against Adversarial Manipulation
No ratings yet
(2017) Formal Guarantees On The Robustness of A Classifier Against Adversarial Manipulation
21 pages
Data Mining
No ratings yet
Data Mining
49 pages
REview Affirm For Regularization RE A R - Chieney Illavera
No ratings yet
REview Affirm For Regularization RE A R - Chieney Illavera
30 pages
Deep Learning in Hilbert Spaces - New Frontiers in Algorithmic Trading
No ratings yet
Deep Learning in Hilbert Spaces - New Frontiers in Algorithmic Trading
351 pages
Visualsentimentanalysis - Deeplearning - Applsci 12 01030 With Cover
No ratings yet
Visualsentimentanalysis - Deeplearning - Applsci 12 01030 With Cover
24 pages
CS 304.A Training Models
No ratings yet
CS 304.A Training Models
149 pages
Midterm
No ratings yet
Midterm
33 pages
Lecture Notes in Mathematics
No ratings yet
Lecture Notes in Mathematics
207 pages
LBDL
No ratings yet
LBDL
142 pages
HEALTH CARE ANALYTICS (All 5 Units Notes)
100% (1)
HEALTH CARE ANALYTICS (All 5 Units Notes)
63 pages
Neural Prophet vs. Prophet, Why Neural Prophet is So Accurate by Bingblackbean Medium
No ratings yet
Neural Prophet vs. Prophet, Why Neural Prophet is So Accurate by Bingblackbean Medium
20 pages
Machine Learning/Data Science Interview Cheat Sheets: Aqeel Anwar
No ratings yet
Machine Learning/Data Science Interview Cheat Sheets: Aqeel Anwar
17 pages
Cowan Statistical Data Analysis
No ratings yet
Cowan Statistical Data Analysis
10 pages
CS-601 Machine Learning Study Guide
No ratings yet
CS-601 Machine Learning Study Guide
2 pages
AI for Breast Cancer
No ratings yet
AI for Breast Cancer
6 pages
CS7643: Deep Learning Assignment 3: Instructor: Zsolt Kira Deadline: 11:59pm Mar 14, 2021, EST
No ratings yet
CS7643: Deep Learning Assignment 3: Instructor: Zsolt Kira Deadline: 11:59pm Mar 14, 2021, EST
12 pages
B.E Ece 19 23 Batchno 35
No ratings yet
B.E Ece 19 23 Batchno 35
50 pages
DomainATM--Domain-adaptation-toolbox-for-medical-data-analysi_2023_NeuroImag
No ratings yet
DomainATM--Domain-adaptation-toolbox-for-medical-data-analysi_2023_NeuroImag
12 pages
Deep Learning in Data Science Theoretical Foundati
No ratings yet
Deep Learning in Data Science Theoretical Foundati
6 pages
LSRN A Parallel Iterative Solver For STR
No ratings yet
LSRN A Parallel Iterative Solver For STR
19 pages
Practical Applications Of Sparse Modeling Practical Applications Of Sparse Modeling instant download
100% (1)
Practical Applications Of Sparse Modeling Practical Applications Of Sparse Modeling instant download
86 pages
Enhancing GPS Positioning Accuracy Using Machine Learning Regression
No ratings yet
Enhancing GPS Positioning Accuracy Using Machine Learning Regression
6 pages
Get Multiscale Methods for Fredholm Integral Equations 1st Edition Zhongying Chen PDF ebook with Full Chapters Now
100% (4)
Get Multiscale Methods for Fredholm Integral Equations 1st Edition Zhongying Chen PDF ebook with Full Chapters Now
61 pages
Machine Learning: April 2022
No ratings yet
Machine Learning: April 2022
32 pages
JAL Big Data Auditing Submitted Version Free To Share
No ratings yet
JAL Big Data Auditing Submitted Version Free To Share
41 pages
(Facchinei, Pang) Finite - Dimens II
No ratings yet
(Facchinei, Pang) Finite - Dimens II
702 pages

Multilabel feature selection with constrained latent structure shared term

Uploaded by

Multilabel feature selection with constrained latent structure shared term

Uploaded by

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO.

3, MARCH 2023 1253

Multilabel Feature Selection With Constrained

W ITH the rapid development of computer science tech-

is, Y ∈ Rn×c . Each Yi j = 1 if the j th label is associated with

III. R ELATED W ORK

where  ∈ Rn×k , ∈ Rk×c , and ∈ Rd×k denote three Algorithm 1 SSFS

inator, we add one very small constant to the denominator.

Compared to the Taylor function of K i j (Vi j ) TABLE I

an auxiliary function of K i j (Vi j ). Replacing Formula (27) by c

considers low-order feature correlations or low-order label R EFERENCES

You might also like

where ∈ Rn×k , ∈ Rk×c , and ∈ Rd×k denote three Algorithm 1 SSFS