0% found this document useful (0 votes)
18 views21 pages

O G Reyesetal-ScalableextensionsoftheReliefFalgorithm

The paper presents three extensions of the ReliefF algorithm for multi-label learning: ReliefF-ML, PPT-ReliefF, and RReliefF-ML, aimed at improving feature weighting and selection in complex multi-label datasets. These extensions are evaluated against previous ReliefF adaptations on 34 multi-label datasets, demonstrating enhanced performance and addressing limitations of earlier methods. The study highlights the significance of feature engineering techniques in optimizing machine learning algorithms for multi-label data applications.

Uploaded by

kexuedaishu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views21 pages

O G Reyesetal-ScalableextensionsoftheReliefFalgorithm

The paper presents three extensions of the ReliefF algorithm for multi-label learning: ReliefF-ML, PPT-ReliefF, and RReliefF-ML, aimed at improving feature weighting and selection in complex multi-label datasets. These extensions are evaluated against previous ReliefF adaptations on 34 multi-label datasets, demonstrating enhanced performance and addressing limitations of earlier methods. The study highlights the significance of feature engineering techniques in optimizing machine learning algorithms for multi-label data applications.

Uploaded by

kexuedaishu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/257947469

ReliefF-ML: An Extension of ReliefF Algorithm to Multi-label Learning

Conference Paper · November 2013


DOI: 10.1007/978-3-642-41827-3_66

CITATIONS READS

17 754

3 authors:

Oscar Reyes Carlos Morell


Healios Technologies Universidad Central "Marta Abreu" de las Villas
41 PUBLICATIONS 370 CITATIONS 71 PUBLICATIONS 504 CITATIONS

SEE PROFILE SEE PROFILE

Sebastian Ventura
University of Cordoba (Spain)
346 PUBLICATIONS 13,672 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Data Mining with More Flexible Representations View project

Propositionalization for Multi-Relational Data Mining View project

All content following this page was uploaded by Sebastian Ventura on 20 July 2015.

The user has requested enhancement of the downloaded file.


Scalable extensions of the ReliefF algorithm for weighting and selecting features
on the multi-label learning context

Oscar Reyesa , Carlos Morellb , Sebastián Venturac,d,∗


a Departmentof Computer Science, University of Holguı́n, Cuba
b Department of Computer Science, Universidad Central de Las Villas, Cuba
c Department of Computer Science and Numerical Analysis, University of Córdoba, Spain
d Department of Computer Science, King Abdulaziz University, Jeddah, Saudi Arabia

Abstract
Multi-label learning has become an important area of research due to the increasing number of modern applications
that contain multi-label data. The multi-label data are structured in a more complex way than single-label data.
Consequently the development of techniques that allow the improvement in the performance of machine learning
algorithms over multi-label data is desired. The feature weighting and feature selection algorithms are important
feature engineering techniques which have a beneficial impact on the machine learning. The ReliefF algorithm is one
of the most popular algorithms to feature estimation and it has proved its usefulness in several domains. This paper
presents three extensions of the ReliefF algorithm for working in the multi-label learning context, namely ReliefF-ML,
PPT-ReliefF and RReliefF-ML. PPT-ReliefF uses a problem transformation method to convert the multi-label problem
into a single-label problem. ReliefF-ML and RReliefF-ML adapt the classic ReliefF algorithm in order to handle
directly the multi-label data. The proposed ReliefF extensions are evaluated and compared with previous ReliefF
extensions on 34 multi-label datasets. The results show that the proposed ReliefF extensions improve preceding
extensions and overcome some of their drawbacks. The experimental results are validated using several nonparametric
statistical tests and confirm the effectiveness of the proposal for a better multi-label learning.
Keywords: Multi-label learning, ReliefF algorithm, feature weighting, feature selection, multi-label classification,
label ranking

1. Introduction the LR provides an ordering of the labels for a given


query instance [3, 4].
Traditional machine learning applications have been
In the last few years, an increasing number of mod-
derived from the analysis of data where the examples
ern applications that contain multi-label data have ap-
are associated with a single label [1]. However, recently
peared, such as text categorization [5], emotions evoked
studies over data that are structured in a more complex
by music [6], semantic annotation of images [7] and
way than single-label data have received especial atten-
videos [8], classification of protein function [9] and
tion. Multi-label problems are concerned to those prob-
gene [10, 11].
lems where the examples belong to a set of labels at the
same time [2, 3]. The goal of the Multi-Label Learn- Generally speaking, the multi-label datasets contain
ing (MLL) paradigm is to learn a model that correctly a large number of features that describe the instances,
generalizes unseen multi-label data [2, 3]. On the MLL e.g. description of texts, images, proteins and genes
context two problems are studied, multi-label classifi- [5, 7–16]. The irrelevant, interacting, redundant and
cation (MLC) and label ranking (LR). MLC divides the noisy features have a highly negative impact in the per-
set of labels into relevant and irrelevant sets, whereas formance of the learning algorithms [17]. Moreover, the
number of features is much bigger than the number of
∗ Corresponding
instances in several multi-label applications [13]. On
author. Tel:+34957212218; fax:+34957218630.
Email addresses: [email protected] (Oscar
the other hand, in some domains the number of possi-
Reyes), [email protected] (Carlos Morell), ble labels can be in the region of hundreds (even thou-
[email protected] (Sebastián Ventura) sands) and the distribution of instances per label can be
Preprint submitted to Neurocomputing February 12, 2015
showed in a non-uniform way [8, 12, 14–16]. Conse- the embedded approach by regularization methods, the
quently, some multi-label learning algorithms present a classifier induction and feature selection are achieved
poor performance with regards to time complexity and simultaneously by estimating a weight vector. Theoret-
efficiency [4]. As a result of the above situations, nowa- ical and empirical studies have suggested the usefulness
days the designing process of MLL algorithms faces of structure sparsity for feature learning, e.g., LASSO
several challenges and it is an open field of research. and its extensions [31–35]. Several adaptations of these
The preprocessing techniques have demonstrated to type of methods have been proposed for the Multi-task
be an important step of the knowledge discovery in Learning paradigm [36–38], that it can be considered a
databases [1, 18]. Feature engineering techniques such generalization of MLL paradigm.
as feature weighting (FW) and feature selection (FS) In this work we focus in the feature selection based in
improve the performance of machine learning algo- filter approach for MLL, specifically in the Relief fam-
rithms [19, 20]. The FW assigns a weight to each fea- ily of algorithms. ReliefF [22, 23] is one of the most
ture representing the usefulness of the feature to distin- popular feature estimation algorithms for working on
guish pattern classes. The weight vector can be used single-label data. The ReliefF algorithm follows the fil-
to improve the performance of the lazy algorithms, pa- ter approach; since it does not use the feedback from
rameterizing the distance function used to retrieve the k- the classifier to assign the weights to the features. Re-
nearest neighbours of a given query instance [19]. Fur- liefF learns a single set of weights that are employed
thermore, the weight vector can be used as a ranking of globally over the entire instance space. ReliefF does
features for guiding the search of the best subset of fea- not employ domain specific knowledge to set feature
tures [21–26]. The FS can be seen as a specific case of weights [19, 23]. Traditionally, the ReliefF applications
the FW process [19], where the feature weights are bi- has been focused in the FS [21, 24] process. However,
nary values representing whether a feature is removed or ReliefF has proved its usefulness in other domains as
conserved. FS tries to reduce the dimensionality, which feature ranking [39], building tree-based models [23],
has a positive effect on the efficiency, effectiveness and association rules-based classifier [40], constructive in-
comprehensibility of machine learning [20, 27–29]. duction [41], and improving the efficiency of the genetic
The FS algorithms can be divided into three main cat- algorithms [42] and lazy classifiers [19].
egories, filter, wrapper and embedded methods. The fil- The problem transformation methods (PTM) decom-
ter methods are independent of any classification algo- pose a multi-label problem into one or more single-label
rithm, so that the biases of the learning algorithms do problems [3]. Generally, the estimation of the utility of
not influence the feature selection method. The filter features on multi-label data has been done using a PTM
methods evaluate the usefulness of a feature, or set of [3, 43–47]. A PTM transforms the multi-label problem
features, through measures of distance, dependency, in- and by an aggregation strategy the score of a feature is
formation or correlation on data. The major disadvan- computed using a single-label feature estimation algo-
tage of the filter approach is that it ignores the effects rithm.
of the selected features on the performance of a learn- Owing to the simplicity and effectiveness of the Re-
ing algorithm. Consequently the selected features may liefF algorithm, several extensions of the ReliefF on
be suboptimal features for an certain classification algo- the MLL context have been proposed in the literature
rithm [30]. [26, 46–48]. Traditionally, the multi-label ReliefF ex-
The wrapper approach uses a specific classifier to tensions work together with a PTM, and finally, the
evaluate the quality of selected features. The wrapper weights for each feature are computed by some aggrega-
methods are dependent of the classification algorithm tion strategy, e.g. the average, minimum and maximum.
selected, therefore the bias of a learning algorithm in- These previous works are an important contribution
fluences the feature selection method. It obtains better to MLL for dealing with high dimensional data. How-
performance for a predefined classifier. The major dis- ever, these approaches have several limitations. First,
advantage of the wrapper approach is that it is compu- the performance of a PTM generally depends on the
tationally expensive [30]. number of labels of the dataset. Consequently, they
The embedded methods include the feature selection are very expensive for domains that contain a moderate
in the classifier construction process, taking into ac- number of labels. Second, a drawback of some PTM
count the interaction with the learning model, but it is that they do not consider label correlations. Third,
is less computational expensive than wrapper methods. the previous ReliefF extensions have been restricted on
Nowadays, structured sparsity learning is a promising FS area and few multi-label datasets are used in the
paradigm for learning in high-dimensional settings. In experimental study. Fourth, they are not usually com-
2
pared with other ReliefF extensions in order to deter- To the best of our knowledge, this paper presents the
mine which ReliefF extension achieves a superior per- first attempt to compare the most significant multi-label
formance. ReliefF extensions over a large number of multi-label
The usefulness of ReliefF as a feature estimation al- datasets, two main areas of study (FS and FW process)
gorithm and the drawbacks presented by the previous and two MLL tasks (MLC and LR tasks).
studies, have led to the design of new scalable Reli- This paper is arranged as follows: Section 2 de-
efF extensions for working on the MLL context. For scribes the MLL paradigm and the ReliefF extensions
this, three new ReliefF extensions, namely ReliefF-ML, to multi-label learning that have appeared in the litera-
PPT-ReliefF and RReliefF-ML, that outperform previ- ture. Section 3 presents the ReliefF-ML, PPT-ReliefF
ous ReliefF extensions have been proposed. ReliefF- and RReliefF-ML extensions. Section 4 describes the
ML can be considered as a generalization of the classic experimental set-up and analyses the experimental re-
ReliefF, where the updating weights equation is mod- sults. Finally, section 5 provides some concluding re-
ified. PPT-ReliefF uses the Pruned Problem Trans- marks.
formation method (PPT) [49] to convert the original
multi-label dataset into a new multi-class dataset. On
the other hand, RReliefF-ML is based on the princi- 2. Preliminaries
ples of the well-known adaptation of ReliefF to regres-
sion problems [50]. The three proposed ReliefF exten- In this section the general definition of the MLL
sions include interaction among features and take into paradigm and the multi-label evaluation measures used
account label dependencies. PPT-ReliefF depends on in this work are presented. A general background of the
a PTM, whereas ReliefF-ML and RReliefF-ML handle Relief family of algorithms is shown, highlighting the
the multi-label data directly. simplicity, effectiveness and applicability of the ReliefF
In this paper, an analysis of the ReliefF extensions algorithm. Finally, the previous extensions of ReliefF
on the FW process to improve the performance of the to MLL that have appeared in the literature are briefly
multi-label lazy algorithms is carried out. The lazy al- described.
gorithms depend on the definition of a distance func-
tion that determines the k-nearest neighbours of a query 2.1. Multi-label learning paradigm
instance. The distance function is defined on feature A multi-label dataset can be defined as follows:
space, therefore using a weight vector can reduce the
negative impact of the irrelevant, redundant or noise • A feature space F with a cardinality equal to d
features on distance computation [19]. The proposal (number of features) and a label space L with a car-
is proved on four multi-label k-NN algorithms. On the dinality equal to q (number of labels).
other hand, the ReliefF extensions are tested on the FS
process. The weight vector is transformed in a ranking • A set E of n instances, where each instance i is
of features, which is used for guiding the search of the a tuple < xi , yi >. xi is the feature vector of the
best subset of features. The proposal is proved on one instance i. A feature vector is a tuple of values
multi-label algorithm, which is completely sensitive to of features, where xi f represents the value of f -th
the presence of irrelevant, redundant and noise features. feature for the instance i. yi is the set of labels of
The experiments are carried out on 34 multi-label the instance i, it is a subset of the label space L.
datasets, considering different problem domains, num-
ber of instances, features and labels. Several multi- The goal of the Multi-Label Learning (MLL) is to
label evaluation measures are used to analyze different perform a machine learning process from instances that
viewpoints. The experimental stage shows the effective- are associated with multiple labels at the same time [3].
ness of the proposal, obtaining significantly better re- The Multi-label Classification (MLC) task is concerned
sults than previous ReliefF extensions on MLC and LR with learning a predictive model that divides the label
tasks. The experimental study includes a statistical anal- space into relevant and irrelevant label sets. The binary
ysis based on several nonparametric test as proposed in and multi-class classification are specific cases of the
[51–54]. MLC task [3]. On the other hand, the Label Ranking
The main contribution of the present work is the en- (LR) task is concerned with learning a predictive model
hancement of the understanding of the ReliefF’s appli- that for a given query instance, provides an ordering of
cability on the MLL context. This work aims to study the labels. The generalization of the MLC and LR tasks
the benefits of the ReliefF algorithm for a better MLL. has been called Multi-label Ranking (MLR) [55].
3
2.1.1. Multi-label evaluation measures class and from different classes. A feature has a unde-
Given a test set T = {< xi , yi >, i = 1 . . . m}, a MLC sired property if it differentiates two near instances that
method predicts a set of labels zi for a given query in- belong to the same class. A feature has a desired prop-
stance i. On the other hand, a LR method provides a erty if it differentiates two near instances that belong to
ranking of labels ri for a given query instance i, being different classes [24]. The Relief family of algorithms
ri (λ) the rank predicted for the label λ. [3] can be implemented efficiently and avoids an expensive
The Hamming Loss measure (HL ) [56] averages the search on the feature space.
symmetrical differences among the predicted and actual In [22] an extension of Relief method called ReliefF
label sets. was proposed. ReliefF deals with incomplete and noisy
data, and with multi-class problems. For each sam-
m
1 X | yi 4zi | pling instance i, the k-nearest neighbours from the same
HL = (1)
m i=1 q class (Hits neighbours) of i and for each different class
(Misses neighbours) are determined. ReliefF takes into
where 4 denotes the symmetric difference among two account the information of the k-nearest neighbours to
sets. increase the precision of the feature estimation. ReliefF
The Example-Based F1 measure (F1Ex ) [57] is the takes into account the effect of interacting features, al-
harmonic mean of precision and recall. though it is sensitive to redundant and irrelevant features
m [24]. ReliefF returns a weight vector W, where W f rep-
1 X 2 | yi ∩ zi |
F1Ex = (2) resents the weight for the f -th feature. The weights of
m i=1 | yi | + | zi | the vector W are in [−1, 1] range. The higher the weight
The Ranking Loss measure (RL ) [3] averages the pro- assigned to the feature, the more useful the feature.
portion of label pairs that are incorrectly ordered. The time complexity of the ReliefF algorithm is O(m·
n · d). However, several search algorithms can be used
m to reduce the time complexity of ReliefF to O(d · n ·
1 X | (λa , λb ) : ri (λa ) > ri (λb ), (λa , λb ) ∈ yi × y¯i |
RL = (3) log n), such as kD-Tree, Ball-Tree or Cover-Tree. Also,
m i=1 | yi || y¯i |
ReliefF can be split into independent tasks, which is a
where ȳi denotes the complementary set of yi in L. requisite for the parallelization of an algorithm. [1, 24,
The Average Precision measure (AP ) [3] averages 58]
how many times a particular label is ranked above other In [59] is empirically observed that the Relief family
label which actually is in the truth label set. of algorithms approximate a gradient ascent based in a
margin-based criterion. In [60, 61] a mathematical in-
m terpretation of the heuristic used by Relief as an online
1 X 1 X | {λ0 ∈ yi : ri (λ0 ) ≤ ri (λ)} |
AP = (4) method that solves a convex optimization problem with
m i=1 | yi | λ∈y ri (λ) a margin-based objective function is presented.
i
Similar to structured sparsity learning methods, Reli-
The HL and F1Ex evaluation measures are associated efF implicitly resolves an optimization problem to find
with the MLC task, whereas RL and AP are associated the best weight vector. However, the objective function
with the LR task. The higher the value of F1Ex and AP , that ReliefF optimizes does not depend of the feedback
and the lower the value of HL and RL , the better the of a classifier. The structured sparsity learning meth-
performance of a MLL algorithm. ods (e.g. regularization methods) optimize an objective
function that it is a composite of a loss function (e.g.
2.2. ReliefF algorithm quadratic loss, hinge loss or logistic loss) and a regu-
In [21] was initially presented the Relief family of larization term (e.g. LASSO or Bridge regularization).
algorithms. Relief is a classical method for feature es- The computational cost of evaluating the loss function
timation in single-label data. The Relief method was depends of the performance of the classification algo-
designed for binary class problems without missing val- rithm used. Consequently, the efficiency of the algo-
ues. The method randomly selects m instances from the rithm used to optimize the objective function is a pri-
training set. For each selected instance i, Relief com- mordial point in this type of methods.
putes the nearest neighbour from the same class of i and The interpretability of the feature estimation with Re-
the nearest neighbour from the opposite class. The qual- liefF makes it more attractive than other feature es-
ity of each feature is estimated with respect to whether timation algorithms [58]. The mathematical interpre-
the feature differentiates two instances from the same tation of the Relief scheme explains the success of
4
this family of algorithms in several real applications significantly affect the results. However, this is men-
[19, 21, 23, 24, 39–42]. The ease of application of tioned because the two specific datasets used possess
ReliefF has been proven on other problems, such as in this property. MReliefF carries out O(q2 · m · n · d) oper-
the evaluation of literals in inductive logic programming ations, owing to MReliefF executes q(q − 1)/2 times the
[62], cost-sensitive classification [63], evaluation of fea- classic ReliefF. This ReliefF extension has a high com-
tures with ordered values [64] and learning similarity putational cost on datasets with a moderate number of
metrics [65]. labels.
Owing to the simplicity, efficiency, effectiveness and The LP-ReliefF extension is proposed in [47]. LP-
applicability of the ReliefF algorithm, it has been also ReliefF uses the Label Power Set approach (LP) [2] to
adapted to other machine learning paradigms, such as measure the contribution of each feature directly from
Multi-Instance Learning [66], Multi-Target Learning in the multi-class problem generated. The LP approach
the context of structured targets [67] and MLL [26, 46– is a PTM that constructs one multi-class dataset from
48, 68]. a multi-label dataset. LP considers each unique set
of labels as one of the classes of the new multi-class
dataset. The time complexity of the LP-ReliefF exten-
2.2.1. Previous extensions of ReliefF to multi-label sion is O(m·n·d). The LP method needs O(n·q) steps to
learning construct a multi-class dataset from a multi-label dataset
Several ReliefF extensions to MLL have appeared in and the classic ReliefF needs O(m · n · d) steps, but
the literature. The previous ReliefF extensions use a m · n > n and generally d > q in real multi-label prob-
PTM to transform the multi-label dataset into one or lems.
more single-label datasets. Afterwards, the classic Reli- LP takes into account implicitly label correlations.
efF algorithm is performed on every single-label dataset However, the LP method has a high complexity on
generated and finally the feature weights are computed multi-label datasets which present a large number of
by an aggregation strategy. distinct label sets, limiting its scalability and leading to
The BR-ReliefF extension is proposed in [48]. BR- a tendency to over-fitting the data [49, 69]. Several set
ReliefF is evaluated on ten multi-label datasets in [26]. of labels can sporadically appear on multi-label prob-
Later on [47], BR-ReliefF and a multi-label feature es- lems, resulting in a scarcity of instances on the gener-
timation algorithm based on the Information Gain mea- ated multi-class dataset. Consequently, the LP method
sure are compared. BR-ReliefF uses the Binary Rel- can generate a highly imbalanced multi-class datasets.
evance (BR) [3] approach to measure the contribution Previous extensions of ReliefF have only dealt with
of each feature according to each label. The BR is a the FS process to improve the efficiency and effective-
PTM, that decomposes the multi-label problem into q ness of MLL. In addition, the present work also focuses
binary classification problems. The classic ReliefF is on the FW process to improve the performance of sev-
performed for each label and the average of the score eral multi-label lazy algorithms.
of each feature across all labels is computed. The BR-
ReliefF extension makes O(q·m·n·d) operations, owing
to BR-ReliefF executes q times the classic ReliefF. The 3. Scalable extensions of ReliefF to multi-label
efficiency of BR-ReliefF is affected by datasets with a learning
high number of labels. Moreover, the BR approach does
not consider label correlations. In this section the basis of three new ReliefF exten-
sions to MLL is explained. PPT-ReliefF uses a PTM
The MReliefF extension is presented in [46]. The
approach, whereas the two other extensions handle the
multi-label problem is decomposed into a set of pair-
multi-label data directly.
wise single-label problems. This decomposition is
equivalent to breaking down the multi-label problem
into several binary classification problems using the 3.1. ReliefF-ML
Ranking by Pair Wise Comparison (RPC) [3] approach. We initially proposed the ReliefF-ML extension in
The RPC is a PTM that decomposes the multi-label [68]. ReliefF-ML was evaluated on FW process to im-
problem into q(q − 1)/2 binary classification problems. prove the performance of three multi-label lazy algo-
MReliefF excludes those examples that fall into Hits rithms on eleven multi-label datasets, showing the ef-
and Misses neighbours at the same time. In [46] the au- fectiveness of the proposal.
thors write that the occurrence of these cases is very rare An instance in MLL context is assigned to multi-
and therefore the exclusion of these instances does not ple labels at the same time. Consequently, the nearest
5
Hits and Misses concepts of ReliefF algorithm cannot on instances with similar sets of labels. ReliefF-ML it-
be used in a strict sense. Given a multi-label dataset, the eratively updates the feature weights as follows:
prior probability that an instance belongs to a label l can
be computed as follows [70]: 1−PH l P δ(x f ,x j f )
W f = W f − l∈yi ( Pq∈yPl Pq 1+P il j∈Hil imk
P
)
Cl + b i H (9)
Pl = (5) i
δ(x ,x )
n + 2b + l∈y¯i ( Pq∈Py¯l Pq P Mil j∈Mil imkf jf
P P
)
i
where Cl is the number of instances in the dataset that
belong to label l, n is the number of the instances of where the function δ(xi f , x j f ) calculates the difference
the training set, and b is the smoothing parameter con- between the values of the f -th feature on the instances i
trolling the strength of uniform prior (b = 1 yields the and j. The parameter m represents the number of sam-
Laplace smoothing). pling instances selected from the training set to estimate
Given two instances i and j, the distance among the the feature weights.
sets of labels of i and j is calculated by the Hamming The contributions of each relevant and irrelevant la-
1−PH l
Distance (see equation 6). The distance dL represents a bel are weighted by the factors Pq∈yPl Pq , 1+P il and Pq∈Py¯l Pq ,
measure of how much the sets of two instances differ. A i H
i i

smaller value of dL represents a major similarity in the P Mil respectively. ReliefF-ML uses the given represen-
classification of these instances. tation of the original datasets, i.e. it does not use a PTM.
ReliefF-ML requires the retrieval of the k-nearest
| yi 4y j | neighbours for each relevant and irrelevant label of an
dL (i, j) = (6) instance i. However, through a linear search over the
q
training set the groups of the k-nearest neighbors for an
For each relevant and irrelevant label of a sampling
instance i can be find efficiently. Consequently, the time
instance i a group of k-nearest neighbours is defined.
complexity of ReliefF-ML is equal to the classic Reli-
The following groups of Hits (Hil ) and Misses (Mil ) with
efF algorithm (O(m · n · d)). The Algorithm 1 describes
respect to an instance i are computed:
the ReliefF-ML extension.
• Hil : k-nearest neighbours that have the relevant la-
bel l of i as relevant label. Algorithm 1: ReliefF-ML algorithm.
Input : E → training set of multi-label instances
• Mil : k-nearest neighbours that have the irrelevant m → number of sampling instances
label l of i as relevant label. k → number of nearest neighbours
Output: W → weight vector
Based in the defined groups Hil and Mil , the following 1 begin
2 foreach f ∈ F do
probabilities are defined: 3 W f ← 0;
P 4 end
j∈Hil dL (i, j) 5 foreach l ∈ L do
PHil = (7) 6 Pl ← LabelProbability (l); (equation 5)
k
7 end
for t ← 1 to m do
P
j∈Mil dL (i, j)
8

P Mil = (8) 9 i ← Randomly pick an instance from E;


k 10 foreach label l ∈ yi do
PHil represents the probability that two nearest in- 11 Hil ← kNearestHits (i,l);
stances that share the label l as relevant belong to dif- 12 PH l ← Probability (i,Hil ); (equation 7)
i

ferent set of labels. P Mil represents the probability that 13 end


14 foreach label l ∈ y¯i do
two nearest instances belong to different set of labels, 15 Mil ← kNearestMisses (i,l);
where i has the label l as irrelevant and the k-nearest 16 P Ml ← Probability (i,Mil ); (equation 8)
i
neighbours have the label l as relevant. 17 end
ReliefF-ML takes into account the dependence 18 foreach f ∈ F do
among labels by the calculus of the probabilities PHil 19 W f ← UpdateWeight ( f ); (equation 9)
and P Mil . A feature weight reflects the ability of the fea- 20 end
21 end
ture to distinguish class labels. A high feature weight 22 return W;
indicates that the feature has different values on in- 23 end
stances with dissimilar sets of labels. Also, a high fea-
ture weight indicates that the feature has similar values
6
3.2. PPT-ReliefF In this work we propose a new ReliefF extension to
MLL called PPT-ReliefF. The PPT-ReliefF uses the PPT
The principal drawback of the LP method is the gen- approach to measure the contribution of each feature di-
eration of multi-class datasets with a large number of rectly from the multi-class problem generated. The time
classes. Generally, the LP method returns an extreme complexity of PPT-ReliefF extension is O(m · n · d). The
class imbalance problem for learning. The Pruned Prob- Algorithm 2 describes the proposed PPT-ReliefF exten-
lem Transformation (PPT) method is proposed in [49] to sion.
address with the drawbacks of the LP method.
The simplest PPT method selects those label sets that Algorithm 2: PPT-ReliefF algorithm.
occur more times than a defined threshold and discards Input : E → training set of multi-label instances
the rest of the label sets. PPT has the power of the m → number of sampling instances
LP approach, where the correlation among labels is im- k → number of nearest neighbours
plicitly taken into account. Unlike to LP method, PPT p → pruning parameter
s → number of label subsets to select
only considers the most important label relationships. Output: W → weight vector
The PPT approach reduces the scarcity of labels and the 1 begin
over-fitting of data [49]. 2 multiClassDataset ← PPTTransformation (E, p, s);
3 W ← ReliefF (multiClassDataset,m,k);
In [49] a more elaborated PPT approach that intro-
4 return W;
duces disjoint subsets is presented. The disjointed sub- 5 end
sets are extracted from those label sets that exist fewer
times than a threshold. In general, the PPT approach
contains two steps, a pruning step and a subsampling In the Algorithm 2 the function
step of label sets. The pruning step removes the la- PPT T rans f ormation(E, p, s) represents the PPT
bel sets that occur fewer times than a defined thresh- method which needs the multi-label dataset to convert,
old. This step reduces the complexity of the generated the pruning parameter (p) and the sampling parameter
multi-class dataset. However, if a high number of label (s) as parameters. This function given a multi-label
sets is discarded then a considerable information can be dataset returns a multiclass dataset. The function
lost. The subsampling step reintroduces into the train- Relie f F(multiClassDataset, m, k) refers to the classic
ing data those instances that have label subsets which ReliefF algorithm. The classic ReliefF algorithm only
occur more frequently, ensuring a minimal information needs the multi-class dataset, the number of sampling
loss [71]. instances (m) and the number of neighbours (k) as
The PPT method is influenced by the parameters p parameters.
and s. The parameter p represents how much pruning
needs to be carried out. A value of p = 0 signifies that 3.3. RReliefF-ML
all the label sets are accepted (being the PPT method In this work we propose a new ReliefF extension to
equivalent to the LP method). A value of p = 1 indicates MLL called RReliefF-ML. RReliefF-ML is based on
that all the instances that have a label set that appears at the well known ReliefF adaptation to regression prob-
most one time are pruned, and so on. lems (RReliefF) [50]. From the point of view of proba-
The parameter s is related with the subsampling step. bility, the classic ReliefF algorithm estimates the feature
The parameter s represents how many subsets are recov- weight as follows [22]:
ered from those label sets previously discarded. When
given a label set discarded, the PPT method finds all la- f
W f = Peqval|eqcl f
− Peqval|di (10)
f cl
bel subsets which occur more than p times. The label
f
subsets are sorted according to their frequency of ap- where Peqval|eqcl is the probability that nearest instances
parition. Afterwards, the top s label subsets are used have the same value in the f -th feature and the same
f
to create new instances that are copies of the original prediction. Peqval|di f cl
is the probability that nearest in-
multi-label instance. The new instances created have stances have the same value in the f -th feature and dif-
the top s label subsets previously selected as label sets. ferent classes. Using the Bayes Rule the equation (10)
The PPT method not only outperforms the LP method is transformed into [24]:
in predictive performance. Also, it improves the ef-
ficiency and scalability on larger datasets. PPT has a f f f f
worse-case time complexity equal to LP method, O(n·q) Peqcl|eqval Peqval (1 − Peqcl|eqval )Peqval
[71]. Wf = − (11)
Peqcl 1 − Peqcl
7
and the equation (11) is equivalent to [24]: The Algorithm 3 describes the main steps that follows
the RReliefF-ML extension.
Pdif f cl|di f val Pdif f val (1 − Pdif f cl|di f val )Pdif f val
Wf = − (12)
Pdi f cl 1 − Pdi f cl Algorithm 3: RReliefF-ML algorithm.
Input : E → training set of multi-label instances
where Pdif f cl|di f val is the prior probability that nearest in- m → number of sampling instances
stances have different feature values in the f -th feature k → number of nearest neighbours
Output: W → weight vector
and different predictions. Pdif f val is the prior probability 1 begin
that nearest instances have different feature values in the 2 Pdi f Y ← 0;
f -th feature. Pdi f cl is the prior probability that nearest 3 foreach f ∈ F do
f f
instances belong to different classes. In Relief family 4 Pdi f val ← 0; Pdi f Y|di f val ← 0; W f ← 0;
of algorithms, the equation 12 represents a unified view 5 end
of the feature quality estimation in classification and re- 6 for t ← 1 to m do
7 i ← Randomly pick an instance from E;
gression problems [24]. 8 Ni ← kNearestNeighbors (i);
Analysing the unified view (equation 12) on the MLL 9 foreach j ∈ Ni do
context, the estimation of a feature weight is computed 10 Pdi f Y = Pdi f Y + dL (i, j)/k;
11 foreach f ∈ F do
as follows: f f
12 Pdi f val = Pdi f val + δ(xi f , x j f )/k;
f
f f f f 13 Pdi f Y|di f val =
Pdi f Y|di f val · Pdi f val (1 − Pdi f Y|di f val ) · Pdi f val
Wf = − (13) f
Pdi f Y|di f val + dL (i, j) · δ(xi f , x j f )/k;
Pdi f Y 1 − Pdi f Y
14 end
where Pdif f Y|di f val is the prior probability that nearest in- 15 end
stances have different feature values in the f -th feature 16 end
and belong to different set of labels. Pdi f Y is the prior 17 foreach f ∈ F do
f f f f
Pdi f Y|di f val ·Pdi f val (1−Pdi f Y|di f val )·Pdi f val
probability that nearest instances belong to different set 18 Wf = Pdi f Y − 1−Pdi f Y
of labels. 19 end
The prior probability Pdi f Y is computed by the dis- 20 return W;
tance among the set of labels of two learning instances. 21 end
Given two instances i and j, the distance between the
sets of labels of i and j is calculated by the Hamming
Distance (see equation 6). The prior probability that Independently, in [72] was proposed a similar method
nearest instances belong to different set of labels is com- to RReliefF-ML. The authors prove the proposal on
puted as follows: synthetic datasets and the FS process. The synthetic
P P multi-label datasets are generated following two dif-
i∈S j∈Ni dL (i, j)
Pdi f Y = (14) ferent strategies (HyperCubes and HyperSpheres) pro-
k posed in [73]. The equations to update the label and
where S is a set of m instances randomly selected from feature dissimilarities differ from our proposal. The dis-
a training set E of multi-label instances. Ni represents similarity values are multiplied by a weighted distance.
the set of nearest neighbours of the instance i. On the other hand, the partial weights are combined in
The prior probability that nearest instances have dif- a different manner.
ferent feature values in the f -th feature (Pdif f val ) is com-
puted as follows: RReliefF-ML takes into account the label correla-
tions by the calculus of the probabilities Pdif f Y|di f val and
j∈Ni δ(xi f , x j f )
P P
i∈S Pdi f Y . A high feature weight indicates that the feature
Pdif f val = (15) has different values on instances with dissimilar sets of
k
The prior probability that nearest instances have dif- labels. It also indicates that the feature has similar val-
ferent feature values in the f -th feature and belong to ues on instances with similar sets of labels.
different set of labels (Pdif f Y|di f val ) is computed as fol- The RReliefF-ML extension does not use a PTM for
lows: the estimation of the feature weights. RReliefF-ML re-
trieves only k-nearest neighbours for each sampling in-
dL (i, j) · δ(xi f , x j f ) stance. The computational complexity of RReliefF-ML
P P
i∈S j∈Ni
Pdif f Y|di f val = (16) is O(m · n · d).
k
8
4. Empirical study from 15 to 6555 different subset of labels, from 1.014
to 26.044 label cardinality, and from 0.009 to 0.485 la-
In this section a description of the multi-label bel density.
datasets, algorithms, statistical tests and other settings
used in the experimental study are explained. The ex- 4.2. Experimental setting
perimental results on different classifiers, datasets and The multi-label ReliefF extensions and algorithms
the statistical analysis are showed. The empirical study were implemented on MULAN [83]. MULAN is a
was divided into two parts: a comparative study be- Java library which contains several methods for MLL.
tween the ReliefF extensions on the FW process and a For each possible combination of algorithm and dataset
comparison of the ReliefF extensions on the FS process. a stratified 10-fold cross validation strategy was used.
The methods proposed in [84] were used to stratify the
4.1. Multi-label datasets multi-label data. For each fold in the training phase, a
In the experiments 34 real multi-label datasets were ReliefF extension learned the weight vector on the train-
used1 , where the number of relevant, irrelevant and re- ing set. The whole training set was used to retrieve the
dundant features is unknown. Multi-label datasets with k-nearest neighbours of a sampling instance. The best
different scale and from different application domains number of neighbours (k) for each classifier and Reli-
were included to analyse the behaviour of the ReliefF efF extension on each dataset was determined.
extensions in datasets with diverse properties. Several studies have proved that a larger number
The datasets come from six domains. The dataset of sampling instances (m) results in a more reli-
Birds [74] contains examples of multiple bird species able approximation in the feature estimation process
for acoustic classification. Cal500 [16] contains pieces [22, 24]. For the sake of fairness, in the experi-
of music for semantic annotation. Emotions dataset [43] ments the whole training set was used on the feature
stores examples of songs according to the emotions that estimation process. Also, the values of p and s
they evoke. Flags [75] stores examples about nations parameters for PPT-ReliefF were tuned by a cross
and their national flags. Scene [8] contains a series of validation on the training set, as recommended in [69].
patterns about kinds of landscapes. The Corel5k [12]
and Corel16k [14] datasets contain Corel images. Me- Feature weighting setting
diamill [76] contains examples for the automatic detec- The performance of lazy algorithms can be signifi-
tion of semantic concepts in videos. The Yeast [77] and cantly improved with the use of an appropriate weight
Genbase [9] datasets come from the biological domain, vector. The aim is to find a weight vector W that allows
including information about the function of genes and the distance function to recover the k-nearest neigh-
proteins. Medical [78] was used in the Medical Natu- bours in the feature space, given that not all features
ral Language Processing Challenge in 2007. Enron [79] have the same relevance in a dataset [19].
contains emails from 151 users. TMC2007-500 [80, 81] In the experiments four multi-label lazy algorithms
stores examples of reports of aviation safety. Bibtex were used. In [70] is proposed the Multi-Label k-
[82] contains bibtex examples for automatic tag sugges- Nearest Neighbour algorithm (MLkNN). MLkNN de-
tion. The other eleven datasets come from the Yahoo termines the label set of a query instance by the max-
text collection [13]. imum a posteriori (MAP) principle. The BRkNN al-
Table 1 shows some statistics of the multi-label gorithm appears in [85]. BRkNN calculates the confi-
datasets. The values of the properties of the Corel16k dences of each label based on the label sets of the neigh-
dataset are averaged over all ten samples used. The label bour queries. BRkNN is conceptually equivalent to use
cardinality is the average number of labels per example. the BR method in conjunction with the kNN algorithm.
The label density is the label cardinality divided by the In [86] the Dependent Multi-Label k-Nearest Neighbour
total number of labels. The label cardinality, label den- algorithm (DMLkNN) is proposed. DMLkNN defines a
sity and different subsets of labels are measures that rep- MAP rule that takes into account the number of all la-
resent the complexity of a multi-label dataset [3]. The bels in the neighbourhood. The Multi-Label Weighted
datasets vary in size: from 194 up to 43907 instances, k-Nearest Neighbour Classifier (MLCWkNN) appears
from 19 up to 52350 features, from 6 to 374 labels, in [87]. MLCWkNN is based on the Bayes Theorem
and it proposes an instance weighted kNN version.
Each lazy classifier was compared with six
1 All these datasets are available at weighted lazy classifier versions. A weighted
http://mulan.sourceforge.net/datasets.html lazy classifier parameterize its distance function
9
Dataset Domain n d q ds lc ld
Flags Image 194 19 7 54 3.392 0.485
Cal500 Music 502 68 174 502 26.044 0.150
Emotions Music 593 72 6 27 1.869 0.311
Birds Audio 645 260 19 133 1.014 0.053
Yeast Biology 2417 103 14 198 4.237 0.303
Scene Image 2407 294 6 15 1.074 0.179
Genbase Biology 662 1186 27 32 1.252 0.046
Medical Text 978 1449 45 94 1.245 0.028
Enron Text 1702 1001 53 753 3.378 0.064
Corel5k Image 5000 499 374 3175 3.522 0.009
Mediamill Video 43907 120 101 6555 4.376 0.043
Corel16k (10 samples) Image 13811 500 161 4937 2.867 0.018
Bibtex Text 7395 1836 159 2856 2.402 0.015
TMC2007-500 Text 28596 500 22 1341 2.16 0.098
Arts Text 7484 23146 26 599 1.654 0.064
Science Text 6428 37187 40 457 1.450 0.036
Business Text 11214 21924 30 233 1.599 0.053
Health Text 9250 30605 32 335 1.644 0.051
Reference Text 8027 39679 33 275 1.174 0.035
Education Text 12030 27534 33 511 1.463 0.044
Recreation Text 12828 30324 22 530 1.429 0.065
Entertainment Text 12730 32001 21 337 1.414 0.067
Computers Text 12444 34096 33 428 1.507 0.046
Society Text 14512 31802 27 1054 1.670 0.062
Social Text 12111 52350 39 361 1.279 0.033

Table 1: Statistics of the benchmark datasets, number of instances (n), number of features (d), number of labels (q), different subsets of labels (d s ),
label cardinality (lc ) and label density (ld ). The datasets are ordered by their complexity calculated as n · d.

with a learned weight vector by a ReliefF ex- Algorithm 4: Iterative procedure used to evaluate a fea-
tension. All the feature weights were scaled ture ranking. The expression X  Y means that the X value
to [0, 1] using the min-max normalization [1]. is better than the Y value.
Input : W → weight vector
Feature selection setting Φ → multi-label learning algorithm
Output: Best evaluation measure value
The ReliefF extensions were compared into the FS 1 begin
process to improve the effectiveness of the multi-label 2 R ← Rank (W);
learning algorithms. The BRkNN classifier was used 3 R0 ← f TopFeatures (R);
as base-line algorithm, owing to its simplicity and high 4 BestEvaluation ← Empty;
5 F s ← ∅;
sensitivity to the presence of irrelevant and redundant 6 foreach f ∈ R0 do
features. 7 TempEvaluation ← Evaluate (Φ,F s ∪ f );
A learned weight vector by the ReliefF algorithm can 8 if TempEvaluation  BestEvaluation then
9 BestEvaluation ← TempEvaluation;
be viewed as a feature ranking. A feature ranking is 10 Fs = Fs ∪ f ;
important to guide the search on FS process, especially 11 end
where the search space to find the best feature subset is 12 end
large [1]. Several methods has been proposed to eval- 13 return BestEvaluation;
uate a feature ranking on the FS process [1, 25, 88]. 14 end

The procedure proposed in [25] was used, owing to its


simplicity and computational complexity. It provides a
heuristic for comparing several feature rankings. with the feature subset determined from the R x ranking.
In the experiments the 100-top features of a ranking R The ideal ranking would have all the relevant features
were selected for constructing a R0 ranking. The process placed at the top of the ranking [1, 25, 88].
starts from the first feature on R0 and continues with the
next ranked attribute. The Algorithm 4 shows the pro- 4.3. Statistical tests
cedure used to evaluate a feature ranking. To analyze and validate the results, several nonpara-
A R x ranking is considered better than a Ry ranking metric statistical tests were used, as proposed in [51–
if the base-line algorithm reaches a better performance 54, 89]. The Friedman’s test [90] was performed to
10
evaluate whether there are significant difference in the were significant results over the four evaluation mea-
results of the algorithms. If the Friedman’s test indi- sures considered on complex datasets, such as Corel5k,
cated that the results were significantly different, the Corel16k collection, Mediamill, TMC2007-500, Bib-
Bergmann-Hommel post-hoc test [91] was used to per- tex and datasets that belong to Yahoo collection. In
form multiple comparisons among all methods. The some cases the original lazy classifiers performed bet-
Bergmann-Hommel’s test is a more powerful test than ter than the weighted lazy classifiers using BR-ReliefF
other post-hoc tests, such as Nemenyi’s test [92]. Ne- and MReliefF extensions.
menyi’s test is conservative and many of the obvious A statistical analysis to detect significant differ-
differences could not be detected [52, 54]. ences on the performance among the non-weighted and
In the statistical analysis the Adjusted p-values weighted lazy classifiers was carried out. The Fried-
(APVs) [93] were considered. APVs provide more in- man’s test rejected the null hypothesis in all cases ana-
formation in a statistical analysis. APVs take into ac- lyzed, considering a significance level α = 0.05. The p-
count the fact that multiple tests are conducted and values returned by the Friedman’s test can be consulted
can be compared directly with any significance level α on the tables.
[52, 54]. In this work a significance level α = 0.05 was Afterwards, a Bergmann-Hommel’s post-hoc test for
considered. all pairwise comparisons was carried out. The results of
The statistical information obtained from the the Bergmann-Hommel’s test for the MLkNN classifier
Bergmann-Hommel’s test was graphically illustrated as are showed on the figure 1.
a graph. An edge Φ1 → Φ2 represents that the method From a statistical point of view, the MLkNN classifier
Φ1 outperforms the method Φ2 . Each edge was labeled using PPT-ReliefF, RReliefF-ML and ReliefF-ML ex-
with the evaluation measures which Φ1 outperformed tensions outperformed the MLKNN classifier using BR-
the Φ2 method. The APVs of the Bergamnn-Hommel’s ReliefF, MReliefF and LP-ReliefF on the four multi-
test were indicated between parentheses. label evaluation measures considered. A similar re-
sults was obtained over the BRkNN, DMLkNN and
4.4. Results and discussion MLCWkNN classifiers.
The Bergmann-Hommel’s test did not detect signif-
The algorithms as standalone runnable files and all icant differences among PPT-ReliefF, ReliefF-ML and
the results of the empirical study are available in order RReliefF-ML extensions (the three extension proposed
to facilitate the replicability of the experiments2 . In this in this work). However, PPT-ReliefF obtained the first
manuscript only a summary of the results appear. In all position in 14 of 16 average rankings (16 rankings= four
cases, the best results are highlighted in bold typeface lazy methods × four evaluation measures) returned by
in the tables. the Friedman’s test. RReliefF-ML obtained the first po-
sition in two average rankings, the second position in
4.4.1. Feature weighting eleven and the third position in three rankings. ReliefF-
Due to lack of space, the table of results on FW pro- ML obtained the second position in three and the third
cess for the BRkNN, DMLkNN and MLCWkNN clas- position in 13 rankings. The weighted lazy classifiers
sifiers are not included in this paper. The tables 2 and that used the MReliefF and BR-ReliefF extensions per-
3 show the results of the MLkNN classifier using the formed worse than the non-weighted original lazy algo-
ReliefF extensions for the HL and F1Ex measures. The rithms.
results for the AP and RL measures can be consulted in For the HL , F1Ex , RL and AP measures, the weighted
the available web page. In the tables the last two rows lazy classifiers that used the weight vector of PPT-
show the average rank (Rank) and the position in the ReliefF obtained the best results. Followed by those
ranking (Pos.) of each method. classifiers that used the weight vector from RReliefF-
Generally speaking, the results showed that the ML and ReliefF-ML.
weighted lazy classifiers using PPT-ReliefF, ReliefF- The results showed that the PPT-ReliefF outper-
ML and RReliefF-ML performed better than the orig- formed the LP-ReliefF extension. This result showed
inal lazy methods (not weighted). On the other hand that the PPT technique reduces the scarcity of labels
the lazy classifiers using PPT-ReliefF, ReliefF-ML and and over-fitting of data. PPT produced more balanced
RReliefF-ML obtained the lower average ranks. There multi-class datasets than LP approach and reduced the
complexity of the multi-label problems.
From the statistical analysis, we concluded that the
2 http://www.uco.es/grupos/kdis/kdiswiki/MLL/ReliefF three proposed ReliefF extensions learn the usefulness
11
MLkNN
Dataset
- BR-ReliefF LP-ReliefF MReliefF PPT-ReliefF ReliefF-ML RReliefF-ML
Flags 0.268 0.274 0.273 0.292 0.276 0.268 0.265
Cal500 0.139 0.140 0.140 0.138 0.140 0.138 0.138
Emotions 0.209 0.205 0.193 0.198 0.190 0.188 0.191
Birds 0.051 0.055 0.051 0.052 0.049 0.049 0.045
Yeast 0.198 0.197 0.198 0.199 0.197 0.196 0.197
Scene 0.095 0.097 0.098 0.096 0.097 0.095 0.093
Genbase 0.005 0.006 0.004 0.006 0.003 0.004 0.003
Medical 0.019 0.023 0.017 0.019 0.017 0.017 0.015
Enron 0.051 0.062 0.051 0.062 0.050 0.053 0.051
Corel5k 0.009 0.010 0.009 0.010 0.009 0.009 0.009
Mediamill 0.031 0.031 0.031 0.031 0.031 0.031 0.031
Corel16k01 0.020 0.020 0.020 0.020 0.020 0.020 0.020
Corel16k02 0.020 0.019 0.019 0.019 0.019 0.019 0.019
Corel16k03 0.021 0.020 0.020 0.020 0.020 0.020 0.020
Corel16k04 0.019 0.019 0.019 0.018 0.019 0.019 0.019
Corel16k05 0.020 0.019 0.019 0.019 0.019 0.019 0.019
Corel16k06 0.019 0.019 0.019 0.019 0.019 0.019 0.019
Corel16k07 0.018 0.017 0.018 0.018 0.017 0.018 0.017
Corel16k08 0.019 0.018 0.018 0.018 0.018 0.018 0.018
Corel16k09 0.018 0.018 0.018 0.018 0.018 0.018 0.018
Corel16k10 0.020 0.020 0.020 0.20 0.019 0.017 0.019
Bibtex 0.014 0.014 0.014 0.014 0.013 0.014 0.013
TMC2007-500 0.058 0.065 0.064 0.064 0.063 0.056 0.056
Arts 0.063 0.063 0.063 0.063 0.062 0.063 0.063
Science 0.036 0.036 0.036 0.036 0.035 0.035 0.036
Business 0.029 0.029 0.029 0.029 0.029 0.028 0.029
Health 0.049 0.051 0.048 0.052 0.048 0.049 0.048
Reference 0.035 0.036 0.035 0.034 0.034 0.034 0.034
Education 0.044 0.043 0.043 0.043 0.043 0.043 0.043
Recreation 0.063 0.063 0.062 0.062 0.062 0.062 0.063
Entertaiment 0.063 0.062 0.062 0.062 0.061 0.062 0.062
Computers 0.040 0.043 0.040 0.045 0.039 0.039 0.040
Society 0.060 0.062 0.058 0.058 0.055 0.055 0.054
Social 0.030 0.030 0.030 0.030 0.029 0.029 0.030
Rank 4.956 5.029 4.235 4.706 3.000 3.029 3.044
Pos. 6 7 4 5 1 2 3

Table 2: HL (↓) results for MLkNN on FW process. The Friedman’s test rejects the null hypothesis with a p-value equal to 6.488E-7.

of each feature correctly. It also confirms the effec- BRkNN classifier obtained a lower average rank when
tiveness of the ReliefF algorithm as a feature weighting it used the feature subsets determined from PPT-ReliefF,
method for a better multi-label lazy learning. ReliefF-ML and RReliefF-ML.
The results showed that on simple multi-label
4.4.2. Feature selection datasets; such as Flags, Birds, Emotions, Cal500, Yeast,
Tables 4 and 5 show the results of the BRkNN clas- Genbase and Medical, the FS process yielded a con-
sifier using the whole feature space and the feature sub- siderable improvement over the effectiveness of the
sets determined from the ReliefF extensions for the HL BRkNN classifier.
and F1Ex evaluation measures. The results for the AP It is important to highlight that BRkNN using the fea-
and RL measures can be consulted in the available web ture subsets of PPT-ReliefF, ReliefF-ML and RReliefF-
page. In the tables the last two rows show the average ML obtained the best results on the eleven complex
rank (Rank) and the position in the ranking (Pos.) of multi-label datasets that come from Yahoo collection.
each method. Several of the multi-label datasets have a large num-
Generally speaking, the results showed that the ber of features, e.g. Bibtex (1836 features), Medical
BRkNN classifier using the feature subsets determined (1449 features), Genbase (1186 features), Enron (1001
from the feature rankings of PPT-ReliefF, ReliefF-ML features) and Yahoo collection (from 21924 to 52350
and RReliefF-ML performed better than the BRkNN features). The experimental results can be considered
that used the whole feature space. On the other hand, as formidable results, considering that only the 100-top
12
MLkNN
Dataset
- BR-ReliefF LP-ReliefF MReliefF PPT-ReliefF ReliefF-ML RReliefF-ML
Flags 0.696 0.713 0.718 0.684 0.694 0.708 0.720
Cal500 0.326 0.328 0.331 0.308 0.331 0.332 0.339
Emotions 0.587 0.628 0.626 0.633 0.664 0.676 0.643
Birds 0.527 0.490 0.524 0.518 0.523 0.528 0.569
Yeast 0.611 0.609 0.617 0.611 0.609 0.611 0.617
Scene 0.682 0.669 0.658 0.656 0.685 0.656 0.663
Genbase 0.950 0.0953 0.977 0.0952 0.980 0.967 0.981
Medical 0.439 0.412 0.533 0.421 0.515 0.515 0.543
Enron 0.417 0.417 0.461 0.357 0.475 0.419 0.482
Corel5k 0.019 0.017 0.005 0.010 0.023 0.027 0.035
Mediamill 0.533 0.533 0.533 0.528 0.534 0.533 0.537
Corel16k01 0.013 0.010 0.007 0.010 0.013 0.011 0.015
Corel16k02 0.016 0.013 0.009 0.010 0.033 0.020 0.027
Corel16k03 0.012 0.012 0.009 0.015 0.020 0.015 0.012
Corel16k04 0.015 0.012 0.026 0.027 0.020 0.028 0.022
Corel16k05 0.015 0.018 0.006 0.013 0.019 0.024 0.018
Corel16k06 0.018 0.018 0.016 0.017 0.034 0.020 0.025
Corel16k07 0.017 0.020 0.011 0.018 0.027 0.020 0.025
Corel16k08 0.018 0.017 0.009 0.005 0.024 0.023 0.020
Corel16k09 0.005 0.006 0.009 0.010 0.018 0.013 0.016
Corel16k10 0.006 0.008 0.009 0.006 0.016 0.018 0.012
Bibtex 0.161 0.174 0.189 0.201 0.226 0.201 0.205
TMC2007-500 0.660 0.604 0.614 0.609 0.617 0.680 0.671
Arts 0.034 0.029 0.033 0.024 0.050 0.055 0.039
Science 0.015 0.010 0.017 0.016 0.018 0.022 0.023
Business 0.737 0.728 0.736 0.729 0.735 0.737 0.735
Health 0.363 0.423 0.347 0.421 0.387 0.365 0.376
Reference 0.236 0.221 0.131 0.219 0.241 0.264 0.244
Education 0.026 0.026 0.026 0.032 0.032 0.028 0.027
Recreation 0.058 0.060 0.059 0.065 0.064 0.060 0.056
Entertaiment 0.108 0.104 0.113 0.107 0.128 0.116 0.124
Computers 0.369 0.409 0.432 0.378 0.433 0.427 0.429
Society 0.155 0.150 0.173 0.135 0.165 0.178 0.169
Social 0.263 0.223 0.310 0.286 0.285 0.336 0.321
Rank 4.941 5.323 4.691 5.265 2.618 2.706 2.456
Pos. 5 7 4 6 2 3 1

Table 3: F1Ex (↑) results for MLkNN on FW process. The Friedman’s test rejects the null hypothesis with a p-value equal to 5.295E-11.

features of the feature rankings were considered in the the BRkNN classifier that used the whole feature space
FS process. The BRkNN classifier using a small num- for the four multi-label evaluation measures considered.
ber of features (fewer or equal than 100 features) per- The BRkNN classifier using the three proposed ReliefF
formed better than the BRkNN that used the whole fea- extensions performed better than the BRkNN classifier
ture space. using the BR-ReliefF, LP-ReliefF and MReliefF exten-
A statistical analysis to detect significant differences sions for the four multi-label evaluation measures.
on the performance among the ReliefF extensions was The Bergmann-Hommel’s test did not detect signif-
carried out. The Friedman’s test rejected the null hy- icant differences among PPT-ReliefF, ReliefF-ML and
pothesis in all cases analyzed considering a significance RReliefF-ML extensions. However, PPT-ReliefF ob-
level α = 0.05. The p-values returned by the Friedman’s tained the first position in two of four average rankings
test can be consulted on the tables. (four rankings= one classifier × four evaluation mea-
Afterwards, a Bergmann-Hommel’s post-hoc test for sures) returned by the Friedman’s test. RReliefF-ML
all pairwise comparison was carried out. The results of obtained the first position in one average ranking and
the Bergmann-Hommel’s test are displayed in the figure the second position in two rankings. ReliefF-ML ob-
2. tained the third position in three rankings.
From a statistical point of view, the BRkNN classifier The RReliefF-ML extension obtained the best results
using the feature subsets determined from PPT-ReliefF, for HL measure, followed by PPT-ReliefF. ReliefF-ML
RReliefF-ML and ReliefF-ML extensions outperformed performed better for F1Ex measure, followed by PPT-
13
Figure 1: Significant differences of the performance among ReliefF extensions on the MLkNN classifier according to the Bergmann-Hommel’s
test.

Figure 2: Significant differences of the performance among ReliefF extensions on the BRkNN classifier according to the Bergmann-Hommel’s test.

ReliefF. PPT-ReliefF obtained the best results for the RL the usefulness of each feature. Also, it confirms the ef-
and AP measures, followed by RReliefF-ML. fectiveness of the ReliefF algorithm as a feature selec-
From the statistical analysis, we concluded that the tion method for a better multi-label learning.
three proposed ReliefF extensions correctly determine
14
BRkNN
Dataset
- BR-ReliefF LP-ReliefF MReliefF PPT-ReliefF ReliefF-ML RReliefF-ML
Flags 0.271 0.250 0.226 0.237 0.228 0.228 0.232
Cal500 0.145 0.140 0.140 0.140 0.140 0.140 0.139
Emotions 0.197 0.182 0.178 0.175 0.176 0.179 0.178
Birds 0.049 0.042 0.044 0.045 0.043 0.046 0.042
Yeast 0.203 0.208 0.205 0.200 0.201 0.203 0.200
Scene 0.108 0.115 0.115 0.115 0.114 0.146 0.118
Genbase 0.003 0.004 0.003 0.004 0.003 0.002 0.001
Medical 0.021 0.027 0.015 0.024 0.015 0.026 0.015
Enron 0.058 0.053 0.050 0.052 0.049 0.058 0.049
Corel5k 0.010 0.009 0.010 0.010 0.009 0.009 0.009
Mediamill 0.032 0.033 0.032 0.032 0.032 0.031 0.031
Corel16k01 0.020 0.020 0.020 0.020 0.020 0.019 0.019
Corel16k02 0.020 0.019 0.019 0.019 0.019 0.018 0.019
Corel16k03 0.020 0.020 0.020 0.020 0.020 0.020 0.020
Corel16k04 0.019 0.019 0.019 0.020 0.019 0.019 0.019
Corel16k05 0.019 0.019 0.019 0.019 0.018 0.017 0.018
Corel16k06 0.020 0.019 0.019 0.019 0.018 0.017 0.018
Corel16k07 0.018 0.018 0.018 0.018 0.017 0.017 0.017
Corel16k08 0.019 0.018 0.018 0.018 0.018 0.018 0.018
Corel16k09 0.019 0.018 0.018 0.018 0.017 0.017 0.017
Corel16k10 0.021 0.020 0.020 0.020 0.019 0.017 0.019
Bibtex 0.015 0.014 0.014 0.014 0.013 0.013 0.014
TMC2007-500 0.064 0.066 0.065 0.064 0.064 0.066 0.064
Arts 0.063 0.062 0.063 0.062 0.062 0.062 0.061
Science 0.035 0.033 0.034 0.036 0.033 0.032 0.032
Business 0.029 0.029 0.030 0.032 0.027 0.027 0.029
Health 0.050 0.051 0.051 0.049 0.049 0.047 0.047
Reference 0.037 0.040 0.037 0.036 0.032 0.032 0.031
Education 0.044 0.045 0.046 0.042 0.037 0.035 0.036
Recreation 0.063 0.062 0.064 0.058 0.058 0.058 0.058
Entertaiment 0.063 0.061 0.060 0.064 0.057 0.057 0.056
Computers 0.040 0.042 0.040 0.039 0.039 0.036 0.040
Society 0.058 0.050 0.054 0.051 0.049 0.050 0.051
Social 0.030 0.032 0.032 0.031 0.026 0.025 0.026
Rank 5.485 5.000 4.779 4.515 2.794 2.853 2.573
Pos. 7 6 5 4 2 3 1

Table 4: HL (↓) results for BRkNN on FS process. The Friedman’s test rejects the null hypothesis with a p-value equal to 5.122E-11.

4.4.3. Discussion since RReliefF-ML does not use any PTM and retrieves
In the table 6 a summary of the main characteristics of only k-nearest neighbours for a sampling instance. On
the state-of-the-art ReliefF extensions and of the three the other hand, it is important to highlight that RReliefF-
extensions proposed in this paper is shown. In the first ML is the most simplest proposed extension and it ob-
column appears the time complexity of each ReliefF ex- tains very good results on the FW and FS processes.
tension. The column named “Label correlation” spec- The evidence suggests that the ReliefF extensions
ify whether the corresponding ReliefF extension han- which use the Label Powerset family of methods (i.e.
dles the dependencies among labels or not. The column LP and PPT) as PTM perform better than those ReliefF
“Transformation” states the type of PTM that use each extensions that use a PTM that converts the multi-label
ReliefF extension. The last column describes some ad- problem into several single-label problems (e.g. BR and
vantages and disadvantages of the ReliefF extensions. RPC), not only we referred on computing time but on
Those ReliefF extensions that use BR and RPC as the efficacy to determine the feature weights. On the
PTM are very expensive in multi-label datasets that other hand, the ReliefF extensions which consider the
have a large number of labels. The PPT-ReliefF, label dependencies perform better than those extensions
RReliefF-ML and ReliefF-ML have the same compu- which do not consider the label dependencies.
tational complexity and perform faster than MReliefF In the case of the FW process, the PPT-ReliefF,
and BR-ReliefF extensions. However, the RReliefF- ReliefF-ML and RReliefF-ML extensions improved the
ML method is faster than ReliefF-ML and PPT-ReliefF, performance of the four lazy classifier on the four eval-
15
BRkNN
Dataset
- BR-ReliefF LP-ReliefF MReliefF PPT-ReliefF ReliefF-ML RReliefF-ML
Flags 0.675 0.728 0.746 0.724 0.730 0.730 0.729
Cal500 0.302 0.321 0.318 0.318 0.318 0.325 0.320
Emotions 0.584 0.624 0.631 0.600 0.621 0.639 0.633
Birds 0.523 0.584 0.582 0.572 0.589 0.542 0.598
Yeast 0.583 0.590 0.585 0.583 0.580 0.593 0.590
Scene 0.551 0.523 0.550 0.545 0.541 0.500 0.521
Genbase 0.976 0.800 0.982 0.705 0.983 0.983 0.900
Medical 0.335 0.142 0.506 0.237 0.595 0.400 0.560
Enron 0.267 0.388 0.400 0.333 0.486 0.347 0.410
Corel5k 0.004 0.018 0.003 0.040 0.025 0.016 0.021
Mediamill 0.527 0.528 0.518 0.523 0.527 0.529 0.534
Corel16k01 0.009 0.019 0.019 0.019 0.020 0.020 0.020
Corel16k02 0.034 0.038 0.040 0.035 0.042 0.044 0.041
Corel16k03 0.013 0.015 0.014 0.016 0.024 0.023 0.019
Corel16k04 0.030 0.038 0.036 0.035 0.039 0.040 0.038
Corel16k05 0.014 0.008 0.009 0.012 0.020 0.027 0.018
Corel16k06 0.048 0.050 0.055 0.061 0.090 0.099 0.079
Corel16k07 0.008 0.009 0.010 0.006 0.009 0.024 0.009
Corel16k08 0.028 0.020 0.020 0.017 0.031 0.030 0.022
Corel16k09 0.060 0.050 0.066 0.060 0.085 0.099 0.090
Corel16k10 0.038 0.040 0.040 0.055 0.066 0.060 0.065
Bibtex 0.070 0.104 0.127 0.160 0.224 0.129 0.149
TMC2007-500 0.596 0.578 0.567 0.585 0.600 0.593 0.590
Arts 0.028 0.044 0.038 0.055 0.066 0.067 0.060
Science 0.014 0.016 0.015 0.017 0.020 0.020 0.018
Business 0.726 0.716 0.718 0.687 0.798 0.766 0.792
Health 0.230 0.221 0.225 0.229 0.245 0.247 0.256
Reference 0.403 0.398 0.365 0.415 0.428 0.421 0.419
Education 0.039 0.032 0.039 0.041 0.085 0.089 0.078
Recreation 0.037 0.032 0.030 0.021 0.065 0.098 0.077
Entertaiment 0.096 0.092 0.095 0.085 0.122 0.123 0.125
Computers 0.366 0.354 0.325 0.365 0.421 0.410 0.410
Society 0.152 0.150 0.150 0.148 0.169 0.174 0.145
Social 0.217 0.223 0.215 0.245 0.248 0.248 0.261
Rank 5.412 5.103 4.912 5.118 2.338 2.279 2.838
Pos. 7 5 4 6 2 1 3

Table 5: F1Ex (↑) results for BRkNN on FS process. The Friedman’s test rejects the null hypothesis with a p-value equal to 6.091E-11.

Name Time complexity Label correlations Transformation Observations


BR-ReliefF O(q · m · n · d) No BR method Poor performance on datasets with high number of labels.
MReliefF O(q2 · m · n · d) Partially RPC method Poor performance on datasets with high number of labels.
LP-ReliefF O(m · n · d) Yes LPS method Poor performance on datasets with high number of label sets.
ReliefF-ML O(m · n · d) Yes None Acceptable computing time.
PPT-ReliefF O(m · n · d) Yes PPT method Reduces the scarcity of labels and the over-fitting of data.
RReliefF-ML O(m · n · d) Yes None The most simplest ReliefF extension.

Table 6: Summary of the characteristics of the ReliefF extensions.

uation measures considered. However, the results of the label datasets on the MLC and LR tasks. However, the
weighted lazy classifiers using BR-ReliefF, LP-Relief results showed that there was a smaller increase of the
and MReliefF vary according to the measure and dataset performance in those multi-label datasets with a small
employed. label density and a big number of distinct label sets at
The evidence suggests that the learned weight vec- the same time, e.g. the Corel5k and Corel16k collec-
tor in the training phase allows the distance function re- tion.
cover those nearest examples in the feature space that In the case of the FS process, the results showed
are associated with the major confidence set of labels for that the PPT-ReliefF outperformed the LP-ReliefF ex-
classifying a query instance. The proposed ReliefF ex- tension. This result confirmed that the PPT technique
tensions performed well for simple and complex multi- is significantly superior to the LP method. PPT permit-
16
ted to reduce the complexity of the multi-label datasets benefits of the ReliefF algorithm as feature engineering
without loss of effectiveness in machine learning. technique for a better MLL, the main motivation for the
The evidence suggested that the distributions of the present work. We recommend that when new ReliefF
relevant features on the f -top features of the rankings extensions to MLL are proposed, they should be com-
determined by PPT-ReliefF, ReliefF-ML and RReliefF- pared with PPT-ReliefF, RReliefF-Ml and ReliefF-ML
ML were better than the distributions of the relevant extensions using several evaluation measures.
features on the three other ReliefF extensions. More- Future work will carry out a comparative study of
over, the three proposed ReliefF extensions performed how the proposed ReliefF extensions scale to other
well on simple and complex multi-label datasets for the state-of-the-art multi-label feature estimation and fea-
MLC and LR tasks on the FS process. According to ture selection algorithms. Furthermore, it would be im-
the results, the proposed ReliefF extensions performed portant to examine the effectiveness of the proposal on
better on datasets that have a small label density. synthetic multi-label datasets, where the number of rel-
evant, irrelevant and redundant features are known.
5. Conclusions
References
In this work, three scalable ReliefF extensions to
multi-label learning called PPT-ReliefF, ReliefF-ML [1] I. Witten, E. Frank, Data Mining: Practical machine learning
and RReliefF-ML have been presented. The PPT- tools and techniques, 2nd Edition, Morgan Kaufmann, 2005.
ReliefF extension uses the PPT method to convert [2] G. Tsoumakas, I. Katakis, Multi-label classification: An
overview, International Journal of Data Warehousing & Mining
the original multi-label dataset into a new multi-class 3 (2007) 1–13.
dataset. The ReliefF-ML extension can be considered [3] G. Tsoumakas, I. Katakis, I. Vlahavas, Data Mining and Knowl-
as a generalization of the classic ReliefF. On the other edge Discovery Handbook, 2nd Edition, Springer-Verlag, 2010,
hand, the RReliefF-ML extension is based on the princi- Ch. Mining Multi-label Data, pp. 667–686.
[4] G. Madjarov, D. Kocev, D. Gjorgjevikj, An extensive experi-
ples of the well-known ReliefF to regression problems. mental comparison of methods for multi-label learning, Pattern
The three proposed extensions take into account the la- Recognition 45 (2012) 3084–3104.
bel dependencies and the issue of interacting features. [5] A. McCallum, Multi-label text classification with a mixture
The proposed ReliefF extensions were extensively model trained by EM, in: Working Notes of the AAAI-99 Work-
shop on Text Learning, 1999.
compared with previous ReliefF extensions. The exper- [6] T. Li, M. Ogihara, Detecting emotion in music, in: Proceedings
imental study was divided into two parts. In the first of the International Symposium on Music Information Retrieval,
part, the ReliefF extensions were analysed on the FW Washington D.C., USA, 2003, pp. 239–240.
[7] S. Yang, S. Kim, Y. Ro, Semantic home photo categorization,
process to improve the performance of the multi-label Circuits and Systems for Video Technology, IEEE Transactions
lazy algorithms. The statistical analysis showed that the 17 (2007) 324–335.
three proposed ReliefF extensions outperformed previ- [8] M. Boutell, J. Luo, X. Shen, C. Brown, Learning multi-label
ous ReliefF extensions, improving the performance of scene classification, Pattern Recognition 37 (2004) 1757–1771.
[9] S. Diplarisa, G. Tsoumakas, P. Mitkas, I. Vlahavas, Protein clas-
the multi-label lazy algorithms. The statistical tests sification with multiple algorithms, in: Proceedings 10th Pan-
showed that the weighted lazy algorithms that use the hellenic Conference on Informatics (PCI 2005), 2005, pp. 448–
learned weight vector by PPT-ReliefF perform well, fol- 456.
lowed by those that use RReliefF-ML and ReliefF-ML. [10] M. L. Zhang, Z. H. Zhou, Multi-label neural networks with ap-
plications to functional genomics and text categorization, IEEE
In the second part of the experiment, the ReliefF ex- Transactions on Knowledge and Data Engineering 18 (2006)
tensions were evaluated on the FS process. The base- 1338–1351.
line classifier using the feature subsets determined from [11] G. L. Monica, P. M. Granitto, J. C. Gomez, Spot defects detec-
the three proposed ReliefF extensions outperformed the tion in cDNA microarray images, Pattern Anal Applic, Springer-
Verlag 16 (2013) 307–319.
classifier that uses the whole feature space and the fea- [12] P. Duygulu, K. Barnard, N. de Freitas, D. Forsyth, Object recog-
ture subsets determined from previous ReliefF exten- nition as machine translation: Learning a lexicon for a fixed im-
sion. The study shows that with a small number of age vocabulary, in: Proceedings of the 7th European Conference
features the baseline classifier obtains good results on on Computer Vision, 2002, pp. IV:97–112.
[13] N. Ueda, K. Saito, Parametric mixture models for multi-labeled
complex multi-label datasets. text, in: Proceedings of the Neural Information Processing Sys-
The PPT-ReliefF, RReliefF-ML and ReliefF-ML ex- tems 15 (NIPS 15)Kira, MIT Press, 2002, pp. 737–744.
tensions performed well for the MLC and LR tasks on [14] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei,
M. I. Jordan, Matching words and pictures, Journal of Machine
the FW and FS processes. These extensions are scalable
Learning Research 3 (2003) 1107–1135.
on simple and complex multi-label datasets with dif- [15] M. Worring, C. Snoek, J. van Gemert, J. M. Geusebroek,
ferent properties. The experimental study confirms the A. Smeulders, The challenge problem for automated detection

17
of 101 semantic concepts in multimedia, in: Proceedings of mani, M. Welling, C. Cortes, N. Lawrence, K. Weinberger
the 14th Annual ACM International Conference on Multimedia, (Eds.), Advances in Neural Information Processing Systems 27,
2006, pp. 421–430. Curran Associates, Inc., 2014, pp. 1655–1663.
[16] D. Turnbull, L. Barrington, D. Torres, G. Lanckriet, Seman- [36] Y. Zhou, R. Jin, S. C. H. Hoi, Exclusive lasso for multi-task fea-
tic annotation and retrieval of music and sound effects, IEEE ture selection, Journal of Machine Learning Research 9 (2010)
Transactions on Audio, Speech and Language Processing 16(2) 988–995.
(2008) 467–476. [37] P. Gong, J. Ye, C. Zhang, Robust multi-task feature learning,
[17] R. Bellman, Adaptive Control Processes: A Guided Tour, in: Proceedings of the 18th ACM SIGKDD International Con-
Rand Corporation Research Studies, Princeton University Press, ference on Knowledge Discovery and Data Mining, KDD ’12,
1961. ACM, New York, USA, 2012, pp. 895–903.
[18] D. T. Larose, Discovering knowledge in Data : An introduction [38] J. Zhou, J. Liu, V. Narayan, J. Ye, Modeling disease progres-
to data mining, Jhon Wiley & Sons, 2005. sion via fused sparse group lasso, in: Proceedings of the 18th
[19] D. Wettschereck, D. W. Aha, T. Mohri, A review and empirical ACM SIGKDD International Conference on Knowledge Dis-
evaluation of feature weighting methods for a class of lazy learn- covery and Data Mining, KDD ’12, ACM, New York, NY, USA,
ing algorithms, Artificial Intelligence Review 11 (1997) 273– 2012, pp. 1095–1103.
314. [39] R. Ruiz, J. C. Riquelme, J. S. Aguilar-Ruiz, Fast feature rank-
[20] A. Abraham, E. Corchado, J. Corchado, Hybrid learning ma- ing algorithm, in: Proceedings of Knowledge-Based Intelligent
chines, Neurocomputing 72 (2009) 2729–2730. Information and Engineering Systems, KES-2003, Springer
[21] K. Kira, L. Rendell, A practical approach to feature selection, Berlin, 2003, pp. 325–331.
in: Proceedings of the Int. Conf. on Machine Learning, Morgan [40] V. Jovanoski, N. Lavrac, Feature subset selection in association
Kaufmann, 1992, pp. 249–256. rules learning systems, in: Proceedings of Analysis, Warehous-
[22] I. Kononenko, Estimating attributes: Analysis and extension of ing and Mining the Data, 1999, pp. 74–77.
ReliefF, in: Proceedings of the 7th European Conference in Ma- [41] B. Zupan, M. Bohanec, J. Demsar, I. Bratko, Learning by dis-
chine Learning, ECML-94, Springer-Verlag, 1994, pp. 171–182. covering concept hierarchies, Artificial Intelligence 109 (1-2)
[23] I. kononenko, E. Simec, M. R. Sikonja, Overcoming the my- (1999) 211–242.
opia of inductive learning algorithms with ReliefF, Appl. Int. 7 [42] J. J. Liu, J. T.-Y. Kwok, An extended genetic rule induction al-
(1997) 39–55. gorithm, in: Proceedings of Congress of Evolutionary Compu-
[24] M. Robnik-Sikonja, I. Kononenko, Theoretical and empirical tation, 2000, pp. 458–463.
analysis of ReliefF and RReliefF, Machine Learning 53 (1-2) [43] K. Trohidis, G. Tsoumakas, G. Kalliris, I. Vlahavas, Multilabel
(2003) 23–69. classification of music into emotions, in: Proceedings 2008 In-
[25] R. Ruiz, J. C. Riquelme, J. S. Aguilar-Ruiz, Heuristic search ternational Conference on Music Information Retrieval, ISMIR
over a ranking for feature selection, in: Proceedings of IWANN 2008, 2008, pp. 325–330.
2005, Vol. LNCS 3512, Springer-Verlag Berlin Heidelberg, [44] S. Dendamrongvit, P. Vateekul, M. Kubat, Irrelevant attributes
2005, pp. 742–749. and imbalanced classes in multi-label text- categorization do-
[26] N. Spolar, E. Cherman, M. Monard, H. Lee, Filter approach fea- mains, Intelligent Data Analysis 15 (6) (2011) 843–859.
ture selection methods to support multi-label learning based on [45] G. Lastra, O. Luaces, J. R. Quevedo, A. Bahamonde, Graphical
ReliefF and Information Gain, in: Proceedings of the Advances feature selection for multilabel classification tasks, in: Proceed-
in Artificial Intelligence - SBIA 2012, LNCS, Springer Berlin ings of the International Conference on Advances in Intelligent
Heidelberg, 2012, pp. 72–81. Data Analysis, 2011, pp. 246–257.
[27] M. Hall, Correlation-based feature selection for discrete and nu- [46] D. Kong, C. Ding, H. Huang, H. Zhao, Multi-label ReliefF and
meric class machine learning, in: Proceedings of the 17th Inter- F-statistic Feature Selections for Image Annotation, in: Pro-
national Conference on Machine Learning, 2000, pp. 359–366. ceedings of Computer Vision and Pattern Recognition (CVPR),
[28] L. Yu, H. Liu, Feature selection for high-dimensional data: a fast 2012, pp. 2352–2359.
correlation- based filter solution, in: Proceedings of the 20th In- [47] N. Spolaor, E. Alvares, M. Carolina, H. Diana, A comparison of
ternational Conference on Machine Learning, ICML-00, 2003, multi-label feature selection methods using the problem trans-
pp. 856–863. formation approach, Electronic Notes in Theoretical Computer
[29] I. Guyon, A. Elisseeff, An introduction to variable and fea- Science 292 (2013) 135–151.
ture selection, Journal of Machine Learning Research 3 (2003) [48] N. Spolaor, E. A. Cherman, M. C. Monard, Using ReliefF for
1157–1182. multi-label feature selection, in: Proceedings of the Conferencia
[30] J. Tang, S. Alelyani, H. Liu, Data Classification: Algorithms Latinoamericana de Informática, Brazil, 2011, pp. 960–975.
and Applications, CRC Press, 2015, Ch. Feature Selection for [49] J. Read, A pruned problem transformation method for multi-
Classification: A Review, pp. 37–64. label classification, in: Proceedings 2008 New Zealand Com-
[31] R. Tibshirani, Regression shrinkage and selection via the puter Science Research Student Conference (NZCSRS 2008),
LASSO, Journal of the Royal Statistical Society (1996) 267– 2008, pp. 143–150.
288. [50] M. Robnik-Sikonja, I. Kononenko, An adaptation of Relief for
[32] H. Zou, The adaptive lasso and its oracle properties, Journal attribute estimation in regression, in: Proceedings of the ICML-
of the American statistical association 101 (476) (2006) 1418– 97, 1997, pp. 296–304.
1429. [51] J. Demsar, Statistical comparisons of classifiers over multiple
[33] M. Yuan, Y. Lin, Model selection and estimation in regression data sets, Journal of Machine Learning Research 7 (2006) 1–30.
with grouped variables, Journal of the Royal Statistical Society [52] S. Garcı́a, F. Herrera, An extension on “Statistical Comparisons
68 (1) (2006) 49–67. of Classifiers over Multiple Data Sets’’ for all pairwise compar-
[34] P. Zhao, B. Yu, On model selection consistency of lasso, The isons, Journal of Machine Learning Research 9 (2008) 2677–
Journal of Machine Learning Research 7 (2006) 2541–2563. 2694.
[35] D. Kong, R. Fujimaki, J. Liu, F. Nie, C. Ding, Exclusive feature [53] S. Garcı́a, A. Fernández, J. Luengo, F. Herrera, Advanced non-
learning on arbitrary structures via l1,2-norm, in: Z. Ghahra- parametric tests for multiple comparisons in the design of ex-

18
periments in computational intelligence and data mining: Ex- [72] N. Spolaor, E. A. Cherman, M. C. Monard, H. D. Lee, ReliefF
perimental analysis of power, Information Sciences 180 (2010) for multi-label feature selection, in: Proceedings of the Interna-
2044–2064. tional Brasilian conference, IEEE, 2013.
[54] J. Derrac, S. Garcı́a, D. Molina, F. Herrera, A practical tutorial [73] M. L. Zhang, J. M. Pea, V. Robles, Feature selection for
on the use of nonparametric statistical tests as a methodology multi-label naive bayes classification, Information Sciences 179
for comparing evolutionary and swarm intelligence algorithms, (2009) 3218–3229.
Swarm and Evolutionary Computation 1 (2011) 3–18. [74] F. B. et al., The 9th annual MLSP competition: New methods for
[55] K. Brinker, J. Furnkranz, E. Hullermeier, A unified model for acoustic classification of multiple simultaneous bird species in
multilabel classification and ranking, in: Proceedings of the 17th a noisy environment, in: Proceedings of the IEEE International
European Conference on Artificial Intelligence, ECAI-06, 2006, Workshop on Machine Learning for Signal Processing (MLSP),
pp. 489–493. 2013.
[56] R. Schapire, Y. Singer, Boostexter: a boosting-based system for [75] E. Correa, A. Plastino, A. Freitas, A Genetic Algorithm for Op-
text categorization, Machine Learning 39 (2000) 135–168. timizing the Label Ordering in Multi-Label Classifier Chains,
[57] S. Godbole, S. Sarawagi, Discriminative methods for multi- in: Proceedings of the ICTAI-2013, 2013.
labeled classification, in: Proceedings of the 8th Pacific- [76] C. Snoek, M. Worring, J. van Gemert, J.-M. Geusebroek,
Asia Conference on Knowledge Discovery and Data Mining, A. Smeulders, The challenge problem for automated detection
PAKDD 2004, 2004, pp. 22–30. of 101 semantic concepts in multimedia, in: Proceedings of
[58] I. Kononenko, M. Robnik-Sikonja, Computational methods of ACM Multimedia, ACM, Santa Barbara, USA, 2006, pp. 421–
Feature Selection, Chapman & Hall/CRC, 2008, Ch. Non- 430.
Myopic feature quality evaluation with (R)ReliefF, pp. 169– [77] A. Elisseeff, J. Weston, A kernel method for multi-labelled clas-
191. sification, Advances in Neural Information Processing Systems
[59] R. Gilad-Bachrach, A. Navot, N. Tishby, Margin based feature 14.
selection- Theory and Algorithms, in: Proceedings of the 21 [78] J. Read, B. Pfahringer, G. Holmes, E. Frank, Classifier chains
International Conference on Machine Learning, 2004, pp. 43– for multi-label classification, in: Proceedings of the 20th Euro-
50. pean Conference on Machine Learning, 2009, pp. 254–269.
[60] Y. Sun, Iterative Relief for feature weighting: Algorithms, The- [79] B. Klimt, Y. Yang, The Enron corpus: A new dataset for email
ories, and Applications, IEEE Transactions on Pattern Analysis classification research, in: Proceedings of the 15th European
and Machine Intelligence 29 (6) (2007) 1035–1051. conference on Machine Learning, 2004, pp. 217–226.
[61] Y. Sun, D.Wu, A RELIEF based feature extraction algorithm, in: [80] A. Srivastava, B. Zane-Ulman, Discovering recurring anomalies
Proceedings of the SIAM International Conference on DataMin- in text reports regarding complex space systems, in: Proceed-
ing, Atlanta,USA, 2008, pp. 188–195. ings of the IEEE Aerospace Conference, 2005, pp. 55–63.
[62] U. Pompe, I. Kononenko, Linear space induction in first order [81] G. Tsoumakas, I. Vlahavas, Random k-labelsets: an ensemble
logic with ReliefF, Mathematical and Statistical Methods in Ar- method for multilabel classification, in: Proceedings of the 18th
tificial Intelligence, Springer Verlag, New Yorkrob. European conference on Machine Learning, 2007, pp. 406–417.
[63] M. Robnik-Sikonja, Experiments with cost-sensitive feature [82] I. Katakis, G. Tsoumakas, I. Vlahavas, Multilabel text classi-
evaluation, in: Proceedings of the European Conference in Ma- fication for automated tag suggestion, in: Proceedings of the
chine Learning, ECML-2003, 2003, pp. 325–336. ECML/PKDD 2008 Discovery Challenge, Antwerp, Belgium,
[64] M. Robnik-Sikonja, K. Vanhoof, Evaluation of ordinal attributes 2008.
at value level, Data Mining and Knowledge Discovery 14 (2007) [83] G. Tsoumakas, E. Spyromitros-Xioufi, J. Vilcek, I. Vlahavas,
225–243. MULAN: A java library for multi-label learning, Journal of Ma-
[65] A. M. Qamar, E. Gaussier, RELIEF Algorithm and Similarity chine Learning Research 12 (2011) 2411–2414.
Learning for k-NN, International Journal of Computer Informa- [84] K. Sechidis, G. Tsoumakas, I. Vlahavas, On the stratification
tion Systems and Industrial Management Applications 4 (2012) of multi-label data, in: Proceedings of the 2011 European
445–458. conference on Machine learning and knowledge discovery in
[66] A. Zafra, M. Pechenizkiy, S. Ventura, ReliefF-MI: An exten- databases. Volume Part III, ECML/PKDD-11, Springer-Verlag,
sion of ReliefF to multiple instance learning, Neurocomputing 2011, pp. 145–158.
75 (2012) 210–218. [85] E. Spyromitros, G. Tsoumakas, I. Vlahavas, An empirical study
[67] I. Slavkov, J. Karcheska, D. Kocev, S. Kalajdziski, S. Dzeroski, of lazy multilabel classification algorithms, in: Proceedings of
Extending ReliefF for Hierarchical Multi-label Classification, the SETN-2008, Vol. 5138 of Lectures Notes in Artifitial Intel-
in: Proceedings of the 2013 European conference on Machine ligence, Springer-Verlag Berlin Heidelberg, 2008, pp. 401–406.
learning and knowledge discovery in databases. ECML/PKDD- [86] Z. Younes, F. Abdallah, T. Denceux, Multi-label classification
14, 2014. algorithm derived from k-nearest neighbor rule with label de-
[68] O. Reyes, C. Morell, S. Ventura, ReliefF-ML: An extension pendencies, in: Proceedings of the 16th Eropean Signal Process-
of ReliefF Algorithm to Multi-label Learning, in: Proceedings ing Conference, Lausanne, Switzerland, 2008, pp. 297–308.
of the CIARP 2013, Vol. 8259 of Part II, Lecture Notes in [87] J. Xu, Multi-label weighted k-nearest neighbor classifier with
Computer Science, Springer-Verlag Berlin Heidelberg, Habana, adaptive weight estimation, in: Proceedings of the ICONIP
Cuba, 2013, pp. 528–535. 2011, Part II, Vol. 7073 of LNCS, Springer Berlin Heidelberg,
[69] J. Read, B. Pfahringer, G. Holmes, Multi-label classification us- 2011, pp. 79–88.
ing ensembles of pruned sets, in: Proceedings of the 8th IEEE [88] I. Slavkov, An evaluation method for feature rankings, Ph.D.
International Conference on Data Mining, 2008, pp. 995–1000. thesis, Josef Stefan International Postgraduade School (2012).
[70] M. L. Zhang, Z. H. Zhou, ML-kNN: A lazy learning approach [89] S. Garcı́a, D. Molina, M. Lozano, F. Herrera, A study on the
to multi-label learning, Pattern Recognition 40 (7) (2007) 2038– use of non-parametric tests for analyzing the evolutionary al-
2048. gorithms’ behaviour: a case study on the CEC-2005 Special
[71] J. Read, Scalable multi-label classification, Ph.D. thesis, Uni- Session on Real Parameter Optimization, Journal of Heuristics,
versity of Waikato, Hamilton, New Zeland (2010). Springer 15 (2009) 617–644.

19
[90] M. Friedman, A comparison of alternative tests of significance
for the problem of m rankings, Annals of Mathematical Statis-
tics 11 (1940) 86–92.
[91] G. Bergmann, G. Hommel, Improvements of general multiple
test procedures for redundant systems of hypotheses, Multiple
Hypotheses Testing, Springer, Berlin (1988) 100–115.
[92] P. B. Nemenyi, Distribution-free multiple-comparisons., Ph.D.
thesis, Pricenton University (1963).
[93] S. P. Wright, Adjusted p-values for simultaneous inference, Bio-
metrics 48 (1992) 1005–1013.

20

View publication stats

You might also like