0% found this document useful (0 votes)
71 views15 pages

Neurocomputing: José-Ramón Cano, Pedro Antonio Gutiérrez, Bartosz Krawczyk, Michał Wo Zniak, Salvador García

Uploaded by

vikasbhowate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views15 pages

Neurocomputing: José-Ramón Cano, Pedro Antonio Gutiérrez, Bartosz Krawczyk, Michał Wo Zniak, Salvador García

Uploaded by

vikasbhowate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Neurocomputing 341 (2019) 168–182

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

Monotonic classification: An overview on algorithms, performance


measures and data sets
José-Ramón Cano a, Pedro Antonio Gutiérrez b, Bartosz Krawczyk c, Michał Woźniak d,
Salvador García e,∗
a
Department of Computer Science, University of Jaén, EPS of Linares, Avenida de la Universidad S/N, Linares 23700, Jaén, Spain
b
Department of Computer Science and Numerical Analysis, University of Córdoba, Córdoba, Spain
c
Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
d
Department of Computer Science, Wrocław University of Technology, Wyb. Wyspiańskiego 27, 50-370 Wrocław, Poland
e
Department of Computer Science and Artificial Intelligence, University of Granada, Granada 18071, Spain

a r t i c l e i n f o a b s t r a c t

Article history: Currently, knowledge discovery in databases is an essential first step when identifying valid, novel and
Received 17 November 2018 useful patterns for decision making. There are many real-world scenarios, such as bankruptcy predic-
Revised 4 February 2019
tion, option pricing or medical diagnosis, where the classification models to be learned need to fulfill
Accepted 11 February 2019
restrictions of monotonicity (i.e. the target class label should not decrease when input attributes values
Available online 14 March 2019
increase). For instance, it is rational to assume that a higher debt ratio of a company should never re-
Communicated by Dr. Nianyin Zeng sult in a lower level of bankruptcy risk. Consequently, there is a growing interest from the data mining
research community concerning monotonic predictive models. This paper aims to present an overview
Keywords:
Monotonic classification of the literature in the field, analyzing existing techniques and proposing a taxonomy of the algorithms
Ordinal classification based on the type of model generated. For each method, we review the quality metrics considered in
Taxonomy the evaluation and the different data sets and monotonic problems used in the analysis. In this way, this
Software paper serves as an overview of monotonic classification research in specialized literature and can be used
Performance metrics as a functional guide for the field.
Monotonic data sets
© 2019 Elsevier B.V. All rights reserved.

1. Introduction accurate, robust and fairer models of the data considered. In this
way, monotonicity can be found in different environments such
Data mining, as a key stage in the discovery of knowledge, is as economics, natural language or game theory [5], as well as the
aimed at extracting models that represent data in ways we may evaluation of courses at teaching institutions [6].
not have previously taken into consideration [1]. Among all the Some important examples of real problems where this kind of
data mining alternatives, we focus our attention on classification background knowledge has to be considered are being analyzed to-
as a predictive task [2,3]. There is a particular case of predictive day. For bankruptcy prediction in companies in time [7], appropri-
classification where the target class takes values in a set of ordered ate action should be taken considering the information based on
categories. In the case at hand, we are referring to ordinal classi- financial indicators taken from their annual reports. Monotonicity
fication or regression [4]. In addition, the classification task is de- is present in the comparison of two companies where one dom-
fined as monotonic classification in those cases in which domains inates the other in all financial indicators. Because of this dom-
of attributes have been ordered and a monotonic relationship ex- inance, the overall evaluation of the second one should not be
ists between an evaluation of an object in the attributes and its higher than that of the first. In this way, monotonic classification
class assignment [5]. has been applied to predict the credit rating score used by banks
Monotonicity is a type of background knowledge of vital im- [8]. Another example is the house pricing problem [9], in which we
portance for many real problems, which is needed to obtain more should assure that the price of a house increases with an increase
of the number of rooms or with the availability of air conditioning,
and that it decreases with, for example, the pollution concentration

Corresponding author. in the area.
E-mail addresses: [email protected] (J.-R. Cano), [email protected] (P.A. Gutiér- Considering monotonicity constraints in a learning task is moti-
rez), [email protected] (B. Krawczyk), [email protected] (M. Woźniak), vated by two main facts [10]: (1) the size of the hypothesis space
[email protected] (S. García).

https://doi.org/10.1016/j.neucom.2019.02.024
0925-2312/© 2019 Elsevier B.V. All rights reserved.
J.-R. Cano, P.A. Gutiérrez and B. Krawczyk et al. / Neurocomputing 341 (2019) 168–182 169

which facilitates the learning process, is reduced; (2) other met- years. This fact can be corroborated in Fig. 1, where the number of
rics besides accuracy, such as the consistency with respect to these proposals in the specialized literature is represented over time.
constraints, can be used by experts to accept or reject certain Classification problems where there is background knowledge
models. in the form of ordinal evaluations and monotonicity constraints
In this way, the need of handling background knowledge about are very common. In this kind of problem, the order properties of
ordinal evaluations and monotonicity constraints in the learning the input space are exploited, by using the available knowledge in
process has led to the development of new algorithms. The interest terms of dominance relation (one sample dominates another when
in the field of monotonic classification has significantly increased each coordinate of the former is not smaller than the respective
[11,12], leading to a growing number of techniques and methods. coordinate of the latter). Monotonicity constraints require that the
Apart from these algorithmic developments, different quality mea- class label assigned to a pattern should be greater or equal to the
sures have been presented to measure the consistency with respect class labels assigned to the patterns it dominates. As an example,
monotonicity constraints. consider a monotonicity constraint relating one input attribute and
Given that, to the knowledge of the authors, there are no func- the target class. In this case, a sample in the data set with a higher
tional guides for this domain of study, and it can be difficult to value of the input attribute should not be associated to a lower
obtain a general overview of the state of the art. Because of this class value, as long as the other attributes of the sample are fixed.
reason, this paper presents an overview on the monotonic classifi- A monotonicity constraint always involves one input attribute and
cation field, including: the class attribute, and there should be, at least, one monotonicity
constraint (to distinguish monotonic classification from ordinal re-
• A systematic review of the techniques proposed in the liter- gression). Monotonicity constraints can be either direct (as the ex-
ature. ample presented before) or inverse (if the value of the attribute de-
• A taxonomy to categorize all the existing algorithms, includ- creases, the class value should not increase). Usually, in real mono-
ing whether or not there is publicly available software re- tonic classification problems, the monotonicity constraints are as-
lated to them. sumed only for a subset of the input features.
• The quality measures applied to evaluate the performance As a descriptive example, we can consider student evaluation
of monotonic classifiers in the literature. These metrics ana- in a college, the students being evaluated with a rating between 0
lyze the performance both in terms of accuracy and degree and 10. We consider three students (Student A, B and C) with 22
of fulfillment of the monotonicity constraints. evaluations each one and a final mark. We consider that all the in-
• Finally, the data sets considered in every proposal and a put attributes (22 evaluations) have a direct monotonic assumption
summary of which are the most used and where they can with respect to the output value (final qualification, represented in
be found. bold face):

• Student A: 5,5,5,5,7,6,5,5,5,5,5,5,6,5,5,6,6,6,5,5,5,5,4.
The remainder of this paper is structured as follows.
• Student B: 3,5,3,4,7,3,3,5,3,3,3,3,6,3,3,4,3,6,4,3,5,3,5.
Section 2 presents a definition of the monotonic classification
• Student C: 2,2,1,2,1,2,2,3,2,2,1,2,3,2,2,3,3,2,2,1,2,3,2.
problem. Section 3 shows an overview of the monotonic methods
and the taxonomy proposed to categorize them. Section 4 offers an As can be observed, there is a monotonic violation involving
analysis of the quality metrics considered in monotonic classifica- two samples (Students A and B), where Student B, who has worse
tion. Section 5 presents the data sets evaluated in the literature, or equal evaluation marks than Student A, who presents a higher
highlighting the most popular ones and where they can be found. final qualification. On the other hand, there are no monotonic vio-
In Section 6 we offer some guidelines regarding existing methods lations when considering Student C with respect to both Students
to researchers interested in this topic and we enumerate some rec- A and B.
ommendations for future research. Finally, Section 7 is devoted to Now, we formally define a classification data set with ordinal
the conclusions reached. labels and monotonicity constraints. Let us assume that patterns
are described using a total of f input variables with ordered do-
mains, xi ⊆ R f , and a class label, yi , from a finite set of C ordered
2. Definition of monotonic classification labels, yi ∈ Y = {1, . . . , C }. In this way, the data set D consists of n
samples or instances D = {(x1 , y1 ), . . . , (xn , yn )}. As previously dis-
The process of data knowledge discovery in databases is a key cussed, a dominance relation, , is defined as follows:
objective for organizations to make accurate and timely decisions 
and recognize the value in data sources. One of the main stages
x  x ⇔ xs ≥ xs ∀s with a monotonicity constraint, (1)

within the process is data mining [1], where models are extracted where xs and xs are the sth coordinates of patterns x and x , re-
from the input data collected. These models are used to support spectively. In other words, x dominates x if each coordinate of x
people in making decisions about problems that may be rapidly is not smaller than the respective coordinate of x .
changing and not easily specified in advance (i.e. unstructured and Samples x and x in space D are comparable if either x x or
semi-structured decision problems). Among all kinds of models, we x x. Both x and x are incomparable otherwise. Two examples x


focus our attention on classification algorithms, where the goal is and x are identical if x j = x j , ∀ j ∈ {1, . . . , f }, and they are non-

to predict the value of a target variable. When the target variable identical if ∃j for which x = x .
j j

exhibits a natural ordering, we are talking about ordinal classifi- A pair of comparable examples (x, y) and (x , y ) is said to be
cation (also known as ordinal regression) [4,11,13,14]. The order of monotone if:1
the categories can be exploited to construct more accurate mod-
els in those application domains involving preferences, like social x  x ∧ x = x ∧ y ≥ y , (2)
choice, multiple criteria decision making, or decision under risk or
and uncertainty. For example, in a factory a worker can be evalu-
ated as “excellent”, “good” or “bad”, or a credit risk can be rated as x = x ∧ y = y . (3)
“AAA”, “AA”, “A” or “A-”. A particular case of ordinal classification
is monotonic classification [11]. The interest in monotonic classifi- 1
Recall that y, y ∈ Y = {1, . . . , C }, so that every two labels can be compared us-
cation of the scientific community has increased over the last few ing the ordinal scale.
170 J.-R. Cano, P.A. Gutiérrez and B. Krawczyk et al. / Neurocomputing 341 (2019) 168–182

Fig. 1. Number of monotonic classification proposals over time.

A data set D with n examples is monotone if all possible pairs tially or totally. There are several families of classifiers de-
of examples are either monotone or incomparable. It is worth pending on the type of model they build:
mentioning that the previous notation was expressed for direct • Instance based classifiers. These algorithms do not build
monotonicity constraints, but it could be changed to consider in- a model but they directly use the instances of the data
verse ones. This definition considers that all f characteristics to be set of to make classification decisions.
monotonous, forming a fully monotone data set. • Decision trees or classification rules. In this case, the
However, in real life there may be data sets with monotonic (m) models built involve readable production rules in forms
and non-monotonic (p) characteristics, forming a partially mono- of decision trees or a set of rules.
tone data set whose definition would be as follows: • Ensembles [17] or multiclassifiers. This group is com-
For D = {(xm n , xn , yn )}, where the patterns are
p p
1
, x1 , y1 ), . . . , ( xm posed by methods which use several classifiers to obtain
described using f input variables ( f = m + p), xm ⊆ R fm with or- different responses, which are aggregated into a global
i
p f
dered domains, xi ⊆ R with unordered domains and a class la-
p classification decision. Two classical approaches are con-
bel, yi , from a finite set of C ordered labels, yi ∈ Y = {1, . . . , C }. sidered:
The monotone partial order η is defined in expression (4) and – Boosting: a number of weak learners are combined to
partial monotonic data set in expression (5), for ((xm , xp ), y) and create a strong classifier able to achieve accurate pre-
 
((xm , x p ), y ): dictions. These algorithms use all data to train each
   
learner, but the instances are associated with differ-
( x m , x p ) η ( x m , x p ) ⇔ x m ≥ x m ∀ m , x p = x p ∀ p (4) ent weights representing their relevance in the learn-
ing process. If an instance is misclassified by a weak
    learner, its weight is increased so that subsequent
( x m , x p ) η ( x m , x p ) ∧ y ≥ y  , ∀ ( x m , x p ) , ( x m , x p ) ∈ D (5)
learners focus on them. This process is applied iter-
atively.
3. A taxonomy for monotonic classification algorithms
– Bagging: it chooses random subsets of samples with
replacement of the data set, and a (potentially) weak
This section presents and describes the proposals in the spe-
learner is trained from each subset.
cialized literature for monotonic classification, deriving a taxonomy
• Neural Networks. These are biologically inspired mod-
from them.
els, where the function relating inputs and target at-
The Knowledge Data Discovery process is composed of sev-
tribute consists of a set of building blocks (neurons),
eral stages. Two of them are usually known as data preprocessing
which are organized in layers and interconnected. An it-
and data mining [15]. In monotonic classification, the algorithms
erative training process is performed to obtain the values
present in the literature belong to one of these two stages: data
of connection weights. They are the precursors of Deep
preprocessing [16] for monotonic classification problems (here, we
Learning, which is currently the most promising area in
denote it as Monotonic Data Preprocessing) or knowledge extrac-
Machine Learning [18].
tion through monotonic classification [11], respectively. The re-
• Support Vector Machines. This family considers support
maining categorizations are based on the goal of the different
vector machines based learning and derivatives.
methods, the heuristics followed and the models generated by
• Hybrid. This last set of algorithms considers the combi-
each algorithm. In this sense, the algorithms proposed can be di-
nation of different classification algorithms into a hybrid
vided into:
one (for example, rule and instance-based learning).
1. Monotonic Classifiers, aiming at the generation of predictive • Fuzzy Integral. These algorithms are based on the
models satisfying the monotonicity constraints either par- use of the Choquet integral which can be seen as a
J.-R. Cano, P.A. Gutiérrez and B. Krawczyk et al. / Neurocomputing 341 (2019) 168–182 171

Fig. 2. Monotonic algorithms taxonomy.

generalization of the standard (Lebesque) integral to the 3.1. Monotonic classifiers


case of non-additive measures [19].
2. Monotonic Data Preprocessing refines the data sets in order 3.1.1. Instance based classifiers
to improve the performance of monotonic classification al- • Ordered Learning Model (OLM [10,69]). New objects are clas-
gorithms: sified by the following function:
• Relabeling. These methods change the label of the in-
fOLM (x ) = max{yi : (xi , yi ) ∈ D, xi x}. (6)
stances to minimize the number of monotonicity viola-
tions present in the data set. If there is no object from D which is dominated by x, then a
• Feature selection. Their objective is to obtain the most class label is assigned by a nearest neighbor rule. D is cho-
relevant features to improve monotonic classification per- sen to be consistent and not to contain redundant examples.
formance. An object (xi , yi ) is redundant in D if there is another object
• Instance selection. In this case, a subset of samples is se- (xj , yj ) such that xi xj and yi = y j .
lected from the data set with the objective of deriving • Isotonic discrimination [26]. This method applies isotonic re-
better monotonic classifiers. gression based relabeling. After that, the limiting cumulative
• Training set selection. The heuristic followed by this set probability distribution for a prediction is evaluated consid-
of algorithms must be generic in such a way that the se- ering the changes produced in the previous stage.
lected set is the one that reports the highest performance • Isotonic separation [32]. As a continuation of a relabeling
regardless of the classifier subsequently used. process based on linear programming, instances that do not
belong to boundaries are eliminated. The resulting bound-
aries are used to make the predictions.
Fig. 2 shows the proposed taxonomy and Tables 1 and 2 the • Ordered Stochastic Dominance Learner (OSDL [35,70]). For
summary of all the monotonic classifiers found in the specialized each sample xi , OSDL computes two mapping functions: one
literature. The first column of the table contains the year of the that is based on the examples that are stochastically dom-
proposal, the second is the reference and the third is the proposal inated by xi with the maximum label (of that subset), and
name. We also show in the fourth and fifth columns, whether or the second is based on the examples that cover (i.e., domi-
not the algorithm requires a total monotonic input data set and nate) xi , with the smallest label. Later, an interpolation be-
whether or not it produces complete monotonic output models, tween the two class values (based on their position) is re-
respectively. The sixth column indicates whether the algorithm turned as a class.
accepts partially monotonic data sets [20]. Seventh and eighth • Monotonic k-Nearest Neighbor (MkNN [36]). This classifier is
columns present the non monotonic classification algorithms used an adaptation of the well-known nearest neighbor classifier,
as a baseline to compare the method and the monotonic classi- considering a full monotone data set. Starting from the orig-
fiers used for comparison in the experimental analysis conducted inal nearest neighbor rule, the class label assigned to a new
in each paper. The last column shows whether or not the algo- data point x0 must lie in the interval [ymin , ymax ], where:
rithm’s source code is publicly available and, if it is, the name of
framework in which we can find it. All algorithms are capable of ymin = max{y|(x, y ) ∈ D ∧ x x0 }, (7)
dealing with multiclass problems, except for one of them which and:
will be indicated in its description.
Next, we provide a description of the methods in each family. ymax = min{y|(x, y ) ∈ D ∧ x0 x}. (8)
172 J.-R. Cano, P.A. Gutiérrez and B. Krawczyk et al. / Neurocomputing 341 (2019) 168–182

Table 1
Monotonic classification methods reviewed. Part I.

Require Completely Partial Comparison versus

Year Reference Abbr. name Input monot. Monot. output Monot. Classical methods Monotonic methods Code available in

1992 [10] OLM Yes No No C4, ID3 None [21] in WEKA


1995 [22] MID No No Yes ID3 OLM Not available
1995 [23] HLMS No Yes No None None Not available
1997 [24] Monotonic networks Yes Yes No None None Not available
1999 [25] P-DT, QP-DT Yes, No Yes, No No, No ID3 MID Not available
1999 [26] Isotonic discrimination No Yes No None None Not available
20 0 0 [27] MT Yes Yes No C4.5 OLM Not available
20 0 0 [28] VC-DRSA No No No None None Not available
20 0 0 [29] DomLEM No No No None None Not available
2002 [30] Bioch&Popova MDT No Yes No None None Not available
2002 [9] Modified MID No No Yes None None Not available
2003 [31] MDT Yes Yes No CART None Not available
2005 [32] Isotonic Separation No No No None None Not available
2005 [33] MonMLP Yes Yes No None None In CRAN
2007 [34] VC-DRSA with Ambig. Resol. No No No None None Not available
2008 [35] OSDL No Yes No None None [21] in WEKA
2008 [36] MkNN No Yes No kNN None Not available
2008 [37] MOCA No Yes No OSDL None Not available
2008 [38] Stochastic DRSA No No No None None Not available
2009 [39] ICT No Yes Yes None None Not available
2009 [40] LPRules No Yes No J48, SVM OLM, ICT Not available
2009 [41] VP-DRSA No No No None None Not available
2009 [42] MORE No Yes No SVM, J48, kNN None Not available
2010 [20] MPNN MIN–MAX No No Yes None None Not available
2010 [43] VC-bagging No No No None OLM, OSDL Not available
2011 [44] VC-DomLEM No No No Naive Bayes, SVM, Ripper, C4.5 OLM, OSDL Not available

Table 2
Monotonic classification methods reviewed. Part II.

Require Completely Partial Comparison versus

Year Reference Abbr. name Input monot. Monot. output Monot. Classical methods Monotonic methods Code available in

2012 [45] REMT No No No CART, Rank Tree OLM, OSDL Not Available
2012 [19] Choquistic Regression Yes Yes No MORE LMT, Logistic Regression Not Available
2012 [46] VC-DRSA with No No Yes Naive Bayes, SVM, None Not Available
Non-Monot. Features
Ripper, C4.5, MODLEM
2014 [8] MC-SVM Yes Yes No SVM None Not Available
2015 [47] MGain No No No C4.5 None Not Available
2015 [48] FREMT No No No None REMT Not Available
2015 [49] MonRF No No No None OLM, OSDL, MID Not Available
2015 [50] VC-DRSA ORF No No No None None [51] in jMAF
2015 [52] RDMT(H) No No No None MID, ICT Not Available
2015 [53] RMC-FSVM No No No FSVM, SVM None Not Available
2015 [54] VC-RF No No No None VC-DRSA with Non-Monot. Feat., Not Available
VC-DomLEM
2016 [55] MoNGEL No No Yes None MkNN, OLM, OSDL [56] in Java
2016 [57] Monot. AdaBoost No No Yes None MID Not Available
2016 [58] AntMiner+, No, No Yes, Yes Yes, Yes ZeroR OLM Not Available
cAnt-MinerPB +MC
2016 [59] EHSMC-CHC No No No None MkNN, OLM, OSDL, MID Not Available
2016 [60] XGBoost No Yes Yes pGBRT, Spark MLLib, H2O None [60] in GitHub
2016 [61] PM-SVM No No Yes SVM MC-SVM [61] in GitHub
2016 [62] PM-RF No No Yes Random Forest MC-SVM [62] in GitHub
2016 [63] MMT No No Yes ID3, J48, CART, RandomTree REMT, OLM, OSDL, RDMT(H) Not Available
2017 [64] FCMT No No No REMT, FREMT None Not Available
2017 [12] MCELM No Yes No CART, Rank Tree, ELM OLM, OSDL, REMT Not Available
2017 [65] RULEM No Yes Yes Ripper, C4.5 AntMiner+ Not Available
2017 [66] MFARC-HD, No, No No, No No, No WM OSDL, MkNN, C4.5-MID, Not Available
FSMOGF Se +TUNe OLM, EHSMC-CHC, RF-MID
2018 [67] MonoBoost No No Yes kNN None [67] in GitHub
2018 [68] PMDT No No Yes None REMT, OLM, OSDL, RDMT(H) Not Available

• MOCA [37]. MOCA is a nonparametric monotone classifica- 3.1.2. Decision trees and classification rules
tion algorithm that attempts to minimize the mean abso- • Monotonic Induction of Decision trees (MID [22]). Ben-David
lute prediction error for classification problems with ordered introduces a measure of non-monotonicity in the classical
class labels. Firstly, the algorithm obtains a monotone classi- classification decision tree ID3 algorithm [71]. This measure
fier considering only training data. In the test phase, a sim- was denoted as total-ambiguity-score. To calculate it, a non-
ple interpolation scheme is applied. monotonicity b × b matrix M must be constructed, related
J.-R. Cano, P.A. Gutiérrez and B. Krawczyk et al. / Neurocomputing 341 (2019) 168–182 173

to a tree containing b branches. Each value mij is 1 if the monotonicity violation. The idea is that, considering the
branches i and j are non-monotone, and 0 if they are. monotonicity constraint, the sum of the absolute prediction
• Positive Decision Tree, Quasi-Positive Decision Tree (P-DT, errors on the training sample should be minimized. In addi-
QP-DT [25]). In these algorithms the splitting rule separates tion, this algorithm can also handle problems where some,
the points that have the right child-node larger than the left but not all, attributes have a monotonic relation with respect
child-node (in the sense of the target variable). The algo- to the response.
rithm adds samples to the nodes in such way that the result- • Variable Consistency DomLEM (VC-DomLEM [44]). Improve-
ing tree is monotone. This algorithm requires as a precon- ment of the DomLEM method to transfer the probabilistic
dition to be applied on strictly monotone binary data sets, characteristic of variable consistency approaches through to
containing only two classes. This is the only method which rule induction.
is not able to deal with multiclass data sets. • VC-DRSA with Non-Monotone Features [46]. The relationships
• Variable Consistency model of Dominance-based Rough Sets in the data are represented by monotonic decision rules. To
Approach (VC-DRSA [28]). The method introduces a relax- discover the monotonic rules, the authors propose a non-
ation to the DRSA model, which admits some inconsistent invasive transformation of the input data, and a way of
objects to the lower approximations; the relaxation is con- structuring them into consistent and inconsistent parts us-
trolled by an index called consistency level. VC-DRSA is in- ing VC-DRSA.
sensitive to marginal inconsistencies which appear in data • Rank Entropy based Monotonic decision Trees (REMT [45]).
sets. This algorithm introduces a metric called rank entropy as
• Monotonic Tree (MT [27]). Potharst and Bioch present a tree a robust measure of feature quality. It is used to compute
generation algorithm for monotonic classification problems the uncertainty, reflecting the ordinal structures in mono-
with discrete domains for multiclass data sets. In addition, tonic classification. The construction of the decision tree is
the proposal can be used to repair non-monotonic decision based on this measure.
trees that have been generated by other methods. • RDMT(H) [52]. Marsala and Petturiti presented a tree clas-
• DomLEM [29]. This algorithm generates a complete and non- sifier parameterized by a discrimination measure H, which
redundant set of decision rules, heuristically tending to min- is considered for splitting, together with other three pre-
imize the number of rules generated. It is able to produce pruning parameters. RDMT(H) guarantees a weak form of
decision rules accepting a limited number of negative exam- monotonicity for the resulting tree when the data set is
ples within the variable consistency model of the dominance monotone consistent and H refers to any rank discrimination
rough sets approach. measure. The authors adapted different measures to mono-
• Modified MID [9]. In this case, an improvement of the order tonic classification.
ambiguity in MID algorithm is proposed by the authors. The • MGain [47]. MGain introduces the index of the monotonic
new order ambiguity weighs nonmonotone leaf pairs by the consistency of a cut point with respect to a data set. When
probability of leaf appearance. non-monotonic data appear in the training set, the index of
• Bioch&Popova Monotone Decision Tree (Bioch&Popova MDT monotonic consistency selects the best cut point. If the ini-
[30]). This algorithm generates monotonic decision trees tial data set is totally monotonic, the results obtained are
from noisy data modifying the update rule. It controls the similar to those using C4.5 [72].
size of the trees by means of pre- and post-pruning while • AntMiner+, cAnt-MinerPB +MC [58]. These algorithms are an
the tree is guaranteed to remain monotone. extension of an existing ant colony optimization based clas-
• Monotonic Decision Tree (MDT [31]). The authors proposed sification rule learner, able to create lists of monotonic clas-
an induction approach to generate monotonic decision trees sification rules. They consider an improved sequential cov-
from sets of examples which may not be monotonic or con- ering strategy to search for the best list of classification
sistent. The algorithm constructs the tree using a set of or- rules.
dinal labels which are not the same as the original ones. A • Monotonic Multivariate Trees (MMT [63]). The proposed
mapping process can be used to relabel them into the origi- method discovers partitions via oblique hyperplane in the
nals. input space. MMT generates the projections of the objects
• VC-DRSA with Ambiguity Resolution [34]. This method in- which are used to split the data by improved splitting crite-
duces the rules from rough approximations of preference- ria with rank mutual information or rank Gini impurity.
ordered decision classes, according to Variable Consistency • Rule Learning of ordinal classification with Monotonicity
Dominance-based Rough Set Approach. When ambiguity ap- constraints (RULEM [65]). The authors present a technique
pears in the prediction of the class of a new instance to eval- to induce monotonic ordinal rule based classification mod-
uate, the method assigns a given instance to a class charac- els, which can be applied in combination with any rule or
terized by a maximum positive difference between strength tree induction technique in a post processing step. They also
of rule premises suggesting assignment to this class and introduce two metrics to evaluate the plausibility of the or-
those discouraging such an assignment. dinal classification models obtained.
• Stochastic DRSA [38]. The proposal presents a new stochastic • MFARC-HD [66]. In this case, different mechanisms based on
approach to dominance-based rough sets, whose application monotonicity indexes are coupled with a popular and com-
results in estimating the class interval for each instance. The petitive classification evolutionary fuzzy system: FARC-HD.
class interval generated has the form of a confidence inter- In addition, the proposal is able to handle any kind of clas-
val and follows from the empirical risk minimization of the sification data set without a preprocessing step.
specific loss function. • FSMOGF Se +TUNe [66]. The proposed method consists of two
• Variable Precision Dominance-based Rough Set Approach separated stages for learning and subsequent tuning. The
(VP-DRSA [41]). The authors offers a proposal to treat errors first stage is based on an improved multi-objective evolu-
in the framework of DRSA. They introduce the concept of tionary algorithm designed to select the relevant features
variable precision rough set approach. while learning the appropriate granularities of the member-
• Isotonic Classification Tree (ICT [39]). This approach adjusts ship functions. In the second stage, an evolutionary post-
the probability estimated in the leaf nodes in case of a process is applied to the knowledge base obtained.
174 J.-R. Cano, P.A. Gutiérrez and B. Krawczyk et al. / Neurocomputing 341 (2019) 168–182

• Partially Monotonic Decision Trees (PMDT [68]). The au- rithm ensures perfect partial monotonicity with reason-
thors propose a rank-inconsistent rate that distinguishes at- able performance.
tributes from criteria. That rate represents the directions of There exist two publicly available and open source libraries
the monotonic relationships between criteria and decisions. that are absent from the literature: Arborist [76] and GBM
Finally, a partially monotonic decision tree algorithm is de- [77]. Both are R packages that allow for monotone features
signed to extract decision rules for partially monotonic clas- by naïvely constraining each branch split (in each tree) to
sification tasks. prohibit non monotone splits.
2. Bagging
3.1.3. Ensembles • Variable Consistency Bagging (VC-bagging [43]). For this
1. Boosting proposal, the data set is structured using the Variable
• LPRules [40]. This algorithm is based on a statistical anal- Consistency Dominance-based Rough Set Approach (VC-
ysis of the problem, trying to relate monotonicity con- DRSA). A variable consistency bagging scheme is used
straints to the constraints imposed on the probability to produce bootstrap samples that promote classification
distribution. First, LPRules decomposes the problem into examples with relatively high consistency measure val-
a sequence of binary subproblems. Then, the data for ues.
each subproblem is monotonized using a non-parametric • Fusing Rank Entropy based Monotonic decision Trees
approach by means of the class of all monotone func- (FREMT [48]). This method fuses decision trees tak-
tions. In the last step, a rule ensemble is generated using ing into account attribute reduction and a fusing prin-
the LPBoost method to avoid errors in the monotonized ciple. The authors propose an attribute reduction ap-
data. proach with rank-preservation for learning base clas-
• MOnotone Rule Ensembles (MORE [42]). MORE uses for- sifiers, which can effectively avoid overfitting and im-
ward a stage-wise additive modeling scheme for gener- prove classification performance. In a second step, the
ating an ensemble of decision rules for binary problems. authors establish a fusing principle considering the max-
An advantage of this method, as the authors indicate, is imal probability by combining the base classifiers.
its comprehensibility and consistence. • Fusing Complete Monotonic decision Trees (FCMT [64]).
• Monotonic Random Forest (MonRF [49]). The method Xu et al. propose an improvement of FREMT algorithm
is an adaptation of Random Forest [73] for classifica- using a discriminativeness matrix approach that guaran-
tion with monotonicity constraints, including the rate of teed finding all satisfactory subsets.
monotonicity as a parameter to be randomized during
the growth of the trees. An ensemble pruning mecha-
3.1.4. Neural networks
nism based on the monotonicity index of each tree is
• Monotonic networks [24]. Monotonic networks implements
used to select the subset of the most monotonic decision
a piecewise-linear surface by taking maximum and min-
trees which constitute the forest.
imum operations on groups of hyperplanes. Monotonicity
• Variable Consistency Dominance-based Rough Set Ap-
constraints are enforced by constraining the sign of the hy-
proach Ordinal Random Forest (VC-DRSA ORF [50]). The
perplane weight.
authors propose an Ordinal Random Forest based on the
• Monotonic Multi-Layer Perceptron (MonMLP [33]). This algo-
variable consistence dominance rough set approach. The
rithm satisfies the requirements of monotonicity for one or
ordinal random forest algorithm is implemented using
more inputs by constraining the sign of the weights of the
Hadoop [74].
multi-layer perceptron network. The performance of Mon-
• Variable Consistency Random Forest (VC-RF [54]). Wang
MLP does not depend on the quality of the training data be-
et al. propose the dominance and fuzzy preference in-
cause it is imposed in its structure.
consistency rates, which have the capacity of discover-
• Monotonic Partial Neural Network MIN–MAX (MPNN MIN–
ing global monotonicity relationships directly from data
MAX [20]). In this paper, the authors clarify some of the the-
rather than induced rules. The method includes a re-
oretical results on monotone neural networks with positive
fined transformation, in which an additional step is
weights, which sometimes cause misunderstanding in the
introduced to determine whether an ordinal condition
neural network literature. In addition, in the case of partially
attribute should be cloned or not according to its incon-
monotone problems they generalize the so-called MIN–MAX
sistency rates.
networks.
• Monotonic Adaboost [57]. In this case, decision trees are
• Monotonic Classification Extreme Learning Machine (MCELM
combined in an Adaboost scheme [75], considering a
[12]). MCELM is a generalization of extreme learning ma-
simple ensemble pruning method based on the degree of
chine for monotonic classification data sets. The proposal in-
monotonicity. The objective in this algorithm is to offer a
volves a quadratic programing problem in which the mono-
good trade-off between accurate predictive performance
tonicity relationships are considered to be constraints and
and the construction of monotonic models.
the training errors as the objective to be minimized.
• XGBoost [60]. An open source library that provides the
gradient boosting framework, which supports monotonic
constraints as of version 0.71. 3.1.5. Support Vector Machines
• Partially Monotone Random Forest (PM-RF [62]). By cre- • Monotonicity Constrained Support Vector Machine (MC-SVM
ating a novel re-weighting scheme, PM-RF is an effective [8,78]). MC-SVM is a rating model based on a support vector
partially monotone approach that was particularly good machine including monotonicity constraints in the optimiza-
at retaining accuracy while correcting highly non mono- tion problem. The model is applied to credit rating, and the
tone data sets with many classes, albeit only achieving constraints are derived from the prior knowledge of financial
monotonicity locally. experts.
• MonoBoost [67]. Inspired by instance based classifiers, • Regularized Monotonic Fuzzy Support Vector Machine (RMC-
MonoBoost is a framework for monotone additive rule FSVM [53]). This method applies the Tikhonov regulariza-
ensembles where partial monotonicity appears. The algo- tion [79] to SVMs with monotonicity constraints in order to
J.-R. Cano, P.A. Gutiérrez and B. Krawczyk et al. / Neurocomputing 341 (2019) 168–182 175

ensure that the solution is unique and bounded. In this way, the algorithm relabels more examples than is needed.
the prior domain knowledge of monotonicity can be repre- This relabel method does not guarantee an optimal so-
sented in the form of inequalities based on the partial order lution.
of the training data. • Optimal Flow Network Relabel [36,85,86]. This method is
• Partially Monotone Support Vector Machine (PM-SVM [61]). based on finding a maximum weight independent set in
PM-SVM differs from the MC-SVM by proposing a new con- the monotonicity violation graph. Relabeling the comple-
straint generation technique designed to more efficiently ment of the maximum weight independent set results in
achieve monotonicity. a monotone data set with as few label changes as possi-
ble. This method is optimal, producing the minimal num-
ber of label changes.
3.1.6. Hybrid
• Feelders Relabel [87–89]. This algorithm faces the prob-
• Monotonic Nested Generalized Exemplar Learning (MoN-
lem of relabeling with minimal empirical loss as a con-
GEL [55]). MoNGEL combines instance-based and rule learn-
vex cost closure problem. Feelders Relabel results in an
ing. The instances are converted to zero-dimensional rules,
optimal solution.
formed by a single point, obtaining an initial set of rules. As
• Single-pass Optimal Ordinal Relabel [90]. In this case, the
a second step, the method searches for that comparable rule
idea is to exploit the properties of a minimum flow net-
of the same class with the minimum distance with respect
work and identify pleasing properties of some maximum
to each rule, in order to iteratively generalize it. In the last
cuts. As the name suggests, this is an optimal relabeling
step, the minimum number of non monotonic rules existing
algorithm.
between them will be removed.
• Naive Relabel [91]. This algorithm is a building block
• Evolutionary Hyperrectangle Selection for Monotonic Clas-
of the two algorithms detailed next, and uses a greedy
sification (EHSMC-CHC [59]). After building a set of hyper-
scheme. The method does not guarantee an optimal so-
rectangles from the training data set, a selection chosen by
lution.
evolutionary algorithms is applied. In a preliminary stage,
• Border Relabel [91]. This is a fast alternative to the greedy
an initial set of hyperrectangles are generated by using a
algorithm mentioned above, an it is more specific as it
heuristic based on the training data, and then a selection
minimizes the deviations between the new and the orig-
process is carried out, focused on maximizing the perfor-
inal labels. This case is similar to the previous one, and
mance considering several objectives, such as accuracy, cov-
thus is not optimal.
erage of examples and reduction of the monotonicity viola-
• Antichain Relabel [91]. Based on the previous algorithm,
tions of the model with the lowest possible number of hy-
this algorithm minimizes the total number of relabelings
perrectangles.
and leads to optimal solutions.
2. Feature Selection [92]. The objective of these methods is to
3.1.7. Fuzzy Integrals improve the predictive capacity of the monotonic classifiers
• Heuristic Least Mean Square (HLMS [23,80]). HLMS aims to by selecting the most relevant characteristics.
identifying the fuzzy measure taking advantage of the lat- • O-ReliefF, O-Simba [93]
tice structure of the coefficients. Thanks to this identifica- The authors introduce margin-based feature selection al-
tion, knowledge concerning the criteria can be obtained. gorithms for monotonic classification by incorporating
• Choquistic Regression [19,81,82]. The basic idea of choquis- the monotonicity constraints into the ordinal task. Relief
tic regression is to replace the linear function of predictor and Simba methods are extended to the context of ordi-
variables, which is commonly used in logistic regression to nal classification.
model the log odds of the positive class, by the choquet in- • min-Redundancy Max-Relevance (mRMR [94–96])
tegral [83]. The algorithm mRMR integrates the rank mutual in-
formation metric with the search strategy of min-
3.2. Monotonic Data Preprocessing redundancy and max-relevance, creating an effective al-
gorithm for monotonic feature selection.
Other group of methods in monotonic classification area are fo- • Non-Monotonic feature selection via Multiple Kernel
cused on applying data preprocessing techniques to improve the Learning (NMMKL [97]). Yang et al. propose a non-
performance of monotonic classification algorithms [16]. So far the monotonic feature selection method that alleviates
literature proposals follow four paths: monotonic violations by computing the scores for indi-
vidual features that depend on the number of selected
1. Relabeling. These methods aim at changing the class label of features.
the instances which produce monotonicity violations to gen- 3. Instance Selection [98,99]. The idea behind these algorithms
erate fully monotone data sets, which are required for many is to improve the performance of monotonic classifiers by
monotonic classifiers. selecting the most useful instances to be used as training
• Dykstra Relabel [26]. These authors propose a monotone set, using instance-based heuristics.
relabeling based on isotonic regression, able to minimize • Monotonic Iterative Prototype Selection (MONIPS [6])
absolute error or squared error. The algorithm is optimal, MONIPS follows an iterative scheme in which it deter-
optimizing those loss functions (absolute or squared er- mines the most representative instances which maintain
ror) but it does not guarantee the minimum number of or improve the prediction capabilities of the MkNN al-
label changes as it is not the key objective. gorithm. It follows an instance removal process based on
• Daniels–Velikova Greedy Relabel [84,85]. This is a greedy the improvement of the MkNN performance.
algorithm used to relabel the non-monotone examples 4. Training Set Selection [100]. This set of algorithms has the
one at a time. At each step, it searches for the instance same objective as those mentioned previously, except that
and the new label to maximize the increase in mono- the heuristic followed must be generic in such a way that
tonicity of the data set. Although, at each step, it is able the selected set is the one that reports the highest perfor-
to maximize the jump towards complete monotonicity, mance regardless of the classifier that is used on it later.
176 J.-R. Cano, P.A. Gutiérrez and B. Krawczyk et al. / Neurocomputing 341 (2019) 168–182

Table 3
Metrics considered in the reviewed monotonic classification methods.

Predictive assessment Monotonicity


Abbr. name metrics fulfillment metrics

OLM MSE None


MID MSE, MAE NMI
HLMS Accuracy None
Monotonic networks Error rate None
P-DT, QP-DT Error rate None
Isotonic discrimination None None
MT Accuracy None
VC-DRSA None None
DomLEM None None
Bioch&Popova MDT None None
Modified MID Error rate NMI
MDT Accuracy γ 1, γ 2
Isotonic separation None None
MonMLP None None
VC-DRSA with amb. resol. None None
OSDL None None
MkNN Error rate None
MOCA MAE None
Stochastic DRSA None None
ICT MAE None
LPRules MAE None
VP-DRSA None None
MORE MAE None
MPNN MIN–MAX MSE, error rate None
VC-bagging MAE None
VC-DomLEM MAE, accuracy None
REMT MAE None
Choquistic regression Accuracy, AUC None
VC-DRSA with non-monot. features Accuracy None
MC-SVM Accurary, recall, PPV, FOM
NPV, F-measure, κ coefficient
MGain Accuracy None
FREMT Accuracy, MAE None
MonRF Accuracy, MAE NMI
VC-DRSA ORF None None
RDMT(H) Accuracy, κ coefficient, MAE NMI
RMC-FSVM Accuracy, recall, None
PPV, F-measure
VC-RF Accuracy, MAE None
MoNGEL Accuracy, MAE NMI
Monot. AdaBoost Accuracy, MAE NMI
AntMiner+, cAnt-MinerPB+MC Accuracy None
EHSMC-CHC Accuracy, MAE, MAcc, MMAE NMI
XGBoost AUC None
PM-SVM Accuracy, κ coefficient MCC
PM-RF Accuracy MCC
MMT Accuracy, MAE None
FCMT Accuracy, MAE None
MCELM MAE None
RULEM Accuracy, MAE, MSE None
MFARC-HD, FSMOGF Se +TUNe MAE, MMAE NMI
MonoBoost F-measure, κ coefficient, recall, accuracy None
PMDT Accuracy, MAE None

• Monotonic Training Set Selection (MonTSS [101]) 4.1. Predictive assessment metrics
MonTSS incorporates proper measurements to identify
and select the most suitable instances in the training set In order to define the metrics considered to evaluate the pre-
to enhance both the accuracy and the monotonic nature dictive performance of a classifier, we introduce the following no-
of the models produced by different classifiers. tation:

4. Quality metrics used in monotonic classification


• True Positives (TP): number of instances with positive out-
This section analyzes and summarizes the evaluation measures comes that are correctly classified.
used in all the experimental studies present in the specialized lit- • False Positives (FP): number of instances with positive out-
erature. They evaluate two different aspects: precision and mono- comes that are incorrectly classified.
tonicity. In Table 3, we present, for each monotonic classification • True Negative (TN): number of instances with negative out-
method, the measures used both for predictive assessment and for comes that are correctly classified.
monotonicity fulfillment. The description of each metric is included • False Negative (FN): number of instances with negative out-
below. comes that are incorrectly classified.
J.-R. Cano, P.A. Gutiérrez and B. Krawczyk et al. / Neurocomputing 341 (2019) 168–182 177

The first set of predictive measures included are applied in bi- • Mean Squared Error (MSE [65]) is calculated as:
nary classification, and they are listed below:
1 
n
MSE = ( yi − yi )2 , (18)
• Accuracy [8]: n
i=1
TP + TN
Accuracy = , (9) where n is the number of observations in the evaluated data
TP + FP + TN + FN set, yi the estimated class label for observation i and yi the
representing the predictive ability according to the propor- true class label (both represented as integer values based on
tion of the tested data correctly classified. their position in the ordinal scale). It measures the average
• Error rate [8]: of the squares of errors.
• Mean Absolute Error (MAE [65]) is defined as:
FP + FN
Error rate = . (10)
1 
n
TP + FP + TN + FN
MAE = |yi − yi |. (19)
This is the opposite case to the previous one, evaluating the n
i=1
proportion of the tested data incorrectly classified.
MAE is a measure of how close predictions are to the out-
• Recall [8]:
comes.
TP • Monotonic Accuracy (MAcc [59]), computed as standard Ac-
Recall = . (11)
TP + FN curacy, but only considering those examples that completely
fulfill the monotonicity constraints in the test set. In other
Recall (also called sensitivity) is a measure of the proportion
words, non-monotonic examples do not take part in the cal-
of actual positives that are correctly classified.
culation of MAcc.
• Positive predictive value (PPV [8]):
• Monotonic Mean Absolute Error (MMAE [59]), calculated as
TP standard MAE, but only considering those examples that
PPV = , (12) completely fulfill the monotonicity constraints in the test
TP + FP
set.
which is the proportion of test instances with positive pre-
dictive outcomes that are correctly predicted. PPV (also
known as precision) represents the probability that a posi- 4.2. Monotonicity fulfillment metrics
tive test reflects the underlying condition being tested for.
• Negative predictive value (NPV [8]): In this case, the interest is to evaluate the rate of monotonicity
provided by either the predictions obtained or the model built.
TN Let x be an example from the data set D. NClash(x) is the num-
NPV = , (13)
TN + FN ber of examples from D that do not meet the monotonicity restric-
which is the proportion of test instances with negative pre- tions with respect to x, and n is the number of instances in D.
dictive outcomes that are correctly predicted. NMonot(x) is the number of examples from D that meet the mono-
• F-measure [8]: tonicity restrictions with respect to x.

2 · P PV · Recall • The Non-Monotonic Index [22,102] is defined as the number


F − measure = . (14) of clash-pairs divided by the total number of pairs of exam-
P PV + Recall
ples in the data set:
This metric is the harmonic mean of precision and recall.
1 
• The κ coefficient [8] represents the agreement between the NMI = NClash(x ) (20)
classifier and the data labels, and it is computed as follows: n (n − 1 )
x∈D

Pa − Pe • γ 1 [31], assessed as:


κ coefficient = , (15)
1 − Pe S+ − S−
γ1 = , (21)
S+ + S−
where Pe is the hypothetical probability of chance agreement
and Pa is the relative observed agreement between the clas-

sifier and the data. They are computed as follows: S− = NClash(x ), (22)
x∈D
(T P + F P ) · (T P + F N ) + (T N + F P ) · (T N + F N )
Pe = , (16)
( T P + T N + F P + F N )2 
S+ = NMonot (x ), (23)
x∈D
TP + TN
Pa = . (17) where S− is the number of discordant pairs, and S+ is the
TP + TN + FP + FN
number of concordant pairs. γ 1 is the Goodman–Kruskal’s
• Area Under Curve (AUC): To combine the Recall and the false γ statistic [103].
FP
positive rate ( F P+ T N ) into one single metric, we first com- • γ 2 [31]:
pute the two former metrics with many different thresholds
(for example 0.00, 0.01, 0.02, . . . ,1.00) for the logistic regres- S+ − S−
γ2 = , (24)
sion, then plot them on a single graph, with the false posi- #P
tive rate values on the abscissa and the Recall values on the where #P is the total number of pairs, i.e. P = S+ + S− +
ordinate. The resulting curve is called ROC curve, and the #NCP, #NCP standing for number of non-comparable pairs.
metric we consider is the AUC of this curve. • Frequency of Monotonicity (FOM [8]):

The second set of predictive measures have been applied to S+


FOM = . (25)
multiclass classification problems, and are listed below: #P
178 J.-R. Cano, P.A. Gutiérrez and B. Krawczyk et al. / Neurocomputing 341 (2019) 168–182

Table 4 which are the most important qualities of candidates for a


Number of times each metric is
certain type of jobs.
used in monotonic classification lit-
erature. • ESL: in this case, we find profiles of applicants for cer-
tain industrial jobs. Expert psychologists from a recruiting
Metric # of times used
company, based on psychometric test results and interviews
Accuracy 24 with the candidates, determined the values of the input at-
MAE 21 tributes. The output is an overall score corresponding to the
Error rate 5
degree to which of the candidate fits this type of job.
κ coefficient 4
MSE 4 • LEV: this data set contains examples of anonymous lecturer
Recall 3 evaluations, taken at the end of MBA courses. Before receiv-
F-measure 3 ing the final grades, students were asked to score their lec-
PPV 2
turers according to four attributes such as oral skills and
MMAE 2
AUC 2 contribution to their professional/general knowledge. The
NPV 1 single output was a total evaluation of the lecturer’s perfor-
MAcc 1 mance.
NMI 8 • Pima: this data set comes from the National Institute of
MCC 2
Diabetes and Digestive and Kidney Diseases. Several con-
γ1 1
γ2 1 straints were placed on the selection of sample from a larger
FOM 1 database. In particular, all patients here are females of Pima
NMI2 0 Indian heritage, and are at least 21 years old. The class label
demonstrates if the person has (or not) diabetes.
• The Non-Monotonicity Index 2 (NMI2 [104]) is defined as • MachineCPU: this problem focuses on relative CPU perfor-
the number of non-monotone examples divided by the to- mance data. The task is to approximate the published rela-
tal number of examples: tive performance of the CPU.
• SWD: this data set contains real-world assessments of qual-
1
NMI2 = Clash(x ) (26) ified social workers regarding the risk of a group of children
n
x∈D if they stay with their families at home. This evaluation of
where Clash(x) = 1 if x clashes with at least one example risk assessment is often presented to judicial courts to help
in D, and 0 otherwise. If Clash(x ) = 1, x is called a non- decide what is in the best interest of an allegedly abused or
monotone example. This metric was proposed in [104] but neglected child.
it has not been used in any study yet.
Considering these data sets, Table 6 includes the estimation of
• Monotonicity Compliance (MCC [61]), defined as the propor-
the possible monotonic relationship between each input feature
tion of the input space where the requested monotonicity
and the class feature, by using the RMI measure [45]. This met-
constraints are not violated, weighted by the joint probabil-
ric takes values in the range [−1, 1], where −1 means that the re-
ity distribution of the input space. This metric has been pro-
lationship is totally inverse (if the feature increases, the class de-
posed to be applied when partial monotonicity is present.
creases), and 1 represents a completely direct relationship (if the
Table 4 includes the number of times each metric was used in feature increases, the class increases). If the relationship is direct
the different experimental studies. As can be observed, the most (for instance, a value in the range [0.1,1]), we include a ‘+’ in the
commonly used metrics for predictive purposes are Accuracy and cell. In the case of an inverse relationship (a value in the range
MAE, whereas NMI is the most popular one for estimating the [−1, −0.1]), the symbol used is ‘−’, and, when the RMI value is
monotonicity fulfillment. in the range [−0.1, 0.1], we consider that the feature and the class
are not related (represented by a ‘=’). The RMI value is given below
5. Data sets used in monotonic classification each corresponding symbol. As can be checked in Table 6, most of
the characteristics present a relationship with the corresponding
Next, we review monotonic classification papers to summarize class, so that they are good candidate data sets to be used in fu-
which are the data sets considered in their experimental analysis. ture experimental studies.
The information about the most commonly used data sets (with
at least 15 appearances in the literature) has been included in
Table 5, which summarizes their properties. For each data set, we 6. Guidelines and future work in monotonic classification
can observe the number of examples (Ex.), attributes (Atts.), nu-
merical attributes (Num.) and nominal attributes (Nom.), the num- This section offers suggestions to researchers interested in de-
ber of classes (Cl.), the source where the data set can be found, the veloping new ideas within this field. We will emphasize some rele-
NMI metric associated with it and finally, the number of times it vant algorithms proposed in the literature to be considered as con-
has been included in experimental analysis in the literature. testant methods in experimental comparisons. In this regard, our
A brief description is now given for each of these data sets: considerations on their analysis will focus on:

• AutoMPG: the data set concerns city-cycle fuel consumption • Algorithms to consider for future study: We will choose a
given in miles per gallon (Mpg). subset of methods depending on the specific family they be-
• BostonHousing: the data set concerns the housing values in long to. We will suggest a list of algorithms motivated by
the suburbs of Boston. their properties, reputation and performance.
• Car: this data set (Car Evaluation Database) was derived – Instance-based techniques: The OSDL is a method to keep
from a simple hierarchical decision model. The model eval- in mind due to its interpretation of the monotonicity
uates cars according to six input attributes: buying, maint, constraints in terms of stochastic dominance which is
doors, persons, lug_boot, safety. very useful when trying to achieve total monotonicity
• ERA: this data set was originally gathered during an aca- in the predictive decisions. Furthermore, we should con-
demic decision-making experiment aiming at determining sider MkNN based on the basis of its simplicity, perfor-
J.-R. Cano, P.A. Gutiérrez and B. Krawczyk et al. / Neurocomputing 341 (2019) 168–182 179

Table 5
Summary of the most used data sets used in the monotonic classifiers literature.

Data set Ex. Atts. Num. Nom. Cl. Source NMI # of times used

AutoMPG 392 7 7 0 10 [105] 0.023 17


BostonHousing 506 12 10 2 4 [106] 0.001 15
Car 1728 6 0 6 4 [105] 0.0 0 0 22
ERA 10 0 0 4 4 0 9 [69] 0.016 15
ESL 488 4 4 0 9 [69] 0.004 18
LEV 10 0 0 4 4 0 5 [69] 0.006 15
MachineCPU 209 6 6 0 4 [105] 0.001 19
Pima 768 8 8 0 2 [105] 0.015 16
SWD 10 0 0 10 10 0 4 [69] 0.009 16

Table 6
RMI measure [45] for all input features when considering the most popular monotonic classification data sets.

Data Set A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12

AutoMPG − − − − + + +
−0.5 −0.8 −0.8 −0.7 0.3 0.6 0.4
BostonHousing − + − = − + − + − − − =
−0.5 0.3 −0.4 0.0 −0.4 0.6 −0.4 0.2 −0.2 −0.4 −0.5 0.0
Car + + + + + +
1.0 1.0 1.0 1.0 1.0 1.0
ERA + + + +
0.3 0.4 0.2 0.2
ESL + + + +
0.6 0.6 0.6 0.6
LEV + + + +
0.2 0.4 0.2 0.2
MachineCPU − + + + + +
−0.6 0.6 0.7 0.7 0.5 0.5
Pima + + = = + + + +
0.2 0.3 0.0 0.0 0.2 0.2 0.2 0.2
SWD + + + = + = + = + +
0.2 0.2 0.3 0.0 0.2 0.0 0.2 0.0 0.2 0.2

mance and the ease of its integration when hybridized models generated by REMT are monotonically consis-
with other algorithms. tent and have high predictive capabilities. Finally, we
– Statistical based methods: MonMLP should be considered highlight the choice of MMT based on its ability to
to be the classic multi-layer perceptron network that has handle the performance limitations produced by in-
been a source of inspiration for the rest of the algorithms comparable object pairs. MMT addresses this problem
from the same family. Another choice, which is choquis- constructing multivariate decision trees with monotonic-
tic regression, replaces the linear function of predictor ity constraints.
variables using the choquet integral. The choquet inte- – Ensemble based methods: Five techniques have been se-
gral is very attractive for machine learning as to if of- lected as the most noteworthy. The LPRules and Mono-
fers measures that quantify the importance of individual Boost are representative boosting algorithms for multi-
predictor variables and the interaction between groups class problems, despite the first was based on binary
of variables. We also recommend considering the PM- decomposition. FREMT has been chosen thanks to the in-
SVM technique because of its apability to treat partial teresting attribute reduction and fusing principle intro-
monotonicity using an alternative metric to NMI, called duced in its definition. MonRF is considered due to its
MCC and its ability to measure the monotonicity de- good performance and the extensive experimental anal-
gree. Finally, we should take into consideration the algo- ysis conducted by the authors. Lastly, the scalability of
rithm MCELM. Its advantages are that it does not need to XGBoost and its capability of obtaining monotone consis-
tune parameters iteratively, it has extremely fast training tent decisions makes it an obvious choice.
times, does not require monotonic relationships to exist • Quality metrics: the quality of the models learned can be
between features, the outputs are consistent and experi- evaluated based on precision or monotonicity fulfillment. If
mentally it shows generalization capability. we take precision into account, MAE is the measure that
– Rules and Decision Trees family: MID was the first pro- should be considered as it is widely used in the area. As far
posal in this family. The idea is simple and intuitive since as monotonicity fulfillment is concerned, NMI is the most
it consists of the inclusion of a criterion to achieve a commonly used metric which adequately reflects compli-
trade-off between accuracy and the monotonicity con- ance with model monotonicity (see Section 4.1).
straints present in the data. In fact, this criterion can
Directions for future research in monotonic classification are in-
be easily attached to any decision tree and rule learning
dicated as follows:
model. Also noteworthy is the fact that its performance
in prediction is oustanding. There is another decision • It is necessary to propose performance measures that com-
tree algorithm called REMT which introduces the rank bine the evaluation of accurate and monotonic predictions.
mutual information (RMI) as a feature quality measure, Currently, MAE and NMI measures are mainly used simul-
combining the advantage of robustness of Shannons en- taneously, but the latter requires a complete set to calcu-
tropy with the ability of dominance rough sets in ex- late the comparability of the examples. We have seen that
tracting ordinal structures from monotonic data sets. The in some revised algorithms useful measures have been pro-
180 J.-R. Cano, P.A. Gutiérrez and B. Krawczyk et al. / Neurocomputing 341 (2019) 168–182

posed for this purpose, but they are hardly being used in Finally, a summary and description of all the data sets used is
successive ideas. On the other hand, we are missing the use considered. We emphasize eight of them, which have been used in,
of complex measures that have been used in ordinal regres- at least, ten of the experimental evaluations reviewed in the litera-
sion [107], such as ROC curves or performance curves. ture. Their characteristics, availability and the monotonic relation-
• The distinction between partial and total monotonic classifi- ships between input features and the class label are also detailed.
cation is crucial and this should be clearly indicated in fu- The overview is completed by including a set of guidelines re-
ture proposals. Depending on the application, it will make garding the most representative methods found in the literature to
more sense to use one type of technique or another, depend- be considered in novel ideas and proposals and with an enumera-
ing on where the importance of the model learned lies; ei- tion of possible directions for future research in this field.
ther in the interpretation of the model or in the accuracy
of the model. We recommend that this type of property be
Acknowledgment
clearly highlighted in future proposals.
• It is possible to devise extensions of the classic monotonic
This work has been supported by TIN2017-89517-P, TIN2015-
classification problem based on different graduations of re-
70308-REDT, TIN2014-54583-C2-1-R and the Spanish “Ministerio
strictions between input and output attributes. There may
de Economía y Competitividad” and by “Fondo Europeo de Desar-
be attributes that are more relevant than others in the
rollo Regional” (FEDER) under Project TEC2015-69496-R.
monotonicity constraint and their violation may result in
greater perjury. This implies a reformulation of the partial
References
order and a generalization of the problem to introduce bias
in the predictions. [1] I.H. Witten, E. Frank, M.A. Hall, C.J. Pal, Data Mining: Practical Machine Learn-
• Although adaptations of all types of classifiers to this prob- ing Tools and Techniques, Morgan Kaufmann Series in Data Management Sys-
lem, including ensembles, have been proposed, other types tems, fourth ed., Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
2016.
of proposals are still lacking, such as the decomposition
[2] A.I. Saleh, F.M. Talaat, L.M. Labib, A hybrid intrusion detection system (HIDS)
of One-Versus-One (OVO) and more advanced One-Versus- based on prioritized k-nearest neighbors and optimized SVM classifiers, Artif.
All (OVA) classes [5,108] and more data preprocessing tech- Intell. Rev. (2017) 1–41.
[3] B.A. Tama, K.-H. Rhee, Tree-based classifier ensembles for early detection
niques, such as noise filtering [109].
method of diabetes: an exploratory study, Artif. Intell. Rev. (2017) 1–16.
• We have also observed that many of the algorithms re- [4] P.A. Gutiérrez, M. Pérez-Ortiz, J. Sánchez-Monedero, F. Fernandez-Navarro,
viewed in this paper are not available to the public in soft- C. Hervás-Martínez, Ordinal regression methods: survey and experimental
ware repositories. More software development is needed in study, IEEE Trans. Knowl. Data Eng. 28 (1) (2016) 127–146, doi:10.1109/TKDE.
2015.2457911.
this area. [5] W. Kotłowski, R. Słowiński, On nonparametric ordinal classification with
• Currently, monotonic classification is understood as a natural monotonicity constraints., IEEE Trans. Knowl. Data Eng. 25 (11) (2013)
extension of the classical or ordinal regression. Other pre- 2576–2589.
[6] J.-R. Cano, N.R. Aljohani, R.A. Abbasi, J.S. Alowidbi, S. García, Prototype selec-
dictive learning paradigms that require some interpretation tion to improve monotonic nearest neighbor, Eng. Appl. Artif. Intell. 60 (2017)
of the results may benefit from monotone models or mono- 128–135.
tone predictions in certain real-life applications. We refer to [7] M.-J. Kim, I. Han, The discovery of experts’ decision rules from qualitative
bankruptcy data using genetic algorithms, Exp. Syst. Appl. 25 (4) (2003)
those singular or non-standard predictive problems [110] in- 637–646.
cluding weak supervision [111]. To date, there are proposals [8] C.-C. Chen, S.-T. Li, Credit rating with a monotonicity-constrained support
to deal with monotonicity constraints in imbalanced classi- vector machine model, Exp. Syst. Appl. 41 (16) (2014) 7235–7247.
[9] R. Potharst, A.J. Feelders, Classification trees for problems with monotonicity
fication [112,113].
constraints, SIGKDD Explor. 4 (1) (2002) 1–10.
[10] A. Ben-David, Automatic generation of symbolic multiattribute ordinal knowl-
7. Conclusions edge-based DSSs: methodology and applications, Decis. Sci. 23 (1992)
1357–1372.
[11] P.A. Gutiérrez, S. García, Current prospects on ordinal and monotonic classifi-
This paper is a systematical review of monotonic classification cation, Progr. Artif. Intell. 5 (3) (2016) 171–179.
literature that could be used as a functional guide on the scope. [12] H. Zhu, E.C. Tsang, X.-Z. Wang, R.A.R. Ashfaq, Monotonic classification extreme
Monotonic classification is an emerging area in the field of data learning machine, Neurocomputing 225 (2017) 205–213.
[13] J.S. Cardoso, R. Sousa, Measuring the performance of ordinal classification, Int.
mining. In recent years, the number of proposals in this area of J. Pattern Recognit. Artif. Intell. 25 (8) (2011) 1173–1195.
knowledge has significantly increased, as shown in Fig. 1. This fact [14] W. Kotłowski, The Paradox of Overfitting, Poznan University of Technology,
justifies the necessity of proposing a taxonomy that classifies and 2008 Master’s thesis.
[15] C.C. Aggarwal, Data Mining: The Textbook, Springer, 2015.
discriminates all the methods proposed so far. The taxonomy de- [16] S. García, J. Luengo, F. Herrera, Data Preprocessing in Data Mining, Springer,
signed can be used as a guide to: 2015.
[17] L. Rokach, Ensemble-based classifiers, Artif. Intell. Rev. 33 (1) (2010) 1–39.
• Decide which kind of algorithm and model is best suited for [18] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, F.E. Alsaadi, A survey of
a new monotonic problem. deep neural network architectures and their applications, Neurocomputing
234 (2017) 11–26.
• Compare any new proposals with those current proposals [19] A.F. Tehrani, W. Cheng, K. Dembczyński, E. Hüllermeier, Learning monotone
which come from the same family, so that it can be de- nonlinear models using the Choquet integral, Mach. Learn. 89 (1–2) (2012)
cided if the new proposal should be considered and if any 183–211.
[20] H. Daniels, M. Velikova, Monotone and partially monotone neural networks,
improvements in their performance can be observed.
IEEE Trans. Neural Netw. 21 (6) (2010) 906–917.
[21] E. Frank, M. Hall, I. Witten, The Weka workbench, Data Min. Pract. Mach.
Together with this taxonomy, we also analyze which methods
Learn. Tools Tech. 4 (2016). https://www.cs.waikato.ac.nz/ml/weka/Witten_et_
are publicly available, and whose source codes are available on al_2016_appendix.pdf.
line. In those cases, we also include where their implementation [22] A. Ben-David, Monotonicity maintenance in information-theoretic machine
learning algorithms, Mach. Learn. 19 (1) (1995) 29–43.
can be found.
[23] M. Grabisch, A new algorithm for identifying fuzzy measures and its appli-
Additionally, an analysis of the proposed and used quality met- cation to pattern recognition, in: Proceedings of the 1995 IEEE International
rics is carried out, considering predictive assessment and mono- Joint Conference of the Fourth IEEE International Conference on Fuzzy Sys-
tonicity fulfillment. We also highlight some measures, which are tems and The Second International Fuzzy Engineering Symposium, 1, IEEE,
1995, pp. 145–150.
more frequently considered in this field, such as Accuracy, MAE [24] J. Sill, Monotonic networks, in: Proceedings of the 1997 Advances in Neural
and NMI. Information Processing Systems, 1997, pp. 661–667.
J.-R. Cano, P.A. Gutiérrez and B. Krawczyk et al. / Neurocomputing 341 (2019) 168–182 181

[25] M. Kazuhisa, S. Takashi, O. Hirotaka, I. Toshihide, Data analysis by positive [54] H. Wang, M. Zhou, K. She, Induction of ordinal classification rules from de-
decision trees, IEICE Trans. Inf. Syst. E82-D (1) (1999) 76–88. cision tables with unknown monotonicity, Eur. J. Oper. Res. 242 (1) (2015)
[26] R. Dykstra, J. Hewett, T. Robertson, Nonparametric, isotonic discriminant pro- 172–181.
cedures, Biometrika 86 (2) (1999) 429–438. [55] J. García, H.M. Fardoun, D.M. Alghazzawi, J.-R. Cano, S. García, MoNGEL:
[27] R. Potharst, J.C. Bioch, Decision trees for ordinal classification, Intell. Data monotonic nested generalized exemplar learning, Pattern Anal. Appl. 20 (2)
Anal. 4 (2) (20 0 0) 97–111. (2017) 441–452.
[28] S. Greco, B. Matarazzo, R. Slowinski, J. Stefanowski, Variable consistency [56] J. García, H.M. Fardoun, D.M. Alghazzawi, J.-R. Cano, S. García, MoNGEL
model of dominance-based rough sets approach, in: Proceedings of the In- Java Code, 2015. http://www4.ujaen.es/∼jrcano/Research/MoNGEL/sourcecode.
ternational Conference on Rough Sets and Current Trends in Computing, html.
Springer, 20 0 0a, pp. 170–181. [57] S. González, F. Herrera, S. García, Managing monotonicity in classification by
[29] S. Greco, B. Matarazzo, R. Slowinski, J. Stefanowski, An algorithm for induction a pruned AdaBoost, in: Proceedings of the International Conference on Hybrid
of decision rules consistent with the dominance principle, in: Proceedings of Artificial Intelligence Systems, Springer, 2016, pp. 512–523.
the International Conference on Rough Sets and Current Trends in Computing, [58] J. Brookhouse, F.E.B. Otero, Monotonicity in ant colony classification algo-
Springer, 20 0 0b, pp. 304–313. rithms, in: Proceedings of the Tenth International Conference on Swarm In-
[30] J. Bioch, V. Popova, Monotone decision trees and noisy data, in: Proceedings telligence, ANTS 2016, Springer International Publishing, Brussels, Belgium,
of the Fourteenth Belgium–Dutch Conference on Artificial Intelligence, 2002, 2016, pp. 137–148. September 7–9, 2016.
pp. 19–26. [59] J. García, A.M. AlBar, N.R. Aljohani, J.-R. Cano, S. García, Hyperrectangles se-
[31] J.W. Lee, D.S. Yeung, X. Wang, Monotonic decision tree for ordinal classifica- lection for monotonic classification by using evolutionary algorithms, Int. J.
tion, in: Proceedings of the IEEE International Conference on Systems, Man Comput. Intell. Syst. 9 (1) (2016) 184–201.
and Cybernetics, 3, IEEE, 2003, pp. 2623–2628. [60] T. Chen, C. Guestrin, XGBoost: a scalable tree boosting system, in: Proceed-
[32] R. Chandrasekaran, Y.U. Ryu, V.S. Jacob, S. Hong, Isotonic separation, INFORMS ings of the Twenty-Second ACM SIGKDD International Conference on Knowl-
J. Comput. 17 (4) (2005) 462–474. edge Discovery and Data Mining, ACM, 2016, pp. 785–794. https://github.
[33] B. Lang, Monotonic multi-layer perceptron networks as universal approx- com/dmlc/xgboost/.
imators, Proceedings of the International Conference on Artificial Neural [61] C. Bartley, W. Liu, M. Reynolds, Effective monotone knowledge integration
Networks: Formal Models and Their Applications – ICANN 20 05, 20 05. in kernel support vector machines, in: Proceedings of the Twelfth Inter-
750–750. national Conference on Advanced Data Mining and Applications, Springer,
[34] J. Błaszczyński, S. Greco, R. Słowiński, Multi-criteria classification – a new 2016a, pp. 3–18. https://github.com/chriswbartley/PMSVM.
scheme for application of dominance-based decision rules, Eur. J. Oper. Res. [62] C. Bartley, W. Liu, M. Reynolds, A novel technique for integrating monotone
181 (3) (2007) 1030–1044. domain knowledge into the random forest classiffier, in: Proceedings of the
[35] S. Lievens, B.D. Baets, K. Cao-Van, A probabilistic framework for the design Fourteenth Australasian Data Mining Conference, 170, 2016b, pp. 3–18. https:
of instance-based supervised ranking algorithms in an ordinal setting, Ann. //github.com/chriswbartley/PMRF.
Oper. Res. 163 (1) (2008) 115–142. [63] S. Pei, Q. Hu, C. Chen, Multivariate decision trees with monotonicity con-
[36] W. Duivesteijn, A. Feelders, Nearest neighbour classification with monotonic- straints, Knowl.-Based Syst. 112 (2016) 14–25.
ity constraints, in: Proceedings of the 2008 ECML/PKDD, in: Lecture Notes in [64] H. Xu, W. Wang, Y. Qian, Fusing complete monotonic decision trees, IEEE
Computer Science, 5211, Springer, 2008, pp. 301–316. Trans. Knowl. Data Eng. 29 (10) (2017) 2223–2235.
[37] N. Barile, A. Feelders, Nonparametric monotone classification with MOCA, [65] W. Verbeke, D. Martens, B. Baesens, RULEM: a novel heuristic rule learning
in: Proceedings of the Eighth IEEE International Conference on Data Mining, approach for ordinal classification with monotonicity constraints, Appl. Soft
ICDM’08, IEEE, 2008, pp. 731–736. Comput. (2017), doi:10.1016/j.asoc.2017.01.042.
[38] W. Kotłowski, K. Dembczyński, S. Greco, R. Słowiński, Stochastic domi- [66] J. Alcalá-Fdez, R. Alcala, S. González, Y. Nojima, S. Garcia, Evolutionary fuzzy
nance-based rough set model for ordinal classification, Inf. Sci. 178 (21) rule-based methods for monotonic classification, IEEE Trans. Fuzzy Syst. 25
(2008) 4019–4037. (6) (2017) 1376–1390.
[39] R. Van De Kamp, A. Feelders, N. Barile, N. Adams, Isotonic classification trees, [67] C. Bartley, W. Liu, M. Reynolds, A novel framework for constructing partially
in: Proceedings of the International Symposium on Intelligent Data Analysis, monotone rule ensembles, in: Proceedings of the Thirty-Fourth International
IDA 2009, Springer, 2009, pp. 405–416. Conference on Data Engineering, 2018, pp. 1320–1323. https://github.com/
[40] W. Kotłowski, R. Słowiński, Rule learning with monotonicity constraints, in: chriswbartley/monoboost.
Proceedings of the Twenty-Sixth Annual International Conference on Machine [68] S. Pei, Q. Hu, Partially monotonic decision trees, Inf. Sci. 424 (2018) 104–117.
Learning, ACM, 2009, pp. 537–544. [69] A. Ben-David, L. Sterling, Y.H. Pao, Learning, classification of monotonic ordi-
[41] M. Inuiguchi, Y. Yoshioka, Y. Kusunoki, Variable-precision dominance-based nal concepts, Comput. Intell. 5 (1989) 45–49.
rough set approach and attribute reduction, Int. J. Approx. Reason. 50 (8) [70] S. Lievens, B. De Baets, Supervised ranking in the Weka environment, Inf. Sci.
(2009) 1199–1214. 180 (24) (2010) 4763–4771.
[42] K. Dembczyński, W. Kotłowski, R. Słowiński, Learning rule ensembles for or- [71] J.R. Quinlan, Induction of decision trees, Mach. Learn. 1 (1) (1986) 81–106.
dinal classification with monotonicity constraints, Fundam. Inf. 94 (2) (2009) [72] J.R. Quinlan, C4.5: Programs for Machine Learning, Elsevier, 2014.
163–178. [73] L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32.
[43] J. Błaszczyński, R. Słowiński, J. Stefanowski, Rough sets and current trends in [74] I. Triguero, D. Peralta, J. Bacardit, S. García, F. Herrera, MRPR: a mapreduce
computing, in: Proceedings of the Seventh International Conference on Rough solution for prototype reduction in big data classification, Neurocomputing
Sets and Current Trends in Computing, RSCTC 2010, Warsaw, Poland, June 28– 150 (2015) 331–345.
30, 2010, Springer, Berlin, Heidelberg, pp. 392–401. [75] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learn-
[44] J. Blaszczynski, R. Slowinski, M. Szelkag, Sequential covering rule induction al- ing and an application to boosting, in: Proceedings of the European Confer-
gorithm for variable consistency rough set approaches, Inf. Sci. 181 (5) (2011) ence on Computational Learning Theory, Springer, 1995, pp. 23–37.
987–1002. [76] M. Seligman, Rborist: Extensible, Parallelizable Implementation of the Ran-
[45] Q. Hu, X. Che, L. Zhang, D. Zhang, M. Guo, D. Yu, Rank entropy-based deci- dom Forest Algorithm, 2017. https://cran.r-project.org/web/packages/Rborist/
sion trees for monotonic classification, IEEE Trans. Knowl. Eng. 24 (11) (2012) index.html
2052–2064. [77] B. Greenwell, B. Boehmke, J. Cunningham, G. Developers, GBM: Generalized
[46] J. Blaszczynski, S. Greco, R. Slowinski, Inductive discovery of laws using Boosted Regression Models, 2018. https://cran.r-project.org/web/packages/
monotonic rules, Eng. Appl. Artif. Intell. 25 (2) (2012) 284–294. gbm/index.html.
[47] J. Zhang, J. Zhai, H. Zhu, X. Wang, Induction of monotonic decision trees, in: [78] K. Pelckmans, M. Espinoza, J. De Brabanter, J.A. Suykens, B. De Moor, Primal–
Proceedings of the 2015 International Conference on Wavelet Analysis and dual monotone kernel regression, Neural Process. Lett. 22 (2) (2005) 171–182.
Pattern Recognition (ICWAPR), IEEE, 2015, pp. 203–207. [79] A.N. Tikhonov, V. Arsenin, Solutions of Ill-Posed Problems, 14, Winston Wash-
[48] Y. Qian, H. Xu, J. Liang, B. Liu, J. Wang, Fusing monotonic decision trees, IEEE ington, DC, 1977.
Trans. Knowl. Data Eng. 27 (10) (2015) 2717–2728. [80] M. Grabisch, J.-M. Nicolas, Classification by fuzzy integral: Performance and
[49] S. González, F. Herrera, S. García, Monotonic random forest with an ensemble tests, Fuzzy Sets Syst. 65 (2–3) (1994) 255–271.
pruning mechanism based on the degree of monotonicity, New Gen. Comput. [81] M. Grabisch, Modelling data by the Choquet integral, in: Information Fusion
33 (4) (2015) 367–388. in Data Mining, Springer, 2003, pp. 135–148.
[50] S. Wang, J. Zhai, S. Zhang, H. Zhu, An ordinal random forest and its paral- [82] A.F. Tehrani, E. Hüllermeier, Ordinal choquistic regression, in: Proceedings of
lel implementation with MapReduce, in: Proceedings of the 2015 IEEE In- the 2013 Conference on EUSFLAT, 2013, pp. 1–8.
ternational Conference on Systems, Man, and Cybernetics (SMC), IEEE, 2015, [83] M. Grabisch, Fuzzy integral in multicriteria decision making, Fuzzy Sets Syst.
pp. 2170–2173. 69 (3) (1995) 279–298.
[51] J. Błaszczyński, S. Greco, B. Matarazzo, R. Słowiński, M. Szelag, jMAF- [84] H. Daniels, M. Velikova, et al., Derivation of Monotone Decision Models from
dominance-based rough set data analysis framework, Rough Sets and Intel- Non-Monotone Data, Tilburg University, 2003.
ligent Systems – Professor Zdzisław Pawlak in Memoriam, Springer, 2013, [85] A. Feelders, M. Velikova, H. Daniels, Two polynomial algorithms for relabeling
pp. 185–209. http://www.cs.put.poznan.pl/jblaszczynski/Site/jRS.html. non-monotone data, Technical Report, 2006.
[52] C. Marsala, D. Petturiti, Rank discrimination measures for enforcing mono- [86] M. Rademaker, B. De Baets, H. De Meyer, Loss optimal monotone relabeling
tonicity in decision tree induction, Inf. Sci. 291 (2015) 143–171. of noisy multi-criteria data sets, Inf. Sci. 179 (24) (2009) 4089–4096.
[53] S.-T. Li, C.-C. Chen, A regularized monotonic fuzzy support vector machine [87] A. Feelders, Monotone relabeling in ordinal classification, in: Proceedings of
model for data mining with prior knowledge, IEEE Trans. Fuzzy Syst. 23 (5) the Tenth IEEE International Conference on Data Mining (ICDM), IEEE, 2010,
(2015) 1713–1727. pp. 803–808.
182 J.-R. Cano, P.A. Gutiérrez and B. Krawczyk et al. / Neurocomputing 341 (2019) 168–182

[88] L. Stegeman, A. Feelders, On generating all optimal monotone classifications, Pedro Antonio Gutiérrez received the B.S. degree in com-
in: Proceedings of the Eleventh IEEE International Conference on Data Mining puter science from the University of Sevilla (Spain) in
(ICDM), IEEE, 2011, pp. 685–694. 2006, and the Ph.D. degree in computer science and ar-
[89] A. Feelders, T. Kolkman, Exploiting monotonicity constraints to reduce label tificial intelligence from the University of Granada (Spain)
noise: an experimental evaluation, in: Proceedings of the 2016 International in 2009. He is currently an Assistant Professor with the
Joint Conference on Neural Networks (IJCNN), IEEE, 2016, pp. 2148–2155. Department of Computer Science and Numerical Analysis,
[90] M. Rademaker, B. De Baets, H. De Meyer, Optimal monotone relabelling of University of Córdoba (Spain). His research interests are
partially non-monotone ordinal data, Optim. Methods Softw. 27 (1) (2012) in the areas of supervised learning, evolutionary artificial
17–31. neural networks, ordinal classification and the application
[91] W. Pijls, R. Potharst, Repairing Non-Monotone Ordinal Data Sets by Chang- of these techniques to different real world problems, in-
ing Class Labels, Technical Report, Econometric Institute, Erasmus University cluding precision agriculture, renewable energy, climatol-
Rotterdam, 2014. ogy and biomedicine, among others.
[92] S. Kotsiantis, Feature selection for machine learning classification problems:
a recent overview, Artif. Intell. Rev. (2011) 1–20.
[93] Q. Hu, W. Pan, Y. Song, D. Yu, Large-margin feature selection for monotonic Bartosz Krawczyk is an assistant professor in the De-
classification, Knowl.-Based Syst. 31 (2012a) 8–18. partment of Computer Science, Virginia Commonwealth
[94] Q. Hu, W. Pan, L. Zhang, D. Zhang, Y. Song, M. Guo, D. Yu, Feature selection University, Richmond VA, USA, where he heads the Ma-
for monotonic classification, IEEE Trans. Fuzzy Syst. 20 (1) (2012b) 69–81. chine Learning and Stream Mining Lab. He obtained his
[95] W. Pan, Q. Hu, Y. Song, D. Yu, Feature selection for monotonic classifica- M.Sc. and Ph.D. degrees from Wroclaw University of Sci-
tion via maximizing monotonic dependency, Int. J. Comput. Intell. Syst. 7 (3) ence and Technology, Wroclaw, Poland, in 2012 and 2015
(2014) 543–555. respectively. His research is focused on machine learn-
[96] W. Pan, Q. Hu, An improved feature selection algorithm for ordinal classifi- ing, data streams, ensemble learning, class imbalance,
cation, IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 99 (12) (2016) one-class classifiers, and interdisciplinary applications of
2266–2274. these methods. He has authored 45+ international jour-
[97] H. Yang, Z. Xu, M.R. Lyu, I. King, Budget constrained non-monotonic feature nal papers and 100+ contributions to conferences. He was
selection, Neural Netw. 71 (2015) 214–224. awarded with numerous prestigious awards for his scien-
[98] J.R. Cano, F. Herrera, M. Lozano, Using evolutionary algorithms as instance tific achievements like IEEE Richard Merwin Scholarship
selection for data reduction in KDD: an experimental study, IEEE Trans. Evol. and IEEE Outstanding Leadership Award among others. He served as a Guest Editor
Comput. 7 (6) (2003) 561–575. in four journal special issues and as a chair of ten special session and workshops.
[99] S. Garcia, J. Derrac, J. Cano, F. Herrera, Prototype selection for nearest neigh- He is a member of Program Committee for over 40 international conferences and a
bor classification: taxonomy and empirical study, IEEE Trans. Pattern Anal. reviewer for 30 journals.
Mach. Intell. 34 (3) (2012) 417–435.
[100] J.R. Cano, F. Herrera, M. Lozano, Evolutionary stratified training set selec-
tion for extracting classification rules with trade off precision-interpretability,
Data Knowl. Eng. 60 (1) (2007) 90–108. Michał Woźniak is a professor of computer science at
[101] J.-R. Cano, S. García, Training set selection for monotonic ordinal classification, the Department of Systems and Computer Networks,
Data Knowl. Eng. 112 (2017) 94–105. Wrocław University of Science and Technology, Poland.
[102] H. Daniels, M. Velikova, Derivation of monotone decision models from noisy He received M.Sc. degree in biomedical engineering from
data, IEEE Trans. Syst. Man Cybern. Part C 36 (2006) 705–710. the Wrocław University of Technology in 1992, and Ph.D.
[103] L. Goodman, W. Kruskal, Measures of Association for Cross Classifications, and D.Sc. (habilitation) degrees in computer science in
Springer-Verlag, 1977. 1996 and 2007, respectively, from the same university.
[104] I. Milstein, A. Ben-David, R. Potharst, Generating noisy monotone ordinal In 2015 he was nominated as the professor by Presi-
datasets, Artif. Intell. Res. 3 (1) (2014) 30–37. dent of Poland. His research focuses on machine learning,
[105] J. Alcala-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Her- compound classification methods, classifier ensembles,
rera, KEEL data-mining software tool: data set repository, integration of algo- data stream mining, and imbalanced data processing. He
rithms and experimental analysis framework, J. Mult.-Valued Logic Soft Com- has been involved in research projects related to the
put. 17 (2–3) (2011) 255–287. above-mentioned topics and has been a consultant of sev-
[106] K. Bache, M. Lichman, UCI Machine Learning Repository, 2013, https://archive. eral commercial projects for well-known Polish companies and public administra-
ics.uci.edu/ml/index.php. tion. He has published over 260 papers and three books. His recent one Hybrid
[107] W. Waegeman, B. De Baets, L. Boullart, ROC analysis in ordinal regression classifiers: Method of Data, Knowledge, and Data Hybridization was published by
learning, Pattern Recognit. Lett. 29 (1) (2008) 1–9. Springer in 2014. He was awarded with numerous prestigious awards for his scien-
[108] Z. Zhang, X. Luo, S. González, S. García, F. Herrera, DRCW-ASEG: one-ver- tific achievements as IBM Smarter Planet Faculty Innovation Award (twice) or IEEE
sus-one distance-based relative competence weighting with adaptive syn- Outstanding Leadership Award, and several best paper awards of the prestigious
thetic example generation for multi-class imbalanced datasets, Neurocomput- conferences. He serves as program committee chairs and member for the numer-
ing 285 (2018) 176–187. ous scientific events and prepared several special issues as the guest editor. He is
[109] J.R. Cano, J. Luengo, S. García, Label noise filtering techniques to im- the member of the editorial board of the high ranked journals as Information Fu-
prove monotonic classification, Neurocomputing (2019) in press, doi:10.1016/ sion (Elsevier), Applied Soft Computing (Elsevier), and Engineering Applications of
j.neucom.2018.05.131. Artificial Intelligence (Elsevier). He is a senior member of the IEEE.
[110] D. Charte, F. Charte, S. García, F. Herrera, A snapshot on nonstandard super-
vised learning problems: taxonomy, relationships, problem transformations
and algorithm adaptations, Progr. Artif. Intell. (2019) in press, doi:10.1007/ Salvador García is currently an Associate Professor in
s13748- 018- 00167- 7. the Department of Computer Science and Artificial In-
[111] J. Hernández-González, I. Inza, J.A. Lozano, Weak supervision and other telligence, University of Granada, Granada, Spain. He has
non-standard classification problems: a taxonomy, Pattern Recognit. Lett. 69 published more than 80 papers in international journals
(2016) 49–55. (more than 60 in Q1), with more than 60 0 0 citations,
[112] A. Fernández, S. García, M. Galar, R.C. Prati, B. Krawczyk, F. Herrera, Learning h-index 41, over 60 papers in international conference
from Imbalanced Data Sets, Springer, 2018. proceedings (data from Web of Science). He has been
[113] S. González, S. García, S. Li, F. Herrera, Chain based sampling for monotonic associated with the international program committees
imbalanced classification, Inf. Sci. 474 (2019) 187–204. and organizing committees of several regular interna-
tional conferences including IEEE CEC, ICPR, ICDM, IJCAI,
José Ramón Cano received the M.Sc. and Ph.D. degrees etc. As edited activities, he has co-edited two special is-
in computer science from the University of Granada, sues in international journals and he is an associate editor
Granada, Spain, in 1999 and 2004, respectively. He is of “Information Fusion” (Elsevier), “Swarm and Evolution-
currently a Professor in the Department of Computer ary Computation” (Elsevier) and “AI Communications” (IOS Press) journals, and he
Science, University of Jaén, Jaén, Spain. His research inter- is co-Editor in Chief of the international journal “Progress in Artificial Intelligence”
ests include data mining, data reduction, data complex- (Springer). He is a co-author of the books entitled “Data Preprocessing in Data Min-
ity, interpretability-accuracy trade-off, motonic classifica- ing” and “Learning from Imbalanced Data Sets” published by Springer. His research
tion and evolutionary algorithms. interests include data science, data preprocessing, Big Data, evolutionary learning,
Deep Learning, metaheuristics and biometrics. He has been given some awards and
honors for his personal work or for his publications in and conferences, such as
IFSA-EUSFLAT 2015 Best Application Paper Award and IDEAL 2015 Best Paper Award.
He belongs to the list of the Highly Cited Researchers in the area of Computer Sci-
ences (2014–2018): http://highlycited.com/ (Clarivate Analytics). His h-index is 41
in Scholar Google, receiving more than 13,0 0 0 citations, till date.

You might also like