A Unifying View of Class Overlap and Imbalance
A Unifying View of Class Overlap and Imbalance
Information Fusion
journal homepage: www.elsevier.com/locate/inffus
Granada, Spain
d
Instituto de Ciências Biomédicas Abel Salazar da Universidade do Porto, Portugal
e
IPO-Porto Research Centre (CI-IPOP), Porto, Portugal
MSC: The combination of class imbalance and overlap is currently one of the most challenging issues in machine
00-01 learning. While seminal work focused on establishing class overlap as a complicating factor for classification
99-00 tasks in imbalanced domains, ongoing research mostly concerns the study of their synergy over real-word
Keywords: applications. However, given the lack of a well-formulated definition and measurement of class overlap in
Class imbalance real-world domains, especially in the presence of class imbalance, the research community has not yet reached
Imbalanced data a consensus on the characterisation of both problems. This naturally complicates the evaluation of existing
Class overlap
approaches to address these issues simultaneously and prevents future research from moving towards the
Data complexity
devise of specialised solutions. In this work, we advocate for a unified view of the problem of class overlap
Data intrinsic characteristics
Complexity measures in imbalanced domains. Acknowledging class overlap as the overarching problem – since it has proven to
be more harmful for classification tasks than class imbalance – we start by discussing the key concepts
associated to its definition, identification, and measurement in real-world domains, while advocating for a
characterisation of the problem that attends to multiple sources of complexity. We then provide an overview
of existing data complexity measures and establish the link to what specific types of class overlap problems
these measures cover, proposing a novel taxonomy of class overlap complexity measures. Additionally, we
characterise the relationship between measures, the insights they provide, and discuss to what extent they
account for class imbalance. Finally, we systematise the current body of knowledge on the topic across several
branches of Machine Learning (Data Analysis, Data Preprocessing, Algorithm Design, and Meta-learning),
identifying existing limitations and discussing possible lines for future research.
1. Introduction dataset shift, and missing data (please refer to [5] for a comprehen-
sive review), where class overlap has been characterised as the most
In Data Science classification problems, researchers often find that harmful among them [7–9].
they compile data with uneven class representations, which generally Note how class imbalance may not be a problem per se. It refers to
degrades the performance of many standard machine learning models, the disproportion between class examples in the domain, which does
independently of their learning paradigms [1]. However, it is currently not implicitly align with classification complexity [10]. As an example,
well known that the observed class imbalance is not the sole responsible consider a linearly separable problem, where a standard classifier will
for this undesired behaviour [2–5]. What truly hinders classification be able to obtain good performance, even if the domain is highly
is its combination with other factors, defined in the literature as data imbalanced. On the contrary, class overlap is undeniably problematic,
intrinsic characteristics [3,5], data difficulty factors [4,6], or data irregu- even in balanced domains. It depicts a situation where examples from
larities [1]. These refer to several different data issues, such as class both classes (in binary-classification problems) are located in the same
imbalance, small disjuncts, class overlap, lack of data, noisy data, region of the data space, thus compromising the definition of clear
∗ Corresponding author.
E-mail address: [email protected] (M.S. Santos).
https://doi.org/10.1016/j.inffus.2022.08.017
Received 16 December 2021; Received in revised form 15 June 2022; Accepted 16 August 2022
Available online 20 August 2022
1566-2535/© 2022 Elsevier B.V. All rights reserved.
M.S. Santos et al. Information Fusion 89 (2023) 228–253
decision boundaries [3,11]. In imbalanced domains, this issue is how- measures and the specific class overlap problems they cover, proposing
ever exacerbated, since it may be in those overlapped regions that the a new taxonomy of class overlap complexity measures. The taxonomy
few minority examples that exist are located. Hence, their recognition aggregates a comprehensive set of measures proposed over the past
comprises a much more difficult scenario for classifiers [12,13]. years, beyond the well-established data complexity measures of Ho and
Accordingly, our focus on both class imbalance and overlap is not Basu [33]. Furthermore, this taxonomy is especially devised for the
a coincidence, since they do not have independent effects on classifica- class overlap problem, while also identifying important adaptations of
tion performance. Nevertheless, class overlap stands as a more complex complexity measures that simultaneously consider the class imbalance
and overarching problem in classification tasks, and will therefore be problem. Finally, we provide a multi-view panorama on the joint-
given a deeper discussion throughout this work. In turn, class imbal- problem of class imbalance and overlap, discussing the current state of
ance acts as an exacerbator and its relationship with class overlap will knowledge and emerging challenges across four vital areas of research
be depicted throughout the definition, measurement, and characterisa- in the field (Data Analysis, Data Preprocessing, Algorithm Design, and
tion of the latter, notwithstanding the analysis of the synergy between Meta-learning), and present our view on promising future directions for
both issues across several fields of Machine Learning. research in each of them.
The joint-effect of class imbalance and overlap has been one of the In recent years, several outstanding survey papers have been pub-
major hot topics in research for the past two decades [6,7,11,14] and lished on the topic of learning from imbalanced datasets in the presence
is still a trending question nowadays, with applications across several of data difficulty factors. A book by Fernández et al. [43] provides a
fields [15–19]. Within the field of information fusion, data imperfection comprehensive summary of the established data intrinsic characteristics
is one of the most challenging factors affecting fusion quality, given and their added difficulty for classification tasks. Das et al. [1] give
the complexity of application environments and associated variety and an impressive bird’s eye view on data irregularities and their inter-
heterogeneity of data [20]. A reasonable concern regards the possibility relation. Finally, Pattaramon et al. [18] provide an in-depth review
of inadvertently creating subgroups of overlapping instances between of approaches that handle simultaneously overlapped and imbalanced
classes during data fusion, an issue that either needs to be avoided or domains. Similarly, the field of data complexity measures has also been
dealt with a posteriori [20]. In the line of the interest on explainability a focus of intense research in the last couple of years. Most recent
and transparency of algorithms, the use of a posteriori explanations, surveys include the research of Rivolli et al. [44], discussing existing
especially with respect to the generation of counterfactuals, is also data characterisation measures for classification datasets (including
deeply linked to the study of class overlap (analysing boundary or data complexity measures), and Lorena et al. [40], providing a detailed
overlapping zones) [21–23]. Several applications that face these issues overview on data complexity measures and their use in the literature.
may be found within financial [24], medical [25–30], software [31], Contrary to previous works, this paper does not focus on presenting
and network systems [32] domains, among others. an exhaustive review of related work and existing approaches in the
While seminal work on the topic focused on establishing class field, but rather on providing a global and unique view on the synergy
overlap as a difficulty factor for imbalanced domains, ongoing research
between class imbalance and overlap. To the authors knowledge, this
mostly concerns the study of several forms of learning where the
is the first work to put forward such a thorough discussion of the
combination of both issues may be problematic. Accordingly, while
class overlap problem and its characterisation according to distinct
previous work focused on artificial domains where class imbalance
representations, systematising data complexity measures towards that
and overlap were synthetically generated, current research aims to
characterisation with the development of a new taxonomy. It also
characterise both problems in real-world domains.
provides the most recent and comprehensive evaluation of important
The identification and characterisation of class overlap in imbal-
issues raised by the combination of class overlap and imbalance in the
anced domains is, however, a subject that still troubles researchers
analysis of real-world domains.
in the field since, to this point, there is no clear, standard, well-
The remaining of this paper is essentially divided into two main
formulated definition and measurement of class overlap for real-world
parts and is structured as follows. Sections 2 and 3 comprise the
domains [18]. For the most part, current research heavily relies on the
first half of this work and consist of a conceptual discussion of the
data complexity measures originally proposed by Ho and Basu [33]. De-
class overlap problem. Section 2 moves towards a unifying view of
spite the fact that many other measures have been proposed throughout
the problem of class overlap, establishing the key concepts for its
the years [8,34–38], the original measures of Ho and Basu remain the
definition and characterisation, whereas Section 3 elaborates on our
most popular, promoted by open source libraries such as DCoL and
novel taxonomy of complexity measures for class overlap and illustrates
ECoL [39,40].
the distinct representations of the problem. Then, Sections 4, 5, and
Nevertheless, data complexity measures have the limitation of fo-
6 constitute the second half of this work, focusing on the current
cusing on certain individual properties of data, although some data
state of knowledge about the dual problem of class imbalance and
characteristics may simultaneously comprise several sources of com-
plexity. More and more, researchers are gravitating around the idea overlap. Additionally, they are structured in a rather modular format,
that class overlap, especially in combination with class imbalance, so that the reader may navigate them easily. Section 4 provides a
is such a case [18,41,42]. It follows that class overlap arises as an panorama of the main developments across important tasks in ma-
heterogeneous concept, encompassing distinct representations of the chine learning (Data Analysis, Data Preprocessing, Algorithm Design,
problem. Accordingly, certain complexity measures are eximious in and Meta-learning) and the limitations they currently face. Section 5
characterising some specific types of class overlap while failing to highlights the open challenges identified within each field of Section 4
adequately capture others. and discusses promising lines for future research. In turn, Section 6
The main idea and contribution of this paper therefore consists focuses particularly on data benchmarking and open source contribu-
of putting forward a unified view of the problem of class overlap in tions. Finally, Section 7 concludes the paper, providing an overview of
imbalanced domains, from the definition of the class overlap problem the ideas discussed throughout this work and summarising important
and its characterisation in all dimensions (i.e., sources of complexity), directions that the research community should debate for a renewed
to the analysis of the most emergent topics in the field to address in perspective on the problem of class overlap in imbalanced domains.
the years to come. We start by introducing the idea that class overlap
is currently regarded as an umbrella term that stands for a multitude 2. A unifying view on class overlap
of related, although distinct, problems, and discussing the key concepts
associated to its definition, identification, measurement, and characteri- The definition and characterisation of class imbalance is well de-
sation. Then, we map the relationship between existing data complexity scribed in the literature, where the Imbalance Ratio (IR) and the
229
M.S. Santos et al. Information Fusion 89 (2023) 228–253
Fig. 1. An overview of the main tasks encompassed in the characterisation and analysis of the class overlap problem: (1) decomposition, (2) identification, and (3) quantification
and insight. The characterisation of class overlap first requires the decomposition of the data domain into regions of interest and the identification of the problematic (overlapped)
regions. Then, the chosen measure to quantify class overlap (and the insight that measure unveils) will ultimately define the representation of the problem in the domain, i.e., the
specific type of class overlap that is being measured and analysed.
percentage of minority examples (%Min) constitute the standard, for- Essentially, Fig. 2 corresponds to a more detailed view of the Data
mal measures established in the field [43]. However, whereas class Complexity Measures block of Fig. 1. Accordingly, the (1) decompo-
imbalance corresponds to a distribution-based irregularity, class over- sition of the data domain and (2) the identification of problematic
lap may comprise multiple sources of complexity and is therefore regions represent the first two tasks necessary to understand the prob-
more complicated to assess [1,41]. Herein we provide an overview lem of class overlap. On that note, it is important to define the concepts
of the characterisation of class overlap, elaborating on the key con- of Class Overlap, Overlap Regions, and Overlap Areas:
cepts frequently discussed in related work, which constitutes the main
contribution of this section.
Class Overlap, Overlap Regions, and Overlap Areas: These defini-
The characterisation of the class overlap problem can be subdi-
tions are rather intertwined since class overlap is a phenomenon
vided into three main sequential tasks, as shown in Fig. 1. First, it
that implies the existence of ambiguous regions or areas of
important to decompose the data domain into regions of interest. Then,
the data space. Class overlap is often defined as (i) regions of
the problematic regions (overlapped regions) need to be identified.
the data space where the representation of the classes is simi-
Finally, it is possible to proceed to the quantification/measurement of
lar [11], (ii) regions that contain a similar number of training
class overlap, and establish its associated insight. Depending on the
approaches applied to each of these tasks, class overlap may be charac- examples from each class [3], (iii) regions with similar class
terised from different perspectives, leading to distinct representations priors [9] or (iv) regions containing examples from more than
of the problem (i.e., specific types of class overlap). Ultimately, each one class, where class boundaries overlap [5]. These definitions
representation is associated with different measures and perceptions seek to illustrate the same idea that there may be regions of the
regarding the data domain. This measurement and characterisation data space that are shared by different classes. Intuitively, this
of class overlap falls onto the scope of data complexity measures and complicates their discrimination, leading to a poor classification
will be addressed in Section 3. First, let us discuss the importance of performance. Note however, how definitions (i) to (iii) refer
establishing the key concepts and insights regarding the problem of to the concept of class overlap in a balanced scenario, equally
class overlap. populated by existing classes. In imbalanced domains, these
Note that once the overlapping regions are identified, it is possi- definitions may not hold, as the representation of the classes
ble to handle class overlap directly (Fig. 1). This can be performed in overlapping regions is not necessarily similar (nor are priors
through modifications of the data domain (e.g., cleaning approaches), established equally for each class). A global definition of class
algorithm modification, or by handling simple and problematic regions overlap is therefore based on the existence of regions populated
separately, among others, depending on the end goal. However, the by examples from different classes. However, this does not
difference between applying ad hoc solutions that globally ease the prevent these regions, as well as the examples that populate
problem and performing informed, specialised decisions based on the them, from assuming distinct properties, leading to different
characteristics of the domain relies on a thoughtful understanding and representations of the problem. Accordingly, the decomposition
characterisation of the class overlap problem. If such meta-knowledge of the data space, the identification of class overlap, and its
is available, then it is possible to guide the recommendation of suitable quantification, can be performed in several ways, each focusing
classifiers or preprocessing techniques, the choice of suitable hyper-
on different properties of the overlap regions and consequently
parameters, or the design of specialised approaches. Fundamentally,
producing different insights on the problem of class overlap. For
determining the specific type of class overlap present in the data
the most part, the concept of ‘‘overlap region’’ is therefore a
domain is establishing what is truly harming the machine learning tasks
generic term, not subjected to a formal characterisation. Most
and, in the end, it is that insight (meta-knowledge) that guides the
often, this is also the case for ‘‘overlap area’’, taken as a synonym
choice and the development of optimal solutions.
In the remainder of this section we give an overarching view of for ‘‘region’’, although in some related research, the overlap
the key concepts associated with the definition of class overlap in area is in fact defined by computing the mathematical area of
related work, which ultimately results in the definition of distinct overlapped regions (2-dimensional datasets) [7,18].
representations of the problem. Fig. 2 summarises both the main tasks,
concepts, and insights encompassed in the characterisation of class Once the overlap regions are identified, it is possible to move to-
overlap. Starting from the core of the schema, we will now move along wards (3) the quantification/measurement of the class overlap problem
the sequential steps required to characterise class overlap, discussing over the domain. In that regard, related research often refers to the
important concepts found in the literature. concept of ‘‘Overlap Degree’’ or ‘‘Overlap Ratio’’.
230
M.S. Santos et al. Information Fusion 89 (2023) 228–253
Fig. 2. An overarching view of the characterisation of the class overlap problem. Moving from the core to the peripheral parts of the schema, we may follow along the sequential
steps encompassed in class overlap characterisation. First, it is necessary to (1) decompose the domains and (2) identify problematic regions (overlap regions or areas). Then it
is possible to move to (3) the quantification of the class overlap problem in the domain (overlap degree or ratio). Depending on the approaches used in the previous steps, the
obtained estimates will reflect distinct insights on the problem and be associated to different representations of class overlap. The established representations (Feature Overlap,
Instance Overlap, Structural Overlap and Multiresolution Overlap) and associated concepts shown in the peripheral parts of the schema will be further discussed in Section 3.2.
Overlap Degree or Overlap Ratio: ‘‘Overlap Degree’’ is perhaps the of classes, and (iii) measures of geometry, topology, and density of
broadest term used to describe the extent to which some do- manifolds. Over the years, other authors sought to complement this
mains are affected by class overlap, even when the ‘‘extent’’ grouping, presenting their own division or proposing additional cat-
of the problem is not mathematically defined. This occurs fre- egories in order to characterise the prevalence of a given domain
quently in seminal work with synthetic data, where the overlap characteristic. Sotoca et al. [51] also consider three main groups of
degree has been defined as the distance between cluster cen- complexity measures: (i) measures of overlap, (ii) measures of class sep-
troids of different classes [14], captured by the ‘‘extent to which arability and (iii) measures of geometry and density. Lorena et al. [40]
adjacent regions intertwine’’ [11], or even not characterised divide complexity measures into (i) feature-based measures, (ii) linear-
numerically ([7] for atypical domains). Other seminal work ity measures, (iii) neighbourhood measures, (iv) network measures, (v)
estimates the overlap degree as the proportion of the domain dimensionality measures and (vi) class imbalance measures.
area that is overlapped [7,45–47] (2-dimensional domains), or For the most part, the groups discussed above do not derive from
the proportion of examples near the decision borders [2,6,48]. a taxonomical classification, i.e., they are defined according to each
In real-world domains, the quantification of class overlap is author’s evaluation of common characteristics or insights among mea-
more frequent (i.e., rather than a qualitative characterisation of sures. The principles underlying the categorisation of measures are
the problem) and is intrinsically associated to the computation therefore nor explicit, nor characterised themselves. A natural con-
of data complexity measures. In that regard, the overlap degree, sequence is that authors may include the same measure in different
sometimes referred to as ‘‘Overlap Ratio’’ [13,49,50], reflects groups. A representative example is the grouping of F1, F2, and F3
a quantitative estimate of the problem of class overlap in the measures, identified as measures of overlap in [52], as measures of
domain. overlap of individual feature values in [33], and as feature-based mea-
sures in [40]. Another example is the categorisation of T1 measure,
All in all, the concepts of overlap regions/areas and associated encompassed in the geometry, topology and density of manifolds group
overlap degrees/ratios are rather generic and encompass a broad spec- in [33,53], in the geometry and density group in [52] and in the
trum of overlap representations, depending on the strategies used neighbourhood measures group in [40,41,54]. Throughout the years,
to tackle the decomposition, identification and quantification of the other data complexity measures have been proposed, although they
problem. This is shown in the peripheral parts of Fig. 2 and will be are often overlooked and included in additional categories of measures
clearly explained throughout the following section, where we propose (e.g., ‘‘Other Measures’’ [40]).
a new taxonomy of class overlap measures that encompasses all three With respect to class overlap, due to its heterogeneous nature, it
components. is expected that several data complexity measures appear scattered
across different groupings (T1 is such an example), which has several
3. A taxonomy of complexity measures for class overlap drawbacks. One is that they may not be identified as class overlap
complexity measures: this is observed when measures are grouped
Current research largely resorts to data complexity measures in based on the object of analysis (e.g., feature-based measures, neigh-
order to characterise certain data characteristics. These measures are bourhood measures), rather than according to the insight they provide
frequently organised into groups or categories, depending on the com- over the domain (e.g., feature overlap, instance overlap). Other is
mon factors each author considers in the division. By far, the most that some recent measures that characterise class overlap are either
well-known grouping of complexity measures is the one defined by Ho described as general complexity measures, included in a separate cate-
and Basu [33], which considers three main categories: (i) measures gory (e.g., ‘‘Other Measures’’), or do not figure among well-established
of overlap of individual feature values, (ii) measures of separability groupings. Finally, some of the existing groups may be misleading
231
M.S. Santos et al. Information Fusion 89 (2023) 228–253
Fig. 3. Taxonomy of class overlap complexity measures. Different groups can be established depending on the level of the analysis. In the tree structure, class overlap measures
are divided in what concerns their approach to decompose the data domain, identify regions of interest and quantify class overlap. Measures marked with an asterisk are those
for which adaptations to imbalanced domains have been explored in the literature.
by defining categories of measures of overlap that comprise only mea- overlap and respective insights, is further discussed on Section 3.2,
sures that capture only one specific type of class overlap (e.g., feature alongside their associated complexity measures. Finally, we end this
overlap). section with an evaluation of the proposed taxonomy, as well as its
We advocate that data complexity measures should be grouped implications regarding future research (Section 3.3).
according to the insight they provide over the domain, and in the par-
ticular case of class overlap, that a taxonomy of complexity measures
3.1. Components for defining a taxonomy of class overlap measures
should attend to its heterogeneous nature. It would therefore be instru-
mental to define a taxonomy of class overlap measures that attends to
its different representations and sources of complexity. However, no Essentially, all overlap measures require three components:
such characterisation currently exists. To put forward such taxonomy
is the main contribution of this section. 1. A component to decompose the data domain into regions
As previously discussed, the characterisation of class overlap is of interest: We consider three main approaches to divide the
intrinsically tied to the definition and quantification of problematic feature space into regions of interest. Although all are distance-
regions in data. Accordingly, along this section, we devise a taxonomy based, they rely on different types of distances:
of complexity measures for class overlap based on the strategies used • Statistical Distance: Based on the distance between class
to address the three main identified components of overlap charac-
distributions (e.g., Fisher Linear Discriminant);
terisation: (1) decomposition of the data space, (2) identification of
• Geometrical Distance: Based on the distance between
problematic regions, and (3) quantification and insight of the overlap
pairs of data examples (e.g., Euclidean Distance);
problem in the domain.
• Graph-Based Distance: Based on the geodesic distance
The proposed taxonomy is presented as a tree structure (Fig. 3),
(e.g., Minimum Spanning Trees).
based on the sequential tasks of Figs. 1 and 2. Class overlap measures
are first divided depending on their decomposition of the data space. 2. A component to identify problematic regions of interest.
As we move down each path, further groups arise, depending on the We consider the following strategies for the identification of
identification of problematic areas and ultimately, on the class overlap problematic regions:
representations they are able to capture.
Rather than focusing solely on the well-known measures of Ho and • Discriminant Analysis: The properties of class distribu-
Basu [33], we consider a larger set of measures proposed throughout tions are analysed in order to determine the discrimi-
the years. The relationship between measures is also characterised, native power of features. Problematic regions are those
since some measures based on different paradigms may provide sim- where classes remain overlapped in the projections with
ilar insights, whereas others are complementary. Complexity measures maximum separability;
that have been previously studied in imbalanced contexts are also • Feature Space Partitioning: The feature space is parti-
identified. The reader may find additional information regarding the tioned into certain ranges or into a specified number of
mentioned complexity measures in [19]. Additionally, to support the intervals where the properties of data are then analysed.
reading of this section, the characteristics of each complexity measure Problematic regions are delimited in specific ranges of the
were summarised in Table 1. In the remainder of this section we will feature space;
elaborate on further aspects of the proposed taxonomy. First, we start • Neighbourhood Analysis: The data domain is analysed at
by defining and describing the essential components of class overlap a local level, based on the neighbourhood characteristics
characterisation (Section 3.1). We mainly focus on components (1) and of examples. Problematic regions are those associated to
(2), whereas (3), comprising the final proposed representations of class larger errors of the k-nearest neighbour classifier;
232
M.S. Santos et al. Information Fusion 89 (2023) 228–253
Table 1
Main characteristics of class overlap complexity measures. For each complexity measure, it is identified which class overlap representation it is able to capture, ‘‘Representation’’;
its abbreviation and name, ‘‘Abbr.’’ and ‘‘Measure’’; its complexity interpretation, ‘‘Complexity’’ (‘‘++’’ denotes that higher values of the measure indicate more overlapped domains,
whereas ‘‘--’’ denotes that lower values indicate more overlapped domains, according to the formulation established in Santos et al. [19]); its taxonomical classification, ‘‘Taxonomy’’;
and whether it has been previously investigated in imbalanced domains, ‘‘Imbalanced Data’’ (C.D.: Class Decomposition, IR: Imbalance Ratio).
Representation Abbr. Measure Complexity Characteristics Taxonomy Imbalanced
data
Directional Vector
F1v Maximum Fisher’s ++ Determines the data projection with maximum separability. Statistical: Discriminant Analysis ➡ Feature Overlap No
Discriminant Ratio
Collective Feature Returns the ratio of examples that could not be separated
F4 ++ Statistical: Feature Space Partitioning ➡ Feature Overlap Yes (C.D.)
Efficiency considering the efficiency of all features.
R𝑎𝑢𝑔 Augmented R-value ++ Extends R-value taking the Imbalance Ratio into account. Geometrical: Neighbourhood ➡ Instance Overlap Yes (IR)
Error Rate of the Measures the error rate of the Nearest Neighbour classifier
N3 Nearest Neighbour ++ Geometrical: Neighbourhood ➡ Instance Overlap Yes (C.D)
(1NN), estimated using a Leave-One-Out cross-validation.
Classifier
𝑘-Disagreeing For each data example, kDN measures the percentage of its
kDN ++ Geometrical: Neighbourhood ➡ Instance Hardness No
Neighbours 𝑘 nearest neighbours that do not share its class.
Class Density in Determines, for each class, the number of examples that lie
D3 ++ Geometrical: Neighbourhood ➡ Instance Overlap No
the Overlap Region in regions populated by a different class.
Structural Fraction of Measures the proportion of examples that are connected to Graph-Based: MST-based ➡ Structural Overlap
N1 ++ Yes (C.D.)
Overlap Borderline Points the opposite class by an edge in a Minimum Spanning Tree. Geometrical: Neighbourhood ➡ Structural Overlap
Local Set Determines the average local set cardinality considering all
LSC𝐴𝑣𝑔 -- Geometrical: Hypersphere Coverage ➡ Density of Manifolds No
Average Cardinality points in data.
Decision Determines the interleaving of hyperspheres of different Geometrical: Hypersphere Coverage ➡ Structural Overlap
DBC Boundary ++ classes, by determining the number of connected centres of Graph-Based: MST-based ➡ Structural Overlap No
Complexity different classes in a MST built with the final hyperspheres Geometrical: Neighbourhood ➡ Structural Overlap
determined with the T1 measure.
ICSV Inter-class ++ Measures the standard deviation of the densities of the Geometrical: Hypersphere Coverage ➡ Density of Manifolds No
Scale Variation hyperspheres found with the T1 measure.
Estimates the purity of the domain by focusing on the class Geometrical: Feature Space Partitioning ➡ Multiresolution
Purity Purity -- distribution of recursive partitions of the data space (cells) No
Overlap
defined at several resolutions.
Neighbourhood Neighbourhood Estimates the separability of the domain by focusing on the Geometrical: Feature Space Partitioning ➡ Multiresolution
-- neighbourhood characteristics of each example comprised No
Separability Separability Overlap
inside cells defined at several resolutions.
233
M.S. Santos et al. Information Fusion 89 (2023) 228–253
• Hypersphere Coverage: The necessary number of subsets analysis of their neighbourhood, thus obtaining different estimates for
(hyperspheres) to cover the entire domain is found. Prob- the presented scenarios.
lematic regions are those encompassed in hyperspheres Other limitations of feature overlap measures have already been
with smaller radii; described in the literature [35,40]. First, these measures presuppose
• Minimum Spanning Trees: The data domain is repre- their application over continuous features. Then, with the exception
sented by a graph (often a minimum spanning tree). Prob- of F1v, they assume that the decision boundary between classes is
lematic regions are identified by directly connected ver- perpendicular to one of the feature axis. Measures based on feature
tices with disagreeing class memberships. space partitioning (F2, F3, F4, IN) are additionally susceptible to dis-
junct concepts (a situation where features present more than one valid
3. A component for quantifying the overlap problem in the
interval), and noisy data.
problematic areas of interest. This component returns the final
groups of the tree structure, consisting in the ultimate division
3.2.2. Instance overlap
between overlap measures. For that reason, we will discuss each
Instance Overlap measures are deeply linked to the exploration
group in detail throughout the following sections, along with the
of ‘‘local data characteristics’’ [57] and comprise a local, rather than
measures they include and the insights they provide.
a global, characterisation of domains. These characteristics are of-
By addressing the definition and quantification of problematic re- ten approximated by analysing the neighbourhood of data examples
gions differently, complexity measures characterise class overlap from and determining their complexity accordingly. This ‘‘complexity’’ is
different perspectives. Indeed, in real-world domains, problematic re- often associated to the error of the k-Nearest Neighbour (kNN) clas-
gions often present certain properties that have an impact on the sifier and is used to characterise class overlap by focusing on the
definition and measurement of class overlap (e.g., class imbalance, amount of overlapped examples in data, i.e., those that are misclassified
local imbalance, class decomposition, non-linear boundaries, different by kNN. Instance Overlap measures include R-value [58], 𝑅𝑎𝑢𝑔 [38],
types of examples in data) [2,6,7,15]. These characteristics of data may 𝑑𝑒𝑔𝑂𝑣𝑒𝑟 [15], N3 [33], SI [59,60], D3 [52], N4 [33], CM, wCM, and
therefore give rise to different representations of class overlap, and dwCM [17,34], which provide an overall insight on the amount of
certain measures may successfully characterise some, while failing to overlapped examples in the entire domain, and kDN [8], Borderline
uncover others. The final groups of the proposed taxonomy associate Examples [2], IPoints and LSC [36], which, despite providing similar
the complexity measures to the representations of class overlap they insights, are more aligned with the idea of estimating the complexity
intend to characterise, and are thoroughly described in what follows. of individual examples in data, associated to the concepts of ‘‘instance
hardness’’ [8] and ‘‘data typology’’ [2].
3.2. Representations of class overlap ‘‘Instance Hardness’’ and ‘‘Data Typology’’ reflect the idea that not
all examples in data are equal for classification tasks. On the contrary,
Formally, we recognise four main representations (i.e., specific depending on the local characterisation of class distributions, some
types) of class overlap: Feature Overlap, Instance Overlap, Structural examples may be harder to learn than others. ‘‘Instance Hardness’’
Overlap, and Multiresolution Overlap. There are however some sub- corresponds to the likelihood of an example to be misclassified, for
groups that somewhat complement the characterisation of certain rep- which class overlap is the principal contributor [8]. In turn, ‘‘Data
resentations (Instance Hardness and Density of Manifolds). They will be Typology’’ comprehends the division of data examples according to
discussed within the respective groups (Instance Overlap and Structural four types: safe, borderline, rare, and outlier examples [61]. Note that
Overlap, respectively). ultimately, the typology of examples depends on the endgame and
desired treatment of different types of examples, and therefore it is
3.2.1. Feature overlap not uncommon to find other notions of redundant, noisy, danger, or un-
Class overlap is often referred to as ‘‘class separability’’ [5,9,55]. safe examples [10,62,63]. Overall, since borderline examples are those
This term refers to the degree to which classes may be separated by located in the borderline between classes, where their discrimination
discriminative rules, i.e., the degree to which good decision boundaries becomes complicated, they are highly associated with the definition
may be found. Hence, it provides an interpretation of class overlap of class overlap [2,6,48,61]. Nevertheless, it may also be important
via its contrary, i.e., an overlapped domain is one where the class to consider overlapped examples scattered across the entire domain,
separability is low. i.e., those that, although farther from the border, also contribute to
Feature Overlap measures are intrinsically associated with the con- class overlap [64]. In that sense, borderline examples are considered a
cept of class separability, i.e., they aim to characterise the discrimi- subset of overlapped examples, and class overlap measures may either
native power of features in data or, accordingly, the class overlap of consider solely the borderline regions between classes or the entire
individual features in data. Some measures estimate class overlap by domain. This ultimately relies on each measure’s setting regarding the
looking for the most discriminative projections of data (F1, F1v) [33, size of local neighbourhoods (𝑘 value) and/or the tolerance threshold
40], where others resort to feature space partitioning to delimit overlap which distinguishes an overlapped from a non-overlapped example.
regions, based on the properties of class distributions (F2, F3, F4, The concept of ‘‘Class Distribution Skew’’ is also worthy of discus-
IN) [33,39,56]. sion within the problem of class overlap [1,7]. In addition to situations
By focusing on the individual properties of features, these measures where classes are intertwined, class overlap may possess other struc-
may fail to capture other idiosyncrasies of class overlap. Take for tural biases, where one class is dominant in the overlap region. Such
instance the scenario depicted in Fig. 4. F1 measures the highest dis- a phenomenon may arise due to the presence of local imbalance in
criminative power for all features in data, i.e., it returns the minimum the overlap region, or irrespective of class imbalance, e.g., due to
overlap of individual features found in the domain. Accordingly, the differences in class densities (one class is sparse in the overlap region
scenarios in Fig. 4 reveal the same discriminative power: feature 𝑓1 has whereas the other is dense). Some authors refer to this phenomenon as
the same (and highest) F1 value in both cases. However, the individual ‘‘local densities’’ [7], while other describe it as a distribution skew or
overlap in feature 𝑓2 is different, which makes these scenarios different ‘‘class skew’’ [1]. In such scenarios, instance overlap measures, due to
in terms of classification difficulty (as emphasised by the superimposed their flexibility (variable neighbourhood definition), may be helpful in
optimal linear discriminant). In turn, marked points illustrate the facet capturing the degradation caused by class overlap.
of the problem measured by Instance Overlap. Rather than analysing Nevertheless, instance overlap measures, focusing on the prop-
feature separability, instance overlap – described in the following sec- erties of individual examples in data, disregard the characterisation
tion – captures the amount of conflicting examples in data through the of overlap regions themselves. In general, instance overlap measures
234
M.S. Santos et al. Information Fusion 89 (2023) 228–253
Fig. 4. Example of F1 computation for two domains, where data examples are projected onto the axis. The F1 measure outputs the same value of class overlap for both domains,
despite the fact that the problem affects domains differently, as indicated by the superimposed optimal linear discriminant. Note how 𝑓 1 has the same and maximum discriminative
power in both domains, whereas the individual overlap in 𝑓 2 is different between domains. F1 therefore captures one facet of class overlap (feature overlap) but it may not provide
a full characterisation of the class overlap problem. As an example, marked points illustrate a representation of instance overlap, identifying data points which are misclassified
by their nearest neighbour (𝑘 = 1). Different estimates of class overlap are obtained for each domain, namely 19/35 = 54.3% and 11/35 = 31.4% for the left-side and right-side,
respectively.
are concerned with the class membership of examples within a k- where complicated examples from one class (blue crosses appearing
neighbourhood, regardless of the actual distance between them. It as rare and outlier examples) are scattered throughout regions of the
follows that, given two examples that are each other’s nearest neigh- other. The characterisation of the class overlap problem in each domain
bours, instance overlap measures cannot distinguish a situation where may be complemented by focusing on global, structural properties of
they share similar values in the feature space from a situation where data: (c) characterises the domain as having two well-defined concepts
they have rather different feature values. Ultimately, despite being and a confounding boundary (balls of both classes with smaller radii,
each other’s closest neighbours, the examples may belong to distinct containing only one example and close to each other), whereas (f)
regions of the data space where there is no class overlap. Similarly, identifies a well-defined region of one class (blue crosses comprised in
in the borderline between classes, instance overlap measures may also a lower number of balls with large radii and local sets) and another
produce erroneous estimates of class overlap in some scenarios. region with higher class decomposition (red points comprised in a
Consider Fig. 5a, where the distance between examples on class larger number of balls with variable local sets) contaminated with
boundaries is smaller than the distance between examples of the same scattered examples of the opposite class (blue crosses in balls of smaller
class. Instance overlap measures, focusing on local properties of data, radii, containing only one example, close to larger balls of the other
will produce biased class overlap estimates even though the domain class, with higher local sets).
illustrates a linearly separable problem. Additionally, domains where
the properties of examples are the same at a local level may be 3.2.3. Structural overlap
indistinguishable. Consider Figs. 5a and c, which comprise examples Recognised as the most impactful issue for prediction tasks [7,9],
with similar local neighbourhoods. Oblivious to the global properties class overlap is also often used interchangeably with the term ‘‘class
of problematic regions, instance overlap measures will output similar complexity’’ [55]. We have seen this for instance overlap measures,
values of class overlap for both domains. In turn, note how analysing where class overlap is associated with the complexity of individual
the global properties (e.g., structure) of problematic regions (Figs. 5b examples in data and often evaluated on the basis of disagreeing
and d) provides a different insight on the characterisation of the class neighbourhoods of examples (overlapped or ‘‘complex’’ examples) [17,
overlap problem of Figs. 5a and c. 34]. Beyond this, recall that class overlap aggregates a multitude of
Increasing the value of the 𝑘 is one way to move towards a more complexity sources, as we have been discussing so far. In particular,
global view of the domain [7,34]. Note how the scenario depicted in data morphology (data topology, shape or structure) may have hidden
Fig. 5a would be distinguishable from (c) if instead of 𝑘 = 1, we were to dependencies on the problem. On the one hand, the global characteris-
consider 𝑘 = 3 or 5: in (c), we would find a larger number of examples tics of the domain (e.g., class decomposition, complexity of the decision
with conflicting class neighbourhoods. However, optimal values of 𝑘 boundaries, data sparsity) influence the identification of problematic
are hard to determine, especially in the presence of domain peculiarities regions and consequently the quantification and characterisation of
such as class skews: 𝑘 values that correctly characterise one region may class overlap. On the other hand, class overlap directly affects the shape
produce biased estimates in another. of the decision boundaries between classes and may create additional
Similarly, categorising examples into several types is a way of complications such class skews, changing the structural properties of
approximating the global properties of data, which provides additional the domains. In fact, recent research is gravitating towards the idea
insight on the domain; yet it is still based on a local analysis paradigm that complexity measures related to data morphology may prove good
(dependent on the 𝑘 hyperparameter configuration). These are intrinsic predictors of class overlap, especially in the context of imbalanced
limitations of instance or neighbourhood-based identification and may domains [41,42].
be attenuated by a characterisation of problematic regions themselves, Structural Overlap measures are more attentive to the internal
focusing on a global analysis of the domain. structure of classes (data morphology) when evaluating problematic
As an example, consider Fig. 6, which characterises two data do- regions. Some measures analyse the properties of a minimum spanning
mains (a and d) from a local to a global perspective. Note how (a) tree (MST) built over the data domain to identify complicated regions
and (d) return the same overlap value (𝑘 = 1), despite depicting where classes intertwine (N1 [33]). Others approach the identification
different representations of class overlap. The identification of different of class overlap using the notion of hypersphere coverage, where the
types of examples (𝑘 = 5, b and e) reveals that the domains are domain is entirely divided into subsets comprising only examples of
indeed conceptually different: a/b observe a more classical class over- the same class (T1 [33], Clst [36], ONB [41]). Some consider both
lap (complicated borderline regions), whereas d/e depict a situation MST and hypersphere coverage (DBC [65]). Additionally, we refer to
235
M.S. Santos et al. Information Fusion 89 (2023) 228–253
Fig. 5. Comparing local (a and c) versus global (b and d) information. Focusing on local information, instance overlap measures may not be able to capture certain properties of
the domains that affect class overlap. Note how (a) and (c) result in similar class overlap characterisations (same percentage of conflicting examples), despite the fact that (a) is
linearly separable, as indicated by the superimposed linear discriminant. Analysing the structure of problematic regions (b and d) provides different insights on the characterisation
of the class overlap problem, where (d) requires a higher number of hyperspheres to cover the entire domain, thus illustrating a more intertwined scenario.
Fig. 6. Characterisation of two domains affected by class overlap, moving from a local (a and d) to a global analysis (c and f): in scenarios (a) and (d), class overlap is estimated
through the number of conflicting examples (nearest enemies); in (b) and (e) the data typology of the domains is used to characterise class overlap via borderline or non-safe
examples; in (c) and (f) the number of hyperspheres needed to cover the domains is computed to characterise how intertwined the domains are. Instance overlap measures define
class overlap by analysing the properties of individual examples, thus neglecting certain structural characteristics of the domain: note how (a) and (d) return the same percentage
of complicated instances, despite depicting different representations of class overlap. Studying the data typology is a way of approximating the global properties of the domain,
combining both local and global information (although still dependent on 𝑘 hyperparameter configuration): the data typology reveals that a/b illustrate complicated borderline
regions, whereas d/e depict a scenario where examples of one class are scattered throughout regions of the other. The characterisation of the class overlap problem may be
complemented by structural overlap measures, focusing on global, rather than local, characteristics of the domains: note how (c) illustrates two well-defined concepts with a
complicated decision boundary, while (f) shows a well-defined region of one class with some instances contaminating the region occupied by the other class.
a subset of structural overlap measures (‘‘Density of Manifolds’’ group) (N2 [33]), or the average local set cardinality of examples in the
that complements the characterisation of class overlap by adding local domain (LSC𝐴𝑣𝑔 [36]).
Recall the domains of Fig. 6, where the analysis of global, structural
information to data morphology, i.e., focusing on data density/sparsity.
information (Figs. 6c and f) supports the distinction between a domain
These measures characterise the average number and dispersion of
with complicated borderline regions (Fig. 6a) and a domain with a
examples comprised within the hyperspheres that cover the domain large amount of intrusive points (Fig. 6d). Figs. 6c and 6f are in fact
(NSG and ICSV [56]), describe the within- and between-class spread representative of structural overlap and illustrate the computation of
236
M.S. Santos et al. Information Fusion 89 (2023) 228–253
Fig. 7. Exploring the structural properties of the domain may be fundamental to derive a more accurate characterisation of class overlap. Starting with the domains of Figs. 6c
and f, the scenarios in (a) and (d) assess the interleaving of classes along the decision boundary of each domain, by building a MST considering the hyperspheres’ centres and
determining the number of connected nodes, i.e., computing DBC. Nevertheless, note how complexity measures that focus on individual characteristics of data, such as DBC in (a)
and (d), may not extract perceptive insights. In this regard, exploring additional information on the domain, such as considering the local set of each node (i.e., each hypersphere
centre) as represented in (b) and (e), may lead to a better understanding of what is truly harming the domains, identifying invasive points. By combining information regarding
invasive points with the structure of the MST solution, it is possible to distinguish between domains comprising mostly borderline examples, such as in (c), and intrusive examples,
such as in (f), enabling the development of specialised solutions for each scenario.
Clst [36], which divides the data domain into clusters of the same class. contains only the core itself, defined as ‘‘invasive points’’(IPoints) [36].
However, despite the fact that the domains are easily distinguished Now, despite the number of invasive points is similar in both domains,
when visualised, their Clst values are rather similar, since Clst is only it is possible to differentiate (i) situations where these points are
concerned with the number of total clusters in data, regardless of their ‘‘strongly connected’’2 to others of the same type of the opposite class,
radius, their local sets (how many examples they cover), or the distance identifying examples located in overlapping regions of the data space,
between them. from (ii) situations where these points are connected to nodes of the op-
A way to enhance this characterisation would be to analyse ad- posite class with larger local sets, identifying examples that somewhat
ditional structural information, such as assessing the interleaving of infiltrate the other class. Hence, Fig. 7c illustrates a domain where all
classes along the decision boundary of each domain. Accordingly, of its invasive points strongly connect to others of the same type (and of
Figs. 7a and 7d illustrate a representation of DBC [65], which creates a the opposite class), suggesting that class overlap is the main complexity
MST using the cluster centres defined by Clst and determines the num- factor affecting the domain (9 out of 15 nodes represent complicated
ber of connected centres of different classes. As in the previous case, borderline regions, which amounts to a class overlap of 60%), caused
although the problem of class overlap is conceptually different when by overlapping class borders. In turn, Fig. 7f reveals that only 4 out
assessed visually, DBC also returns similar values, since the number of of 16 nodes (25%) are responsible for class overlap (4 invasive points
connected nodes of opposite classes is similar for both domains. The strongly connected), whereas the remaining 4 identified points are
analysis of NSG [56], which returns the average size of clusters, would intruding the opposite class, and may indicate different issues: either
yield identical conclusions to those of the previous measures.
Note how the difficulty in distinguishing the domains via existing
complexity measures is due to their focus on individual properties of
data: Clst and NSG disregard the characterisation of clusters whereas class structures found in data. In fact, LSC may be an indicator of instance
hardness and instance overlap, identifying examples whose local set cardinality
DBC neglects other properties of the MST (e.g., edge weights, local
is low.
sets of connected nodes). Alternatively, Figs. 7b and e characterise 2
Note that our purpose is not to derive a new complexity measure for
the domains by combining several structural overlap measures. Ac-
class overlap. With this example, we explore the investigation of additional
cordingly, they incorporate information regarding class decomposition properties of the MST (namely edge weights) as well as density and local
(starting with the solution defined by Clst ), complexity of decision information (local set cardinality) to complement the characterisation of
boundaries (considering the solution achieved by DBC), and density class overlap. Combining distinct sources of information allows to distinguish
of manifolds (considering the local set cardinality of each node in shorter, stronger connections between nodes, from weaker connections, where
the MST).1 Contrarily to Figs. 7a and d, the marked points represent edges between nodes are longer. To determine whether an invasive example
clusters that include only one example (the core) and whose local set is responsible for class overlap or is infiltrating the opposite class – in the
case that an invasive point is connected to both an invasive point and other
nodes of higher cardinality (all of the opposite class) – it is possible to adjust
1
Note how despite the fact that LSC𝐴𝑣𝑔 is comprised in the Structural the edge weights by the local set cardinality of connected nodes (e.g., 𝑤𝑖 =
1
Overlap group (as it estimates the density of manifolds in the domain), and 𝑑
× 𝐿𝑆𝐶𝑛𝑜𝑑𝑒𝑖 ). Nevertheless, the main purpose of the example remains to
𝑖
that LSC and IPoints derive from structural information (i.e., hypersphere highlight the advantage of considering multiple sources of complexity when
coverage), they can be used to add local information regarding the internal characterising class overlap.
237
M.S. Santos et al. Information Fusion 89 (2023) 228–253
Fig. 8. Impact of considering structural information in the characterisation of class overlap. Scenarios (a) and (d) illustrate the solution achieved by removing all conflicting
examples according to Figs. 6a and d (examples misclassified by their nearest neighbour (𝑘 = 1) are eliminated). In (b) and (e), all examples that do not belong to the ‘‘safe’’
category are removed (i.e., all the borderline, rare, and outlier examples), following the data typology of Figs. 6b and e. Finally, (c) and (f) illustrate the removal of the invasive
points shown in Figs. 7b and e.
representing noisy data [5], or suggesting the existence of valid, though when dealing with domains presenting additional sources of com-
underrepresented, sub-concepts in data (a situation likely to arise in the plexity. On that note, although we may argue that structural over-
case of imbalanced data [61]). lap measures focus on data characteristics unrelated to class overlap,
Let us end this discussion by analysing the impact of considering in the sense that they describe other general properties of the do-
structural information in the characterisation of class overlap. Fig. 8 mains (e.g., geometry, topology, density), we advocate that class over-
shows different cleaning solutions for the original domains of Figs. 6a lap cannot be fully understood irrespective of structural information,
and d (top and bottom rows of Fig. 8, respectively). since the global properties of the domains affect its identification and
Despite the fact that all characterisations of class overlap lead to so- characterisation.
lutions with simplified, clear decision boundaries, i.e., eliminating the
problem of class overlap, they differ in what concerns both the amount 3.2.4. Multiresolution overlap measures
of cleaning performed and the ability to retain the original structure of Multiresolution Overlap measures characterise class overlap by pro-
data. Approaches relying solely on instance overlap (Figs. 8a, b, d, and viding a trade-off between global and local data characteristics (Fig. 9).
e) tend to be more conservative when compared to those that incorpo- Some are more closely related to the previous ideas of using hy-
rate structural information (Figs. 8c and f). Also, note that since Figs. 6b perspheres (MRCA [37]) or k-neighbourhoods (C1 and C2 [35,66])
and e consider more global information on the data domain than 6a and to define regions of the space where class overlap can be analysed.
d (via data typology), the former solutions are more conservative than Others are associated with feature space partitioning, where features
the latter. This is due to (i) the larger neighbourhood considered: 𝑘 = 5 are divided into several intervals to assess the properties of class
versus 𝑘 = 1, identifying only nearest-enemies (please refer to Fig. 6b overlap (Purity and Neighbourhood Separability [67,68]). Neverthe-
where more examples are considered conflicting), and (ii) the borderline less, the main idea than binds these measures together is that they
category often assigned to examples in the neighbourhood of rare and operate by moving iteratively from a global to a local analysis of the
domains (fine-grain search criteria). They recursively define hyper-
outlier examples, which may not represent valid class concepts, but
spheres, neighbourhoods, or feature partitions at different resolutions,
rather intrusive/noisy points, affecting mainly domain 6e.3
all of which are analysed characterise to the problem of class overlap,
In turn, solutions 8c and f are the less invasive, i.e., the class overlap
combining both structural and local information.
problem is solved while removing a smaller amount of examples and re-
taining most of the original internal structure of data. Finally, note how
3.3. Evaluation of the proposed taxonomy
for domains with less complex data structure/morphology, instance
overlap measures are able to accurately characterise the problem of
Along the previous sections, we have been discussing the idea that,
class overlap, whereas structural information needs to be considered
in real-world domains, class overlap often aggregates information on
different data characteristics, and therefore it is important to establish
the insight that different complexity measures provide to fully charac-
3
Note that in imbalanced domains, there a difference between rare and
terise the problem. To standardise existing types of class overlap, we
outlier examples, and noisy data (please refer to [61]), given that distant,
established a novel taxonomy that defines four main groups of class
isolated minority examples may result from an insufficient representation of
the minority class in certain regions of the data space. Accordingly, rare
overlap representations and associated complexity measures, while
and outlier examples may represent valid sub-concepts rather than noise. describing their perception on the class overlap problem as well as their
Nevertheless, the given example (Fig. 6e), represents a balanced domain intrinsic limitations. In this section, we discuss some further details of
where rare and outlier points are not distant or isolated examples, but rather the proposed taxonomy, and elaborate on its implications for future
infiltrating the opposite class and do not constitute interesting class concepts. research in the field.
238
M.S. Santos et al. Information Fusion 89 (2023) 228–253
Fig. 9. Example of multiresolution overlap measures, which aggregate global and local information on the domains. In (a) and (b), a strategy of recursive feature space partitioning
is used to analyse the domains at increasingly lower resolutions. At each resolution, problematic regions (grey cells) are individually analysed. In (c), example 𝐱 exhibits distinct
complexity values depending on the resolution of its neighbourhood (defined using hyperspheres with different radii). The final characterisation of domains consists of averaging
the individual results obtained at several resolutions.
3.3.1. Properties of the proposed taxonomy are however more related to boundary complexity and the internal
Beyond mapping the relationship between complexity measures and structure of classes (structural overlap) rather than to local data char-
their associated class overlap representations, the proposed taxonomy acteristics (neighbourhood analysis) and are therefore included in the
evidences certain properties of the measures and illustrates other exist- Structural Overlap group.
ing relationships between the categories that constitute the taxonomy. 3. Measures that complement certain representations of class over-
In particular, three main characteristics may be highlighted: lap: Some groups of measures are also intrinsically related to (or
1. Measures belonging to different decomposition or identifica- complemented by) others, as previously discussed. This is the case of
tion categories may be associated to the same class overlap repre- Instance Overlap measures, that cannot be dissociated from the concept
sentations: As shown in Fig. 3, there are situations where measures of ‘‘Instance Hardness’’, and the case of Structural Overlap measures,
based on distinct decomposition and/or identification strategies aim which encompass the characterisation of the ‘‘Density of Manifolds’’.
to provide similar insights. An example is the case of Purity and We have chosen to highlight these two subgroups in the taxonomy
Neighbourhood Separability measures, C1 and C2, and MRCA, which since, notwithstanding their representations, they are often crucial to
are comprised in the ‘‘Multiresolution Overlap’’ group (since their devise optimal solutions for certain domains. When analysing the cur-
insights are derived from the same underlying principle), despite the rent panorama on class imbalance and overlap problems (Section 4), we
fact that their identification of problematic regions is performed dif- will see how instance hardness information is useful for preprocessing
ferently (through ‘‘Feature Space Partitioning’’, ‘‘Neighbourhood’’, and approaches, and often embedded in the internal operations of some re-
‘‘Hypersphere Coverage’’, respectively). The same rationale applies to sampling algorithms for imbalanced learning. In turn, instance overlap
other examples depicted in Fig. 3. measures provide a better insight of the overall difficulty of the domain
This evidences that the strategy through which overlapped regions for classification. Similarly, some class overlap-based methods, more
are decomposed and identified, may not correspond directly to the than analysing certain global properties of the domains (e.g., structural
knowledge they incorporate. In other words, this illustrates that al- properties), may further incorporate density information for improved
though the analysis of the process of decomposition and identification results.
of problematic regions is essential to the characterisation of class
overlap, investigating its quantification and the insights provided by 3.3.2. Sensitivity of complexity measures to class imbalance
each complexity measure – through a careful analysis of their design Another topic of discussion is whether the identified class overlap
and purpose – is fundamental to fully understand the problem. To measures are sensitive to class imbalance. In Fig. 3, class overlap
some extent, existing research has often grouped complexity measures measures that have been designed or adapted to be attentive to class
according to the process inherent to the identification of certain prop- imbalance are marked with an asterisk.
erties (e.g., feature-based, neighbourhood-based) [40,41], rather than Some measures take the problem of class imbalance into account
the insight they produce on the data domain. In this regard, one of the by defining the data typology only for the minority class (Borderline
advantages of the presented taxonomy is that the decomposition and Examples [61]). Others were originally proposed within the scope
identification processes of each measure can be dissociated from the of imbalanced domains (𝑅𝑎𝑢𝑔 [38], ONB [41], CM [34], wCM and
perception obtained from data, i.e., measures are grouped based on the dwCM [17]), although only 𝑅𝑎𝑢𝑔 incorporates the imbalance ratio
knowledge they provide on the domain, rather than on their underlying in the computation of class overlap (the remaining use a strategy
processes. Nevertheless, such information is not lost, since it remains of class decomposition, i.e., complexity measures are computed for
established in the upper-levels of the tree structure that compose the each individual class). The same applies to recent adaptations of well-
taxonomy. established measures (F2, F3, F4, N1, N2, N3, N4, T1), also based on
2. Measures may incorporate two or more decomposition or iden- class-wise computation [42].4 The basis for the development of these
tification methods: Although the established groups are subsets of adaptations is that, in imbalanced domains, the majority class tends
complexity measures with shared similarities, their boundaries are not to dominate the computation of some complexity measures, providing
strictly delimited. Accordingly, some measures may comprise two or biased estimates of classification difficulty [34,54,69]. This is mostly
more decomposition or identification methods. To some extent, they observed for measures that depend on the total number of examples in
may be considered ‘‘hybrid’’ measures, which is the case of N1 and data, rather than class sizes. Ongoing research therefore considers the
DBC. N1 is based on graph decomposition although it also incorpo-
rates neighbourhood information to identify connected vertices with
disagreeing class memberships. In turn, DBC first divides the domain 4
In this regard, F1 was also studied in [42,54], although, since it relates
into hyperspheres, and then builds an MST considering their centres two means and variances, it was not possible to adapt it in order to obtain
and analyses the neighbourhood of the MST vertices. Both their insights individual information by class. The same is expected for F1v.
239
M.S. Santos et al. Information Fusion 89 (2023) 228–253
decomposition of complexity measures into their minority and majority • Beyond well-established measures, this taxonomy includes more
counterparts, and has shown promising results for binary-classification recent (although lesser-known) measures, often encompassed in
tasks [42,54] (this will be further discussed in Section 4). uncharacterised groups (e.g., ‘‘Other Measures’’ [40]). The new
Other than the highlighted measures, the remaining are yet to be taxonomy actively characterises their properties, relationships
investigated in the context of class imbalance.5 A final remark should and insights, which contributes to a broader and deeper knowl-
be given to MRCA [37], that although it was not especially devised or edge on the topic;
thoroughly investigated in imbalanced domains, it considers an imbal- • The taxonomy also identifies class overlap measures that have
ance estimation function, which attends to the distribution of examples been developed in the scope of imbalanced domains, or for which
comprised within the hyperspheres, at each step, before obtaining a adaptations to imbalanced data have been explored in the liter-
complexity profile of a given example. ature. Accordingly, it illustrates to which extent the joint-effect
of both issues has been discussed in the scope of classification
3.3.3. Implications for future research complexity, and highlights opportunities for novel contributions
Let us now delve into the implications associated with the inception in the field.
of our proposed taxonomy for future research in the field.
To summarise, the proposed taxonomy systematises the current
In alternative to discussing general measures of classification com-
state of knowledge of the problem of class overlap in what concerns
plexity, our taxonomy focuses specifically on class overlap. Among
its definition, identification, quantification and characterisation. Fur-
well-known data issues, this is the most harmful for imbalanced learn-
thermore, it highlights core properties of the measures and provides
ing tasks [5,9] and the one which generates most debate regarding its
an overview of the relationships between them. Finally, it evidences
definition, measurement, and understanding [18]. In this regard, the
that future research should keep moving towards the development of
proposed taxonomy clarifies the concepts associated with the defini-
measures with broader points of view, i.e., that are able to combine
tion, identification, quantification and characterisation of class overlap,
different representations of class overlap and consider other factors,
and illustrates its distinct representations, as well as the sources of
namely class imbalance.
complexity to which they are associated.
Along the next sections, we offer a multi-view panorama of the
Additionally, rather than aggregating complexity measures solely
state-of-the-art solutions for class imbalance and overlap across several
according to the category of data descriptors (e.g., separability, topol-
branches of machine learning. The main goal is to analyse the current
ogy, sparseness, decision boundary) or their object of study (e.g.,
body of knowledge in different but related areas of research, iden-
feature-based, neighbourhood-based, network-based), the proposed
tify their limitations and suggest possible future directions. Whenever
taxonomy focuses on associating class overlap measures to the insight
possible, insightful class overlap measures are identified and discussed
they provide regarding the domain. In other words, each measure is
within each area, based on related research on the respective topics.
associated to the class overlap representation it is able to perceive.
Consequently, several practical implications for future research may be
4. Class imbalance and overlap: A multi-view panorama
drawn:
• The proposed taxonomy advocates for the establishment of stan- In this section, we summarise how state-of-the-art research tries to
dard measures of the overlap degree, contrarily to what is still handle class imbalance and overlap jointly across different fields, taking
currently portrayed in related research, where class overlap is into consideration the ideas discussed throughout Sections 2 and 3. To
measured in rather distinct ways.6 To this regard, the proposed provide the reader with a global understanding of the current state of
taxonomy evidences which measures are better suited to cap- knowledge, Fig. 10 illustrates the main topics discussed throughout this
ture specific types of class overlap, should the researchers be section. Four main areas (and respective sub-areas) of research are iden-
interested in a particular facet of the problem; tified and will be presented following the schema of Fig. 10, moving
• Notwithstanding the effort to associate each measure with the from the top-left corner to the lower-right corner: Data Analysis, Data
class overlap representation it captures, the proposed taxonomy Preprocessing, Algorithm Design, and Meta-learning. Herein, we focus
reflects simultaneously the three basal components of class over- mostly on the topics that are currently being explored more thoroughly
lap characterisation (decomposition, identification and quantifi- within each field, summarising their most significant insights. In light of
cation/insight). As such, it allows that different groupings are the class overlap representations and taxonomy previously presented,
established depending on the intended level of the analysis; we provide a discussion on insightful complexity measures for each
• Acknowledging class overlap as a heterogeneous concept, our topic, whenever possible: naturally, some topics will be more deeply
taxonomy further advocates for the need of a complete character- supported by the use of complexity measures than others. Finally,
isation of class overlap, through the combination or simultaneous although we provide a general view on all topics in Fig. 10, those that
analysis of distinct representations of the problem. In this regard, are investigated less often are marked as open challenges and will be
the properties and relationships between measures identified in further discussed in Section 5, where we present promising lines for
the taxonomy may serve as a stepping stone for the develop- future research, explaining how the considerations of Sections 2 and 3
ment of more perceptive, flexible and robust sets of complexity could lead to improved solutions.
measures;
4.1. Data Analysis
5
Note, however, that according to previous results, a biased behaviour is One of the most prominent use of complexity measures is their
expected for complexity measures that provide average values over the total application to establish the baseline classification difficulty of a given
number of observations. Nevertheless, they are simple to adapt through class dataset. Insightful complexity measures produce estimates that are
decomposition.
6
aligned with the performance of classifiers, i.e., by determining com-
For instance, some works refer to specific measures (F1 [70], N1 [71],
plexity measures over different datasets, we may infer which will
or data typology [72]), while others refer to a generic Overlapping Ra-
tio [13,49,50], which is based on different variations of instance overlap
yield better classification results. Overall, class overlap measures have
measures. Besides not using a standard measurement of class overlap (and proven to be good indicators of classification difficulty, although imbal-
hence preventing a fair comparison between approaches), related work is in anced domains require a more thoughtful characterisation given their
fact focusing on distinct facets of class overlap, by resorting to measures that observed bias towards the majority class [42]. Data analysis is perhaps
capture different dimensions of the problem. the most frequently studied topic on the problem of class imbalance and
240
M.S. Santos et al. Information Fusion 89 (2023) 228–253
Fig. 10. Overview of current research in imbalanced and overlapped domains. Underinvestigated topics are identified as open challenges, whereas for the remaining, the major
insights for research are summarised. Whenever relevant, insightful class overlap complexity measures are also highlighted, based on the findings of related research on the topic.
overlap, where different lines of thought are currently under investiga- for the minority class obtaining the highest correlations with perfor-
tion, depending on the classification paradigm. For binary-classification mance results [17,34,42]. Instance hardness measures have also proven
problems, the current established approach relies on the decompo- to be good estimators of classification complexity [2,8,61]. As they
sition of complexity measures by class, whereas multi-classification look for hard examples to classify, it is intuitive that they are the very
and singular problems present additional challenges for research. In aligned with classification performance. In particular, measures that
what follows, we will detail the state-of-the-art recommendations when relate to class overlap (kDN, borderline and rare/outlier points) have
handling these scenarios. been identified as major contributors to classification difficulty. Note
how the most useful complexity indicators are highly correlated: it
becomes clear that analysing the local properties of the domains is a
4.1.1. Binary classification
suitable approach to determine classification difficulty in the case of
In binary imbalanced domains, the majority class tends to dominate
imbalanced binary-classification domains.
the computation of some complexity measures [34,69]. The focus is
therefore shifting towards the proposal of adapted measures that incor-
4.1.2. Multi-classification
porate class imbalance or the evaluation of the individual class com-
Contrary to binary-classification problems, a decomposition by class
plexities, i.e., decomposing complexity measures into their minority
may not suffice to accurately estimate the difficulty of the classifica-
and majority counterparts [34,38,41,42,54]. tion tasks in multi-class domains: previous research has shown some
Related research has demonstrated how several of the complex- inconsistencies between the complexity obtained for a given class and
ity measures by Ho and Basu are insensitive to class imbalance in the performance achieved on that class [58]. Nevertheless, the co-
overlapped domains and propose new complexity measures that cor- decomposition of complexity measures considering the combination of
relate better to the classification performance of the minority class existing classes may be used to characterise multi-class domains more
(e.g., 𝑅𝑎𝑢𝑔 ) [38]. Another line of research is the adaptation of the deeply. In particular for class overlap, this may be helpful to establish
original measures by Ho and Basu [42,54], where complexity estimates which classes have broad overlapping areas with the remaining or
are provided for the majority and minority class individually, rather which classes are responsible for the most problematic areas.
than taking a single measure for the entire domain. Another advantage of co-decomposition is the ability to integrate
In particular, instance overlap measures (please refer to the individual properties of classes in the computation of a final mea-
Section 3.2.2) have demonstrated an exceptional good alignment with sure. For instance, 𝑅𝑎𝑢𝑔 ( Table 1) could be used to measure the overlap
classification difficulty, with adaptations of N3, CM, wCM and dwCM of every two classes, where the imbalance between those classes will
241
M.S. Santos et al. Information Fusion 89 (2023) 228–253
also be captured. Alternatively, previous class-wise adaptations of com- 4.2. Data Preprocessing
plexity measures may be further examined in multi-class imbalance
domains, i.e., determining the complexity between every two classes. Data Preprocessing encompasses a series of operations that may
The major question here is how to determine an overall measure be applied before the data is passed to the learning stage, where
for the entire domain, which constitutes an open issue for research. the classification models are built. In the context of imbalance and
Most frequently, strategies to compute complexity measures over multi- overlapped domains, common preprocessing tasks include:
class datasets rely on One-Versus-One (OVO) or One-Versus-All (OVA) • Data Resampling: To compensate for class imbalance by removing
approaches. OVO considers all possible combinations for every two
( ) majority examples and/or synthesising new minority examples,
classes in the domain, i.e., 𝐶2 binary sub-problems (𝐶 representing the and to identify and clean overlapped regions or examples;
total number of classes in the domain). In turn, OVA tests every class • Dimensionality Reduction: To alleviate the dimensionality ratio
against the remaining, composing 𝐶 binary sub-problems. In both cases, problem (i.e., the curse of dimensionality [75]), by characterising
a final measure may be defined as the average across all sub-problems. the data domain through a reduced representation, rather than
This is in fact the default behaviour of existing software for complexity the entire input data. This process is commonly performed us-
measures: DCoL,7 uses OVA whereas ECoL8 ImbCoL9 and pymfe10 ing feature selection (selecting a subset of the original features
use OVO. However, this type of decomposition somewhat perverts by discarding redundant and/or overlapped features), or using
the decision boundaries of the original domain, since the individual feature extraction (replacing the original features with new trans-
properties and relations between classes are disregarded. formed/extracted features that retain the relevant information in
Naturally, more thoughtful measures such as 𝑅𝑎𝑢𝑔 or the adaptations data);
of complexity measures allow to incorporate more information into the
final measure, namely the imbalance between classes, thus avoiding 4.2.1. Data Resampling
treating all pairs of classes equally. Similarly, it is possible to define In this section, we focus on the current trends on handling imbal-
several approaches for the aggregation of individual values (rather anced and overlapped domains. To that regard, Fig. 11 summarises
than the average). One possibility is to weight the contribution of the most popular approaches in the field, along with the class overlap
each class to the overall overlap according to the representation of the representations (introduced in Section 3.2) they are associated to.
class concept in the domain. Other possible aggregations have recently The reader may find additional information on class overlap-based
been derived [16]. Despite that, new approaches need to be investi- approaches in [19]. Among class overlap-based approaches, data re-
gated, especially taking into account the mutual relationships between sampling approaches (undersampling, cleaning and oversampling) are
the most frequently explored when handling class imbalance and over-
classes. Possible directions are to consider cluster-based solutions [49]
lap simultaneously. Nevertheless, when relevant, we also provide some
or incorporating the similarity between classes while computing data
comments on the remaining approaches.
typology [73]. We acknowledge this topic as one of the major issues
In light of the class overlap representations previously discussed,
for future research and discuss some approaches for multi-classification
it is possible identify some trends regarding the development of ap-
domains in Section 5.1.
proaches sensitive to class imbalance and overlap. Undersampling ap-
proaches are more prone to consider structural information, via cluster-
4.1.3. Singular problems ing and graph-based approaches [63,64,76,77]. They focus on defining
The great majority of studies in the field of imbalanced learning the regions of interest (core concepts) of the data domains and dis-
is focused on standard supervised learning tasks (often classification card redundant or overlapped examples found within those regions.
tasks, either binary or multi-class). With respect to non-standard su- In turn, cleaning and oversampling approaches mostly prioritise local
pervised learning problems, i.e., singular problems, little research has information, often via kDN rules [78]. In cleaning approaches, the
been developed and therefore their study constitutes another challenge value of 𝑘 determines the depth of the cleaning procedure (either
for future research. Singular problems comprehend a set of variations addressing borderline regions or the entire domain). In this regard,
of non-classical supervised learning problems, where the traditional multi-resolution information (fine-grain search) information has been
explored to recursively remove harmful examples from data [79].
structure (e.g., one-vector input and one-dimensional output) does
Oversampling is increasingly moving towards parametrised approaches
not apply. This is the case of multi-label, multi-instance, and multi-
that adapt the generation of new examples to the characteristics of each
view problems, to name a few [74]. Complexity measures have the
dataset [80–83]. There is also some concern with the generation of
potential to be as useful for singular problems as they have been for
examples that are both informative and diverse [84,85]. This allows the
standard classification problems. Nevertheless, similarly to what has
generation process to cover more regions of the data space and alleviate
been recently uncovered for imbalanced domains (i.e., that several
the structural complexity of datasets to some extent. Oversampling
complexity measures are biased towards the most represented con-
approaches therefore seem more flexible, but may require a large
cepts), they require further adaptations to properly handle problems number of user-defined hyperparameters, for which there is not yet
with a different composition. A recent review on singular problems an established relationship with complexity measures. This constitutes
may be found in [74]. A discussion of recent research related to yet another open challenge for hyperparameter tuning (more details
class imbalance in the singular problems framework is presented in in Section 5.4). Finally, it is not uncommon for approaches to share
Fernández et al. [43]. With respect to imbalanced and overlapped some paradigms or consider several sources of information (e.g., local,
domains, Pascual-Triana et al. [41] describe some strategies to adapt structural, density, fuzzy logic, cost-sensitive). This goes towards the
ONB measure (representative of structural overlap) to several types idea that class overlap has different vortices of complexity and address-
of singular problems. Possible future directions within the scope of ing them altogether could potentially improve results. Additionally,
singular problems are discussed in Section 5.2. there are considerably fewer approaches developed within the scope of
ensembles, evolutionary, region splitting and hybrid approaches. This
may be due to the lack of current knowledge on the joint-effect of class
7
https://github.com/nmacia/dcol imbalance and overlap on different learning paradigms [15], ensemble
8
https://github.com/lpfgarcia/ECoL learning, and hyperparameter tuning.
9
https://github.com/victorhb/ImbCoL Despite the fact that some research has been focused on handling
10
https://github.com/ealcobaca/pymfe domains affected simultaneously by class imbalance and overlap in
242
M.S. Santos et al. Information Fusion 89 (2023) 228–253
Fig. 11. Common approaches to address imbalanced and overlapped domains. The schema associates each group of approaches to the class overlap representation it is most
attentive to.
the last couple of years, there is currently not enough knowledge to a univariate–multivariate feature selection approach [89], combining
support the application of one approach (or category of approaches) both feature-based and neighbourhood-based information. Parmezan
over the others. On the one hand, despite the extraordinary flexibil- et al. [87] proposed a new framework for the recommendation of
ity of oversampling methods, the generation of synthetic examples feature selection algorithms based on meta-learning, considering both
becomes a more complicated task in overlapped domains due to the the characteristics of the feature selection methods and the intrinsic
risk of further exacerbating class overlap, i.e., generating examples characteristics of the datasets. Information theoretic and complexity
in problematic regions [10]. This has been somewhat attenuated by meta-features have shown promising results in the characterisation of
the development of more polished approaches [81–85], but at the datasets [44]. In particular, the ratio signal/noise, dispersion of the
cost of increasing computational complexity and interpretability (too data set and average mutual information between classes and attributes
many user-defined hyperparameters to tune). On the other hand, the were frequently selected as decision nodes in the meta-models. Sim-
apparent superiority of oversampling techniques due to their ability ilarly, F2 was also present in the all the constructed meta-models.
to consider the inner structure of data [86] may not hold for imbal- Seijo-Pardo et al. [90] use a combination of feature overlap measures
anced and overlapped domains. Indeed, most recent undersampling and (F1, F2, F3) to guide the definition of thresholds regarding a suitable
cleaning approaches also consider information regarding the structural number of features to keep by feature selection methods. Dong and
and local complexity of the domains and have proven to surpass well- Khosla [91] show that the performance of feature selection methods
established oversampling algorithms [63,64,77]. Additionally, there is correlated with N3.
are obvious advantages to using other types of approaches, such as A few emergent approaches have attempted to handle class imbal-
the incorporation of data complexity and classification performance in ance and overlap in synergy. Fernández et al. propose a multi-objective
multi-objective evolutionary approaches [50,71], or the combination of evolutionary algorithm to handle class imbalance and overlap [92].
multiple reasoning paradigms when using ensembles [31,49]. Both feature and instance selection are considered while evolving
Beyond a theoretical point of view, there are further empirical solutions, to simultaneously compensate for the class distributions,
limitations preventing recommendations on the best approaches to remove complicated examples, and remove features with high overlap
handle imbalanced and overlapped domains from being devised. These degrees. Lin et al. [93] propose a feature selection algorithm based
relate to experimental design of related work, the lack of a standard on feature overlapping and group overlapping (FS-FOGO). Feature
definition, characterisation, and measurement of class overlap, as dis- overlapping is computed by the ratio of the overlapping region on the
cussed along Section 2, and the lack of dataset benchmarking and open effective range of each class (similarly to F3), while group overlapping
software. We will discuss these limitations and directions for further is determined by the number of examples that fall onto overlap regions
research in Sections 5.3, 6.1, and 6.2. between classes (using R-value [58]). In such a way, group overlapping
is related to the instance overlap category defined in Section 3.2, and
4.2.2. Feature selection FS-FOGO combines it with feature overlap to better decide on the
Feature Selection is an important preprocessing step when handling discriminative power of features. Fu et al. [16] propose two feature
high-dimensional data in every standard classification domain, given selection methods to define a subset of features under SVM and Logistic
that a large number of features can be problematic for some classi- Regression classifiers: MOSNS (Minimising Overlapping Selection un-
fiers [87]. In imbalanced and overlapped domains, it becomes a more der No-Sampling) and MOSS (Minimising Overlapping Selection under
strenuous task since it is more difficult to discriminate certain concepts SMOTE). Both methods are built via sparse regularisation with the
in data and consequently determine the features that increase class main objective to minimise the overlap degree between the majority
separability. and the minority classes (defined using 𝑅𝑎𝑢𝑔 , therefore incorporating
Past work has already discerned on the challenges of feature se- instance overlap information). However, MOSS first applies SMOTE to
lection in imbalanced domains [43], whereas the use of complexity rebalance the training data. MOSS outperform all other approaches
measures for the recommendation of feature selection methods has (MOSNS, ACC and ROC-based feature selection) regarding classifica-
become a hot topic in the last couple of years. Okimoto et al. [88] tion performance, whereas MOSNS produces the lowest number of
show the suitability of using data complexity measures for univariate retained features while providing better or comparable results to ACC
feature selection, where F1, F3 and N1 were successful in selecting and ROC-based methods in most datasets. Recently, MOSS has also
the most relevant features. F1, associated with class separability was shown to improve the performance of imbalanced approaches in multi-
the most effective. In a later work, F1 is coupled with N2 to produce class domains [94]. Based on the same strategy of considering sparse
243
M.S. Santos et al. Information Fusion 89 (2023) 228–253
feature selection to minimise class overlap (i.e., instance overlap, via measures and/or other observed characteristics of datasets, leading to
𝑅𝑎𝑢𝑔 ), Fatima et al. [95] refer to RONS (Reduce Overlapping with No- the development of specialised approaches. Alternatively, it can also
sampling), ROS (Reduce Overlapping with SMOTE), and ROA (Reduce be based on the tuning of hyperparameters. In this case, the main
Overlapping with ADASYN). RONS and ROS are the same as MOSNS objective is to maximise the classification performance by choosing
and MOSS, respectively, while ROA follows the sample principle as optimal hyperparameters for classifying or preprocessing each dataset.
MOSS although using ADASYN instead. Considering ADASYN instead of Whereas some strategies for specialised approaches have been ap-
SMOTE seems favourable, since ADASYN focuses on more complicated plied in the literature, hyperparameter tuning remains an understudied
minority examples, whereas SMOTE considers all minority examples topic in what concerns the design of approaches sensitive to the pe-
equally. culiarities of data suffering simultaneously from class imbalance and
overlap.11 In what follows, we discuss some existing approaches in this
4.2.3. Feature extraction and visualisation regard.
Rather than selecting a subset of features, feature extraction meth-
ods perform certain transformations on the original set of features 4.3.1. Specialised approaches
in order to produce a reduced set of artificial features. These new Depending on the category of class overlap based-approaches
features are somewhat a combination or mixture of the original features (please refer to Section 4.2.1), different strategies may arise for the
that aims to retain most of the information comprised in the original development of specialised approaches. Recent approaches are based
feature space. In imbalanced and overlapped domains, a common ap- on defining heuristics for undersampling or cleaning (adaptive thresh-
plication of feature extraction is data visualisation. Graphic inspection olding or local neighbour adjustment), analysing local information
is often applied to get a feel of the structure of data, overlapping for selective oversampling (via data typology) and incorporating costs
between classes and overall data complexity. To that end, datasets are associated with data complexity directly into the learning systems.
often transformed using feature extraction techniques to allow data Pattaramon and Elyan [64,79] propose two heuristics for cleaning
visualisation in two or three dimensions. overlapped majority class examples. With AdaOBU [64], they intro-
Anwar et al. [34] used Multidimensional Scaling (MDS) to represent duce an automatic elimination threshold adaptable to the degree of
each data example in two dimensions in order to visually assess data
class overlap. The threshold is proportional to the fuzziness of the
complexity. The visualisation is used in conjunction with the proposed
dataset and consequently to the existing class overlap. In [79], authors
CM metric ( Table 1) to analyse the degree of overlap between classes.
discuss another heuristic to determine a reasonable value of 𝑘 for
Whereas the majority class is shown in some colour, each minority
neighbourhood-based cleaning methods that promotes the discovery
class example is identified by the number of same class neighbours in
of overlapped majority examples. The heuristic considers information
its 3-neighbourhood. Napierala et al. [61] used MDS and t-Distributed
regarding both the number of examples in data and the imbalance
Stochastic Neighbour Embedding (t-SNE) to assess the dominant ty-
ratio. A similar approach is taken in [101], where 𝑘 is defined by the
pology of datasets (safe, borderline, rare/outlier datasets) and identify
imbalance ratio of the dataset.
class overlap. Despite certain differences in the projections of both
Data typology has also been considered in the design of specialised
methods, the observations regarding the complexity of the studied
approaches, where selective oversampling has proven to improve clas-
domains are similar.
sification results. Skryjomski et al. [102] show how SMOTE can be
Recent research is also exploring feature extraction and visualisa-
empowered by incorporating information regarding the typology of
tion strategies to characterise the footprint of algorithms. This is a
minority class examples. Similarly, Sáez et al. [103] guide the over-
methodology known as Instance Space Analysis and may be applied to a
sampling procedure based on the data typology of examples in multi-
collection of datasets or to individual observations within a dataset. The
class datasets. The best oversampling configurations often involved the
rationale of the analysis is similar. Essentially, it involves summarising
oversampling of only borderline and outlier examples, with a higher
each dataset or each instance within a dataset as a 𝑛-dimensional
frequency of the preprocessing of borderline examples.
feature vector representing its complexity. Regarding the taxonomy
Another strategy is to integrate the information regarding data
presented in Section 3, dataset complexity may be captured by the class
complexity directly on the learning stage of classifiers. Lango et al. [72]
overlap representations proposed, where the complexity of singular
suggest to consider the information produced by ImWeights regarding
observations are most often associated with the instance hardness
measures [8]. Then, using a feature extraction technique, e.g., Principal the number of clusters and associated difficulty (incorporating both
Component Analysis (PCA), a two or three dimensional embedding (an structural and local information). Lee et al. [13] introduce the concept
instance space) that can be visually investigated. The classification per- of overlap-sensitive costs, which combines both the imbalance ratio and
formance associated to each dataset or instance can be superimposed the degree of overlap of training observations (based on kDN).
in the visualisation to identify regions of good or poor behaviour of
classifiers, and identify pockets of hard and easy datasets or instances. 4.3.2. Hyperparameter tuning
Smith-Miles et al. [96,97] used PCA to project dataset instances onto Hyperparameter tuning allows to determine specific model param-
a 2-dimensional space and analyse algorithm performance. Muñoz eters tailored to the characteristics of each dataset in order to obtain
et al. [98,99] propose a new dimension reduction methodology that optimal performance. Thus, more than embedding ‘‘rule of thumb’’, the-
improves the interpretability of the visualisations. The new projec- oretical settings into the approaches, it is possible to empirically fine-
tion approach is optimised so that the created instance space repre- tune parameter values for individual datasets, improving classification
sents as much as possible a linear trend between data complexity and results.
classification performance.
11
4.3. Algorithm Design Note that hyperparameter tuning, per se, constitutes a topic of interest
across several fields beyond traditional Supervised Learning, such as Deep
Learning, and Meta-learning [100]. Accordingly, some intersections between
The idea behind algorithm design is to adjust a given approach,
terms, trends, and solutions are likely to arise. Notwithstanding, in this paper,
i.e., the parameters of a classifier or preprocessing method, to the we detach from that intersection and overall considerations on hyperparameter
characteristics of data. In the context of imbalanced and overlapped tuning regarding the Deep Learning and Meta-leaning fields specifically. In
datasets, a common strategy is to incorporate information regarding alternative, we focus particularly on hyperparameter tuning with respect to
both these problems in the development of approaches. Such infor- imbalanced and overlapped domains, highlighting existing limitations which
mation might appear in the form of an heuristic based on complexity are yet to be addressed by all communities.
244
M.S. Santos et al. Information Fusion 89 (2023) 228–253
With respect to imbalanced and overlapped domains, the tuning were not originally proposed for meta-learning, complexity measures
process is most often performed directly by analysing the effect of have been used extensively in the MtL and AutoML literature [109–
hyperparameters on classification performance [49,62,70,80,84,104]. 112]. For that reason, authors have started to refer to them as an extra
That involves testing a range of hyperparameters (or combinations of category of meta-features [44] and recent research has been showing
hyperparameters) over a benchmark of datasets and choosing the one that they may prove equally or more informative than traditional meta-
that performs the best overall. features [112]. In particular, class overlap measures have stood out
Some studies further discuss the effect of hyperparameters of the as highly accurate indicators of classification performance [38,42].
proposed approach and suggest appropriate values that provide overall Indeed, some class overlap measures are related to the landmark-
good results. This is especially the case of approaches that require ing category of meta-features. Landmarking meta-features characterise
several user-defined hyperparameters (e.g., A-SUWO, NI-MWMOTE, IA-
datasets based on the classification performance of simple and fast
SUWO) [81–83]. Still, the discussion is given as a high-level view of
learners, such as kNN and linear discriminants, therefore highly as-
the approach, rather than providing recommendations based on data
sociated with the instance overlap measures (N3) and feature overlap
characteristics. An exception can be highlighted for Douzas et al. [85],
measures associated to class separability (F1, F1v).
where some hyperparameter recommendations for G-SMOTE are given
based on the imbalance ratio, and the ratio of the number of samples In the context of imbalanced and overlapped domains, common
to the number of features of the datasets. Another important exception applications of MtL systems are related to the recommendation of
are evolutionary-based approaches that, by recurring to multi-objective classifiers and preprocessing techniques or to the study of their domains
algorithms, are able to consider both the classification performance and of competence. Most often, related research focuses on obtaining a high
data characteristics in the refinement of the approach [71,105]. level view of MtL frameworks rather than discussing informative mea-
Nevertheless, there are still several approaches where hyperparam- sures [109–111]. Nevertheless, some works have attempted to connect
eters are defined according to the default values of existing software the insights derived from complexity measures to the recommendation
packages or set to common values for consistency with other works provided by the systems, which we discuss in what follows.
in the literature that used the same approaches or datasets [76,78,
106]. All in all, in what concerns imbalanced and overlapped data,
4.4.1. Classifier recommendation
hyperparameter tuning remains a neglected subject and it constitutes
a challenge for further research. Accordingly, future directions will be In the scope of classifier recommendation, García et al. [113] use
highlighted in Section 5.4. regression techniques to recommend the best classifier (ANN, DT, SVM,
Finally, as previously discussed,11 we may argue that this topic also kNN) for a given dataset, based on their data complexity. The top most
falls onto the scope of Meta-learning and Deep Learning. informative measures were N3 and N1, followed by N2, Density and
In the Meta-learning community, hyperparameters themselves may T1. Luengo and Herrera [114] discuss an automatic extraction method
be seen as meta-data that describes the learning tasks [100], and some to determine the domains of competence of classifiers (DT, SVM and
categories of meta-features (e.g., model-based, landmarking) further kNN). The complexity measures regarded as most informative for the
require the definition of hyperparameters as well, for which tuning is automatic extraction method were N1, N3, L1 and L2. Apart from
yet to be explored [44]. The idea of defining appropriate parameters the top informative measures, additional information may be useful
depending on the data characteristics has also been subject of previous depending on the nature of classifiers. That however, remains an under-
work in the field, where meta-models are designed to recommend spe- investigated topic. Open avenues regarding classifier recommendation
cific configurations or hyperparameters, based on some meta-features. will be discussed in Section 5.5, along with ensemble learning, as they
The reader is referred to [44,100], which constitute two comprehensive are related topics that suffer from similar limitations.
surveys on the topic. Nevertheless, existing work mostly focuses on tra-
ditional meta-features (e.g., simple, statistical, information-theoretic)
rather than complexity measures, and there is not, to our knowledge, 4.4.2. Recommendation of resampling approaches
any study that focuses specifically on hyperparameter tuning for imbal- Regarding data preprocessing approaches, complexity measures are
anced and overlapped datasets. We will further discuss this matter in often used to guide the choice of appropriate resampling techniques.
Section 4.4. Depending on the complexity of a domain, a suitable resampling strat-
With respect to the Deep Learning field, some recent research is egy can be chosen by taking into account its intrinsic behaviour,
starting to study the behaviour of deep learning systems in imbalanced i.e., how it works internally and to what extent it can alleviate certain
domains which are further affected by additional complexity factors, data problems. Luengo et al. [53] analyse the usefulness of complexity
such as class overlap. The reader is referred to [107] for the first novel measures to evaluate the behaviour of resampling approaches. F1,
thoughts on the subject, although some core issues persist in deep learn- N4 and L3 proved informative to establish significant intervals of
ing systems as for their classical counterparts: class overlap remains good and bad behaviour for different preprocessing approaches. Santos
a challenging factor even for deeper architectures, and, to this point, et al. [10] perform a thorough comparison of oversampling approaches
model parametrisation follows the same principle of experimenting for imbalanced datasets, supported by a data complexity analysis. The
with several hyperparameters to report optimal classification results. best oversampling techniques seemed to include structural informa-
tion (cluster-based synthetisation), instance overlap information (use
4.4. Meta-learning
of cleaning procedures) and instance hardness information (adaptive
weighting of examples). Costa et al. [112] use Exceptional Preferences
In Meta-learning (MtL), the characteristics of a dataset (named
Mining to extract interpretable rules to guide the recommendation of
meta-features or meta-characteristics) are extracted and associated to
the classification performance obtained over it. By compiling meta- oversampling strategies for imbalanced datasets. Similarly to the pre-
information on a collection of datasets with associated performance vious work, class overlap measures were the most informative, namely
results (thus creating a meta-dataset), it is possible to build a recom- measures related to structural and instance overlap (N1, N4) and in-
mendation system that infers on the behaviour of a technique (or sug- stance hardness (proportion of borderline examples). Zhang et al. [111]
gests the application of an appropriate one) based on the characteristics propose an instance-based learning recommendation algorithm to de-
of a new dataset. termine the most suitable strategy to handle imbalanced datasets.
Traditionally, there are five categories of meta-features discussed They use complexity, landmarking measures, model-based measures
within MtL frameworks: simple, statistical, information-theory, land- and structural meta-features, although they only present a high-level
marking and model-based meta-features [108]. However, although they view, with no specific measures discussed.
245
M.S. Santos et al. Information Fusion 89 (2023) 228–253
4.4.3. Ensemble learning to the application of methods that are not appropriate for the domain as
Although some ensemble-based techniques have been discussed a whole, i.e., they may hurt one class while trying to improve the rep-
within the scope of imbalanced and overlapped domains, ensemble resentation of another. OVA can additionally introduce artificial class
learning is still an open avenue for research. imbalance [49,116] whereas OVO suffers from the non-competence
Current ensemble frameworks often incorporate one of two solu- problem [117], i.e., when classifying new data, the predictions of all
tions. One is the coupling of ensembles with resampling and cleaning constructed OVO classifier are considered, even those of classifiers that
methods: recent approaches include CluAD-EdiDO [49], SPDM [31], have not been trained with examples belonging to the real class of that
and SPE [115]. The other is the simultaneous use of evolutionary data. The following directions could be analysed to fully understand
approaches to handle the peculiarities of the domains. Most often, this and explore multi-class domains:
involves the incorporation of some data complexity information in the
objective criteria of evolutionary algorithms, in order to optimise the • An interesting future direction is the exploration of cluster-based
final performance of the ensemble. For instance, Fernandes et al. [71] techniques. The domain is divided into several regions, where
discuss EVINCI, an evolutionary ensemble-based method that incorpo- data complexity can then be assessed. For instance, clusters con-
rates the N1 measure in the workflow to optimise instance selection. taining examples of only one class will not contribute to class
Fernández et al. [105] propose EFIS-MOEA, which incorporates both overlap. In turn, clusters containing examples of multiple classes
feature and instance selection. will be evaluated maintaining the original relationship between
The first strategy requires the understanding of which resampling/ classes. A starter point for the investigation of this line of research
cleaning approaches are most suited to different domains, and may be is [49], where multi-class imbalanced and overlapped datasets are
supported by previous meta-learning studies on resampling approaches. first clustered, before any cleaning and oversampling procedures.
The second strategy is more closely related to algorithm design, focus- • Another alternative to take into account the relationships between
ing on the development of specialised approaches and hyperparameter classes is to incorporate additional information on the data ty-
tuning to improve classification performance. pology of different classes. Rather than considering each class
Indeed, note how both strategies do not specifically focus on ensem- in isolation and producing its typology (OVA approach) [103],
ble learning from a meta-learning perspective, i.e., using complexity recent research suggests to incorporate a similarity factor when
measures to define an appropriate set of base classifiers for the en- determining the safety level of each example in data [73]. A
semble framework. That requires the choice of a pool of adequate major drawback in [73] is that it considers that similarity should
classifiers to form the ensemble, which comprises both the analysis of be provided by the user (via domain knowledge or consulting a
how classifiers with different learning biases respond to the joint-effect domain expert). As this is most often not available, a possibility
of class imbalance and overlap, and the assessment of their combination to overcome this issue could be to estimate a similarity coefficient
(creating ensembles) for optimal solutions. However, as previously via similarity/distance functions. Another heuristic based on the
discussed, the link between data characteristics (i.e., complexity mea- imbalance ratio between class concepts has also been recently
sures) and classifier recommendation is not yet well-established, and proposed [118]. It suggests that concepts with lower class imbal-
consequently, ensemble learning, to this extent, also remains an open ance are more similar to each other. We argue that associating
challenge for research, and will be discussed in Section 5.5. class similarity to the imbalance ratio between classes might be
too simplistic and suggest that the overlap degree between classes
5. Open challenges and future directions for research could be used in alternative, to produce a more realistic measure
of class similarity.
In what follows, we revisit the topics identified as open challenges
in the previous section (Section 4) and elaborate on possible future 5.2. Singular problems
research directions, based on the considerations of the first part of
the paper (Sections 2 and 3). Such discussion constitutes the main As pointed out in Section 4.1.3, current real problems and applica-
contribution of this section. tions are showing a more complex structure with respect to the classical
supervised and unsupervised tasks: essentially, in what concerns their
5.1. Multi-class problems input and output variables [74]. In what follows, we discuss how
problematic regions and/or instances may be identified in non-standard
As discussed in Section 4.1.2, the standard approach for multi-class scenarios such as multi-label, multi-instance, and multi-view problems.
problems consists of formulating several binary sub-problems, using
OVA or OVO decomposition. On the one hand, these strategies allow • To address multi-label or multi-domain learning, two different
the application of binary classifiers without additional modifications. approaches are likely to be applied [119,120]. On the one hand,
Also, and especially when handling data overlap, they may simplify to transform the dataset into standard ‘‘single-output’’ problems.
the original domain by focusing on sub-problems individually, thus On the other hand, to adapt or design the classifier to cope with
easing the separation between classes [116]. On the other hand, this this type of data. In both cases, the occurrence of the imbalance
simplification is achieved at the cost of distorting the inner structure and overlap issues is especially relevant, as there is a significant
of individual classes (and original decision boundaries) and neglecting increase in the number of labels and combinations among them.
mutual relations between classes. For instance, a given class can ei- To address the former situation, binarisation and probabilistic
ther be considered the minority or majority class, depending on the classification algorithms may be explored to ease the discrimina-
size of the class it is being compared to. Some classes can also be tion among groups of labels by simplifying the original problem.
more closely related (more similar) than others. With respect to class With respect to binarisation techniques, similar considerations
overlap, there can be a class or a subset of classes that is mainly can be taken as for multi-class problems (see Section 5.1).
responsible for overlapping regions, whereas other classes may have • In the multi-instance paradigm, input examples are represented in
clear decision boundaries among each other. Classes may also have groups or the so-called bags [121]. Every instance shares the input
distinct overlapping regions with respect to each other. Regarding data space, but the number of elements in a bag can be different. The
typology, examples will be categorised in different types, depending on final objective is identifying the class of the bag by labelling all
the classes considered to define their neighbourhood. instances associated to it, i.e., the bag is ‘‘positive’’ if there is at
By manipulating the data internally, via OVA or OVO, the informa- least one positive instance. In this scenario, considering the bags
tion on the intrinsic characteristics of each class is lost, which may lead as ‘‘instances’’ by aggregation mechanisms, that is, considering
246
M.S. Santos et al. Information Fusion 89 (2023) 228–253
a single representative element for each one of them, eases the • Also, class imbalance should be explored beyond the characterisa-
definition of overlap to follow the standard case. Otherwise, tion of the disproportion between classes and consequently used
feature vectors should be used separately, instance by instance, for the definition of the undersampling/oversampling amount
possibly inducing a higher degree of overlap and complexity to necessary for preprocessing techniques. Instead, it could be con-
the problem. This oversimplification of the problem may have an sidered altogether with class overlap to produce new measures of
influence in the quality of the model to be obtained. In addition, complexity and further embedded in the operations of methods.
few research studies still consider the event of imbalance in this Some recent work is already searching for solutions along this
context [122,123]. As such, there is a need of establishing the line, at the level of algorithm design (Section 4.3), which we
proper preprocessing approaches to cope with both overlap and believe to be the direction with the highest potential for future
imbalance taking into account the properties of the positive and developments in the following years.
negative bags. • Improved weighting schemes are also worth studying to adjust the
• Finally, multi-view problems are defined as those in which each complexity profile of training examples, e.g., closer neighbours,
instance has a fixed number of feature vectors that can vary or minority class neighbours, may have a higher impact in com-
in type and format [124]. As there are different ‘‘input-spaces’’, plexity computation. This rationale can also be applied to data
the degree of overlap may vary for each of them. This implies preprocessing approaches to provide a specialised resampling,
that the characterisation of a given instance must be consid- depending on the difficulty of a given example.
ered under the perspective that better establishes the separation
among other labels. Multi-view problems are mainly addressed 5.4. Hyperparameter tuning
via auto-encoders and feature transfer [125], so that creating
non-linear combinations of the original features for a higher-level As discussed in Section 4.3.2, the configuration of hyperparameters
representation of the data may lead to simpler decision functions. (of classifiers or resampling approaches) is most often guided by the
results obtained from the classification stage. Besides time consuming,
5.3. Data resampling this type of approach does not take advantage of information on data
complexity, which is available, often at a lower cost than running entire
In Section 4.2.1, we have provided an extensive discussion of experiments. The following directions may be explored in order to
the limitations and opportunities for future research regarding class devise more insightful ways to guide hyperparameter tuning:
overlap-based methods. Besides the ones previously highlighted, the
following open directions are crucial for the development of new • Regarding resampling approaches – undersampling, oversampling
approaches dedicated to handle imbalanced and overlapped datasets: and cleaning – a possibility is to guide the tuning of hyper-
• For the most part, the comparison of class overlap-based methods parameters based on complexity measures. For imbalanced and
remains limited to well-established approaches (e.g., ROS, RUS, overlapped domains, the hyperparameters of resampling proce-
SMOTE, Safe-Level-SMOTE, Borderline-SMOTE) which have been dures can be adjusted in a way that they alleviate class imbalance
frequently outperformed. It is also not uncommon to find that and minimise class overlap, by assessing the effects of given
some class overlap-based approaches are compared to their anal- hyperparameters on suitable complexity measures. This can be
ogous distribution-based approaches. It would be crucial to com- thought out by addressing data complexity as a whole, for in-
pare new methods with emergent, state-of-the-art approaches, stance, focusing on minimising feature, instance and structural
developed for the same purpose, to provide a more accurate overlap. Alternatively, it is possible to address data complex-
evaluation of results. ity selectively, depending on the classification paradigm to be
• Despite the fact that many methods are being proposed to over- used after the preprocessing stage, i.e., focusing only on the
come class overlap, there is a clear lack of information on how most complicated factors for the classier at state. As an example,
datasets are affected by this problem (there is no quantification since SVMs can handle rather complex structures [6], one can
of class overlap). The question of whether the applied methods focus solely on addressing instance overlap, removing harmful
provide true improvements with respect to class overlap therefore examples.
remains. Most often, approaches are evaluated in terms of clas- • Regarding classifier hyperparameterisation, it is possible to
sification performance, which may not be sufficient to validate achieve a reduced range of hyperparameters to test by exploring
the approach. It is important that future research considers a data complexity at an intermediate stage. For instance, for SVMs,
deeper characterisation of domains, especially if the purpose of more appropriate combinations of 𝐶 and 𝛾 can be explored
an approach is to alleviate some data-related issue. New studies depending on the characteristics of data. An obvious advantage of
in the field should provide a more insightful characterisation of considering hyperparameter tuning based on data complexity is
datasets beyond the number of samples, features and imbalance that complexity measures are often faster and simpler to compute
ratio. It is important to guarantee that a testbed is representative than performing full classification experiments. Also, choosing
of the desired data issue to sustain the improvements introduced more insightful ranges of hyperparameters allows the algorithm
by a proposed approach. to converge faster, avoiding the need to test an extended set
• A large amount of class overlap-based methods is based on of possible combinations. In this regard, some interesting ap-
handling conflicting examples (e.g., borderline, noisy examples), proaches have studied meta-models to determine whether or
whose identification relies almost exclusively on instance hard- not to tune SVMs [126], or how to define appropriate sets of
ness measures (kDN rules). Future research could simultaneously default hyperparameters [127]. Both research works consider
explore other vortices of class overlap while performing this general real-world domains and rely on the study of several
assessment. In this regard, exploring the taxonomy presented in data characteristics (meta-features), including some complexity
Section 3 is a good starting direction. measures (the former exploring imbalanced datasets in more
• Class overlap measures can also be used to provide specialised detail). Although they do not focus particularly on the joint-effect
data preprocessing so that the representation of minority ex- of class imbalance and overlap, they may serve as a starting point
amples is increased in overlapping regions. For instance, the to further explore hyperparameter tuning in these domains across
generation of new synthetic examples can be guided in order to several learning paradigms and methods, including preprocessing
optimise a given complexity measure. approaches.
247
M.S. Santos et al. Information Fusion 89 (2023) 228–253
• At the level of class overlap complexity measures themselves, a 6.1. Benchmark datasets
large number of measures relies on finding a 𝑘-neighbourhood,
where the value of 𝑘 is routinely set to a pre-defined value Popular public repositories (e.g., UCI,12 Kaggle,13 KEEL,14
(𝑘 = 5 is a common hyperparameter). The same is true for data OpenML15 ) offer a diverse collection of datasets in what concerns their
typology and several class overlap-based methods. This strategy extrinsic complexity (number of instances, dimensionality, missing
obviously neglects the characteristics of the domains, although values, number of classes), though not focusing on their intrinsic
estimating 𝑘 for each domain may be computationally expensive. complexity (class imbalance, class overlap, small disjuncts, noisy data
Therefore, defining more insightful heuristics for setting 𝑘 is and other data-related issues). Therefore, they lack diversity, i.e., their
a interesting direction for future work. Regarding complexity are not representative of a great span of complexity problems [98,
measures, some approaches suggest incrementally increasing 𝑘 129]. Regarding specific applications or data characteristics, KEEL is
until the complexity stabilises [34]. On data typology, recent perhaps the most popular repository. It provides a collection of both
work discusses the possibility of tuning 𝑘 and the used distance standard datasets as well as datasets targeted to imbalance learning,
metric based on classification results of a kNN classifier [128]. detection of noisy and borderline examples, as well as singular prob-
On data resampling, some recent heuristics for defining suitable lems (multi-instance and multi-label datasets). Nevertheless, other data
𝑘-neighbourhoods are based on the degree of class overlap or the complexity factors remain overlooked. An important contribution to
class imbalance of datasets [64,79,101]. research would be the creation of an open repository representative
• Similarly, adaptive methods for finding 𝑘 should also be explored, of data complexity problems. This would establish a benchmark for
where 𝑘 could be adjusted to the local minority class densi- studies regarding the domains of competence of classifiers, as well
ties. Traditionally, smaller values of 𝑘 are more successful to as the development of specialised approaches and AutoML pipelines.
recognise the less represented concepts in the overlap region. In The following directions could be taken in order to develop data
turn, larger values of 𝑘 benefit the more represented concepts benchmarks targeted to complexity analysis:
in that region [7]. Future research could pursue the proposal
of a framework able to select an optimal 𝑘 value based on the • Providing a complete characterisation of datasets comprised in
local characteristics of data. In that regard, hypersphere coverage well-known repositories and grouping datasets according to their
metrics could be informative to define optimal 𝑘 values. For complexity. Varying degrees of data complexity could be deter-
instance, examples with lower LSC require smaller values of 𝑘 for mined, and in particular for class overlap, the taxonomy provided
correct classification. in Section 3 could be helpful to divide datasets depending on their
• Future research may also focus on the investigation and optimi- dominant overlap representation. For instance, some datasets can
sation of distance functions (both for specialised approaches and be structurally intertwined (structural overlap), whereas others
complexity measures). Although previous studies have shed some may include a great amount of difficult examples (identified with
light on different behaviour of complexity measures and data ty- instance overlap measures). Combinations of these factors could
pology depending on the used distance function [34,40,41,128], also be considered.
this remains a poorly studied topic. • On this note, it is important to refer to the computational com-
plexity associated to the computation of some complexity mea-
5.5. Classifier recommendation and ensemble learning sures. Despite the fact that they have been used extensively in MtL
applications, their widespread usage may be compromised by the
As discussed throughout Sections 4.4.1 and 4.4.3, although pre- fact that some are computationally expensive. In this regard, an
vious studies have shown that the combination of class imbalance open challenge relies on the optimisation of complexity measures.
and overlap creates a challenging scenario for classifiers independently As an alternative, recent research has shown that it is possible to
of their learning paradigms (i.e., the nature of the learned decision predict data complexity measures of a given dataset using simpler,
boundaries) [1], there is no study that thoroughly discusses this topic, low cost meta-features as input [130], which could also be an
focusing specifically on establishing its effects on distinct learning interesting direction to explore.
biases with respect to real-world domains. Related research has es- • Complementary to the characterisation of datasets, a possible
tablished some insights regarding the behaviour of local versus global strategy to guide researchers on the choice of appropriate datasets
classifiers [7], symbolic and non-symbolic classifiers [6] and classifiers to evaluate their proposed approaches could be the creation of a
with different learning paradigms [15]. However, these comprise arti- meta-dataset which could then be explored via clustering analysis
ficially generated data domains, where class overlap, class imbalance to define groups of datasets with similar complexity. Another
and other factors (data typology, data structure and class decomposi- interesting approach is the one taken in [98] where datasets
tion, local data densities, and data dimensionality) are defined apriori. are projected onto a 2-dimensional instance space where their
Transposing these studies to real-world scenarios is now possible due to complexity and diversity can be visualised.
the increasing number of complexity measures proposed and revisited • Enhancing existing repositories with artificial data is also a pos-
in the last few years, and it would be of major interest to the research sibility. Previous work suggests enhancing data repositories with
community. This would lay the foundation for the choice of baseline the thoughtful design of artificial datasets, via evolutionary multi-
approaches for imbalanced and overlapped domains, as well as guide objective algorithms [129]. This approach samples a real-world
the selection of ensemble approaches. SVM and KNN have perhaps been dataset so that the resulting set of examples optimises a set of
the most studied classifiers under varying degrees of complexity [6,7, data complexity measures. A similar approach based on class label
11,13], whereas establishing the behaviour of other learning paradigms modification is introduced in [131]. Another strategy is presented
remains an open challenge. in [98], where datasets are evolved to fall onto target regions of
6. Open source contributions the complexity space. Similarly, a recent and interesting line for
future development is the exploration of data morphing, where a
In this section, we highlight further directions for future research
that are complementary to those identified in the previous section
and may contribute to their more rapid and effective advancement. 12
https://archive.ics.uci.edu
The main contribution of this section consists of the identification of 13
https://www.kaggle.com
benchmarks and open-source software to boost new developments in 14
http://keel.es
15
the field. https://www.openml.org
248
M.S. Santos et al. Information Fusion 89 (2023) 228–253
real-world dataset can be gradually manipulated to display cer- ONB, 𝐶𝑙𝑠𝑡, N1, IPoints, LSC, kDN, Borderline Examples, 𝑑𝑒𝑔𝑂𝑣𝑒𝑟,
tain meta-characteristics [132]. In this case, it would be possible SI, R-value, 𝑅𝑎𝑢𝑔 , N3, N4, D3, CM, wCM, and dwCM. We are
to select a high complexity dataset with respect to certain prop- currently conducting a large experimental study over imbalanced
erties (e.g., both structural and instance overlap) and iteratively and overlapped datasets, focusing on distinct representations of
transform a less complex dataset to exhibit gradual variations of class overlap and the ability of the identified groups of class
those properties. Although manipulating the datasets artificially, overlap complexity measures to effectively characterise them.
these strategies aim to enrich their data characteristics while • Within the scope of artificially generated data, we also recom-
attempting to maintain the essence of real-world domains. With mend the use of data generator described in [6], for which we
respect to class overlap, Sáez et al. [116] discuss a scheme to provide the documentation in English so that more researchers
generate overlapping regions in real-world datasets. are able to understand and configure it. Additionally, we include
• In alternative, artificial datasets can be used as a benchmark to our example collection of generated artificial datasets, as well
improve the behaviour of approaches with respect to a particular as visualisation modules for data typology.32 We welcome other
aspect (e.g., presence of borderline examples, class-skews). The researchers to contribute with their own research data in order to
main advantage is that artificial datasets can be tailored to the move towards the creation of a representative repository regard-
needs of the experimental setup, i.e., covering specific sources ing data complexity factors, beyond imbalanced and overlapped
and ranges of data complexity or gradually increasing data com- datasets.
plexity. A recent line of research in this direction is [133], where • With respect to Instance Space Analysis discussed in Section 4.2.3,
a many-objective optimisation algorithm is used for complexity- exploring MATILDA (Melbourne Algorithm Test Instance Library
based data generation. with Data Analytics)33 is an interesting direction. It allows the
visualisation of the distribution, diversity and complexity of ex-
6.2. Software and open source implementations isting benchmark and real-world instances, the generation of new
synthetic test instances at specific locations of the instance space,
• Code availability is a crucial aspect for the reproducibility of and the analysis of algorithm footprints [98]. Another recent tool
results. Long-established methods are implemented in several is PyHard,34 which allows to assess the complexity of individual
open-source software. Some of the most popular are KEEL Soft- examples within a dataset [145].
ware Tool16 [134–136], WEKA workbench17 [137], among other
R18 , 19 , 20 , 21 [138–141] and Python22 , 23 [142,143] packages. How- 7. Concluding remarks
ever, most recent research work does not frequently provide
open-source implementations of novel approaches on imbalanced As thoroughly discussed throughout this work, real-world appli-
and overlapped data. We have identified all existing resources cations need to account for both class imbalance and overlap when
(data and code) regarding class overlap-based approaches in im- devising suitable solutions for domains affected by both problems.
balanced domains, so that researchers may consider them in
However, whereas class imbalance is simpler to characterise and mea-
future experiments.24 We further encourage future researchers to
sure, referring to the disproportion of examples between classes, class
make their code and obtained results publicly available.
overlap stands as a confounding concept, due to the multitude of
• Existing open-source implementations of complexity measures
representations, i.e., specific types of overlap problems, it comprises.
include the DCoL (C++)25 [39], ECoL26 [40] and the recent
For instance, some authors may characterise overlap as the overlap
ImbCoL27 [54], SCoL28 [130], and mfe29 [144] packages (R between individual feature values, associating class overlap to the
code). There is also pymfe30 in Python. Regarding class overlap
discriminative power of features. Others may characterise the problem
measures identified in Section 3, these packages consider the
by searching for complicated examples located in borderline regions
implementation of the following: F1, F1v, F2, F3, F4, N1, N2, N3,
between classes, in which case class overlap refers to instance complex-
N4, T1 and LSC. ImbCoL provides a decomposition by class of
ity. The lack of a standard and well-formulated characterisation of class
the original measures and SCoL focuses on simulated complexity
overlap in real-world domains is currently preventing the research com-
measures, as discussed in the previous section. In order to foster
munity to move towards improved approaches since, due to the lack of
the study of a more comprehensive set of measures of class over-
consensus and standardisation, the evaluation (and consequently, the
lap, we provide an extended Python library – Python Class Overlap
comparison) of existing solutions and associated results (and insights)
Library (pycol)31 – comprising all the class overlap measures
becomes extremely difficult.
included in the previous packages, plus the remaining measures
In this work, we advocate for a unified view of the problem of class
described in Section 3: F1, F1v, F2, F3, F4, IN, Purity, Neigh-
overlap in imbalanced domains, essentially dividing the paper into two
bourhood Separability, MRCA, C1, C2, N2, NSG, ICSV, T1, DBC,
parts: a conceptual discussion of the problems (Sections 2 and 3) and
a multi-view panorama of the current state of knowledge and open
16
https://github.com/SCI2SUGR/KEEL
avenues across several fields of Machine Learning (Sections 4 to 6).
17
https://www.cs.waikato.ac.nz/ml/weka/ In the first part of the paper, acknowledging class overlap as the
18
https://cran.r-project.org/web/packages/unbalanced overarching problem (as per se it is more harmful than class imbal-
19 ance), we start by discussing the concepts associated with its definition
https://cran.r-project.org/web/packages/smotefamily
20
https://cran.r-project.org/web/packages/ROSE across related work. We reason towards the idea that class overlap
21
https://cran.r-project.org/web/packages/imbalance comprises multiple sources of complexity and that it needs to be char-
22
https://pypi.org/project/imbalanced-learn/ acterised accordingly. Indeed, we argue that the class overlap measures
23
https://github.com/analyticalmindsltd/smote_variants
24
currently used in the literature are not representative of the class
https://github.com/miriamspsantos/open-source-imbalance-overlap
25 overlap problem as a whole, but that they rather provide an estimate
https://github.com/nmacia/dcol
26
https://github.com/lpfgarcia/ECoL of a specific type (representation) of class overlap.
27
https://github.com/victorhb/ImbCoL
28
https://github.com/lpfgarcia/SCoL
29 32
https://github.com/rivolli/mfe https://github.com/miriamspsantos/datagenerator
30 33
https://github.com/ealcobaca/pymfe https://matilda.unimelb.edu.au/matilda/our-methodology
31 34
https://github.com/DiogoApostolo/pycol https://pypi.org/project/pyhard/
249
M.S. Santos et al. Information Fusion 89 (2023) 228–253
In this regard, in order to systematise the understanding of the prob- Finally, we complemented the revision of the current state-of-the-art
lem of class overlap, we identify three main components underlying by incorporating our thoughts regarding several lines of research across
its characterisation: (1) the decomposition of the domains into regions the four identified areas of research. We consider the following to be
of interest, (2) the identification of problematic regions (overlapped the most pressing to consider in future work:
regions), and (3) the quantification/measurement of the class overlap
• The development of approaches to address other learning tasks
problem. Depending on the approaches followed within each compo-
beyond binary-classification problems. Most of existing work on
nent, the obtained characterisation may refer to distinct class overlap class imbalance and overlap is devised for binary-classification
representations, reflecting different insights on the problem. domains, whereas the issues identified for other contexts (multi-
Accordingly, we devise a novel taxonomy of class overlap complex- class and singular problems) are yet to be faced;
ity measures, establishing four main class overlap representations: (i) • More extensive comparison of approaches to handle imbalanced
Feature Overlap, (ii) Instance Overlap, (iii) Structural Overlap, and (iv) and overlapped domains. In experimental studies, proposed meth-
Multiresolution Overlap. Each group is characterised in what concerns ods are often evaluated against well-established approaches. New
the insight its measures provide regarding the class overlap problem, experiments should include emergent methods developed dur-
as well as existing limitations. In other words, we explain how each ing most recent years. Additionally, a deeper characterisation of
group is able to capture a given representation of class overlap, while datasets and standardisation of performance metrics is necessary
failing to perceive others. Besides establishing the association between to guarantee representative testbeds and a fair comparison of
complexity measures and their class overlap representations, our tax- approaches;
onomy evidences the core properties of the measures and provides an • Optimisation of hyperparameters for preprocessing and
overview of the relationships between them. Additionally, it includes specialised approaches, based on the evaluation of data complex-
a comprehensive set of complexity measures, beyond the well-known ity measures. In imbalanced and overlapped contexts, hyperpa-
measures initially proposed by Ho and Basu, and discusses whether they rameters are often defined according to heuristic solutions or
account for class imbalance, or how they can be extended to do so. tuned based on classification results. Although previous research
All in all, the concepts and ideas explored within the first part of in related fields (Meta-learning) has produced an interesting
this paper, culminating in the proposal of a new taxonomy of class body of work on the topic of hyperparameter recommendation
(although most often using traditional meta-features), further
overlap complexity measures, lay the foundation for a unified view of
research on imbalanced and overlapped domains is required,
the problem of class overlap and may serve as a stepping stone for the
and should explore the possibility of incorporating complexity
design of improved measures and a characterisation of the problem as
measures into the tuning process;
a whole in real-world domains.
• In addition to the previous point, despite the fact that the Deep
Having laid out our conceptualisation of the problem of class over-
Learning community has invested in addressing the class imbal-
lap and its challenging aspects for imbalanced domains, we move ance problem in the latest years, deep learning systems are rarely
towards the second part of the paper, offering a multi-view panorama discussed in more challenging scenarios, namely those comprising
regarding the synergy of both issues across four important areas of additional difficult characteristics, such as class overlap. It would
Machine Learning: Data Analysis, Data Preprocessing, Algorithm Design be important to strengthen the understanding we currently have
and Meta-learning. Regarding ongoing research directions, a few recent on the behaviour or deep learning models, given that despite their
trends can be identified: growing interest in the machine learning community, they seem
to suffer from the save handicaps as their classical counterparts,
• A great amount of related work is currently focused on analysing
namely in what concerns the combination of class imbalance and
the complexity of imbalanced classification tasks, either to estab-
overlap.
lish the baseline difficulty of the learning process (data analysis)
• More thorough studies on the effect of class imbalance and over-
or to develop recommendation systems that compile this infor- lap on distinct learning biases. Existing studies comprise artifi-
mation and produce new inferences with various applications cially generated data, with controlled parameters to create dis-
(meta-learning). Among existing data complexity measures, those tinct complexity factors. New insights are needed for real-world
associated to class overlap have provided the most perceptive domains;
insights. Nevertheless, due to the known biases introduced by the • The creation of a comprehensive benchmark of datasets and
class imbalance problem, recent research is currently investigat- their characterisation should also be prioritised in future re-
ing adaptations of complexity measures to imbalanced domains, search. The same applies to the development of open-source
or focusing on the development of new measures that can take implementations of state-of-the-art approaches for imbalanced
both issues simultaneously into account; and overlapped domains, as well as data complexity measures
• Addressing multiple vortices of class overlap, i.e., considering dis- beyond those established by Ho and Basu, which are mainly the
tinct sources of complexity where can class overlap has synergetic focus of existing libraries.
effects (e.g., local, structural, density information), has proven to
In sum, the purpose and contribution of this manuscript is two-fold.
be a successful approach, both in the field of data preprocessing
First, it establishes the theoretical foundations of the problem of class
and regarding the development of specialised approaches. Simul-
overlap and its implications for imbalanced domains. It is our belief
taneously incorporating several sources of information into the
that, despite the increasing amount of proposals for new methods and
solutions seems to be key to produce improved results, which approaches to address imbalanced and overlapped domains, the lack of
endorses our understanding of class overlap as a heterogeneous understanding regarding the class overlap problem (i.e., the lack of a
concept with distinct representations, and shows that there is an precise definition, measurement, and characterisation of the problem)
advantage in considering their combination; is preventing the development of optimal solutions. In this regard,
• Another emergent line of research is the creation of instance we hope that the concepts and resulting taxonomy discussed through-
spaces where the class overlap problem can be assessed in a out this work, acknowledging the heterogeneity of the class overlap
lower dimensional feature space, through data visualisation. This problem, may encourage the dialogue among researchers towards a
strategy resorts to dimensionality reduction techniques, where consensus on the matter. Secondly, beyond providing a comprehensive
projections can be optimised in order to reveal linear trends identification of open avenues for research, this paper incorporates our
between data complexity and classification performance. thoughts and suggestions on how to address them in future work. We
250
M.S. Santos et al. Information Fusion 89 (2023) 228–253
sincerely hope that these lines of investigation may guide machine [13] H.K. Lee, S.B. Kim, An overlap-sensitive margin classifier for imbalanced and
learning researchers on their journey to pursue future research in this overlapping data, Expert Syst. Appl. 98 (2018) 72–83.
[14] R. Prati, B. G., M. Monard, Class imbalances versus class overlapping: An
field.
analysis of a learning system behavior, in: Mexican International Conference
on Artificial Intelligence, Springer, 2004, pp. 312–321.
CRediT authorship contribution statement [15] M. Mercier, M.S. Santos, P.H. Abreu, C. Soares, J.P. Soares, J. Santos,
Analysing the footprint of classifiers in overlapped and imbalanced contexts,
in: International Symposium on Intelligent Data Analysis, Springer, 2018, pp.
Miriam Seoane Santos: Conceptualisation, Methodology, Litera-
200–212.
ture Search, Investigation, Formal analysis, Writing – original draft, [16] G.-H. Fu, Y.-J. Wu, M.-J. Zong, L.-Z. Yi, Feature selection and classification
Writing – review & editing, Visualization. Pedro Henriques Abreu: by minimizing overlap degree for class-imbalanced data in metabolomics,
Conceptualisation, Validation, Writing – review & editing, Supervi- Chemometr. Intell. Lab. Syst. 196 (2020) 103906.
[17] D. Singh, A. Gosain, A. Saha, Weighted k-nearest neighbor based data complex-
sion. Nathalie Japkowicz: Conceptualisation, Methodology, Formal
ity metrics for imbalanced datasets, Stat. Anal. Data Min.: ASA Data Sci. J. 13
analysis, Validation, Writing – review & editing. Alberto Fernán- (4) (2020) 394–404.
dez: Conceptualisation, Validation, Writing – original draft, Writing – [18] P. Vuttipittayamongkol, E. Elyan, A. Petrovski, On the class overlap problem
review & editing, Supervision. João Santos: Writing – review & editing. in imbalanced data classification, Knowl.-Based Syst. (2020) 106631.
[19] M.S. Santos, P.H. Abreu, N. Japkowicz, A. Fernández, C. Soares, S. Wilk, J.
Santos, On the joint-effect of class imbalance and overlap: A critical review,
Declaration of competing interest Artif. Intell. Rev. (2022) 1–69.
[20] T. Meng, X. Jing, Z. Yan, W. Pedrycz, A survey on machine learning for data
The authors declare that they have no known competing finan- fusion, Inf. Fusion 57 (2020) 115–129.
[21] A.B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado,
cial interests or personal relationships that could have appeared to
S. García, S. Gil-López, D. Molina, R. Benjamins, et al., Explainable artificial
influence the work reported in this paper. intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward
responsible AI, Inf. Fusion 58 (2020) 82–115.
Data availability [22] Y.-L. Chou, C. Moreira, P. Bruza, C. Ouyang, J. Jorge, Counterfactuals
and causability in explainable artificial intelligence: Theory, algorithms, and
applications, Inf. Fusion 81 (2022) 59–83.
No data was used for the research described in the article. [23] Y. Zhu, J. Ma, C. Yuan, X. Zhu, Interpretable learning based dynamic graph
convolutional networks for alzheimer’s disease analysis, Inf. Fusion 77 (2022)
Acknowledgements 53–61.
[24] J. Sun, H. Li, H. Fujita, B. Fu, W. Ai, Class-imbalanced dynamic financial distress
prediction based on adaboost-SVM ensemble combined with SMOTE and time
This work is funded by the FCT - Foundation for Science and Tech- weighting, Inf. Fusion 54 (2020) 128–144.
nology, Portugal, I.P./MCTES through national funds (PIDDAC), within [25] F. Ali, S. El-Sappagh, S.R. Islam, D. Kwak, A. Ali, M. Imran, K.-S. Kwak, A smart
the scope of CISUC R&D Unit - UIDB/00326/2020 or project code healthcare monitoring system for heart disease prediction based on ensemble
deep learning and feature fusion, Inf. Fusion 63 (2020) 208–222.
UIDP/00326/2020. This work is also partially supported by Andalusian [26] Y. Zhang, S. Wang, K. Xia, Y. Jiang, P. Qian, A.D.N. Initiative, et al., Alzheimer’s
frontier regional project A-TIC-434-UGR20 and by the Spanish Min- disease multiclass diagnosis via multimodal neuroimaging embedding feature
istry of Science and Technology under project PID2020-119478GB-I00 selection and fusion, Inf. Fusion 66 (2021) 170–183.
including European Regional Development Funds. The work is further [27] S.-H. Wang, D.R. Nayak, D.S. Guttery, X. Zhang, Y.-D. Zhang, COVID-19 classi-
fication by CCSHNet with deep fusion using transfer learning and discriminant
supported by the FCT Research Grant, Portugal SFRH/BD/138749/2018. correlation analysis, Inf. Fusion 68 (2021) 131–148.
[28] H. Yang, Y. Luo, X. Ren, M. Wu, X. He, B. Peng, K. Deng, D. Yan, H. Tang,
References H. Lin, Risk prediction of diabetes: big data mining with fusion of multifarious
physical examination indicators, Inf. Fusion 75 (2021) 140–149.
[29] S.-H. Wang, V.V. Govindaraj, J.M. Górriz, X. Zhang, Y.-D. Zhang, Covid-19
[1] S. Das, S. Datta, B. Chaudhuri, Handling data irregularities in classifica-
classification by fgcnet with deep feature fusion from graph convolutional
tion: Foundations, trends, and future challenges, Pattern Recognit. 81 (2018)
network and convolutional neural network, Inf. Fusion 67 (2021) 208–229.
674–693.
[30] G. Muhammad, M.S. Hossain, COVID-19 and non-COVID-19 classification using
[2] K. Napierała, J. Stefanowski, S. Wilk, Learning from imbalanced data in
multi-layers fusion from lung ultrasound images, Inf. Fusion 72 (2021) 80–88.
presence of noisy and borderline examples, in: International Conference on
[31] L. Chen, B. Fang, Z. Shang, Y. Tang, Tackling class overlap and imbalance
Rough Sets and Current Trends in Computing, Springer, 2010, pp. 158–167.
problems in software defect prediction, Softw. Qual. J. 26 (1) (2018) 97–125.
[3] V. López, A. Fernández, S. García, V. Palade, F. Herrera, An insight into
[32] M. Lopez-Martin, A. Sanchez-Esguevillas, J.I. Arribas, B. Carro, Supervised
classification with imbalanced data: Empirical results and current trends on
contrastive learning over prototype-label embeddings for network intrusion
using data intrinsic characteristics, Inform. Sci. 250 (2013) 113–141.
detection, Inf. Fusion 79 (2022) 200–228.
[4] J. Stefanowski, Dealing with data difficulty factors while learning from im-
[33] T. Ho, M. Basu, Complexity measures of supervised classification problems, IEEE
balanced data, in: Challenges in Computational Statistics and Data Mining,
Trans. Pattern Anal. Mach. Intell. 24 (3) (2002) 289–300.
Springer, 2016, pp. 333–363.
[34] N. Anwar, G. Jones, S. Ganesh, Measurement of data complexity for classifica-
[5] A. Fernández, S. García, M. Galar, R. C. Prati, B. Krawczyk, F. Herrera, Data tion problems with unbalanced data, Stat. Anal. Data Min.: ASA Data Sci. J. 7
intrinsic characteristics, Learn. Imbalanced Data Sets (2018) 253–277. (3) (2014) 194–211.
[6] S. Wojciechowski, S. Wilk, Difficulty factors and preprocessing in imbalanced [35] L. Cummins, Combining and choosing case base maintenance algorithms (Ph.D.
data sets: An experimental study on artificial data, Found. Comput. Decis. Sci. thesis), University College Cork, 2013.
42 (2) (2017) 149–176. [36] E. Leyva, A. González, R. Perez, A set of complexity measures designed for
[7] V. García, R. Mollineda, J. Sánchez, On the k-NN performance in a challenging applying meta-learning to instance selection, IEEE Trans. Knowl. Data Eng. 27
scenario of imbalance and overlapping, Pattern Anal. Appl. 11 (3–4) (2008) (2) (2014) 354–367.
269–280. [37] G. Armano, E. Tamponi, Experimenting multiresolution analysis for identifying
[8] M.R. Smith, T. Martinez, C. Giraud-Carrier, An instance level analysis of data regions of different classification complexity, Pattern Anal. Appl. 19 (1) (2016)
complexity, Mach. Learn. 95 (2) (2014) 225–256. 129–137.
[9] A. Fernández, S. Garcia, F. Herrera, N.V. Chawla, SMOTE for learning from [38] Z. Borsos, C. Lemnaru, R. Potolea, Dealing with overlap and imbalance: A new
imbalanced data: progress and challenges, marking the 15-year anniversary, J. metric and approach, Pattern Anal. Appl. 21 (2) (2018) 381–395.
Artificial Intelligence Res. 61 (2018) 863–905. [39] A. Orriols-Puig, N. Macia, T.K. Ho, Documentation for the data complexity
[10] M.S. Santos, J.P. Soares, P.H. Abreu, H. Araújo, J. Santos, Cross-validation for library in C++, Universitat Ramon Llull, la Salle 196 (2010) 1–40.
imbalanced datasets: Avoiding overoptimistic and overfitting approaches, IEEE [40] A.C. Lorena, L.P. Garcia, J. Lehmann, M.C. Souto, T.K. Ho, How complex is
Comput. Intell. Mag. 13 (3) (2018) 59–76. your classification problem? A survey on measuring classification complexity,
[11] M. Denil, T. Trappenberg, Overlap versus imbalance, in: Canadian Conference ACM Comput. Surv. 52 (5) (2019) 1–34.
on Artificial Intelligence, Springer, 2010, pp. 220–231. [41] J.D. Pascual-Triana, D. Charte, M.A. Arroyo, A. Fernández, F. Herrera, Revisiting
[12] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, G. Bing, Learning data complexity metrics based on morphology for overlap and imbalance:
from class-imbalanced data: Review of methods and applications, Expert Syst. Snapshot, new overlap number of balls metrics and singular problems prospect,
Appl. 73 (2017) 220–239. Knowl. Inf. Syst. (2021) 1–29.
251
M.S. Santos et al. Information Fusion 89 (2023) 228–253
[42] V.H. Barella, L.P. Garcia, M.C. de Souto, A.C. Lorena, A.C. de Carvalho, [70] P. Vorraboot, S. Rasmequan, K. Chinnasarn, C. Lursinsap, Improving clas-
Assessing the data complexity of imbalanced datasets, Inform. Sci. 553 (2021) sification rate constrained to imbalanced data between overlapped and
83–109. non-overlapped regions by hybrid algorithms, Neurocomputing 152 (2015)
[43] A. Fernández, S. García, M. Galar, R.C. Prati, B. Krawczyk, F. Herrera, Learning 429–443.
from imbalanced data sets, Vol. 11, Springer, 2018. [71] E.R. Fernandes, A.C. de Carvalho, Evolutionary inversion of class distribution in
[44] A. Rivolli, L.P. Garcia, C. Soares, J. Vanschoren, A.C. de Carvalho, Character- overlapping areas for multi-class imbalanced learning, Inform. Sci. 494 (2019)
izing classification datasets: A study of meta-features for meta-learning, 2018, 141–154.
arXiv preprint arXiv:1808.10406. [72] M. Lango, D. Brzezinski, J. Stefanowski, Imweights: Classifying imbalanced data
[45] V. García, R. Alejo, J. Sánchez, J. Sotoca, R. Mollineda, Combined effects of using local and neighborhood information, in: Second International Workshop
class imbalance and class overlap on instance-based classification, in: Inter- on Learning with Imbalanced Domains: Theory and Applications, PMLR, 2018,
national Conference on Intelligent Data Engineering and Automated Learning, pp. 95–109.
Springer, 2006, pp. 371–378. [73] M. Lango, K. Napierala, J. Stefanowski, Evaluating difficulty of multi-class
[46] V. García, R. Mollineda, J. Sánchez, R. Alejo, J. Sotoca, When overlapping imbalanced data, in: International Symposium on Methodologies for Intelligent
unexpectedly alters the class imbalance effects, in: Iberian Conference on Systems, Springer, 2017, pp. 312–322.
Pattern Recognition and Image Analysis, Springer, 2007, pp. 499–506. [74] D. Charte, F. Charte, S. García, F. Herrera, A snapshot on nonstandard
supervised learning problems: taxonomy, relationships, problem transformations
[47] V. García, J. Sánchez, R. Mollineda, An empirical study of the behavior of
and algorithm adaptations, Prog. Artif. Intell. 8 (1) (2019) 1–14.
classifiers on imbalanced and overlapped data sets, in: Iberoamerican Congress
on Pattern Recognition, Springer, 2007, pp. 397–406. [75] J.P.M. De Sá, Pattern Recognition: Concepts, Methods, and Applications,
Springer Science & Business Media, 2001.
[48] J. Stefanowski, Overlapping, rare examples and class decomposition in learning
[76] C. Bunkhumpornpat, K. Sinapiromsaran, DBMUTE: density-based majority
classifiers from imbalanced data, in: Emerging Paradigms in Machine Learning,
under-sampling technique, Knowl. Inf. Syst. 50 (3) (2017) 827–850.
Springer, 2013, pp. 277–306.
[77] P. Vuttipittayamongkol, E. Elyan, A. Petrovski, C. Jayne, Overlap-based un-
[49] X. Chen, L. Zhang, X. Wei, X. Lu, An effective method using clustering-
dersampling for improving imbalanced data classification, in: International
based adaptive decomposition and editing-based diversified oversamping for
Conference on Intelligent Data Engineering and Automated Learning, Springer,
multi-class imbalanced datasets, Appl. Intell. (2020) 1–16.
2018, pp. 689–697.
[50] Y. Zhu, Y. Yan, Y. Zhang, Y. Zhang, EHSO: Evolutionary hybrid sampling in
[78] C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, MUTE: Majority under-
overlapping scenarios for imbalanced learning, Neurocomputing 417 (2020)
sampling technique, in: 2011 8th International Conference on Information,
333–346.
Communications & Signal Processing, IEEE, 2011, pp. 1–4.
[51] J.M. Sotoca, J. Sanchez, R.A. Mollineda, A review of data complexity measures [79] P. Vuttipittayamongkol, E. Elyan, Neighbourhood-based undersampling ap-
and their applicability to pattern classification problems, Actas Del III Taller proach for handling imbalanced and overlapped data, Inform. Sci. 509 (2020)
Nacional de Mineria de Datos Y Aprendizaje. TAMIDA (2005) 77–83. 47–70.
[52] J.M. Sotoca, R.A. Mollineda, J.S. Sánchez, A meta-learning framework for pat- [80] J. Sáez, J. Luengo, J. Stefanowski, F. Herrera, SMOTE–IPF: Addressing the noisy
tern classication by means of data complexity measures, Inteligencia Artificial. and borderline examples problem in imbalanced classification by a re-sampling
Revista Iberoamericana de Inteligencia Artificial 10 (29) (2006) 31–38. method with filtering, Inform. Sci. 291 (2015) 184–203.
[53] J. Luengo, A. Fernández, S. García, F. Herrera, Addressing data complexity for [81] I. Nekooeimehr, S.K. Lai-Yuen, Adaptive semi-unsupervised weighted over-
imbalanced data sets: Analysis of SMOTE-based oversampling and evolutionary sampling (A-SUWO) for imbalanced datasets, Expert Syst. Appl. 46 (2016)
undersampling, Soft Comput. 15 (10) (2011) 1909–1936. 405–416.
[54] V.H. Barella, L.P. Garcia, M.P. de Souto, A.C. Lorena, A. de Carvalho, Data [82] J. Wei, H. Huang, L. Yao, Y. Hu, Q. Fan, D. Huang, IA-SUWO: An im-
complexity measures for imbalanced classification tasks, in: 2018 International proving adaptive semi-unsupervised weighted oversampling for imbalanced
Joint Conference on Neural Networks, IJCNN, IEEE, 2018, pp. 1–8. classification problems, Knowl.-Based Syst. 203 (2020) 106116.
[55] A. Ali, S.M. Shamsuddin, A.L. Ralescu, et al., Classification with class imbalance [83] J. Wei, H. Huang, L. Yao, Y. Hu, Q. Fan, D. Huang, NI-MWMOTE: An
problem: A review, Int. J. Adv. Soft Comput. Appl. 7 (3) (2015) 176–204. improving noise-immunity majority weighted minority oversampling technique
[56] C. M. Van der Walt, E. Barnard, Measures for the characterisation of pattern- for imbalanced classification problems, Expert Syst. Appl. 158 (2020) 113504.
recognition data sets, in: Annual Symposium of the Pattern Recognition [84] T. Zhu, Y. Lin, Y. Liu, Improving interpolation-based oversampling for
Association of South Africa, 2007, pp. 1–6. imbalanced data learning, Knowl.-Based Syst. 187 (2020) 104826.
[57] J. Błaszczyński, J. Stefanowski, Local data characteristics in learning classifiers [85] G. Douzas, F. Bacao, Geometric SMOTE a geometrically enhanced drop-in
from imbalanced data, in: Advances in Data Analysis with Computational replacement for SMOTE, Inform. Sci. 501 (2019) 118–135.
Intelligence Methods, Springer, 2018, pp. 51–85. [86] V. García, J. Sánchez, A. Marqués, R. Florencia, G. Rivera, Understanding the
[58] S. Oh, A new dataset evaluation method based on category overlap, Comput. apparent superiority of over-sampling through an analysis of local information
Biol. Med. 41 (2) (2011) 115–122. for class-imbalanced data, Expert Syst. Appl. 158 (2020) 113026.
[59] C. Thornton, Separability is a learner’s best friend, in: 4th Neural Computation [87] A.R.S. Parmezan, H.D. Lee, F.C. Wu, Metalearning for choosing feature selection
and Psychology Workshop, London, 9–11 April 1997, Springer, 1998, pp. 40–46. algorithms in data mining: Proposal of a new framework, Expert Syst. Appl. 75
[60] J. Greene, Feature subset selection using thornton’s separability index and (2017) 1–24.
its applicability to a number of sparse proximity-based classifiers, in: Annual [88] L.C. Okimoto, R.M. Savii, A.C. Lorena, Complexity measures effectiveness in
Symposium of the Pattern Recognition Association of South Africa, 2001, pp. feature selection, in: 2017 Brazilian Conference on Intelligent Systems, BRACIS,
1–5. IEEE, 2017, pp. 91–96.
[89] L.C. Okimoto, A.C. Lorena, Data complexity measures in feature selection, in:
[61] K. Napierala, J. Stefanowski, Types of minority class examples and their
2019 International Joint Conference on Neural Networks, IJCNN, IEEE, 2019,
influence on learning classifiers from imbalanced data, J. Intell. Inf. Syst. 46
pp. 1–8.
(3) (2016) 563–597.
[90] B. Seijo-Pardo, V. Bolón-Canedo, A. Alonso-Betanzos, On developing an auto-
[62] R.A. Sowah, M.A. Agebure, G.A. Mills, K.M. Koumadi, S.Y. Fiawoo, New cluster
matic threshold applied to feature selection ensembles, Inf. Fusion 45 (2019)
undersampling technique for class imbalance learning, Int. J. Mach. Learn.
227–245.
Comput. 6 (3) (2016) 205.
[91] N.T. Dong, M. Khosla, Revisiting feature selection with data complexity, in:
[63] A. Guzmán-Ponce, R.M. Valdovinos, J.S. Sánchez, J.R. Marcial-Romero, A new
2020 IEEE 20th International Conference on Bioinformatics and Bioengineering,
under-sampling method to face class overlap and imbalance, Appl. Sci. 10 (15)
BIBE, IEEE, 2020, pp. 211–216.
(2020) 5164.
[92] A. Fernández, M.J. del Jesus, F. Herrera, Addressing overlapping in classifica-
[64] P. Vuttipittayamongkol, E. Elyan, Improved overlap-based undersampling for
tion with imbalanced datasets: A first multi-objective approach for feature and
imbalanced dataset classification with application to epilepsy and parkinson’s
instance selection, in: International Conference on Intelligent Data Engineering
disease, Int. J. Neural Syst. 30 (08) (2020) 2050043.
and Automated Learning, Springer, 2015, pp. 36–44.
[65] C.M. Van der Walt, et al., Data measures that characterise classification [93] X. Lin, H. Song, M. Fan, W. Ren, L. Li, W. Yao, The feature selection
problems (Ph.D. thesis), University of Pretoria, 2008. algorithm based on feature overlapping and group overlapping, in: 2016 IEEE
[66] S. Massie, S. Craw, N. Wiratunga, Complexity-guided case discovery for case International Conference on Bioinformatics and Biomedicine, BIBM, IEEE, 2016,
based reasoning, in: AAAI, Vol. 5, 2005, pp. 216–221. pp. 619–624.
[67] S. Singh, PRISM–A novel framework for pattern recognition, Pattern Anal. Appl. [94] H. Hartono, E. Ongko, Y. Risyani, Combining feature selection and hybrid
6 (2) (2003) 134–149. approach redefinition in handling class imbalance and overlapping for multi-
[68] S. Singh, Multiresolution estimates of classification complexity, IEEE Trans. class imbalanced, Indonesian J. Electr. Eng. Comput. Sci. 21 (3) (2021)
Pattern Anal. Mach. Intell. 25 (12) (2003) 1534–1539. 1513–1522.
[69] C.G. Weng, J. Poon, A data complexity analysis on imbalanced datasets and an [95] B. Omar, F. Rustam, A. Mehmood, G.S. Choi, et al., Minimizing the overlapping
alternative imbalance recovering strategy, in: 2006 IEEE WIC ACM International degree to improve class-imbalanced learning under sparse feature selection:
Conference on Web Intelligence, IEEE, 2006, pp. 270–276. Application to fraud detection, IEEE Access 9 (2021) 28101–28110.
252
M.S. Santos et al. Information Fusion 89 (2023) 228–253
[96] K. Smith-Miles, D. Baatar, B. Wreford, R. Lewis, Towards objective measures [121] F. Herrera, S. Ventura, R. Bello, C. Cornelis, A. Zafra, D. Sánchez-Tarragó, S.
of algorithm performance across instance space, Comput. Oper. Res. 45 (2014) Vluymans, Multiple instance learning, in: Multiple Instance Learning, Springer,
12–24. 2016, pp. 17–33.
[97] K. Smith-Miles, T.T. Tan, Measuring algorithm footprints in instance space, in: [122] S. Vluymans, D.S. Tarragó, Y. Saeys, C. Cornelis, F. Herrera, Fuzzy rough
2012 IEEE Congress on Evolutionary Computation, IEEE, 2012, pp. 1–8. classifiers for class imbalanced multi-instance data, Pattern Recognit. 53 (2016)
[98] M.A. Muñoz, L. Villanova, D. Baatar, K. Smith-Miles, Instance spaces for 36–45.
machine learning classification, Mach. Learn. 107 (1) (2018) 109–147. [123] G. Melki, A. Cano, S. Ventura, MIRSVM: multi-instance support vector machine
[99] M.A. Muñoz, T. Yan, M.R. Leal, K. Smith-Miles, A.C. Lorena, G.L. Pappa, R.M. with bag representatives, Pattern Recognit. 79 (2018) 228–241.
Rodrigues, An instance space analysis of regression problems, ACM Trans. [124] S. Sun, L. Mao, Z. Dong, L. Wu, Multiview Machine Learning, Springer, 2019.
Knowl. Discov. Data (TKDD) 15 (2) (2021) 1–25. [125] D. Jiang, R. Xu, X. Xu, Y. Xie, Multi-view feature transfer for click-through rate
[100] J. Vanschoren, Meta-learning: A survey, 2018, arXiv preprint arXiv:1810.03548. prediction, Inform. Sci. 546 (2021) 961–976.
[101] M.M. Nwe, K.T. Lynn, Knn-based overlapping samples filter approach for [126] R.G. Mantovani, A.L. Rossi, J. Vanschoren, B. Bischl, A.C. Carvalho, To tune
classification of imbalanced data, in: International Conference on Software or not to tune: recommending when to adjust SVM hyper-parameters via meta-
Engineering Research, Management and Applications, Springer, 2019, pp. learning, in: 2015 International Joint Conference on Neural Networks, IJCNN,
55–73. IEEE, 2015, pp. 1–8.
[102] P. Skryjomski, B. Krawczyk, Influence of minority class instance types on [127] R.G. Mantovani, A.L. Rossi, J. Vanschoren, A.C. de Carvalho, Meta-learning
SMOTE imbalanced data oversampling, in: First International Workshop on recommendation of default hyper-parameter values for SVMs in classification
Learning with Imbalanced Domains: Theory and Applications, PMLR, 2017, pp. tasks, in: MetaSel PKDD/ECML, 2015, pp. 80–92.
7–21. [128] M. Mahin, M.J. Islam, B.C. Debnath, A. Khatun, Tuning distance metrics and k
[103] J. Sáez, B. Krawczyk, M. Woźniak, Analyzing the oversampling of different to find sub-categories of minority class from imbalance data using k nearest
classes and types of examples in multi-class imbalanced datasets, Pattern neighbours, in: 2019 International Conference on Electrical, Computer and
Recognit. 57 (2016) 164–178. Communication Engineering, ECCE, IEEE, 2019, pp. 1–6.
[104] M. Koziarski, M. Wożniak, CCR: A combined cleaning and resampling algorithm [129] N. Macià, E. Bernadó-Mansilla, Towards UCI+: A mindful repository design,
for imbalanced data classification, Int. J. Appl. Math. Comput. Sci. 27 (4) (2017) Inform. Sci. 261 (2014) 237–262.
727–736. [130] L.P. Garcia, A. Rivolli, E. Alcobaça, A.C. Lorena, A.C. de Carvalho, Boosting
[105] A. Fernández, C.J. Carmona, M. José del Jesus, F. Herrera, A Pareto-based meta-learning with simulated data complexity measures, Intell. Data Anal. 24
ensemble with feature and instance selection for learning from multi-class (5) (2020) 1011–1028.
imbalanced datasets, Int. J. Neural Syst. 27 (06) (2017) 1750028. [131] V.V. de Melo, A.C. Lorena, Using complexity measures to evolve synthetic
[106] V. H. Barella, E. P. Costa, A. C.P.L.F. de Carvalho, Clusteross: A new undersam- classification datasets, in: 2018 International Joint Conference on Neural
pling method for imbalanced learning, in: Brazilian Conference on Intelligent Networks, IJCNN, IEEE, 2018, pp. 1–8.
Systems, Academic Press, 2014, pp. 1–6. [132] A. Correia, C. Soares, A. Jorge, Dataset morphing to analyze the performance
[107] K. Ghosh, C. Bellinger, R. Corizzo, B. Krawczyk, N. Japkowicz, On the combined of collaborative filtering, in: International Conference on Discovery Science,
effect of class imbalance and concept complexity in deep learning, 2021, arXiv Springer, 2019, pp. 29–39.
preprint arXiv:2107.14194. [133] T.R. França, P.B. Miranda, R.B. Prudêncio, A.C. Lorenaz, A.C. Nascimento, A
[108] A. Rivolli, L.P. Garcia, C. Soares, J. Vanschoren, A.C. de Carvalho, Towards many-objective optimization approach for complexity-based data set generation,
reproducible empirical research in meta-learning, 2018, pp. 32–52, arXiv in: 2020 IEEE Congress on Evolutionary Computation, CEC, IEEE, 2020, pp. 1–8.
preprint arXiv:1808.10406. [134] J. Alcalá-Fdez, L. Sánchez, S. Garcia, M.J. del Jesus, S. Ventura, J.M. Garrell,
[109] S.N. das Dôres, L. Alves, D.D. Ruiz, R.C. Barros, A meta-learning framework for J. Otero, C. Romero, J. Bacardit, V.M. Rivas, et al., KEEL: A software tool to
algorithm recommendation in software fault prediction, in: Proceedings of the assess evolutionary algorithms for data mining problems, Soft Comput. 13 (3)
31st Annual ACM Symposium on Applied Computing, 2016, pp. 1486–1491. (2009) 307–318.
[110] R. Shah, V. Khemani, M. Azarian, M. Pecht, Y. Su, Analyzing data complexity [135] J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, F.
using metafeatures for classification algorithm selection, in: 2018 Prognostics Herrera, KEEL data-mining software tool: data set repository, integration of
and System Health Management Conference (PHM-Chongqing), IEEE, 2018, pp. algorithms and experimental analysis framework, J. Mult.-Valued Logic Soft
1280–1284. Comput. 17 (2011).
[111] X. Zhang, R. Li, B. Zhang, Y. Yang, J. Guo, X. Ji, An instance-based learning rec- [136] I. Triguero, S. González, J.M. Moyano, S. García, J. Alcalá-Fdez, J. Luengo, A.
ommendation algorithm of imbalance handling methods, Appl. Math. Comput. Fernández, M.J. del Jesús, L. Sánchez, F. Herrera, KEEL 3.0: An open source
351 (2019) 204–218. software for multi-stage analysis in data mining, Int. J. Comput. Intell. Syst. 10
[112] A.J. Costa, M.S. Santos, C. Soares, P.H. Abreu, Analysis of imbalance strategies (1) (2017) 1238–1249.
recommendation using a meta-learning approach, in: 7th ICML Workshop on [137] E. Frank, M. Hall, G. Holmes, R. Kirkby, B. Pfahringer, I.H. Witten, L. Trigg,
Automated Machine Learning (AutoML-ICML2020), 2020, pp. 1–10. Weka-a machine learning workbench for data mining, in: Data Mining and
[113] L.P. Garcia, A.C. Lorena, M.C. de Souto, T.K. Ho, Classifier recommendation Knowledge Discovery Handbook, Springer, 2009, pp. 1269–1277.
using data complexity measures, in: 2018 24th International Conference on [138] A. Dal Pozzolo, O. Caelen, S. Waterschoot, G. Bontempi, Racing for unbalanced
Pattern Recognition, ICPR, IEEE, 2018, pp. 874–879. methods selection, in: International Conference on Intelligent Data Engineering
[114] J. Luengo, F. Herrera, An automatic extraction method of the domains of and Automated Learning, Springer, 2013, pp. 24–31.
competence for learning classifiers using data complexity measures, Knowl. Inf. [139] N. Lunardon, G. Menardi, N. Torelli, ROSE: A package for binary imbalanced
Syst. 42 (1) (2015) 147–180. learning, R Journal 6 (1) (2014).
[115] Z. Liu, W. Cao, Z. Gao, J. Bian, H. Chen, Y. Chang, T.-Y. Liu, Self-paced [140] W. Siriseriwan, Smotefamily: A collection of oversampling techniques for class
ensemble for highly imbalanced massive data classification, in: 2020 IEEE 36th imbalance problem based on SMOTE, 2019.
International Conference on Data Engineering, ICDE, IEEE, 2020, pp. 841–852. [141] I. Cordón, S. García, A. Fernández, F. Herrera, Imbalance: Oversampling
[116] J.A. Sáez, M. Galar, B. Krawczyk, Addressing the overlapping data problem in algorithms for imbalanced classification in R, Knowl.-Based Syst. 161 (2018)
classification using the one-vs-one decomposition strategy, IEEE Access 7 (2019) 329–341.
83396–83411. [142] G. Lemaître, F. Nogueira, C.K. Aridas, Imbalanced-learn: A python toolbox to
[117] M. Galar, A. Fernández, E. Barrenechea, F. Herrera, DRCW-OVO: distance- tackle the curse of imbalanced datasets in machine learning, J. Mach. Learn.
based relative competence weighting combination for one-vs-one strategy in Res. 18 (1) (2017) 559–563.
multi-class problems, Pattern Recognit. 48 (1) (2015) 28–42. [143] G. Kovács, An empirical comparison and evaluation of minority oversampling
[118] M. Janicka, M. Lango, J. Stefanowski, Using information on class interrelations techniques on a large number of imbalanced datasets, Appl. Soft Comput. 83
to improve classification of multiclass imbalanced data: A new resampling (2019) 105662.
algorithm, Int. J. Appl. Math. Comput. Sci. 29 (4) (2019). [144] E. Alcobaça, F. Siqueira, A. Rivolli, L.P.F. Garcia, J.T. Oliva, A.C.P.L.F. de
[119] F. Herrera, F. Charte, A.J. Rivera, M.J. Del Jesus, Multilabel classification, in: Carvalho, MFE: Towards reproducible meta-feature extraction, J. Mach. Learn.
Multilabel Classification, Springer, 2016, pp. 17–31. Res. 21 (111) (2020) 1–5.
[120] I. Bendjoudi, F. Vanderhaegen, D. Hamad, F. Dornaika, Multi-label, multi-task [145] P.Y.A. Paiva, K. Smith-Miles, M.G. Valeriano, A.C. Lorena, Pyhard: A novel
CNN approach for context-based emotion recognition, Inf. Fusion 76 (2021) tool for generating hardness embeddings to support data-centric analysis, 2021,
422–428. arXiv preprint arXiv:2109.14430.
253