0% found this document useful (0 votes)
14 views14 pages

Choi Et Al. - 2022 - Imbalanced Data Classification Via Cooperative Int

Uploaded by

이원재
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views14 pages

Choi Et Al. - 2022 - Imbalanced Data Classification Via Cooperative Int

Uploaded by

이원재
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO.

8, AUGUST 2022 3343

Imbalanced Data Classification via Cooperative


Interaction Between Classifier and Generator
Hyun-Soo Choi , Dahuin Jung , Siwon Kim , and Sungroh Yoon , Senior Member, IEEE

Abstract— Learning classifiers with imbalanced data can be with imbalanced data can be strongly biased by the majority
strongly biased toward the majority class. To address this issue, class, causing low precision of the minority class. Ultimately,
several methods have been proposed using generative adversarial the goal of addressing the imbalanced data problem is to
networks (GANs). Existing GAN-based methods, however, do not
effectively utilize the relationship between a classifier and a increase the classification performance on the minority class.
generator. This article proposes a novel three-player structure Various methods have been proposed to overcome the
consisting of a discriminator, a generator, and a classifier, along imbalanced data problem [10]. Among existing methods, the
with decision boundary regularization. Our method is distinctive data-level balancing approach has been widely used to balance
in which the generator is trained in cooperation with the classifier training samples [6], [7], [11]–[19] The loss-based (cost-
to provide minority samples that gradually expand the minority
decision region, improving performance for imbalanced data sensitive) balancing approach, which gives larger weights on
classification. The proposed method outperforms the existing minority samples than the majority samples, has also been
methods on real data sets as well as synthetic imbalanced data widely used [20]–[22]. The classifier-design approach for
sets. balancing is to design algorithmic techniques embedded in a
Index Terms— Classification, decision boundary, deep learn- classifier to overcome the class-imbalance problem inherently
ing, generative adversarial networks (GANs), imbalanced data, [23]–[26]. These conventional methods have been effectively
supervised learning. applied to the shallow learning classifier using handcrafted
features, such as SHIFT [27] and SURF [28]. In recent
I. I NTRODUCTION
years, deep learning classifiers outperform the shallow learning

T HE imbalanced data problem is a phenomenon in which


the number of samples in minority and majority classes
has a large gap in training data. The medical domain is
classifiers by a large margin [29]. Hence, even when the
shallow learning classifiers adopt the imbalanced learning
scheme, they are difficult to exceed the baseline performance
representative fields of the imbalanced data problem [1]. of the deep learning classifier without an imbalanced learning
Domains, including biology, network intrusion, and fraud scheme. In this article, we aim to develop a method that can
detection, also suffer from the same phenomenon [2]–[5]. The be applied to a deep learning classifier that uses images as
imbalance ratio (IR) between minority and majority classes inputs directly without handcrafted features.
varies depending on the application, and in severe cases, the Recently, to tackle the imbalanced problem in deep learning
IR may be as high as 100 000 [6], [7]. In many applications, classifiers, generative adversarial networks (GANs) [30] have
it is more costly and important to classify the minority than been used to generate high-dimensional synthetic samples in
the majority class [8], [9]. Since imbalanced data causes the minority class [20]–[22], [31], [32]. Most of the existing
severe performance degradation in machine learning, it is an GAN-based methods do not consider the effect on a classifier
important research topic in both academia and industry [5]. when training a generator and a discriminator of GAN, thus
The decision boundary learned by standard machine learning limiting improvement opportunities for the generated samples.
Manuscript received 6 March 2020; revised 2 October 2020; accepted To handle this issue, the concept of TripleGAN [33] has
12 December 2020. Date of publication 2 February 2021; date of current been adopted to address imbalanced data classification [34].
version 4 August 2022. This work was supported in part by the National However, since TripleGAN was proposed for semisupervised
Research Foundation of Korea (NRF) Grant funded by the Korean Govern-
ment (Ministry of Science and ICT) under Grant 2018R1A2B3001628 and learning, they have an adversarial relationship between a
in part by the Brain Korea 21 Plus Project in 2021. (Corresponding author: classifier and a discriminator, which limits the performance
Sungroh Yoon.) improvement.
Hyun-Soo Choi is with the Department of Electrical and Computer Engi-
neering, Seoul National University, Seoul 08826, South Korea, also with the In this article, instead of the adversarial relationship between
Vision AI Labs, SK Telecom, Seoul 04539, South Korea, and also with classifier and GAN (generator/discriminator), we propose a
the Department of Computer Science and Engineering, Kangwon National novel cooperative relationship between classifier and GAN
University, Gangwon-Do 24341, South Korea.
Dahuin Jung and Siwon Kim are with the Department of Electrical and (generator/discriminator). In deep classifiers, implicit fea-
Computer Engineering, Seoul National University, Seoul 08826, South Korea. ture embedding techniques via multiple layers are used to
Sungroh Yoon is with the Department of Electrical and Computer Engineer- obtain discriminative features that are easily separable between
ing, Seoul National University, Seoul 08826, South Korea, also with ASRI,
INMC, ISRC, Seoul National University, Seoul 08826, South Korea, and also classes. Nevertheless, when the training samples are not suffi-
with the Institute of Engineering Research, Seoul National University, Seoul cient or imbalanced, the deep features may not be embedded
08826, South Korea (e-mail: [email protected]). well enough to form the trained borderline similar to the true
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNNLS.2021.3052243. borderline between classes. In this perspective, to mitigate
Digital Object Identifier 10.1109/TNNLS.2021.3052243 the problem of training the deep network with imbalanced
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
3344 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 8, AUGUST 2022

are the synthetic minority oversampling technique (SMOTE)


[6]. Several variants have been proposed to overcome the
limitations of SMOTE. Borderline-SMOTE (B-SMOTE)
[12], neighborhood rough set boundary-SMOTE (NRSB-
SMOTE) [35], and the adaptive synthetic sampling approach
(ADASYN) [13] adaptively generates samples considering
the proportions of adjacent majority data. To ensure that
the generated samples belong to a minority class, majority
weighted minority oversampling technique (MWMOTE) [7]
effectively identifies minority sample-dominated clusters
Fig. 1. Cooperative interaction between a generator and a classifier to expand
the skewed decision region of minority class.
that become sources of oversampling. Gaussian SMOTE
(G-SMOTE) [15] achieves sample diversity by replacing the
uniform distribution of SMOTE with Gaussian distribution.
data, we aim to balance the samples in the overlapping region Real-value negative selection oversampling (RSNO) [16]
around the trained borderline by the proposed cooperative synthesizes a minority sample without accessing minority
structure of classifier, generator, and discriminator along with samples. The hybrid methods, such as SMOTE editing the
a borderline regularization. nearest neighbor (SMOTENN) [17] and SMOTE-Iterative
For an imbalanced data including between- and within-class partitioning filter (SMOTE-IPF) [18], filter out unsafe
imbalance, we assume that the samples in the overlapping samples after SMOTE-based oversampling. However, most
region are also imbalanced and yield a biased training of of these data-level methods consider only local information;
the classifier. In particular, when training with the within- therefore, they cannot reflect the entire data distribution [21].
class imbalanced data, the imbalance would become more Furthermore, these methods are based on interpolation with
severe around the borderline adjacent to the within-class simple distance metrics (e.g., Euclidean), and therefore, they
minority region. When training the deep classifier without any only consider numerically featurized data, which do not
consideration of the imbalance, the trained borderline on the successfully address other types of data such as image [21].
overlapping region can be biased toward the minority region. 2) Loss-Based (Cost-Sensitive) Balancing Approach: The
Hence, as shown in Fig. 1, our approach tries to move the cost-sensitive approach modifies the existing classification
trained borderline toward the majority region by training the loss (cost) function (i.e., cross-entropy loss) to give addi-
classifier with the generated minority samples via the proposed tional considerations on minority class samples. Representative
cooperative training of classifier, generator, and discriminator, methods include class rectification loss (CRL) [36], max-
along with the borderline regularization. pooling loss (MPL) [37], and focal loss (Focal) [38]. In detail,
Our key concepts and contributions are as follows. CRL rectifies the incremental class bias in the model by
1) A three-player structure (a classifier, a discriminator, and making use of batchwisely selected hard positive and negative
a generator) is proposed in a cooperative relationship samples of the minority classes. In the case of MRL, it indi-
between the generator and the classifier to address rectly addresses both interclass and intraclass imbalance by
imbalanced data learning. performing a generalized max pooling of pixel-specific losses.
2) A novel regularization term is embedded to expand the In the case of the focal loss, it reshapes the loss function
decision boundary of the minority class in a cooperative to downweight samples of the majority classes. The advan-
interaction of the generator and the classifier. tage of the cost-sensitive approach is that it can be simply
3) We develop an alternating optimization strategy, along applied to a training procedure for a deep learning network.
with a regularization decaying scheme to prevent over- However, as illustrated in our experiments, the performance
generalization, in which the generator and the classifier improvements of cost-sensitive methods are bounded because
are trained alternately to learn a desirable distribution. the amount of samples of the minority classes that can be used
4) The proposed method is validated experimentally using for learning data distributions is still limited.
in-depth self-analysis as well as comparing with the 3) Classifier-Design Approach for Balancing: This
existing methods. approach is to design algorithmic techniques embedded
in a classifier to overcome the class-imbalance problem
inherently. The representative methods in this approach are
II. R ELATED W ORK
the regression-based linear classifier minimizing one-pass area
A. Imbalanced Data Classification under the receiver operating characteristic curve (AUC) convex
1) Data-Level Balancing Approach: The data-level loss [23], the kernelized online imbalanced learning (KOIL)
balancing approach is divided into three categories: of support vector machine (SVM) [24], the ensemble
undersampling, oversampling, and hybrid methods. The strategy of SVMs [25], and Random forests combining
undersampling methods, such as clustering centroids multiple decision trees to learn highly imbalanced medical
(C-Centroids) and condensed nearest neighbor (CN-Neighbor), data [26]. However, since the above methods belong to
balance the training data by removing majority samples [11]. shallow learning, these methods cannot access extremely
The oversampling methods generate synthetic minority high-dimensional data, such as raw images without handcraft
samples to balance training data. Representative methods features. Recently, deep-network classifiers have shown more
CHOI et al.: IMBALANCED DATA CLASSIFICATION VIA COOPERATIVE INTERACTION BETWEEN CLASSIFIER AND GENERATOR 3345

Fig. 2. Difference between (a) TripleGAN and (b) Proposed.

remarkable performance by directly learning raw images than


the shallow learning methods [39]. Thus, our work intends to
propose an imbalanced data learning approach applicable to
a deep-network classifier.
4) GAN-Based Balancing Approach: In recent years,
to reflect the actual distribution of a minority class in learning Fig. 3. Proposed GAN architecture with notations.
a deep-network classifier, GAN has been used as oversam-
pling method. GAN-based works exploit the GAN model,
such as deep convolutional GAN (DCGAN) [20], conditional (see Section III-A) and develop an alternating training scheme
GAN (cGAN) [21], or cycleGAN [22] to restore the actual with a cooperative training loop between the generator and the
distribution by synthesizing data. Balancing GAN (BAGAN) classifier (see Section III-B).
[31] is a slightly modified version of an auxiliary classifier
GAN [40] that specializes in the generation of minority class A. Three-Player Structure for Imbalanced Data Learning
samples. In all of these studies, the process of generating 1) Motivation: As mentioned in Section II-A4, TripleGAN
samples through GAN and the process of learning a classifier [33] and E-TripleGAN [41] are designed to generate pseudo-
with the generated samples are independent. To handle this labels for unlabeled samples for facilitating semisupervised
issue, the concept of TripleGAN [33] has been adopted to learning. The discriminator (D) in TripleGAN discriminates
address imbalanced data classification [34]. However, Triple- a given true label and a false label generated by the classi-
GAN and its variant, Enhanced TripleGAN (E-TripleGAN) fier (C). TripleGAN has an adversarial relationship U(C, D)
[41], are originally proposed for semisupervised learning, and between D and C, as shown in Fig. 2(a). In the proposed
it has an adversarial relationship between a classifier and a model, a cooperative relationship is developed between the
discriminator for pseudolabeling. This adversarial relation is of generator (G) and C to ensure that both G and C are benefitted
no use for imbalanced data learning as unlabeled samples are by joint training. For developing the cooperative relation-
not present in an imbalanced data problem. Hence, we remove ship, additional utility terms U(G, C) and R(G, C) have
the adversarial relationship and propose a novel cooperative been proposed, as shown in Fig. 2(b). The proposed three-
relationship between the generator and the classifier. player structure is designed to expand the minority region by
generating minority samples toward the borderline between
B. Effectiveness of Samples Near Class Boundary the majority and the minority in the early stage of training
The samples near the decision boundary play an important and finally to provide densely distributed samples within an
role in training classifiers. For this reason, various research expended minority region. In Section III-A2, we describe the
works have attempted to utilize the concept of the decision details of the proposed utility function and discuss its impact
boundary, such as knowledge distillation via decision bound- on imbalanced data learning.
ary transfer [42], classifier training robust to adversarial attacks 2) Utility Function: To describe our utility function for our
[43], and out-of-distribution detection problems [44]. To the three-player structure shown in Fig. 3, we define notations. x
best of our knowledge, however, our work is the first attempt denotes the input data and y denotes the output label. Then,
to address an imbalanced classification problem by generating x = G(z, y) denotes a generated sample from the randomly
samples with GAN to expand the decision boundary of the generated z and y values. It is assumed that the observed
minority region. The novelty of our study is the decision training samples are sampled from unknown p(x, y) and that
boundary regularization with its decay, which promotes the samples from both p(z) and p(y) can be easily obtained
convergence of the alternating optimization in training our by using simple known distributions (normal or uniform and
three-player structure for mitigating the imbalance issue. so on.) during training. The classified label is denoted by
y = C(x) and the output of D is denoted by D(x, y) for
given x and y. In addition, the joint distributions pg (x, y) and
III. P ROPOSED M ETHOD
pc (x, y) are defined as
To formulate our concept for expanding the minority
decision region to have a desirable distribution, we design pg (x, y) := p(y) pg (x|y) = p(y) p(G(z, y)|y) (1)
a three-player structure for imbalanced data learning pc (x, y) := p(x) pc (y|x) = p(x) p(C(x)|x) (2)
3346 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 8, AUGUST 2022

where pg (x|y) = p(G(z, y)|y) in (1) indicates the distribution


of synthetic samples generated by G for a given label y
and pc (y|x) = p(C(x)|x) in (2) indicates the distribution of
labels, determined by C, for the given samples (generated or
observed).
Our goal is to design a utility function U(C, D, G) for
imbalanced data learning in three-player game given by
min max U(C, D, G). (3)
C,G D

In this article, the utility function for imbalanced data


learning is proposed as
U(C, D, G) = Ug (D, G) + Uc1 (C) Fig. 4. Through cooperative training, G is trained to generate minority
samples (yellow) crossing the decision boundary between the minority class
+ (1 − λ)Uc2 (G, C) + λR(G, C) (4) and the majority class. As indicated by the dashed line, the samples generated
by the tuned G contribute to the expansion of the minority class region for
where the last two terms are distinctive aspects against Triple- the next training of C.
GAN and they take key roles for cooperative training of G and
C in our method. The third term is for jointly training of G
and C, whereas the fourth term R(G, C) is for minority region satisfying Cmi (xg ) = Cma (xg ), as shown in Fig. 4. These
expansion. These two terms are linked by a hyperparameter λ generated samples take a role in the next training of the
for tradeoff scheduling between the two terms (for details, see classifier to expand the decision boundary toward the majority
Sections III-A3 and III-B3). Each term is defined formally in region, as shown in Fig. 1. If the generated sample (xg ) is
the following. already placed in the majority region [the lower condition in
The term Ug (D, G) is well known utility function of (8)], this sample does not move to prevent the harmful effect
cGAN [21], which is defined as to the majority class. Hence, as shown in Fig. 4, minimizing
Ug (D, G) = E p(x,y) [log D(x, y)] R(G, C) plays a role in generating samples to expand the
minority region in the direction of the majority region in
+ E pg (G(z,y),y) [log(1 − D(G(z, y), y))]. (5) training C.
The term Uc1 (C) is for training C with only the However, the endless expansion of the minority region
observed (real) data, whereas Uc2 (G, C) is for joint training might cause overgeneralization of the minority class, i.e.,
of G and C, which are defined as potential overlapping issues, which degrades the classification
performance. To mitigate the potential overlapping problem,
Uc1 (C) = E p(x,y) [− log pc (y|x)] (6) we introduce a hyperparameter λ for a tradeoff scheduling
Uc2 (G, C) = E pg (G(z,y),y)[− log pc (y|G(z, y))]. (7) between Uc2 (G, C) and R(G, C). By reducing λ gradually
to zero during the alternate training of C and G, the role
In particular, Uc2 (G, C) makes C be trained to well classify of R(G, C) vanishes and thus the overgeneralization of the
the samples generated by G, whereas G be trained to generate minority class stops. This implies that G is trained for the
extra samples helpful for C. expansion of the minority class decision region in the early
Lastly, R(G, C) is introduced for expansion of the minority training stage only.
region. To define R(G, C), the classification scores for the As λ decays, the cooperative term of Uc2 (G, C) contributes
minority and majority classes are denoted by Cmi (G(x)) and to the generation of minority samples achieving sufficiently
Cma (G(x)), respectively, and a generated sample is denoted balanced distribution within an expanded minority region by
by xg . Using these terms, R(G, C) is defined as utilizing the shareable low-level features from the majority
R(G, C) = E pg (x,y) [sg ] (8) samples. When training the generator via the term Uc2 (G, C),
the majority samples can provide shareable local feature infor-
where mation (edge, blob, texture, and so on) helpful to the minority

[Cmi (xg ) − Cma (xg )]2 , if Cmi (xg ) > Cma (xg ) sample generation, which can mitigate the overfitting problem
sg = encountered when an only small number of minority samples
0, otherwise
are used for generator training. This claim is supported by the
where Cmi and Cma have values between 0 and 1 since the study [45], which reports that the lower (input-side) layers of a
classifier uses the softmax activation in the output layer. The convolution network learn the local features shareable among
role of R(G, C) is presented in Section III-A3. various classes. When generating images, the generator such as
3) Effect to Imbalanced Data Learning: In R(G, C) of (8), DCGAN [20] stores the local feature information in the higher
if the generated sample (xg ) is placed in the minority region (output-side) layers contrary to the convolution network used
of the current classifier (C), the minority score (Cmi (xg )) is for a classifier. More details about the decaying scheme are
greater than the majority score (Cma (xg )). This is the case described in Section III-B.
of upper condition in (8); in this case, the minimization Theorem 1 shows that the proposed utility function has an
of R(G, C) forces to generate samples near the boundary equilibrium when λ decays to zero.
CHOI et al.: IMBALANCED DATA CLASSIFICATION VIA COOPERATIVE INTERACTION BETWEEN CLASSIFIER AND GENERATOR 3347

Fig. 5. Overall scheme of alternating training of C an G/D. For details on each block, see the pseudocode in Algorithm 1.

TABLE I
E MPIRICAL U TILITY S UBFUNCTIONS

Theorem 1: The equilibrium of U(C, D, G) with λ = 0 is with the decision boundary regularization R(G, C). These
achieved if and only if two optimizations are repeated iteratively in an alternating
loop. Each optimization is described in the following. The
p(x, y) = pg (x, y) = pc (x, y) = pc (G(z|y), y). (9)
alternating loop induces G to generate minority samples that
Note that pc (x, y) = pc (G(z|y), y) means that the training of help C expand the minority region during the initial training
C relies on the distribution of samples generated by G at the phase. As λ decays with increasing of alternating iterations,
equilibrium. Hence, how well G learns the true distribution the joint term Uc (G, C) plays a major role in achieving a
dominates the performance of C. The proof is shown in desirable distribution within each decision region determined
Appendix A. by the trained C.
The training parameters of G, D, and C are denoted by θg ,
B. Training Scheme θd , and θc , respectively. Then, letting U(·) be an empirical
utility function that is parameterized from U(·) in (4), the
1) Overall Scheme: To promote cooperation between G and  d , θg , θc ) of U(D, G, C) is
total empirical utility function U(θ
C, along with Uc2 (G, C) and R(G, C), in the optimization
denoted by
process, we adopt an alternating optimization between the
training of G/D and the training of C. The overall scheme of U(θ g (θd , θg ) + U
 d , θ g , θc ) = U c1 (θc )
the proposed method is outlined in Fig. 5. To stop the training + (1 − λ)U c2 (θc , θg ) + λR(θc , θg ) (10)
of G/D, C, or alternating loop, we adopt the validation-
based early stopping rule [46]. Before starting alternating where a decaying rule of λ is designed as λi for the i th
optimization, we pretrain G/D with the observed imbalanced iteration in Section III-B3. The details of each term are given
data for the initial generator. As the first step of the alternating in Table I. The details for the training of θc with a balanced
loop, C is trained with a balanced batch generated by fixed batch generated by G and the training of θg /θd in cooperation
G/D. Thereafter, G/D is trained in cooperation with C, along with C are described in Sections III-B2 and III-B3.
3348 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 8, AUGUST 2022

2) Training of C With Balanced Batch by G: In this Algorithm 1 Alternating Training Scheme of C and G/D
stage, only C is trained using the empirical utility function, Notation:
 d , θg , θc ), in (10), that is, only the parameter vector θc
U(θ λ : the trade-off control parameter between Uc2 (θg , θc ) and R(θg , θc )
γ ∈ (0, 1] : hyper-parameter for λ decaying by λi = γ λi−1
of C is updated after fixing θg and θd . As θg is fixed, θc is Procedure:
updated by descending the empirical utility function in (10) 1: Initialize λ0 = 1 and i = 1
along its stochastic gradient with respect to θc . The samples 2: [G/D Pre-Training]
3: while not converge by early stop during training θg /θd do
of minority class for balancing are generated by the trained 4: Sample a batch from the given data
G in a batchwise manner, whereas the existing GAN-based 5: Train θg and θd by solving min-max problem with U g (θd , θg ))
balancing methods adopt a one-shot balancing policy. In a one- 6: end while
7: [Alternating Loop]
shot balancing policy, the fixed number of minority samples 8: while not converge by early stop during alternating loops do
is generated before training C as a preprocessing step. In 9: Set λi = γ λi−1
batchwise balancing, however, new samples are generated for 10: [C Training with Balanced Data]
11: while not converge by early stop during training θc do
each batch. Batchwise balancing is advantageous because it 12: Sample a batch from the given data
can fully utilize G by generating an unlimited number of 13: Balance the batch with minority samples generated by G
samples, as new samples are generated repeatedly in a batch- 14: Train θc by minimizing the utility in (10)
15: end while
wise manner until C converges. Another advantage is memory 16: [G/D Training in cooperation of C via R]
efficiency. Unlike one-shot balancing, batchwise balancing 17: while not converge by early stop during training θg /θd do
requires only a small amount of memory for as much as one 18: Sample a batch from the given data
19: Train θg and θd by solving min-max problem in (10)
batch size. 20: end while
3) Training of G/D in Cooperation With C Along With 21: i ++
R: This training stage is designed to train G/D to pursue 22: end while
a balanced distribution by expanding the minority decision
region and generating sufficient samples within the decision
region. To prevent overgeneralization of the minority region,
we designed a decaying rule of λ in the utility function (10).
Specifically, λi for the i th iteration is exponentially reduced
by multiplying hyperparameter γ ∈ (0, 1] every iteration
loop (i.e., λi = γ λi−1 for the i th iteration). The value of
γ is empirically selected in experiments. In each alternating
loop, by fixing θc , θg /θd are updated by descending/ascending
 d , θc , θg ) in (10) along their stochastic gradient with
U(θ
Fig. 6. Way how to choose multilabel for generate the samples.

respect to θg /θd . Note that θg /θd can also be trained for several
epochs in each loop, but one epoch was empirically sufficient.
IV. E XPERIMENTAL R ESULTS
The pseudocode of the proposed alternating training scheme
is given in Algorithm 1. A. Data Sets
In our evaluation, we utilized CIFAR-10 [47], Ima-
geNet [29], Dementia diagnosis [48], and CelebA (multi-
C. Extension to Multiclass and Multilabel label) [49] data sets. CIFAR10 is a low-resolution image
To apply our method to multiclass problems, we expand data set, and ImageNet is a large data set including high-
a decision boundary between a minority and its neighboring resolution images. Dementia is a diagnosis data set for binary
majority class, which is the most influential class to the classification (control versus patient) of neuropsychological
minority class. Hence, the majority class is determined by the assessment profiles, where the number of control subjects is
class i ∗ , where i ∗ = argmaxi,i=mi Ci (xmi
g ). six times more than that of dementia (IR (IR) = 6). CelebA
For multilabel classification, let the multilabel vector for the is a data set including portraits with multilabels, and some
i th sample be denoted by yi = [y1i , . . . , y ij , . . .], where j is attributes (labels) are extremely imbalanced, such as baldness
an attribute index. For CelebA, either 0 or 1 is assigned to y ij . or hat wearing. The aspects of the data sets for our experiments
For balancing with the mini-batch generation, we make the are given in Table II.
multilabel vectors as inputs for G. Let y j = {y ij |i = 1, . . . , N} Dementia and CelebA are inherently imbalanced data sets,
be a set of the j th elements of label vectors in a mini-batch, but CIFAR10 and ImageNet are not imbalanced. Hence,
as shown in Fig. 6. Note that y and ȳ indicate a given training we artificially constructed imbalanced data sets by subsam-
batch and a generated batch, respectively. The minority labels pling the original data set of CIFAR10 for a minority class. For
in ȳ j are randomly chosen by a probability 1 − p j , where ImageNet, we constructed an imbalanced data set where the
p j is the ratio of minority samples in y j of the training majority class is chosen by a class including many subclasses
data set. The remaining elements are assigned to the majority and the minority class is chosen by a class including a few
labels. During G/D training with C,the utility function is subclasses.
obtained by summation of all utilities, (U j +R j ), from each To construct the minority class, two factors should be
attribute. considered so that they cannot be easily classified from the
CHOI et al.: IMBALANCED DATA CLASSIFICATION VIA COOPERATIVE INTERACTION BETWEEN CLASSIFIER AND GENERATOR 3349

TABLE II
S UMMARY OF E VALUATION D ATA S ET

majority classes. The first factor is the degree of similarity


between classes. Learning will be easy if both classes are
distinct from each other even if a considerably small number
of samples is provided for the minority class. Therefore, con-
structing classes with high similarity is desirable for evaluating
the performance of imbalanced data learning. The second
factor is the IR between classes. Previous studies constructed
an imbalanced data set with a low IR, not higher than 2.5
(100:40) [31]; however, these low IR data are insufficient to
verify the methods for an extremely imbalanced case. It is
therefore desirable to set a sufficiently large value for the IR.
Considering these two factors, we constructed imbalanced
data sets from the CIFAR10 [47]. Based on the first factor,
we selected two highly similar classes from the original data Fig. 7. Effect of λ along with R(G, C). The proposed method (with λ decay)
shows more stable and higher performance than the other two cases.
set. For example, car (majority class) and truck (minority
class) are highly similar to each other. Based on the second
factor, we set IR to 20 (100:5) for CIFAR10. We used all sam-
G-score (geometric mean score) is a metric that measures
ples in the majority class data set and randomly selected 5%
the balance between classification performances on both the
of the samples from the minority class data set. A validation
majority and minority classes. B-ACC (balanced accuracy)
data set was constructed with 20% of samples in the selected
is a metric for evaluating learning processes in two-class
training data set. For the test data set, we used the original
imbalanced domains. While G-score and B-ACC are specific
test data set in CIFAR10. For evaluation on real-imbalanced
to particular decision thresholds, AUROC and AUPR are not
images with high resolution, we conducted experiments on
specific to decision thresholds. Hence, AUROC and AUPR
ImageNet [29]. We set up an imbalanced binary class data set
are more valid than the accuracy metric for performance
where the majority class is “dog” containing 117 species and
evaluations of the trained model. AUROC is more common
the minority class is “cat” containing four species, and thus,
than AUPR, but AUPR is more sensitive than AUROC for
IR becomes 29.25. To verify the effectiveness of the proposed
highly imbalanced data sets [52]. For imbalanced data, the
method on the multiclass classification problem, according to
AUPR value is very low. Furthermore, as the analytical power
[50], we constructed an imbalanced multiclass data set by
of a classifier for the minority class increases, the AUPR value
extracting 5% of the samples for five (half) classes (0, 2, 4,
remarkably increases. Thus, AUPR is the most appropriate
6, and 8) in CIFAR10.
metric for imbalanced data classification. Furthermore, using
Delong et al.’s method [55], we test the statistically significant
B. Evaluation Metrics difference in AUROC values against other methods.
Several previous studies [20], [22], [31] have used accuracy
as an evaluation metric. However, for extremely imbalanced C. Self-Analysis
data, the accuracy metric cannot precisely evaluate the minor-
Using the imbalanced data set constructed by two similar
ity class classification because high accuracy can be achieved
classes of the car (majority class) and truck (minority class)
with a simple zero-rule classifier, which determines all samples
CIFAR-10, we deeply self-analyze our method in various
as the majority class. To avoid this problem, we adopted the
aspects detailed in this section.
metrics in the following table:
1) Effect of λ Along With R(G, C): To analyze the influence
of λ and its decay scheme along with R(G, C) in (4),
we evaluated the convergence of the optimization process
for each of the three settings: without R(G, C), without λ
decay scheme, and with λ decay scheme. Fig. 7 shows the
results of the three cases using CIFAR10. In the case without
3350 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 8, AUGUST 2022

Fig. 8. Distribution of each class and generated minority samples in feature space. Without cooperative training, generated samples are located within the
training data distribution. However, with cooperative training along with R(G, C), generated samples tend to be located on the borderline. As λ decays,
generated samples return to the distribution with broader coverage.

Fig. 9. Feature space mappings and images of generated minority samples (truck) against majority samples (car) in (a) early-stage iteration and (b) late-stage
iteration.

R(G, C) (green line), using the utility function in (3), the initial phase, without cooperative training. Most of the sam-
performance was not much improved due to the premature ples are mapped in a small region within the training
convergence explained in Section III-A3. In the case without data distribution. The remaining panels show a map of
a λ decay scheme (blue line), performance degraded after the samples generated through repeated cooperative training.
approximately 100 iterations, due to overexpansion of the Although the samples were generated using the same z val-
minority region. In contrast, the case with λ decay (proposed, ues, they are mapped in different positions of the feature
brown line), using the utility function in (4), the high and space in every iteration. Especially, in the first cooperative
stable performance was achieved as expected. The degree of training, as the value of λ is 1, most of the generated
decay for λ = γ i is determined by the value of γ , which is minority samples cross the decision boundary between two
observed to be dependent on the data set. We determined γ classes. We can see that as the λ value decays, the ten-
empirically as 0.9, 0.1, and 0.5 for CIFAR10, Dementia, and dency of generating samples cross the decision boundary
CelebA, respectively. decreases.
2) Validity of Samples Generated Throughout Cooperative Fig. 9 shows the locations of the generated minority samples
Training: Fig. 8 shows a map of the samples generated by in feature space. The top-right images in Fig. 9(a) and (b)
the proposed GAN in the feature space. The blue and red are the generated sample images. The numbers left to the
contours represent the majority and minority class distribu- generated images are the indexes that correspond to the
tions, respectively, for the given training data. The dark red numbers written in feature space. Fig. 9(a) shows the generated
dots represent the 64 samples generated by G. Features in sample locations after the first cooperative interaction training.
the intermediate layer of C were extracted for all samples As discussed in Section IV-E, due to λ = 1, the generated
and were visualized in 2-D space using the parametric t- minority samples are located around the borderline of two
distributed stochastic neighbor embedding scheme [56]. For classes. Fig. 9(b) shows the generated sample locations after
fair visualization, we used a fixed z to generate samples at the 80th cooperative interaction training. As λ converges
each iteration. to 0, the generated samples are located within the original
In Fig. 8, the leftmost panel shows the samples gen- distribution rather than the borderline. Even though the images
erated by cGAN learning, which was only trained in the with the same index in Fig. 9(a) and (b) are generated with the
CHOI et al.: IMBALANCED DATA CLASSIFICATION VIA COOPERATIVE INTERACTION BETWEEN CLASSIFIER AND GENERATOR 3351

Fig. 10 shows the results of the ablation study. First,


on Variant 1 (orange line), performance improves slightly
compared to learning using only C (blue line). As this variant
corresponds to the existing cGAN, the amount of improvement
is not significant. On Variant 2 (green line), a significant
improvement of performance is achieved in addition to the first
ablation (green line). This implies the terms for joint training
of G and C along with alternating training contributes to both
G and C so that C helps G generate samples beneficial to
C, consequently improving C’s performance. Finally, when
the R(G, C) term was added as Variant 3, it significantly
improved since G generated samples to interactively expand
the minority region (red line).

D. Comparative Analysis
To verify the validity of the proposed method, we compared
the classification performance based on four metrics to existing
Fig. 10. Radar chart for ablation comparison of classifier performance techniques using five configurations from four data sets.
on CIFAR10. Scores are from the validation∗ and test† sets. For better 1) Compared Methods: For the conventional data-level
visualization, each score is normalized with mean and variance of four variants
because AUPR and AUROC have different ranges from each other. methods, we adopted 11 methods described in Section II-A.
For implementing SMOTE [6], B-SMOTE [12], ADASYN
[13], C-Centroids [11], CN-Neighbor [11], and SMOTE-ENN
[17], we used imbalanced-learn library [58]. For MWMOTE
same value of z, appearances of the two images with the same [7], NRSB-SMOTE [35], SMOTE-IPF [18], and G-SMOTE
index are different from each other. Many of the generated [15], we used smote-variants library [59]. For RSNO [16],
images in Fig. 9(a) appear to be a car (low and round). we acquired MATLAB code from the authors. However,
This figure illustrates that G trained in the initial cooperative because all the conventional data-level methods support CPU
interaction phase can generate the ambiguous minority sam- computation only, we could not conduct some of experiments
ples that look like majority samples. These ambiguous minor- on high-dimensional and large number of samples (marked
ity samples are beneficial to the expansion of the minority by “−” in Table III). The compared loss-based methods
region. However, as λ converges to zero, the generated images are CRL [36], MPL [37], and focal loss [38]. GAN-based
become similar to truck image (high and box-style), as shown techniques were compared to three other methods. The first
in Fig. 9(b). method is based on cGAN, which is used in most GAN-
As most data-level sampling methods provide samples only based approaches. The structure of cGAN is the same as that
in the inner region of the training data distribution, they used in our work. The second method is BAGAN [31]. The
risk overfitting [57]. In contrast, we can observe that several authors of BAGAN released the source code, and the structure
samples generated by our method are positioned over the and hyperparameters specified in their paper were used. The
decision boundary between two classes. This result implies third GAN-based method is TripleGAN [33], E-TripleGAN
that the proposed method can expand the minority region to [41], and HexaGAN [34] use the concept of TripleGAN for
improve the generalization performance of C on the minority imbalanced data problem.
class. After the regularization term vanishes by reducing λ to 2) Hyperparameters and Experimental Settings: For a fair
almost zero, the generated samples cover a wide region of the comparison, hyperparameters of the classifier for each data
minority class, as shown in the fourth map of Fig. 8. set are searched for the classifier only (baseline) case. Then,
3) Ablation Study: The ablation study was conducted with the same set of classifier’s hyper-parameters was used for the
CIFAR10 by sequentially adding each ablation component others. Besides, the unique hyperparameters of each method,
because each component could not be implemented without such as γ of focal loss [38], were searched within a specific
the previous components. The role of the components is vali- range following their guidelines and selected with the values
dated through an ablation study on CIFAR10 through ablation that showed the best validation performance. In the case of
of one baseline and three variants as listed in following table: GAN-based techniques, the same structure of G and D was
used, except for BAGAN having its own structure. Further
training details about network structures and hyperparameter
values are provided in Appendix B.
3) Comparison Results: The comparative results are listed
in Tables III–V. Our method outperformed all the compared
methods on all the data sets consistently. Most GAN-based
methods tend to give consistent improvements against the
baseline “classifier-only” on all data sets. Some loss-based
3352 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 8, AUGUST 2022

Fig. 11. Performance changes of representative methods depending on IR changes.

TABLE III
T EST S ET P ERFORMANCE OF B INARY-C LASS D ATA S ETS

TABLE IV TABLE V
T EST S ET P ERFORMANCE OF M ULTICLASS CIFAR10 T EST S ET P ERFORMANCE OF M ULTILABEL C ELEBA

the methods except the underlined cases on the Dementia data


set.
In addition, we investigated the trend of performance change
with IR variation for the representative methods of data-level,
loss-based, and GAN-based approach. Every data set except
CIFAR10 has fixed IR. Thus, we used CIFAR10 to see
(cost-sensitive) methods give improvements on most data sets the change in performance (G-score, AUPRmi , and AUPRmj )
except the multiclass data set (CIFAR10). However, most of according to IR change, which is shown in Fig. 11. In
the data-level methods do not give improvements on high- G-score, the proposed method outperforms the other methods
dimensional image data and show improvements only on low- consistently in the most range of IR.
dimensional table data (Dementia). As shown in the underlined When IR = 20, AUPR of majority class of our method is
scores of Table III, in Delong et al. AUROC test, the proposed slightly degraded instead of remarkably improving AUPR of
method shows statistically significant improvement against all minority class as shown in the right two graphs in Fig. 11.
CHOI et al.: IMBALANCED DATA CLASSIFICATION VIA COOPERATIVE INTERACTION BETWEEN CLASSIFIER AND GENERATOR 3353

Although the proposed method showed promising results,


certain issues remain. In this study, with a relatively simple
form of cGAN (DCGAN), the proposed method achieved a
considerable performance improvement in imbalanced data
classification. As further work, the use of more precise gen-
erators and discriminators such as Wasserstein GAN [60] is
Fig. 12. Two possible cases, (a) easy and (b) hard, for a multiclass problem.
expected to yield higher and stable performance for the imbal-
anced data learning. In addition, for the λ decay schedule,
an exponential decay rule was used where the decaying degree
In the range of IR less than 10, our method shows outper- was empirically determined depending on the data sets. For
forming AUPR for both majority and minority classes. further improvement, an elaborate design or adaptive scheme
4) Discussion on Performance Improvements: Although our for λ decay should be adopted, which would consider the IR
method consistently outperforms comparative methods, the and complexity of the target data set. As open problems, inter-
amount of improvement depends on the data set. Here, we dis- pretability and authenticity are critical topics to be pursued in
cuss possible causes of the improvement differences. a machine learning field. Recently, several types of research
have been proposed to interpret GANs [61]–[64]. By applying
1) High Baseline: The dementia data have small IR and
them to our method, we can take one step closer to solving
low dimensions. Small IR mitigates the difficulty of
the open problems.
imbalanced data problem, and low dimension alleviates
the curse of dimensionality. Thus, the baseline accuracy
is already high and so the room for improvement is A PPENDIX A
limited. P ROOF OF T HEOREM 1
2) Complex Decision Boundary in Multiclass Problem: The proof of the following Lemma 1 is equivalent to
As the number of classes increases in the multiclass the proof1 the original GAN [30], and thereby, we briefly
problem, the decision boundary generally becomes more summarize the original proof by rewriting it. For the details,
complex. As shown in Fig. 12(a), it is relatively easy to refer to the reference in a footnote. Here, we add Theorem
decide the direction of expansion. However, for the case 1 for the proof of the three-player game proposed in this
shown in Fig. 12(b), because the expansion directions article.
are opposite to each other, a complex decision boundary Lemma 1: For any fixed G in Ug (D, G), the optimal
in the multiclass problem might provoke ineffective discriminator D is given by
expansion. p(x, y)
3) Unrealistic Sample Generation in Multilabel Problem: D ∗ (x, y) = . (11)
p(x, y) + pg (x, y)
Since the multilabel vectors of generated samples are
sampled randomly, it has a possibility to sample an Proof: Given G, Ug (D, G) can be rewritten as
unrealistic combination of labels such as women-with-  
mustache. Because this unrealistic image does not exist Ug (D, G) = p(x, y) log D(x, y)d ydx
in the test data, it may not contribute to or even be  
harmful to the performance. However, this phenomenon + pg (x, y) log(1 − D(x, y))d ydx. (12)
inevitably occurs when balancing for the multilabel
case. This function achieves the maximum at
(( p(x, y))/( p(x, y) + pg (x, y))). 
V. C ONCLUSION Theorem 1: For given D ∗ , the equilibrium of U(C, D, G)
is achieved if and only if
To overcome the difficulty of imbalanced data learning,
we proposed a novel methodology based on a three-player p(x, y) = pg (x, y) = pc (x, y) = pc (G(z|y), y). (13)
game and decision boundary regularization. First, we designed
Proof: Given D ∗ , we can reformulate the minimax game
a three-player structure to improve imbalanced data learn-
with value function Ug (D, G) as
ing performance and analyzed the equilibrium point of the  
proposed utility function. Second, we introduced a decision p(x, y)
Ug (D, G) = p(x, y) log d ydx
boundary regularization to expand the minority region deter- p(x, y) + pg (x, y)
 
mined by the trained classifier with samples generated by pg (x, y)
the generator in our three-player structure. Third, we pro- + pg (x, y) log d ydx.
p(x, y) + pg (x, y)
posed an alternating training scheme to effectively train (14)
the three-player structure, in cooperation with the decision
boundary regularization. The experiment illustrated that the Following the proof in GAN, Ug (D, G) can be rewritten as
proposed method outperforms the existing methods by yield-
Ug (D, G) = − log 4 + 2JSD( p(x, y)|| pg (x, y)) (15)
ing abundant samples to expand the minority decision region,
which is beneficial in addressing imbalanced data learning 1 https://srome.github.io/An-Annotated-Proof-of-Generative-Adversarial-
problems. Networks-with-Implementation-Notes/
3354 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 8, AUGUST 2022

where JSD is the Jensen–Shannon divergence. In addition, R EFERENCES


according to the definition of Kullback–Leibler (KL) diver-
[1] R. B. Rao, S. Krishnan, and R. S. Niculescu, “Data mining for improved
gence, Uc (C, G) can be rewritten as cardiac care,” ACM SIGKDD Explor. Newslett., vol. 8, no. 1, pp. 3–10,
Jun. 2006.
Uc (C, G) = E(x,y)∼ p(x,y) [− log pc (y|x)] [2] X.-M. Zhao, X. Li, L. Chen, and K. Aihara, “Protein classification with
imbalanced data,” Proteins, Struct., Function, Bioinf., vol. 70, no. 4,
+ E(x,y)∼ pg (x,y) [− log pc (y|G(z|y))] pp. 1125–1132, Dec. 2007.
= DKL ( p(x, y)|| pc(x, y)) + H p (y|x) [3] P. K. Chan, W. Fan, A. L. Prodromidis, and S. J. Stolfo, “Distributed
data mining in credit card fraud detection,” IEEE Intell. Syst., vol. 14,
+ DKL ( pg (x, y)|| pc (G(z|y), y)) no. 6, pp. 67–74, Nov. 1999.
[4] C. Phua, D. Alahakoon, and V. Lee, “Minority report in fraud detection:
+ H pg (y|G(z|y)). (16) Classification of skewed data,” ACM SIGKDD Explor. Newslett., vol. 6,
no. 1, pp. 50–59, Jun. 2004.
From (15) and (16), U(C, D, G) becomes [5] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans.
Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009.
U(C, D, G) = Ug (D, G) + Uc (C, G) [6] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,
“SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell.
= 2JSD( p(x, y)|| pg (x, y)) Res., vol. 16, pp. 321–357, Jun. 2002.
[7] S. Barua, M. M. Islam, X. Yao, and K. Murase, “MWMOTE-majority
+ DKL ( p(x, y)|| pc(x, y)) weighted minority oversampling technique for imbalanced data set
+ DKL ( pg (x, y)|| pc (G(z|y), y)) learning,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 2, pp. 405–425,
Feb. 2014.
+, (H p (y|x) + H pg (y|G(z|y)) − log 4). (17) [8] P. Domingos, “Metacost: A general method for making classifiers cost-
sensitive,” in Proc. 5th ACM SIGKDD Int. Conf. Knowl. Discovery Data
Since JSD(·) and DKL (·) are nonnegative, their minimum Mining, Aug. 1999, pp. 155–164.
[9] K. M. Ting, “An instance-weighting method to induce cost-sensitive
values become zero if and only if p(x, y) = pg (x, y), trees,” IEEE Trans. Knowl. Data Eng., vol. 14, no. 3, pp. 659–665,
p(x, y) = pc (x, y), and pg (x, y) = pc (G(z|y), y). Hence, the May 2002.
equilibrium of U(C, D, G) becomes p(x, y) = pg (x, y) = [10] A. Fernández, V. López, M. Galar, M. J. del Jesus, and F. Herrera,
“Analysing the classification of imbalanced data-sets with multiple
pc (x, y) = pc (G(z|y), y).  classes: Binarization techniques and ad-hoc approaches,” Knowl.-Based
Syst., vol. 42, pp. 97–110, Apr. 2013.
[11] S.-J. Yen and Y.-S. Lee, “Cluster-based under-sampling approaches
A PPENDIX B for imbalanced data distributions,” Expert Syst. Appl., vol. 36, no. 3,
I MPLEMENTATIONS AND H YPERPARAMETERS pp. 5718–5727, Apr. 2009.
[12] H. Han, W.-Y. Wang, and B.-H. Mao, “Borderline-smote: A new over-
For the classifier on CIFAR10 and CelebA data, we used the sampling method in imbalanced data sets learning,” in Advances in
architecture of Resnet18 [39]. For Dementia data set, we used Intelligent Computing, D.-S. Huang, X.-P. Zhang, and G.-B. Huang,
Eds. Berlin, Germany: Springer, 2005, pp. 878–887.
the same classifier as in the original paper that first used [13] H. He, Y. Bai, E. A. Garcia, and S. Li, “ADASYN: Adaptive synthetic
the Dementia data set [48], where the classifier is composed sampling approach for imbalanced learning,” in Proc. IEEE Int. Joint
Conf. Neural Netw. (IEEE World Congr. Comput. Intell.), Jun. 2008,
of 2-D convolutional neural networks with skip connection pp. 1322–1328.
and Hilbert curve transform. Both the classifiers used the [14] G. Douzas, F. Bacao, and F. Last, “Improving imbalanced learning
softmax activation in the output layer. In the case of GAN- through a heuristic oversampling method based on K-means
and SMOTE,” Inf. Sci., vol. 465, pp. 1–20, Oct. 2018,
based techniques, the same structure of DCGAN [65] was used doi: 10.1016/j.ins.2018.06.056.
for every comparison. However, as BAGAN adopts their own [15] H. Lee, J. Kim, and S. Kim, “Gaussian-based SMOTE algorithm for
sophisticated architecture, we used their own architecture for solving skewed class distributions,” Int. J. Fuzzy Log. Intell. Syst.,
vol. 17, no. 4, pp. 229–234, Dec. 2017.
BAGAN. [16] X. Tao et al., “Real-value negative selection over-sampling for imbal-
Besides, the unique hyperparameters of each method were anced data set learning,” Expert Syst. Appl., vol. 129, pp. 118–134,
searched within a specific range following their guidelines Sep. 2019.
[17] G. E. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior
and selected with the values that showed the best validation of several methods for balancing machine learning training data,” ACM
performance. The list of hyperparameters and their ranges is SIGKDD Explor. Newslett., vol. 6, no. 1, pp. 20–29, Jun. 2004.
as follows. [18] J. A. Sáez, J. Luengo, J. Stefanowski, and F. Herrera, “SMOTE–IPF:
Addressing the noisy and borderline examples problem in imbalanced
1) The number of nearest neighbor, k, of conventional data- classification by a re-sampling method with filtering,” Inf. Sci., vol. 291,
level methods: [5, 10]. pp. 184–203, Jan. 2015.
[19] J. Hu, X. He, D.-J. Yu, X.-B. Yang, J.-Y. Yang, and H.-B. Shen,
2) The ratio of CRL loss: [0.1, 0.8]. “A new supervised over-sampling algorithm with application to
3) The γ of Focal loss: [2, 4]. protein-nucleotide binding residue prediction,” PLoS One, vol. 9,
4) The number of triplet, k, of CRL loss: [10, 40]. no. 9, pp. 1–10, Sep. 2014. [Online]. Available: https://doi.org/10.1371/
journal.pone.0107676
5) The learning rate of BAGAN: [0.00005, 0.0005]. [20] C. Wang, Z. Yu, H. Zheng, N. Wang, and B. Zheng, “CGAN-plankton:
6) The α of tripleGAN: [0.5, 1]. Towards large-scale imbalanced class generation and fine-grained clas-
sification,” in Proc. IEEE Int. Conf. Image Process. (ICIP), Sep. 2017,
The common training details for every case are as follows. pp. 855–859.
1) Batch Size: 128. [21] G. Douzas and F. Bacao, “Effective data generation for imbalanced
learning using conditional generative adversarial networks,” Expert Syst.
2) Optimizer: ADAM (with β1 : 0.5, β2 : 0.999, and learning Appl., vol. 91, pp. 464–471, Jan. 2018.
rate: 0.0002). [22] X. Zhu, Y. Liu, J. Li, T. Wan, and Z. Qin, “Emotion classification with
3) Weight initialization: Xavier normalization with mean = data augmentation using generative adversarial networks,” in Proc. Adv.
Knowl. Discovery Data Mining, D. Phung, V. S. Tseng, G. I. Webb,
0, std = 0.02. B. Ho, M. Ganji, and L. Rashidi, Eds. Cham, Switzerland: Springer,
4) The number of patient for early stop: 30. 2018, pp. 349–360.
CHOI et al.: IMBALANCED DATA CLASSIFICATION VIA COOPERATIVE INTERACTION BETWEEN CLASSIFIER AND GENERATOR 3355

[23] W. Gao, L. Wang, R. Jin, S. Zhu, and Z.-H. Zhou, “One-pass AUC [45] M. D. Zeiler and R. Fergus, “Visualizing and understanding con-
optimization,” Artif. Intell., vol. 236, pp. 1–29, Jul. 2016. volutional networks,” in Proc. Eur. Conf. Comput. Vis., 2014,
[24] J. Hu, H. Yang, M. R. Lyu, I. King, and A. M.-C. So, “Online nonlinear pp. 818–833.
AUC maximization for imbalanced data sets,” IEEE Trans. Neural Netw. [46] G. Raskutti, M. J. Wainwright, and B. Yu, “Early stopping for non-
Learn. Syst., vol. 29, no. 4, pp. 882–895, Apr. 2018. parametric regression: An optimal data-dependent stopping rule,” in
[25] Q. Wang, Z. Luo, J. Huang, Y. Feng, and Z. Liu, “A novel ensem- Proc. 49th Annu. Allerton Conf. Commun., Control, Comput., Berlin,
ble method for imbalanced data learning: Bagging of extrapolation- Germany: Springer-Verlag, Sep. 2011, pp. 1318–1325.
SMOTE SVM,” Comput. Intell. Neurosci., vol. 2017, pp. 1–11, [47] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from
Jan. 2017. tiny images,” M.S. thesis, Dept. Comput. Sci., Univ. Toronto, Toronto,
[26] M. Khalilia, S. Chakraborty, and M. Popescu, “Predicting disease risks ON, Canada, 2009.
from highly imbalanced data using random forest,” BMC Med. Informat. [48] H.-S. Choi et al., “Deep learning based low-cost high-accuracy diag-
Decis. Making, vol. 11, no. 1, Jul. 2011. nostic framework for dementia using comprehensive neuropsychological
[27] D. G. Lowe, “Object recognition from local scale-invariant fea- assessment profiles,” BMC Geriatrics, vol. 18, no. 1, p. 234, Oct. 2018,
tures,” in Proc. 7th IEEE Int. Conf. Comput. Vis., Sep. 1999, doi: 10.1186/s12877-018-0915-z.
pp. 1150–1157. [49] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes
[28] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust in the wild,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015,
features,” in Proc. Comput. Vis. (ECCV), A. Leonardis, H. Bischof, and pp. 3730–3738.
A. Pinz, Eds. Berlin, Germany: Springer, 2006, pp. 404–417. [50] S. H. Khan, M. Hayat, M. Bennamoun, F. A. Sohel, and R. Togneri,
[29] O. Russakovsky et al., “ImageNet large scale visual recognition “Cost-sensitive learning of deep feature representations from imbal-
challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, anced data,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 8,
Dec. 2015. pp. 3573–3587, Aug. 2018.
[30] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. [51] J. A. Hanley and B. J. McNeil, “The meaning and use of the area under
Neural Inf. Process. Syst., Z. Ghahramani, M. Welling, C. Cortes, a receiver operating characteristic (ROC) curve,” Radiology, vol. 143,
N. D. Lawrence, and K. Q. Weinberger, Eds. Red Hook, NY, USA: no. 1, pp. 29–36, 1982.
Curran Associates, 2014, pp. 2672–2680. [52] J. Davis and M. Goadrich, “The relationship between precision-recall
[31] I. Goodfellow et al., “Generative adversarial nets,” in Advances in Neural and ROC curves,” in Proc. 23rd Int. Conf. Mach. Learn. (ICML), 2006,
Information Processing Systems, vol. 27, Z. Ghahramani, M. Welling, pp. 233–240.
C. Cortes, N. Lawrence, and K. Q. Weinberger, Eds. New York, NY, [53] M. Kubat, R. Holte, and S. Matwin, “Machine learning for the detection
USA: Curran Associates, Inc., 2014, pp. 2672–2680. [Online]. Available: of oil spills in satellite radar images,” Mach. Learn., vol. 30, nos. 2–3,
https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c pp. 195–215, Dec. 1998.
97b1afccf3-Paper.pdf [54] V. García, R. A. Mollineda, and J. S. Sánchez, “Index of balanced
[32] Y. Zhang, “Deep generative model for multi-class imbalanced learning,” accuracy: A performance measure for skewed class distributions,” in
M.S. thesis, Dept. Elect. Eng., Univ. Rhode Island, Kingston, RI, USA, Proc. 4th Iberian Conf. Pattern Recognit. Image Anal. Berlin, Germany:
2018. Springer-Verlag, Jun. 2009, p. 441–448.
[33] L. I. Chongxuan, T. Xu, J. Zhu, and B. Zhang, “Triple gener- [55] E. R. DeLong, D. M. DeLong, and D. L. Clarke-Pearson, “Comparing
ative adversarial nets,” in Proc. Adv. Neural Inf. Process. Syst., the areas under two or more correlated receiver operating character-
I. Guyon, et al. Eds. Red Hook, NY, USA: Curran Associates, 2017, istic curves: A nonparametric approach,” Biometrics, vol. 44, no. 3,
pp. 4088–4098. pp. 837–845, 1988.
[34] U. Hwang, D. Jung, and S. Yoon, “HexaGAN: Generative adver- [56] L. van der Maaten, “Learning a parametric embedding by preserving
sarial nets for real world classification,” in Proc. 36th Int. Conf. local structure,” in Proc. 12th Int. Conf. Artif. Intell. Statist., in Proceed-
Mach. Learn., in Proceedings of Machine Learning Research, vol. 97, ings of Machine Learning Research, vol. 5, D. van Dyk and M. Welling,
K. Chaudhuri and R. Salakhutdinov, Eds. Long Beach, CA, USA, Eds. Clearwater Beach, FL, USA: Hilton Clearwater Beach Resort,
Jun. 2019, pp. 2921–2930. Apr. 2009, pp. 384–391.
[35] F. Hu and H. Li, “A novel boundary oversampling algorithm based on [57] N. V. Chawla, Data Mining for Imbalanced Datasets: An Overview.
neighborhood rough set model: NRSBoundary-SMOTE,” Math. Prob- Boston, MA, USA: Springer, 2005, pp. 853–867, [Online]. Available:
lems Eng., vol. 11, pp. 1–10, Jan. 2013. https://doi.org/10.1007/0-387-25465-X_40
[36] Q. Dong, S. Gong, and X. Zhu, “Class rectification hard mining for [58] G. Lemaître, F. Nogueira, and C. K. Aridas, “Imbalanced-learn:
imbalanced deep learning,” in Proc. IEEE Int. Conf. Comput. Vis. A python toolbox to tackle the curse of imbalanced datasets
(ICCV), Oct. 2017, pp. 1851–1860. in machine learning,” J. Mach. Learn. Res., vol. 18, no. 17, pp. 1–5,
[37] S. R. Bulo, G. Neuhold, and P. Kontschieder, “Loss max-pooling for 2017.
semantic image segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern [59] G. Kovács, “Smote-variants: A Python implementation of 85 minor-
Recognit. (CVPR), Jul. 2017, pp. 7082–7091. ity oversampling techniques,” Neurocomputing, vol. 366, pp. 352–354,
[38] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for Nov. 2019.
dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), [60] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative
Oct. 2017, pp. 2980–2988. adversarial networks,” in Proc. Int. Conf. Mach. Learn., vol. 70.
[39] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for Aug. 2017, pp. 214–223, [Online]. Available: http://proceedings.mlr.
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. press/v70/arjovsky17a.html
(CVPR), Jun. 2016, pp. 770–778. [61] D. Bau et al., “Gan dissection: Visualizing and understanding generative
[40] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis adversarial networks,” in Proc. Int. Conf. Learn. Represent. (ICLR),
with auxiliary classifier GANs,” in Proc. 34th Int. Conf. Mach. 2019, pp. 1–5.
Learn. (ICML), vol. 70, 2017, pp. 2642–2651. [Online]. Available: [62] Y. Shen, J. Gu, X. Tang, and B. Zhou, “Interpreting the
https://arxiv.org/abs/1610.09585 latent space of GANs for semantic face editing,” in Proc.
[41] S. Wu, G. Deng, J. Li, R. Li, Z. Yu, and H.-S. Wong, “Enhancing IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
TripleGAN for semi-supervised conditional instance synthesis and clas- pp. 9243–9252.
sification,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. [63] A. Jahanian, L. Chai, and P. Isola, “On the‘steerability’ of generative
(CVPR), Jun. 2019, pp. 10091–10100. adversarial networks,” in Proc. Int. Conf. Learn. Represent. (ICLR),
[42] B. Heo, M. Lee, S. Yun, and J. Y. Choi, “Knowledge distillation with 2020, pp. 214–223.
adversarial samples supporting decision boundary,” in Proc. AAAI Conf. [64] D. Jung, J. Lee, J. Yi, and S. Yoon, “Icaps: An interpretable classifier
Artif. Intell. (AAAI), Jul. 2019, pp. 3771–3778. via disentangled capsule networks,” in Proc. Comput. Vis. (ECCV),
[43] K. Sun, Z. Zhu, and Z. Lin, “Enhancing the robustness of deep neural A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds. Cham,
networks by boundary conditional GAN,” 2019, arXiv:1902.11029. Switzerland: Springer, 2020, pp. 314–330.
[Online]. Available: http://arxiv.org/abs/1902.11029 [65] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
[44] K. Lee, H. Lee, K. Lee, and J. Shin, “Training confidence-calibrated learning with deep convolutional generative adversarial networks,”
classifiers for detecting out-of-distribution samples,” 2017, 2015, arXiv:1511.06434. [Online]. Available: https://arxiv.org/abs/
arXiv:1711.09325. [Online]. Available: http://arxiv.org/abs/1711.09325 1511.06434
3356 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 33, NO. 8, AUGUST 2022

Hyun-Soo Choi received the B.S. degree in com- Siwon Kim received the B.S. degree in electrical
puter and communication engineering (first major) and computer engineering from Seoul National Uni-
and in brain and cognitive science (second major) versity, Seoul, South Korea, in 2018, where she is
from Korea University, Seoul, South Korea, in 2013, currently pursuing the integrated M.S./Ph.D. degree
and the integrated M.S./Ph.D. degree in electrical in electrical and computer engineering.
and computer engineering from Seoul National Uni- Her research interests include artificial intelli-
versity, Seoul, in 2019. gence, deep learning, and biomedical applications.
Since February 2020, he has been a Senior
Researcher with Vision AI Labs, SK Telecom. Since
March 2021, he has been working at the Department
of Computer Science and Engineering, Kangwon
National University, South Korea.

Sungroh Yoon (Senior Member, IEEE) received the


B.S. degree in electrical engineering from Seoul
National University, Seoul, South Korea, in 1996,
and the M.S. and Ph.D. degrees in electrical engi-
neering from Stanford University, Stanford, CA,
USA, in 2002 and 2006, respectively.
He was a Visiting Scholar with the Department
of Neurology and Neurological Sciences, Stanford
University, from 2016 to 2017. He held research
positions at Stanford University and Synopsys, Inc.,
Mountain View, CA, USA. From 2006 to 2007,
he was with Intel Corporation, Santa Clara, CA, USA. He was an Assistant
Professor with the School of Electrical Engineering, Korea University, Seoul,
Dahuin Jung received the B.S. degree in electri- from 2007 to 2012. He is currently a Professor with the Department of
cal engineering from the University of Washington, Electrical and Computer Engineering, Seoul National University. His current
Seattle, WA, USA, in 2016. She is currently pursuing research interests include machine learning and artificial intelligence.
the integrated M.S./Ph.D. degree in electrical and Dr. Yoon was a recipient of the SNU Education Award, in 2018, the IBM
computer engineering with Seoul National Univer- Faculty Award, in 2018, the Korean Government Researcher of the Month
sity, Seoul, South Korea. Award in 2018, the BRIC Best Research of the Year in 2018, the IMIA Best
Her research interests include deep learning, rep- Paper Award in 2017, the Microsoft Collaborative Research Grant in 2017 and
resentation learning, and explainable artificial intel- 2020, the SBS Foundation Award in 2016, the IEEE Young IT Engineer Award
ligence (AI). in 2013, and many other prestigious awards. Since February 2020, he has
been serving as the Chairperson for the Presidential Committee on the Fourth
Industrial Revolution established by the Korean Government.

You might also like