0% found this document useful (0 votes)
4 views

Data Augmentation With Variational Autoencoder

This paper presents a novel approach for addressing the challenges of Imbalanced Regression (IR) using Variational Autoencoders (VAE) combined with a smoothed bootstrap method for synthetic data generation. The authors propose a new loss function and weighting scheme to enhance the learning process for rare values in regression tasks, which are often overlooked by traditional algorithms. The effectiveness of this method is evaluated through numerical investigations against existing techniques on various datasets known for IR.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Data Augmentation With Variational Autoencoder

This paper presents a novel approach for addressing the challenges of Imbalanced Regression (IR) using Variational Autoencoders (VAE) combined with a smoothed bootstrap method for synthetic data generation. The authors propose a new loss function and weighting scheme to enhance the learning process for rare values in regression tasks, which are often overlooked by traditional algorithms. The effectiveness of this method is evaluated through numerical investigations against existing techniques on various datasets known for IR.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

DATA AUGMENTATION WITH VARIATIONAL AUTOENCODER FOR

I MBALANCED DATASET

Samuel Stocksieker Denys Pommeret Arthur Charpentier


Aix Marseille Université Aix Marseille Université Université du Québec à Montréal
I2M, CNRS, Centrale Marseille I2M, CNRS, Centrale Marseille Département de Mathématique
arXiv:2412.07039v1 [cs.LG] 9 Dec 2024

Marseille, France Marseille, France Montréal, Canada

A BSTRACT
Learning from an imbalanced distribution presents a major challenge in predictive modeling, as
it generally leads to a reduction in the performance of standard algorithms. Various approaches
exist to address this issue, but many of them concern classification problems, with a limited focus
on regression. In this paper, we introduce a novel method aimed at enhancing learning on tabular
data in the Imbalanced Regression (IR) framework, which remains a significant problem. We
propose to use variational autoencoders (VAE) which are known as a powerful tool for synthetic
data generation, offering an interesting approach to modeling and capturing latent representations
of complex distributions. However, VAEs can be inefficient when dealing with IR. Therefore, we
develop a novel approach for generating data, combining VAE with a smoothed bootstrap, specifically
designed to address the challenges of IR. We numerically investigate the scope of this method by
comparing it against its competitors on simulations and datasets known for IR.
Keywords

1 Introduction
Regression techniques are widely used in various fields such as finance, economics, medicine, and engineering, to model
relationships between features and a continuous target variable. However, datasets in real-world applications often
exhibit a significant imbalance in the distribution of the target variable, presenting specific challenges for traditional
modeling methods ((Krawczyk, 2016), (Fernández et al., 2018a)). Indeed, standard approaches typically treat all values
as equally important and are trained to minimize the average error. Since rare and/or extreme values are few in the
training set, they have little influence on error minimization during the learning process. But such values are often
highly relevant.
This issue has been extensively studied in the context of classification (Haixiang et al., 2017), (Fernández et al.,
2018b). However, Very few works have addressed the problem of Imbalance Regression (IR) although many important
real-world applications (Krawczyk, 2016). Indeed, regression poses additional challenges compared to classification: i)
rare/minority values are not identifiable because the distribution is continuous; ii) the degree of imbalance is not easily
identifiable; iii) the extent of rebalancing is also difficult to measure; and iv) generating synthetic data also requires
generating a new value for the target variable.
For tabular data, preprocessing solutions are often preferred (Branco et al., 2016b; He and Ma, 2013) for their
universality: they allow the use of any standard regression technique. Preprocessing methods rely notably on resampling
through synthetic data generation. Several methods have been proposed to generate synthetic observations from the
initial data space, such as the well-known SMOTE algorithm (Chawla et al., 2002). It is also possible to use a method
that embeds observations into a latent space to generate data from it, as Variational Autoencoders (VAE). VAEs is a
data generation method that utilizes a neural network to encode observations into a continuous latent space, where new
data can be probabilistically generated. This approach has proven to be relatively effective in generating unstructured
data, particularly due to its ability to capture nonlinear relationships.
Our goal here is to generate artificial rare values while taking into account the observed correlations. Indeed, if we
use a synthetic data generator, regression requires generating a new value for the target variable, unlike in the case of
Data Augmentation with Variational Autoencoder for Imbalanced Dataset

classification. We propose to use a VAE approach to construct a new latent representation of the data, enabling a more
relevant generation than directly generating from their initial space. The use of a VAE is also justified because, as we
will present below, we aim to retain a representation of the observations in the latent space for specific guided generation
in IR. However, it is well known that neural networks, including VAEs, are not efficient on tabular data (Shwartz-Ziv
and Armon, 2022). Therefore, we favor a "preprocessing" approach over "in-processing" to enable the application of
any learning algorithm subsequently. In addition, a "preprocessing" solution enables the use of a "white box" model
which is transparency and explainability. This is recommended in many fields such as insurance and finance, but also in
the social sciences or economics.
In this paper, we propose a simple and effective modification of VAEs to adapt their training and generation to the
problem of IR on structured data. Our main contributions can be summarized as follows:
i) Providing a new loss function for training VAEs in the case of IR
ii) Proposal of a new weighting scheme to control the sampling of rare values in the case where the target variable
is continuous (regression)
iii) Introduction of a new synthetic data generator by combining the learning power of VAEs and a non-parametric
generator: the smoothed bootstrap. This technique has the advantage of generating data from a "seed"
observation with the help of its neighborhood.
The paper is organized as follows. Section 2 presents various state-of-the-art studies along with their shortcomings and
differences compared to our proposal. In Section 3, we present our DAVID approach combining a VAE adapted to
IR with an alternative method of generating data from a VAE. Numerical results are presented in Section 4 with an
illustration and in Section 5 with real applications. Finally, we discuss the proposed method in Section 6.

2 Related works
We will distinguish two cases: 1) The imbalanced regression, which was initially addressed in the context of modeling
the distributions of the target variable in structured tabular data; 2) The deep imbalanced regression, which has been
introduce more recently, extending the imbalance regression concept to unstructured data (such as images or NLP),
where deep learning approaches are primarily used and quite effective.

2.1 Imbalanced Regression

The initial works on IR were established based on a utility function that allowed binarizing the issue. More precisely,
(Ribeiro, 2011) first proposed using a relevance function to assign a value to each value of the target variable Y . Then,
the majority and minority values are identified using a user-defined threshold. This initial approach allows adapting
existing solutions within the framework of imbalanced classification, which are much more numerous. This is the case
with the adaptation of the famous SMOTE algorithm to regression (Torgo et al., 2013). This binary partitioning method
has enabled the conversion of other synthetic data generation methods such as Gaussian Noise ((Branco et al., 2016a),
(Branco et al., 2017), (Song et al., 2022)) or SMOTE extensions ((Moniz et al., 2018), (Camacho et al., 2022)). Other
methods are proposed on the same approach (e.g (Branco et al., 2019), (Wu et al., 2022) or (Branco et al., 2018)). This
primary approach has allowed for the first solutions to be proposed for IR. Moreover, generating data in the initial data
space can present challenges such as identifying and generating nonlinear relationships or handling qualitative features.
Recently, a new approach has also been proposed by (Stocksieker et al., 2023) which focuses more on the features
imbalance rather than the target variable.

2.2 Deep Imbalanced Regression

More recently, the issue of IR has been extended to the context of unstructured data (for example, estimating the age of
individuals from images) what is known as the Deep Imbalanced Regression. Given that deep learning approaches are
very effective on this type of data, it is natural that new approaches have emerged. (Yang et al., 2021) proposed the
use of kernel density estimations to smooth the distributions of variables, improving learning with a neural network
when the target variable is continuous and imbalanced. The authors adapted the regression variant Focal-R proposed for
classification tasks ((Lin et al., 2017)) for regression context by replacing the scaling factor by a continuous function.
(Ding et al., 2022) built upon this idea of smoothing via kernel density estimation and adds techniques of "CORrelation
ALignment" and "class-balanced re-weighting" to improve the results. (Gong et al., 2022) proposed a method to enhance
the performance of deep regression models in the case of imbalanced regression by incorporating a ranking similarity
regularization into the loss function, thereby balancing the importance of different classes and improving overall model
accuracy. (Ren et al., 2022) demonstrated that using the standard MSE for imbalanced regression is inefficient and

2
Data Augmentation with Variational Autoencoder for Imbalanced Dataset

introduced a balanced MSE to improve learning with neural net techniques trained on images. (Keramati et al., 2023)
proposed a contrastive regularizer technique, from Contrastive Learning. (Sen et al., 2023) handled the issue using
logarithmic transformation and artificial neural network. Finally, as proposed for classification ((Ai et al., 2023), (Zhang
et al., 2018), (Utyamishev and Partin-Vaisband, 2019)), the VAE approach has been proposed as a solution to the
problem of imbalanced regression with (Wang and Wang, 2024). The authors adapted the VAE for dealing with the
deep imbalanced regression. Unlike the original VAE, which performs inference for each observation (assuming i.i.d
latent representation), the authors suggested using the similarity between observations for the inference part of the VAE.
However, as the previous works in Deep Imbalanced Regression, they used an arbitrary discretization of the continuous
target variable (with equal-interval or equal-size). This technique introduces parameterization complexity with a new
hyperparameter (number of bins), which significantly affects the modeling process. Furthermore, discretizing the target
variable’s support is not recommended as it may lead to information loss. We believe it is preferable to preserve the
continuous nature of the distribution to distinguish the values of the variable of interest. Finally, the loss function used
in their work is not suitable for imbalanced regression, which does not guarantee that the inference for these points is
relevant, and aggregating the parameters to improve it may not solve the problem.
Moreoever, as demonstrated by (Ren et al., 2022), the standard MSE is not relevant for imbalanced regression and like all
standard algorithms, the VAE focuses on an average error and thus neglects the reconstruction (and therefore generation)
of rare values. However, beyond their generative capacity, VAEs offer a latent space with interesting and exploitable
properties: each point has a latent representation in a regular space; that is, with continuity and completeness. Finally,
neural networks have the advantage of being able to capture nonlinear correlations, which is crucial for generating
synthetic data.

3 DAVID: A Novel Synthetic Data Generator


Let x = (xij )i=1,·,n;j=1,·,p ∈ X ⊂ Rn×p be a dataset composed of n observations and p variables where xij is the
variable j for the observation i. Let y = (yi )i=1,·,n ∈ Y ⊂ Rn the associated target variable, which is imbalanced and
continuous.

3.1 β-VAE for Regression

Autoencoders enable the construction of a latent representation space for data, capturing non-linear relationships
between variables. By introducing a stochastic component into the latent space, Variational Autoencoders (VAE) train
latent random variables representing the data and naturally generates new synthetic data very close to the initial data.
VAEs are trained to reconstruct inputs from latent variables. However, in a regression framework, the target variable Y
should not be treated as a feature X. Therefore, we propose to construct a mixed VAE: the VAE takes both the input
features X and the target variable Y and aims to reconstruct them, but with a weighting specific to Y .
The global loss function is therefore classically defined as follows:
L(θ, ϕ, x, y) = βx Eq [log pθ (x|z)] − βKL DKL (qθ (z|x)||pθ (z)) + βy Eq [log pθ (y|z)]
where {βx , βy , βKL } are the weights, Eq [log pθ (x|z)] (resp. Eq [log pθ (y|z)]) represents the loss function reconstruction
for x (resp. y) and DKL (q(z|x), p(z|x)) represents the Kullback-Leibler Divergence for regularization (qθ (z|x)
represents the distribution of latent variables z given the input data x and pθ (z) represents the prior distribution, often
chosen as Gaussian). The advantage of using a VAE here is to ensure the regularity (continuity of the prior) of the latent
space, through penalization, to enable coherent data generation.

3.2 A Balanced loss function for Imbalanced Regression

It is demonstrated that the Mean Squared Error (MSE) loss function is not effective in addressing imbalanced regression
(Ren et al., 2022). Indeed, it mechanically favors frequent values, as they significantly reduce the loss function,
neglecting rare values as a consequence. Here, we propose the use of a balanced loss function taking into account the
frequency of the target variable Y . Indeed, the goal is to train a model capable of generating new synthetic data for the
rare values of Y . For classification tasks, rare values are directly identifiable as they belong to one or more minority
classes. However, in regression, where the support of the target variable is continuous, defining rare values already
poses an initial challenge. Here, we propose to weigh the observations during the training phase by the inverse of the
empirical density of Y . In other words, the rarer a value, the higher its weight, and vice versa. Finally, we suggest
mitigating the weights using a parameter α. The higher this parameter, the greater the difference between the weights
will be. The global loss function thus becomes:
βy
L(θ, ϕ, x, y) = βx Eq [log pθ (x|z)] − βKL DKL (qθ (z|x)||pθ (z)) + Eq [log pθ (y|z)] (1)
f (Y )α
b

3
Data Augmentation with Variational Autoencoder for Imbalanced Dataset

Since the density of Y is unknown, we propose to estimate it using a kernel estimator (with a smoothing parameter
ensuring its convergence e.g Silverman’s bandwidth matrix, Silverman (1986), or that of Scott (2015)). We can find a
comparable approach in (Steininger et al., 2021), but the authors employed the opposite rather than the inverse. Another
approach, called "inverse re-weighting", is quite similar in (Yang et al., 2021) where the authors proposed to re-weight
the loss function by multiplying it by the inverse of the Label Distribution Smoothing estimated density for each target
bin i.e. by partitioning the domain of Y . However, partitioning the support of Y poses a risk of information loss.

3.3 A Smoothed Bootstrap for Data Generation

As previously stated, the goal and benefit of VAEs reside in their capacity to produce data closely resembling that
on which the model was trained. Typically, a classical VAE (i.e. using Gaussian distributions) aims to calibrate the
parameters µi and σi of a latent normal distribution to the observation xi . The challenge in IR is that algorithms face
difficulty in learning to model rare values. Classical autoencoders also encounter this issue. In the section above, we
proposed an initial level of adjustment, a balanced MSE, to enhance the VAE’s learning process. However, the variance
σi assigned to rare values could still be significant, and the generated data may not faithfully represent the original
data. Indeed, the VAE constructs a latent representation for each observation, i.e., calibrates the parameters of each
latent variable independently: the latent representations are i.i.d. It is important to emphasize that the data generation
concerns rare values. The objective is indeed to construct a training sample consisting of "majority" values, observed,
and rare values, observed or generated.
Here, we propose a second level of processing to improve the generation of rare data. First, we define the importance
weights used for the drawing for rebalancing the learning process, and enabling better modeling of rare values, as
follows:
1
ωi := (2)
ˆ
fY (yi )α

We then suggest not generating conventionally with the VAE but rather through a smoothed bootstrap (see (Silverman and
Young, 1987), (Hall et al., 1989), (De Angelis and Young, 1992)) on the n × q matrix of values µ representing the mean
value of the new representation of X in the latent space of dimensionality q: µ := {µij , i = 1, · · · , n ; j =, · · · , q}.
Smoothed bootstrap (SB) consists of drawing samples from kernel density estimators of the distribution. It can be
decomposed into two steps: first, a seed is randomly drawn and second, a random noise from the kernel density
estimator is added to obtain a new sample. Here, the first step is represented by the drawing weight ωi and the second
by a kernel generation on µ. Convergence properties of smoothed bootstrap are studied in (De Martini and Rapallo,
2008) and (Falk and Reiss, 1989). They proved the consistency of the smoothed bootstrap with classical multivariate
kernel estimator and more specifically the convergence in Mallows metric. We propose to use a mixture of multivariate
kernel (e.g. Gaussian) to generate synthetic data:
X
gZ ∗ (z ∗ |µ) = ωi Ki (z ∗ , µ),
i=1,··· ,n
P
where (Ki )i=1,··· ,n is a collection of kernel, (ωi )i=1,··· ,n is a sequence of positive weights with i=1,··· ,n ωi = 1.
Here the index ∗ stands for the synthetic data. The smoothed bootstrap has the advantage of not requiring any additional
parameters. This method is based on the kernel density estimate technique, for which the estimation of the smoothing
parameter H is already optimally proposed. Indeed, (Scott, 2015) or (Silverman, 1986) suggest a parametrization of
the bandwidth matrix to obtain consistency. Moreover, It is important to note that the VAE provides a regular space,
offering an ideal framework for the smoothed bootstrap. This may not be the case for the initial space, which can exhibit
discontinuities and bounded distributions. Finally, as the Gausian Noise (Lee and Sauchi, 2000) and ROSE (Menardi
and Torelli, 2014) algoritms, we use a parameter tuning the level noise and manage the generation. This multiplicative
parameter (<1) is applied directly to the smoothing matrix.
We could have applied this method to a classical autoencoder, but VAEs ensure the necessary continuity of the support
for the application of smoothed bootstrap. Indeed, a kernel density estimator can be seen as a Gaussian mixture, just like
the latent distribution of a VAE. However, a VAE generates from a Gaussian distribution specific to each observation,
with possible biased parameters for rare values. Here, generation is based on the non-parametric estimation of the
distribution of mean latent values µi , and the variance used for generation is derived from the neighborhood and the
distribution of means, rather than attempting to estimate a parameter specific to each observation (σi ).
More precisely, for the classical VAE each latent variable can be express as zi = µ bi + σbi ϵ, with ϵ ∼ N (0, 1). The
generator associated to zi has the form gz∗ (z ∗ |zi ) ∼ N (bµi , σ
bi ). The weakness of this generator in the presence of a
rare zi value is that it takes into account the associated variance σi , which is likely to be very large and the associated

4
Data Augmentation with Variational Autoencoder for Imbalanced Dataset

generated value of z ∗ will often be far from µi . We are trying to remedy this problem with DAVID, taking into account
all the µi and proposing a joint generative function as follows:
n
X

g (z |µ) =
z∗ ωi KHn (z ∗ − µi ),
i=1
1 −1
4
where K is a Gaussian kernel, Hn := η × V ar(µ) is defined by Silverman’s rule of thumb: η = ( p+2 ) d+4 n d+4 or
−1
Scott’s rule of thumb: η = n d+4 , V ar(µ) being the variance matrix of µ. We notice that the generation of z ∗ from
a seed µi is performed based on the kernel centered on it, i.e., depending of its neighborhood and the characteristics
of the distribution of µ. The VAE proposes to generate from the inference made on the seed zi , i.e., from a Gaussian
distribution with parameters (µi , σi ). As mentioned earlier, these parameters of the latent Gaussians can be difficult to
estimate for rare values, so it is wise to rely on the neighborhood of the latent representation of rare values to generate
new values. Furthermore, the assumption of latent Gaussians can be too strong for tabular data ((Ma et al., 2020)), so
generating from a classical VAE may be ineffective in certain situations.

3.4 Algorithm

Our approach is based on two main steps:

1. The first one is the β−VAE for regression training with the balanced loss function on the target variable.
2. The second step involves generating synthetic data:
i) by drawing a seed, the mean representation in the latent space of a rare initial observation µi , obtained
with the VAE’s encoder;
ii) by generating a new observation in the latent space z ∗ using a smoothed bootstrap applied on µ;
iii) by generating a new observation (x∗ , y ∗ ) from z ∗ using the decoder.

The data generation is illustrated in Figure 1.

Figure 1: DAVID Algorithm: Data Generation

4 Numerical Illustration
4.1 Dataset

We first test our approach on a simple simulated dataset of size 3,000. To do this, we simulate 6 numerical features
X = Xj , j = 1, · · · , 6 and 1 target variable Y as follows:

• Features X : X1 ∼ N (0, 2) ; X2 ∼ N (10, 2) ; X3 ∼ N (0, 5) ; X4 ∼ N (X13 , 1) ; X5 ∼ N ((X2 − 10)2 , 1) ;


X6 ∼ N (X32 , 2)
• Target variable Y : Y ∼ N (U 2 , 10) with U := 11mm(X4 ) + 9mm(X5 ) + 14mm(X6 ) + 10 with mm the Min-Max
X−min(X)
Scaler: mm(X) := max(X)−min(X)

As observed here, we simulated a dataset of 7 variables with nonlinear relationships. The dataset is defined based on 3
Gaussian latent variables to which a slight Gaussian noise is added.

5
Data Augmentation with Variational Autoencoder for Imbalanced Dataset

4.2 Protocol

We construct a training set and a test set through uniform random sampling, aiming to allocate 70% of the data to the
train. Next, we construct four models: a vanilla autoencoder, a 0-VAE (by canceling the penalty of the Kullback-Leibler
divergence), a vanilla β-VAE, and a rebalanced β-VAE with weights on Y given by 1. It is important to note that a
vanilla VAE, i.e., with a weight of 1 for the KL divergence, does not yield good results. Indeed, enforcing normality
in the latent space penalizes the reconstruction of observations ((Ma et al., 2020)). Therefore, the parameter β in the
β-VAE is smaller than 1. The architectures of the models and the hyperparameters are detailed in the Appendix A1 .
Once the models are constructed, we generate several synthetic samples from the same seed drawing with the weights
defined previously 2. To measure the impacts of our proposals, we propose a step-by-step study in the evaluation of our
results. We then obtain several sets of training data:

• Initial training set: Baseline


• Oversampling, applied on the training set, with the drawing weights 2: OS
• Classical Smoothed bootstrap, applied on the training set i.e. in the data space: CSB
• Natural generation with the 0 − V AE: 0VAE
• Natural generation with the vanilla β − V AE: BVAE
• Smoothed bootstrap from the µ distribution in latent space Z obtained with the vanilla β − V AE: kBVAE
• Natural generation with the rebalanced β − V AE i.e. loss function defined as 1: BVAEw
• Smoothed Bootstrap from the µ distribution in latent space Z obtained with the rebalanced β − V AE i.e. loss
function defined as 1: kBVAEw. This approach will be referred to as the David generator, which combines all
the steps of our algorithm.
• Smoothed bootstrap on the latent space obtained with PCA: kPCA
• Smoothed bootstrap on the latent space obtained with kernel-PCA (polynomial kernel): kKPCA
• We finally combine smoothed bootstrap with an autoencoder: kAE. This last method is to be avoided because the
autoencoder does not guarantee the regularity of the latent space, which theoretically prevents the application
of the SB.

To compare our approach with the state of the art, we also simulate synthetic data using the following methods:

• A Tabular Variational Autoencoder ((Xu et al., 2019)), trained with the same epochs than ours, from the
python-package Synthetic Data Vault (SDV) (Patki et al., 2016): TVAE
• A conditional Tabular Generative Adversarial Net ((Xu et al., 2019)), trained with the same epochs than ours,
from the python-package Synthetic Data Vault (SDV) (Patki et al., 2016): CTGAN
• A Copula-Generative Adversarial Net, trained with the same epochs than ours, from the python-package
Synthetic Data Vault (SDV) (Patki et al., 2016): CopGAN
• Oversampling, applied on the training set, with the UBL approach ((Branco et al., 2016a)) from the python-
package imbalancedlearningregression ((Wu et al., 2022)): ILRro
• The SMOTE for Regression ((Torgo et al., 2013)), applied on the training set, with the UBL approach ((Branco
et al., 2016a)) from the python-package imbalancedlearningregression ((Wu et al., 2022)): ILRsmote
• The Gaussian Noise for Regression ((Branco et al., 2017)), applied on the training set, with the UBL approach
((Branco et al., 2016a)) from the python-package imbalancedlearningregression ((Wu et al., 2022)): ILRgn
• The ADASYN method (He et al., 2008), with the UBL approach ((Branco et al., 2016a)) from the python-
package imbalancedlearningregression ((Wu et al., 2022)) was also applied but removed because the results
were very poor and, more importantly, the computational time was too high.
• We would like to note that we were unable to compare our results with the VIR (Wang and Wang, 2024)
approach as their code is no longer accessible.

The next step involves predicting the target variable Y of the test set using the different training datasets, including, as
the baseline, the initial training dataset. To obtain robust results, we apply the comparison on 10 train-test samples
(K-fold approach). It is important to note that the train sets are mixed, meaning they are constructed by blending
original data (especially for frequent observations) and synthetic data (especially for rare values). In the same way, to
avoid getting results dependent on some learning algorithms we use 10 models from the autoML of the H2O package
1
Code and data are available at: https://github.com/sstocksieker/DAVID/

6
Data Augmentation with Variational Autoencoder for Imbalanced Dataset

(LeDell and Poirier, 2020) among the following algorithms: Distributed Random Forest, Extremely Randomized
Trees, Generalized Linear Model with regularization, Gradient Boosting Model, Extreme Gradient Boosting and a
Fully-connected multi-layer artificial neural network.

4.3 Results

We compare the prediction results with the following metrics:


Pn
• The standard Mean Squared Error M SE(Y, Yb ) := n1 i=1 (yi − ybi )2
Pn
• Our weighted MSE wM SE(Y, Yb ) := n1 i=1 ωi (yi − ybi )2 with ωi defined as in 2
Pn
• The standard Mean Absolute Error M AE(Y, Yb ) := n1 i=1 |yi − ybi |
Pn |yi −b
yi |
• The Mean Absolute Percentage Error M AP E(Y, Yb ) := 1 n i=1 max(ϵ,|yi |)

Metric MSE wMSE MAE MAPE


Train mean (std) mean (std) mean (std) mean (std)
Baseline 222 (29) 0,11 (0,01) 7,78 (0,21) 0,06 (0,00)
OS 245 (36) 0,12 (0,02) 7,95 (0,18) 0,06 (0,00)
CSB 247 (24) 0,12 (0,01) 8,11 (0,16) 0,06 (0,00)
kAE 251 (31) 0,12 (0,01) 8,10 (0,22) 0,06 (0,00)
OVAE 220 (24) 0,10 (0,01) 7,77 (0,19) 0,06 (0,00)
BVAE 218 (23) 0,10 (0,01) 7,79 (0,18) 0,06 (0,00)
kBVAE (ours) 218 (23) 0,10 (0,01) 7,79 (0,18) 0,06 (0,00)
BVAEw (ours) 214 (25) 0,10 (0,01) 7,81 (0,14) 0,06 (0,00)
kBVAEw (ours) 202 (22) 0,10 (0,01) 7,70 (0,18) 0,05 (0,00)
kPCA (ours) 247 (28) 0,12 (0,01) 8,10 (0,19) 0,06 (0,00)
kKPCA (ours) 494 (258) 0,24 (0,12) 9,42 (2,22) 0,07 (0,02)
TVAE 214 (23) 0,10 (0,01) 7,72 (0,15) 0,06 (0,00)
CTGAN 297 (47) 0,14 (0,02) 8,24 (0,31) 0,06 (0,00)
CopGAN 344 (82) 0,16 (0,04) 8,83 (1,39) 0,07 (0,01)
ILRro 269 (39) 0,13 (0,02) 8,13 (0,27) 0,06 (0,00)
ILRsmote 245 (32) 0,12 (0,02) 7,97 (0,21) 0,06 (0,00)
ILRgn 238 (24) 0,11 (0,01) 8,20 (0,23) 0,06 (0,00)
Table 1: Illustration results: metrics for test prediction

The results given in Table 1 show that:


• The results of the BVAE are slightly better than those of the 0VAE, i.e., this shows the importance of having a
Kullback-Leibler penalty, which notably helps obtain a regular latent space.
• The results of the kAE do not appear to be better than those of the kTrain: this shows that embedding the
data into a latent space to apply the smoothed bootstrap is not sufficient: the latent space must have regularity
properties
• The results of kBVAE and BVAE are similar: generation with a smoothed bootstrap gives the same results as
the natural generation of the BVAE
• The results of the BVAEw are better than those obtained from the BVAE: this demonstrates the relevance of the
proposed loss function.
• The results of DAVID, that is, the kBVAEw, are better than those of the BVAEw: generating data with a
smoothed bootstrap is more suitable than natural generation of the VAE.
• The results of kPCA and kKPCA are quite poor: applying a smoothed bootstrap to the latent space of factorial
approaches is not relevant.
• Synthetic data generators that are not specific to IR (TVAE, CTGAN, CopGAN) are not suitable: the results
obtained are worse than those obtained with the initial sample.

7
Data Augmentation with Variational Autoencoder for Imbalanced Dataset

• Finally, DAVID outperforms the state-of-the-art methods (ILRro,ILRsmote,ILRgn).

A study on performance ranking shown that DAVID is often superior and that the results are quite robust.

5 Experiments
To test our approach on real datasets, we compare the results on benchmark datasets for IR taken from (Branco et al.,
2019)2 .

Dataset bank8FM abalone boston NO2


Train mean (std) mean (std) mean (std) mean (std)
Baseline 1,87 (0,19) 6,84 (0,29) 44,44 (9,57) 0,48 (0,04)
OS 2,61 (0,4) 6,19 (0,17) 41,01 (11,07) 0,47 (0,06)
kTrain 3,3 (0,52) 5,81 (0,18) 37,49 (12,47) 0,46 (0,07)
kAE 1,77 (0,16) 5,79 (0,19) 37,53 (11,62) 0,44 (0,05)
OVAE 1,78 (0,15) 5,78 (0,25) 40 (9,9) 0,42 (0,04)
BVAE 1,76 (0,16) 5,74 (0,19) 38,99 (11,12) 0,42 (0,04)
kBVAE (ours) 1,74 (0,17) 5,6 (0,19) 37,37 (12,03) 0,43 (0,04)
BVAEw (ours) 1,71 (0,16) 5,67 (0,22) 35,67 (11,82) 0,42 (0,04)
kBVAEw (ours) 1,68 (0,13) 5,52 (0,22) 34,86 (12,12) 0,4 (0,04)
kPCA (ours) 3,35 (0,52) 5,83 (0,21) 39,22 (11,64) 0,46 (0,07)
kKPCA (ours) 2,57 (0,51) 5,81 (0,18) 38,99 (11,06) 0,44 (0,05)
TVAE 2,56 (0,65) 7,17 (0,33) 45,2 (12,59) 0,5 (0,06)
CTGAN 8,96 (4,55) 7,43 (0,32) 51,77 (15,11) 0,54 (0,05)
CopGAN 8 (3,08) 7,5 (0,3) 55,79 (12,37) 0,57 (0,08)
ILRro 3,37 (0,36) 6,42 (0,2) 46,11 (10,85) 0,48 (0,04)
ILRsmote 2,6 (0,33) 5,78 (0,2) 44,3 (9,19) 0,47 (0,04)
ILRgn 2,51 (0,38) 6,41 (0,18) 42,78 (10,73) 0,46 (0,04)
Table 2: Experiments results: wMSE for test prediction

The results, in Table 2, confirm those of the illustration, DAVID ("kBVAEw") gives better results than the initial training
sample and the state-of-the-art approaches. The results show that it is preferable to generate synthetic data from the
latent space rather than from the initial space ("kTrain"). Finally, these experiments show that using standard synthetic
data generators ("TVAE", "CTGAN", "copGAN") is not recommended in the case of imbalanced regression.

6 Discussion and Perspectives


This paper proposes an enhancement for learning in imbalanced regression (IR), which remains a relatively unexplored
problem compared to classification, especially for structured data. We suggest embedding observations into a latent
space to enable more relevant data generation than in the initial space. We empirically demonstrated that intuitive
techniques like PCA and kernel-PCA do not offer a satisfactory framework for data augmentation. On the other hand,
based on neural networks, deep learning approaches used effectively on images (deep imbalanced regression) are
limited for tabular data. However, VAEs offer a variational inference capable of capturing nonlinear correlations, which
is crucial in data generation.
We propose here leveraging the power of VAE inference while reconsidering how to generate data. Morevoer, VAEs
provide a suitable framework for the application of KDEs because they offer a regular latent space, which may not be
the case in the original space or with a vanilla autoencoder. Our results empirically show that modifying only the loss
function of a VAE for data generation in IR (in-processing) is satisfactory. But, we can improve this data generation
by replacing the native data generation of the VAE with a smoothed bootstrap. The underlying idea is to use the
neighborhood of observations in the latent space to more effectively generate rare values, which are inherently rare
2
The dedicated repository "Data Sets for Imbalanced Regression Learning" is available at this address: https://paobranco.
github.io/DataSets-IR/

8
Data Augmentation with Variational Autoencoder for Imbalanced Dataset

i.e. difficult to model. It is interesting to note that generation from VAEs is semi-parametric (as it stems from latent
Gaussians). The DAVID algorithm offers non-parametric generation based on a smoothed bootstrap, i.e., a KDE-like
approach. Our simple and effective DAVID algorithm offers better results than conventional approaches in IR and
than the traditional VAE on multiple datasets with various learning algorithms. Our study also demonstrates that using
"vanilla" synthetic data generators (TVAE or CTGAN) not specifically dedicated to the problem of IR are not effective.
The relevance of our approach can be explained by its ability to consider nonlinear correlations between variables.
This methodology is effective when the latent representation space accurately reflects the observations, meaning the
VAE (Variational Autoencoder) must effectively reproduce the data and provide the continuity properties needed for
applying smoothed bootstrap. Thus, the method would not work if the VAE is not functioning properly. Changing the
representation space of the data to generate synthetic data is promising. It allows for the consideration of nonlinear
correlations as well as mixed data, which remains a significant challenge (Ma et al., 2020).
Algorithm DAVID could potentially be applied in the context of imbalanced classification, but this issue has already
been extensively addressed with a plethora of solutions. Applying this approach to mixed tabular in IR data would
be intriguing, as it remains a relatively unexplored and challenging application type, especially due to correlations
associated with qualitative data. Finally, it would be interesting to test this new approach on images within the context
of deep imbalanced regression.

Acknowledgment
The authors thank the reviewers for their helpful comments which helped to improve the manuscript. S. Stocksieker
would like to acknowledge the support of the Research Chair DIALog under the aegis of the Risk Foundation, a joint
initiative by CNP Assurance. D. Pommeret would like to acknowledge the support received from the Research Chair
ACTIONS under the aegis of the Risk Foundation, an initiative by BNP Paribas Cardif and the Institute of Actuaries of
France.

References
Ai, Q., Wang, P., He, L., Wen, L., Pan, L., and Xu, Z. (2023). Generative oversampling for imbalanced data via
majority-guided vae. In International Conference on Artificial Intelligence and Statistics, pages 3315–3330. PMLR.
Branco, P., Ribeiro, R. P., and Torgo, L. (2016a). Ubl: an r package for utility-based learning. arXiv preprint
arXiv:1604.08079.
Branco, P., Torgo, L., and Ribeiro, R. P. (2016b). A survey of predictive modeling on imbalanced domains. ACM
computing surveys (CSUR), 49(2):1–50.
Branco, P., Torgo, L., and Ribeiro, R. P. (2017). Smogn: a pre-processing approach for imbalanced regression. In First
international workshop on learning with imbalanced domains: Theory and applications, pages 36–50. PMLR.
Branco, P., Torgo, L., and Ribeiro, R. P. (2018). Rebagg: Resampled bagging for imbalanced regression. In Second
International Workshop on Learning with Imbalanced Domains: Theory and Applications, pages 67–81. PMLR.
Branco, P., Torgo, L., and Ribeiro, R. P. (2019). Pre-processing approaches for imbalanced distributions in regression.
Neurocomputing, 343:76–99.
Camacho, L., Douzas, G., and Bacao, F. (2022). Geometric smote for regression. Expert Systems with Applications,
page 116387.
Chawla, Bowyer, Hall, and Kegelmeyer (2002). Smote: Synthetic minority over-sampling technique. Journal of
Artificial Intelligence Research, 16:321–357.
De Angelis, D. and Young, G. A. (1992). Smoothing the bootstrap. International Statistical Review/Revue Internationale
de Statistique, pages 45–56.
De Martini, D. and Rapallo, F. (2008). On multivariate smoothed bootstrap consistency. Journal of statistical planning
and inference, 138(6):1828–1835.
Ding, Y., Jia, M., Zhuang, J., and Ding, P. (2022). Deep imbalanced regression using cost-sensitive learning and deep
feature transfer for bearing remaining useful life estimation. Applied Soft Computing, 127:109271.
Falk, M. and Reiss, R.-D. (1989). Weak convergence of smoothed and nonsmoothed bootstrap quantile estimates. The
Annals of Probability, pages 362–371.
Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., and Herrera, F. (2018a). Learning from imbalanced
data sets, volume 10. Springer.

9
Data Augmentation with Variational Autoencoder for Imbalanced Dataset

Fernández, A., Garcia, S., Herrera, F., and Chawla, N. V. (2018b). Smote for learning from imbalanced data: progress
and challenges, marking the 15-year anniversary. Journal of artificial intelligence research, 61:863–905.
Gong, Y., Mori, G., and Tung, F. (2022). Ranksim: Ranking similarity regularization for deep imbalanced regression.
arXiv preprint arXiv:2205.15236.
Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., and Bing, G. (2017). Learning from class-imbalanced
data: Review of methods and applications. Expert systems with applications, 73:220–239.
Hall, P., DiCiccio, T. J., and Romano, J. P. (1989). On smoothing and the bootstrap. The Annals of Statistics, pages
692–704.
He, H., Bai, Y., Garcia, E. A., and Li, S. (2008). Adasyn: Adaptive synthetic sampling approach for imbalanced
learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational
intelligence), pages 1322–1328. IEEE.
He, H. and Ma, Y. (2013). Imbalanced learning: foundations, algorithms, and applications. John Wiley & Sons.
Keramati, M., Meng, L., and Evans, R. D. (2023). Conr: Contrastive regularizer for deep imbalanced regression. arXiv
preprint arXiv:2309.06651.
Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future directions. Progress in Artificial
Intelligence, 5(4):221–232.
LeDell, E. and Poirier, S. (2020). H2O AutoML: Scalable automatic machine learning. 7th ICML Workshop on
Automated Machine Learning (AutoML).
Lee and Sauchi (2000). Noisy replication in skewed binary classification. Computational Statistics and Data Analysis,
34(2):165–191.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollár, P. (2017). Focal loss for dense object detection. In Proceedings
of the IEEE international conference on computer vision, pages 2980–2988.
Ma, C., Tschiatschek, S., Turner, R., Hernández-Lobato, J. M., and Zhang, C. (2020). Vaem: a deep generative model
for heterogeneous mixed type data. Advances in Neural Information Processing Systems, 33:11237–11247.
Menardi and Torelli (2014). Training and assessing classification rules with imbalanced data. Data Mining and
Knowledge Discovery, 28(1):92–122.
Moniz, N., Ribeiro, R., Cerqueira, V., and Chawla, N. (2018). Smoteboost for regression: Improving the prediction of
extreme values. In 2018 IEEE 5th international conference on data science and advanced analytics (DSAA), pages
150–159. IEEE.
Patki, N., Wedge, R., and Veeramachaneni, K. (2016). The synthetic data vault. In IEEE International Conference on
Data Science and Advanced Analytics (DSAA), pages 399–410.
Ren, J., Zhang, M., Yu, C., and Liu, Z. (2022). Balanced mse for imbalanced visual regression. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7926–7935.
Ribeiro, R. P. (2011). Utility-based regression. Ph. D. dissertation.
Scott, D. W. (2015). Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons.
Sen, S., Singh, K. P., and Chakraborty, P. (2023). Dealing with imbalanced regression problem for large dataset using
scalable artificial neural network. New Astronomy, 99:101959.
Shwartz-Ziv, R. and Armon, A. (2022). Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90.
Silverman, B. and Young, G. (1987). The bootstrap: to smooth or not to smooth? Biometrika, 74(3):469–479.
Silverman, B. W. (1986). Density estimation for statistics and data analysis, volume 26. CRC press.
Song, X. Y., Dao, N., and Branco, P. (2022). Distsmogn: Distributed smogn for imbalanced regression problems. In
Fourth International Workshop on Learning with Imbalanced Domains: Theory and Applications, pages 38–52.
PMLR.
Steininger, M., Kobs, K., Davidson, P., Krause, A., and Hotho, A. (2021). Density-based weighting for imbalanced
regression. Machine Learning, 110:2187–2211.
Stocksieker, S., Pommeret, D., and Charpentier, A. (2023). Data augmentation for imbalanced regression. In
International Conference on Artificial Intelligence and Statistics, pages 7774–7799. PMLR.
Torgo, L., Ribeiro, R. P., Pfahringer, B., and Branco, P. (2013). Smote for regression. In Portuguese conference on
artificial intelligence, pages 378–389. Springer.

10
Data Augmentation with Variational Autoencoder for Imbalanced Dataset

Utyamishev, D. and Partin-Vaisband, I. (2019). Progressive vae training on highly sparse and imbalanced data. arXiv
preprint arXiv:1912.08283.
Wang, Z. and Wang, H. (2024). Variational imbalanced regression: Fair uncertainty quantification via probabilistic
smoothing. Advances in Neural Information Processing Systems, 36.
Wu, W., Kunz, N., and Branco, P. (2022). Imbalancedlearningregression-a python package to tackle the imbalanced
regression problem. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases,
pages 645–648. Springer.
Xu, L., Skoularidou, M., Cuesta-Infante, A., and Veeramachaneni, K. (2019). Modeling tabular data using conditional
gan. In Advances in Neural Information Processing Systems.
Yang, Y., Zha, K., Chen, Y., Wang, H., and Katabi, D. (2021). Delving into deep imbalanced regression. In International
Conference on Machine Learning, pages 11842–11851. PMLR.
Zhang, C., Zhou, Y., Chen, Y., Deng, Y., Wang, X., Dong, L., and Wei, H. (2018). Over-sampling algorithm based
on vae in imbalanced classification. In Cloud Computing–CLOUD 2018: 11th International Conference, Held as
Part of the Services Conference Federation, SCF 2018, Seattle, WA, USA, June 25–30, 2018, Proceedings 11, pages
334–344. Springer.

A Model Architecure and Parameters


For the illustration and experiments, we have chosen the following parameters:
• train-test size: 60-40%
• The α parameter for weighting: 1. This parameterization allows convergence to a continuous uniform target
distribution, i.e. all observations are given equal weight.
• The parameter tuning the level noise for perturbation approaches: 0.1. This parameter ensures that we do not
stray too far from observing the seed in latent space. It is a classical default value in equivalent approaches
(Branco et al., 2016a).
• Epochs: 2000, batch size: 128, learning rate: 10e−3
• The weights of X reconstruction in loss function: βx = 1. This parameter is always set to 1. Setting the
weight of the Y is sufficient.
• The weights of Y reconstruction in loss function: βy = 10. The weight of the Y reconstruction is approximately
equal to the number of covariates.
• The weights of Kullback-Leibler term in loss function: βK L = 1e−6 . It is important to highlight that a
standard VAE, which assigns a weight of 1 to the KL divergence, doesn’t produce satisfactory results. This is
because enforcing normality in the latent space hinders the reconstruction of observations (Ma et al., 2020).
Consequently, the β parameter in the β-VAE is set to less than 1.
• The Kernel Density Estimate for Smoothed Boostrap and the weighting in loss function and drawing is obtained
with a Gaussian Kernel with Silverman’rule (using the python-package Scipy, as described in 3.3)
The architecture of our β−VAE is as follows:
• an encoder ϕ consisting of 5 hidden layers of dimensions
(p + 1, 2p + 1, p − q, p − 2q, p − 3q/p − 3q)
• a decoder ψ consisting of 5 hidden layers of dimensions
(p − 3q, p − 2q, p − q, 2p + 1, p/1)
• the parameter of HL dimension reduction q = int(p/10) + 1
• activation function are all T anh()
• a standard reparametrization based on a random standard Gaussian

B Dataset Details
To test our approach on real datasets, we compare the results on benchmark datasets for IR taken from (Branco et al.,
2019)3 . The datasets are composed as follows:
3
The dedicated repository "Data Sets for Imbalanced Regression Learning" is available at this address: https://paobranco.
github.io/DataSets-IR/

11
Data Augmentation with Variational Autoencoder for Imbalanced Dataset

• The bank8fm dataset: 4498 observations and 9 variables (7 float and 1 integer).
• The abalone dataset: 4177 observations and 9 variables (8 float and 1 integer).
• The boston dataset: 506 observations and 14 variables (11 float and 3 integer).
• The NO2 dataset: 4498 observations and 9 variables (8 float and 1 integer).

12

You might also like