Test 2
Test 2
SearchLogin
Sections
About journal
Sections
About journal
METHODS article
5
Computer Science Department, University of Neuchâtel, Neuchâtel,
Switzerland
1 Introduction
One potential solution is the use of synthetic data. These synthetic data are
not only statistically comparable to the original data but also exhibit the
same utility in subsequent data analysis, and the artificial nature makes
them compliant with GDPR. The generative adversarial network (GAN)
(Goodfellow et al., 2014), which is composed of a generator and a
discriminator, is an innovative generative model that has been proven
effective in synthesizing images, and has recently been utilized to synthesize
tabular data (Mottini et al., 2018; Park et al., 2018; Xu et al., 2019; Zhao et
al., 2021). However, recent studies have shown that GANs may fall prey to
membership inference attacks which greatly endanger the personal
information present in the real training data (Chen et al., 2020b; Stadler et
al., 2020). Therefore, it is imperative to safeguard the training of tabular
GANs such that synthetic data can be generated without causing harm. To
address these issues, prior studies (Jordon et al., 2018; Long et al.,
2019; Torkzadehmahani et al., 2019; Torfi et al., 2020) rely on differential
privacy (DP) (Dwork, 2008). DP is a mathematical framework that provides
theoretical guarantees bounding the statistical difference between any
resulting machine learning (ML) model trained with or without a particular
individual's information in the original training dataset. Typically, this can be
achieved by injecting calibrated statistical noise while updating the
parameters of a network during back-propagation, i.e., DP stochastic
gradient descent (DP-SGD) (Abadi et al., 2016; Xie et al., 2018; Chen et al.,
2020a), or by injecting noise while aggregating teacher ensembles using the
PATE framework (Papernot et al., 2016; Jordon et al., 2018).
The main contributions of this study can be summarized as follows: (1) Novel
conditional adversarial network which introduces a classifier/regressor
providing additional supervision to improve the utility for ML applications. (2)
Efficient modeling of continuous, categorical, and mixed variables via novel
data encoding. (3) Improved GAN training using well-designed information
loss, downstream loss and generator loss along with Was+GP to enhance
stability and effectiveness. (4) Constructed a simpler and more stable DP
GAN algorithm for tabular data to control its performance under different
privacy budgets. Our code is openly hosted at this github. 1
2 Motivation
Through empirical analysis, we show how previous SOTA methods fall short
in addressing challenges in industrial datasets. The specifics of our
experimental setup are detailed in Section 5.1.
Single mode Gaussian distributions are very common. Figure 1A shows the
histogram of variable bmi (i.e., body mass index) in the Insurance dataset,
and synthetic data are generated by six SOTA algorithms for this variable.
The distribution of real data is close to a single mode Gaussian distribution.
But except TableGAN and TVAE, none of the SOTA algorithms can correctly
recover this distribution in their synthetic data. IT-GAN reproduced the
Gaussian distribution, but its mean and standard deviation shifted. CTGAN
uses variational Gaussian mixture (VGM) to model all continuous variables.
However, VGM is a complicated method to deal with single mode Gaussian
distributions as it initially approximates the distribution with multiple
Gaussian mixtures by default. TVAE also uses VGM to encode continuous
column, but variational autoencoder (VAE) framework handles this use case
better than GAN. CWGAN and MedGAN use min-max normalization to scale
the original data to [0, 1]. TableGAN also uses min-max normalization but
scales the original data to [−1, 1] to better match the output of the
generator using tanh as activation function. The reason that min-max
normalization works for TableGAN but not MedGAN and CWGAN is because
the training convergence for both algorithms is less stable than for
TableGAN. However, since TableGAN applies min-max normalization on all
variables, it suffers from a disadvantage modeling column with complex
multi-modal Gaussian distributions.
Figure 1
Figure 1. Challenges of modeling industrial dataset using existing GAN-
based table generator: (A) single Gaussian, (B) mixed type, (C) long tail
distribution, and (D) skewed data.
Real-world data often exhibits long tail distributions, where the majority of
occurrences are concentrated near the initial value of the distribution, with
rare cases appearing toward the end. This can be seen in the cumulative
frequency plots of data generated by six SOTA algorithms for
the Amount variable in the Credit dataset, as shown in Figure 1C. This
variable represents transaction amounts when using credit cards, and it is
likely that the majority of transactions involve relatively small amounts,
ranging from a few bucks to thousands. However, it is also possible for there
to be a very small number of transactions with large amounts. It is
noteworthy that for the purpose of comparison, both plots utilize the same x-
axis, but the real data does not contain any negative values. The real data
demonstrates that 99% of occurrences occur at the start of the range, with
the distribution extending up to 25,000. In contrast, none of the synthetic
data generators is able to effectively learn and replicate this behavior.
3 Related work
The related study consists of two parts: (i) generative model for tabular data,
and (ii) differential private tabular GANs.
Table 1
4 CTAB-GAN+
GANs have proven its utility for synthesizing tabular data in previous studies
(Yahi et al., 2017; Park et al., 2018; Xu et al., 2019). There are many
excellent methods that we can learn from.
Modeling imbalanced dataset can be a challenge for GANs as they can cause
models to disproportionately favor the majority class. To address this issue,
we adopted the training-by-sampling method from CTGAN. This approach
involves the use of a conditional vector, which represents the classes of
categorical columns. The vector is used to both feed the generator and
discriminator and to sample subsets of the real training data that satisfy the
given condition. By leveraging this condition, we can resample all classes
and give minority classes a higher chance of being included in the training
data.
To more effectively represent tabular data, we use one-hot encoding for all
the categorical variables. To handle the complex distributions of continuous
columns, we adopted the Mode-Specific Normalization (MSN) method (Xu et
al., 2019). This involves encoding each value as a value-mode pair based on
a variational Gaussian mixture model.
DP is becoming the standard solution for privacy protection and has even
been adopted by the US census department to bolster privacy of citizens
(Hawes, 2020). DP protects against privacy attacks by minimizing the
influence of any individual data point based on a given privacy budget. In
this study, we leverage the Rényi Differential Privacy (RDP) (Mironov, 2017)
as it provides stricter bounds on the privacy budget. A randomized
mechanism MM is (λ, ϵ)-RDP with order λ and privacy budget ϵ if
Dλ(M(S)∣∣∣∣∣∣M(S′))=1λ−1logEx~M(S)[(Pr[M(S)=x]Pr[M(S′)=x])]λ−1≤ϵ (1)Dλ(
ℳ(S)||ℳ(S′))=1λ−1logEx~ℳ(S)[(Pr[ℳ(S)=x]Pr[ℳ(S′)=x])]λ−1≤ϵ (1)
holds for any adjacent datasets S and S′, where Pr denotes the probability
density at given condition and Dλ(P||Q)=1λ−1logEx~Q[(P(x)/Q(x))λ]Dλ(P||
Q)=1λ-1logEx~Q[(P(x)/Q(x))λ] represents the Rényi divergence for two
probability distributions P and Q (Chen et al., 2020a). In addition, a (λ, ϵ)-
RDP mechanism MM can be expressed as
(ϵ+log1/δλ−1,δ)-DP. (2)(ϵ+log1/δλ-1,δ)-DP. (2)
composition operator. For M1M1,…,MkMk such that MiMi is (λ, ϵi)-RDP ∀i , the
mechanisms via the composition theorem (Mironov, 2017). Let ° denote the
composition M1M1°…°MkMk is
Lastly, two more theorems are key to this study. The post-processing
theorem (Dwork and Roth, 2014) states that if MM satisfies (ϵ, δ)-
DP, F○MF○M will satisfy (ϵ, δ)-DP, where F can be any arbitrary randomized
function. Hence, it suffices to train one of the two networks in the GAN
architecture with DP guarantees to ensure that the overall GAN is compatible
with differential privacy. RDP for subsampled mechanisms (Wang et al.,
2019) computes the reduction in privacy budget when sub-sampling private
data. Formally, let XX be a dataset with n data points
and subsample return m ≤ n subsamples without replacement
from XX (subsampling rate γ = m/n). For all integers λ≥2, if a randomized
mechanism MM is (λ, ϵ(λ))-RDP, then M○subsampleM○subsample is
where
ϵ′(λ)≤1λ−1log(1 +γ2(λ2)min{4(eϵ(2)−1),eϵ(2)min{2,(eϵ(∞)−1)2}}
+λ∑j=3γj(λj)e(j−1)ϵ(j)min⎧⎨⎩2,(eϵ(∞)−1)j)⎫⎬⎭)ϵ′(λ)≤1λ−1log(1 +γ2(λ2)
min{4(eϵ(2)−1),eϵ(2)min{2,(eϵ(∞)−1)2}} +∑j=3λγj(λj)e(j−1)ϵ(j)min{2,
(eϵ(∞)−1)j)})
According to (1), each fixed λ can be used as a privacy measure, Wang et al.
(2019) emphasized its function view in which ϵ is a function of λ, and this
function is fully determined by MM. The function is denoted by ϵ(λ). When λ
= ∞, it indicates that MM is (ϵ, 0)-DP, i.e., pure DP.
Figure 2
GANs are trained via a zero-sum min-max game where the goal of the
generator being to produce synthetic data that is indistinguishable from real
data, and the goal of the discriminator being to accurately distinguish
between real and synthetic data. In the specific case described in the
text, GG is trained using additional feedback based on three loss terms: the
information loss, the downstream loss, and the generator loss. The
information loss measures the difference between the first- and second-order
statistics (mean and standard deviation) of the synthetic and real data,
encouraging the synthetic data to have the same statistical properties as the
real data. The downstream loss measures the correlation between the target
column and other columns in the data, ensuring that the combination of
values in the synthetic data are semantically correct. The generator loss is
the cross-entropy between the given conditional vector and the generated
output classes, encouraging the generator to produce exactly the same
output classes as the given conditional vector. These three loss terms are
added to the default loss term (i.e., Was+GP) of GG during training. We
adopted the CNN structure from Park et al. (2018) for GG and DD. CNNs are
good at capturing the relation between pixels within an image, which in our
case, can help to increase the semantic integrity of synthetic data. The input
data, which consists of row records stored as vectors, are processed by
wrapping it in the closest square matrix dimensions and padding missing
values with zeros. CC, implemented using a multi-layer perceptron (MLP)
with four 256-neuron hidden layers, is trained on the real data to better
interpret the semantic integrity of the synthetic data. The synthetic data is
reverse transformed from its matrix encoding to a vector (details in Section
4.3), while the real data are encoded (details in Sections 4.3, 4.6) before
being used as input for CC to create the class label predictions.
Suppose the last layer of DD is softmax, then we used fx and fG(z)fG(z) that
denote the logits fed into this softmax layer from a real sample x and a
sample generated from latent value z, respectively. The information
loss for GG is calculated as
LGinfo=∣∣∣∣E[fx]x~pdata(x)−E[fG(z)]z~p(z)∣∣∣∣2 +∣∣∣∣SD[fx]x~pdata(x)
−SD[fG(z)]z~p(z)∣∣∣∣2ℒinfoG=||E[fx]x~pdata(x)−E[fG(z)]z~p(z)||2 +||
SD[fx]x~pdata(x)−SD[fG(z)]z~p(z)||2
LGdstream=E[∣∣l(G(z))−C(fe(G(z)))∣∣]z~p(z)LdstreamG=E[|l(G(z))-
C(fe(G(z)))|]z~p(z)
where l(.) returns the target variable and fe(.) returns the input features of a
given data record. Finally, the generator loss is presented as
LGgenerator=H(mi,ˆmi)LgeneratorG=H(mi,m^i)
where mi and ˆmim^i are the given and generated conditional vector bits
corresponding to column i and H(.) is the cross-entropy loss. Condition in
column i is selected using the training-by-sampling procedure (see Section
4.4 for details).
LD=E˜x~Pg[D(˜x)]−Ex~Pr[D(x)]original discriminato
r loss+λEˆx~Pˆx[(∥∇ˆxD(ˆx)∥2−1])2gradient pen
altyℒD=Ex˜~ℙg[D(x˜)]−Ex~ℙr[D(x)]︸original discriminator loss+λEx^
~ℙx^[(‖∇x^D(x^)‖2−1])2︸gradient penalty
pairs of points sampled from the real data distribution ℙr and the generator
where Pˆxℙx^ is defined as sampling uniformly along straight lines between
LG=LGdefauLt+LGinfo+LGdstream+LGgeneratorLG=LdefauLtG+LinfoG+Lds
treamG+LgeneratorG
The training objective for DD is unchanged. Finally, the loss to train the
auxiliary CC is similar to the downstream loss of the generator:
LCdstream=E[∣∣l(x)−C(fe(x))∣∣]x~pdata(x)LdstreamC=E[|l(x)-
C(fe(x))|]x~pdata(x)
The tabular data are organized in rows and columns, and each column is
encoded before it is used as input for training. We distinguished three types
of variables: categorical, continuous, and mixed. Mixed columns contain both
categorical and continuous values, or any column with missing values. To
handle mixed columns, we propose a new mixed-type encoder that treats
them as concatenated value-mode pairs. As an example, the encoding of a
mixed variable is shown in red in Figure 3A. The values in this column can
either be exactly μ0 or μ3 (the categorical part) or continuously distributed
around two peaks in μ1 and μ2. To handle the continuous part, we adopted
the Mode-Specific Normalization (MSN) idea from Xu et al. (2019) in using a
variational Gaussian mixture (VGM) (Bishop, 2006) model to estimate the
number of modes k, e.g., k = 2 in our example, and fit a Gaussian mixture.
The learned Gaussian mixture is P=∑2k=1ωkN(μk,σk)ℙ=∑k=12ωkN(μk,σk),
where NN is the normal distribution, and ωk, μk, and σk are the weight, mean,
and standard deviation of each mode, respectively.
Figure 3
Figure 3. Encoding for mix data-type variable. (A) Mixed type variable
distribution with VGM. (B) Mode selection of single value in continuous
variable.
For the categorical value (such as μ0 or μ3 in Figure 3A), the normalized value
α is simply set to 0, as the category is determined solely by the one-hot
encoding component. As an illustration, for a given value within μ 1, the final
encoding can be expressed as 0⊕[1, 0, 0, 0].
Figure 4
The main idea of GT is to encode columns in the range of (−1, 1). This makes
the encoding directly compatible with the output range of the generator
using tanh activation function. This is achieved via a shifted and scaled min-
max normalization. Mathematically, given a data point xi of a continuous
variable x, the transformed value, xti=2*xi−min(x)max(x)−min(x)
−1xit=2*xi-min(x)max(x)-min(x)-1, where min(x) and max(x) represents the
minimum and maximum values of the continuous variable. Inversely an
encoded or generated value xtixit may be reverse transformed
as Xi=(max(x)−min(x))*Xti+12+min(x)Xi=(max(x)-min(x))*Xit+12+min(x).
Continuous variable can be directly treated with the above formulas for
normalization and denormalization. Categorical variables are first encoded
using integers before using the above normalization and rounded to integers
after using the above denormalization.
When there is real data involved, CTAB-GAN+ trains all its components (i.e.,
discriminator, generator, and auxiliary model) using DP-SGD where the
number of training iterations is determined based on the total privacy budget
ϵ. Thus, to compute the number of iterations, the privacy budget spent for
every iteration must be bounded and accumulated. For this purpose, we use
the subsampled RDP analytical moments accountant technique. To elaborate
the process of adding noise on gradients and calculating privacy cost in
CTAB-GAN+, we show the theoretical analysis of discriminator in below as an
example, the generator and the auxiliary model share the same process.
where ˜gDg~D and θD represent the perturbed gradients and the weights of
the discriminator network, respectively. LDLD is the loss function of
discriminator. (9) may be regarded as a composition of B Gaussian
mechanisms and treated via (3). The privacy cost for a single gradient
update step for the discriminator can be expressed as (λ,∑Bi=12λ/σ2)
(λ,∑i=1B2λ/σ2) or equivalently (λ, 2Bλ/σ2).
Note that MσMσ is only applied for those gradients that are computed with
respect to the real training dataset (Abadi et al., 2016; Zhang et al., 2018).
Hence, the gradients computed with respect to the synthetic data or the
gradient penalty term are left undisturbed for discriminator. For the
generator, the DP-SGD is only applied on the gradients that are calculated by
the information loss. There are no real data involved in the generator
loss. The default GAN loss for
generator LGdefaultLdefaultG and downstream loss are the post
processing of already DP-protected discriminator and auxiliary model;
therefore, there is no need to apply DP-SDG to the generator for these
losses.
5.1.1 Datasets
Table 2
5.1.2 Baselines
Our CTAB-GAN+ is compared with CTAB-GAN and six other SOTA tabular data
generators: IT-GAN (Lee et al., 2021), CTGAN (Xu et al., 2019), TVAE (Xu et
al., 2019), TableGAN (Park et al., 2018), CWGAN (Engelmann and Lessmann,
2020), and MedGAN (Choi et al., 2017). We implemented all algorithms in
Pytorch and kept the hyperparameters, generator, and discriminator
structures consistent with the descriptions provided in their studies. All the
algorithms remain the same hyperparameters on all the datasets. For
Gaussian mixture estimation of continuous variables, we set the default
number of modes 10, the same as in CTGAN. We trained all algorithms for
150 epochs on Adult, Covertype, Credit, and Intrusion datasets. However, for
Loan, Insurance, and King datasets, we trained all algorithms for 300 epochs
as these datasets are smaller and require more training to converge. All
experiments are repeated three times, and average results are reported.
5.1.3 Environment
The synthetic data are evaluated on two dimensions: (1) machine learning
(ML) utility and (2) statistical similarity. They measure if the synthetic data
can be used as a good proxy of the original data.
Figure 5
Figure 5. Evaluation flows for ML utility of classification.
Three metrics are used to quantify the statistical similarity between real and
synthetic data.
5.3.1 ML utility
Table 3 shows the results for the classification datasets. A better synthetic
dataset is expected to have small differences in ML utility for classification
tasks trained on real and synthetic data. It can be seen that CTAB-GAN+
outperforms all other SOTA methods and CTAB-GAN in all the metrics. CTAB-
GAN+ decreases the AUC difference from 0.094 (i.e., best baseline CTAB-
GAN) to 0.041 (56.4% reduction), and the difference in accuracy from 7.86%
(i.e., best baseline IT-GAN) to 5.23% (33.5% reduction). The improvement
over CTAB-GAN shows that general transform encoder and Was+GP loss
indeed help enhance the feature representation and GAN training. Table
4 shows the results for the regression datasets. The result of CTAB-GAN and
CTAB-GAN+ are far better than all other baselines. This shows the
effectiveness of the feature engineering. Additionally, as CTAB-GAN+ adds
the auxiliary regressor which explicitly enhances the regression analysis, the
overall downstream performance of CTAB-GAN+ is better than CTAB-GAN.
We note that CTAB-GAN uses auxiliary classification loss for the classification
analysis and disables it for the regression analysis.
Table 3
Table 3. Difference (± standard deviation) of ML utility and statistical
similarity for Classification between original and synthetic data, averaged on
five datasets.
Table 4
Due to the page limit, ablation analysis are only implemented for
classification datasets. We focus on conducting an ablation study to analyse
the impact of the different components of CTAB-GAN and CTAB-GAN+.
The results are compared with the default CTAB-GAN. Table 5 shows the
results in terms of F1-score difference between ablation and CTAB-GAN on all
five classification datasets. Each part of CTAB-GAN has different impacts on
different datasets. For instance, w/o CC has a negative impact for all
datasets except Credit, where the small number of categorical variables
limits the effectiveness of the semantic check. w/o information loss has a
positive impact on Loan but leads to worse results for all other datasets. It
can even make the model unusable for Intrusion. w/o MSN performs bad for
Covertype but has little impact for Intrusion. Credit without MSN performs
better than original CTAB-GAN. This is because 28 out of its 30 continuous
variables are nearly single mode Gaussian distributed. The initialized high
number of modes, i.e., 10, for each continuous variable (same setting as in
CTGAN) degrades the estimation quality. w/o LT has the biggest impact on
Intrusion since it contains two long tail columns which are important
predictors for the target column. For Credit, the influence is limited. Even if
the long tail treatment fits well the amount column (see Section 5.6), it is not
a strong predictor for the target column.
Table 5
To show the efficacy of the General Transform and Was+GP loss in CTAB-
GAN+, we propose two ablation studies. (1) w/o GT which disables the
general transform in CTAB-GAN+. All continuous variables use MSN and all
the categorical variables use one-hot encoding. (2) w/o Was+GP which
switches the default GAN training loss from Was+GP to the original GAN loss
defined in Goodfellow et al. (2014). It is worth noting that the information,
downstream and generator losses are still present in this experiment. The
other experimental settings are the same as in Section 5.4.1. Table 6 shows
the results in terms of F1-score difference among different versions of CTAB-
GAN+. For Covertype, Credit, and Intrusion datasets, the effects of GT and
Was+GP are all positive. GT significantly boosts the performance on
Covertype and Credit datasets. For Adult, it worsens the result. The reason is
that the Adult dataset contains only one GT column: age. Since this column is
strongly correlated with other columns, the original MSN encoding can better
capture this interdependence. The positive impact of Was+GP on the other
hand is limited but consistent across all datasets. The only exception is the
Loan dataset, where GT and Was+GP have minor impacts. This is due to the
fact that Loan has fewer variables comparing to other datasets, which makes
it easier to capture the correlation between columns. CTAB-GAN already
performs well on Loan, Therefore, GT and Was+GP cannot further improve
performance on this dataset.
Table 6
Table 6. Ablation analysis for CTAB-GAN+ (F1. diff.).
Table 7
Table 7. Training time (s/epoch) usage.
After reviewing all the metrics, let us recall the four motivation cases from
Section 2.
Figure 6A shows the real and CTAB-GAN+ generated bmi variable. CTAB-
GAN+ can reproduce the distribution with minor differences. This shows the
effctiveness of general transform to better model variables with single
Gaussian distribution.
Figure 6
6.1.1 Datasets
Due to page limit, we only use the classification datasets: Adult, Covertype,
Intrusion, Credit, and Loan.
6.1.2 Metrics
We use the same ML utility metrics from Section 5.2 under two privacy
budgets, i.e., ϵ = 1 and ϵ = 100.
6.1.3 Baselines
To compute the privacy cost in a fair manner, we used the RDP accountant
for all approaches that employ DP-SGD: CTAB-GAN+, DP-WGAN, and GS-
WGAN. PATE-GAN uses moment accountant (Wang et al., 2019) by default.
We set δ = 10−5 for all experiments. We follow the examples of DP-WGAN and
set the exploration span of λ to [2, 4096]. We use (2) to convert the overall
cumulative privacy cost computed in terms of RDP back to (ϵ, δ)-DP.
6.2.1 ML utility
Table 8 presents the results for the differences ML utility between models
trained on the original and synthetic data: lower is better. CTAB-GAN+
outperforms all other SOTA algorithms under both privacy budgets. With a
looser privacy budget, i.e., higher ϵ, all metrics for all algorithms improve.
These results are in line with our expectation because higher privacy budgets
mean training the model with less injected noise and more training epochs—
before exhaustion of the privacy budget. CTAB-GAN+ outperforms second
best 7.8% in F1-Score under ϵ = 1, and this advantage increases to 21.9%
when ϵ = 100. The superior performance of CTAB-GAN+ compared to other
baselines can be explained by its well-designed neural network architecture,
which improve the training objective and capacity to better deal with the
challenges of the tabular domain such as column dependencies. This also
explains the poor results offered by GS-WGAN which is not designed to
handling these specific issues achieving the worst overall performance.
Table 8
Table 8. Difference of accuracy (%), F1-score, AUC, and AP between original
and synthetic data: average over 5 ML models and five datasets with
different privacy budgets ϵ = 1 and ϵ = 100.
Table 9
7 Conclusion
Author's note
This manuscript delves into the pivotal realm of data science, emphasizing
the generation of synthetic data using advanced machine learning models,
specifically Generative Adversarial Networks (GANs) for tabular data. In the
contemporary big data ecosystem, there's an increasing need to generate
high-quality synthetic data, which not only resembles the original but also
ensures stringent privacy safeguards. Our research, centered around the
introduction of CTAB-GAN+, contributes to this need by blending robust data
synthesis, data utility preservation, and the incorporation of differential
privacy. Given the journal's commitment to advancing the frontiers of data
science and exploring the challenges and opportunities posed by big data,
our work is distinctly aligned with its vision.
Publicly available datasets were analyzed in this study. This data can be
found at: http://archive.ics.uci.edu/ml/datasets; https://www.kaggle.com/
datasets/mlg-ulb/creditcardfraud; https://www.kaggle.com/datasets/
itsmesunil/bank-loan-modelling; https://www.kaggle.com/datasets/
mirichoi0218/insurance; https://www.kaggle.com/datasets/harlfoxem/
housesalesprediction.
Author contributions
Funding
The author(s) declare that no financial support was received for the research,
authorship, and/or publication of this article.
Conflict of interest
The authors declare that the research was conducted in the absence of any
commercial or financial relationships that could be construed as a potential
conflict of interest.
Publisher's note
All claims expressed in this article are solely those of the authors and do not
necessarily represent those of their affiliated organizations, or those of the
publisher, the editors and the reviewers. Any product that may be evaluated
in this article, or claim that may be made by its manufacturer, is not
guaranteed or endorsed by the publisher.
Footnotes
1. ^https://github.com/Team-TUD/CTAB-GAN-Plus-DP
2. ^http://archive.ics.uci.edu/ml/datasets
3. ^https://www.kaggle.com/mlg-ulb/creditcardfraud,itsmesunil/bank-loan-
modelling
4. ^https://www.kaggle.com/mirichoi0218/insurance,harlfoxem/
housesalesprediction
5. ^http://shakedzy.xyz/dython/modules/nominal/#compute_associations
6. ^https://github.com/BorealisAI/private-data-generation
References
Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., et
al. (2016). “Deep learning with differential privacy,” in ACM SIGSAC
Conference on Computer and Communications Security (CCS). doi:
10.1145/2976749.2978318
Google Scholar
Bellemare, M. G., Danihelka, I., Dabney, W., Mohamed, S.,
Lakshminarayanan, B., Hoyer, S., et al. (2017). The cramer distance as a
solution to biased wasserstein gradients. arXiv preprint arXiv:1705.10743.
Google Scholar
Google Scholar
Google Scholar
Chen, D., Yu, N., Zhang, Y., and Fritz, M. (2020b). “GAN-leaks: a taxonomy of
membership inference attacks against generative models,” in ACM SIGSAC
Conference on Computer and Communications Security (CCS). doi:
10.1145/3372297.3417238
Google Scholar
Chen, X., Wu, S. Z., and Hong, M. (2020c). “Understanding gradient clipping
in private SGD: a geometric perspective,” in Advances in Neural Information
Processing Systems 33, 13773–13782.
Google Scholar
Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W. F., and Sun, J. (2017).
Generating multi-label discrete patient records using generative adversarial
networks. arXiv preprint arXiv:1703.06490.
Google Scholar
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., et al. (2014). “Generative adversarial nets,” in Proceedings of the 27th
NIPS - Volume 2 (Cambridge, MA), 2672–2680.
Google Scholar
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. (2017).
“Improved training of Wasserstein GANs,” in Advances in Neural Information
Processing Systems, 5769–5779.
Google Scholar
Jordon, J., Yoon, J., and van der Schaar, M. (2018). “Pate-GAN: generating
synthetic data with differential privacy guarantees,” in International
Conference on Learning Representations (ICLR).
Google Scholar
Lee, J., Hyeong, J., Jeon, J., Park, N., and Cho, J. (2021). “Invertible tabular
GANs: killing two birds with one stone for tabular data synthesis,” in NeurIPS
Conference, 4263–4273.
Google Scholar
Long, Y., Lin, S., Yang, Z., Gunter, C. A., and Li, B. (2019). Scalable
differentially private generative student model via pate. arXiv preprint
arXiv:1906.09338.
Google Scholar
Mironov, I. (2017). “Rényi differential privacy,” in Computer Security
Foundations Symposium (CSF) (IEEE). doi: 10.1109/CSF.2017.11
Mottini, A., Lheritier, A., and Acuna-Agost, R. (2018). “Airline passenger name
record generation using generative adversarial networks,” in Workshop on
Theoretical Foundations and Applications of Deep Generative Models (ICML).
Google Scholar
Odena, A., Olah, C., and Shlens, J. (2017). “Conditional image synthesis with
auxiliary classifier GANs,” in The 34th ICML - Volume 70, 2642–2651.
Google Scholar
Papernot, N., Abadi, M., Erlingsson, U., Goodfellow, I., and Talwar, K. (2016).
Semi-supervised knowledge transfer for deep learning from private training
data. arXiv preprint arXiv:1610.05755.
Google Scholar
Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., and Kim, Y. (2018).
Data synthesis based on generative adversarial networks. Proc. VLDB Endow.
11, 1071–1083. doi: 10.14778/3231751.3231757
Stadler, T., Oprisanu, B., and Troncoso, C. (2020). Synthetic data-a privacy
mirage. arXiv preprint arXiv:2011.07018.
Google Scholar
Torfi, A., Fox, E. A., and Reddy, C. K. (2020). Differentially private synthetic
medical data generation using convolutional GANs. arXiv preprint
arXiv:2012.11774.
Google Scholar
Wang, R., Fu, B., Fu, G., and Wang, M. (2017). “Deep & cross network for ad
click predictions,” in Proceedings of the ADKDD'17 (New York, NY). doi:
10.1145/3124749.3124754
Google Scholar
Xie, L., Lin, K., Wang, S., Wang, F., and Zhou, J. (2018). Differentially private
generative adversarial network. arXiv preprint arXiv:1802.06739.
Google Scholar
Google Scholar
Google Scholar
Zhang, X., Ji, S., and Wang, T. (2018). Differentially private releasing via deep
generative model (technical report). arXiv preprint arXiv:1801.01594.
Google Scholar
Zhao, Z., Kunar, A., Birke, R., and Chen, L. Y. (2021). “CTAB-GAN: effective
table data synthesizing,” in Proceedings of The 13th Asian Conference on
Machine Learning, 97–112.
Google Scholar
Citation: Zhao Z, Kunar A, Birke R, Van der Scheer H and Chen LY (2024)
CTAB-GAN+: enhancing tabular data synthesis. Front. Big Data 6:1296508.
doi: 10.3389/fdata.2023.1296508
Edited by:
Reviewed by:
Copyright © 2024 Zhao, Kunar, Birke, Van der Scheer and Chen. This is an
open-access article distributed under the terms of the Creative Commons
Attribution License (CC BY). The use, distribution or reproduction in other
forums is permitted, provided the original author(s) and the copyright
owner(s) are credited and that the original publication in this journal is cited,
in accordance with accepted academic practice. No use, distribution or
reproduction is permitted which does not comply with these terms.
Disclaimer: All claims expressed in this article are solely those of the
authors and do not necessarily represent those of their affiliated
organizations, or those of the publisher, the editors and the reviewers. Any
product that may be evaluated in this article or claim that may be made by
its manufacturer is not guaranteed or endorsed by the publisher.
Download article
Download PDF
ReadCube
EPUB
XML (NLM)
6K
Total views
2,1K
Downloads
42
Citations
Share on
Edited by
Feng Chen
Reviewed by
Xujiang Zhao
Chen Zhao
Abstract
1 Introduction
2 Motivation
3 Related work
4 CTAB-GAN+
7 Conclusion
Author's note
Author contributions
Funding
Conflict of interest
Publisher's note
Footnotes
References
Export citation
EndNote
Reference Manager
BibTex
Learn more about the work of our research integrity team to safeguard the
quality of each article we publish.
Ruipeng Tang, Narendra Kumar Aridas and Mohamad Sofian Abu Talip
Johannes Stelzer, Eric Lacosse, Jonas Bause, Klaus Scheffler and Gabriele
Lohmann
Guidelines
o Author guidelines
o Editor guidelines
o Fee policy
Explore
o Articles
o Research Topics
o Journals
o How we publish
Outreach
o Frontiers Forum
Connect
o Help center
o Contact us
o Submit
o Career opportunities
Follow us
© 2025 Frontiers Media S.A. All rights reserved
We use cookies
Our website uses cookies that are essential for its operation and additional
cookies to track performance, or to improve and personalize our services. To
manage your cookie preferences, please click Cookie Settings. For more
information on how we use cookies, please see ourCookie Policy