Synthetic Data in AI: Challenges, Applications, and Ethical Implications
Synthetic Data in AI: Challenges, Applications, and Ethical Implications
Shuang Hao, Wenfeng Han, Tao Jiang, Yiping Li, Haonan Wu,
Chunlin Zhong, Zhangjun Zhou, He Tang*
1
sentations of data without the need for manually designing bilities and extensive pre-trained linguistic knowledge, ex-
feature extractors. Their inherent nonlinearity makes them emplify the capacity of large language models (LLMs) to
well-suited for adapting to the complex nonlinear relation- produce synthetic datasets. This capability facilitates the
ships within data, enabling a more effective capture of the training of models on smaller domains, effectively address-
essential characteristics in the data. The following section ing the challenge of data scarcity in specific areas. The Gen-
outlines several typical approaches to data generation based erative Pre-trained Transformer (GPT) family [9–11] com-
on deep learning. prises a series of Natural Language Processing (NLP) mod-
Variational Auto-Encoder(VAE). VAE [3] is a kind of els developed by OpenAI, employing the Transformer ar-
probabilistic generative model, which employs the encoder- chitecture to capture long-distance dependencies in input
decoder architecture. The encoder maps the input data to sequences. GPT is unsupervisedly pre-trained on large-
the underlying latent space corresponding to the parame- scale Internet text corpora, for example, to predict the next
ters of the variational distribution and the decoder projects word in a given context. This pre-training strategy enables
features from the latent space back into the input space. the model to acquire a profound understanding of linguis-
By capturing the distribution of latent space features, the tic statistical structure and contextual relationships and ul-
VAE can generate multiple distinct samples that follow the timately to perform a variety of natural language process-
same distribution.The inherent randomness of the VAE in- ing tasks in a pre-training-fine-tuning format. For example,
troduces a degree of diversity in the generated data, making Josifoski et al. [12] synthetically generate a dataset of 1.8M
them more representative of the complexity in real datasets. data points in a reverse manner and demonstrate the effec-
Generative Adversarial Networks(GAN). GAN was first tiveness of this approach on closed information extraction.
proposed by Goodfellow et al. [4] in 2014, which consists Whitehouse et al. [13] utilise several LLMs, namely Dolly-
of a generator and a discriminator. The generator randomly v2, StableVicuna, ChatGPT, and GPT-4 to augment three
samples from the latent space, in order to produce samples datasets and assess the naturalness and logical coherence of
resembling the training set. Meanwhile, the discriminator the generated examples.
whether the sample belongs to real or synthetic data. These
two parts engages in a continual adversarial learning pro- 3. The usage of synthetic data.
cess and updates parameters alternately until the generator
is able to synthesise high quality samples to fool the dis- 3.1. Existing synthetic datasets.
criminator.
Diffusion models. Diffusion model [5] stands as a robust Synthetic data possesses several advantages that natural
method for crafting synthetic datasets, relying on a system- data lacks, making it an attractive choice in many fields.
atic approach to emulate intricate temporal dependencies Compared to natural data, synthetic datasets are relatively
within data. Rooted in the principles of diffusion processes, easy to acquire and can provide data in rare or challeng-
where information spreads through a network over time, ing scenarios, thereby addressing diversity issues in cer-
this technique integrates these models into the data genera- tain datasets. Additionally, this technology can effectively
tion process. By doing so, it enables the replication of nu- avoid privacy concerns, safeguarding user information se-
anced patterns and relationships observed in authentic data. curity. As this technology gradually becomes more promi-
The crux lies in accurately capturing the dynamic evolution nent, its practical applications are becoming increasingly
of information over time. This technique, grounded in diffu- widespread. This section will discuss several domains
sion models, not only reproduces statistical characteristics where generated data has had a significant impact.
but also adeptly mirrors the complex temporal dynamics Vision Synthetic datasets can greatly address the challenge
present in real-world datasets. This algorithmic approach of acquiring natural data in certain domains. In the early
proves invaluable for generating synthetic datasets, offering days of computer vision, the generation of correspond-
a sophisticated tool for simulating realistic data scenarios ing datasets relied primarily on computer graphics engines
essential for the training and evaluation of machine learn- [14–17]. For example, in the re-identification (Re-ID) do-
ing models across diverse domains. To date, synthetic data main, Sun et al. [14] created the PersonX dataset using
generated from diffusion models has found extensive appli- the a engine based on Unity. This dataset includes three
cation across various vision tasks, notably enhancing per- pure color backgrounds and three scene backgrounds. It
formance in areas such as image classification [6], object consists of 1266 hand-crafted identities (547 females and
detection [7], and semantic segmentation [8]. 719 males), with each identity having 36 viewpoints. At
Large language models. In the past years, large language that time, this dataset effectively addressed the lack of
models (LLMs) have emerged as a revolutionary approach multi-viewpoint data. Furthermore [18–21], there are also
for generating synthetic datasets. For instance, models such generated methods based on Generative Adversarial Net-
as GPT-3.5, with their exceptional in-context learning capa- works (GAN). In [21], GIRAFFE introduces synthetic neu-
ral feature maps, enabling control over camera poses, ob-
ject placements and orientations, as well as object shapes thetic data plays a pivotal role in comprehending diseases
and appearances during generation. Moreover, GIRAFFE while upholding patient confidentiality and privacy [30].
allows for the free addition of multiple objects in a scene, Synthetic data possesses the capability to mirror the original
expanding the generated scenes from single-object to multi- data distribution without disclosing actual patient informa-
object scenarios. tion [30–32]. A notable example in healthcare is MedGAN,
Audio Synthetic data is widely employed in the field of au- a model introduced by Edward Choi et al., leveraging adver-
dio, and its rapid development is truly remarkable. Take, for sarial networks to generate authentic synchronized medical
instance, the Speech Commands Dataset proposed by Chris records. Through the integration of autoencoders and gener-
Donahue et al., which leverages WaveGAN, a generative ative adversarial networks, MedGAN proficiently produces
adversarial network [22]. WaveGAN excels in synthesizing high-dimensional discrete variables (e.g., binary features
one-second audio waveform slices with global coherence, and counting features) based on genuine medical records
making it particularly well-suited for sound effects genera- [16]. Synthetic patient records generated by MedGAN have
tion. Even without labels, when trained on small vocabulary demonstrated comparable performance to real data across
speech datasets, WaveGAN adeptly learns to generate in- various experiments, encompassing distribution statistics,
telligible words and can extend its synthesis capabilities to predictive modeling tasks, and assessments by medical ex-
audio from diverse domains, including drums, bird sounds, perts. Furthermore, synthetic data finds widespread ap-
and piano.Furthermore, Zhang et al.introduced Stutter-TTS plication in the realm of drug discovery. The predom-
[23], tailored to tackle the performance challenges faced by inant approach involves learning the distribution of drug
existing Automatic Speech Recognition (ASR) interfaces molecules from existing databases and subsequently deriv-
when dealing with stuttered speech. Stutter-TTS stands as ing new samples (i.e., drug molecules) from the acquired
an end-to-end neural text-to-speech model with the profi- knowledge of drug molecule distributions. Numerous im-
ciency to synthesize various forms of stuttered speech. It plementations of this process exist, such as variant autoen-
employs a simple yet effective prosody control strategy and coders (VAEs) [33–35], generative adversarial networks
incorporates additional markers during training to represent (GANs) [36], energy-based models (EBMs) [37, 38], dif-
specific stuttering features. By strategically selecting the fusion models [39], reinforcement learning (RL) [40–42],
positions of these markers, Stutter-TTS provides word-level genetic algorithms [43], sampling-based methods [44, 45],
control over the occurrence of stuttering in the synthesized and others.
speech.
Natural Language Processing (NLP) The growing inter- 3.2. The data distribution of synthetic datasets.
est in synthetic data has propelled the thriving development
of numerous deep generative models in the field of Natu- In the rapid advancement of artificial intelligence and
ral Language Processing (NLP) [24]. In recent years, ma- machine learning, synthetic datasets have become an essen-
chine learning has demonstrated its formidable capabilities tial resource. These datasets, typically algorithmically gen-
in tasks such as classification, routing, filtering, and in- erated, are used for training and testing a variety of models.
formation retrieval across various domains [25].Addressing However, the generation process of synthetic datasets of-
the challenge of synonym variations arising from contextual ten harbors implicit issues, especially regarding the fairness
changes in NLP, researchers have introduced the BLEURT and representativeness of data distribution. These issues can
model. Built upon BERT, this model simulates human judg- affect the performance of models and potentially lead to bi-
ment by utilizing a limited set of training examples that may ases and discriminatory practices in real-world applications.
exhibit biases. To enhance the model’s generalization, inno- Distribution Issues in Synthetic Datasets. The genera-
vative pre-training approaches have been developed using tion of synthetic datasets often lacks sufficient consideration
millions of synthetic examples [26, 27].Furthermore, Rel- for demographic diversity, which can lead to unbalanced
GAN, developed by Rice University, represents a significant data distributions in terms of gender, age, race, etc. For
breakthrough in text generation using Generative Adversar- instance, in the creation of datasets, if the data is primar-
ial Networks (GAN). Comprising three key components - a ily based on individuals from specific racial or age groups,
relation memory-based generator, Gumbel-Softmax relax- the trained models might perform poorly when dealing with
ation algorithm, and multiple embedded representations in other groups. This situation can lead to severe discrimina-
the discriminator - RelGAN outperforms several state-of- tory issues in real-world applications, such as certain racial
the-art models in benchmark tests, showcasing its remark- groups being incorrectly identified or entirely overlooked.
able performance in sample quality and diversity. This un- For instance, if a dataset used to train a facial recognition
derscores its potential for further research and application system disproportionately represents certain demographics
across a wide range of NLP tasks and challenges [28, 29]. over others, the resulting AI models may exhibit biased
Health In the realm of healthcare, the generation of syn- performance, leading to unfair or discriminatory outcomes.
These biases in the data can manifest in various forms,
Table 1. Summarization of Representative Works in Synthetic Data Generation.
5. Conclusions References
In summary, the use of synthetic datasets in AI intro- [1] F. Thabtah, S. Hammoud, F. Kamalov, and A. Gon-
duces a range of potential risks that require careful assess- salves, “Data imbalance in classification: Experi-
ment and management. mental evaluation,” Information Sciences, vol. 513,
pp. 429–441, 2020.
5.1. New Approaches to Synthetic Data.
To address the current issues associated with synthetic [2] M. Favaretto, E. De Clercq, and B. S. Elger, “Big data
data generation, there is a need for continuous development and discrimination: perils, promises and solutions. a
and adoption of new methods. systematic review,” Journal of Big Data, vol. 6, no. 1,
Adopting more advanced generative models. One po- pp. 1–27, 2019.
tential approach involves the use of advanced generative
[3] D. P. Kingma and M. Welling, “Auto-encoding varia-
models such as Generative Adversarial Networks (GANs)
tional bayes,” arXiv preprint arXiv:1312.6114, 2013.
or Variational Autoencoders (VAEs). These models possess
stronger learning capabilities, allowing for more accurate [4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
modeling of the complex distribution of real-world data. By D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-
employing these advanced models, it becomes possible to gio, “Generative adversarial networks,” Communica-
better avoid distribution shift issues, enhance the diversity tions of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
of generated data, and more effectively simulate the noise
and uncertainty present in the real world. [5] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion
Integrating domain-specific expertise to enhance the probabilistic models,” Advances in neural information
realism of synthetic data. Integrating domain-specific processing systems, vol. 33, pp. 6840–6851, 2020.
knowledge, such as computer graphics, physics, and cog-
nitive science, can contribute to improving the realism of [6] S. Azizi, S. Kornblith, C. Saharia, M. Norouzi, and
synthetic data. A deeper understanding of the physical laws D. J. Fleet, “Synthetic data from diffusion mod-
behind scenes and cognitive processes can lead to more pre- els improves imagenet classification,” arXiv preprint
cise generation of various scenarios, making synthetic data arXiv:2304.08466, 2023.
closer to real-world situations. [7] H. Fang, B. Han, S. Zhang, S. Zhou, C. Hu, and W.-
5.2. Regulating the Use of Synthetic Data. M. Ye, “Data augmentation for object detection via
controllable diffusion models,” in WACV 2024, 2024.
To regulate the use of synthetic data, establishing a set of
clear guidelines is crucial. [8] Q. H. Nguyen, T. T. Vu, A. T. Tran, and K. Nguyen,
Establishing industry standards. Industry standards “Dataset diffusion: Diffusion-based synthetic data
should be developed to outline best practices for synthetic generation for pixel-level semantic segmentation,”
in Thirty-seventh Conference on Neural Information [17] Q. Wang, J. Gao, W. Lin, and Y. Yuan, “Learning from
Processing Systems, 2023. synthetic data for crowd counting in the wild,” in Pro-
ceedings of the IEEE/CVF conference on computer vi-
[9] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, sion and pattern recognition, pp. 8198–8207, 2019.
et al., “Improving language understanding by genera-
tive pre-training,” 2018. [18] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired
image-to-image translation using cycle-consistent ad-
[10] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, versarial networks,” in Proceedings of the IEEE in-
I. Sutskever, et al., “Language models are unsuper- ternational conference on computer vision, pp. 2223–
vised multitask learners,” OpenAI blog, vol. 1, no. 8, 2232, 2017.
p. 9, 2019. [19] R. Torkzadehmahani, P. Kairouz, and B. Paten, “Dp-
cgan: Differentially private synthetic data and label
[11] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- generation,” in Proceedings of the IEEE/CVF Con-
plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- ference on Computer Vision and Pattern Recognition
try, A. Askell, et al., “Language models are few-shot Workshops, pp. 0–0, 2019.
learners,” Advances in neural information processing
systems, vol. 33, pp. 1877–1901, 2020. [20] A. Brock, J. Donahue, and K. Simonyan, “Large scale
gan training for high fidelity natural image synthesis,”
[12] M. Josifoski, M. Sakota, M. Peyrard, and R. West, in International Conference on Learning Representa-
“Exploiting asymmetry for synthetic training data tions, 2018.
generation: SynthIE and the case of information ex-
[21] M. Niemeyer and A. Geiger, “Giraffe: Represent-
traction,” in Proceedings of the 2023 Conference on
ing scenes as compositional generative neural fea-
Empirical Methods in Natural Language Processing
ture fields,” in Proceedings of the IEEE/CVF Con-
(H. Bouamor, J. Pino, and K. Bali, eds.), (Singapore),
ference on Computer Vision and Pattern Recognition,
pp. 1555–1574, Association for Computational Lin-
pp. 11453–11464, 2021.
guistics, Dec. 2023.
[22] C. Donahue, J. McAuley, and M. Puckette, “Adversar-
[13] C. Whitehouse, M. Choudhury, and A. Aji, “LLM- ial audio synthesis,” arXiv preprint arXiv:1802.04208,
powered data augmentation for enhanced cross- 2018.
lingual performance,” in Proceedings of the 2023 Con-
ference on Empirical Methods in Natural Language [23] X. Zhang, I. Vallés-Pérez, A. Stolcke, C. Yu,
Processing (H. Bouamor, J. Pino, and K. Bali, eds.), J. Droppo, O. Shonibare, R. Barra-Chicote, and
(Singapore), pp. 671–686, Association for Computa- V. Ravichandran, “Stutter-tts: Controlled synthesis
tional Linguistics, Dec. 2023. and improved recognition of stuttered speech,” arXiv
preprint arXiv:2211.09731, 2022.
[14] X. Sun and L. Zheng, “Dissecting person re- [24] C. Dewi, R.-C. Chen, Y.-T. Liu, and S.-K. Tai, “Syn-
identification from the viewpoint of viewpoint,” in thetic data generation using dcgan for improved traf-
Proceedings of the IEEE/CVF conference on com- fic sign recognition,” Neural Computing and Applica-
puter vision and pattern recognition, pp. 608–617, tions, vol. 34, no. 24, pp. 21465–21480, 2022.
2019.
[25] G. Forman, “An extensive empirical study of feature
[15] Y. Yao, L. Zheng, X. Yang, M. Naphade, and selection metrics for text classification,” Journal of
T. Gedeon, “Simulating content consistent vehicle Machine Learning Research, vol. 3, pp. 1289–1305,
datasets with attribute descent,” in Computer Vision– 2003.
ECCV 2020: 16th European Conference, Glasgow, [26] X. Yue, H. A. Inan, X. Li, G. Kumar, J. McAnallen,
UK, August 23–28, 2020, Proceedings, Part VI 16, H. Sun, D. Levitan, and R. Sim, “Synthetic text gener-
pp. 775–791, Springer, 2020. ation with differential privacy: A simple and practical
recipe,” arXiv preprint arXiv:2210.14348, 2022.
[16] Z. Tang, M. Naphade, S. Birchfield, J. Tremblay,
W. Hodge, R. Kumar, S. Wang, and X. Yang, [27] X. Zheng, Y. Liu, D. Gunceler, and D. Willett, “Using
“Pamtri: Pose-aware multi-task learning for vehicle synthetic audio to improve the recognition of out-of-
re-identification using highly randomized synthetic vocabulary words in end-to-end asr systems,” in IEEE
data,” in Proceedings of the IEEE/CVF International International Conference on Acoustics, Speech and
Conference on Computer Vision, pp. 211–220, 2019. Signal Processing (ICASSP), 2021.
[28] W. Nie, N. Narodytska, and A. Patel, “Relgan: Rela- [39] M. Xu, L. Yu, Y. Song, C. Shi, S. Ermon, and
tional generative adversarial networks for text gener- J. Tang, “Geodiff: A geometric diffusion model for
ation,” in International conference on learning repre- molecular conformation generation,” arXiv preprint
sentations, 2018. arXiv:2203.02923, 2022.
[29] Z. Zhao, A. Zhu, Z. Zeng, B. Veeravalli, and C. Guan, [40] Z. Zhou, S. Kearnes, L. Li, R. N. Zare, and P. Riley,
“Act-net: Asymmetric co-teacher network for semi- “Optimization of molecules via deep reinforcement
supervised memory-efficient medical image segmen- learning,” Scientific reports, vol. 9, no. 1, p. 10752,
tation,” in 2022 IEEE International Conference on Im- 2019.
age Processing (ICIP), pp. 1426–1430, IEEE, 2022.
[41] M. Olivecrona, T. Blaschke, O. Engkvist, and
[30] J. Dahmen and D. Cook, “Synsys: A synthetic data H. Chen, “Molecular de-novo design through deep
generation system for healthcare applications,” Sen- reinforcement learning,” Journal of cheminformatics,
sors, vol. 19, no. 5, 2019. vol. 9, no. 1, pp. 1–14, 2017.
[31] Y. Lu, Y. T. Chang, E. P. Hoffman, G. Yu, and [42] T. Fu, W. Gao, C. Coley, and J. Sun, “Reinforced
Y. Wang, “Integrated identification of disease specific genetic algorithm for structure-based drug design,”
pathways using multi-omics data,” Cold Spring Har- Advances in Neural Information Processing Systems,
bor Laboratory, 2019. vol. 35, pp. 12325–12338, 2022.
[32] Z. Wang, P. Myles, and A. Tucker, “Generating and [43] J. H. Jensen, “A graph-based genetic algorithm and
evaluating cross:ectional synthetic electronic health- generative model/monte carlo tree search for the
care data: Preserving data utility and patient privacy,” exploration of chemical space,” Chemical science,
Computational Intelligence, no. 3, 2021. vol. 10, no. 12, pp. 3567–3572, 2019.
[33] W. Jin, R. Barzilay, and T. Jaakkola, “Junction tree [44] T. Fu, C. Xiao, X. Li, L. M. Glass, and J. Sun,
variational autoencoder for molecular graph genera- “Mimosa: Multi-constraint molecule sampling for
tion,” 2018. molecule optimization,” in Proceedings of the AAAI
Conference on Artificial Intelligence, vol. 35, pp. 125–
[34] R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. 133, 2021.
Hernández-Lobato, B. Sánchez-Lengeling, D. She-
berla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. [45] T. Fu and J. Sun, “Sipf: Sampling method for in-
Adams, and A. Aspuru-Guzik, “Automatic chemical verse protein folding,” in Proceedings of the 28th
design using a data-driven continuous representation ACM SIGKDD Conference on Knowledge Discovery
of molecules,” ACS central science, vol. 4, no. 2, and Data Mining, pp. 378–388, 2022.
pp. 268–276, 2018.
[46] A. Kundu, A. Tagliasacchi, A. Y. Mak, A. Stone,
[35] B. Zhang, Y. Fu, Y. Lu, Z. Zhang, R. Clarke, J. E. C. Doersch, C. Oztireli, C. Herrmann, D. Gnanapra-
Van Eyk, D. M. Herrington, and Y. Wang, “Ddn2. gasam, D. Duckworth, D. Rebain, et al., “Kubric: A
0: R and python packages for differential dependency scalable dataset generator,” 2022.
network analysis of biological systems,” bioRxiv,
pp. 2021–04, 2021. [47] J. Zheng, J. Zhang, J. Li, R. Tang, S. Gao, and Z. Zhou,
“Structured3d: A large photo-realistic dataset for
[36] N. De Cao and T. Kipf, “Molgan: An implicit gener- structured 3d modeling,” in Computer Vision–ECCV
ative model for small molecular graphs. arxiv 2018,” 2020: 16th European Conference, Glasgow, UK, Au-
arXiv preprint arXiv:1805.11973, 2019. gust 23–28, 2020, Proceedings, Part IX 16, pp. 519–
535, Springer, 2020.
[37] T. Fu and J. Sun, “Antibody complementarity deter-
mining regions (cdrs) design using constrained energy [48] S. Gowal, S.-A. Rebuffi, O. Wiles, F. Stimberg, D. A.
model,” in Proceedings of the 28th ACM SIGKDD Calian, and T. A. Mann, “Improving robustness using
Conference on Knowledge Discovery and Data Min- generated data,” Advances in Neural Information Pro-
ing, pp. 389–399, 2022. cessing Systems, vol. 34, pp. 4218–4233, 2021.
[38] T. Fu, W. Gao, C. Xiao, J. Yasonik, C. W. Coley, [49] R. He, S. Sun, X. Yu, C. Xue, W. Zhang, P. Torr,
and J. Sun, “Differentiable scaffolding tree for molec- S. Bai, and X. Qi, “Is synthetic data from generative
ular optimization,” arXiv preprint arXiv:2109.10469, models ready for image recognition?,” arXiv preprint
2021. arXiv:2210.07574, 2022.
[50] R. Srinivasan and K. Uchino, “Biases in generative art
- A causal look from the lens of art history,” CoRR,
vol. abs/2010.13266, 2020.
[51] V. U. Prabhu and A. Birhane, “Large image
datasets: A pyrrhic win for computer vision?,” CoRR,
vol. abs/2006.16923, 2020.