0% found this document useful (0 votes)
5 views9 pages

Synthetic Data in AI: Challenges, Applications, and Ethical Implications

This document discusses the significance of synthetic data in artificial intelligence, focusing on its generation methods, applications, and ethical implications. It highlights various techniques for creating synthetic datasets, including statistical models, deep learning approaches, and their use in fields like healthcare, audio, and natural language processing. The report also addresses the challenges of bias and fairness in synthetic data, emphasizing the need for diverse and representative datasets to prevent discriminatory outcomes in AI systems.

Uploaded by

Quang Hoang Duc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views9 pages

Synthetic Data in AI: Challenges, Applications, and Ethical Implications

This document discusses the significance of synthetic data in artificial intelligence, focusing on its generation methods, applications, and ethical implications. It highlights various techniques for creating synthetic datasets, including statistical models, deep learning approaches, and their use in fields like healthcare, audio, and natural language processing. The report also addresses the challenges of bias and fairness in synthetic data, emphasizing the need for diverse and representative datasets to prevent discriminatory outcomes in AI systems.

Uploaded by

Quang Hoang Duc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Synthetic Data in AI: Challenges, Applications, and Ethical Implications

Shuang Hao, Wenfeng Han, Tao Jiang, Yiping Li, Haonan Wu,
Chunlin Zhong, Zhangjun Zhou, He Tang*

School of Software Engineering, Huazhong University of Science and Technology


arXiv:2401.01629v1 [cs.LG] 3 Jan 2024

[email protected]

1. Introduction used to describe the distribution, while probability mass


functions (PMFs) are used for discrete data. When synthe-
In the rapidly evolving field of artificial intelligence, the sizing data, new data points can be generated by sampling
creation and utilization of synthetic datasets have become from the distribution of existing data.
increasingly significant. This report delves into the multi- Interpolation and Extrapolation. Interpolation and ex-
faceted aspects of synthetic data, particularly emphasizing trapolation involve generating new data points between or
the challenges and potential biases these datasets may har- beyond existing data points. This is particularly useful for
bor. It explores the methodologies behind synthetic data time series, geographical data, etc. One common interpola-
generation, spanning traditional statistical models to ad- tion method is linear interpolation, where the value of a new
vanced deep learning techniques, and examines their appli- point is determined by a linear relationship between two ad-
cations across diverse domains. The report also critically jacent known points.
addresses the ethical considerations and legal implications Monte Carlo Simulation. Monte Carlo simulation em-
associated with synthetic datasets, highlighting the urgent ploys random sampling to simulate the uncertainty in real
need for mechanisms to ensure fairness, mitigate biases, and systems. In data synthesis, this method is utilized to gener-
uphold ethical standards in AI development. ate new samples by randomly sampling from known distri-
butions. It finds common applications in finance, engineer-
2. The generation of synthetic data ing, and physics modeling.
Real data typically refers to data collected directly from Model-based Sampling. This method involves utilizing
the real world, covering text, images, video, audio and so statistical models of existing data to predict new data points.
on. However, due to its inherent limitations and incom- For example, a linear regression model can be fitted to exist-
pleteness, issues such as data imbalance [1] and data dis- ing data, and new data points can be generated by randomly
crimination [2] arise in practical applications. Since it is sampling the model parameters. This approach is particu-
difficult to satisfy the demand relying solely on real data, larly effective for data exhibiting linear relationships.
researchers start to employ diverse methods for generating Kernel Density Estimation. Kernel density estimation in-
synthetic data through existing real data. These methods volves placing kernels (typically Gaussian kernels) around
range from traditional statistical models to contemporary each data point and calculating the contribution at each
advanced techniques based on deep learning models. point to estimate the probability density function. When
generating new samples, random sampling can be per-
2.1. Statistical Models formed from the estimated probability density function.
This is useful for capturing the complexity and multimodal-
Statistical Models for generating synthetic datasets often ity of data distributions.
rely on the analysis of the distribution, relationships, and
characteristics of real data and attempt to generate synthetic 2.2. Deep learning based generation
data with similar properties by simulating these statistical
properties. The main methods are summarized as follows: In light of the rapid advancements in deep learning over
Distribution-based methods. This method aims to simu- recent years, scholars have increasingly directed their at-
late the distribution characteristics of the original data. For tention toward harnessing deep learning methodologies for
continuous data, probability density functions (PDFs) are the generation of synthetic datasets. In contrast to tradi-
tional statistical-based methods, deep learning approaches
* Corresponding author: [email protected] have the capability to acquire more intricate feature repre-

1
sentations of data without the need for manually designing bilities and extensive pre-trained linguistic knowledge, ex-
feature extractors. Their inherent nonlinearity makes them emplify the capacity of large language models (LLMs) to
well-suited for adapting to the complex nonlinear relation- produce synthetic datasets. This capability facilitates the
ships within data, enabling a more effective capture of the training of models on smaller domains, effectively address-
essential characteristics in the data. The following section ing the challenge of data scarcity in specific areas. The Gen-
outlines several typical approaches to data generation based erative Pre-trained Transformer (GPT) family [9–11] com-
on deep learning. prises a series of Natural Language Processing (NLP) mod-
Variational Auto-Encoder(VAE). VAE [3] is a kind of els developed by OpenAI, employing the Transformer ar-
probabilistic generative model, which employs the encoder- chitecture to capture long-distance dependencies in input
decoder architecture. The encoder maps the input data to sequences. GPT is unsupervisedly pre-trained on large-
the underlying latent space corresponding to the parame- scale Internet text corpora, for example, to predict the next
ters of the variational distribution and the decoder projects word in a given context. This pre-training strategy enables
features from the latent space back into the input space. the model to acquire a profound understanding of linguis-
By capturing the distribution of latent space features, the tic statistical structure and contextual relationships and ul-
VAE can generate multiple distinct samples that follow the timately to perform a variety of natural language process-
same distribution.The inherent randomness of the VAE in- ing tasks in a pre-training-fine-tuning format. For example,
troduces a degree of diversity in the generated data, making Josifoski et al. [12] synthetically generate a dataset of 1.8M
them more representative of the complexity in real datasets. data points in a reverse manner and demonstrate the effec-
Generative Adversarial Networks(GAN). GAN was first tiveness of this approach on closed information extraction.
proposed by Goodfellow et al. [4] in 2014, which consists Whitehouse et al. [13] utilise several LLMs, namely Dolly-
of a generator and a discriminator. The generator randomly v2, StableVicuna, ChatGPT, and GPT-4 to augment three
samples from the latent space, in order to produce samples datasets and assess the naturalness and logical coherence of
resembling the training set. Meanwhile, the discriminator the generated examples.
whether the sample belongs to real or synthetic data. These
two parts engages in a continual adversarial learning pro- 3. The usage of synthetic data.
cess and updates parameters alternately until the generator
is able to synthesise high quality samples to fool the dis- 3.1. Existing synthetic datasets.
criminator.
Diffusion models. Diffusion model [5] stands as a robust Synthetic data possesses several advantages that natural
method for crafting synthetic datasets, relying on a system- data lacks, making it an attractive choice in many fields.
atic approach to emulate intricate temporal dependencies Compared to natural data, synthetic datasets are relatively
within data. Rooted in the principles of diffusion processes, easy to acquire and can provide data in rare or challeng-
where information spreads through a network over time, ing scenarios, thereby addressing diversity issues in cer-
this technique integrates these models into the data genera- tain datasets. Additionally, this technology can effectively
tion process. By doing so, it enables the replication of nu- avoid privacy concerns, safeguarding user information se-
anced patterns and relationships observed in authentic data. curity. As this technology gradually becomes more promi-
The crux lies in accurately capturing the dynamic evolution nent, its practical applications are becoming increasingly
of information over time. This technique, grounded in diffu- widespread. This section will discuss several domains
sion models, not only reproduces statistical characteristics where generated data has had a significant impact.
but also adeptly mirrors the complex temporal dynamics Vision Synthetic datasets can greatly address the challenge
present in real-world datasets. This algorithmic approach of acquiring natural data in certain domains. In the early
proves invaluable for generating synthetic datasets, offering days of computer vision, the generation of correspond-
a sophisticated tool for simulating realistic data scenarios ing datasets relied primarily on computer graphics engines
essential for the training and evaluation of machine learn- [14–17]. For example, in the re-identification (Re-ID) do-
ing models across diverse domains. To date, synthetic data main, Sun et al. [14] created the PersonX dataset using
generated from diffusion models has found extensive appli- the a engine based on Unity. This dataset includes three
cation across various vision tasks, notably enhancing per- pure color backgrounds and three scene backgrounds. It
formance in areas such as image classification [6], object consists of 1266 hand-crafted identities (547 females and
detection [7], and semantic segmentation [8]. 719 males), with each identity having 36 viewpoints. At
Large language models. In the past years, large language that time, this dataset effectively addressed the lack of
models (LLMs) have emerged as a revolutionary approach multi-viewpoint data. Furthermore [18–21], there are also
for generating synthetic datasets. For instance, models such generated methods based on Generative Adversarial Net-
as GPT-3.5, with their exceptional in-context learning capa- works (GAN). In [21], GIRAFFE introduces synthetic neu-
ral feature maps, enabling control over camera poses, ob-
ject placements and orientations, as well as object shapes thetic data plays a pivotal role in comprehending diseases
and appearances during generation. Moreover, GIRAFFE while upholding patient confidentiality and privacy [30].
allows for the free addition of multiple objects in a scene, Synthetic data possesses the capability to mirror the original
expanding the generated scenes from single-object to multi- data distribution without disclosing actual patient informa-
object scenarios. tion [30–32]. A notable example in healthcare is MedGAN,
Audio Synthetic data is widely employed in the field of au- a model introduced by Edward Choi et al., leveraging adver-
dio, and its rapid development is truly remarkable. Take, for sarial networks to generate authentic synchronized medical
instance, the Speech Commands Dataset proposed by Chris records. Through the integration of autoencoders and gener-
Donahue et al., which leverages WaveGAN, a generative ative adversarial networks, MedGAN proficiently produces
adversarial network [22]. WaveGAN excels in synthesizing high-dimensional discrete variables (e.g., binary features
one-second audio waveform slices with global coherence, and counting features) based on genuine medical records
making it particularly well-suited for sound effects genera- [16]. Synthetic patient records generated by MedGAN have
tion. Even without labels, when trained on small vocabulary demonstrated comparable performance to real data across
speech datasets, WaveGAN adeptly learns to generate in- various experiments, encompassing distribution statistics,
telligible words and can extend its synthesis capabilities to predictive modeling tasks, and assessments by medical ex-
audio from diverse domains, including drums, bird sounds, perts. Furthermore, synthetic data finds widespread ap-
and piano.Furthermore, Zhang et al.introduced Stutter-TTS plication in the realm of drug discovery. The predom-
[23], tailored to tackle the performance challenges faced by inant approach involves learning the distribution of drug
existing Automatic Speech Recognition (ASR) interfaces molecules from existing databases and subsequently deriv-
when dealing with stuttered speech. Stutter-TTS stands as ing new samples (i.e., drug molecules) from the acquired
an end-to-end neural text-to-speech model with the profi- knowledge of drug molecule distributions. Numerous im-
ciency to synthesize various forms of stuttered speech. It plementations of this process exist, such as variant autoen-
employs a simple yet effective prosody control strategy and coders (VAEs) [33–35], generative adversarial networks
incorporates additional markers during training to represent (GANs) [36], energy-based models (EBMs) [37, 38], dif-
specific stuttering features. By strategically selecting the fusion models [39], reinforcement learning (RL) [40–42],
positions of these markers, Stutter-TTS provides word-level genetic algorithms [43], sampling-based methods [44, 45],
control over the occurrence of stuttering in the synthesized and others.
speech.
Natural Language Processing (NLP) The growing inter- 3.2. The data distribution of synthetic datasets.
est in synthetic data has propelled the thriving development
of numerous deep generative models in the field of Natu- In the rapid advancement of artificial intelligence and
ral Language Processing (NLP) [24]. In recent years, ma- machine learning, synthetic datasets have become an essen-
chine learning has demonstrated its formidable capabilities tial resource. These datasets, typically algorithmically gen-
in tasks such as classification, routing, filtering, and in- erated, are used for training and testing a variety of models.
formation retrieval across various domains [25].Addressing However, the generation process of synthetic datasets of-
the challenge of synonym variations arising from contextual ten harbors implicit issues, especially regarding the fairness
changes in NLP, researchers have introduced the BLEURT and representativeness of data distribution. These issues can
model. Built upon BERT, this model simulates human judg- affect the performance of models and potentially lead to bi-
ment by utilizing a limited set of training examples that may ases and discriminatory practices in real-world applications.
exhibit biases. To enhance the model’s generalization, inno- Distribution Issues in Synthetic Datasets. The genera-
vative pre-training approaches have been developed using tion of synthetic datasets often lacks sufficient consideration
millions of synthetic examples [26, 27].Furthermore, Rel- for demographic diversity, which can lead to unbalanced
GAN, developed by Rice University, represents a significant data distributions in terms of gender, age, race, etc. For
breakthrough in text generation using Generative Adversar- instance, in the creation of datasets, if the data is primar-
ial Networks (GAN). Comprising three key components - a ily based on individuals from specific racial or age groups,
relation memory-based generator, Gumbel-Softmax relax- the trained models might perform poorly when dealing with
ation algorithm, and multiple embedded representations in other groups. This situation can lead to severe discrimina-
the discriminator - RelGAN outperforms several state-of- tory issues in real-world applications, such as certain racial
the-art models in benchmark tests, showcasing its remark- groups being incorrectly identified or entirely overlooked.
able performance in sample quality and diversity. This un- For instance, if a dataset used to train a facial recognition
derscores its potential for further research and application system disproportionately represents certain demographics
across a wide range of NLP tasks and challenges [28, 29]. over others, the resulting AI models may exhibit biased
Health In the realm of healthcare, the generation of syn- performance, leading to unfair or discriminatory outcomes.
These biases in the data can manifest in various forms,
Table 1. Summarization of Representative Works in Synthetic Data Generation.

Datasets Generated Method Applicantion Size Content DL


Kubric [46] 3D-Rendered Vison —— 3D Image/Video !
Structured3D [47] 3D-Rendered Vison 196,515 Frames 3D Image/Video !
PersonX [14] Physically Realistic Engines Vison 1266 Images Person image #
GCC [17] Physically Realistic Engines Vison 15,212 Images Crowd image #
BigGANs [48] GAN Vison —— Image Annotation !
GIRAFFE [21] GAN Vison —— Mutil-View image !
SyntheticData [49] Diffusion Model Vison —— Generate Rare Species Data !
[49] Diffusion Model Vison —— Improved Imagenet !
SynthIE [12] LLM NLP 1.8M data points Text !
Gen-X [13] LLM NLP —— Data Augmentation !

such as overrepresentation or underrepresentation of spe- hegemonies, and marginalize underrepresented groups.


cific groups, leading to skewed perceptions and decisions This issue underscores the importance of curating diverse
made by AI systems. and inclusive datasets, especially in fields like art, where
Biases in AI Artistic Creation. Biases in AI artistic cre- representation and expression are crucial. Furthermore, it
ation exemplify how dataset distribution issues can pro- highlights the necessity for continuous scrutiny and evalua-
foundly affect the application of artificial intelligence, often tion of AI models and their outputs to identify and mitigate
reflecting and even amplifying underlying societal biases. biases.
Artistic creation, as an arena where AI has been increas- In this context, it’s crucial to view AI not just as a tech-
ingly applied, offers a stark view into the ramifications of nical tool but as an entity shaped by human decisions and
these biases. societal structures. The choices made in dataset compila-
When AI systems are employed to process, interpret, tion and algorithm design profoundly impact the AI’s be-
or generate artworks from diverse cultural backgrounds or havior, echoing the creators’ conscious and unconscious bi-
artistic styles, they often manifest clear biases towards cer- ases. Addressing these issues requires a concerted effort
tain genres, styles, or races. This phenomenon primarily to integrate ethical considerations, cultural sensitivity, and
stems from the composition of the training datasets. If a diversity into every stage of AI development, from dataset
dataset is heavily skewed towards artworks from a particular creation to model training and application.
cultural or stylistic background, the AI is more likely to de- Lack of Ethical and Legal Constraints. Beyond the chal-
velop a bias towards that specific style or culture. This bias lenges posed by statistical distribution, synthetic datasets
can significantly influence the AI’s artistic outputs, subtly may also suffer from a lack of necessary ethical and legal
shaping the characteristics of the generated artwork in ways constraints during their creation process. This oversight can
that reflect the predominant features of the training data. lead to significant issues, particularly in the context of how
The case of GoArt applying an expressionist filter to data generation algorithms process and interpret the input
Clementine Hunter’s painting ”Black Matriarch” is a telling data. Often, these algorithms may inadvertently learn and
example [50]. In this scenario, the AI altered the black replicate biases that are inherent in real-world data sources.
skin tones to red, a choice that seems inexplicable with- This issue is especially pronounced in scenarios involving
out considering the training data’s influence. In contrast, gender or racial biases.
when processing a sculpture like Desiderio’s ”Giovinetto,” Moreover, the problem is exacerbated when synthetic
which features a figure with white skin, the AI preserved the datasets rely on publicly available internet data. The inter-
artwork’s original color palette. These differences in treat- net, as a vast repository of human-generated content, inher-
ment can be indicative of the AI’s learned preferences, po- ently contains a myriad of biases and prejudices that exist
tentially influenced by the nature of the training data, which within society. This data is often unfiltered and includes
might have contained more frequent representations of cer- implicit societal biases, stereotypes, and even offensive or
tain skin tones within specific artistic contexts. harmful representations. When such data is used without
Such biases in AI artistic creation are not merely aca- critical filtering or ethical consideration, the resulting syn-
demic concerns; they have real-world implications. They thetic datasets can inadvertently become a medium through
can inadvertently perpetuate stereotypes, reinforce cultural which these societal biases are perpetuated and amplified.
The roots of such biases are multifaceted, yet they all con- from the veracity of real-world datasets. This discrepancy
verge on a common issue: societal prejudices can infiltrate may arise from algorithmic imperfections, noise injection,
the process of AI’s artistic creation. or other contributory factors that give rise to inaccuracies.
Research [51] indicates that this problem is not limited Consequently, the model may internalize erroneous pat-
to synthetic datasets; it also plagues datasets collected from terns, thereby inducing biased predictions and undermining
the internet. A prime example is the ”Tiny Images” dataset the overall performance and reliability of the model when
from the Massachusetts Institute of Technology. Com- confronted with genuine data.
piled using extensive image and label data aggregated from Insufficient Noise Level. Synthetic datasets may evince
search engines, this dataset was found to contain tenden- an undue sterility, lacking the multifarious noise and in-
cies of racial and gender discrimination, and even instances tricacies inherent in real-world data. In authentic scenar-
of pedophilia, leading to its eventual permanent removal. ios, data invariably incorporates diverse interferences, er-
The emergence of these issues is partly due to the influence rors, and uncertainties. The paucity of such features within
of societal biases and partly reflects negligence in the con- synthetic datasets may hamper the model’s efficacy within
struction of datasets. realistic environments.
In essence, the generation of synthetic datasets requires Over-Smoothing. In the process of synthetic data genera-
not only a sophisticated understanding of statistical method- tion, certain models may overly attenuate or oversimplify
ologies but also a deep consideration of the ethical and legal the data, resulting in an attenuated representation devoid
implications. Ensuring fairness and representativeness in of the nuanced details and diversity inherent in authentic
these datasets necessitates a comprehensive approach that datasets. This propensity may precipitate challenges for
includes ethical oversight, legal compliance, and an active the model in assimilating complex variations within gen-
effort to identify and mitigate any potential biases. This ap- uine data.
proach should be an integral part of the dataset creation pro- Neglecting Temporal and Dynamic Aspects. Certain
cess, ensuring that AI systems trained on these datasets do methodologies for synthetic data generation may inade-
not perpetuate existing societal biases but rather contribute quately capture temporal and dynamic nuances, facets that
to fair and equitable outcomes. are inherently pivotal within authentic datasets. The con-
sequential failure to faithfully simulate these temporal in-
4. Risks and Challenges in Utilizing Synthetic tricacies may culminate in the ineffectuality of models in
Datasets for AI. real-world applications.
Inconsistency. The paucity of inconsistency within syn-
While synthetic data plays a crucial role in AI applica- thetic datasets, relative to the rich tapestry inherent in au-
tions, the current methods of generating synthetic datasets thentic datasets. The latter frequently embodies variations
bring forth notable challenges and potential risks, necessi- stemming from diverse sources, temporal epochs, and envi-
tating careful consideration of their applications. ronmental conditions, aspects that synthetic datasets often
4.1. Shortcomings of Synthetic Data. fail to encapsulate. This shortfall may engender challenges
for models in adapting to the multifaceted vicissitudes orig-
Data Distribution Bias. A discernible incongruity ex- inating from disparate sources, temporal epochs, and envi-
ists between synthetic datasets and their authentic coun- ronmental conditions, thereby precipitating a decrement in
terparts, encompassing notable disparities in feature distri- the generalization performance vis-à-vis diverse datasets.
bution, class distribution, and other pertinent statistical at-
tributes. This bias imparts a proclivity for models to en- 4.2. Risks in Synthetic Data Application.
gender misleading prognostications within practical appli-
cations, thereby compromising their fidelity to faithfully en- General Model Performance Widespread use of synthetic
capsulate real-world phenomena. data may constrain the generalization performance of AI
Incomplete Data.The presence of lacunae or partial infor- models. For instance, datasets like PersonX, generated
mation within synthetic datasets, ostensibly stemming from within a gaming data engine, may deviate significantly from
imperfections, errors, or an inadequacy in encapsulating the real-world data. In natural language processing, relying on
manifold changes inherent in authentic datasets during the fine-tuning with large language models may restrict down-
synthetic generation process. This dearth of comprehen- stream tasks to the performance and biases of the selected
sive information may impede the model’s capacity to ac- model. In healthcare, an abundance of non-real cases dur-
curately prognosticate or manage scenarios characterized ing model training may undermine confidence in diagnostic
by data incompleteness, thereby influencing the model’s re- results among healthcare professionals and patients.
silience and pragmatic utility. Ethical and Social Implications The use of synthetic data
Inaccurate Data. The manifestation of errors, noise, or in- may give rise to ethical and social concerns. Creating fic-
accuracies within synthetic datasets, diverging significantly tional characters or scenarios through synthetic data raises
questions about AI responsibility in generating fictional data generation, covering the selection of data generation
content, potentially leading to misinformation, misunder- models, parameter settings, and the correlation between
standings, or the dissemination of false information with synthetic and real data.
detrimental societal impacts. Transparency and documentation. Transparency and
Security and Adversarial Risks The introduction of syn- documentation are equally essential. Researchers and prac-
thetic data brings forth security and adversarial attack risks. titioners should clearly document the methods and param-
Malicious use of synthetic data may render AI models un- eter settings used for generating synthetic data, providing
stable during adversarial attacks, as models may not ade- detailed documentation about the synthetic data set. This
quately learn the complexity and diversity of the real world aids other researchers in understanding the source and char-
during training. This susceptibility increases the likelihood acteristics of the data, facilitating a better assessment of the
of deception or manipulation, posing threats to the credibil- model’s performance.
ity and security of the system. Emphasizing model validation and evaluation. Empha-
Legal Compliance Challenges In certain domains, the use sizing model validation and evaluation is a crucial step in
of synthetic data may present challenges regarding legal regulating the use of synthetic data. In addition to training
compliance. For instance, employing synthetic data for risk on synthetic data, models should be validated on real data
assessment in the financial sector may encounter regulatory to ensure their generalization performance and robustness.
hurdles. Regulatory authorities often require transparent Regularly updating synthetic data sets to adapt to new sce-
and interpretable models, and the synthetic data generation narios and changes in data distribution is also a vital means
process may face difficulties in meeting these standards. of maintaining model performance.

5. Conclusions References
In summary, the use of synthetic datasets in AI intro- [1] F. Thabtah, S. Hammoud, F. Kamalov, and A. Gon-
duces a range of potential risks that require careful assess- salves, “Data imbalance in classification: Experi-
ment and management. mental evaluation,” Information Sciences, vol. 513,
pp. 429–441, 2020.
5.1. New Approaches to Synthetic Data.
To address the current issues associated with synthetic [2] M. Favaretto, E. De Clercq, and B. S. Elger, “Big data
data generation, there is a need for continuous development and discrimination: perils, promises and solutions. a
and adoption of new methods. systematic review,” Journal of Big Data, vol. 6, no. 1,
Adopting more advanced generative models. One po- pp. 1–27, 2019.
tential approach involves the use of advanced generative
[3] D. P. Kingma and M. Welling, “Auto-encoding varia-
models such as Generative Adversarial Networks (GANs)
tional bayes,” arXiv preprint arXiv:1312.6114, 2013.
or Variational Autoencoders (VAEs). These models possess
stronger learning capabilities, allowing for more accurate [4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
modeling of the complex distribution of real-world data. By D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-
employing these advanced models, it becomes possible to gio, “Generative adversarial networks,” Communica-
better avoid distribution shift issues, enhance the diversity tions of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
of generated data, and more effectively simulate the noise
and uncertainty present in the real world. [5] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion
Integrating domain-specific expertise to enhance the probabilistic models,” Advances in neural information
realism of synthetic data. Integrating domain-specific processing systems, vol. 33, pp. 6840–6851, 2020.
knowledge, such as computer graphics, physics, and cog-
nitive science, can contribute to improving the realism of [6] S. Azizi, S. Kornblith, C. Saharia, M. Norouzi, and
synthetic data. A deeper understanding of the physical laws D. J. Fleet, “Synthetic data from diffusion mod-
behind scenes and cognitive processes can lead to more pre- els improves imagenet classification,” arXiv preprint
cise generation of various scenarios, making synthetic data arXiv:2304.08466, 2023.
closer to real-world situations. [7] H. Fang, B. Han, S. Zhang, S. Zhou, C. Hu, and W.-
5.2. Regulating the Use of Synthetic Data. M. Ye, “Data augmentation for object detection via
controllable diffusion models,” in WACV 2024, 2024.
To regulate the use of synthetic data, establishing a set of
clear guidelines is crucial. [8] Q. H. Nguyen, T. T. Vu, A. T. Tran, and K. Nguyen,
Establishing industry standards. Industry standards “Dataset diffusion: Diffusion-based synthetic data
should be developed to outline best practices for synthetic generation for pixel-level semantic segmentation,”
in Thirty-seventh Conference on Neural Information [17] Q. Wang, J. Gao, W. Lin, and Y. Yuan, “Learning from
Processing Systems, 2023. synthetic data for crowd counting in the wild,” in Pro-
ceedings of the IEEE/CVF conference on computer vi-
[9] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, sion and pattern recognition, pp. 8198–8207, 2019.
et al., “Improving language understanding by genera-
tive pre-training,” 2018. [18] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired
image-to-image translation using cycle-consistent ad-
[10] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, versarial networks,” in Proceedings of the IEEE in-
I. Sutskever, et al., “Language models are unsuper- ternational conference on computer vision, pp. 2223–
vised multitask learners,” OpenAI blog, vol. 1, no. 8, 2232, 2017.
p. 9, 2019. [19] R. Torkzadehmahani, P. Kairouz, and B. Paten, “Dp-
cgan: Differentially private synthetic data and label
[11] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- generation,” in Proceedings of the IEEE/CVF Con-
plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- ference on Computer Vision and Pattern Recognition
try, A. Askell, et al., “Language models are few-shot Workshops, pp. 0–0, 2019.
learners,” Advances in neural information processing
systems, vol. 33, pp. 1877–1901, 2020. [20] A. Brock, J. Donahue, and K. Simonyan, “Large scale
gan training for high fidelity natural image synthesis,”
[12] M. Josifoski, M. Sakota, M. Peyrard, and R. West, in International Conference on Learning Representa-
“Exploiting asymmetry for synthetic training data tions, 2018.
generation: SynthIE and the case of information ex-
[21] M. Niemeyer and A. Geiger, “Giraffe: Represent-
traction,” in Proceedings of the 2023 Conference on
ing scenes as compositional generative neural fea-
Empirical Methods in Natural Language Processing
ture fields,” in Proceedings of the IEEE/CVF Con-
(H. Bouamor, J. Pino, and K. Bali, eds.), (Singapore),
ference on Computer Vision and Pattern Recognition,
pp. 1555–1574, Association for Computational Lin-
pp. 11453–11464, 2021.
guistics, Dec. 2023.
[22] C. Donahue, J. McAuley, and M. Puckette, “Adversar-
[13] C. Whitehouse, M. Choudhury, and A. Aji, “LLM- ial audio synthesis,” arXiv preprint arXiv:1802.04208,
powered data augmentation for enhanced cross- 2018.
lingual performance,” in Proceedings of the 2023 Con-
ference on Empirical Methods in Natural Language [23] X. Zhang, I. Vallés-Pérez, A. Stolcke, C. Yu,
Processing (H. Bouamor, J. Pino, and K. Bali, eds.), J. Droppo, O. Shonibare, R. Barra-Chicote, and
(Singapore), pp. 671–686, Association for Computa- V. Ravichandran, “Stutter-tts: Controlled synthesis
tional Linguistics, Dec. 2023. and improved recognition of stuttered speech,” arXiv
preprint arXiv:2211.09731, 2022.
[14] X. Sun and L. Zheng, “Dissecting person re- [24] C. Dewi, R.-C. Chen, Y.-T. Liu, and S.-K. Tai, “Syn-
identification from the viewpoint of viewpoint,” in thetic data generation using dcgan for improved traf-
Proceedings of the IEEE/CVF conference on com- fic sign recognition,” Neural Computing and Applica-
puter vision and pattern recognition, pp. 608–617, tions, vol. 34, no. 24, pp. 21465–21480, 2022.
2019.
[25] G. Forman, “An extensive empirical study of feature
[15] Y. Yao, L. Zheng, X. Yang, M. Naphade, and selection metrics for text classification,” Journal of
T. Gedeon, “Simulating content consistent vehicle Machine Learning Research, vol. 3, pp. 1289–1305,
datasets with attribute descent,” in Computer Vision– 2003.
ECCV 2020: 16th European Conference, Glasgow, [26] X. Yue, H. A. Inan, X. Li, G. Kumar, J. McAnallen,
UK, August 23–28, 2020, Proceedings, Part VI 16, H. Sun, D. Levitan, and R. Sim, “Synthetic text gener-
pp. 775–791, Springer, 2020. ation with differential privacy: A simple and practical
recipe,” arXiv preprint arXiv:2210.14348, 2022.
[16] Z. Tang, M. Naphade, S. Birchfield, J. Tremblay,
W. Hodge, R. Kumar, S. Wang, and X. Yang, [27] X. Zheng, Y. Liu, D. Gunceler, and D. Willett, “Using
“Pamtri: Pose-aware multi-task learning for vehicle synthetic audio to improve the recognition of out-of-
re-identification using highly randomized synthetic vocabulary words in end-to-end asr systems,” in IEEE
data,” in Proceedings of the IEEE/CVF International International Conference on Acoustics, Speech and
Conference on Computer Vision, pp. 211–220, 2019. Signal Processing (ICASSP), 2021.
[28] W. Nie, N. Narodytska, and A. Patel, “Relgan: Rela- [39] M. Xu, L. Yu, Y. Song, C. Shi, S. Ermon, and
tional generative adversarial networks for text gener- J. Tang, “Geodiff: A geometric diffusion model for
ation,” in International conference on learning repre- molecular conformation generation,” arXiv preprint
sentations, 2018. arXiv:2203.02923, 2022.
[29] Z. Zhao, A. Zhu, Z. Zeng, B. Veeravalli, and C. Guan, [40] Z. Zhou, S. Kearnes, L. Li, R. N. Zare, and P. Riley,
“Act-net: Asymmetric co-teacher network for semi- “Optimization of molecules via deep reinforcement
supervised memory-efficient medical image segmen- learning,” Scientific reports, vol. 9, no. 1, p. 10752,
tation,” in 2022 IEEE International Conference on Im- 2019.
age Processing (ICIP), pp. 1426–1430, IEEE, 2022.
[41] M. Olivecrona, T. Blaschke, O. Engkvist, and
[30] J. Dahmen and D. Cook, “Synsys: A synthetic data H. Chen, “Molecular de-novo design through deep
generation system for healthcare applications,” Sen- reinforcement learning,” Journal of cheminformatics,
sors, vol. 19, no. 5, 2019. vol. 9, no. 1, pp. 1–14, 2017.
[31] Y. Lu, Y. T. Chang, E. P. Hoffman, G. Yu, and [42] T. Fu, W. Gao, C. Coley, and J. Sun, “Reinforced
Y. Wang, “Integrated identification of disease specific genetic algorithm for structure-based drug design,”
pathways using multi-omics data,” Cold Spring Har- Advances in Neural Information Processing Systems,
bor Laboratory, 2019. vol. 35, pp. 12325–12338, 2022.
[32] Z. Wang, P. Myles, and A. Tucker, “Generating and [43] J. H. Jensen, “A graph-based genetic algorithm and
evaluating cross:ectional synthetic electronic health- generative model/monte carlo tree search for the
care data: Preserving data utility and patient privacy,” exploration of chemical space,” Chemical science,
Computational Intelligence, no. 3, 2021. vol. 10, no. 12, pp. 3567–3572, 2019.
[33] W. Jin, R. Barzilay, and T. Jaakkola, “Junction tree [44] T. Fu, C. Xiao, X. Li, L. M. Glass, and J. Sun,
variational autoencoder for molecular graph genera- “Mimosa: Multi-constraint molecule sampling for
tion,” 2018. molecule optimization,” in Proceedings of the AAAI
Conference on Artificial Intelligence, vol. 35, pp. 125–
[34] R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. 133, 2021.
Hernández-Lobato, B. Sánchez-Lengeling, D. She-
berla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. [45] T. Fu and J. Sun, “Sipf: Sampling method for in-
Adams, and A. Aspuru-Guzik, “Automatic chemical verse protein folding,” in Proceedings of the 28th
design using a data-driven continuous representation ACM SIGKDD Conference on Knowledge Discovery
of molecules,” ACS central science, vol. 4, no. 2, and Data Mining, pp. 378–388, 2022.
pp. 268–276, 2018.
[46] A. Kundu, A. Tagliasacchi, A. Y. Mak, A. Stone,
[35] B. Zhang, Y. Fu, Y. Lu, Z. Zhang, R. Clarke, J. E. C. Doersch, C. Oztireli, C. Herrmann, D. Gnanapra-
Van Eyk, D. M. Herrington, and Y. Wang, “Ddn2. gasam, D. Duckworth, D. Rebain, et al., “Kubric: A
0: R and python packages for differential dependency scalable dataset generator,” 2022.
network analysis of biological systems,” bioRxiv,
pp. 2021–04, 2021. [47] J. Zheng, J. Zhang, J. Li, R. Tang, S. Gao, and Z. Zhou,
“Structured3d: A large photo-realistic dataset for
[36] N. De Cao and T. Kipf, “Molgan: An implicit gener- structured 3d modeling,” in Computer Vision–ECCV
ative model for small molecular graphs. arxiv 2018,” 2020: 16th European Conference, Glasgow, UK, Au-
arXiv preprint arXiv:1805.11973, 2019. gust 23–28, 2020, Proceedings, Part IX 16, pp. 519–
535, Springer, 2020.
[37] T. Fu and J. Sun, “Antibody complementarity deter-
mining regions (cdrs) design using constrained energy [48] S. Gowal, S.-A. Rebuffi, O. Wiles, F. Stimberg, D. A.
model,” in Proceedings of the 28th ACM SIGKDD Calian, and T. A. Mann, “Improving robustness using
Conference on Knowledge Discovery and Data Min- generated data,” Advances in Neural Information Pro-
ing, pp. 389–399, 2022. cessing Systems, vol. 34, pp. 4218–4233, 2021.
[38] T. Fu, W. Gao, C. Xiao, J. Yasonik, C. W. Coley, [49] R. He, S. Sun, X. Yu, C. Xue, W. Zhang, P. Torr,
and J. Sun, “Differentiable scaffolding tree for molec- S. Bai, and X. Qi, “Is synthetic data from generative
ular optimization,” arXiv preprint arXiv:2109.10469, models ready for image recognition?,” arXiv preprint
2021. arXiv:2210.07574, 2022.
[50] R. Srinivasan and K. Uchino, “Biases in generative art
- A causal look from the lens of art history,” CoRR,
vol. abs/2010.13266, 2020.
[51] V. U. Prabhu and A. Birhane, “Large image
datasets: A pyrrhic win for computer vision?,” CoRR,
vol. abs/2006.16923, 2020.

You might also like