Synthetic Data in AI: Challenges, Applications, and Ethical Implications

This document discusses the significance of synthetic data in artificial intelligence, focusing on its generation methods, applications, and ethical implications. It highlights various techniques for creating synthetic datasets, including statistical models, deep learning approaches, and their use in fields like healthcare, audio, and natural language processing. The report also addresses the challenges of bias and fairness in synthetic data, emphasizing the need for diverse and representative datasets to prevent discriminatory outcomes in AI systems.

Uploaded by

Quang Hoang Duc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views9 pages

Synthetic Data in AI: Challenges, Applications, and Ethical Implications

Uploaded by

Quang Hoang Duc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Synthetic Data in AI: Challenges, Applications, and Ethical Implications

Shuang Hao, Wenfeng Han, Tao Jiang, Yiping Li, Haonan Wu,
Chunlin Zhong, Zhangjun Zhou, He Tang*

School of Software Engineering, Huazhong University of Science and Technology

arXiv:2401.01629v1 [cs.LG] 3 Jan 2024

[email protected]

1. Introduction used to describe the distribution, while probability mass

functions (PMFs) are used for discrete data. When synthe-
In the rapidly evolving field of artificial intelligence, the sizing data, new data points can be generated by sampling
creation and utilization of synthetic datasets have become from the distribution of existing data.
increasingly significant. This report delves into the multi- Interpolation and Extrapolation. Interpolation and ex-
faceted aspects of synthetic data, particularly emphasizing trapolation involve generating new data points between or
the challenges and potential biases these datasets may har- beyond existing data points. This is particularly useful for
bor. It explores the methodologies behind synthetic data time series, geographical data, etc. One common interpola-
generation, spanning traditional statistical models to ad- tion method is linear interpolation, where the value of a new
vanced deep learning techniques, and examines their appli- point is determined by a linear relationship between two ad-
cations across diverse domains. The report also critically jacent known points.
addresses the ethical considerations and legal implications Monte Carlo Simulation. Monte Carlo simulation em-
associated with synthetic datasets, highlighting the urgent ploys random sampling to simulate the uncertainty in real
need for mechanisms to ensure fairness, mitigate biases, and systems. In data synthesis, this method is utilized to gener-
uphold ethical standards in AI development. ate new samples by randomly sampling from known distri-
butions. It finds common applications in finance, engineer-
2. The generation of synthetic data ing, and physics modeling.
Real data typically refers to data collected directly from Model-based Sampling. This method involves utilizing
the real world, covering text, images, video, audio and so statistical models of existing data to predict new data points.
on. However, due to its inherent limitations and incom- For example, a linear regression model can be fitted to exist-
pleteness, issues such as data imbalance [1] and data dis- ing data, and new data points can be generated by randomly
crimination [2] arise in practical applications. Since it is sampling the model parameters. This approach is particu-
difficult to satisfy the demand relying solely on real data, larly effective for data exhibiting linear relationships.
researchers start to employ diverse methods for generating Kernel Density Estimation. Kernel density estimation in-
synthetic data through existing real data. These methods volves placing kernels (typically Gaussian kernels) around
range from traditional statistical models to contemporary each data point and calculating the contribution at each
advanced techniques based on deep learning models. point to estimate the probability density function. When
generating new samples, random sampling can be per-
2.1. Statistical Models formed from the estimated probability density function.
This is useful for capturing the complexity and multimodal-
Statistical Models for generating synthetic datasets often ity of data distributions.
rely on the analysis of the distribution, relationships, and
characteristics of real data and attempt to generate synthetic 2.2. Deep learning based generation
data with similar properties by simulating these statistical
properties. The main methods are summarized as follows: In light of the rapid advancements in deep learning over
Distribution-based methods. This method aims to simu- recent years, scholars have increasingly directed their at-
late the distribution characteristics of the original data. For tention toward harnessing deep learning methodologies for
continuous data, probability density functions (PDFs) are the generation of synthetic datasets. In contrast to tradi-
tional statistical-based methods, deep learning approaches
* Corresponding author: [email protected] have the capability to acquire more intricate feature repre-

1
sentations of data without the need for manually designing bilities and extensive pre-trained linguistic knowledge, ex-
feature extractors. Their inherent nonlinearity makes them emplify the capacity of large language models (LLMs) to
well-suited for adapting to the complex nonlinear relation- produce synthetic datasets. This capability facilitates the
ships within data, enabling a more effective capture of the training of models on smaller domains, effectively address-
essential characteristics in the data. The following section ing the challenge of data scarcity in specific areas. The Gen-
outlines several typical approaches to data generation based erative Pre-trained Transformer (GPT) family [9–11] com-
on deep learning. prises a series of Natural Language Processing (NLP) mod-
Variational Auto-Encoder(VAE). VAE [3] is a kind of els developed by OpenAI, employing the Transformer ar-
probabilistic generative model, which employs the encoder- chitecture to capture long-distance dependencies in input
decoder architecture. The encoder maps the input data to sequences. GPT is unsupervisedly pre-trained on large-
the underlying latent space corresponding to the parame- scale Internet text corpora, for example, to predict the next
ters of the variational distribution and the decoder projects word in a given context. This pre-training strategy enables
features from the latent space back into the input space. the model to acquire a profound understanding of linguis-
By capturing the distribution of latent space features, the tic statistical structure and contextual relationships and ul-
VAE can generate multiple distinct samples that follow the timately to perform a variety of natural language process-
same distribution.The inherent randomness of the VAE in- ing tasks in a pre-training-fine-tuning format. For example,
troduces a degree of diversity in the generated data, making Josifoski et al. [12] synthetically generate a dataset of 1.8M
them more representative of the complexity in real datasets. data points in a reverse manner and demonstrate the effec-
Generative Adversarial Networks(GAN). GAN was first tiveness of this approach on closed information extraction.
proposed by Goodfellow et al. [4] in 2014, which consists Whitehouse et al. [13] utilise several LLMs, namely Dolly-
of a generator and a discriminator. The generator randomly v2, StableVicuna, ChatGPT, and GPT-4 to augment three
samples from the latent space, in order to produce samples datasets and assess the naturalness and logical coherence of
resembling the training set. Meanwhile, the discriminator the generated examples.
whether the sample belongs to real or synthetic data. These
two parts engages in a continual adversarial learning pro- 3. The usage of synthetic data.
cess and updates parameters alternately until the generator
is able to synthesise high quality samples to fool the dis- 3.1. Existing synthetic datasets.
criminator.
Diffusion models. Diffusion model [5] stands as a robust Synthetic data possesses several advantages that natural
method for crafting synthetic datasets, relying on a system- data lacks, making it an attractive choice in many fields.
atic approach to emulate intricate temporal dependencies Compared to natural data, synthetic datasets are relatively
within data. Rooted in the principles of diffusion processes, easy to acquire and can provide data in rare or challeng-
where information spreads through a network over time, ing scenarios, thereby addressing diversity issues in cer-
this technique integrates these models into the data genera- tain datasets. Additionally, this technology can effectively
tion process. By doing so, it enables the replication of nu- avoid privacy concerns, safeguarding user information se-
anced patterns and relationships observed in authentic data. curity. As this technology gradually becomes more promi-
The crux lies in accurately capturing the dynamic evolution nent, its practical applications are becoming increasingly
of information over time. This technique, grounded in diffu- widespread. This section will discuss several domains
sion models, not only reproduces statistical characteristics where generated data has had a significant impact.
but also adeptly mirrors the complex temporal dynamics Vision Synthetic datasets can greatly address the challenge
present in real-world datasets. This algorithmic approach of acquiring natural data in certain domains. In the early
proves invaluable for generating synthetic datasets, offering days of computer vision, the generation of correspond-
a sophisticated tool for simulating realistic data scenarios ing datasets relied primarily on computer graphics engines
essential for the training and evaluation of machine learn- [14–17]. For example, in the re-identification (Re-ID) do-
ing models across diverse domains. To date, synthetic data main, Sun et al. [14] created the PersonX dataset using
generated from diffusion models has found extensive appli- the a engine based on Unity. This dataset includes three
cation across various vision tasks, notably enhancing per- pure color backgrounds and three scene backgrounds. It
formance in areas such as image classification [6], object consists of 1266 hand-crafted identities (547 females and
detection [7], and semantic segmentation [8]. 719 males), with each identity having 36 viewpoints. At
Large language models. In the past years, large language that time, this dataset effectively addressed the lack of
models (LLMs) have emerged as a revolutionary approach multi-viewpoint data. Furthermore [18–21], there are also
for generating synthetic datasets. For instance, models such generated methods based on Generative Adversarial Net-
as GPT-3.5, with their exceptional in-context learning capa- works (GAN). In [21], GIRAFFE introduces synthetic neu-
ral feature maps, enabling control over camera poses, ob-
ject placements and orientations, as well as object shapes thetic data plays a pivotal role in comprehending diseases
and appearances during generation. Moreover, GIRAFFE while upholding patient confidentiality and privacy [30].
allows for the free addition of multiple objects in a scene, Synthetic data possesses the capability to mirror the original
expanding the generated scenes from single-object to multi- data distribution without disclosing actual patient informa-
object scenarios. tion [30–32]. A notable example in healthcare is MedGAN,
Audio Synthetic data is widely employed in the field of au- a model introduced by Edward Choi et al., leveraging adver-
dio, and its rapid development is truly remarkable. Take, for sarial networks to generate authentic synchronized medical
instance, the Speech Commands Dataset proposed by Chris records. Through the integration of autoencoders and gener-
Donahue et al., which leverages WaveGAN, a generative ative adversarial networks, MedGAN proficiently produces
adversarial network [22]. WaveGAN excels in synthesizing high-dimensional discrete variables (e.g., binary features
one-second audio waveform slices with global coherence, and counting features) based on genuine medical records
making it particularly well-suited for sound effects genera- [16]. Synthetic patient records generated by MedGAN have
tion. Even without labels, when trained on small vocabulary demonstrated comparable performance to real data across
speech datasets, WaveGAN adeptly learns to generate in- various experiments, encompassing distribution statistics,
telligible words and can extend its synthesis capabilities to predictive modeling tasks, and assessments by medical ex-
audio from diverse domains, including drums, bird sounds, perts. Furthermore, synthetic data finds widespread ap-
and piano.Furthermore, Zhang et al.introduced Stutter-TTS plication in the realm of drug discovery. The predom-
[23], tailored to tackle the performance challenges faced by inant approach involves learning the distribution of drug
existing Automatic Speech Recognition (ASR) interfaces molecules from existing databases and subsequently deriv-
when dealing with stuttered speech. Stutter-TTS stands as ing new samples (i.e., drug molecules) from the acquired
an end-to-end neural text-to-speech model with the profi- knowledge of drug molecule distributions. Numerous im-
ciency to synthesize various forms of stuttered speech. It plementations of this process exist, such as variant autoen-
employs a simple yet effective prosody control strategy and coders (VAEs) [33–35], generative adversarial networks
incorporates additional markers during training to represent (GANs) [36], energy-based models (EBMs) [37, 38], dif-
specific stuttering features. By strategically selecting the fusion models [39], reinforcement learning (RL) [40–42],
positions of these markers, Stutter-TTS provides word-level genetic algorithms [43], sampling-based methods [44, 45],
control over the occurrence of stuttering in the synthesized and others.
speech.
Natural Language Processing (NLP) The growing inter- 3.2. The data distribution of synthetic datasets.
est in synthetic data has propelled the thriving development
of numerous deep generative models in the field of Natu- In the rapid advancement of artificial intelligence and
ral Language Processing (NLP) [24]. In recent years, ma- machine learning, synthetic datasets have become an essen-
chine learning has demonstrated its formidable capabilities tial resource. These datasets, typically algorithmically gen-
in tasks such as classification, routing, filtering, and in- erated, are used for training and testing a variety of models.
formation retrieval across various domains [25].Addressing However, the generation process of synthetic datasets of-
the challenge of synonym variations arising from contextual ten harbors implicit issues, especially regarding the fairness
changes in NLP, researchers have introduced the BLEURT and representativeness of data distribution. These issues can
model. Built upon BERT, this model simulates human judg- affect the performance of models and potentially lead to bi-
ment by utilizing a limited set of training examples that may ases and discriminatory practices in real-world applications.
exhibit biases. To enhance the model’s generalization, inno- Distribution Issues in Synthetic Datasets. The genera-
vative pre-training approaches have been developed using tion of synthetic datasets often lacks sufficient consideration
millions of synthetic examples [26, 27].Furthermore, Rel- for demographic diversity, which can lead to unbalanced
GAN, developed by Rice University, represents a significant data distributions in terms of gender, age, race, etc. For
breakthrough in text generation using Generative Adversar- instance, in the creation of datasets, if the data is primar-
ial Networks (GAN). Comprising three key components - a ily based on individuals from specific racial or age groups,
relation memory-based generator, Gumbel-Softmax relax- the trained models might perform poorly when dealing with
ation algorithm, and multiple embedded representations in other groups. This situation can lead to severe discrimina-
the discriminator - RelGAN outperforms several state-of- tory issues in real-world applications, such as certain racial
the-art models in benchmark tests, showcasing its remark- groups being incorrectly identified or entirely overlooked.
able performance in sample quality and diversity. This un- For instance, if a dataset used to train a facial recognition
derscores its potential for further research and application system disproportionately represents certain demographics
across a wide range of NLP tasks and challenges [28, 29]. over others, the resulting AI models may exhibit biased
Health In the realm of healthcare, the generation of syn- performance, leading to unfair or discriminatory outcomes.
These biases in the data can manifest in various forms,
Table 1. Summarization of Representative Works in Synthetic Data Generation.

Datasets Generated Method Applicantion Size Content DL

Kubric [46] 3D-Rendered Vison —— 3D Image/Video !
Structured3D [47] 3D-Rendered Vison 196,515 Frames 3D Image/Video !
PersonX [14] Physically Realistic Engines Vison 1266 Images Person image #
GCC [17] Physically Realistic Engines Vison 15,212 Images Crowd image #
BigGANs [48] GAN Vison —— Image Annotation !
GIRAFFE [21] GAN Vison —— Mutil-View image !
SyntheticData [49] Diffusion Model Vison —— Generate Rare Species Data !
[49] Diffusion Model Vison —— Improved Imagenet !
SynthIE [12] LLM NLP 1.8M data points Text !
Gen-X [13] LLM NLP —— Data Augmentation !

such as overrepresentation or underrepresentation of spe- hegemonies, and marginalize underrepresented groups.

cific groups, leading to skewed perceptions and decisions This issue underscores the importance of curating diverse
made by AI systems. and inclusive datasets, especially in fields like art, where
Biases in AI Artistic Creation. Biases in AI artistic cre- representation and expression are crucial. Furthermore, it
ation exemplify how dataset distribution issues can pro- highlights the necessity for continuous scrutiny and evalua-
foundly affect the application of artificial intelligence, often tion of AI models and their outputs to identify and mitigate
reflecting and even amplifying underlying societal biases. biases.
Artistic creation, as an arena where AI has been increas- In this context, it’s crucial to view AI not just as a tech-
ingly applied, offers a stark view into the ramifications of nical tool but as an entity shaped by human decisions and
these biases. societal structures. The choices made in dataset compila-
When AI systems are employed to process, interpret, tion and algorithm design profoundly impact the AI’s be-
or generate artworks from diverse cultural backgrounds or havior, echoing the creators’ conscious and unconscious bi-
artistic styles, they often manifest clear biases towards cer- ases. Addressing these issues requires a concerted effort
tain genres, styles, or races. This phenomenon primarily to integrate ethical considerations, cultural sensitivity, and
stems from the composition of the training datasets. If a diversity into every stage of AI development, from dataset
dataset is heavily skewed towards artworks from a particular creation to model training and application.
cultural or stylistic background, the AI is more likely to de- Lack of Ethical and Legal Constraints. Beyond the chal-
velop a bias towards that specific style or culture. This bias lenges posed by statistical distribution, synthetic datasets
can significantly influence the AI’s artistic outputs, subtly may also suffer from a lack of necessary ethical and legal
shaping the characteristics of the generated artwork in ways constraints during their creation process. This oversight can
that reflect the predominant features of the training data. lead to significant issues, particularly in the context of how
The case of GoArt applying an expressionist filter to data generation algorithms process and interpret the input
Clementine Hunter’s painting ”Black Matriarch” is a telling data. Often, these algorithms may inadvertently learn and
example [50]. In this scenario, the AI altered the black replicate biases that are inherent in real-world data sources.
skin tones to red, a choice that seems inexplicable with- This issue is especially pronounced in scenarios involving
out considering the training data’s influence. In contrast, gender or racial biases.
when processing a sculpture like Desiderio’s ”Giovinetto,” Moreover, the problem is exacerbated when synthetic
which features a figure with white skin, the AI preserved the datasets rely on publicly available internet data. The inter-
artwork’s original color palette. These differences in treat- net, as a vast repository of human-generated content, inher-
ment can be indicative of the AI’s learned preferences, po- ently contains a myriad of biases and prejudices that exist
tentially influenced by the nature of the training data, which within society. This data is often unfiltered and includes
might have contained more frequent representations of cer- implicit societal biases, stereotypes, and even offensive or
tain skin tones within specific artistic contexts. harmful representations. When such data is used without
Such biases in AI artistic creation are not merely aca- critical filtering or ethical consideration, the resulting syn-
demic concerns; they have real-world implications. They thetic datasets can inadvertently become a medium through
can inadvertently perpetuate stereotypes, reinforce cultural which these societal biases are perpetuated and amplified.
The roots of such biases are multifaceted, yet they all con- from the veracity of real-world datasets. This discrepancy
verge on a common issue: societal prejudices can infiltrate may arise from algorithmic imperfections, noise injection,
the process of AI’s artistic creation. or other contributory factors that give rise to inaccuracies.
Research [51] indicates that this problem is not limited Consequently, the model may internalize erroneous pat-
to synthetic datasets; it also plagues datasets collected from terns, thereby inducing biased predictions and undermining
the internet. A prime example is the ”Tiny Images” dataset the overall performance and reliability of the model when
from the Massachusetts Institute of Technology. Com- confronted with genuine data.
piled using extensive image and label data aggregated from Insufficient Noise Level. Synthetic datasets may evince
search engines, this dataset was found to contain tenden- an undue sterility, lacking the multifarious noise and in-
cies of racial and gender discrimination, and even instances tricacies inherent in real-world data. In authentic scenar-
of pedophilia, leading to its eventual permanent removal. ios, data invariably incorporates diverse interferences, er-
The emergence of these issues is partly due to the influence rors, and uncertainties. The paucity of such features within
of societal biases and partly reflects negligence in the con- synthetic datasets may hamper the model’s efficacy within
struction of datasets. realistic environments.
In essence, the generation of synthetic datasets requires Over-Smoothing. In the process of synthetic data genera-
not only a sophisticated understanding of statistical method- tion, certain models may overly attenuate or oversimplify
ologies but also a deep consideration of the ethical and legal the data, resulting in an attenuated representation devoid
implications. Ensuring fairness and representativeness in of the nuanced details and diversity inherent in authentic
these datasets necessitates a comprehensive approach that datasets. This propensity may precipitate challenges for
includes ethical oversight, legal compliance, and an active the model in assimilating complex variations within gen-
effort to identify and mitigate any potential biases. This ap- uine data.
proach should be an integral part of the dataset creation pro- Neglecting Temporal and Dynamic Aspects. Certain
cess, ensuring that AI systems trained on these datasets do methodologies for synthetic data generation may inade-
not perpetuate existing societal biases but rather contribute quately capture temporal and dynamic nuances, facets that
to fair and equitable outcomes. are inherently pivotal within authentic datasets. The con-
sequential failure to faithfully simulate these temporal in-
4. Risks and Challenges in Utilizing Synthetic tricacies may culminate in the ineffectuality of models in
Datasets for AI. real-world applications.
Inconsistency. The paucity of inconsistency within syn-
While synthetic data plays a crucial role in AI applica- thetic datasets, relative to the rich tapestry inherent in au-
tions, the current methods of generating synthetic datasets thentic datasets. The latter frequently embodies variations
bring forth notable challenges and potential risks, necessi- stemming from diverse sources, temporal epochs, and envi-
tating careful consideration of their applications. ronmental conditions, aspects that synthetic datasets often
4.1. Shortcomings of Synthetic Data. fail to encapsulate. This shortfall may engender challenges
for models in adapting to the multifaceted vicissitudes orig-
Data Distribution Bias. A discernible incongruity ex- inating from disparate sources, temporal epochs, and envi-
ists between synthetic datasets and their authentic coun- ronmental conditions, thereby precipitating a decrement in
terparts, encompassing notable disparities in feature distri- the generalization performance vis-à-vis diverse datasets.
bution, class distribution, and other pertinent statistical at-
tributes. This bias imparts a proclivity for models to en- 4.2. Risks in Synthetic Data Application.
gender misleading prognostications within practical appli-
cations, thereby compromising their fidelity to faithfully en- General Model Performance Widespread use of synthetic
capsulate real-world phenomena. data may constrain the generalization performance of AI
Incomplete Data.The presence of lacunae or partial infor- models. For instance, datasets like PersonX, generated
mation within synthetic datasets, ostensibly stemming from within a gaming data engine, may deviate significantly from
imperfections, errors, or an inadequacy in encapsulating the real-world data. In natural language processing, relying on
manifold changes inherent in authentic datasets during the fine-tuning with large language models may restrict down-
synthetic generation process. This dearth of comprehen- stream tasks to the performance and biases of the selected
sive information may impede the model’s capacity to ac- model. In healthcare, an abundance of non-real cases dur-
curately prognosticate or manage scenarios characterized ing model training may undermine confidence in diagnostic
by data incompleteness, thereby influencing the model’s re- results among healthcare professionals and patients.
silience and pragmatic utility. Ethical and Social Implications The use of synthetic data
Inaccurate Data. The manifestation of errors, noise, or in- may give rise to ethical and social concerns. Creating fic-
accuracies within synthetic datasets, diverging significantly tional characters or scenarios through synthetic data raises
questions about AI responsibility in generating fictional data generation, covering the selection of data generation
content, potentially leading to misinformation, misunder- models, parameter settings, and the correlation between
standings, or the dissemination of false information with synthetic and real data.
detrimental societal impacts. Transparency and documentation. Transparency and
Security and Adversarial Risks The introduction of syn- documentation are equally essential. Researchers and prac-
thetic data brings forth security and adversarial attack risks. titioners should clearly document the methods and param-
Malicious use of synthetic data may render AI models un- eter settings used for generating synthetic data, providing
stable during adversarial attacks, as models may not ade- detailed documentation about the synthetic data set. This
quately learn the complexity and diversity of the real world aids other researchers in understanding the source and char-
during training. This susceptibility increases the likelihood acteristics of the data, facilitating a better assessment of the
of deception or manipulation, posing threats to the credibil- model’s performance.
ity and security of the system. Emphasizing model validation and evaluation. Empha-
Legal Compliance Challenges In certain domains, the use sizing model validation and evaluation is a crucial step in
of synthetic data may present challenges regarding legal regulating the use of synthetic data. In addition to training
compliance. For instance, employing synthetic data for risk on synthetic data, models should be validated on real data
assessment in the financial sector may encounter regulatory to ensure their generalization performance and robustness.
hurdles. Regulatory authorities often require transparent Regularly updating synthetic data sets to adapt to new sce-
and interpretable models, and the synthetic data generation narios and changes in data distribution is also a vital means
process may face difficulties in meeting these standards. of maintaining model performance.

5. Conclusions References
In summary, the use of synthetic datasets in AI intro- [1] F. Thabtah, S. Hammoud, F. Kamalov, and A. Gon-
duces a range of potential risks that require careful assess- salves, “Data imbalance in classification: Experi-
ment and management. mental evaluation,” Information Sciences, vol. 513,
pp. 429–441, 2020.
5.1. New Approaches to Synthetic Data.
To address the current issues associated with synthetic [2] M. Favaretto, E. De Clercq, and B. S. Elger, “Big data
data generation, there is a need for continuous development and discrimination: perils, promises and solutions. a
and adoption of new methods. systematic review,” Journal of Big Data, vol. 6, no. 1,
Adopting more advanced generative models. One po- pp. 1–27, 2019.
tential approach involves the use of advanced generative
[3] D. P. Kingma and M. Welling, “Auto-encoding varia-
models such as Generative Adversarial Networks (GANs)
tional bayes,” arXiv preprint arXiv:1312.6114, 2013.
or Variational Autoencoders (VAEs). These models possess
stronger learning capabilities, allowing for more accurate [4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
modeling of the complex distribution of real-world data. By D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-
employing these advanced models, it becomes possible to gio, “Generative adversarial networks,” Communica-
better avoid distribution shift issues, enhance the diversity tions of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
of generated data, and more effectively simulate the noise
and uncertainty present in the real world. [5] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion
Integrating domain-specific expertise to enhance the probabilistic models,” Advances in neural information
realism of synthetic data. Integrating domain-specific processing systems, vol. 33, pp. 6840–6851, 2020.
knowledge, such as computer graphics, physics, and cog-
nitive science, can contribute to improving the realism of [6] S. Azizi, S. Kornblith, C. Saharia, M. Norouzi, and
synthetic data. A deeper understanding of the physical laws D. J. Fleet, “Synthetic data from diffusion mod-
behind scenes and cognitive processes can lead to more pre- els improves imagenet classification,” arXiv preprint
cise generation of various scenarios, making synthetic data arXiv:2304.08466, 2023.
closer to real-world situations. [7] H. Fang, B. Han, S. Zhang, S. Zhou, C. Hu, and W.-
5.2. Regulating the Use of Synthetic Data. M. Ye, “Data augmentation for object detection via
controllable diffusion models,” in WACV 2024, 2024.
To regulate the use of synthetic data, establishing a set of
clear guidelines is crucial. [8] Q. H. Nguyen, T. T. Vu, A. T. Tran, and K. Nguyen,
Establishing industry standards. Industry standards “Dataset diffusion: Diffusion-based synthetic data
should be developed to outline best practices for synthetic generation for pixel-level semantic segmentation,”
in Thirty-seventh Conference on Neural Information [17] Q. Wang, J. Gao, W. Lin, and Y. Yuan, “Learning from
Processing Systems, 2023. synthetic data for crowd counting in the wild,” in Pro-
ceedings of the IEEE/CVF conference on computer vi-
[9] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, sion and pattern recognition, pp. 8198–8207, 2019.
et al., “Improving language understanding by genera-
tive pre-training,” 2018. [18] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired
image-to-image translation using cycle-consistent ad-
[10] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, versarial networks,” in Proceedings of the IEEE in-
I. Sutskever, et al., “Language models are unsuper- ternational conference on computer vision, pp. 2223–
vised multitask learners,” OpenAI blog, vol. 1, no. 8, 2232, 2017.
p. 9, 2019. [19] R. Torkzadehmahani, P. Kairouz, and B. Paten, “Dp-
cgan: Differentially private synthetic data and label
[11] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- generation,” in Proceedings of the IEEE/CVF Con-
plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- ference on Computer Vision and Pattern Recognition
try, A. Askell, et al., “Language models are few-shot Workshops, pp. 0–0, 2019.
learners,” Advances in neural information processing
systems, vol. 33, pp. 1877–1901, 2020. [20] A. Brock, J. Donahue, and K. Simonyan, “Large scale
gan training for high fidelity natural image synthesis,”
[12] M. Josifoski, M. Sakota, M. Peyrard, and R. West, in International Conference on Learning Representa-
“Exploiting asymmetry for synthetic training data tions, 2018.
generation: SynthIE and the case of information ex-
[21] M. Niemeyer and A. Geiger, “Giraffe: Represent-
traction,” in Proceedings of the 2023 Conference on
ing scenes as compositional generative neural fea-
Empirical Methods in Natural Language Processing
ture fields,” in Proceedings of the IEEE/CVF Con-
(H. Bouamor, J. Pino, and K. Bali, eds.), (Singapore),
ference on Computer Vision and Pattern Recognition,
pp. 1555–1574, Association for Computational Lin-
pp. 11453–11464, 2021.
guistics, Dec. 2023.
[22] C. Donahue, J. McAuley, and M. Puckette, “Adversar-
[13] C. Whitehouse, M. Choudhury, and A. Aji, “LLM- ial audio synthesis,” arXiv preprint arXiv:1802.04208,
powered data augmentation for enhanced cross- 2018.
lingual performance,” in Proceedings of the 2023 Con-
ference on Empirical Methods in Natural Language [23] X. Zhang, I. Vallés-Pérez, A. Stolcke, C. Yu,
Processing (H. Bouamor, J. Pino, and K. Bali, eds.), J. Droppo, O. Shonibare, R. Barra-Chicote, and
(Singapore), pp. 671–686, Association for Computa- V. Ravichandran, “Stutter-tts: Controlled synthesis
tional Linguistics, Dec. 2023. and improved recognition of stuttered speech,” arXiv
preprint arXiv:2211.09731, 2022.
[14] X. Sun and L. Zheng, “Dissecting person re- [24] C. Dewi, R.-C. Chen, Y.-T. Liu, and S.-K. Tai, “Syn-
identification from the viewpoint of viewpoint,” in thetic data generation using dcgan for improved traf-
Proceedings of the IEEE/CVF conference on com- fic sign recognition,” Neural Computing and Applica-
puter vision and pattern recognition, pp. 608–617, tions, vol. 34, no. 24, pp. 21465–21480, 2022.
2019.
[25] G. Forman, “An extensive empirical study of feature
[15] Y. Yao, L. Zheng, X. Yang, M. Naphade, and selection metrics for text classification,” Journal of
T. Gedeon, “Simulating content consistent vehicle Machine Learning Research, vol. 3, pp. 1289–1305,
datasets with attribute descent,” in Computer Vision– 2003.
ECCV 2020: 16th European Conference, Glasgow, [26] X. Yue, H. A. Inan, X. Li, G. Kumar, J. McAnallen,
UK, August 23–28, 2020, Proceedings, Part VI 16, H. Sun, D. Levitan, and R. Sim, “Synthetic text gener-
pp. 775–791, Springer, 2020. ation with differential privacy: A simple and practical
recipe,” arXiv preprint arXiv:2210.14348, 2022.
[16] Z. Tang, M. Naphade, S. Birchfield, J. Tremblay,
W. Hodge, R. Kumar, S. Wang, and X. Yang, [27] X. Zheng, Y. Liu, D. Gunceler, and D. Willett, “Using
“Pamtri: Pose-aware multi-task learning for vehicle synthetic audio to improve the recognition of out-of-
re-identification using highly randomized synthetic vocabulary words in end-to-end asr systems,” in IEEE
data,” in Proceedings of the IEEE/CVF International International Conference on Acoustics, Speech and
Conference on Computer Vision, pp. 211–220, 2019. Signal Processing (ICASSP), 2021.
[28] W. Nie, N. Narodytska, and A. Patel, “Relgan: Rela- [39] M. Xu, L. Yu, Y. Song, C. Shi, S. Ermon, and
tional generative adversarial networks for text gener- J. Tang, “Geodiff: A geometric diffusion model for
ation,” in International conference on learning repre- molecular conformation generation,” arXiv preprint
sentations, 2018. arXiv:2203.02923, 2022.
[29] Z. Zhao, A. Zhu, Z. Zeng, B. Veeravalli, and C. Guan, [40] Z. Zhou, S. Kearnes, L. Li, R. N. Zare, and P. Riley,
“Act-net: Asymmetric co-teacher network for semi- “Optimization of molecules via deep reinforcement
supervised memory-efficient medical image segmen- learning,” Scientific reports, vol. 9, no. 1, p. 10752,
tation,” in 2022 IEEE International Conference on Im- 2019.
age Processing (ICIP), pp. 1426–1430, IEEE, 2022.
[41] M. Olivecrona, T. Blaschke, O. Engkvist, and
[30] J. Dahmen and D. Cook, “Synsys: A synthetic data H. Chen, “Molecular de-novo design through deep
generation system for healthcare applications,” Sen- reinforcement learning,” Journal of cheminformatics,
sors, vol. 19, no. 5, 2019. vol. 9, no. 1, pp. 1–14, 2017.
[31] Y. Lu, Y. T. Chang, E. P. Hoffman, G. Yu, and [42] T. Fu, W. Gao, C. Coley, and J. Sun, “Reinforced
Y. Wang, “Integrated identification of disease specific genetic algorithm for structure-based drug design,”
pathways using multi-omics data,” Cold Spring Har- Advances in Neural Information Processing Systems,
bor Laboratory, 2019. vol. 35, pp. 12325–12338, 2022.
[32] Z. Wang, P. Myles, and A. Tucker, “Generating and [43] J. H. Jensen, “A graph-based genetic algorithm and
evaluating cross:ectional synthetic electronic health- generative model/monte carlo tree search for the
care data: Preserving data utility and patient privacy,” exploration of chemical space,” Chemical science,
Computational Intelligence, no. 3, 2021. vol. 10, no. 12, pp. 3567–3572, 2019.
[33] W. Jin, R. Barzilay, and T. Jaakkola, “Junction tree [44] T. Fu, C. Xiao, X. Li, L. M. Glass, and J. Sun,
variational autoencoder for molecular graph genera- “Mimosa: Multi-constraint molecule sampling for
tion,” 2018. molecule optimization,” in Proceedings of the AAAI
Conference on Artificial Intelligence, vol. 35, pp. 125–
[34] R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. 133, 2021.
Hernández-Lobato, B. Sánchez-Lengeling, D. She-
berla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. [45] T. Fu and J. Sun, “Sipf: Sampling method for in-
Adams, and A. Aspuru-Guzik, “Automatic chemical verse protein folding,” in Proceedings of the 28th
design using a data-driven continuous representation ACM SIGKDD Conference on Knowledge Discovery
of molecules,” ACS central science, vol. 4, no. 2, and Data Mining, pp. 378–388, 2022.
pp. 268–276, 2018.
[46] A. Kundu, A. Tagliasacchi, A. Y. Mak, A. Stone,
[35] B. Zhang, Y. Fu, Y. Lu, Z. Zhang, R. Clarke, J. E. C. Doersch, C. Oztireli, C. Herrmann, D. Gnanapra-
Van Eyk, D. M. Herrington, and Y. Wang, “Ddn2. gasam, D. Duckworth, D. Rebain, et al., “Kubric: A
0: R and python packages for differential dependency scalable dataset generator,” 2022.
network analysis of biological systems,” bioRxiv,
pp. 2021–04, 2021. [47] J. Zheng, J. Zhang, J. Li, R. Tang, S. Gao, and Z. Zhou,
“Structured3d: A large photo-realistic dataset for
[36] N. De Cao and T. Kipf, “Molgan: An implicit gener- structured 3d modeling,” in Computer Vision–ECCV
ative model for small molecular graphs. arxiv 2018,” 2020: 16th European Conference, Glasgow, UK, Au-
arXiv preprint arXiv:1805.11973, 2019. gust 23–28, 2020, Proceedings, Part IX 16, pp. 519–
535, Springer, 2020.
[37] T. Fu and J. Sun, “Antibody complementarity deter-
mining regions (cdrs) design using constrained energy [48] S. Gowal, S.-A. Rebuffi, O. Wiles, F. Stimberg, D. A.
model,” in Proceedings of the 28th ACM SIGKDD Calian, and T. A. Mann, “Improving robustness using
Conference on Knowledge Discovery and Data Min- generated data,” Advances in Neural Information Pro-
ing, pp. 389–399, 2022. cessing Systems, vol. 34, pp. 4218–4233, 2021.
[38] T. Fu, W. Gao, C. Xiao, J. Yasonik, C. W. Coley, [49] R. He, S. Sun, X. Yu, C. Xue, W. Zhang, P. Torr,
and J. Sun, “Differentiable scaffolding tree for molec- S. Bai, and X. Qi, “Is synthetic data from generative
ular optimization,” arXiv preprint arXiv:2109.10469, models ready for image recognition?,” arXiv preprint
2021. arXiv:2210.07574, 2022.
[50] R. Srinivasan and K. Uchino, “Biases in generative art
- A causal look from the lens of art history,” CoRR,
vol. abs/2010.13266, 2020.
[51] V. U. Prabhu and A. Birhane, “Large image
datasets: A pyrrhic win for computer vision?,” CoRR,
vol. abs/2006.16923, 2020.

Synthetic Data Generation Leveraging Generative AI
No ratings yet
Synthetic Data Generation Leveraging Generative AI
12 pages
Aai Ia1 Que Ans
No ratings yet
Aai Ia1 Que Ans
17 pages
Synthetic Data Generation (1) .Nandhu
No ratings yet
Synthetic Data Generation (1) .Nandhu
20 pages
Accelerating Ai With Synthetic Data Nvidia - Web
No ratings yet
Accelerating Ai With Synthetic Data Nvidia - Web
64 pages
Synthetic Data Generation (231272601003)
No ratings yet
Synthetic Data Generation (231272601003)
73 pages
Templ Et Al (2017) Simulation of Synthetic Complex Data The R Package Simpop
No ratings yet
Templ Et Al (2017) Simulation of Synthetic Complex Data The R Package Simpop
38 pages
Literature Review Draft 7
No ratings yet
Literature Review Draft 7
35 pages
Mathematics: Survey On Synthetic Data Generation, Evaluation Methods and Gans
No ratings yet
Mathematics: Survey On Synthetic Data Generation, Evaluation Methods and Gans
41 pages
Controllable Data Generation by Deep Learning: A Review: Shiyu Wang Yuanqi Du Xiaojie Guo Bo Pan Zhaohui Qin Liang Zhao
No ratings yet
Controllable Data Generation by Deep Learning: A Review: Shiyu Wang Yuanqi Du Xiaojie Guo Bo Pan Zhaohui Qin Liang Zhao
38 pages
Deep Generative Models For Synthetic Dat
No ratings yet
Deep Generative Models For Synthetic Dat
27 pages
Voulgaris - Data Scientist (AVG) (2014)
No ratings yet
Voulgaris - Data Scientist (AVG) (2014)
297 pages
CSL COINS Finse Select Topic3
No ratings yet
CSL COINS Finse Select Topic3
27 pages
Beyond The Norm: A Survey of Synthetic Data Generation For Rare Events
No ratings yet
Beyond The Norm: A Survey of Synthetic Data Generation For Rare Events
35 pages
Fake It Till You Make It Guidelines For Effective
No ratings yet
Fake It Till You Make It Guidelines For Effective
18 pages
Synthetic ECG Generation For Data Augmentation and Transfer Learning in Arrhythmia Classification
No ratings yet
Synthetic ECG Generation For Data Augmentation and Transfer Learning in Arrhythmia Classification
23 pages
Project Report Final
No ratings yet
Project Report Final
20 pages
Proposed Guide On Synthetic Data Generation 1740328790
No ratings yet
Proposed Guide On Synthetic Data Generation 1740328790
48 pages
Literature Review Draft 1
No ratings yet
Literature Review Draft 1
22 pages
Paper 8
No ratings yet
Paper 8
26 pages
Beyond The Generative Learning Trilemma: Generative Model Assessment in Data Scarcity Domains
No ratings yet
Beyond The Generative Learning Trilemma: Generative Model Assessment in Data Scarcity Domains
39 pages
Final Term Paper Draft 2
No ratings yet
Final Term Paper Draft 2
33 pages
Nonparametric Generation of Synthetic Data Using C
No ratings yet
Nonparametric Generation of Synthetic Data Using C
17 pages
SD Guide
No ratings yet
SD Guide
42 pages
Diffusion Models For Time Series Applications: A Survey
No ratings yet
Diffusion Models For Time Series Applications: A Survey
25 pages
MLDAF
No ratings yet
MLDAF
102 pages
Electronics 13 00322
No ratings yet
Electronics 13 00322
31 pages
Synthpop: Bespoke Creation of Synthetic Data in R: Beata Nowok Gillian M Raab Chris Dibben
No ratings yet
Synthpop: Bespoke Creation of Synthetic Data in R: Beata Nowok Gillian M Raab Chris Dibben
26 pages
Machine Learning For Synthetic Data Generation: A Review
No ratings yet
Machine Learning For Synthetic Data Generation: A Review
18 pages
Machine Learning For Synthetic Data Generation: A Review
No ratings yet
Machine Learning For Synthetic Data Generation: A Review
20 pages
Applsci 14 05975
No ratings yet
Applsci 14 05975
13 pages
Syntatic Data
No ratings yet
Syntatic Data
26 pages
Data Augmentation Techniques in Time Series Domain: A Survey and Taxonomy
No ratings yet
Data Augmentation Techniques in Time Series Domain: A Survey and Taxonomy
25 pages
2019 - Data Augmentation Using GANs - Fabio Henrique
No ratings yet
2019 - Data Augmentation Using GANs - Fabio Henrique
16 pages
Synthetic Generation of High Dimensional Dataset
No ratings yet
Synthetic Generation of High Dimensional Dataset
8 pages
Fundamentals of Data Source and Preparation For ML v31
No ratings yet
Fundamentals of Data Source and Preparation For ML v31
45 pages
Abstract
No ratings yet
Abstract
2 pages
Gans + Final Practice Questions: Instructor: Preethi Jyothi
No ratings yet
Gans + Final Practice Questions: Instructor: Preethi Jyothi
28 pages
Bayesian Inference For Psychology Part 3: Parameter Estimation in Nonstardard Models
No ratings yet
Bayesian Inference For Psychology Part 3: Parameter Estimation in Nonstardard Models
25 pages
Lab Manual
No ratings yet
Lab Manual
80 pages
Computational Statistics With Matlab: Mark Steyvers May 13, 2011
No ratings yet
Computational Statistics With Matlab: Mark Steyvers May 13, 2011
78 pages
MASc Research Framework Architecture Draft 2
No ratings yet
MASc Research Framework Architecture Draft 2
10 pages
4
No ratings yet
4
2 pages
Exploring The Role of Generative Adversarial Networks (Gans) and Generative Ai For Synthetic Data Generation and Augmentation in Machine Learning
No ratings yet
Exploring The Role of Generative Adversarial Networks (Gans) and Generative Ai For Synthetic Data Generation and Augmentation in Machine Learning
8 pages
3483-Document Upload-13320-1-10-20231026
No ratings yet
3483-Document Upload-13320-1-10-20231026
16 pages
Unit 1
No ratings yet
Unit 1
84 pages
Final Year
No ratings yet
Final Year
28 pages
Abufadda Et Al. (2021) A Survey of Synthetic Data Generation For Machine Learning
No ratings yet
Abufadda Et Al. (2021) A Survey of Synthetic Data Generation For Machine Learning
7 pages
Data Science
No ratings yet
Data Science
42 pages
Ai 05 00035
No ratings yet
Ai 05 00035
19 pages
2
No ratings yet
2
2 pages
Intro, Research Gap, RRL, Conclusion
No ratings yet
Intro, Research Gap, RRL, Conclusion
4 pages
Lec 12
No ratings yet
Lec 12
15 pages
Unit V Small Sample Tests
No ratings yet
Unit V Small Sample Tests
27 pages
Ai Life Cycle
No ratings yet
Ai Life Cycle
30 pages
Active Shooter 2012 Edition
No ratings yet
Active Shooter 2012 Edition
210 pages
Nursing Research Exam Sample
No ratings yet
Nursing Research Exam Sample
9 pages
Generative Models
No ratings yet
Generative Models
10 pages
MMW MidTerm RevMat
No ratings yet
MMW MidTerm RevMat
8 pages
Generatuvemodals
No ratings yet
Generatuvemodals
3 pages
IJRPR23960
No ratings yet
IJRPR23960
6 pages
Week 12 Chats
No ratings yet
Week 12 Chats
4 pages
Synthetic Data Generation - Machine Learning
No ratings yet
Synthetic Data Generation - Machine Learning
8 pages
Putational Statistics Using Matlab
No ratings yet
Putational Statistics Using Matlab
78 pages
Statistical Genetics of Quantitative Traits Linkage, Maps and QTL Complete Chapter Download
100% (15)
Statistical Genetics of Quantitative Traits Linkage, Maps and QTL Complete Chapter Download
17 pages
A Leisurely Look at The Bootstrap, The Jackknife, and Cross-Validation (1983 13s) - BRADLEY EFRON
No ratings yet
A Leisurely Look at The Bootstrap, The Jackknife, and Cross-Validation (1983 13s) - BRADLEY EFRON
13 pages
Data Generation
No ratings yet
Data Generation
1 page
Fitri Utami Ningrum 0604001559 2008-2009 Sekaran, Uma. (2003) - Research Methods For Business, 4 Ed. USA: Wiley
No ratings yet
Fitri Utami Ningrum 0604001559 2008-2009 Sekaran, Uma. (2003) - Research Methods For Business, 4 Ed. USA: Wiley
28 pages
Political, Economic, Environmental, and Cultural Influences On Organizational Behavior and An Introduction To Industrial or Organizational Psychology
No ratings yet
Political, Economic, Environmental, and Cultural Influences On Organizational Behavior and An Introduction To Industrial or Organizational Psychology
28 pages
DM Unit III CH 1final
No ratings yet
DM Unit III CH 1final
60 pages
Budgeting and Budgetary Control in The Manufacturing Sector of Nigeria
No ratings yet
Budgeting and Budgetary Control in The Manufacturing Sector of Nigeria
14 pages
SP Iii-37
No ratings yet
SP Iii-37
5 pages
Practical Statistic For The Analytical Scientist A Bench Guide
100% (1)
Practical Statistic For The Analytical Scientist A Bench Guide
2 pages
Correlation and Regression
No ratings yet
Correlation and Regression
2 pages
Edu 211 Sample Past Papers With Answers
No ratings yet
Edu 211 Sample Past Papers With Answers
35 pages
Introduction To Manufacturing Systems - Readings
No ratings yet
Introduction To Manufacturing Systems - Readings
3 pages
Journal Wang Ying 2010 Financial Ratios and The Prediction of Bankruptcy The Ohlson Model Applied To Chinese Publicly Traded Companies
No ratings yet
Journal Wang Ying 2010 Financial Ratios and The Prediction of Bankruptcy The Ohlson Model Applied To Chinese Publicly Traded Companies
6 pages
Digital Image Processing ECE 533 Assignment 4 Due Date: March 11, in Class
No ratings yet
Digital Image Processing ECE 533 Assignment 4 Due Date: March 11, in Class
7 pages
A New Liu-Type Estimator For The Inverse Gaussian Regression Model
No ratings yet
A New Liu-Type Estimator For The Inverse Gaussian Regression Model
21 pages
SQQS1013-Chapter 2 STUDENT
No ratings yet
SQQS1013-Chapter 2 STUDENT
45 pages
Gec 3 MMW Lesson-5 Statistics and Data
No ratings yet
Gec 3 MMW Lesson-5 Statistics and Data
23 pages
Puglio and Tucker (2021) Neural Networks and Recession Forecasting
No ratings yet
Puglio and Tucker (2021) Neural Networks and Recession Forecasting
27 pages
Diploma in Management General
No ratings yet
Diploma in Management General
11 pages
Exponential Distribution CDF PDF
No ratings yet
Exponential Distribution CDF PDF
2 pages
A Theoretical and Empirical Investigation of Job Satisfaction and Intended Turnover in The Large Cpa Firm
No ratings yet
A Theoretical and Empirical Investigation of Job Satisfaction and Intended Turnover in The Large Cpa Firm
16 pages
9030-Article Text-54550-1-10-20230930
No ratings yet
9030-Article Text-54550-1-10-20230930
10 pages
Noise-Contrastive Estimation: A New Estimation Principle For Unnormalized Statistical Models
No ratings yet
Noise-Contrastive Estimation: A New Estimation Principle For Unnormalized Statistical Models
8 pages
Design of Experiments (DOE) - ASQ
No ratings yet
Design of Experiments (DOE) - ASQ
2 pages
Chapter Six
No ratings yet
Chapter Six
4 pages
Choose Quantitative re-WPS Office
No ratings yet
Choose Quantitative re-WPS Office
2 pages
Synthetic Data Generation: A Beginner’s Guide
From Everand
Synthetic Data Generation: A Beginner’s Guide
Robert Johnson
No ratings yet