Data For AI
Data For AI
About UNIDIR
The United Nations Institute for Disarmament Research (UNIDIR) is a voluntarily funded, autonomous
institute within the United Nations. One of the few policy institutes worldwide focusing on disarmament,
UNIDIR generates knowledge and promotes dialogue and action on disarmament and security. Based
in Geneva, UNIDIR assists the international community to develop the practical, innovative ideas
needed to find solutions to critical security problems.
Citation
H. Deng, Exploring Synthetic Data for Artificial Intelligence and Autonomous Systems: A Primer,
Geneva, Switzerland: UNIDIR, 2023.
Note
The designations employed and the presentation of the material in this publication do not imply the
expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning
the legal status of any country, territory, city or area, or of its authorities, or concerning the delimitation
of its frontiers or boundaries. The views expressed in the publication are the sole responsibility of the
individual authors. They do not necessary reflect the views or opinions of the United Nations, UNIDIR,
its staff members or sponsors.
Author
Harry Deng (@hwrdeng) is a Consultant for the Security and Technology Programme at
UNIDIR, where his work focuses on the international security implications of emerging
technologies. He holds a master’s degree in global governance from the University of
Waterloo, where he is currently a PhD candidate.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 2
Acronyms & Abbreviations
3D Dull, Dirty and Dangerous
AI Artificial Intelligence
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 3
Contents
Executive Summary 5
1. Introduction 6
4. Conclusion 28
Bibliography 29
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 4
Executive Summary
Advances in the field of artificial intelligence (AI) and machine learning in recent years have created
unprecedented opportunities to augment human capabilities and to improve the functionality of
various autonomous systems, including in the field of international security. Yet, there is a scarcity of
the high-quality, highly diverse and relevant real-world data sets that are needed to train increasingly
complex AI systems in the defence sector. As a consequence, synthetic data is gradually becoming
an essential tool in the data toolbox to develop and train AI systems. The characteristics and potential
benefits of synthetic data, along with proven application of the technology in various sectors, make it a
relevant topic for debates surrounding the use of AI within the context of international security.
This primer provides a brief overview of synthetic data, including its characteristics, how it is generated,
the value that it adds, its risks, and its potential use cases for defence organizations and military oper-
ations. In addition, the primer provides an outline of existing data challenges and limitations that have
facilitated the emergence of synthetic data as an important tool for the development of increasingly
complex AI systems.
The use of synthetic data within the context of international security has so far mostly remained ex-
perimental and exploratory. However, the features of synthetic data could have a beneficial effect on
training AI systems. In particular, synthetic data allows for the generation of highly diverse or even novel
data sets, fine grain control of data attributes, automatic annotation or data labelling where necessary,
and cost-effectiveness. This primer examines how the main characteristics of synthetic data could
benefit militaries and defence organizations by allowing them to integrate more capable and reliable AI
systems in both defensive and offensive autonomous systems.
While synthetic data can be beneficial for training AI systems and could help alleviate some of the data
issues faced by militaries and defence organizations, it is not a silver bullet, and it comes with risks and
challenges. The benefits accrued from using synthetic data will depend on the ability of organizations
to navigate these risks in order for AI systems trained on synthetic data to be used in a responsible and
safe manner and in accordance with legal requirements and ethical values.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 5
1. Introduction
Advances in artificial intelligence (AI), along learning models require ever increasing
with the machine learning models that support diversity, volume and velocity of high-quality
its use, have made it ubiquitous when optimiz- data, often high-quality labelled data. Without
ing performance for increasingly complex tasks the necessary diversity, volume, and velocity of
and complicated working environments. Yet high-quality data to train complex AI systems,
the integration of AI introduces unprecedented such systems could see increases in failures,
legal, ethical, safety and security challenges including unintended harms. Labelled data ex-
– and this is especially relevant in the interna- plicitly informs the machine learning model what
tional security context. Within this context, AI
1
the data means rather than leaving the model to
is being explored as a tool for decision support, figure it out by itself and possibly get it wrong.
operational planning and intelligence analysis. It However, the scarcity of high-quality real-world
could also be integrated into both offensive and data along with the privacy, legal, regulatory
defensive autonomous systems, such as tar- and cost challenges associated with sensitive
get-identification systems, swarm robotics and data make real-world data generally unsuitable
cyber operations. Moreover, it has been asserted for training increasingly complex AI systems,3
that AI could perform certain tasks better than particularly in the international security
traditional methods (e.g., defensive cyber infra- context. Because of this scarcity, synthetic
structures or intelligence analysis), meaning 2
data is gradually becoming an essential tool to
that states could be better equipped to uphold develop, improve and condition increasingly
their international legal obligations, specifically complex AI systems. In particular, it can provide
international humanitarian law, in addition to data where there is none, can counterbalance
enhancing operational effectiveness. various forms of bias in real-world data and
can automatically label data where necessary,
At the same time, the downstream effect of the among other things.4
tasks envisioned for AI means that machine
1 The First Committee of the United Nations General Assembly defines “international security” as “global challenges and
threats to peace that affect the international community”. See United Nations General Assembly, “Disarmament and Interna-
tional Security (First Committee)”, https://www.un.org/en/ga/first/.
2 A. Wilner, “AI and the Future of Deterrence: Promises and Pitfalls”, Centre for International Governance Innovation, 28
November 2022, https://www.cigionline.org/articles/ai-and-the-future-of-deterrence-promises-and-pitfalls/; Defence In-
novation Board, “AI Principles”, 2019, https://media.defense.gov/2019/Oct/31/2002204458/-1/-1/0/DIB_AI_PRINCIPLES_
PRIMARY_DOCUMENT.PDF.
3 J. Yan et al., “Synthetic Dataset Generation and Adaptation for Human Detection”, DEVCOM Army Research Laboratory,
November 2020, https://apps.dtic.mil/sti/pdfs/AD1115446.pdf; A. Holland, “Known Unknowns: Data Issues and Military Au-
tonomous Systems”, UNIDIR, 17 May 2021, https://unidir.org/known-unknowns; Government Business Council, “Advancing
ISR at the Edge: A Survey on Networks and Processing Technologies in the Digital Battlespace”, July 2020, http://cdn.govexec.
com/media/advancing-isr-at-the-edge-isr.pdf.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 6
However, the implications of using synthetic and augmented by the generation and use of
data in autonomous systems remain unex- synthetic data. It is therefore essential that the
plored in relevant United Nations security use of synthetic data in autonomous systems
processes, such as the Group of Governmen- does not derogate any commitments to interna-
tal Experts (GGE) on emerging technologies tional law (e.g., international humanitarian law)
in the area of lethal autonomous weapons or Responsible AI – that is, the broad approach
systems (LAWS) or the Open-ended Working to the development and use of AI to ensure that
Group (OEWG) on security of and in the use of AI systems are lawful, ethical, safe, secure and
information and communications technologies responsible.7
(ICTs). However, the value-added and the risks
associated with synthetic data are relevant to As such, this primer aims to provide policy-
these discussions as well as other debates on makers and diplomats engaged in interna-
the use of AI within the context of international tional security discussions with an introduc-
security. For example, some parties involved tory overview of synthetic data. It describes
in debate in the GGE on LAWS are concerned the main characteristics, value-added, risks
about the possibility that increased autonomy and relevance of synthetic data within the
in weapon systems could lead to increases context of international security, particularly
in unintended harms due to a lack of training as an enabler of autonomy. The primer further
data to appropriately train such systems.5 Ad- attempts to demonstrate the growing impor-
ditionally, participants in the OEWG on ICTs tance of synthetic data as well as the evolving
have discussed the possibility that AI-pow- paradigm of data usage and governance in the
ered cyberattacks could autonomously adapt international security context. It does this by il-
to defensive cyber measures, making them luminating and mapping out the peculiarities of
more difficult to detect and mitigate.6 AI-pow- synthetic data, then re-connecting it to existing
ered cyberattacks can indeed be enabled data challenges.
5 Group of Governmental Experts on Emerging Technologies in the Area of Lethal Autonomous Weapons Systems, “Proposal
for an International Instrument on Lethal Autonomous Weapons (LAWS)”, Submitted by Pakistan, CCW/GGE.1/2023/WP.3/
Rev.1, 8 March 2023, http://undocs.org/en/CCW/GGE.1/2023/WP.3/Rev.1; “State of Palestine’s Proposal for the Normative
and Operational Framework on Autonomous Weapons Systems”, Submitted by Palestine, CCW/GGE.1/2023/WP.2/Rev.1, 3
March 2023, http://undocs.org/en/CCW/GGE.1/2023/WP.2/Rev.1.
6 H. Alkhzaimi, “Contribution to the Fifth Substantive Session by Emerging Research and Security Center, NYU/NYUAD”,
NGO Working Papers, 28 July 2023, https://docs-library.unoda.org/Open-Ended_Working_Group_on_Information_and_
Communication_Technologies_-_(2021)/Stakeholder_Recommendation_for_Open-ended_working_group_on_security_
APR.pdf.
7 A. Anand and H. Deng, “Towards Responsible AI in Defence: A Mapping and Comparative Analysis of AI Principles Adopted
by States”, UNIDIR, 13 February 2023, https://unidir.org/publication/towards-responsible-ai-defence-mapping-and-com-
parative-analysis-ai-principles-adopted.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 7
2. Understanding Synthetic
Data
Highlights
• Unlike real-world data, which refers to data and inputs derived from the real world, synthetic data is
artificially created in the digital world. It often seeks to reproduce the characteristics and properties
of an existing set of data or to produce data based on existing knowledge.
• The purpose of synthetic data is to improve the quality and utility of training data sets. It is critical
that the data on which autonomous systems are trained is of sufficient quality and diversity in order
to avoid unintended harms, especially within the context of international security.
• Defence organizations currently face a myriad of data management challenges, thereby limiting the
quality and utility of real-world data to train increasingly complex AI and autonomous systems.
• While synthetic data may not be a silver bullet that resolves all existing data challenges within
defence organizations, it may provide a means to improve the quality and utility of training data sets.
8 S. Kannan, “Synthetic Time Series Data Generation for Edge Analytics”, F1000 Research, 20 January 2022, https://doi.
org/10.12688/f1000research.72984.1.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 8
if not identical, results when undergoing the improving the value of the training data set).
same statistical analysis. 9
It is also possible to enhance a training data
set by generating synthetic data that does not
In short, synthetic data is often information that reproduce the characteristics of the original
is artificially generated to represent the original data set, but instead exaggerates certain char-
data it either seeks to replace (thereby providing acteristics (see section 3.1).
an equivalent function) or complement (thus
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 9
2.2 Existing Data Challenges
Defence organizations and militaries face chal- anticipate and respond to data issues in order
lenges in obtaining sufficient data of adequate to avoid unintended harms.
quality. The data challenges are not only
technical, but also organizational.11 This means While synthetic data may not be a silver bullet
that defence organizations cannot simply that alleviates all existing data challenges, it
“engineer their way out” of their shortcomings could provide a means to improve the quality
using technical solutions. Instead, data chal- and utility of training data sets, especially in in-
lenges within defence organizations should creasingly complicated and opaque machine
also consider the impact of organizational learning models where data issues may not be
culture, policies and procedures. Ultimately, easily revealed. This section looks in turn at two
any use of autonomous systems, particularly strands of data challenges faced by defence
autonomous systems in combat environments organizations that synthetic data can address:
or autonomous systems intended to engage first data management, and then data quality.
human targets, hinges on the responsibility to
12 Ibid, 4.
13 Ibid.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 10
data.14 This indicates shortcomings in estab- has failed to keep pace with the ever-increas-
lishing the appropriate processes to correctly ing volume of data. The paucity of data quality
label data, store it in the appropriate databases, control – either ex ante or ex poste – means that
and ensure appropriate access and availability. analysts are drowning in data and by the time
It may also indicate a struggle to balance the the correct set of data has been obtained it is
need to protect sensitive or classified data with often obsolete and unreliable.
the need to share that data with those who may
benefit from exploiting it. Similar data management challenges are also
mentioned in Australia’s 2021 Defence Data
In fact, siloed data, the compartmentalization Strategy,17 the United Kingdom’s 2021 Data
of mutually exclusive departments, limited Strategy for Defence18 and Canada’s 2021 De-
bandwidth, and limited tagging of data were partment of National Defence Data Strategy,19
noted by DoD personnel to be some of the most as well as in research on information networks
prevalent challenges for their organization’s conducted by the Indonesian Defence Uni-
ability to effectively collect, disseminate and versity.20 Common issues found across these
analyse data.15 Regarding data labelling, only documents include, for example, challenges
32 per cent of defence civilians and 13 per cent with data visibility, siloed data, lack of common
of active-duty personnel said that their defence data standards within and across organiza-
agency had the systems in place to effectively tions, and cultural issues of not considering
label data.16 The number of personnel required data requirements in the initial phase of capa-
along with the necessary processes and infra- bilities development.
structure to monitor and manage incoming data
14 Ibid, 8.
15 Ibid, 15.
16 Ibid.
18 British Ministry of Defence, “Data Strategy for Defence: Delivering the Defence Data Framework and Exploiting the Power of
Data”, 27 September 2021, https://www.gov.uk/government/publications/data-strategy-for-defence/data-strategy-for-de-
fence.
19 Canadian Department of National Defence, “The Department of National Defence and Canadian Armed Forces Data
Strategy”, 18 May 2021, https://www.canada.ca/en/department-national-defence/corporate/reports-publications/da-
ta-strategy/data-strategy.html.
20 P.A. Udayana et al., “Strategy for Integrated Land Information System Network Arrangements for the Indonesian National
Army”, 2022, https://doi.org/10.33172/jspd.v8i1.1054.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 11
2.2.2 Strand 2 – Data Quality
Poor data management contribute to the vehicle placed in an “uncontrolled” multivariate
second strand of issues – poor data quality. combat environment may face both harsh con-
Common data quality issues include incom- ditions and adversarial actions.
plete data, unlabelled data, poisoned or spoofed
data, inaccurate data, data bias, and discrep- If autonomous systems rely on the data they
ant data. Poor data quality can be the result are trained on in order to navigate, respond to
of external factors, such as harsh conditions and manipulate their environment, it is critical
(e.g., dust, smoke, vibrations, contaminants, that the data is of sufficient quality and diver-
camouflage, wear and tear of sensors, etc.) and sity.21 It is, however, important to note that not
adversarial actions (e.g., signal jamming, data all AI systems rely on being trained by data;
poisoning, attacks on sensors, unanticipated reinforcement learning models – which use a
tactics, etc.). However, proper data manage- reward function to learn the consequences of
ment practices can help filter out data that has actions taken – can also be used.22
been corrupted in order to avoid creating dis-
tortions or biases in the training data set and to Yet, a certain amount of poor quality data
ensure that the right data reaches the appropri- should be expected in any large real-world
ate entities. data set, especially in the international security
context where adversarial environments,
Different autonomous systems will face whether in the digital space or the physical
different types of poor data quality issue. For space, pose a wide range of challenges to the
example, an autonomous system used in a collection of complete high-quality data.23 As
defensive cyber operation is less likely to face such, it has been suggested that synthetic data
issues arising from harsh conditions (e.g., could play an important role in alleviating some
dust, smoke, contaminants, etc.), but is likely of the pressures of collecting quality real-world
to face adversarial actions such as spoofing data, for example by filling in where data is
or data poisoning. In contrast, an uncrewed missing due to sensor failures.24
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 12
2.3 Methods of Generating
Synthetic Data
Synthetic data can be generated by leveraging properties of the original data set are extracted
various techniques, such as decision trees or and replicated depend on the type and structure
deep-learning algorithms. As a proxy, synthetic of the original data. There are three broad
data can be classified according to the type of methods of synthetic data generation:
the original data:
• Rules-based methods, which have pre-
• Real-world data defined data structures
25 In programming, an array is a data structure that consists of a collection of values or variables (e.g., numbers, words, objects,
etc.) formatted and sorted according to their type. The purpose of an array is to store multiple pieces of data of the same type
together.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 13
data set in training a machine learning model complicated and abstruse it is to generate. This
to predict the four AQI parameters specified thereby limits the practicality of using a rules-
in the data sets. 26
Kannan concludes that the based method for more complex or intricate re-
improved performance of the synthetic data lational networks. Relatedly, data drift – that is,
set could be the result of the synthetic data set the shift in data distribution over time – may limit
“filling in” the incomplete data contained in the the practicality of rules-based methods, particu-
original data set. 27
larly if there is no well-established change-man-
agement system to govern how the rules are
There are, however, limitations to using a rules- changed to fit their application. Lastly, since
based method to generate synthetic data sets. the rules are defined by humans, the bias of the
Most notable are challenges with scalabil- developer is reflected in the generated data,
ity, drift and bias.28 Regarding scalability, the whether it is conscious (e.g., business logic) or
more complex a synthetic data set is (e.g., if unconscious (e.g., gender bias29).
the synthetic data set requires thousands of in-
terdependent and intertwined rules), the more
27 Kannan notes that the incomplete data in the original AQI data set is due to sensor failure at one of the stations that caused
partial recording of the data. See Ibid.
28 M. Pasieka, “A Comparison of Synthetic Data Generation Methods and Synthetic Data Types”, Mostly AI, 1 September
2022, https://mostly.ai/blog/comparison-of-synthetic-data-types/.
29 K. Chandler, “Does Military AI Have Gender? Understanding Bias and Promoting Ethical Approaches in Military Applica-
tions of AI”, UNIDIR, 7 December 2021, https://doi.org/10.37559/GEN/2021/04.
30 E. Bonabeau, “Agent-Based Modeling: Methods and Techniques for Simulating Human Systems”, PNAS, 14 May 2002,
https://doi.org/10.1073/pnas.082080899.
31 Ibid.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 14
Example32
An autonomous system that could identify people who are escaping a flood by standing on their roofs
would be useful for humanitarian aid and disaster recovery (HADR) operations. However, since such
a situation may occur only infrequently in real life, there is very little data that can be used to train an
autonomous system to identify this scenario. An agent-based model may simulate such a scenario.
By generating high-fidelity and highly diverse synthetic data to augment the training data set with rare
data points, the synthetic data generated from an agent-based model could potentially be beneficial in
training autonomous systems for such scenarios.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 15
Figure 2. Generative Adversarial Network34
Generative adversarial networks are commonly both training data and synthetic data, would
used for image recognition and image genera- then try to classify the observations as real or
tion.35 They are usually comprised of two neural generated. The generator network improves its
networks,36 one a generator network and the performance over-time based on the feedback
other a discriminator network, that train each it receives from the discriminator network. The
other on an iterative basis (see Figure 2). The two networks then converge when the discrim-
generator network would produce a synthetic inator is no longer able to differentiate between
data point (e.g., an image) as an input with the the “real” data and the synthetically generated
same characteristics as the training data. The data.37
discriminator network, containing batches of
34 T. Silva, “A Short Introduction to Generative Adversarial Networks”, Thalles’ Blog, 7 June 2017, https://sthalles.github.io/
intro-to-gans/.
36 A neural network, also known as artificial neural network, is a network of interconnected layers of nodes that transmit infor-
mation from one layer to another and each layer performs a different function on its inputs. Neural networks rely on training data
to learn, and their performance improves over time. See IBM, “What are Neural Networks?”, https://www.ibm.com/topics/neu-
ral-networks.
37 J. Hradec et al., “Multipurpose Synthetic Population or Policy Application”, European Commission Joint Research Centre,
13 April 2022, 14, https://doi.org/10.2760/50072.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 16
Figure 3. Variational Autoencoder38
Variational autoencoders are a type of like- iteration, the VAE ingests data which is then
lihood-based generative model. VAEs are compared with the encoder–decoder output.
comprised of an encoder and a decoder (see The essential function for a VAE, then, is to
Figure 3). The encoder ingests data and sim- learn the optimal encoding–decoding scheme
plifies it (known as the “latent representation”) to iteratively optimize the process. As such,
to represent the key features of the data. The more complex VAE architectures can support
decoder takes in the latent representation higher dimensionality reduction (i.e., learning
and returns a reconstruction of it. Like GANs, the key features) while keeping reconstruction
VAEs function on an iterative basis. At each errors low.39
39 D.P. Kingma and M. Welling, “An Introduction to Variational Autoencoders”, Foundations and Trends in Machine Learning,
2019, https://arxiv.org/pdf/1906.02691.pdf.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 17
Figure 4. Diffusion Model40
NOISING
DENOISING
Diffusion models are an emerging class of It has been posited that, due to its ability to
deep-learning models that produce data, such synthesize novel high-fidelity images that are
as images, from a training distribution via an ostensibly unlike its training data as well as its
iterative denoising 41
process. In other words, ease of use, diffusion models are the de facto
diffusion models work by corrupting an image method for generating large-scale images.43
(e.g., by adding noise), which the model then Popular diffusion models include DALL-E and
learns how to remove (or denoise) in order to Stable Diffusion.
generate a coherent image (see Figure 4). It
can then generate variations of that image by
introducing different noises to the otherwise
coherent image. While GAN was a break-
through technology that enabled the produc-
tion of high-fidelity images at scale, diffusion
models have largely displaced GANs in recent
years.42
40 A. Vahdat and K. Kreis, “Improving Diffusion Models as an Alternative to GANs, Part 1”, NVIDIA, 26 April 2022, https://
developer.nvidia.com/blog/improving-diffusion-models-as-an-alternative-to-gans-part-1/.
41 The term “denoising” refers to the process of removing imperfections and defects from audio-visual data in order to restore
the actual features and characteristics. See L. Fan et al., “Brief Review of Image Denoising Techniques”, Visual Computing for
Industry, Biomedicine, and Art, vol. 2 (2019), https://doi.org/10.1186/s42492-019-0016-7.
42 N. Carlini et al., “Extracting Training Data from Diffusion Models”, arXiv:2301.13188, 30 January 2023, 1, https://arxiv.org/
abs/2301.13188.
43 Ibid.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 18
3. Synthetic Data and
International Security
Highlights
• Militaries and defence organizations could benefit from continued advances in AI and autonomous
systems. Ensuring that AI and autonomous systems are properly trained prior to deployment and
use is of critical importance in the international security context.
• Benefits of synthetic data include highly diverse data sets, shortened training cycles, fine grain
control and flexibility, ability to generate hypothetical data, and identifying and addressing skewed
data sets, among others.
• Synthetic data may also eliminate legal challenges related to collecting, storing, disseminating and
disposing of sensitive data, thus potentially allowing for more sharing of sensitive data among allies.
• The use of synthetic data comes with its own set of risks. These include difficulties in fully replicat-
ing the complex physics of the real world, data poisoning, unintended biases, and lower levels of
privacy associated with some synthetic data-generation techniques.
• While these risks may also be applicable to real-world data sets, synthetic data may expand the
potential for some of these risks. Thus, it is critical to establish processes to ensure the reliability
and quality of synthetic data sets.
44 P. Scharre, “Robotics on the Battlefield Part II: The Coming Swarm”, Center for a New American Security, 15 October 2014,
https://www.cnas.org/publications/reports/robotics-on-the-battlefield-part-ii-the-coming-swarm.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 19
by the autonomous system in order to reduce It has been postulated that autonomous
the latency between analysis and any actions systems possess tremendous value for military
taken. In other words, the autonomous system operations. The advantages range from
that collects the data is the same system that executing tasks quicker than any human or hu-
performs the analysis and delivers an output – man-operated systems for time-critical mission
a process called “edge analytics”. The ability 45
(e.g., air-defence or defensive cyber opera-
for autonomous systems to be placed at the tions) to carrying out so-called 3D (dull, dirty
edge has become an increasingly important and dangerous) missions where human per-
component of technical solutions for various formance is prone to deterioration over time.47
military applications. Yet, the development of However, data issues for autonomous systems
autonomous systems intended to be placed continue to plague defence organizations. The
at the edge (e.g., uncrewed vehicles used in task of designing autonomous systems for
military operations) is unlike that of autono- either on-board or off-board data processing
mous systems in other contexts (e.g., those represents a trade-off, as diverse stakeholders
used in cyber operations) due to hardware lim- each have unique requirements.48 Defence or-
itations.46 To be sure, the issue is not necessar- ganizations are grappling with this trade-off and
ily a lack of data, but rather a lack of high-quality face challenges in obtaining real-world data
labelled data as a result of the lack of data-col- along with the associated annotations (i.e., data
lection hardware. There is also the issue of a labels) that can be used to train algorithms for
lack of diversity of data collected by autono- on-board data processing.49 The current data
mous systems, which are fielded for specific management architecture only permits autono-
operational functions, not data collection. For mous systems to operate in controlled environ-
example, an uncrewed aerial vehicle (UAV) that ments and with limited degrees of autonomy.50
operates at high altitudes will only be able to For example, the Israeli Guardium uncrewed
collect images from high angles, thereby gener- ground vehicle is only used autonomously at
ating a data set that may be largely irrelevant for the Israel–Gaza border, a location that is well-
another UAV operating at lower altitudes and at mapped and relatively static.51 The use of
lower angles of vision. synthetic data for training autonomous systems
45 M. Hagström, “Military Applications of Machine Learning and Autonomous Systems”, The Impact of Artificial Intelligence on
Strategic Stability and Nuclear Risk, SIPRI, May 2019, https://www.sipri.org/sites/default/files/2019-05/sipri1905-ai-strate-
gic-stability-nuclear-risk.pdf.
47 V. Boulanin, “Artificial Intelligence: A Primer”, The Impact of Artificial Intelligence on Strategic Stability and Nuclear Risk,
SIPRI, May 2019, https://www.sipri.org/sites/default/files/2019-05/sipri1905-ai-strategic-stability-nuclear-risk.pdf.
48 Defense Science Board, “Task Force Report: The Role of Autonomy in DoD Systems”, US Department of Defense, July
2012, 20, https://irp.fas.org/agency/dod/dsb/autonomy.pdf.
50 J. Kwik and T. Van Engers, “Algorithmic Fog of War: When Lack of Transparency Violates the Law of Armed Conflict”, Journal
of Future Robot Life, 2021, 7, https://doi.org/10.3233/FRL-200019.
51 R. Crootof, “The Killer Robots are Here: Legal and Policy Implications”, Cardozo Law Review, 2015, 1869, https://papers.
ssrn.com/sol3/papers.cfm?abstract_id=2534567.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 20
may therefore represent a means to alleviate ability to conduct defensive cyber operations at
the data challenges associated with the current scale and to identify threats before they arise.52
data-collection and data-processing architec- Put another way, AI could make the defence of
ture. It may thereby present defence organiza- cyber infrastructures more reliable, especially
tions with the opportunity to further exploit the against AI-enabled offensive cyber opera-
use of autonomous systems by placing them in tions.53 AI could be critical in dealing with chal-
highly dynamic and multivariate environments lenges arising from both the increasing scale
while reducing the associated risks. and the increasing sophistication of the cyber
realm.
In the cyber realm, however, the incorpora-
tion of AI could be an essential element of the
SCALE S O P H I S T I C AT I O N
As societies and urban environments grow more digitally in- Offensive cyber operations augmented and amplified by AI
terconnected and heterogenous, it creates more pressure (e.g., synthetic images, adversarial data manipulation and
points and vulnerabilities for defensive cyber operations to other deceptive techniques) could pose threats to the regular
supervise. Vulnerabilities in digital systems are not only the operations of a government, private enterprise or individual.
result of increased sophistication in the vector of attacks, Using AI-enhanced systems and practices defensively may
but are also created by the size of the attack surface. In then be necessary to detect and respond to anomalies by
other words, while the types of attack may not necessarily using fine grain control. As such, unlike the issue of scale,
be changing, the scale of the risks are. As such, AI could act leveraging AI against AI-enhanced attacks is not just a matter
as a force multiplier to provide enough “eyeballs” on enough of correcting social or organizational deficits, but of amelio-
segments within a digital space in order to be effective.54 rating technological shortcomings.
Evidently, there is a wide range of actual and These concerns are creating a gap between
potential use cases for AI within the interna- “experimental tools and fielded systems”.55
tional security context. Yet, cultural and social Synthetic data is one proposed solution that
overhangs as well as technological barriers could contribute to ameliorating the trust deficit
associated with the deployment of AI technol- associated with the integration of AI technol-
ogies continue to raise concerns regarding the ogies in high-risk situations by enhancing the
safety, predictability and reliability of AI, espe- quality and usability of training data.
cially in the context of international security.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 21
3.1 Value Added by Synthetic Data
The value added by synthetic data depends on highly varied scenarios with all possible com-
where, how and for which AI systems it is being binations of relevant attributes as well as the
applied. In general, synthetic data allows for ability to properly identify rare occurrences will
the generation of highly diverse data sets, fine thus be essential for any autonomous system
grain control of data attributes, automatic anno- to be safe, predictable and reliable, especially
tation or data labelling, and cost-effectiveness. in uncontrolled environments.
The aim is that the application of synthetic data
would have a beneficial effect on training AI Using synthetic data to train autonomous
systems. systems may also shorten training cycles.
Because AI systems are dependent on the
The extent to which synthetic data is an appro- experiences held within the data, rather than
priate proxy for the original data is a measure the data in and of itself, training AI systems
of the utility of the method used to generate the on real-world data can be impractical. Col-
synthetic data as well as the machine learning lecting a sufficient amount of real-world data
model and AI system using the synthetic data. 56
and ensuring that there is ample diversity
In some circumstances, machine learning algo- within that data set is a resource-intensive
rithms may even be trained on synthetic data and time-consuming process. Even then, it is
sets that have no real-world equivalent, partic- difficult to ensure that all possible variations
ularly in cases where real-world data cannot and diversity in the training data set have been
be properly collected (e.g., objects placed in exhausted. In addition, real-world data may not
uncommon or rare environments). In these provide the fine grain control and flexibility that
circumstances, the use of synthetic data may synthetic data provides in training an AI system
be essential. This is especially relevant in the to fit different requirements. On the other hand,
military context, where autonomous systems sometimes the characteristics of a real-world
placed in complex combat environments are data set can be muddled or the data set may
designed and built for operational efficiency simply be too cumbersome to be used effec-
and efficacy, rather than high detail data col- tively. In some cases, mere seconds could be
lection. As such, real-world data collected by worth gigabytes of data (e.g., packet capture).
a UAV, for example, may not be able to capture As such, simple synthetic data sets that retain
all possible combinations of relevant attributes the characteristics and statistical distributions
with high levels of detail, such as images of of the underlying original data set could be suf-
the relevant object in different environments, ficient in certain cases.
captured at different distances, viewing angles
and orientations, and under different illumina- Moreover, in instances where the collection of
tions.57
The ability to synthetically generate parsed and properly index data may not be an
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 22
issue (e.g., for defensive cyber operations), simulate and anticipate the impact of public
synthetic data may be used to produce and policies (e.g., urban planning and public trans-
learn hypothetical situations. For example, de- portation), for policy evaluation and for simu-
velopers may leverage agent-based modelling lating disease outbreaks and interventions.59
– a technique to simulate interactions between In fact, open-source synthetic populations that
multiple variables (e.g., people, internet of reflect the characteristics of a local population
things (IoT) systems, time, etc.) – to create have already been developed for the United
synthetic data sets that reflect people working Kingdom60 and the United States61 as well as
on certain IoT or enterprise systems for a for smaller geographic regions (e.g., the Île-
specified amount of time. 58
The value-added de-France region, France,62 and the Island of
here is that, even though organizations working Montreal, Canada63).
on an IoT system may be able to capture a vast
amount of complete data, organizations may It can, therefore, be implied that agent-based
not possess the fine grain control of the data modelling can help militaries to prepare for
they collect to detect or predict all anomalies unexpected situations or to plan operations.
or distinguish anomalies from regular patterns. By generating synthetic data via agent-based
By using agent-based modelling to generate modelling simulations, militaries can prepare
a synthetic reality, organizations may be able for a range of potential situations and develop
to train IoT systems to simulate, identify and strategies to address them. This may help
classify malicious and non-malicious activity at to improve the readiness and effectiveness
various levels of sophistication. of military operations, making them better
prepared for unforeseen events and creating
The application of such techniques is not data points for rare events or uncommon envi-
exclusive to the international security context ronments.
nor are they simply theoretical. Indeed, tech-
niques such as agent-based modelling have The fine grain control of synthetic data sets
been applied in other contexts. For example, grants developers the ability to make minor ad-
agent-based modelling has been used to justments to the traits and characteristics of the
synthetic data set and to test the performance
59 M. Prédhumeau and E. Manley, “A Synthetic Population for Agent-Based Modelling in Canada”, Scientific Data, vol. 10, 21
March 2023, https://doi.org/10.1038/s41597-023-02030-4.
60 A. Smith et al., “An Open-Source Model for Projecting Small Area Demographic and Land-Use Change”, Geographical
Analysis, 7 February 2022, https://doi.org/10.1111/gean.12320.
61 W. Wheaton et al. “Synthesized Population Database: A US Geospatial Database for Agent-Based Models”, Methods
Report, RTI Press, May 2009, https://doi.org/10.3768/rtipress.2009.mr.0010.0905.
62 S. Hörl and M. Balać, “Synthetic Population and Travel Demand for Paris and Île-de-France Based on Open and Publicly
Available Data”, Transportation Research Part C: Emerging Technologies, vol. 130, December 2021, https://doi.org/10.1016/j.
trc.2021.103291.
63 L. Perez et al., “A Geospatial Agent-Based Model of the Spatial Urban Dynamics of Immigrant Populations: A Study of the
Island of Montreal, Canada”, PLOS ONE, vol. 14, July 2019, https://doi.org/10.1371/journal.pone.0219188.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 23
and limitations of machine learning algo- images on which the system was trained.69
rithms.64 Indeed, it is possible to create several The researchers noted that the classifier model
synthetic data sets with the same underlying demonstrated bias against images collected
data in order to serve different functions. 65
directly above a subject (e.g., human, building,
tank etc.), and the performance improved as the
The digital world can also test how variations of camera moved further away – decreasing the
a synthetic data set derived from the same un- angle of vision. The researchers concluded that
derlying data influence how an AI system ulti- one possible reason is that, because the clas-
mately responds to its environment. This can sifier model was trained on ground imagery, the
also be particularly useful for identifying and performance would improve as the experiment
addressing skewed data sets, where one trait inputs looked more like ground imagery. The
or a class of traits within a data set is overrepre- system thus needs to be retrained using more
sented (i.e., data or algorithmic bias). In a data aerial imagery with higher angles. As such, the
set where one class of traits is swamped by a researchers noted that synthetic data can be
larger class, techniques such as the Synthetic used to compare the different classifier models
Minority Oversampling Technique (SMOTE)66 with different model complexities and architec-
can be used to balance their frequencies.67 ture, which would then allow the optimal classi-
Conditional GANs (CGANs) can also reduce fier to be chosen for a specific task.
skew in a data set by using adversarial training
to address the ability of the discriminator Lastly, it has been argued that the creation of
network to predict underrepresented classes synthetic data that represents real-world data
more accurately to eliminate class-wide bias.68 may also eliminate legal challenges associ-
ated with collecting, storing, disseminating and
These “benchmarking” characteristics imply disposing of sensitive data.70 Currently, organi-
applicability in the military context. For zations may be unwilling to share data related
example, in an experiment conducted by the to their digital infrastructure if sensitive details
US Army Research Laboratory, the research- of their environment (e.g., IP address, network
ers found that the performance of computer types, etc.) are exposed, which could pose a risk
vision systems (e.g., those used in uncrewed to the safety of their enterprise digital infrastruc-
vehicles) is correlated to the angle of the ture.71 This may be even more pertinent in the
65 A. Alfons et al., “Synthetic Data Generation of SILC Data”, European Commission, 2011, 6, https://www.uni-trier.de/
fileadmin/fb4/projekte/SurveyStatisticsNet/Ameli_Delivrables/AMELI-WP6-D6.2-240611.pdf.
67 Focus Group on Artificial Intelligence for Health, “Data and Artificial Intelligence Assessment Methods (DAISAM) Reference”,
International Telecommunication Union and World Health Organization, May 2020, 13.
68 Ibid, 12.
70 A. Tucker et al., “Generating High-Fidelity Synthetic Patient Data for Assessing Machine Learning Healthcare Software”,
NPJ Digital Medicine, vol. 3, 9 November 2020, https://doi.org/10.1038/s41746-020-00353-9.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 24
international security context, where sensitive
real-world data is not easily shared even among
allies.72 For example, the Australian Department
of Defence has noted the challenge of not being
aligned with the data standards of the other
members of the “Five Eyes” intelligence-shar-
ing partnership.73 Privacy preservation also
suggests the ability of synthetic data to act as a
hedge against changes in data privacy regula-
tions that could heighten risks by disrupting or-
ganizational and inter-organizational routines
of sharing sensitive data.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 25
3.2 Risks
While synthetic data can help alleviate some detecting those malicious patterns. Nonethe-
of the data challenges faced by defence orga- less, it is critical that these intended biases do
nizations, it is not a silver bullet. Synthetic data not have unintended consequences. In virtually
comes with its own set of risks and challenges. all AI systems, there is an optimal number of
The ability to manage those risks and chal- synthetic data points, which is dependent on
lenges is particularly important in order for AI the composition of both the synthetic data
systems to be used in a responsible and safe and real-world data on which the AI systems
manner and in accordance with legal require- is trained. Too much synthetic data could
ments and ethical values. “overfit”75 the AI system, thereby degrading
the performance of the system. Therefore,
One of the most prominent risks with using ensuring that the specified scope is correct is
synthetic data is called the “reality gap”. This critical to avoiding unintended harms or other
refers to the subtle differences between the unintended consequences. Not only could poor
synthetic data and the real world. Sophisticated scoping lead to unintended consequences
machine learning models often learn to exploit once an autonomous system is fielded, but it
small discrepancies, making simulated environ- could also lead to low-quality data, sampling
ments difficult to learn from. In other words, if
74
errors, gender or racial bias, labelling or aggre-
synthetic data is not simulated properly, it can gation bias, or incomplete synthetic data sets
run into issues of not being able to fully replicate when it is being generated.76
the complex and chaotic physics of the real
world and may fail to properly capture the un- The issue of data bias and algorithmic bias is
expected shifts or one-off cases that emerge in also a social and cultural challenge in addition
real-world data. to a technical challenge. For example, if a
synthetic data set is generated based on the
While synthetic data can be used to benchmark traits and characteristic of an original real-world
data quality, data bias and algorithmic bias, data set that contains certain assumptions
synthetic data itself can also create (or even of gender or racial norms, that synthetic data
amplify) unintended biases. Intended biases set could further amplify those biases. Even if
can be useful in certain applications, for gender or race is not “explicit in the machine
example, overrepresenting specific classes learning model, patterns drawn from neutral
of rare malicious network traffic patterns so characteristics, such as uniforms or evidence
that AI systems used for either surveillance or of weapons, could still implicitly incorporate
incidence response have a higher chance of
75 Overfitting is a term used in machine learning that denotes that the algorithm or network does not generalize well enough to
the unseen test data, although it performs well on training data. See, Ibid, 5.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 26
gender [or racial] norms”.77 As such, gender- such as GANs. This is directly related to the
and racial-based approaches underline the im- utility of diffusion models to generate higher
portance of diversifying the people and range quality images compared to GANs and VAEs.80
of expertise involved in each step of the AI In other words, in some contexts synthetic data
system, including data generation. 78
may present a privacy–utility trade-off, as in-
creasingly powerful generative models raise
Moreover, synthetic data can still be prone questions about how diffusion models work
to data poisoning by sophisticated malicious and how, and under what circumstances they
actors. It is possible for adversaries to bury should be responsibly deployed.81
unwanted changes in the synthetic data
or data set in order to disrupt the learning While these risks may also be applicable to re-
procedure, such as by injecting a small fraction al-world data sets, synthetic data may expand
of malicious samples into a training data set the potential surface area for most of these
or making small adjustments to a synthetically risks. Synthetic data itself does not provide
generated image.79 However, it is important to new discrete risks, but these risks could be
note that, while data poisoning is a risk with more pervasive. Simply put, the types of risk
synthetic data, it is less prone to poisoning than may be similar, but the risk vectors are shifting
real-world data, which is often created in distant and the scale of risk is increasing. However, it
or uncontrolled environments. has been posited that the use of synthetic data
may raise more questions than real-world data
Lastly, while some synthetic data-genera- because people generally trust it less, which
tion techniques are privacy-preserving, other may provide more opportunities to establish
techniques may not provide adequate levels processes to verify synthetic data – more so
of privacy protection. Specifically, diffusion than real-world data.82
models are the least private form of image gen-
eration when compared to other techniques,
81 Ibid.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 27
4. Conclusion
Synthetic data has proven to be a useful tech- At the same time, the risks associated with
nology in a variety of sectors, from healthcare synthetic data should not be understated. While
to fraud detection and public policy planning. synthetic data does not necessarily create new
While synthetic data can still be considered risks that are distinct from the risks associated
an “emerging technology”, it has reached an
83
with real-world data, synthetic data may expand
adequate degree of maturity and there is a suf- the risk surface. In other words, the risks may be
ficient amount of expertise for adoption across similar, but there may be more ways to generate
industries and public services, including those those risks. For example, a lack of diversity in
related to international security. a real-world training data set may create unin-
tended biases in the same way as overfitting
Indeed, the value-added and risks associ- an AI system with one synthetic data point may
ated with synthetic data are relevant to United create unintended biases.
Nations multilateral processes and other dis-
cussions surrounding the use of AI within the While the features of synthetic data make it a
context of international security. The potential promising technology for the development of
benefits offered by synthetic data, particularly autonomous systems in international security,
the fine grain control, data diversity and cost-ef- it should not be viewed as a silver bullet or a
fectiveness, should not be ignored. Synthetic cure-all to existing data challenges. It should,
data could present a solution to ameliorate instead, be understood as a tool in the data-gov-
some of the data challenges that continue to ernance toolbox. There is ample research that
plague defence organizations, such as poor demonstrates the utility of synthetic data, and
data quality and low-diversity data sets. By generation models have advanced significantly
ameliorating some of these challenges, mili- in the past few years. As such, next steps
taries and defence organizations alike could should include, but not be limited to, identi-
improve operational capabilities while ensuring fication of specific use cases; more targeted
compliance with their international humanitar- research on how to apply existing methods and
ian law obligations, particularly during 3D op- knowledge of synthetic data in the field of inter-
erations where human performance is prone to national security while considering the gender
deteriorate over time. and racial aspects of data; and how synthetic
data can be integrated into existing data gover-
nance strategies.
83 “Emerging technologies” refers to technologies that create new opportunities to address global challenges while also
creating new regulatory challenges. See Organisation for Economic Co-operation and Development (OECD), “OECD Science,
Technology, and Industry Outlook 2012”, 13 September 2012, 222, https://doi.org/10.1787/sti_outlook-2012-en.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 28
Bibliography
Alfons Andreas, Peter Filzmoser, Beat Hullinger, Jan-Philipp Kolb, Stefan Kraft, Ralf Münnich, and Matthias Templ. “Synthetic
Data Generation of SILC Data”, European Commission, 2011, 6, https://www.uni-trier.de/fileadmin/fb4/projekte/SurveySta-
tisticsNet/Ameli_Delivrables/AMELI-WP6-D6.2-240611.pdf.
Alkhzaimi, Hoda. “Contribution to the Fifth Substantive Session by Emerging Research and Security Center, NYU/NYUAD”,
NGO Working Papers, 28 July 2023, https://docs-library.unoda.org/Open-Ended_Working_Group_on_Information_and_
Communication_Technologies_-_(2021)/Stakeholder_Recommendation_for_Open-ended_working_group_on_security_
APR.pdf
Anand, Alisha and Harry Deng, “Towards Responsible AI in Defence: A Mapping and Comparative Analysis of AI Principles
Adopted by States”, UNIDIR, 13 February 2023, https://unidir.org/publication/towards-responsible-ai-defence-map-
ping-and-comparative-analysis-ai-principles-adopted
Aryawan Udayana. Putu, Tri Legionosukumo, and Sri Sundari., “Strategy for Integrated Land Information System Network Ar-
rangements for the Indonesian National Army”, 2022, https://doi.org/10.33172/jspd.v8i1.1054.
Bonabeau, Eric. “Agent-Based Modeling: Methods and Techniques for Simulating Human Systems”, PNAS, 14 May 2002,
https://doi.org/10.1073/pnas.082080899.
Canada, “The Department of National Defence and Canadian Armed Forces Data Strategy”, Department of National Defence,
18 May 2021, https://www.canada.ca/en/department-national-defence/corporate/reports-publications/data-strategy/da-
ta-strategy.html.
Carlini, Nicholas, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, and
Eric Wallace. “Extracting Training Data from Diffusion Models”, 30 January 2023, 1, https://arxiv.org/abs/2301.13188.
Chandler, Katherine. “Does Military AI Have Gender? Understanding Bias and Promoting Ethical Approaches in Military Applica-
tions of AI”, UNIDIR, 7 December 2021, https://doi.org/10.37559/GEN/2021/04.
Crootof, Rebecca. “The Killer Robots are Here: Legal and Policy Implications”, Cardozo Law Review, 2015, 1869, https://papers.
ssrn.com/sol3/papers.cfm?abstract_id=2534567.
Defence Innovation Board, “AI Principles”, Defense Innovation Board, 2019, https://media.defense.gov/2019/
Oct/31/2002204458/-1/-1/0/DIB_AI_PRINCIPLES_PRIMARY_DOCUMENT.PDF.
Defense Science Board, “Task Force Report: The Role of Autonomy in DoD Systems”, Department of Defense, July 2012, 20,
https://irp.fas.org/agency/dod/dsb/autonomy.pdf.
Fan. Linwei, Fan Zhang, Hui Fan, and Caiming Zhang. “Brief Review of Image Denoising Techniques”, Visual Computing for
Industry, Biomedicine, and Art, https://doi.org/10.1186/s42492-019-0016-7.
Government Business Council, “Advancing ISR at the Edge: A Survey on Networks and Processing Technologies in the Digital
Battlespace”, July 2020, 4, http://cdn.govexec.com/media/advancing-isr-at-the-edge-isr.pdf
Hagström, Martin. “Military Applications of Machine Learning and Autonomous Systems”, SIPRI, May 2019, https://www.sipri.
org/sites/default/files/2019-05/sipri1905-ai-strategic-stability-nuclear-risk.pdf.
Hörl, Sebastian and Milos Balać. “Synthetic Population and Travel Demand for Paris and Île-de-France Based on Open and
Publicly Available Data”, Transportation Research Part C: Emerging Technologies, December 2021, https://doi.org/10.1016/j.
trc.2021.103291.
Hradec., Jiri, Massimo Craglia, Margherita Di Leo, Sarah De Nigris, Nicole Ostlaender, and Nicholas Nicholson. “Multipurpose
Synthetic Population or Policy Application”, European Commission Joint Research Centre, 13 April 2022, 14, https://dx.doi.
org/10.2760/50072.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 29
Holland, Arthur. “Known Unknowns: Data Issues and Military Autonomous Systems”, UNIDIR, 17 May 2021, https://unidir.org/
known-unknowns.
International Telecommunications Union. “Data and Artificial Intelligence Assessment Methods (DAISAM) Reference”, ITU-T
Focus Group on Artificial Intelligence for Health, May 2020, 13.
Kannan, Subarmaniam. “Synthetic Time Series Data Generation for Edge Analytics”, F1000 Research, 20 January 2022, https://
doi.org/10.12688/f1000research.72984.1.
Kingma Diederick P., and Max Welling. “An Introduction to Variational Autoencoders”, Foundations and Trends in Machine
Learning, 2019, https://arxiv.org/pdf/1906.02691.pdf.
Kwik, Jonathan and Tom Van Engers. “Algorithmic Fog of War: When Lack of Transparency Violates the Law of Armed Conflict”,
Journal of Future Robot Life, 2021, 7, https://doi.org/10.3233/FRL-200019.
Longford, Frank. “Experiments in Synthetic Data”, Forensic Architecture, 6 November 2018, https://forensic-architecture.org/
investigation/experiments-in-synthetic-data.
Manon Prédhumeau and Ed Manley, “A Synthetic Population for Agent-Based Modelling in Canada”, Scientific Data, 21 March
2023, https://doi.org/10.1038/s41597-023-02030-4.
Öhman, Wilhelm. “Data Augmentation Using Military Simulators in Deep Learning Object Detection Applications”, KTH Royal
Institute of Technology, 10 September 2019, 2, https://www.diva-portal.org/smash/get/diva2:1375838/FULLTEXT01.pdf.
Organization for Economic Cooperation and Development. “OECD Science, Technology, and Industry Outlook 2012”, 13
September 2012, 222, https://doi.org/10.1787/sti_outlook-2012-en.
Pakistan, “Proposal for an International Instrument on Lethal Autonomous Weapons (LAWS)”, CCW/GGE.1/2023/WP.2/Rev.1,
8 March 2023, https://docs-library.unoda.org/Convention_on_Certain_Conventional_Weapons_-Group_of_Governmen-
tal_Experts_on_Lethal_Autonomous_Weapons_Systems_(2023)/CCW_GGE1_2023_WP.3_REv.1_0.pdf
Pasieka, Manuel. “A Comparison of Synthetic Data Generation Methods and Synthetic Data Types”, 1 September 2022, https://
mostly.ai/blog/comparison-of-synthetic-data-types/.
Perez, Liliana, Suzana Dragicevic, and Jonathan Gaudreau. “A Geospatial Agent-Based Model of the Spatial Urban Dynamics of
Immigrant Populations: A Study of the Island of Montreal, Canada”, PLOS ONE, 24 July 2019, https://doi.org/10.1371/journal.
pone.0219188.
Scharre, Paul. “Robotics on the Battlefield Part II: The Coming Swarm”, Center for a New American Security, 15 October 2014,
https://www.cnas.org/publications/reports/robotics-on-the-battlefield-part-ii-the-coming-swarm.
Silva, Thalles. “A Short Introduction to Generative Adversarial Networks”, Thalles’ Blog, 7 June 2017, https://sthalles.github.
io/intro-to-gans/.
Smith, Andrew, Luke Archer, Alistair Ford, and James Virgo. “An Open-Source Model for Projecting Small Area Demographic and
Land-Use Change”, Geographical Analysis, 7 February 2022, https://doi.org/10.1111/gean.12320.
State of Palestine, “State of Palestine’s Proposal for the Normative and Operational Framework on Autonomous Weapons
Systems”, CCW/GGE.1/2023/WP.2/Rev.1, 3 March 2023, https://docs-library.unoda.org/Convention_on_Certain_Con-
ventional_Weapons_-Group_of_Governmental_Experts_on_Lethal_Autonomous_Weapons_Systems_(2023)/CCW_
GGE1_2023_WP.2_Rev.1.pdf.
Tucker, Allan, Zhenchen Wang, Ylenia Rotalinti, Paja Myles. “Generating High-Fidelity Synthetic Patient Data for Assessing
Machine Learning Healthcare Software”, NPJ Digital Medicine, 9 November 2020, https://www.nature.com/articles/s41746-
020-00353-9#citeas.
United Nations General Assembly, “Disarmament and International Security (First Committee)”, https://www.un.org/en/ga/first/.
Wheaton, William, James Cajka, Bernadette Chasteen, Diane Wagener, Phillip Cooley, Laxminarayana Ganapathi, Douglas
Roberts, and Justine Allpress. “Synthesized Population Database: A US Geospatial Database for Agent-Based Models”,
Methods Report RTI Press, May 2009, https://doi.org/10.3768%2Frtipress.2009.mr.0010.0905.
Wilner, Alex. “AI and the Future of Deterrence: Promises and Pitfalls”, Centre for International Governance Innovation, 28
November 2022, https://www.cigionline.org/articles/ai-and-the-future-of-deterrence-promises-and-pitfalls/.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 30
United Kingdom, “Data Strategy for Defence: Delivering the Defence Data Framework and Exploiting the Power of Data”, 27
September 2021, Ministry of Defence, https://www.gov.uk/government/publications/data-strategy-for-defence/data-strat-
egy-for-defence.
Vahdat, Arash and Karsten Kreis. “Improving Diffusion Models as an Alternative to GANs, Part 1”, NVIDIA, 26 April 2022, https://
developer.nvidia.com/blog/improving-diffusion-models-as-an-alternative-to-gans-part-1/.
Yan, Jie, Eung Joo Lee, Damon Conover, and Heesung Kwon. “Synthetic Dataset Generation and Adaptation for Human
Detection”, DEVCOM Army Research Laboratory, November 2020, 1, https://apps.dtic.mil/sti/pdfs/AD1115446.pdf.
E X P L O R I N G SY N T H E T I C D ATA F O R A R T I F I C I A L I N T E L L I G E N C E A N D A U T O N O M O U S SYS T E M S 31
@unidir
/unidir Palais des Nations
1211 Geneva, Switzerland
/un_disarmresearch
/unidirgeneva © UNIDIR, 2023
/unidir W W W. U N I D I R . O R G