0% found this document useful (0 votes)
99 views13 pages

A Survey of Air Combat Behavior Modeling Using Machine Learning

Uploaded by

Edward Yu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views13 pages

A Survey of Air Combat Behavior Modeling Using Machine Learning

Uploaded by

Edward Yu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

1

A survey of air combat behavior modeling using


machine learning
Patrick Ribu Gorton , Andreas Strand , Karsten Brathen , Senior Member, IEEE

Abstract—With the recent advances in machine learning, Besides training, realistic behavior models benefit
creating agents that behave realistically in simulated air combat applications like mission planning and tactics development.
has become a growing field of interest. This survey explores the Modeling and simulation tools can help mission planners
application of machine learning techniques for modeling air predict and evaluate the outcome of different scenarios,
combat behavior, motivated by the potential to enhance
allowing refinement of strategies and tactics before the actual
simulation-based pilot training. Current simulated entities tend to
lack realistic behavior, and traditional behavior modeling is labor- mission takes place [5]. Tactics development may leverage the
intensive and prone to loss of essential domain knowledge between creativity of ML agents that autonomously explore new
development steps. Advancements in reinforcement learning and strategies with few restrictions.
imitation learning algorithms have demonstrated that agents may Traditionally, air combat behavior is created manually by
learn complex behavior from data, which could be faster and more first eliciting domain knowledge from subject matter experts
scalable than manual methods. Yet, making adaptive agents (SMEs) and then creating a conceptual model reflecting this
capable of performing tactical maneuvers and operating weapons knowledge. Finally, the model is implemented in a computer
and sensors still poses a significant challenge. program as a rule-based script, such as a decision tree or a finite
The survey examines applications, behavior model types,
state machine (FSM) [6]. This process is laborious and risks
prevalent machine learning methods, and the technical and human
challenges in developing adaptive and realistically behaving losing essential domain knowledge when transitioning from one
agents. Another challenge is the transfer of agents from learning step to another.
environments to military simulation systems and the consequent In recent years, there has been a surge in applying machine
demand for standardization. learning (ML) for the efficient development of air combat
Four primary recommendations are presented regarding behavior models. Learning the behavior model with data-driven
increased emphasis on beyond-visual-range scenarios, multi-agent methods imposes other requirements on the model
machine learning and cooperation, utilization of hierarchical representation. The model structure should be able to capture
behavior models, and initiatives for standardization and research general patterns from complex data, and at the same not be too
collaboration. These recommendations aim to address current
rigid as that would restrict learning potential. A neural network
issues and guide the development of more comprehensive,
adaptable, and realistic machine learning-based behavior models is a compelling model representation because it offers
for air combat applications. scalability and parallel processing and allows iterative
improvements.
Index Terms—Machine learning, intelligent agents, behavioral The shift to ML is perhaps also inspired by advances in
sciences, modeling, simulation, military systems. applying intelligent agents to complex systems in general, aided
by novel approaches in reinforcement learning (RL), imitation
I. INTRODUCTION
learning (IL), and evolutionary algorithms (EAs). Notable

B examples in RL are DeepMind’s AlphaGo and its successors,


EHAVIOR models are fundamental components of
simulation-based fighter pilot training and other air reaching superhuman levels in complex games [7, 8]. Robotic
combat applications. Computer-generated forces control showcases the strength of IL and how demonstrating a
(CGFs) are autonomous simulated entities used in military task is sometimes the simplest solution [9]. Furthermore, EAs
simulations to represent friendly and opposing forces [1]. excel in global optimization in high-dimensional problems [10].
Modeling their behavior to make them act realistically and The above-mentioned air combat applications employ
human-like is demanding. Consequently, in today’s simulation- behavior models that reflect human behavior, but this is not
based training, pilots engage in combat with CGFs that act in required in applications like unmanned combat aerial vehicles
predictable ways and generally lack realistic behavior. (UCAVs). Dong et al. [11] and Wang et al. [12] have surveyed
Therefore, the instructors must manually control many aspects maneuvering aspects of autonomous air combat, where Dong et
of the CGFs to ensure the pilots get the training needed [2, 3]. al. focus particularly on lower-level control and guidance. Both
Freeman et al. [4] advocate four functional requirements to surveys outlined the applications of analytical, knowledge-
achieve realistic behavior, which are tactical inference, tactical based, and ML methods. The examined behavior models were
action, modal behavior, and instructional capability. aimed at winning one-on-one engagements, which is a key
problem in unmanned air combat and some aspects of fighter
Patrick Ribu Gorton (e-mail: [email protected]), Andreas Strand (e-
pilot training. Dong et al. argue that ML methods model human
mail: [email protected]), and Karsten Brathen (e-mail:
[email protected]) are with FFI (Norwegian Defence Research
establishment), Instituttveien 20, 2007 Kjeller, Norway.
2

perceptions and decisions better, while traditional methods Start


produce well-defined maneuvers. They also emphasize the
benefit of using simulation systems to generate learning data. Action
Wang et al. argue that realistic scenarios and simulations are priority=2 priority=1
more critical for advancing autonomous air combat than
designing elaborate models, such as deep neural networks.
Transition Transition
Løvlid et al. [13] have explored works related to the data-
driven behavior modeling (DDBM) approach for CGFs as an
alternative to the traditional approach that relies on SMEs. They Action Action
highlight how the DDBM approach offers potential benefits to priority=1 priority=2
military end-users, such as an easier way of adding new CGF Transition
behaviors, including more human-like behaviors compared to
the traditional modeling approach. However, they also point to
challenges related to lack of data, noisy and incomplete data, Action
difficulty interpreting model decisions, and striking a balance
between simulation fidelity and processing performance. While End
their work concentrates on RL and IL in military simulation- Fig. 1. Example of a BTN that shows how actions connected to
based training and decision support systems, it encompasses a more than one transition have assigned priorities.
broad scope, devoting only a single reference to air combat [14].
Thus, recent contributions with air combat applications call
for a revised perspective on the field. This paper presents a A. Behavior Transition Network (BTN)
survey of the current state of research on behavior modeling in Stottler Henke Associates developed the BTN in the early
the air combat domain, specifically pilot behavior modeling 2000s [11, 12, 13], which is essentially a behavior script
based on ML. The purpose is to provide insight into prevalent represented as a graph. Actions are executed sequentially, and
models, methods, scenarios, and applications. Key themes which action to perform next is determined by transitions.
emerging from the survey include the transition to multi-agent Syntactically, BTNs are directed bipartite multigraphs. They
learning, alignment of simulation systems, development of are bipartite since all edges connect an action and a transition.
benchmark scenarios, and their significance for practical A transition node expresses a condition that, when fulfilled,
applications such as pilot training. These findings shape transitions between action nodes. Moreover, BTNs are
multigraphs since action nodes may connect to multiple
directions for future research endeavors.
transition nodes expressing different conditions. In this event,
The outline of the paper is as follows. Section II describes the
the transition nodes are assigned priorities defining the order to
scope, databases, and keywords used to collect pertinent
evaluate them in case more than one condition is fulfilled. All
literature. Section III outlines the range of behavior model
BTNs must have a start action node and an end action node. An
types, and Sections IV-VII introduce the prevalent ML methods
example of a BTN is shown in Fig. 1.
used to train these. Section VIII presents the surveyed literature
Behavior transition networks are like FSMs but include
and key characteristics. The identified gaps and trends are
several augmentations. A BTN may be hierarchical, wherein an
discussed in Section IX followed by a conclusion in Section X
action node may refer to another BTN. This hierarchical
with findings and recommendations.
property is essential to model complex behavior and avoid the
state explosion of FSMs. The execution of a BTN is relatively
II. LITERATURE SELECTION simple, beginning at the start node and proceeding to the
This survey scopes research related to the development of following node. If the following node is an action, the action
realistic behavior models for air combat. The collected gets executed. If the following node is a transition, the transition
literature draws from several sources, including IEEE Xplore, occurs only when the condition is met. If the following node
Google Scholar, and suggestions from the Mendeley reference branches to more than one transition, their conditions are
manager. The search terms entered on these services were air evaluated in the order given by their priority scheme as
combat, behavior, simulation, artificial intelligence, and described above. Note that in hierarchical BTNs, if an action
machine learning. Publications, reports, and theses were then node refers to another BTN, any subsequent transition node in
subjectively filtered based on pertinence. Only literature written the superior BTN still gets evaluated and can interrupt this sub-
in English has been considered. An overview of the literature is BTN execution. This behavior is an important aspect of BTNs
presented in Table I. that helps reduce their complexity. Finally, the execution
terminates in the end action node. When controlling simulated
III. BEHAVIOR MODELS entities, each entity will have its own BTN. The BTNs can read
A behavior model is a broad term for a model that chooses and write messages to and from blackboards enabling
actions based on percepts and is in ML often expressed as a information sharing and cooperation between the entities.
mathematical function. The choice of behavior representation
is based on the mission task, application, previous research, and
related technical requirements.
3


X1
? Match speed Match altitude
Y1
→ Pure pursuit X2

? Y2
Within turn range
X3
→ Fly relative bearing

Sufficient displacement ? Fig. 3. Neural network with three percept inputs, two hidden
→ Fly offset
layers of four nodes (gray), and two action outputs.

Within conversion range Convert Because fuzzy logic uses numbers instead of true and false,
AND returns the minimum of 𝜇(enemy in pursuit) and
Fig. 2. Stern conversion intercept represented as a BT. Adapted 𝜇(close). Whether to perform a jink depends on the
from Ramirez et al. [17]. membership values and a chosen threshold. If there are multiple
B. Behavior Tree (BT) rules for maneuvering, they must be aggregated in some way.
Behavior trees are another paradigm for task switching that D. Neural Network (NN)
became mainstream in the 2000s for modeling non-player Neural networks work as versatile behavior models where
characters following games such as Grand Theft Auto, Halo, the rationale for the actions is implicit. The flexible structure
and Bioshock [15, 16]. A modern description of BTs [17] allows a wide range of behaviors to be learned by data-driven
includes the five node types sequence (→), fallback (?), parallel methods. An example of an NN is shown in Fig. 3, composed
(⇉), action (shaded box), and condition (oval), as demonstrated of percepts 𝑋1 , 𝑋2 , and 𝑋3 , two hidden layers shown in gray,
in Fig. 2. The example shows a maneuver where one aircraft and actions 𝑌1 and 𝑌2 . The edges in the network represent
intercepts another [18]. The BT gets evaluated at each tick weights, and those are the trainable parameters of the behavior
starting at the root, which in the example is a parallel node. model. In the most basic networks, the value of each node is
Consequently, the aircraft will execute three subtrees the weighted sum of its inputs, but modern NNs often employ
simultaneously, matching the speed and altitude of the other some additional nonlinear transformation, which greatly
aircraft whilst executing the left subtree representing steering. increases the function space. The number and size of the
Fallbacks execute their children from left to right until one child hidden layers is also adjustable.
succeeds, while sequences execute their children until one child
fails. By this logic, the aircraft will have to reach a series of sub-
IV. REINFORCEMENT LEARNING
goals before it can finally convert.
Even though BTs span the same range of behavior models as An agent learning by reinforcement works out what to do
FSMs, they are in some respects easier to manage and modify without being told explicitly, and this is perhaps the most
as they become complex. Because all nodes return either common approach to learning behavior for intelligent agents.
success, failure, or running, the interface is fixed and thus The agent interacts with an environment and receives rewards
subtrees can be inserted anywhere in the existing model. according to its performance. The aim is to maximize rewards
Moreover, it makes graphical editors suitable for BTs. The flow over time by choosing the right actions [20], which amounts to
in a BT travels down to children and then back up to parents. finding the optimal policy 𝜋 ∗ that results in the greatest
This is a two-way transfer of control, as opposed to the one-way cumulative reward across all possible states.
transfer of control exhibited by FSMs [15]. Problems solved using RL are commonly formulated as
Markov decision processes (MDPs), where actions influence
C. Fuzzy Tree (FT) the rewards, following environment states, and, thus, future
An FT is a fuzzy system (FS) [19] organized as a hierarchical rewards. An MDP is defined by a tuple (𝑆, 𝐴, 𝑇, 𝑅, 𝛾) where
tree structure. A defining trait of FTs is that percepts and actions
are represented by linguistic variables such as close, • 𝑆 is a set of states
threatening, defensive, or evade. This conforms to human • 𝐴 is a set of actions
reasoning, which builds on qualitative descriptions rather than • 𝑇 is a transition function specifying, for each state,
numbers. The most basic components of FSs are membership action, and next state, the probability of that next state
functions, defining the linguistic variables in terms of agent occurring
percepts. In air combat with two aircraft, close may be • 𝑅 is a reward function specifying the immediate
represented as a sigmoid membership function that maps reward given a state and an action
aircraft distance to a membership value 𝜇 ∈ [0, 1] representing
• 𝛾 is a discount factor specifying the relative
the extent to which the aircraft are close. The linguistic
importance of immediate rewards
variables are used to make rules that constitute the behavior
model, such as “IF enemy in pursuit AND close, perform jink”.
4

For all time steps in an MDP, the agent observes a state 𝑠 ∈ V ∗ (s) = max ∑ 𝑇(𝑠, 𝑎, 𝑠 ′ )[𝑅(𝑠, 𝑎) + 𝛾𝑉 ∗ (𝑠 ′ ) ] . (1)
𝑆, makes an action 𝑎 ∈ 𝐴 based on the state, transitions to the a
𝑠′
next state 𝑠 ′ , and receives a reward 𝑟 = 𝑅(𝑠, 𝑎) based on this However, classical DP algorithms are limited to trivial MDPs
transition. The transition function may be deterministic or because of great computational expense as the number of state
stochastic, in which case 𝑇(𝑠, 𝑎, 𝑠 ′ ) ∼ 𝑃(𝑠 ′ |𝑠, 𝑎) is the variables grows. As seen in (1), they assume perfect knowledge
probability of the new state being 𝑠 ′ when applying action 𝑎 in of the reward and transition functions. To circumvent these
state s. The goal of MDPs is to find the policy 𝜋 that maximizes shortcomings, approximate dynamic programming (ADP)
the expected sum of discounted rewards over time operates with a statistical approximation of the value function
rather than computing it exactly.
The Bellman equation, with some rewriting, may also be used
arg max 𝔼 [∑ 𝑅(𝑠𝑡 , 𝑎𝑡 )|𝜋] . to find a 𝑄 function, which is the value function conditioned on
𝜋
𝑡 taking a particular action next. Q-learning [25] is a popular
Rewards are sometimes sparse, making it difficult for agents
algorithm for doing this. In Q-learning, the 𝑄 function is a table
to learn the appropriate behavior. Reward shaping is a
holding the values of all actions at all possible states and is
technique where supplemental rewards are provided to make a
updated by
problem easier to learn, especially during the early stages of
𝑄(𝑠𝑡 , 𝑎𝑡 ) ← 𝑄(𝑠𝑡 , 𝑎𝑡 ) + 𝛼[𝑟𝑡 + 𝛾 max 𝑄(𝑠𝑡+1 , 𝑎) − 𝑄(𝑠𝑡 , 𝑎𝑡 )],
learning [21]. Agents may have to consider multiple, possibly a
conflicting objectives when making decisions. For example, where 𝛼 is the size of the update step size and 𝛾 the factor
depending on the mission, a fighter pilot must consider aspects discounting future rewards. The Q-learning algorithm operates
like resource consumption, careful use of sensors to avoid being on a finite set of discrete states and actions, and the resulting
detected by opponents, and risk and safety in general. Multi- policy amounts to selecting the action that yields the greatest
objective RL (MORL) is a paradigm where agents learn to value given a state.
balance between such priorities. In MORL, the problem usually Extensions to Q-learning have been made, like deep Q-
takes the form of a multi-objective MDP, in which the reward learning [26], achieving eminent results in human-level control
function 𝑅 describes a vector of rewards, one for each objective applications. Deep Q-learning combines Q-learning with neural
[22]. networks forming deep Q-networks (DQNs) for function
While the above description has focused on learning the approximation with continuous state spaces. Deep Q-learning
behavior of a single agent, multi-agent RL (MARL) studies the uses a replay buffer to learn from previously collected data
behavior of multiple learning agents that coexist in a shared multiple times to improve sample efficiency and convergence.
environment. Centralized training and decentralized execution The replay buffer holds a finite but continuously updated set of
(CTDE) is a popular MARL framework, where agents are transitions (𝑠, 𝑎, 𝑟, 𝑠 ′ ).
trained offline with centralized information and execute online Variants of Q-learning are also applicable to MARL
in a decentralized manner [23]. problems. For example, under the CTDE framework, QMIX
There are two distinct methodological approaches to RL: [27] is a Q-learning algorithm for learning behaviors of
searching in policy space and searching in value function space. cooperative agents. The method employs a neural network for
Policy-based methods maintain explicit representations of estimating the joint Q-values of multiple agents.
policies and modify them through search operators. In contrast, An alternative value-based RL algorithm is Monte-Carlo tree
value-based methods do not maintain policy representations but search (MCTS), a heuristic search algorithm for estimating the
attempt to learn a value function 𝑉 𝜋 ∗ (𝑠), defined as the optimal value function by constructing a search tree using
expected discounted cumulative reward for 𝜋 ∗ [24]. In value- Monte-Carlo simulations [28]. This method has been used to
based methods, the policies are implicit and can be derived find the best move in the board games like Go [29]. Like with
directly from the value function by picking the actions yielding ADP and Q-learning, MCTS may combine with neural
the best values. A third approach, actor-critic methods, networks, which in the case of Go, proved able to beat
approximate both policy and value functions. The actor refers professional human players [30].
to the policy, while the critic refers to the value function [20, p. B. Actor-Critic Methods
321]. Central to all RL methods is finding a balance between
exploration of new strategies, and exploitation, meaning The deep deterministic policy gradient (DDPG) algorithm
making the best decision given current information. The RL also adapts the ideas behind the success of deep Q-learning to
methods relevant to the survey are briefly presented here, with the continuous action domain [31]. The actor 𝜋𝜃 , which is
references to further reading. deterministic, and critic 𝑄𝜙 are neural networks parameterized
by 𝜃 and 𝜙. Like in deep Q-learning, DDPG makes use of a
A. Value-based Methods replay buffer with 𝑁 transitions (𝑠, 𝑎, 𝑟, 𝑠 ′ ) collected by
Dynamic programming (DP), created by Richard Bellman in iterations of the policy 𝜋𝜃 . With these transitions, 𝑄𝜙 is updated
the 1950s, is a class of methods for solving optimal control to satisfy the Bellman equation more closely with each
problems of dynamical systems and can be used to find the iteration, as in deep Q-learning. The new 𝑄𝜙 is used to update
optimal value functions in RL problems by solving the Bellman policy 𝜋𝜃 by ascending with the gradient
optimality equation [20, p. 14] 𝑁
1
∑ 𝜃 𝑄𝜙 (𝑠𝑡 , 𝜋𝜃 (𝑠𝑡 )) ,
N
𝑡=1
5

which is computed by backpropagating through 𝑄𝜙 . However, scripts. The weight adjustment function keeps the total weight
the DDPG algorithm is reported to have some stabilization in the rulebase constant, meaning that when the weights of
challenges and to be brittle to hyperparameter settings [32]. specific rules increase, the weights of the other rules decrease.
Beyond the use of replay buffers, sample efficiency can be Due to the probabilistic nature of dynamic scripting, all scripts
improved by running multiple local copies of the actor in generated are likely to contain different sets of behavior rules.
parallel and asynchronously or synchronously updating the
global model, a technique employed in the Advantage Actor- V. IMITATION LEARNING
Critic (A2C) algorithm [33]. While RL attempts to solve tasks by maximizing expected
Extending DDPG, the multi-agent deep deterministic policy rewards, IL aims to reproduce desired behavior from
gradient (MADDPG) algorithm [34] is an actor-critic method demonstrations, normally given by human operators. Thus, IL
for MARL, which adopts the CTDE framework. During can be considered a class of methods that allow transferring
training, the centralized critic has access to the observations and skills from humans to robotic or computer systems [38, ch. 1].
actions of all agents and guides their policy updates. After For some tasks, it is easier to demonstrate how they are
training, the agents select actions based solely on their local performed than to define reward functions. Besides, expert
observations. demonstrators may have their opinions on exactly how the
Another method that builds on the concepts of DDPG is the agent should perform the task, like in the air combat domain,
soft actor-critic (SAC) algorithm [35]. Unlike DDPG, SAC where realistic behavior may mean acting in line with doctrines,
learns a stochastic policy, i.e. a distribution 𝜋𝜃 (⋅ |𝑠𝑡 ) of actions tactics, techniques, and procedures. Another significant
given states. To motivate action exploration, SAC aims to challenge in IL is the dependency of subsequent states in a
maximize not only the expected return but also the entropy of demonstration, which violates the assumptions of independent
the policy, making the objective and identically distributed data when minimizing the loss. This
𝑁
1 violation may lead to poor performance in theory and practice
∑[𝑅(𝑠𝑡 , 𝑎𝑡 ) + 𝑐𝐻(𝜋𝜃 (⋅ |𝑠𝑡 ))] , [39].
N
𝑡=1 Imitation learning can be categorized into passive collection
where 𝐻(𝜋𝜃 (⋅ |𝑠𝑡 )) is the entropy weighted by the coefficient of demonstrations and active collection [40]. In the passive
𝑐. Though not described here, SAC also learns both a value collection setting, demonstrations are collected beforehand, and
function and a Q function, which has proven to stabilize the IL aims to find a policy that mimics them. In contrast, the active
learning process. collection setting assumes an interactive expert that provides
Inconveniently, updates to a policy may sometimes be demonstrations in response to actions taken by the current
detrimental. Proximal policy optimization (PPO) algorithms policy.
[36] employ mechanisms to prevent large diversions from the Behavior cloning (BC) is an example of IL using a passive
current policy configuration when optimizing. Like SAC, PPO collection of demonstrations 𝐷 = {(𝑠𝑡 , 𝜋 ∗ (𝑠𝑡 )) | 𝑡 ∈ 1, … , 𝑁}.
promotes action exploration by maximizing the entropy of the It approximates the expert policy 𝜋 ∗ with supervised learning,
stochastic policy, but more central to the method is the clipped minimizing the difference between the learned policy 𝜋 and
objective 𝐿clip (𝜃) estimating the quality of actions based on expert demonstrations to create an approximate policy
rewards. To make conservative quality estimates and keep the
policy updates small, 𝐿clip (𝜃) is clipped to the interval 𝜋̂ ∗ = arg min ∑ 𝐿(𝜋(𝑠), 𝜋 ∗ (𝑠))
π
[1 − 𝜖, 1 + 𝜖]. Putting together these terms with the objective 𝑠∈𝐷
where 𝐿 is a distance loss function. Applying 𝜋̂ ∗ in states
of the critic 𝐿VF (𝜙) gives the complete PPO objective
𝑁 substantially different from 𝐷 may lead to unexpected and
1 unwanted actions due to the lack of training data. This is a
∑[𝐿clip (𝜃) + 𝑐𝐻(𝜋𝜃 (⋅ |𝑠𝑡 )) − 𝐿VF (𝜙) ].
N problem because slight dissonances in 𝜋̂ ∗ may cause new
𝑡=1
All the actor-critic methods discussed above are types of deep trajectories leading far away from 𝐷. This may happen even if
RL. Dynamic scripting (DS) [37] is a different but well-studied 𝐷 is not collected in sequence. However, an extensive set of
RL method in surveyed literature. Initially designed for demonstrations will mitigate the risk.
behavior generation for non-player characters in video games, An active collection of demonstrations may be leveraged
DS aims to meet computational and functional requirements using methods such as DAgger (dataset aggregation) [39].
related to speed, effectiveness, clarity, and variety in learning DAgger poses a way to query the expert for more information
behavior. Dynamic scripting produces policies in the form of when 𝜋̂ ∗ leads to states not covered by the demonstrations,
behavior scripts that contain a set of behavior rules. These followed by retraining the policy using the aggregated
behavior rules are if-then statements mapping states to actions demonstration set.
and are contained in a rulebase. Each rule in the rulebase is Rather than imitating demonstrations, inverse reinforcement
assigned a weight value comparable to a 𝑄 value. Creating a learning aims to learn the expert’s intent, implicit in these
behavior script involves selecting 𝑛 rules from the rulebase demonstrations, by recovering the underlying and unknown
according to their probability of being selected, equal to each reward function [38]. Consider a reward function parameterized
weight value divided by the sum of all weights. When an agent by a linear combination of features
interacts with its environment, a weight adjustment function 𝑅(𝑠, 𝑎) = 𝑤 𝑇 𝜙(𝑠, 𝑎),
updates the weights based on the rewards the agent receives. where 𝑤 ∈ ℝ is a weight vector and 𝜙(𝑠, 𝑎) ∶ 𝑆 × 𝐴 → ℝ𝑛 is
𝑛

This way, favorable rules will likely be included in upcoming a feature map. For a given 𝜙(𝑠, 𝑎), the goal of inverse RL is to
6

determine 𝑤 and thus 𝑅(𝑠, 𝑎). Having found 𝑅(𝑠, 𝑎), 𝜋̂ ∗ can be C. Hierarchical Learning
approximated by regular RL. Inverse RL can be suitable Hierarchical policies attempt to tackle complex tasks by
because the “expert” is not always optimal. Besides, a policy breaking them down into smaller tasks. Although such
optimal for the expert may not be optimal for the agent if they hierarchies may have multiple levels, consider a two-level
have different dynamics or capabilities. policy hierarchy in which the high-level policy 𝜇: 𝑠 → 𝑔
corresponds to selecting a low-level policy 𝜋𝑔 : 𝑠 → 𝑎 executing
VI. EVOLUTIONARY ALGORITHMS subtask 𝑔. The hierarchical learning problem is to
Evolutionary algorithms (EAs) are population-based simultaneously learn the high-level policy 𝜇 and sub-policies
optimization algorithms suitable for solving a range of 𝜋𝑔 [40]. If formulated as an RL problem, the goal is to
problems, like generating control policies in robotics [41]. maximize the rewards gained when the 𝜇 and 𝜋𝑔 ’s work
Evolutionary algorithms are not classified as methods of RL or together.
IL but rather belong to a paradigm of its own. One reason is
because EAs manage a population of policies rather than just VIII. LITERATURE OVERVIEW
one. Genetic algorithms (GAs) are a common type of EA
inspired by the Darwinian principles of natural selection, There are several ways to organize the surveyed literature to
allowing population members to interact and reproduce [42]. help identify gaps and trends and provide a map of air combat
Genetic algorithms evaluate all policies in a population using a behavior modeling using ML. We have grouped publications
fitness function, which in the MDP formulation is equivalent to that share authors and address similar problems. These 24
a reward function. Then, a selection mechanism selects some groups (studies) are listed in Table I and classified by 11
policies for recombination, creating new policies that inherit properties (columns). Year refers to the most recent publication
traits of their parents. Usually, the new policies undergo some of the study. The applications are described in Section I and
degree of mutation to explore the search space further. This express the intended use of the generated behavior models.
process is called evolution, which typically terminates when the Some studies examine multi-agent learning, where multiple
population converges, or some termination criterion is satisfied. agents operate and learn in the same environment. All the
Neuroevolution is the process of evolving neural networks surveyed studies that concern multi-agent learning employ RL.
using GAs. A popular method for this is the neuroevolution of Beyond-visual-range (BVR) engagements have become
augmenting topologies [43] (NEAT), which evolves both standard in real air combat. If enemy aircraft are too far away,
network structures and their parameter values. they are no longer visible to the unaided eye, and the pilot must
rely on long-range detection systems. The visual range on a
VII. COMPOSITE LEARNING clear day is roughly 12 km depending on the target airframe
size [48]. In contrast, within visual range (WVR) are
The term composite learning is used here to refer to strategies engagements where the enemies are visible to the pilot’s naked
that partition the learning into multiple stages or partition the eye.
learning task into smaller tasks. The behavior models are learned in the context of a mission
A. Transfer Learning task assigned to one or more agents. The mission task
corresponds to the mission or basic task a pilot could encounter
Transfer learning refers to the idea that learning can be sped
in an operation. The behavior model expresses maneuvers or
up by leveraging previous knowledge from related tasks [44].
other functions of a pilot as output values. Naturally, only a
Rather than attempting to learn a difficult task from scratch,
subset of the pilot functions is included, and these are referred
better-performing policies can be obtained with less data and
to as agent functions.
training by generalizing across different tasks. Knowledge may
A key choice in modeling is the type of behavior model
also be transferred between different environments or from a
discussed in Section III, which shapes the behavior of the agent
simulated environment to the real world [45]. The benefits of
and how it will learn. The subsequent columns list the learning
transfer learning are measured by metrics such as jumpstart (the
paradigm and learning method applied, for which summaries
initial performance of an agent on a target task), asymptotic
are provided in Sections IV-VII.
performance (the improvements made to the final learned
The simulation system is the environment in which a CGF
performance via transfer), and time to threshold (the reduction
resides. There is no universally favored simulation system, but
in time needed to achieve a specified threshold).
rather a wide range of bespoke, commercial, and government-
B. Curriculum learning owned systems.
Curriculum learning can be seen as a special form of transfer The last column states the degrees of freedom of a simulated
learning where initial tasks are used to guide the learner to aircraft. While many simulation systems technically provide 3D
perform better on the final task [46]. As knowledge is environments, some studies restrict the aircraft dynamics to 2D,
transferred between tasks, the task sequence induces a reducing the number of maneuver variables in the agent
curriculum, shown to improve policy convergence and function.
performance on difficult tasks [47]. The curriculum can be Based on the classification presented in Table I and related
designed by altering the parameters of the environment, literature, Sections A-D present clear trends and challenges in
changing the reward function, adding constraints to the task, the research field.
and more.
7

TABLE I
OVERVIEW OF THE SURVEYED LITERATURE IN TERMS OF ELEVEN KEY CHARACTERISTICS
Study Year Application Single/ BVR/ Mission Agent Behavior Learning Learning Simulation Agent
multi WVR task function model paradigm method system DOF
Abbott et al. [49], 2015 Training Single BVR, A/A, A/S, M, SA DT, BTN, IL BC, classifiers DVTE, 3D
Abbott [50], WVR formation NN NGTS
Abbott et al. [51]
Bae et al. [52] 2023 UCAV Single WVR Dogfight M NN RL, curriculum SAC JSBSim 3D
Chai et al. [53] 2023 UCAV Single WVR Dogfight M NN RL PPO, self-play Bespoke 3D
Ernest et al. [54], 2016 UCAV Single BVR, SEAD W Fuzzy tree/ RL GA Bespoke, 2D,
Ernest et al. [55], WVR system AFSIM 3D
Ernest et al. [56]
Gorton et al. [57] 2023 Training Single WVR CAP, flee M BTN EA NEAT (GA) NGTS 2D
Han et al. [58], 2022 Tactics, Multi BVR, A/A M, W NN RL, hierarchy A2C, PPO Bespoke 3D
Piao et al. [59], UCAV WVR
Sun et al. [60]
Hu et al. [61] 2021 Planning Single BVR Pursue, flee M, W NN RL DQL Bespoke 3D
Johansson [62] 2018 Training Single BVR Dogfight M, W BT EA GA TACSI 3D
Källström [63], 2022 Training Multi BVR, A/S, CAP, M, W, NN RL, DDPG Bespoke 2D
Källström et al. [64], WVR recon, jam, comms, curriculum
Källström et al. [2] police, coord jammer
Kong et al. [65], 2022 UCAV Multi WVR Dogfight M NN RL, hierarchy, DDPG, QMIX, JSBSim, 2D
Kong et al. [66] curriculum RSAC, self-play bespoke
Li et al. [67] 2022 Tactics Single WVR Dogfight M NN RL PPO Bespoke 3D
Ludwig and Presnell [68] 2019 Training Multi WVR Dogfight M BTN RL DS NGTS 2D
McGrew et al. [69] 2010 UCAV Single WVR Dogfight M NN RL ADP Bespoke 2D
Pope et al. [70], 2022 F-16 autopilot Single WVR Dogfight M NN RL, hierarchy SAC JSBSim 3D
Pope et al. [71]
Reinisch et al. [72] 2022 Training Multi BVR A/A M, W, CM, BT, NN RL, hierarchy n/a Bespoke 3D
R, jammer
Sandström [73], 2022 Training Single n/a Maneuver M NN IL, transfer BC VBS3 3D
Sandström et al. [74]
Selmonaj et al. [75] 2023 Training Multi WVR A/A M, W NN RL, hierarchy, PPO, self-play Bespoke 2D
curriculum
Sommer et al. [76] 2021 Training, CD&E Single BVR Avoid SAM M NN RL, transfer Neural MCTS CMO 3D
Strand et al. [77] 2023 Training Single WVR Formation M NN RL PPO Bespoke 2D
Teng et al. [78] 2012 Training Single BVR A/A M, W, CM NN RL Q-learning STRIVE 3D
Toubman [3] 2020 Training Multi BVR A/A, CAP M, W, R, Rules RL DS Bespoke 2D
comms
Yao et al. [79] 2015 Training Single BVR A/A M, W, CM, R BT EA GA Bespoke 3D
Zhang et al. [80] 2020 Planning Single BVR SEAD M NN RL PPO Bespoke, 2D
AFSIM
Zhang et al. [81] 2022 UCAV Multi WVR A/A M, W NN RL PPO, self-play Bespoke 3D
8

The table contains the following abbreviations not previously layers are often powers of two to optimize memory allocation
defined: and access. Two hidden layers of 256 nodes are used in multiple
A/A air-to-air studies [53, 66, 81], but generally, a large range of shapes and
A/S air-to-surface sizes are employed, which is necessary to accommodate
CAP combat air patrol different state spaces, action spaces, and behaviors.
comms communications In most of the simulations we are concerned with the agent
coord coordination acts based on only the observed state of the entities at the
CM countermeasure current time. However, some studies apply gated memory to
M maneuver represent the states at earlier time steps. Hu et al. [61] employ
n/a not applicable long short-term memory (LSTM) in their policy network to
R radar represent the percepts of 30 state variables from the five
recon reconnaissance previous time steps along with the current one. Each LSTM unit
SAM surface-to-air missile decides what information to keep and what information to
SEAD suppression of enemy air defenses discard from the previous state. Bae et al. [52] measured a
SA situational awareness substantial increase in performance when adding an LSTM
W weapons layer, especially in partially observable environments where the
CD&E concept development and experimentation agent is less informed. Kong et al. [66] and Selmonaj et al. [75]
use gated recurrent units, a simpler version of LSTM, in both
actor and critic networks. Zhang et al. [81] and Källström et al.
A. Air-to-Air Combat
[2] simply make one concatenate input layer with the four most
The most frequent learning task is air-to-air combat and in recent percepts of the state variables.
particular dogfighting. Dogfighting is close-range WVR air Simulations with many entities, and potentially a variable
combat where basic fighter maneuvers (BFM) are used to arrive number of entities, pose a question of how to design the input
behind enemy aircraft for a favorable engagement position. It is layer of the policy network. Too many percepts will be a noisy
an art that emerged naturally in World War I [82] and follows representation of the environment, yet there must be sufficient
principles such as balancing airspeed and altitude, minimizing percepts to enable situational awareness. Han et al. [58] use
turn rates, attacking from the direction of the sun, and avoiding graph attention networks to recognize the most important ally
overshoots. In broader terms, dogfighting is a three- and enemy in the air combat scenario for all aircraft. Based on
dimensional geometrical problem governed by the physical that they construct an input layer representing the ownship,
limitations of the aircraft and pilots. primary enemy, primary ally, and the primary enemy of the
The Defense Advanced Research Project Agency (DARPA) primary ally. Additionally, they include information on all
Air Combat Evolution (ACE) program “seeks to increase trust aircraft that have the agent aircraft as the most important friend,
in combat autonomy by using human-machine collaborative but to fix the input length, they add these together by variable.
dogfighting as its challenge problem” [83]. A feasibility study The solution of Kong et al. [66] is to embed information about
for ACE called the AlphaDogfight Trials invited eight all friendly and enemy aircraft in multi-head attention
companies to make dogfighting agents that would compete in a structures, while Selmonaj et al. [75] opt to include only the
series of knockout tournaments. The AlphaDogfight Trials closest friends and enemies. Zhang et al. [80] also discuss the
culminated in 2020 when the top agent was matched with an challenge of a variable number of entities and suggest either
expert human pilot and won [83]. Later, developers at ACE representing irrelevant entities by zeros or dividing the play
uploaded an agent to a modified F-16 known as the Variable In- area into regions with associated counts of friends and enemies.
flight Simulator Test Aircraft (VISTA) and demonstrated that
the agent could control the aircraft in multiple sorties with C. Actor-Critic Methods
various simulated adversaries and weapons systems [84]. Many of the studies train the neural network with actor-critic
Air combat BVR was first seen on large scale in the Vietnam methods, represented by A2C [33], DDPG [31], PPO [36], and
War [48], and has gradually become the main type of air combat SAC [35], which combine advantages from policy-based and
engagement due to more advanced weapons, sensors, and value-based methods, such as allowing high-dimensional state
sensor fusion [72, 85]. The BVR aspect adds maneuver spaces, continuous action spaces and reduced variance in policy
elements such as breaking radar locks and exhausting the gradient estimates.
energy of incoming missiles. When to use radar and fire Sharing layers between the actor and critic networks allows
missiles becomes critical [86]. learning of common abstract concepts from raw percepts and
reduces parameters and computation. Li et al. [67] make two
B. Neural Networks
hidden layers shared while other studies employ one shared
Three-quarters of the studies use neural networks to represent layer [58, 59, 75]. Sharing layers between agents is also
behavior models. Deep neural networks provide a hierarchical possible. If the agents have similar tasks, they may share the
representation of the environment. The first hidden layer same neural network and receive a common reward [66].
constitutes new abstract concepts based on the percepts. Each Multiple studies [60, 2, 75] apply CTDE which relaxes the
consecutive layer represents more abstract concepts [87]. Thus, assumption of homogeneous agents but still produces
there is potential in deep neural networks to express more coordination between them [88].
advanced concepts such as threat assessment, engagement
envelopes, energy management, and cooperation. The hidden
9

D. Bespoke Simulation Systems communication would be convenient, teammates often


No simulation system is applied in more than three studies, understand how to react without it. Speech as a form of
and many are custom-built. Bespoke simulation systems communication may be challenging yet feasible because it
provide precise control over entity dynamics and allow free follows predefined protocols.
manipulation of scenarios and integration of agents. They also The technical challenges also include the transfer of agents
speak to the need for meticulous data collection and proper from their learning environments to the simulation systems
management of potentially classified data. Building a new used by military personnel, especially if these systems are
simulation system requires resources and knowledge of the substantially different from one another. It must be possible to
domain, simulations, and software development. It is essential extract the percepts the agents rely on, and the action formats
to conduct sufficient validation and maintenance to ensure the must align. Moreover, the flight dynamics model of the ML
system’s accuracy and reliability. environment must be sophisticated enough to capture the
To our knowledge, only JSBSim [89] is open-source. dynamics relevant to the pilot training simulation [97]. But,
Moreover, commercial-off-the-shelf choices are TACSI [90], considering the large demand for data in deep learning,
STRIVE [91], VBS [92], and CMO [93]. Government off-the- lightweight simulation systems are preferred. To facilitate the
shelf choices are NGTS [94], DVTE [95], and AFSIM [96]. transfer of agents, Strand et al. [77] suggest using a distributed
simulation protocol [98] to enable the interaction between
IX. DISCUSSION simulation systems of different fidelities. They also highlight
the use of standard ML interfaces like Gymnasium [99] to allow
Most of the reviewed studies are motivated by improving the rapid changes to an agent's state and action spaces and switches
simulation-based training of fighter pilots. They represent a between ML methods.
concerted effort to produce behavior models for air combat Vectorized RL environments and experience replay can
part-tasks and missions. Regardless, there is a long way to go reduce the learning time manyfold, and most of the applied
from demonstrating adequate behavior in such tasks to methods permit either or both techniques. Vectorized
successfully integrating the agents into pilot training in a way environments are easy to set up and utilize multiple processing
that enhances the experience. There are both technical and units to collect experience concurrently. Experience replay is a
human challenges to be addressed. key component for providing stability in off-policy algorithms,
Instructors would need easy access to agents capable of such as DDPG and SAC, and can also be used to emphasize
playing a certain part in scenarios they design, either from a particularly important experiences such as the use of
small selection of adaptive agents or from a larger selection of weapons [71].
specialized agents. An agent learns adaptive behavior by Despite similar requirements, few studies employ the same
exposure to different situations and tasks, which requires time simulation system. This leads to a lack of standardization and
and a behavior model of a size and structure capable of benchmarking. One large effort in employing a standard
representing the complexity. Most of the reviewed work apply environment is the Not So Grand Challenge comprising nine
random starting positions or velocities for each new episode, companies funded by the Air Force Research Laboratory [100,
and some adjust other scenario parameters such as the number 101, 102, 103]. The companies develop their agents and test
of friendly and hostile aircraft. Still, it is not feasible to these in a common testbed built as a distributed simulation with
implement or even predict all possible conditions that may government off-the-shelf software [4]. They have made agents
occur in pilot training. for a series of 1v1 and 2v2 scenarios and are gradually building
A high level of adaptivity presumes a full selection of agent a library of adversary agents for pilot training. The idea is to
functions including maneuvering, weapons, and radar controls. have a digital librarian suggest agents for a certain scenario in
Half of the studies include weapons but only a few studies a way that reflects and improves pilot performance [104, 105].
regard the choice of when and to use radar and Air-to-air scenarios are predominant in the reviewed studies,
countermeasures. While functions such as radar, but less than half are BVR, even though this is gradually
countermeasures, and afterburner can be modeled implicitly in becoming the norm as the reach of A/A missiles and sensors
many cases, some training scenarios call for a pilot’s explicit increases. Possibly, dogfights are still predominant because
use of these functions. they involve the BFM a pilot learns early in training. In some
Instructors should have the option to adjust the level of sense, it is natural that agents follow the same curriculum as
aggression or other traits of the CGF agents according to the pilots. The studies that utilized curriculum learning exposed
training objective and experience level of the training audience. agents to increasingly difficult scenarios that involved
The behavior representation dictates how trait parameters can gradually reducing tactical advantage [52], adding more
be included. Källström et al. [2] asked 25 pilots what they see opponents [58, 66], and making the opponents more
as important characteristics of agents for use in training, and competitive [75]. Källström et al. [2] highlight that curriculum
distinguish between basic training, procedures, and missions. learning can combat problems that arise in reinforcement
The pilots saw agents with deterministic behaviors as essential learning with sparse rewards. These studies found that
in basic training. Moreover, procedures also call for advanced gradually increasing the scenario complexity resulted in
tactical agents, while missions demand doctrinal behavior on increased learning efficiency.
top of that. Challenging opponents were only found desirable in However, it is needless to make agents learn certain elements,
procedures and missions. such as dogfighting, if this is not their intended final
Källström et al. [2] also asked the pilots whether employment. In fact, BFM is potentially a large detour to
communication with the agents would enhance training. While
10

proficiency in BVR scenarios which are generally less acrobatic with few studies employing the same simulation system,
and more strategic. highlighting the need for initiatives like the Not So Grand
The fact that dogfights are less strategic and more tangible Challenge to establish common testbeds.
may also be a reason they are preferred as use cases. Simple After surveying the current state of the research field, we
goals such as avoiding being shot or shooting down enemy have reached four recommendations intended to aid
aircraft are easier to reflect in reward functions that guide RL advancements toward more comprehensive, adaptable, and
properly. Nonetheless, the goals in BVR combat are not so realistic machine learning-based behavior models for air
different. Some studies actively use reward shaping to improve combat.
learning convergence and include doctrine or domain
A. Emphasis on Beyond Visual Range Scenarios.
knowledge [59, 60, 3].
Multi-agent RL has emerged as a powerful learning paradigm Although dogfighting machine learning agents are
due to its ability to capture interactions and tactical impressive, they are not highly relevant in the current state of
dependencies and dynamics between agents. Pilots do not air combat. Based on our review, a reasonable shift of focus
operate alone, and it is essential they learn how to cooperate would be from WVR missions to prioritizing the development
with their flight and squadron. In MARL, each agent represents of behavior models that incorporate the complexities and
a non-stationarity for the other friendly and adversarial agents, strategic aspects required in BVR missions. This applies
which makes the learning fundamentally more difficult, but also particularly to the applications of simulation-based air combat
more realistic [106]. Key aspects of cooperation include training, mission planning, and the development of new tactics
formations, target coordination, and defensive support, which and strategies.
all become second nature to pilots eventually. In contrast, to B. Enhanced Focus on Multi-Agent Machine Learning and
explicitly formulate collaborative behavior rules is hard [58]. Cooperation
Complex behavior models may exploit hierarchical structures
Fighter pilots do not operate alone, yet the research field
to break down a task into smaller parts, as described in
represents a preponderance of studies focusing on the behavior
Section VII-C. The high-level policies in the surveyed literature
of a single agent. The effectiveness of multi-agent methods in
all include the choice between at least one defensive and one
aggressive sub-policies. A defensive policy is used when the capturing tactical dependencies and interactions among agents
opponent has the advantage, and vice versa. Kong et al. [66] suggests a need for increased research in this area. Future
studies are urged to delve deeper into cooperative behaviors
and Pope et al. [71] include a dedicated sub-policy for attaining
among agents, emphasizing formations, target coordination,
the control zone position behind the opponent aircraft.
and defensive support. Understanding and simulating the
Selmonaj et al. [75] and Kong et al. [66] also include target
complexities of teamwork and collaboration in air combat
selection in the high-level policy. The architecture of Sun et al.
[60] is distinct because the high-level policy has as many as 14 scenarios will contribute to more realistic air combat
outputs that encode BFM macro actions. They use a low-level experiences.
policy to decide the normal load factor and velocity command C. Utilization of Hierarchical Behavior Models
to apply to the selected macro action. Adopting hierarchical structures to break down complex air
combat scenarios into smaller, more manageable sub-problems
X. CONCLUSION is a promising direction for future research. Furthermore, a
The most prominent applications for air combat behavior hierarchical decision-making process allows for a coherent
models based on machine learning are enhancing simulation- representation of an otherwise convoluted policy. High-level
based pilot training, mission planning, developing new tactics policies guiding defensive and aggressive sub-policies, as
and strategies, and optimizing unmanned aerial combat observed in current literature, can be expanded and refined to
vehicles. The reviewed studies exhibit a concerted effort to address a broader range of mission tasks, ultimately enhancing
model behavior for specific air combat tasks, particularly in the the versatility and adaptability of behavior models applied to all
simulation-based training of fighter pilots. However, despite mentioned applications.
notable progress, challenges persist in seamlessly integrating
D. Standardization and Collaboration Initiatives
these models into comprehensive pilot training programs,
presenting both technical and human obstacles that demand Considering the current lack of standardization and
attention. benchmarking in simulation systems, researchers would benefit
The desire for adaptable agents with a wide range of from active participation in collaborative initiatives like the Not
functions, including maneuvering, weapons, radar controls, and So Grand Challenge. Establishing common testbeds and
countermeasures, emphasizes the need for a comprehensive standardized environments and scenarios facilitates cross-
approach to machine learning-based behavior modeling. comparison of different behavior models and ensures that
Certain studies also point to the importance of making more of advancements in one research project can be applied and tested
the agent functions explicit to attain sufficient realism. The in others. This collaborative approach accelerates progress and
technical challenges of transferring agents from their learning contributes to developing more robust and universally
environments to pilot training simulation systems underscore applicable machine learning-based behavior models for air
the importance of aligning percepts and action formats, as well combat applications.
as maintaining dynamics models that balance sophistication
with execution speed. Standardization remains a challenge,
11

REFERENCES [27] T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster, and


S. Whiteson, “Monotonic value function factorisation for deep multi-agent
[1] U. Dompke, “Computer generated forces,” NC3A, The Hague, The reinforcement learning,” JMLR, vol. 21, no. 1, Art. no. 178, Jan. 2020.
Netherlands, Rep. 200826, 2003. [28] K. G. Vamvoudakis, Y. Wan, F. L. Lewis, and D. Cansever, Handbook of
[2] J. Källström, R. Granlund, and F. Heintz, “Design of simulation-based Reinforcement Learning and Control, ser. SSDC. Cham, Switzerland:
pilot training systems using machine learning agents,” Aeronaut. J., vol. Springer, 2021, vol. 325, doi: 10.1007/978-3-030-60990-0.
126, no. 1300, pp. 907–931, June 2022, doi: 10.1017/aer.2022.8. [29] B. Brügmann, “Monte carlo go,” SU, Syracuse, NY, Rep., 1993.
[3] A. Toubman, “Calculated moves,” Ph.D. dissertation, LEI, Leiden, [30] D. Silver et al., “Mastering the game of go with deep neural networks and
Netherlands, 2020. tree search,” Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016.
[4] J. Freeman, E. Watz, and W. Bennett, “Adaptive agents for adaptive [31] T. P. Lillicrap et al., “Continuous control with deep reinforcement
tactical training,” in AIS, vol. 11597, Orlando, FL, 2019, pp. 493–504, learning,” in ICLR, San Juan, Puerto Rico, 2016.
doi: 10.1007/978-3-030-22341-0_39. [32] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel,
[5] M. Sommer, M. Rüegsegger, O. Szehr, and G. Del Rio, “Deep self- “Benchmarking deep reinforcement learning for continuous control,” in
optimizing artificial intelligence for tactical analysis, training and ICML, vol. 48. New York, NY: ACM, 2016, pp. 1329–1338.
optimization,” in M&S support to oper. tasks incl. war gaming, logistics, [33] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,”
cyber defence, ser. STO-MP-MSG, vol. 133. Munich, Germany: NATO in ICML, vol. 48. New York, NY: PMLR, June 2016, pp. 1928–1937.
STO, 2015, paper 18. [34] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch,
[6] R. M. Jones, J. E. Laird, P. E. Nielsen, K. J. Coulter, P. Kenny, and F. V. “Multi-agent actor-critic for mixed cooperative-competitive
Koss, “Automated intelligent pilots for combat flight simulation,” AI environments,” in NIPS, vol. 30, Long Beach, CA, 2017, pp. 6379–6390.
Mag., vol. 20, no. 1, pp. 27–41, Mar. 1999, [35] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic,” in
doi: 10.1609/aimag.v20i1.1438. ICML, vol. 80. Stockholm, Sweden: PMLR, 2018, pp. 1861–1870.
[7] D. Silver et al., “A general reinforcement learning algorithm that masters [36] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
chess, shogi, and go through self-play,” Science, vol. 362, no. 6419, pp. “Proximal policy optimization algorithms,” arXiv, vol. 1707, Art. no.
1140–1144, Dec. 2018. 06347, July 2017, doi: 10.48550/arXiv.1707.06347.
[8] K. Arulkumaran, A. Cully, and J. Togelius, “Alphastar,” in GECCO, [37] P. Spronck, M. Ponsen, I. Sprinkhuizen-Kuyper, and E. Postma, “Adaptive
Prague, Czech Republic, 2019, pp. 314–315. game AI with dynamic scripting,” Mach. Learn., vol. 63, no. 3, pp. 217–
[9] H. Ravichandar, A. S. Polydoros, S. Chernova, and A. Billard, “Recent 248, June 2006, doi: 10.1007/s10994-006-6205-6.
advances in robot learning from demonstration,” Annu. Rev. Control [38] T. Osa, J. Pajarinen, and G. Neumann, An Algorithmic Perspective on
Robot. Auton. Syst., vol. 3, pp. 297–330, May 2020, doi: 10.1146/annurev- Imitation Learning. Hanover, MA: Now Publishers Inc., 2018.
control-100819-063206. [39] S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning
[10] R. Rădulescu, P. Mannion, D. M. Roijers, and A. Nowé, “Multi-objective and structured prediction to no-regret online learning,” in AISTATS,
multi-agent decision making,” AAMAS, vol. 34, Art. no. 10, Dec. 2020, vol. 15. Ft. Lauderdale, FL: PMLR, 2011, pp. 627–635.
doi: 10.1007/s10458-019-09433-x. [40] H. M. Le, N. Jiang, A. Agarwal, M. Dudík, Y. Yue, and H. Daumé, III,
[11] Y. Dong, J. Ai, and J. Liu, “Guidance and control for own aircraft in the “Hierarchical imitation and reinforcement learning,” in ICML, vol. 80.
autonomous air combat,” J. Aerosp. Eng., vol. 233, no. 16, pp. 5943–5991, Stockholm, Sweden: PMLR, 2018, pp. 2923–2932.
Dec. 2019, doi: 10.1177/0954410019889447. [41] R. J. Alattas, S. Patel, and T. M. Sobh, “Evolutionary modular robotics,” J.
[12] X. Wang et al., “Deep reinforcement learning-based air combat maneuver Intell. Robot. Syst., vol. 95, no. 3, pp. 815–828, Sept. 2019.
decision-making: literature review, implementation tutorial and future [42] A. E. Eiben and J. E. Smith, Introduction to evolutionary computing,
direction,” Artif. Intell. Rev., vol. 57, no. 1, Art. no. 1, Jan. 2024. 2nd ed. Berlin, Germany: Springer, 2015, p. 301, doi: 10.1007/978-3-662-
[13] R. A. Løvlid, L. J. Luotsinen, F. Kamrani, and B. Toghiani-Rizi, “Data- 44874-8.
driven behavior modeling for computer generated forces,” FFI, Kjeller, [43] K. O. Stanley and R. Miikkulainen, “Evolving neural networks through
Norway, Rep. 17/01510, 2017. augmenting topologies,” Evol. Comput., vol. 10, no. 2, pp. 99–127, June
[14] T.-H. Teng, A.-H. Tan, and L.-N. Teow, “Adaptive computer-generated 2002, doi: 10.1162/106365602320169811.
forces for simulator-based training,” Expert Syst. Appl., vol. 40, no. 18, pp. [44] M. E. Taylor and P. Stone, “Transfer learning for reinforcement learning
7341–7353, Dec. 2013, doi: 10.1016/j.eswa.2013.07.004. domains,” JMLR, vol. 10, Art. no. 56, pp. 1633–1685, July 2009.
[15] Y. Sekhavat, “Behavior trees for computer games,” IJAIT, vol. 26, no. 2, [45] M. Ranaweera and Q. H. Mahmoud, “Virtual to real-world transfer
Art. no. 1730001, Apr. 2017, doi: 10.1142/S0218213017300010. learning,” Electronics, vol. 10, no. 12, Art. no. 1491, June 2021.
[16] R. A. Agis, S. Gottifredi, and A. J. Garcá, “An event-driven behavior trees [46] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum
extension to facilitate non-player multi-agent coordination in video learning,” in ICML, Montreal, Canada, 2009, pp. 41–48.
games,” Expert Syst. Appl., vol. 155, Art. no. 113457, Oct. 2020. [47] S. Narvekar, B. Peng, M. Leonetti, J. Sinapov, M. E. Taylor, and P. Stone,
[17] M. Iovino, E. Scukins, J. Styrud, P. Ögren, and C. Smith, “A survey of “Curriculum learning for reinforcement learning domains,” JMLR, vol. 21,
behavior trees in robotics and AI,” Rob. Auton. Syst., vol. 154, Art. no. no. 1, Art. no. 181, Jan. 2020.
104096, Aug. 2022, doi: 10.1016/j.robot.2022.104096. [48] P. Higby, “Promise and reality: Beyond visual range (BVR) air-to-air
[18] M. Ramirez et al., “Integrated hybrid planning and programmed control combat,” Maxwell AFB, AL, 2005.
for real time UAV maneuvering,” in AAMAS, Stockholm, Sweden, 2018, [49] R. G. Abbott, J. D. Basilico, M. R. Glickman, and J. Whetzel, “Trainable
pp. 1318–1326. automated forces,” in I/ITSEC, Orlando, FL, 2010, paper 10441.
[19] R. Czabanski, M. Jezewski, and J. Leski, Introduction to Fuzzy Systems. [50] R. G. Abbott, “The relational blackboard,” in BRiMS, Ottawa, Canada,
Cham, Switzerland: Springer, 2017, pp. 23–43, doi: 10.1007/978-3-319- 2013, pp. 139–146.
59614-3. [51] R. G. Abbott, C. Warrender, and K. Lakkaraju, “Transitioning from human
[20] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, to agent-based role-players for simulation-based training,” in AC, ser.
2nd ed. Cambridge, MA: MIT press, 2018. LNCS, vol. 9183, Los Angeles, CA, 2015, pp. 551–561.
[21] E. Wiewiora, Reward Shaping. Boston, MA: Springer, 2010, pp. 863–865, [52] J. H. Bae, H. Jung, S. Kim, S. Kim, and Y.-D. Kim, “Deep reinforcement
doi: 10.1007/978-0-387-30164-8_731. learning-based air-to-air combat maneuver generation in a realistic
[22] D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley, “A survey of environment,” IEEE Access, vol. 11, pp. 26427–26440, Mar. 2023.
multi-objective sequential decision-making,” JAIR, vol. 48, pp. 67–113, [53] J. Chai, W. Chen, Y. Zhu, Z.-X. Yao, and D. Zhao, “A hierarchical deep
Oct. 2013, doi: 10.1613/jair.3987. reinforcement learning framework for 6-DOF UCAV air-to-air combat,”
[23] S. Gronauer and K. Diepold, “Multi-agent deep reinforcement learning,” IEEE TSMC, vol. 53, no. 9, pp. 5417–5429, Sept. 2023.
Artif. Intell. Rev., vol. 55, no. 2, pp. 895–943, Feb. 2022. [54] N. Ernest, K. Cohen, C. Schumacher, and D. Casbeer, “Learning of
[24] D. E. Moriarty, A. C. Schultz, and J. J. Grefenstette, “Evolutionary intelligent controllers for autonomous unmanned combat aerial vehicles by
algorithms for reinforcement learning,” JAIR, vol. 11, pp. 241–276, Sept. genetic cascading fuzzy methods,” in Aerosp. Syst. Technol., Cincinnati,
1999, doi: 10.1613/jair.613. OH, 2014, paper 01-2174.
[25] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. 8, [55] N. Ernest, K. Cohen, E. Kivelevitch, C. Schumacher, and D. Casbeer,
no. 3, pp. 279–292, May 1992, doi: 10.1007/BF00992698. “Genetic fuzzy trees and their application towards autonomous training
[26] V. Mnih et al., “Human-level control through deep reinforcement and control of a squadron of unmanned combat aerial vehicles,”
learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015. Unmanned Syst., vol. 3, no. 3, pp. 185–204, July 2015.
12

[56] N. Ernest, D. Carroll, C. Schumacher, M. Clark, K. Cohen, and G. Lee, [79] J. Yao, Q. Huang, and W. Wang, “Adaptive CGFs based on grammatical
“Genetic fuzzy based artificial intelligence for unmanned combat aerial evolution,” Math. Probl. Eng., vol. 2015, Art. no. 197306, Dec. 2015.
vehicle control in simulated air combat missions,” J. Def. Manag., vol. 6, [80] L. A. Zhang et al., “Air dominance through machine learning,” RAND
no. 1, Mar. 2016. Corp., Santa Monica, CA, Rep. AD1100919, 2020.
[57] P. Gorton, M. Asprusten, and K. Brathen, “Imitation learning for [81] H. Zhang, Y. Wei, H. Zhou, and C. Huang, “Maneuver decision-making
modelling air combat behaviour,” FFI, Kjeller, Norway, Rep. 22/02423, for autonomous air combat based on FRE-PPO,” Appl. Sci., vol. 12,
2023. no. 20, Art. no. 10230, Oct. 2022.
[58] Y. Han et al., “Deep relationship graph reinforcement learning for multi- [82] R. L. Shaw, Fighter combat. Annapolis, MD: NIP, 1985, p. xii.
aircraft air combat,” in IJCNN, Padua, Italy, 2022, pp. 1–8, [83] C. R. DeMay, E. L. White, W. D. Dunham, and J. A. Pino,
doi: 10.1109/IJCNN55064.2022.9892208. “AlphaDogfight Trials,” Johns Hopkins APL Tech. Dig., vol. 36, no. 2, pp.
[59] H. Piao et al., “Beyond-visual-range air combat tactics auto-generation by 154–163, July 2022.
reinforcement learning,” in IJCNN, Glasgow, UK, 2020, pp. 1–8, [84] DARPA, “ACE program’s AI agents transition from simulation to live
doi: 10.1109/IJCNN48605.2020.9207088. flight,” 2023. [Online]. Available: https://www.darpa.mil/news-events/-
[60] Z. Sun et al., “Multi-agent hierarchical policy gradient for air combat 2023-02-13
tactics emergence via self-play,” Eng. Appl. Artif. Intell., vol. 98, Art. no. [85] J. P. A. Dantas, A. N. Costa, D. Geraldo, M. R. O. A. Maximo, and
104112, Feb. 2021, doi: 10.1016/j.engappai.2020.104112. T. Yoneyama, “Engagement decision support for beyond visual range air
[61] D. Hu, R. Yang, J. Zuo, Z. Zhang, J. Wu, and Y. Wang, “Application of combat,” in LARS/SBR/WRE, Natal, Brazil, 2021, pp. 96–101,
deep reinforcement learning in maneuver planning of beyond-visual-range doi: 10.1109/LARS/SBR/WRE54079.2021.9605380.
air combat,” IEEE Access, vol. 9, pp. 32282–32297, Feb. 2021, [86] S. Aronsson et al., “Supporting after action review in simulator mission
doi: 10.1109/ACCESS.2021.3060426. training,” JDMS, vol. 16, no. 3, pp. 219–231, July 2019.
[62] T. Johansson, “Tactical simulation in air-to-air combat,” Master’s thesis, [87] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. Cambridge,
LTU, Luleå, Sweden, 2018. MA: MIT press, 2016.
[63] J. Källström and F. Heintz, “Multi-agent multi-objective deep [88] J. K. Terry, N. Grammel, S. Son, B. Black, and A. Agrawal, “Revisiting
reinforcement learning for efficient and effective pilot training,” in Aerosp. parameter sharing in multi-agent deep reinforcement learning,” arXiv, vol.
Technol. Congr., vol. 162, Stockholm, Sweden, 2019, paper 11, pp. 101– 2005, Art. no. 13625, May 2020, doi: 10.48550/arXiv.2005.13625.
111. [89] J. Berndt and A. De Marco, “Progress on and usage of the open source
[64] ——, “Agent coordination in air combat simulation using multi-agent flight dynamics model software library, JSBSim,” in AIAA MST, 2009,
deep reinforcement learning,” in IEEE SMC, Toronto, Canada, 2020, pp. paper 5699, doi: 10.2514/6.2009-5699.
2157–2164, doi: 10.1109/SMC42975.2020.9283492. [90] T. Johansson, “Metodiker vid regelskrivning,” Linköping, Sweden, 2004.
[65] W. Kong, D. Zhou, and Z. Yang, “Air combat strategies generation of [91] D. Siksik, “STRIVE: An open and distributed architecture for CGF
CGF based on MADDPG and reward shaping,” in CVIDL, Chongqing, representations,” in 9th CGF & BR Conf., Orlando, FL, 2000, pp. 16–18.
China, 2020, pp. 651–655, doi: 10.1109/CVIDL51233.2020.000-7. [92] B. I. Simulations, “VBS4 product brochure,” 2024, accessed on Feb. 28,
[66] W. Kong, D. Zhou, Y. Du, Y. Zhou, and Y. Zhao, “Hierarchical multi- 2024. [Online]. Available: https://bisimulations.com/sites/default/files/-
agent reinforcement learning for multi-aircraft close-range air combat,” data_sheets/bisim_product_flyers_2024_vbs4.pdf
IET CTA, vol. 17, no. 13, pp. 1840–1862, Sept. 2022. [93] Matrix Games, Command Modern Operations Game Manual, Staten
[67] L. Li, Z. Zhou, J. Chai, Z. Liu, Y. Zhu, and J. Yi, “Learning continuous 3- Island, NY, 2023.
DoF air-to-air close-in combat strategy using proximal policy [94] B. Johnson et al., “Game theory and prescriptive analytics for naval
optimization,” in IEEE CoG, Beijing, China, 2022, pp. 616–619, wargaming battle management aids,” NPS, Monterey, CA, Rep.
doi: 10.1109/CoG51982.2022.9893690. AD1184544, 2018.
[68] J. Ludwig and B. Presnell, “Developing an adaptive opponent for tactical [95] M. P. Bailey and R. Armstrong, “The deployable virtual training
training,” in AIS, ser. LNISA, vol. 11597, Orlando, FL, 2019, pp. 379– environment,” in I/ITSEC, Orlando, FL, 2002, pp. 843–849.
388. [96] P. D. Clive, J. A. Johnson, M. J. Moss, J. M. Zeh, B. M. Birkmire, and
[69] J. S. McGrew, J. P. How, B. Williams, and N. Roy, “Air-combat strategy D. D. Hodson, “Advanced framework for simulation, integration and
using approximate dynamic programming,” JGCD, vol. 33, no. 5, pp. modeling (AFSIM),” in CSC, 2015, pp. 73–77.
1641–1654, Sept. 2010, doi: 10.2514/1.46815. [97] J. D. Souza, P. J. L. Silva, A. H. M. Pinto, F. F. Monteiro, and J. M. X. N.
[70] A. P. Pope et al., “Hierarchical reinforcement learning for air-to-air Teixeira, “Assessing the reality gap of robotic simulations with
combat,” in ICUAS, Athens, Greece, 2021, pp. 275–284, educational purposes,” in LARS/SBR/WRE, Natal, Brazil, 2020, pp. 1–6,
doi: 10.1109/ICUAS51884.2021.9476700. doi: 10.1109/LARS/SBR/WRE51543.2020.9306947.
[71] ——, “Hierarchical reinforcement learning for air combat at DARPA’s [98] IEEE, “Standard for modeling and simulation (M&S) high level
AlphaDogfight Trials,” TAI, vol. 4, no. 6, pp. 1–15, Dec. 2022, architecture (HLA),” Piscataway, NJ, 2010.
doi: 10.1109/TAI.2022.3222143. [99] G. Brockman et al., “OpenAI gym,” vol. 1606,” Art. no. 01540, June
[72] F. Reinisch, M. Strohal, and P. Stütz, “Behaviour modelling of computer- 2016, doi: 10.48550/arXiv.1606.01540.
generated-forces in beyond-visual-range air combat,” in SIMULTECH, [100] E. Watz and M. J. Doyle, “Fighter combat-tactical awareness capability
vol. 1, Lisbon, Portugal, 2022, pp. 327–335. (FC-TAC) for use in live virtual constructive training,” in Fall SIW,
[73] V. Sandström, “On the efficiency of transfer learning in a fighter pilot Orlando, FL, 2014, paper 44.
behavior modelling context,” Master’s thesis, KTH, Stockholm, Sweden, [101] M. J. Doyle and A. M. Portrey, “Rapid adaptive realistic behavior
2021. modeling is viable for use in training,” in BRiMS, Washington, DC, 2014,
[74] V. Sandström, L. Luotsinen, and D. Oskarsson, “Fighter pilot behavior pp. 73–80, doi: 10.13140/2.1.4964.4802.
cloning,” in ICUAS, Dubrovnik, Croatia, 2022, pp. 686–695, [102] M. J. Doyle, “A foundation for adaptive agent-based ‘on the fly’ learning
doi: 10.1109/ICUAS54217.2022.9836131. of TTPs,” J. Comput. Eng. Inf. Technol., vol. 6, no. 3, Art. no. 1000173,
[75] A. Selmonaj, O. Szehr, G. D. Rio, A. Antonucci, A. Schneider, and June 2017, doi: 10.4172/2324-9307.1000173.
M. Rüegsegger, “Hierarchical multi-agent reinforcement learning for air [103] W. Warwick and S. Rodgers, “Wrong in the right way: Balancing
combat maneuvering,” arXiv, vol. 2309, Art. no. 11247, pp. 1–8, Sept. realism against other constraints in simulation-based training,” in AIS, ser.
2023, doi: 10.48550/arXiv.2309.11247. LNCS, vol. 11597, Orlando, FL, 2019, pp. 379–388.
[76] M. Sommer, M. Rüegsegger, O. Szehr, and G. Del Rio, “Deep self- [104] J. Freeman, E. Watz, and W. Bennett, “Assessing and selecting AI pilots
optimizing artificial intelligence for tactical analysis, training and for tactical and training skill,” in Towards On-Demand Personalized
optimization,” in AI4HMO, ser. STO-MP-IST, vol. 190. Koblenz, Training and Decision Support, ser. STO-MP-MSG, vol. 177. Virtual:
Germany: NATO STO, 2021, paper 19. NATO STO, 2020, paper 14.
[77] A. Strand, P. Gorton, M. Asprusten, and K. Brathen, “Learning [105] W. Bennett, “Readiness product line,” AFRL Fight’s On!, no. 67, pp. 6–
environment for the air domain (LEAD),” in WSC, San Antonio, TX, 8, Nov. 2022.
2023, pp. 3035–3046. [106] P. Hernandez-Leal, B. Kartal, and M. E. Taylor, “A survey and critique
[78] T.-H. Teng, A.-H. Tan, Y.-S. Tan, and A. Yeo, “Self-organizing neural of multiagent deep reinforcement learning,” AAMAS, vol. 33, no. 6, pp.
networks for learning air combat maneuvers,” in IJCNN, Brisbane, 750–797, Nov. 2019, doi: 10.1007/s10458-019-09421-1.
Australia, 2012, pp. 1–8, doi: 10.1109/IJCNN.2012.6252763.
13

Patrick R. Gorton received a bachelor’s


degree in electronics engineering from
Oslo Metropolitan University, 2018, and a
master’s degree in informatics specializing
in robotics and intelligent systems from the
University of Oslo, Oslo, Norway, 2020.
He is a scientist at FFI (Norwegian Defence
Research Establishment), Kjeller, Norway.
His current research interests include
artificial intelligence, machine learning, digital twins, and
modeling and simulation, with a primary focus on behavior
modeling of intelligent agents for military training and decision
support.

Andreas Strand holds a master’s degree in


applied physics and mathematics from the
Norwegian University of Science and
Technology, Trondheim, Norway, 2017,
including studies in statistics at the
University of California, Berkeley. He
achieved a PhD in statistics from the
Norwegian University of Science and
Technology in 2021 on the topic of
uncertainty quantification in simulations. Dr. Strand has since
been engaged as a scientist at FFI, Kjeller, Norway, dedicated
to behavior models for combat simulations and decision support
for military operations.

Karsten Brathen (Senior Member, IEEE)


holds a “sivilingeniør” degree in
Engineering Cybernetics from the
Norwegian University of Science and
Technology, Trondheim, Norway, 1979.
He has more than 40 years of experience in
defense research and technology at FFI,
Kjeller, Norway, and has been principal
investigator and project manager for
projects within submarine combat systems, high speed marine
craft cockpit, naval command and control systems and defense
simulation technologies. He was a visiting scientist at Ascent
Logic Corporation, San Jose, CA, USA and has been an adjunct
lecturer in human-machine systems engineering at the
University Graduate Center, Kjeller, Norway. For many years
he was the Norwegian principal member to the Modeling and
Simulation Group in NATO Science and Technology
Organization and the Norwegian national coordinator to the
Simulation Technology capability area in the European Defense
Agency. His research interests include defense modeling and
simulation, behavior modeling, command and control systems
and human-machine systems. Mr. Karsten Brathen is a member
of ACM.

You might also like