A Survey of Air Combat Behavior Modeling Using Machine Learning
A Survey of Air Combat Behavior Modeling Using Machine Learning
Abstract—With the recent advances in machine learning, Besides training, realistic behavior models benefit
creating agents that behave realistically in simulated air combat applications like mission planning and tactics development.
has become a growing field of interest. This survey explores the Modeling and simulation tools can help mission planners
application of machine learning techniques for modeling air predict and evaluate the outcome of different scenarios,
combat behavior, motivated by the potential to enhance
allowing refinement of strategies and tactics before the actual
simulation-based pilot training. Current simulated entities tend to
lack realistic behavior, and traditional behavior modeling is labor- mission takes place [5]. Tactics development may leverage the
intensive and prone to loss of essential domain knowledge between creativity of ML agents that autonomously explore new
development steps. Advancements in reinforcement learning and strategies with few restrictions.
imitation learning algorithms have demonstrated that agents may Traditionally, air combat behavior is created manually by
learn complex behavior from data, which could be faster and more first eliciting domain knowledge from subject matter experts
scalable than manual methods. Yet, making adaptive agents (SMEs) and then creating a conceptual model reflecting this
capable of performing tactical maneuvers and operating weapons knowledge. Finally, the model is implemented in a computer
and sensors still poses a significant challenge. program as a rule-based script, such as a decision tree or a finite
The survey examines applications, behavior model types,
state machine (FSM) [6]. This process is laborious and risks
prevalent machine learning methods, and the technical and human
challenges in developing adaptive and realistically behaving losing essential domain knowledge when transitioning from one
agents. Another challenge is the transfer of agents from learning step to another.
environments to military simulation systems and the consequent In recent years, there has been a surge in applying machine
demand for standardization. learning (ML) for the efficient development of air combat
Four primary recommendations are presented regarding behavior models. Learning the behavior model with data-driven
increased emphasis on beyond-visual-range scenarios, multi-agent methods imposes other requirements on the model
machine learning and cooperation, utilization of hierarchical representation. The model structure should be able to capture
behavior models, and initiatives for standardization and research general patterns from complex data, and at the same not be too
collaboration. These recommendations aim to address current
rigid as that would restrict learning potential. A neural network
issues and guide the development of more comprehensive,
adaptable, and realistic machine learning-based behavior models is a compelling model representation because it offers
for air combat applications. scalability and parallel processing and allows iterative
improvements.
Index Terms—Machine learning, intelligent agents, behavioral The shift to ML is perhaps also inspired by advances in
sciences, modeling, simulation, military systems. applying intelligent agents to complex systems in general, aided
by novel approaches in reinforcement learning (RL), imitation
I. INTRODUCTION
learning (IL), and evolutionary algorithms (EAs). Notable
⇉
X1
? Match speed Match altitude
Y1
→ Pure pursuit X2
? Y2
Within turn range
X3
→ Fly relative bearing
Sufficient displacement ? Fig. 3. Neural network with three percept inputs, two hidden
→ Fly offset
layers of four nodes (gray), and two action outputs.
Within conversion range Convert Because fuzzy logic uses numbers instead of true and false,
AND returns the minimum of 𝜇(enemy in pursuit) and
Fig. 2. Stern conversion intercept represented as a BT. Adapted 𝜇(close). Whether to perform a jink depends on the
from Ramirez et al. [17]. membership values and a chosen threshold. If there are multiple
B. Behavior Tree (BT) rules for maneuvering, they must be aggregated in some way.
Behavior trees are another paradigm for task switching that D. Neural Network (NN)
became mainstream in the 2000s for modeling non-player Neural networks work as versatile behavior models where
characters following games such as Grand Theft Auto, Halo, the rationale for the actions is implicit. The flexible structure
and Bioshock [15, 16]. A modern description of BTs [17] allows a wide range of behaviors to be learned by data-driven
includes the five node types sequence (→), fallback (?), parallel methods. An example of an NN is shown in Fig. 3, composed
(⇉), action (shaded box), and condition (oval), as demonstrated of percepts 𝑋1 , 𝑋2 , and 𝑋3 , two hidden layers shown in gray,
in Fig. 2. The example shows a maneuver where one aircraft and actions 𝑌1 and 𝑌2 . The edges in the network represent
intercepts another [18]. The BT gets evaluated at each tick weights, and those are the trainable parameters of the behavior
starting at the root, which in the example is a parallel node. model. In the most basic networks, the value of each node is
Consequently, the aircraft will execute three subtrees the weighted sum of its inputs, but modern NNs often employ
simultaneously, matching the speed and altitude of the other some additional nonlinear transformation, which greatly
aircraft whilst executing the left subtree representing steering. increases the function space. The number and size of the
Fallbacks execute their children from left to right until one child hidden layers is also adjustable.
succeeds, while sequences execute their children until one child
fails. By this logic, the aircraft will have to reach a series of sub-
IV. REINFORCEMENT LEARNING
goals before it can finally convert.
Even though BTs span the same range of behavior models as An agent learning by reinforcement works out what to do
FSMs, they are in some respects easier to manage and modify without being told explicitly, and this is perhaps the most
as they become complex. Because all nodes return either common approach to learning behavior for intelligent agents.
success, failure, or running, the interface is fixed and thus The agent interacts with an environment and receives rewards
subtrees can be inserted anywhere in the existing model. according to its performance. The aim is to maximize rewards
Moreover, it makes graphical editors suitable for BTs. The flow over time by choosing the right actions [20], which amounts to
in a BT travels down to children and then back up to parents. finding the optimal policy 𝜋 ∗ that results in the greatest
This is a two-way transfer of control, as opposed to the one-way cumulative reward across all possible states.
transfer of control exhibited by FSMs [15]. Problems solved using RL are commonly formulated as
Markov decision processes (MDPs), where actions influence
C. Fuzzy Tree (FT) the rewards, following environment states, and, thus, future
An FT is a fuzzy system (FS) [19] organized as a hierarchical rewards. An MDP is defined by a tuple (𝑆, 𝐴, 𝑇, 𝑅, 𝛾) where
tree structure. A defining trait of FTs is that percepts and actions
are represented by linguistic variables such as close, • 𝑆 is a set of states
threatening, defensive, or evade. This conforms to human • 𝐴 is a set of actions
reasoning, which builds on qualitative descriptions rather than • 𝑇 is a transition function specifying, for each state,
numbers. The most basic components of FSs are membership action, and next state, the probability of that next state
functions, defining the linguistic variables in terms of agent occurring
percepts. In air combat with two aircraft, close may be • 𝑅 is a reward function specifying the immediate
represented as a sigmoid membership function that maps reward given a state and an action
aircraft distance to a membership value 𝜇 ∈ [0, 1] representing
• 𝛾 is a discount factor specifying the relative
the extent to which the aircraft are close. The linguistic
importance of immediate rewards
variables are used to make rules that constitute the behavior
model, such as “IF enemy in pursuit AND close, perform jink”.
4
For all time steps in an MDP, the agent observes a state 𝑠 ∈ V ∗ (s) = max ∑ 𝑇(𝑠, 𝑎, 𝑠 ′ )[𝑅(𝑠, 𝑎) + 𝛾𝑉 ∗ (𝑠 ′ ) ] . (1)
𝑆, makes an action 𝑎 ∈ 𝐴 based on the state, transitions to the a
𝑠′
next state 𝑠 ′ , and receives a reward 𝑟 = 𝑅(𝑠, 𝑎) based on this However, classical DP algorithms are limited to trivial MDPs
transition. The transition function may be deterministic or because of great computational expense as the number of state
stochastic, in which case 𝑇(𝑠, 𝑎, 𝑠 ′ ) ∼ 𝑃(𝑠 ′ |𝑠, 𝑎) is the variables grows. As seen in (1), they assume perfect knowledge
probability of the new state being 𝑠 ′ when applying action 𝑎 in of the reward and transition functions. To circumvent these
state s. The goal of MDPs is to find the policy 𝜋 that maximizes shortcomings, approximate dynamic programming (ADP)
the expected sum of discounted rewards over time operates with a statistical approximation of the value function
rather than computing it exactly.
The Bellman equation, with some rewriting, may also be used
arg max 𝔼 [∑ 𝑅(𝑠𝑡 , 𝑎𝑡 )|𝜋] . to find a 𝑄 function, which is the value function conditioned on
𝜋
𝑡 taking a particular action next. Q-learning [25] is a popular
Rewards are sometimes sparse, making it difficult for agents
algorithm for doing this. In Q-learning, the 𝑄 function is a table
to learn the appropriate behavior. Reward shaping is a
holding the values of all actions at all possible states and is
technique where supplemental rewards are provided to make a
updated by
problem easier to learn, especially during the early stages of
𝑄(𝑠𝑡 , 𝑎𝑡 ) ← 𝑄(𝑠𝑡 , 𝑎𝑡 ) + 𝛼[𝑟𝑡 + 𝛾 max 𝑄(𝑠𝑡+1 , 𝑎) − 𝑄(𝑠𝑡 , 𝑎𝑡 )],
learning [21]. Agents may have to consider multiple, possibly a
conflicting objectives when making decisions. For example, where 𝛼 is the size of the update step size and 𝛾 the factor
depending on the mission, a fighter pilot must consider aspects discounting future rewards. The Q-learning algorithm operates
like resource consumption, careful use of sensors to avoid being on a finite set of discrete states and actions, and the resulting
detected by opponents, and risk and safety in general. Multi- policy amounts to selecting the action that yields the greatest
objective RL (MORL) is a paradigm where agents learn to value given a state.
balance between such priorities. In MORL, the problem usually Extensions to Q-learning have been made, like deep Q-
takes the form of a multi-objective MDP, in which the reward learning [26], achieving eminent results in human-level control
function 𝑅 describes a vector of rewards, one for each objective applications. Deep Q-learning combines Q-learning with neural
[22]. networks forming deep Q-networks (DQNs) for function
While the above description has focused on learning the approximation with continuous state spaces. Deep Q-learning
behavior of a single agent, multi-agent RL (MARL) studies the uses a replay buffer to learn from previously collected data
behavior of multiple learning agents that coexist in a shared multiple times to improve sample efficiency and convergence.
environment. Centralized training and decentralized execution The replay buffer holds a finite but continuously updated set of
(CTDE) is a popular MARL framework, where agents are transitions (𝑠, 𝑎, 𝑟, 𝑠 ′ ).
trained offline with centralized information and execute online Variants of Q-learning are also applicable to MARL
in a decentralized manner [23]. problems. For example, under the CTDE framework, QMIX
There are two distinct methodological approaches to RL: [27] is a Q-learning algorithm for learning behaviors of
searching in policy space and searching in value function space. cooperative agents. The method employs a neural network for
Policy-based methods maintain explicit representations of estimating the joint Q-values of multiple agents.
policies and modify them through search operators. In contrast, An alternative value-based RL algorithm is Monte-Carlo tree
value-based methods do not maintain policy representations but search (MCTS), a heuristic search algorithm for estimating the
attempt to learn a value function 𝑉 𝜋 ∗ (𝑠), defined as the optimal value function by constructing a search tree using
expected discounted cumulative reward for 𝜋 ∗ [24]. In value- Monte-Carlo simulations [28]. This method has been used to
based methods, the policies are implicit and can be derived find the best move in the board games like Go [29]. Like with
directly from the value function by picking the actions yielding ADP and Q-learning, MCTS may combine with neural
the best values. A third approach, actor-critic methods, networks, which in the case of Go, proved able to beat
approximate both policy and value functions. The actor refers professional human players [30].
to the policy, while the critic refers to the value function [20, p. B. Actor-Critic Methods
321]. Central to all RL methods is finding a balance between
exploration of new strategies, and exploitation, meaning The deep deterministic policy gradient (DDPG) algorithm
making the best decision given current information. The RL also adapts the ideas behind the success of deep Q-learning to
methods relevant to the survey are briefly presented here, with the continuous action domain [31]. The actor 𝜋𝜃 , which is
references to further reading. deterministic, and critic 𝑄𝜙 are neural networks parameterized
by 𝜃 and 𝜙. Like in deep Q-learning, DDPG makes use of a
A. Value-based Methods replay buffer with 𝑁 transitions (𝑠, 𝑎, 𝑟, 𝑠 ′ ) collected by
Dynamic programming (DP), created by Richard Bellman in iterations of the policy 𝜋𝜃 . With these transitions, 𝑄𝜙 is updated
the 1950s, is a class of methods for solving optimal control to satisfy the Bellman equation more closely with each
problems of dynamical systems and can be used to find the iteration, as in deep Q-learning. The new 𝑄𝜙 is used to update
optimal value functions in RL problems by solving the Bellman policy 𝜋𝜃 by ascending with the gradient
optimality equation [20, p. 14] 𝑁
1
∑ 𝜃 𝑄𝜙 (𝑠𝑡 , 𝜋𝜃 (𝑠𝑡 )) ,
N
𝑡=1
5
which is computed by backpropagating through 𝑄𝜙 . However, scripts. The weight adjustment function keeps the total weight
the DDPG algorithm is reported to have some stabilization in the rulebase constant, meaning that when the weights of
challenges and to be brittle to hyperparameter settings [32]. specific rules increase, the weights of the other rules decrease.
Beyond the use of replay buffers, sample efficiency can be Due to the probabilistic nature of dynamic scripting, all scripts
improved by running multiple local copies of the actor in generated are likely to contain different sets of behavior rules.
parallel and asynchronously or synchronously updating the
global model, a technique employed in the Advantage Actor- V. IMITATION LEARNING
Critic (A2C) algorithm [33]. While RL attempts to solve tasks by maximizing expected
Extending DDPG, the multi-agent deep deterministic policy rewards, IL aims to reproduce desired behavior from
gradient (MADDPG) algorithm [34] is an actor-critic method demonstrations, normally given by human operators. Thus, IL
for MARL, which adopts the CTDE framework. During can be considered a class of methods that allow transferring
training, the centralized critic has access to the observations and skills from humans to robotic or computer systems [38, ch. 1].
actions of all agents and guides their policy updates. After For some tasks, it is easier to demonstrate how they are
training, the agents select actions based solely on their local performed than to define reward functions. Besides, expert
observations. demonstrators may have their opinions on exactly how the
Another method that builds on the concepts of DDPG is the agent should perform the task, like in the air combat domain,
soft actor-critic (SAC) algorithm [35]. Unlike DDPG, SAC where realistic behavior may mean acting in line with doctrines,
learns a stochastic policy, i.e. a distribution 𝜋𝜃 (⋅ |𝑠𝑡 ) of actions tactics, techniques, and procedures. Another significant
given states. To motivate action exploration, SAC aims to challenge in IL is the dependency of subsequent states in a
maximize not only the expected return but also the entropy of demonstration, which violates the assumptions of independent
the policy, making the objective and identically distributed data when minimizing the loss. This
𝑁
1 violation may lead to poor performance in theory and practice
∑[𝑅(𝑠𝑡 , 𝑎𝑡 ) + 𝑐𝐻(𝜋𝜃 (⋅ |𝑠𝑡 ))] , [39].
N
𝑡=1 Imitation learning can be categorized into passive collection
where 𝐻(𝜋𝜃 (⋅ |𝑠𝑡 )) is the entropy weighted by the coefficient of demonstrations and active collection [40]. In the passive
𝑐. Though not described here, SAC also learns both a value collection setting, demonstrations are collected beforehand, and
function and a Q function, which has proven to stabilize the IL aims to find a policy that mimics them. In contrast, the active
learning process. collection setting assumes an interactive expert that provides
Inconveniently, updates to a policy may sometimes be demonstrations in response to actions taken by the current
detrimental. Proximal policy optimization (PPO) algorithms policy.
[36] employ mechanisms to prevent large diversions from the Behavior cloning (BC) is an example of IL using a passive
current policy configuration when optimizing. Like SAC, PPO collection of demonstrations 𝐷 = {(𝑠𝑡 , 𝜋 ∗ (𝑠𝑡 )) | 𝑡 ∈ 1, … , 𝑁}.
promotes action exploration by maximizing the entropy of the It approximates the expert policy 𝜋 ∗ with supervised learning,
stochastic policy, but more central to the method is the clipped minimizing the difference between the learned policy 𝜋 and
objective 𝐿clip (𝜃) estimating the quality of actions based on expert demonstrations to create an approximate policy
rewards. To make conservative quality estimates and keep the
policy updates small, 𝐿clip (𝜃) is clipped to the interval 𝜋̂ ∗ = arg min ∑ 𝐿(𝜋(𝑠), 𝜋 ∗ (𝑠))
π
[1 − 𝜖, 1 + 𝜖]. Putting together these terms with the objective 𝑠∈𝐷
where 𝐿 is a distance loss function. Applying 𝜋̂ ∗ in states
of the critic 𝐿VF (𝜙) gives the complete PPO objective
𝑁 substantially different from 𝐷 may lead to unexpected and
1 unwanted actions due to the lack of training data. This is a
∑[𝐿clip (𝜃) + 𝑐𝐻(𝜋𝜃 (⋅ |𝑠𝑡 )) − 𝐿VF (𝜙) ].
N problem because slight dissonances in 𝜋̂ ∗ may cause new
𝑡=1
All the actor-critic methods discussed above are types of deep trajectories leading far away from 𝐷. This may happen even if
RL. Dynamic scripting (DS) [37] is a different but well-studied 𝐷 is not collected in sequence. However, an extensive set of
RL method in surveyed literature. Initially designed for demonstrations will mitigate the risk.
behavior generation for non-player characters in video games, An active collection of demonstrations may be leveraged
DS aims to meet computational and functional requirements using methods such as DAgger (dataset aggregation) [39].
related to speed, effectiveness, clarity, and variety in learning DAgger poses a way to query the expert for more information
behavior. Dynamic scripting produces policies in the form of when 𝜋̂ ∗ leads to states not covered by the demonstrations,
behavior scripts that contain a set of behavior rules. These followed by retraining the policy using the aggregated
behavior rules are if-then statements mapping states to actions demonstration set.
and are contained in a rulebase. Each rule in the rulebase is Rather than imitating demonstrations, inverse reinforcement
assigned a weight value comparable to a 𝑄 value. Creating a learning aims to learn the expert’s intent, implicit in these
behavior script involves selecting 𝑛 rules from the rulebase demonstrations, by recovering the underlying and unknown
according to their probability of being selected, equal to each reward function [38]. Consider a reward function parameterized
weight value divided by the sum of all weights. When an agent by a linear combination of features
interacts with its environment, a weight adjustment function 𝑅(𝑠, 𝑎) = 𝑤 𝑇 𝜙(𝑠, 𝑎),
updates the weights based on the rewards the agent receives. where 𝑤 ∈ ℝ is a weight vector and 𝜙(𝑠, 𝑎) ∶ 𝑆 × 𝐴 → ℝ𝑛 is
𝑛
This way, favorable rules will likely be included in upcoming a feature map. For a given 𝜙(𝑠, 𝑎), the goal of inverse RL is to
6
determine 𝑤 and thus 𝑅(𝑠, 𝑎). Having found 𝑅(𝑠, 𝑎), 𝜋̂ ∗ can be C. Hierarchical Learning
approximated by regular RL. Inverse RL can be suitable Hierarchical policies attempt to tackle complex tasks by
because the “expert” is not always optimal. Besides, a policy breaking them down into smaller tasks. Although such
optimal for the expert may not be optimal for the agent if they hierarchies may have multiple levels, consider a two-level
have different dynamics or capabilities. policy hierarchy in which the high-level policy 𝜇: 𝑠 → 𝑔
corresponds to selecting a low-level policy 𝜋𝑔 : 𝑠 → 𝑎 executing
VI. EVOLUTIONARY ALGORITHMS subtask 𝑔. The hierarchical learning problem is to
Evolutionary algorithms (EAs) are population-based simultaneously learn the high-level policy 𝜇 and sub-policies
optimization algorithms suitable for solving a range of 𝜋𝑔 [40]. If formulated as an RL problem, the goal is to
problems, like generating control policies in robotics [41]. maximize the rewards gained when the 𝜇 and 𝜋𝑔 ’s work
Evolutionary algorithms are not classified as methods of RL or together.
IL but rather belong to a paradigm of its own. One reason is
because EAs manage a population of policies rather than just VIII. LITERATURE OVERVIEW
one. Genetic algorithms (GAs) are a common type of EA
inspired by the Darwinian principles of natural selection, There are several ways to organize the surveyed literature to
allowing population members to interact and reproduce [42]. help identify gaps and trends and provide a map of air combat
Genetic algorithms evaluate all policies in a population using a behavior modeling using ML. We have grouped publications
fitness function, which in the MDP formulation is equivalent to that share authors and address similar problems. These 24
a reward function. Then, a selection mechanism selects some groups (studies) are listed in Table I and classified by 11
policies for recombination, creating new policies that inherit properties (columns). Year refers to the most recent publication
traits of their parents. Usually, the new policies undergo some of the study. The applications are described in Section I and
degree of mutation to explore the search space further. This express the intended use of the generated behavior models.
process is called evolution, which typically terminates when the Some studies examine multi-agent learning, where multiple
population converges, or some termination criterion is satisfied. agents operate and learn in the same environment. All the
Neuroevolution is the process of evolving neural networks surveyed studies that concern multi-agent learning employ RL.
using GAs. A popular method for this is the neuroevolution of Beyond-visual-range (BVR) engagements have become
augmenting topologies [43] (NEAT), which evolves both standard in real air combat. If enemy aircraft are too far away,
network structures and their parameter values. they are no longer visible to the unaided eye, and the pilot must
rely on long-range detection systems. The visual range on a
VII. COMPOSITE LEARNING clear day is roughly 12 km depending on the target airframe
size [48]. In contrast, within visual range (WVR) are
The term composite learning is used here to refer to strategies engagements where the enemies are visible to the pilot’s naked
that partition the learning into multiple stages or partition the eye.
learning task into smaller tasks. The behavior models are learned in the context of a mission
A. Transfer Learning task assigned to one or more agents. The mission task
corresponds to the mission or basic task a pilot could encounter
Transfer learning refers to the idea that learning can be sped
in an operation. The behavior model expresses maneuvers or
up by leveraging previous knowledge from related tasks [44].
other functions of a pilot as output values. Naturally, only a
Rather than attempting to learn a difficult task from scratch,
subset of the pilot functions is included, and these are referred
better-performing policies can be obtained with less data and
to as agent functions.
training by generalizing across different tasks. Knowledge may
A key choice in modeling is the type of behavior model
also be transferred between different environments or from a
discussed in Section III, which shapes the behavior of the agent
simulated environment to the real world [45]. The benefits of
and how it will learn. The subsequent columns list the learning
transfer learning are measured by metrics such as jumpstart (the
paradigm and learning method applied, for which summaries
initial performance of an agent on a target task), asymptotic
are provided in Sections IV-VII.
performance (the improvements made to the final learned
The simulation system is the environment in which a CGF
performance via transfer), and time to threshold (the reduction
resides. There is no universally favored simulation system, but
in time needed to achieve a specified threshold).
rather a wide range of bespoke, commercial, and government-
B. Curriculum learning owned systems.
Curriculum learning can be seen as a special form of transfer The last column states the degrees of freedom of a simulated
learning where initial tasks are used to guide the learner to aircraft. While many simulation systems technically provide 3D
perform better on the final task [46]. As knowledge is environments, some studies restrict the aircraft dynamics to 2D,
transferred between tasks, the task sequence induces a reducing the number of maneuver variables in the agent
curriculum, shown to improve policy convergence and function.
performance on difficult tasks [47]. The curriculum can be Based on the classification presented in Table I and related
designed by altering the parameters of the environment, literature, Sections A-D present clear trends and challenges in
changing the reward function, adding constraints to the task, the research field.
and more.
7
TABLE I
OVERVIEW OF THE SURVEYED LITERATURE IN TERMS OF ELEVEN KEY CHARACTERISTICS
Study Year Application Single/ BVR/ Mission Agent Behavior Learning Learning Simulation Agent
multi WVR task function model paradigm method system DOF
Abbott et al. [49], 2015 Training Single BVR, A/A, A/S, M, SA DT, BTN, IL BC, classifiers DVTE, 3D
Abbott [50], WVR formation NN NGTS
Abbott et al. [51]
Bae et al. [52] 2023 UCAV Single WVR Dogfight M NN RL, curriculum SAC JSBSim 3D
Chai et al. [53] 2023 UCAV Single WVR Dogfight M NN RL PPO, self-play Bespoke 3D
Ernest et al. [54], 2016 UCAV Single BVR, SEAD W Fuzzy tree/ RL GA Bespoke, 2D,
Ernest et al. [55], WVR system AFSIM 3D
Ernest et al. [56]
Gorton et al. [57] 2023 Training Single WVR CAP, flee M BTN EA NEAT (GA) NGTS 2D
Han et al. [58], 2022 Tactics, Multi BVR, A/A M, W NN RL, hierarchy A2C, PPO Bespoke 3D
Piao et al. [59], UCAV WVR
Sun et al. [60]
Hu et al. [61] 2021 Planning Single BVR Pursue, flee M, W NN RL DQL Bespoke 3D
Johansson [62] 2018 Training Single BVR Dogfight M, W BT EA GA TACSI 3D
Källström [63], 2022 Training Multi BVR, A/S, CAP, M, W, NN RL, DDPG Bespoke 2D
Källström et al. [64], WVR recon, jam, comms, curriculum
Källström et al. [2] police, coord jammer
Kong et al. [65], 2022 UCAV Multi WVR Dogfight M NN RL, hierarchy, DDPG, QMIX, JSBSim, 2D
Kong et al. [66] curriculum RSAC, self-play bespoke
Li et al. [67] 2022 Tactics Single WVR Dogfight M NN RL PPO Bespoke 3D
Ludwig and Presnell [68] 2019 Training Multi WVR Dogfight M BTN RL DS NGTS 2D
McGrew et al. [69] 2010 UCAV Single WVR Dogfight M NN RL ADP Bespoke 2D
Pope et al. [70], 2022 F-16 autopilot Single WVR Dogfight M NN RL, hierarchy SAC JSBSim 3D
Pope et al. [71]
Reinisch et al. [72] 2022 Training Multi BVR A/A M, W, CM, BT, NN RL, hierarchy n/a Bespoke 3D
R, jammer
Sandström [73], 2022 Training Single n/a Maneuver M NN IL, transfer BC VBS3 3D
Sandström et al. [74]
Selmonaj et al. [75] 2023 Training Multi WVR A/A M, W NN RL, hierarchy, PPO, self-play Bespoke 2D
curriculum
Sommer et al. [76] 2021 Training, CD&E Single BVR Avoid SAM M NN RL, transfer Neural MCTS CMO 3D
Strand et al. [77] 2023 Training Single WVR Formation M NN RL PPO Bespoke 2D
Teng et al. [78] 2012 Training Single BVR A/A M, W, CM NN RL Q-learning STRIVE 3D
Toubman [3] 2020 Training Multi BVR A/A, CAP M, W, R, Rules RL DS Bespoke 2D
comms
Yao et al. [79] 2015 Training Single BVR A/A M, W, CM, R BT EA GA Bespoke 3D
Zhang et al. [80] 2020 Planning Single BVR SEAD M NN RL PPO Bespoke, 2D
AFSIM
Zhang et al. [81] 2022 UCAV Multi WVR A/A M, W NN RL PPO, self-play Bespoke 3D
8
The table contains the following abbreviations not previously layers are often powers of two to optimize memory allocation
defined: and access. Two hidden layers of 256 nodes are used in multiple
A/A air-to-air studies [53, 66, 81], but generally, a large range of shapes and
A/S air-to-surface sizes are employed, which is necessary to accommodate
CAP combat air patrol different state spaces, action spaces, and behaviors.
comms communications In most of the simulations we are concerned with the agent
coord coordination acts based on only the observed state of the entities at the
CM countermeasure current time. However, some studies apply gated memory to
M maneuver represent the states at earlier time steps. Hu et al. [61] employ
n/a not applicable long short-term memory (LSTM) in their policy network to
R radar represent the percepts of 30 state variables from the five
recon reconnaissance previous time steps along with the current one. Each LSTM unit
SAM surface-to-air missile decides what information to keep and what information to
SEAD suppression of enemy air defenses discard from the previous state. Bae et al. [52] measured a
SA situational awareness substantial increase in performance when adding an LSTM
W weapons layer, especially in partially observable environments where the
CD&E concept development and experimentation agent is less informed. Kong et al. [66] and Selmonaj et al. [75]
use gated recurrent units, a simpler version of LSTM, in both
actor and critic networks. Zhang et al. [81] and Källström et al.
A. Air-to-Air Combat
[2] simply make one concatenate input layer with the four most
The most frequent learning task is air-to-air combat and in recent percepts of the state variables.
particular dogfighting. Dogfighting is close-range WVR air Simulations with many entities, and potentially a variable
combat where basic fighter maneuvers (BFM) are used to arrive number of entities, pose a question of how to design the input
behind enemy aircraft for a favorable engagement position. It is layer of the policy network. Too many percepts will be a noisy
an art that emerged naturally in World War I [82] and follows representation of the environment, yet there must be sufficient
principles such as balancing airspeed and altitude, minimizing percepts to enable situational awareness. Han et al. [58] use
turn rates, attacking from the direction of the sun, and avoiding graph attention networks to recognize the most important ally
overshoots. In broader terms, dogfighting is a three- and enemy in the air combat scenario for all aircraft. Based on
dimensional geometrical problem governed by the physical that they construct an input layer representing the ownship,
limitations of the aircraft and pilots. primary enemy, primary ally, and the primary enemy of the
The Defense Advanced Research Project Agency (DARPA) primary ally. Additionally, they include information on all
Air Combat Evolution (ACE) program “seeks to increase trust aircraft that have the agent aircraft as the most important friend,
in combat autonomy by using human-machine collaborative but to fix the input length, they add these together by variable.
dogfighting as its challenge problem” [83]. A feasibility study The solution of Kong et al. [66] is to embed information about
for ACE called the AlphaDogfight Trials invited eight all friendly and enemy aircraft in multi-head attention
companies to make dogfighting agents that would compete in a structures, while Selmonaj et al. [75] opt to include only the
series of knockout tournaments. The AlphaDogfight Trials closest friends and enemies. Zhang et al. [80] also discuss the
culminated in 2020 when the top agent was matched with an challenge of a variable number of entities and suggest either
expert human pilot and won [83]. Later, developers at ACE representing irrelevant entities by zeros or dividing the play
uploaded an agent to a modified F-16 known as the Variable In- area into regions with associated counts of friends and enemies.
flight Simulator Test Aircraft (VISTA) and demonstrated that
the agent could control the aircraft in multiple sorties with C. Actor-Critic Methods
various simulated adversaries and weapons systems [84]. Many of the studies train the neural network with actor-critic
Air combat BVR was first seen on large scale in the Vietnam methods, represented by A2C [33], DDPG [31], PPO [36], and
War [48], and has gradually become the main type of air combat SAC [35], which combine advantages from policy-based and
engagement due to more advanced weapons, sensors, and value-based methods, such as allowing high-dimensional state
sensor fusion [72, 85]. The BVR aspect adds maneuver spaces, continuous action spaces and reduced variance in policy
elements such as breaking radar locks and exhausting the gradient estimates.
energy of incoming missiles. When to use radar and fire Sharing layers between the actor and critic networks allows
missiles becomes critical [86]. learning of common abstract concepts from raw percepts and
reduces parameters and computation. Li et al. [67] make two
B. Neural Networks
hidden layers shared while other studies employ one shared
Three-quarters of the studies use neural networks to represent layer [58, 59, 75]. Sharing layers between agents is also
behavior models. Deep neural networks provide a hierarchical possible. If the agents have similar tasks, they may share the
representation of the environment. The first hidden layer same neural network and receive a common reward [66].
constitutes new abstract concepts based on the percepts. Each Multiple studies [60, 2, 75] apply CTDE which relaxes the
consecutive layer represents more abstract concepts [87]. Thus, assumption of homogeneous agents but still produces
there is potential in deep neural networks to express more coordination between them [88].
advanced concepts such as threat assessment, engagement
envelopes, energy management, and cooperation. The hidden
9
proficiency in BVR scenarios which are generally less acrobatic with few studies employing the same simulation system,
and more strategic. highlighting the need for initiatives like the Not So Grand
The fact that dogfights are less strategic and more tangible Challenge to establish common testbeds.
may also be a reason they are preferred as use cases. Simple After surveying the current state of the research field, we
goals such as avoiding being shot or shooting down enemy have reached four recommendations intended to aid
aircraft are easier to reflect in reward functions that guide RL advancements toward more comprehensive, adaptable, and
properly. Nonetheless, the goals in BVR combat are not so realistic machine learning-based behavior models for air
different. Some studies actively use reward shaping to improve combat.
learning convergence and include doctrine or domain
A. Emphasis on Beyond Visual Range Scenarios.
knowledge [59, 60, 3].
Multi-agent RL has emerged as a powerful learning paradigm Although dogfighting machine learning agents are
due to its ability to capture interactions and tactical impressive, they are not highly relevant in the current state of
dependencies and dynamics between agents. Pilots do not air combat. Based on our review, a reasonable shift of focus
operate alone, and it is essential they learn how to cooperate would be from WVR missions to prioritizing the development
with their flight and squadron. In MARL, each agent represents of behavior models that incorporate the complexities and
a non-stationarity for the other friendly and adversarial agents, strategic aspects required in BVR missions. This applies
which makes the learning fundamentally more difficult, but also particularly to the applications of simulation-based air combat
more realistic [106]. Key aspects of cooperation include training, mission planning, and the development of new tactics
formations, target coordination, and defensive support, which and strategies.
all become second nature to pilots eventually. In contrast, to B. Enhanced Focus on Multi-Agent Machine Learning and
explicitly formulate collaborative behavior rules is hard [58]. Cooperation
Complex behavior models may exploit hierarchical structures
Fighter pilots do not operate alone, yet the research field
to break down a task into smaller parts, as described in
represents a preponderance of studies focusing on the behavior
Section VII-C. The high-level policies in the surveyed literature
of a single agent. The effectiveness of multi-agent methods in
all include the choice between at least one defensive and one
aggressive sub-policies. A defensive policy is used when the capturing tactical dependencies and interactions among agents
opponent has the advantage, and vice versa. Kong et al. [66] suggests a need for increased research in this area. Future
studies are urged to delve deeper into cooperative behaviors
and Pope et al. [71] include a dedicated sub-policy for attaining
among agents, emphasizing formations, target coordination,
the control zone position behind the opponent aircraft.
and defensive support. Understanding and simulating the
Selmonaj et al. [75] and Kong et al. [66] also include target
complexities of teamwork and collaboration in air combat
selection in the high-level policy. The architecture of Sun et al.
[60] is distinct because the high-level policy has as many as 14 scenarios will contribute to more realistic air combat
outputs that encode BFM macro actions. They use a low-level experiences.
policy to decide the normal load factor and velocity command C. Utilization of Hierarchical Behavior Models
to apply to the selected macro action. Adopting hierarchical structures to break down complex air
combat scenarios into smaller, more manageable sub-problems
X. CONCLUSION is a promising direction for future research. Furthermore, a
The most prominent applications for air combat behavior hierarchical decision-making process allows for a coherent
models based on machine learning are enhancing simulation- representation of an otherwise convoluted policy. High-level
based pilot training, mission planning, developing new tactics policies guiding defensive and aggressive sub-policies, as
and strategies, and optimizing unmanned aerial combat observed in current literature, can be expanded and refined to
vehicles. The reviewed studies exhibit a concerted effort to address a broader range of mission tasks, ultimately enhancing
model behavior for specific air combat tasks, particularly in the the versatility and adaptability of behavior models applied to all
simulation-based training of fighter pilots. However, despite mentioned applications.
notable progress, challenges persist in seamlessly integrating
D. Standardization and Collaboration Initiatives
these models into comprehensive pilot training programs,
presenting both technical and human obstacles that demand Considering the current lack of standardization and
attention. benchmarking in simulation systems, researchers would benefit
The desire for adaptable agents with a wide range of from active participation in collaborative initiatives like the Not
functions, including maneuvering, weapons, radar controls, and So Grand Challenge. Establishing common testbeds and
countermeasures, emphasizes the need for a comprehensive standardized environments and scenarios facilitates cross-
approach to machine learning-based behavior modeling. comparison of different behavior models and ensures that
Certain studies also point to the importance of making more of advancements in one research project can be applied and tested
the agent functions explicit to attain sufficient realism. The in others. This collaborative approach accelerates progress and
technical challenges of transferring agents from their learning contributes to developing more robust and universally
environments to pilot training simulation systems underscore applicable machine learning-based behavior models for air
the importance of aligning percepts and action formats, as well combat applications.
as maintaining dynamics models that balance sophistication
with execution speed. Standardization remains a challenge,
11
[56] N. Ernest, D. Carroll, C. Schumacher, M. Clark, K. Cohen, and G. Lee, [79] J. Yao, Q. Huang, and W. Wang, “Adaptive CGFs based on grammatical
“Genetic fuzzy based artificial intelligence for unmanned combat aerial evolution,” Math. Probl. Eng., vol. 2015, Art. no. 197306, Dec. 2015.
vehicle control in simulated air combat missions,” J. Def. Manag., vol. 6, [80] L. A. Zhang et al., “Air dominance through machine learning,” RAND
no. 1, Mar. 2016. Corp., Santa Monica, CA, Rep. AD1100919, 2020.
[57] P. Gorton, M. Asprusten, and K. Brathen, “Imitation learning for [81] H. Zhang, Y. Wei, H. Zhou, and C. Huang, “Maneuver decision-making
modelling air combat behaviour,” FFI, Kjeller, Norway, Rep. 22/02423, for autonomous air combat based on FRE-PPO,” Appl. Sci., vol. 12,
2023. no. 20, Art. no. 10230, Oct. 2022.
[58] Y. Han et al., “Deep relationship graph reinforcement learning for multi- [82] R. L. Shaw, Fighter combat. Annapolis, MD: NIP, 1985, p. xii.
aircraft air combat,” in IJCNN, Padua, Italy, 2022, pp. 1–8, [83] C. R. DeMay, E. L. White, W. D. Dunham, and J. A. Pino,
doi: 10.1109/IJCNN55064.2022.9892208. “AlphaDogfight Trials,” Johns Hopkins APL Tech. Dig., vol. 36, no. 2, pp.
[59] H. Piao et al., “Beyond-visual-range air combat tactics auto-generation by 154–163, July 2022.
reinforcement learning,” in IJCNN, Glasgow, UK, 2020, pp. 1–8, [84] DARPA, “ACE program’s AI agents transition from simulation to live
doi: 10.1109/IJCNN48605.2020.9207088. flight,” 2023. [Online]. Available: https://www.darpa.mil/news-events/-
[60] Z. Sun et al., “Multi-agent hierarchical policy gradient for air combat 2023-02-13
tactics emergence via self-play,” Eng. Appl. Artif. Intell., vol. 98, Art. no. [85] J. P. A. Dantas, A. N. Costa, D. Geraldo, M. R. O. A. Maximo, and
104112, Feb. 2021, doi: 10.1016/j.engappai.2020.104112. T. Yoneyama, “Engagement decision support for beyond visual range air
[61] D. Hu, R. Yang, J. Zuo, Z. Zhang, J. Wu, and Y. Wang, “Application of combat,” in LARS/SBR/WRE, Natal, Brazil, 2021, pp. 96–101,
deep reinforcement learning in maneuver planning of beyond-visual-range doi: 10.1109/LARS/SBR/WRE54079.2021.9605380.
air combat,” IEEE Access, vol. 9, pp. 32282–32297, Feb. 2021, [86] S. Aronsson et al., “Supporting after action review in simulator mission
doi: 10.1109/ACCESS.2021.3060426. training,” JDMS, vol. 16, no. 3, pp. 219–231, July 2019.
[62] T. Johansson, “Tactical simulation in air-to-air combat,” Master’s thesis, [87] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. Cambridge,
LTU, Luleå, Sweden, 2018. MA: MIT press, 2016.
[63] J. Källström and F. Heintz, “Multi-agent multi-objective deep [88] J. K. Terry, N. Grammel, S. Son, B. Black, and A. Agrawal, “Revisiting
reinforcement learning for efficient and effective pilot training,” in Aerosp. parameter sharing in multi-agent deep reinforcement learning,” arXiv, vol.
Technol. Congr., vol. 162, Stockholm, Sweden, 2019, paper 11, pp. 101– 2005, Art. no. 13625, May 2020, doi: 10.48550/arXiv.2005.13625.
111. [89] J. Berndt and A. De Marco, “Progress on and usage of the open source
[64] ——, “Agent coordination in air combat simulation using multi-agent flight dynamics model software library, JSBSim,” in AIAA MST, 2009,
deep reinforcement learning,” in IEEE SMC, Toronto, Canada, 2020, pp. paper 5699, doi: 10.2514/6.2009-5699.
2157–2164, doi: 10.1109/SMC42975.2020.9283492. [90] T. Johansson, “Metodiker vid regelskrivning,” Linköping, Sweden, 2004.
[65] W. Kong, D. Zhou, and Z. Yang, “Air combat strategies generation of [91] D. Siksik, “STRIVE: An open and distributed architecture for CGF
CGF based on MADDPG and reward shaping,” in CVIDL, Chongqing, representations,” in 9th CGF & BR Conf., Orlando, FL, 2000, pp. 16–18.
China, 2020, pp. 651–655, doi: 10.1109/CVIDL51233.2020.000-7. [92] B. I. Simulations, “VBS4 product brochure,” 2024, accessed on Feb. 28,
[66] W. Kong, D. Zhou, Y. Du, Y. Zhou, and Y. Zhao, “Hierarchical multi- 2024. [Online]. Available: https://bisimulations.com/sites/default/files/-
agent reinforcement learning for multi-aircraft close-range air combat,” data_sheets/bisim_product_flyers_2024_vbs4.pdf
IET CTA, vol. 17, no. 13, pp. 1840–1862, Sept. 2022. [93] Matrix Games, Command Modern Operations Game Manual, Staten
[67] L. Li, Z. Zhou, J. Chai, Z. Liu, Y. Zhu, and J. Yi, “Learning continuous 3- Island, NY, 2023.
DoF air-to-air close-in combat strategy using proximal policy [94] B. Johnson et al., “Game theory and prescriptive analytics for naval
optimization,” in IEEE CoG, Beijing, China, 2022, pp. 616–619, wargaming battle management aids,” NPS, Monterey, CA, Rep.
doi: 10.1109/CoG51982.2022.9893690. AD1184544, 2018.
[68] J. Ludwig and B. Presnell, “Developing an adaptive opponent for tactical [95] M. P. Bailey and R. Armstrong, “The deployable virtual training
training,” in AIS, ser. LNISA, vol. 11597, Orlando, FL, 2019, pp. 379– environment,” in I/ITSEC, Orlando, FL, 2002, pp. 843–849.
388. [96] P. D. Clive, J. A. Johnson, M. J. Moss, J. M. Zeh, B. M. Birkmire, and
[69] J. S. McGrew, J. P. How, B. Williams, and N. Roy, “Air-combat strategy D. D. Hodson, “Advanced framework for simulation, integration and
using approximate dynamic programming,” JGCD, vol. 33, no. 5, pp. modeling (AFSIM),” in CSC, 2015, pp. 73–77.
1641–1654, Sept. 2010, doi: 10.2514/1.46815. [97] J. D. Souza, P. J. L. Silva, A. H. M. Pinto, F. F. Monteiro, and J. M. X. N.
[70] A. P. Pope et al., “Hierarchical reinforcement learning for air-to-air Teixeira, “Assessing the reality gap of robotic simulations with
combat,” in ICUAS, Athens, Greece, 2021, pp. 275–284, educational purposes,” in LARS/SBR/WRE, Natal, Brazil, 2020, pp. 1–6,
doi: 10.1109/ICUAS51884.2021.9476700. doi: 10.1109/LARS/SBR/WRE51543.2020.9306947.
[71] ——, “Hierarchical reinforcement learning for air combat at DARPA’s [98] IEEE, “Standard for modeling and simulation (M&S) high level
AlphaDogfight Trials,” TAI, vol. 4, no. 6, pp. 1–15, Dec. 2022, architecture (HLA),” Piscataway, NJ, 2010.
doi: 10.1109/TAI.2022.3222143. [99] G. Brockman et al., “OpenAI gym,” vol. 1606,” Art. no. 01540, June
[72] F. Reinisch, M. Strohal, and P. Stütz, “Behaviour modelling of computer- 2016, doi: 10.48550/arXiv.1606.01540.
generated-forces in beyond-visual-range air combat,” in SIMULTECH, [100] E. Watz and M. J. Doyle, “Fighter combat-tactical awareness capability
vol. 1, Lisbon, Portugal, 2022, pp. 327–335. (FC-TAC) for use in live virtual constructive training,” in Fall SIW,
[73] V. Sandström, “On the efficiency of transfer learning in a fighter pilot Orlando, FL, 2014, paper 44.
behavior modelling context,” Master’s thesis, KTH, Stockholm, Sweden, [101] M. J. Doyle and A. M. Portrey, “Rapid adaptive realistic behavior
2021. modeling is viable for use in training,” in BRiMS, Washington, DC, 2014,
[74] V. Sandström, L. Luotsinen, and D. Oskarsson, “Fighter pilot behavior pp. 73–80, doi: 10.13140/2.1.4964.4802.
cloning,” in ICUAS, Dubrovnik, Croatia, 2022, pp. 686–695, [102] M. J. Doyle, “A foundation for adaptive agent-based ‘on the fly’ learning
doi: 10.1109/ICUAS54217.2022.9836131. of TTPs,” J. Comput. Eng. Inf. Technol., vol. 6, no. 3, Art. no. 1000173,
[75] A. Selmonaj, O. Szehr, G. D. Rio, A. Antonucci, A. Schneider, and June 2017, doi: 10.4172/2324-9307.1000173.
M. Rüegsegger, “Hierarchical multi-agent reinforcement learning for air [103] W. Warwick and S. Rodgers, “Wrong in the right way: Balancing
combat maneuvering,” arXiv, vol. 2309, Art. no. 11247, pp. 1–8, Sept. realism against other constraints in simulation-based training,” in AIS, ser.
2023, doi: 10.48550/arXiv.2309.11247. LNCS, vol. 11597, Orlando, FL, 2019, pp. 379–388.
[76] M. Sommer, M. Rüegsegger, O. Szehr, and G. Del Rio, “Deep self- [104] J. Freeman, E. Watz, and W. Bennett, “Assessing and selecting AI pilots
optimizing artificial intelligence for tactical analysis, training and for tactical and training skill,” in Towards On-Demand Personalized
optimization,” in AI4HMO, ser. STO-MP-IST, vol. 190. Koblenz, Training and Decision Support, ser. STO-MP-MSG, vol. 177. Virtual:
Germany: NATO STO, 2021, paper 19. NATO STO, 2020, paper 14.
[77] A. Strand, P. Gorton, M. Asprusten, and K. Brathen, “Learning [105] W. Bennett, “Readiness product line,” AFRL Fight’s On!, no. 67, pp. 6–
environment for the air domain (LEAD),” in WSC, San Antonio, TX, 8, Nov. 2022.
2023, pp. 3035–3046. [106] P. Hernandez-Leal, B. Kartal, and M. E. Taylor, “A survey and critique
[78] T.-H. Teng, A.-H. Tan, Y.-S. Tan, and A. Yeo, “Self-organizing neural of multiagent deep reinforcement learning,” AAMAS, vol. 33, no. 6, pp.
networks for learning air combat maneuvers,” in IJCNN, Brisbane, 750–797, Nov. 2019, doi: 10.1007/s10458-019-09421-1.
Australia, 2012, pp. 1–8, doi: 10.1109/IJCNN.2012.6252763.
13