0% found this document useful (0 votes)

69 views17 pages

Symmetry: HMCTS-OP: Hierarchical MCTS Based Online Planning in The Asymmetric Adversarial Environment

Symmetry 12-00719

Uploaded by

Jordi Amador

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views17 pages

Symmetry: HMCTS-OP: Hierarchical MCTS Based Online Planning in The Asymmetric Adversarial Environment

Symmetry 12-00719

Uploaded by

Jordi Amador

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

SS symmetry

Article
HMCTS-OP: Hierarchical MCTS Based Online
Planning in the Asymmetric Adversarial Environment
Lina Lu , Wanpeng Zhang *, Xueqiang Gu, Xiang Ji and Jing Chen
College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073,
Hunan, China; [email protected] (L.L.); [email protected] (X.G.); [email protected] (X.J.);
[email protected] (J.C.)
* Correspondence: [email protected]

Received: 7 February 2020; Accepted: 31 March 2020; Published: 2 May 2020

Abstract: The Monte Carlo Tree Search (MCTS) has demonstrated excellent performance in solving
many planning problems. However, the state space and the branching factors are huge, and the
planning horizon is long in many practical applications, especially in the adversarial environment.
It is computationally expensive to cover a sufficient number of rewarded states that are far away from
the root in the flat non-hierarchical MCTS. Therefore, the flat non-hierarchical MCTS is inefficient
for dealing with planning problems with a long planning horizon, huge state space, and branching
factors. In this work, we propose a novel hierarchical MCTS-based online planning method named
the HMCTS-OP to tackle this issue. The HMCTS-OP integrates the MAXQ-based task hierarchies and
the hierarchical MCTS algorithms into the online planning framework. Specifically, the MAXQ-based
task hierarchies reduce the search space and guide the search process. Therefore, the computational
complexity is significantly reduced. Moreover, the reduction in the computational complexity enables
the MCTS to perform a deeper search to find better action in a limited time. We evaluate the
performance of the HMCTS-OP in the domain of online planning in the asymmetric adversarial
environment. The experiment results show that the HMCTS-OP outperforms other online planning
methods in this domain.

Keywords: HMCTS; online planning; MAXQ; asymmetric adversarial environment

1. Introduction
It is challenging to solve large-scale planning problems. These problems suffer from the “curse of
dimensionality”. Online planning algorithms (e.g., the Monte Carlo Tree Search (MCTS)) overcome
this problem by avoiding calculation of the complete policy for the whole state space. Additionally,
the MCTS has demonstrated excellent performance in many planning problems [1].
However, existing MCTS algorithms, including the UCT (Upper Confidence Bounds (UCB)
applied to Trees) cannot efficiently cope with long-horizon planning problems with a huge state space
and branching factors. The performance of the MCTS is determined by the effective search depth [2].
For long-horizon problems with a huge state space and branching factors, the MCTS needs to search
deeply to cover a sufficient number of rewarded states that are far away from the root, which leads to a
large computational overhead and poor performance.
In this work, we focus on the online planning problem in the asymmetric adversarial environment.
This domain has asymmetric military forces between two opposing players. Due to the long planning
horizon, huge state space, and significant disparity among forces of this domain, the planning problem
in this domain is very challenging for the MCTS.
In order to improve the performance of the MCTS, the task hierarchies or macro actions are
introduced into planning [1] to reduce the computational cost. We define the set of primitive actions as

Symmetry 2020, 12, 719; doi:10.3390/sym12050719 www.mdpi.com/journal/symmetry

Symmetry 2020, 12, 719 2 of 17

A, the set of hierarchical tasks (subtasks) as Ã, the planning horizon as T, and the upper-bound on the
length of each subtask as L. Once the primitive actions in the MCTS are replaced by the high-level
T/L
subtasks, the computational cost is reduced from O(|A|T ) to O(|Ã| ) [3]. Therefore, the integration of
the task hierarchies and the MCTS algorithm make the algorithm more effective for solving large-scale
problems with a long planning horizon.
In this work, we use MAXQ [4] to formalize the task hierarchies. MAXQ is a decomposition method
of Markov Decision Process (MDP) value function, which makes the value function representation
more compact and makes the subtasks portable.
In this paper, we propose a novel hierarchical MCTS-based online planning method named the
HMCTS-OP. The HMCTS-OP integrates the MAXQ-based task hierarchies and the hierarchical MCTS
algorithms into the online planning framework. Specifically, the MAXQ-based task hierarchies reduce
the search space and guide the search process. Therefore, the computational cost is significantly
reduced, which enables the MCTS to perform a deeper search in a limited time frame. Therefore, it can
provide a more accurate value function estimate for the root node and enable the root node to implement
better action selection policies within the limited time frame. Moreover, combined with the advantages
of hierarchical planning and online planning, the HMCTS-OP facilitates the design and utilization of
autonomous agents in the asymmetric adversarial environment.
The main contributions of this paper are as follows:

1. We model the online planning problem in the asymmetric adversarial environment as an MDP and
extend the MDP to the semi-Markov decision process (SMDP) by introducing the task hierarchies.
This provides the theoretical foundation for MAXQ hierarchical decomposition.
2. We derive the MAXQ value hierarchical decomposition for the defined hierarchical tasks.
The MAXQ value hierarchical decomposition provides a scalable way to calculate the rewards of
hierarchical tasks in HMCTS-OP.
3. We use the MAXQ-based task hierarchies to reduce the search space and guide the search process.
Therefore, the computational cost is significantly reduced, which enables the MCTS to search
deeper to find better action within a limited time frame. As a result, the HMCTS-OP can perform
better in online planning in the asymmetric adversarial environment.

The rest of this paper is structured as follows. In Section 2, we start by introducing the necessary
background. Section 3 summarizes the related work. Section 4 discusses, in detail, how to model the
problem, extend the MDP to SMDP, and integrate the MAXQ-based task hierarchies in the MCTS for
online planning in the asymmetric adversarial environment. In Section 5, we evaluate the performance
of HMCTS-OP. In Section 6, we draw conclusions and discuss several future research areas.

2. Background

2.1. Markov and Semi-Markov Decision Process

The Markov Decision Process (MDP) is a fundamental formalism for learning and planning
problems. It is defined by a four-tuple (S, A, R, P) with states S, actions A, a transition function P,
and a reward function R. In general, P(s0 s, a) represents the probability of performing action a in state
s and transitioning to state s0 . R(s0 s, a) ∈ R is the reward for the state transition from s to s’ after

performing action a. Moreover, the probability of selecting action a in state s is represented as π(as) .
The semi-Markov decision process (SMDP) is a generalization of the MDP. In the SMDP, the actions
are allowed to take variable time steps [4,5]. Let k ∈ N+ denote the positive
execution time of action a
from state s. When action a is executed, the transition function P(s0 , ks, a) represents the probability
of performing
action a in state s and transitioning to state s0 after time k, and the reward function
R(s , ks, a) is the reward of the state transition from s to s0 after execution action a and after time k.
0

The framework of the SMDP is the foundation of hierarchical planning and learning.
Symmetry 2019, 11, x FOR PEER REVIEW 3 of 17

Symmetry 2020, 12, 719 3 of 17

reward function 𝑅(𝑠 , 𝑘|𝑠, 𝑎) is the reward of the state transition from s to s' after execution action 𝑎
and after time 𝑘. The framework of the SMDP is the foundation of hierarchical planning and learning.
2.2. MAXQ
2.2. MAXQ
The MAXQ value function hierarchical decomposition was proposed by Dietterich et al. [4].
MAXQ Theis MAXQ value
an instance of function
the SMDP hierarchical
formulation.decomposition
The fundamentalwas proposed by Dietterich
action of MAXQ et al. [4].
is to decompose
a large MDP into a series of small subtasks {M0 , M1 , M2 , . . . , Mn }. In these subtasks, each subtaska
MAXQ is an instance of the SMDP formulation. The fundamental action of MAXQ is to decompose
large
is MDPa into
actually a seriesAll
sub-SMDP. of subtasks
small subtasks {𝑀 ,a𝑀task
constitute , 𝑀 graph,
, … , 𝑀 }. In multiple
and these subtasks,
subtaskeach subtask
policies can beis
actually a sub-SMDP. All subtasks constitute a task graph, and multiple subtask
learned concurrently [4]. The characteristics of the MAXQ-based task hierarchies include temporal policies can be
learned concurrently
abstraction, [4]. Theand
state abstraction, characteristics of theFigure
subtask sharing. MAXQ-based
1 shows atask hierarchies
graphic exampleinclude temporal
of MAXQ-based
abstraction, state
task hierarchies. abstraction, and subtask sharing. Figure 1 shows a graphic example of MAXQ-based
task hierarchies.

M1 M2 M3

M4 M5 M6 M7 M8

Figure
Figure 1.
1A A graphic
graphic example
example of
of MAXQ-based
MAXQ-based task
task hierarchies.
hierarchies.

M𝑀 i is
is defined
defined as three-tuple hT
as aa three-tuple <i ,𝑇A, 𝐴
i , ,R̃𝑅i i.>.

•• T𝑇i represents
represents the the termination
termination condition
condition of of subtask
subtask M 𝑀i ,, which
which is used to
is used judge whether
to judge whether M 𝑀 i is
is
terminated. Specifically, 𝑆 and 𝐺 are the active states and termination
terminated. Specifically, Si and Gi are the active states and termination states of Mi respectively. states of 𝑀
respectively.
If the current If thescurrent
state ∈ Gi or the 𝑠ϵ𝐺 or the
statepredefined predefined
maximum maximum
calculation calculation
time or number time or number
of iterations is
of iterations
reached, Ti isisset
reached, 𝑇 is set that
to 1, indicating to 1, M indicating
i is that
terminated. 𝑀 is terminated.
•• A𝐴 isisaaset
i
setof
ofactions;
actions;ititcontains
containsboth
bothprimitive
primitiveactionsactionsand andhigh-level
high-levelsubtasks.
subtasks.
• 𝑅 is the optional pseudo-reward function of 𝑀 .
• R̃i is the optional pseudo-reward function of Mi .
The core component of MAXQ is the hierarchical decomposition of the value function. The value
The for
function corethe
component of MAXQ issubtask
MAXQ hierarchical the hierarchical
t is defineddecomposition
as of the value function. The value
function for the MAXQ hierarchical subtask t is defined as
Qπ (i, s, t ) = V π (t , s ) + C π (i, s, t ) . (1)

It represents the expected cumulative Qπ (i, s, t)reward

= V π (t,obtained
s) + Cπ (in t). process of executing subtask t first
i, s,the (1)
and then following policy 𝜋 until task 𝑖 terminates [4]. 𝜋 is the policy of subtask i. 𝑉 (𝑡, 𝑠) is the
It represents the expected cumulative reward obtained in the process of executing subtask t first
expected value, which is obtained in the process of following 𝜋 starting at 𝑠 until 𝑡 πterminates.
and then following policy π until task i terminates [4]. π is the policy of subtask i. V (t, s) is the
𝑉 (𝑡, 𝑠) is specifically defined as
expected value, which is obtained in the process of following π starting at s until t terminates. V π (t, s) is
specifically defined as  R ( s, t ) if t is primitive
V π (t , s) =  π . (2)
( Q (t , s, π ( s)) otherwise (t is composite)
R(s, t) if t is primitive
V π (t, s) = π (t, s, π(s)) otherwise (t is composite) . (2)
𝑅 is the reward in the context of (𝑠, 𝑡) if 𝑡 is a primitive action.
Q
𝐶 (𝑖, 𝑠, 𝑡) is the completion function. It represents the expected reward obtained when
R is thetoreward
continuing followin𝜋 the context
until (s, t) if t is a primitive action.
task 𝑖ofterminates.
π
C (i, s, t) is the completion function. It represents the expected reward obtained when continuing
π
to follow π until task i terminates. C (i, s, t ) =  P( s ', k | s, t )γ kV (i, s ') .
s ', k
(3)
X
π

C (i, s, t) = P(s0 , ks, t)γk V (i, s0 ). (3)
2.3. MCTS s0 ,k
Symmetry 2020, 12, 719 4 of 17

2.3. MCTS
Symmetry 2019, 11, x FOR PEER REVIEW 4 of 17
The Monte Carlo Tree Search (MCTS) is a best-first search. It combines the accuracy of tree
searchThe
with the Carlo
Monte universality of stochastic
Tree Search (MCTS) is asampling. The MCTS
best-first search. finds the
It combines theaccuracy
optimalofaction in the current
tree search
state
with by
theconstructing
universality ofanstochastic
asymmetric optimal
sampling. Theprior
MCTSsearch treeoptimal
finds the based action
on a Monte Carlo simulation
in the current state [6].
by constructing
Specifically, the an asymmetric
MCTS optimal
first selects an prior
actionsearch tree based
according to a ontreea policy.
Monte Carlo
Next,simulation
it executes [6].the action
Specifically,
through the MCTS
a Monte Carlofirst selects an action
simulation accordingthe
and observes to anew
tree state.
policy. If
Next,
the itstate
executes the action
is already on the tree,
through a Monte Carlo simulation and observes the new state. If the state is already
it evaluates the state recursively; otherwise, it adds a new node to the tree. Then, it runs the on the tree, it default
evaluates the state recursively; otherwise, it adds a new node to the tree. Then, it runs the default
rollout policy from this node until termination. Finally, it updates the information of each node on the
rollout policy from this node until termination. Finally, it updates the information of each node on
path via backpropagation. Figure 2 shows the basic process of the MCTS approach.
the path via backpropagation. Figure 2 shows the basic process of the MCTS approach.

Repeated n times

Selection Expansion Simulation Backpropagation

Figure2.2.
Figure TheThe basic
basic process
process ofMonte
of the the Monte
CarloCarlo Tree Search
Tree Search (MCTS)(MCTS) approach.
approach.

UCT
UCT (UCB
(UCB applied
applied to Trees)
Trees) is
is one
oneof ofthe
themost
mostfamous
famousMCTS
MCTS implementations [7]. ItIt is a standard
implementations [7].
MCTS algorithm for planning in MDPs which uses UCB1 [8] as the tree policy. The treeThe
standard MCTS algorithm for planning in MDPs which uses UCB1 [8] as the tree policy. policytree attempts
policy attempts to balance exploration and exploitation [6,9,10]. If all actions of a node have
to balance exploration and exploitation [6,9,10]. If all actions of a node have been selected, then the been
selected, then
subsequent the subsequent
actions actions
of this node areofselected
this nodeaccording
are selectedthe
according the UCB1
UCB1 policy. policy. Specifically,
Specifically,
logs N ( s)
UCB ( s, a ) = Q ( s, a ) + c (4)
N ( s, alog
) N (s)
UCB(s, a) = Q(s, a) + c (4)
N (s, a)
where 𝑄 (𝑠, 𝑎) is the average action value of all simulations in the context of (𝑠, 𝑎). 𝑁(𝑠, 𝑎) is the
cumulative number that action 𝑎 has been selected at state 𝑠 . 𝑁(𝑠) = ∑ ∈ 𝑁(𝑠, 𝑎) is the total
(s,visits
numberQof
where a) istothe average
state 𝑠. 𝑐 is aaction
factor value
that canofbeall simulations
used in the context
to balance exploration of (s, a). N (s, a) is the
P and exploitation.
cumulative number that action a has been selected at state s. N (s) = a∈A N (s, a) is the total number
of3. visits
Relatedto Work
state s. c is a factor that can be used to balance exploration and exploitation.
In this work, we integrate the underlying MAXQ-based task hierarchies of the problem with the
3.MCTS
Related Work
to tackle the online planning in an asymmetric adversarial environment. Substantial work has
beenIndone
thisin termswe
work, of integrate
MAXQ. Litheet al. [11] improved
underlying the MAXQtask
MAXQ-based method, and they
hierarchies of proposed
the problem the with the
context-sensitive
MCTS to tackle thereinforcement learningin
online planning (CSRL) framework, adversarial
an asymmetric but the focusenvironment.
of their work was to use
Substantial work
common knowledge in the subtasks to learn transitional dynamics effectively. Ponce et al. [12]
has been done in terms of MAXQ. Li et al. [11] improved the MAXQ method, and they proposed the
proposed a MAXQ-based semi-automatic control structure for robots to explore unknown and messy
context-sensitive reinforcement learning (CSRL) framework, but the focus of their work was to use
urban search and rescue (USAR) scenarios. Devin et al. [13] introduced the MAXQ-based task
common knowledge
hierarchies in the subtasks
to offline learning. Hoang toet learn
al. [14]transitional dynamics
used hierarchical effectively.toPonce
decomposition guide et al. [12] proposed
high-level
aimitation
MAXQ-based semi-automatic control structure for robots to explore unknown
learning. Bai et al. [15] extended the MAXQ method to online hierarchical planning.and messy urban search
and rescue the
However, (USAR) scenarios.
transition modelDevin
for eachet al. [13] introduced
subtask the MAXQ-based
must be estimated to calculatetask
thehierarchies
completion to offline
learning.
function. Hoang et al. [14] used hierarchical decomposition to guide high-level imitation learning.
Bai et al. [15] extended the MAXQ method to online hierarchical planning. However, the transition
model for each subtask must be estimated to calculate the completion function.
Symmetry 2020, 12, 719 5 of 17

In the domain of online planning, the MCTS combines the best-first search and Monte Carlo
simulation to find the approximate optimal policies. However, in long-horizon problems with a large
state space and branching factors, MCTS algorithms must search deeply to cover a sufficient number of
rewarded states that are far away from the root, which leads to a large computational overhead and poor
performance. Many related works have been done to tackle this problem. Hostetler et al. [2] introduced
state abstraction into the MCTS to reduce random branching factors and improve the performance
of UCT algorithms with limited samples. Bai et al. [16] converted the state-abstracted model into a
POMDP (Partially Observable MDP) problem to solve the non-Markovian problem caused by state
abstraction and proposed a hierarchical MCTS to deal with state and action abstraction in the POMDP
setting. Vien et al. [1] integrated a task hierarchy into the UCT and POMCP (Partially Observable
Monte Carlo Planning) framework and evaluated the methods in various settings. Menashe et al. [17]
proposed the transition-based upper confidence bounds for trees (T-UCT) approach to learn and exploit
the dynamics of structured hierarchical environments. Although these works work well in some
domains, they do not address online planning issues in the adversarial environment. Online planning
in asymmetric adversarial environments with a long planning horizon, huge state space, and branch
factors still poses a significant challenge.
There are many other related works. Sironi et al. [18] tuned different MCTS parameters to improve
an agent’s performance in the games. Neufeld et al. [19] proposed a hybrid planning approach that
combines the Hierarchical Task Network (HTN) and the MCTS. Santiago et al. [20] proposed the
informed MCTS, which exploits prior information to improve the performance of MCTS in large games.
However, the MCTS only works at the level of primitive actions in these works.
In this work, we propose HMCTS-OP, a hierarchical MCTS-based online planning algorithm.
The HMCTS-OP integrates the MAXQ-based task hierarchies and hierarchical MCTS algorithms into
the online planning framework. The evaluation shows that the HMCTS-OP is capable of excellent
performance in online planning in the asymmetric adversarial environment.

4. Method

4.1. Asymmetric Adversarial Environment Modeling

Combat is a typical adversarial scenario. Real-time strategy games (RTS) face similar problems
to combat (e.g., a limited decision time, huge state space, and a dynamic environment). Therefore,
using RTS to study the online planning method in the asymmetric adversarial environment not
only ensures the portability of the results but also reduces the research cost and improves the
research efficiency.
In this work, we design an asymmetric adversarial environment based on the microRTS [21].
The microRTS offers the possibility for the agent to act as a player by providing them with information
about the game state. Figure 3 illustrates a sample scenario in the asymmetric adversarial environment
that will be used in our experiment. The agent controls two types of unit, the base (white square unit)
and the light military unit (yellow circular unit). The opponent controls one type of unit, the worker
(gray circular unit). The static obstacles are represented by green squares.
In this section, we model the online planning problem in the asymmetric adversarial environment
as an MDP. The state space, action space, transition function, and reward function are defined as follows.
State space. The joint state of online planning in the asymmetric adversarial environment contains
state variables that cover all units. It is represented as a high dimensional vector s = (s0 , s1 , s2 , . . . sn ),
which includes the information of n units. For each unit, the state variable s is defined as
s = (player, x, y, hp), where (x, y) is the current position, player is the owner of the unit (red or
blue), and hp represents the health points. The initial health points of the base, light, and worker are 10,
4, and 1, respectively.
Action space. Each movable unit has nine optional primitive actions, which are listed as follows:
Symmetry 2020, 12, 719 6 of 17

1. Four navigation actions: NavUp (move upward), NavDown (move downward), NavLeft (move
Symmetryleftward), andPEER
2019, 11, x FOR NavRight
REVIEW(move rightward). 6 of 17
2. Four attack actions: FirUp (fire upward), FirDown (fire downward), FirLeft (fire leftward),
2. Four attack actions:
and FirRight FirUp (fire upward), FirDown (fire downward), FirLeft (fire leftward),
(fire rightward).
and3. FirRight
Wait. (fire rightward).
3. Wait.
Transition
Transition function.Each
function. Each move
move action
action has
has anan0.90.9 probability
probability of of moving
moving to to
thethe target
target position
position
successfully and a probability of 0.1 of staying in the current position.
successfully and a probability of 0.1 of staying in the current position. Each fire action has an 0.9 Each fire action has
an 0.9 probability
probability of damaging of damaging the opponent
the opponent successfully,
successfully, and a probability
and a probability of 0.1 ofof 0.1 of failing.
failing.
Reward function. Each primitive action has a reward of −1.
Reward function. Each primitive action has a reward of –1. If the agent attacks If the agent attacksthethe opponent
opponent
successfully and the opponent loses 1 health point, the agent will get a reward of 20. Conversely,if ifthe
successfully and the opponent loses 1 health point, the agent will get a reward of 20. Conversely,
theagent
agentisis attacked
attacked by bythe the opponent
opponent andandlosesloses 1 health
1 health point, itpoint,
will getit awill
rewardget ofa –20.
reward of −20.
Moreover,
Moreover, the reward is 100 for winning the game and
the reward is 100 for winning the game and –100 for losing the game. −100 for losing the game.

Figure 3. One
Figure of the
3. One scenarios
of the in in
scenarios thethe
asymmetric adversarial
asymmetric environment.
adversarial environment.

4.2.
4.2. MAXQ
MAXQ Hierarchical
Hierarchical Decomposition
Decomposition
InIn this
this section,we
section, wewill
willdescribe
describethethe MAXQ
MAXQ hierarchical
hierarchicaldecomposition
decompositionmethod
methodover the
over overall
the task
overall
hierarchies in detail. First, we define a set of subtasks at different levels according to the related
task hierarchies in detail. First, we define a set of subtasks at different levels according to the related domain
knowledge,
domain which which
knowledge, are usedareasused
the building blocks toblocks
as the building construct the task hierarchies.
to construct Then, weThen,
the task hierarchies. extend
wethe
MDP to the SMDP by introducing hierarchical tasks and derive the MAXQ hierarchical
extend the MDP to the SMDP by introducing hierarchical tasks and derive the MAXQ hierarchical decomposition
over the defined
decomposition overhierarchical
the definedtasks.
hierarchical tasks.
The
The hierarchical
hierarchical subtasks
subtasks areare defined
defined asas follows:
follows:
- - NavUp,
NavUp,NavDown, NavLeft,NavRight,
NavDown, NavLeft, NavRight,FirUp,
FirUp, FirDown,
FirDown, FirLeft,
FirLeft, FirRight,
FirRight, and and
Wait:Wait:
TheseThese
actions
actions are
are defined
definedbybythe
theRTS;
RTS;they
theyare
areprimitive
primitive actions. When a primitive action is performed,
actions. When a primitive action is performed, a local a
local reward
rewardofof–1
−1will
willbebe assigned to each
assigned to eachprimitive
primitiveaction.
action.This
This method
method ensures
ensures thatthat the online
the online policy
policy of the high-level subtask can reach the corresponding goal state as soon as possible.
of the high-level subtask can reach the corresponding goal state as soon as possible.
- NavToNeaOpp, NavToCloBaseOpp, and FireTo: The NavToNeaOpp subtask will move the
- NavToNeaOpp, NavToCloBaseOpp, and FireTo: The NavToNeaOpp subtask will move the
light military unit to the closest enemy unit as soon as possible by performing NavUp, NavDown,
light military unit to the closest enemy unit as soon as possible by performing NavUp,
NavLeft, and NavRight actions and taking into account the action uncertainties. Similarly, the
NavDown, NavLeft, and NavRight actions and taking into account the action uncertainties.
NavToCloBaseOpp subtask will move the light military unit to the enemy unit closest to the base as
Similarly, the NavToCloBaseOpp subtask will move the light military unit to the enemy unit
fast as possible. The goal of the FireTo subtask is to attack enemy units within a range.
closest to the base as fast as possible. The goal of the FireTo subtask is to attack enemy units
- Attack and Defense: The purpose of Attack is to destroy the enemy’s units to win by planning
within a range.
the attacking behaviors, and the purpose of Defense is to defend against the enemy’s units to protect
- Attack and Defense: The purpose of Attack is to destroy the enemy’s units to win by planning
bases by carrying out defensive behaviors.
the attacking behaviors, and the purpose of Defense is to defend against the enemy’s units to
- Root: This is a root task. The goal of Root is to destroy the enemy’s units and protect bases. In
protect bases by carrying out defensive behaviors.
the Root task, the Attack subtask and the Defense subtask are evaluated by the hierarchical UCB1
policy according to the HMCTS-OP, which is described in the next section in detail.
The overall task hierarchies of the online planning problem in the asymmetric adversarial
environment are shown in Figure 4. The parentheses after the name of a subtask indicate that the
subtask has parameters.
Symmetry 2020, 12, 719 7 of 17

- Root: This is a root task. The goal of Root is to destroy the enemy’s units and protect bases. In the
Root task, the Attack subtask and the Defense subtask are evaluated by the hierarchical UCB1
policy according to the HMCTS-OP, which is described in the next section in detail.

The overall task hierarchies of the online planning problem in the asymmetric adversarial
environment are shown in Figure 4. The parentheses after the name of a subtask indicate that the
subtask has parameters.
According
Symmetry 2019, to
11, the
x FORtheory proposed by He et al. [3], the computational cost of action 7branching
PEER REVIEW of 17
T/L
T
shrinks from O(|A| ) to O(Ã ) [3] by introducing high level subtasks (macro actions), where A is a
According to the theory proposed by He et al. [3], the computational cost of action branching
set of primitive actions, Ã is a set/of high level subtasks, T is the planning horizon, and L ≥ 1 is an
shrinks from 𝑂(|𝐴| ) to 𝑂(|𝐴| ) [3] by introducing high level subtasks (macro actions), where 𝐴
upper-bound
is a set of on the length
primitive of each
actions, 𝐴 is high-level
a set of highsubtask. In this work,
level subtasks, 𝑇 is thethe stochastic
planning branching
horizon, and 𝐿 ≥factor
1 of
the root node is reduced from 9 to 3 by introducing the MAXQ-based hierarchical
is an upper-bound on the length of each high-level subtask. In this work, the stochastic branchingsubtasks. Therefore,
Ã =factor theT/L
|A|/3ofand root ≤ T when
node L ≥ 1.from
is reduced As a9 result,
to 3 by the computational
introducing complexity
the MAXQ-based is significantly
hierarchical reduced
subtasks.
in theTherefore, 𝐴 = |𝐴|/3with
planning problem anda 𝑇/𝐿long≤planning
𝑇 when 𝐿horizon,
≥ 1 . Aswhich enables
a result, the algorithmcomplexity
the computational to search deeper
is
significantly
to find a better plan.reduced in the planning problem with a long planning horizon, which enables the
algorithm to search deeper to find a better plan.

Root

Attack Wait Defense

NavToCloOpp FireTo(.) NavToCloBaseOpp

NavUp NavDown FirUp FirDown NavUp NavDown

NavLeft NavRight FirLeft FirRight NavLeft NavRight

Figure 4. The
Figure overall
4. The task
overall hierarchies
task hierarchiesofofthe
theonline planningproblem
online planning problemin in
thethe asymmetric
asymmetric adversarial
adversarial
environment.
environment.

Based on the
Based tasktask
on the hierarchies
hierarchiesdefined
definedabove,
above, wewe extend
extendthetheMDPMDP to to
thethe SMDP
SMDP by including
by including an an
explicit reference
explicit to the
reference subtasks.
to the We
subtasks. Weintroduce
introducethe variable k𝑘 to
thevariable to represent theexecution
represent the executiontime
time steps of
steps
of the subtask.
the subtask.
The transition
The transition function P : S 𝑃:
function ×A 𝑆×× 𝐴S ×
× 𝑆k ×→𝑘 [→
0, [0,1] represents
1] represents thethe probability
probability of performing
of performing subtask
subtask 𝑎 in state 𝑠 and transitioning
0 to state
a in state s and transitioning to state s after k time steps.𝑠 after 𝑘 time steps.
P(s ', k | s, a) = Pr{st + k = s' | st = s, at = a} . (5)
P(s0 , ks, a) = Pr st+k = s0 st = s, at = a . (5)

The reward function of the hierarchical task accumulates the single step rewards when it is
executed.
The rewardThe function
accumulation manner
of the is relatedtask
hierarchical to theaccumulates
performance measure of HMCTS.
the single In this when
step rewards work, it is
we use the discount factor 𝛾 to maximize the accumulated discounted rewards.
executed. The accumulation manner is related to the performance measure of HMCTS. In this work,
We derive the MAXQ hierarchical decomposition as follows, according to Equations (1)(2)(3).
we use the discount factor γ to maximize the accumulated discounted rewards.
We derive the MAXQ Q( Root
hierarchical V ( Attack , s) +  as
, s, Attack ) =decomposition P( sfollows,
', k | s, Attack )γ kV ( Rootto, sEquations
according ') (6)
(1)–(3).
s ', k
X
Q(Root, s, Attack) = V (Attack, s) + P(s0 , ks, Attack) γkk V (Root, s0 ) (6)
Q( Root , s, Defense) = V ( Defense, s) +s0
,k
P( s ', k | s, Defense)γ V ( Root , s ') (7)
s ', k
X
P(s0 , ks, De fk ense) γk V (Root, s0 )
Q( Root , s,Wait ) = R(Wait , s) +  P(s ', k | s,Wait )γ V ( Root , s ')
Q(Root, s, De f ense) = V (De f ense, s) + (8) (7)
0 ,k
s ', ks

Q( Attack , s, FireTo( p)) = V ( FireTo( p), s) +  P(s ', k | s, FireTo( p))γ V ( Attack , s ')
s ', k
k (9)

Q( Attack , s, NavToCloOpp) = V ( NavToCloOpp, s) +  P(s ', k | s, NavToCloOpp)γ V ( Attack , s ')

s ', k
k (10)
Symmetry 2020, 12, 719 8 of 17

X
Q(Root, s, Wait) = R(Wait, s) + P(s0 , ks, Wait) γk V (Root, s0 ) (8)
s0 ,k
X
Q(Attack, s, FireTo(p)) = V (FireTo(p), s) + P(s0 , ks, FireTo(p))γk V (Attack, s0 ) (9)
s0 ,k
X
Q(Attack, s, NavToCloOpp) = V (NavToCloOpp, s) + P(s0 , ks, NavToCloOpp) γk V (Attack, s0 ) (10)
s0 ,k

Q(De f ense, s, NavToCloBaseOpp) = V (NavToCloBaseOpp, s)

(11)

+ P(s0 , ks, NavToCloBaseOpp) γk V (De f ense, s0 )
P
s0 ,k
X
Q(De f ense, s, FireTo(p)) = V (FireTo(p), s) + P(s0 , ks, FireTo(p))γk V (De f ense, s0 ) (12)
s0 ,k

4.3. Hierarchical MCTS-Based Online Planning (HMCTS-OP)

In this section, we describe the implementation of the algorithm in detail, as shown in
Algorithms 1–5. The main loop is outlined in Algorithm 1, which repeatedly invokes the execution of
the HMCTS-OP function to select the optimal action (Algorithm 1, line 3). It performs this action in the
current state and observes the next state (Algorithm 1 line 4) until the game terminates.
The implementation of the HMCTS-OP function is outlined in Algorithm 2. It repeatedly invokes
the HMCTSimulate function within the computational budget (maximum number of iterations: 100) in
the context of (t, s) (Algorithm 2, line 10). Finally, it returns a greedy primitive action according to the
GetGreedyPrimitive function (Algorithm 2, line 12).
The implementation of the HMCTSimulate function is outlined in Algorithm 3; this is the key
component of our method. In this algorithm, each task t in the MAXQ-based task hierarchies has an
HMCTS-based simulation for action selection. The task t could be the root task or one of the high-level
subtasks or one of the primitive actions. If t is a primitive action, it simulates the action in the generation
model of the underlying MDP (Algorithm 3, line 4). If t is a high-level subtask, there are two situations.
If the node (t, s) is not fully expanded, it randomly selects an untried subtask a from the hierarchical
subtasks of t (Algorithm 3, line 16). Otherwise, it selects subtask a according to the UCB sampling
strategy (Algorithm 3, line 18, Equation (4)). Next, the function invokes the nested simulate process
to compute the value for completing the selected subtask a (Algorithm 3, line 20). Then, it computes
the completion value for the end of task t by the Rollout function (Algorithm 3, line 21). The value
function for the MAXQ hierarchical subtask is shown in Algorithm 3 on line 22. Our algorithm is purely
sample-based, so it is not necessary to estimate the transition probability of the subtask accurately.
Therefore, the value function is consistent with Equations (6)–(12). Finally, the algorithm updates the
node information by backpropagation (Algorithm 3, line 22–26).
The implementation of the Rollout function is outlined in Algorithm 4. It uses the hierarchical
rollout policy πrollout (random or smart) to run the MCTS simulations. The implementation of the
GetGreedyPrimitive function is outlined in Algorithm 5. It invokes the nested process to get the greedy
primitive action.

Algorithm 1 PLAY.
1: function PLAY (task t, state s, MAXQ hierarchy M, rollout policy πrollout )
2: repeat
3: a ← HMCTS-OP(t, s, M, πrollout )
4: s0 ← Execute(s, a)
5: s ← s0
6: until termination conditions
7: end function
Symmetry 2020, 12, 719 9 of 17

Algorithm 2 HMCTS-OP.
1: function HMCTS-OP (task t, state s, MAXQ hierarchy M, rollout policy πrollout )

2: if t is primitive then Return R(s, t), t
3: end if
4: if s < St and s < Gt then Return h−∞, nili
5: end if
6: if s ∈ Gt then Return h0, nili
7: end if
8: Initialize search tree T for task t
9: while within computational budget do
10: HMCTSSimulate (t, s, M, 0, πrollout )
11: end while
12: Return GetGreedyPrimitive(t, s)
13: end function

Algorithm 3 HMCTSSimulate.
1: function HMCTSSimulate (task t, state s, MAXQ hierarchy M, depth d, rollout policy πrollout )
2: steps = 0 // the number of steps executed by task t
3: if t is primitive then
4: (s0 , r, 1) ← Generative-Model(s, t) // domain generative model-simulator
5: steps = 1
6: Return (s0 , r, steps)
7: end if
8: if t terminates or d > dmax then
9: Return (s, 0, 0)
10: end if
11: if node (t, s) is not in tree T then
12: Insert node (t, s) to T
13: Return Rollout(t, s, d, πrollout )
14: end if
15: if node (t, s) is not fully expanded then
16: choose a ∈ untried subtask from subtasks(t)
17: else
r
log N (t,s)
18: 0
a = arg maxa0 Q(t, s, a ) + c 0 //a0 ∈ subtasks(t)
N (t,s,a )
19: end if
20: (s0 , r, k) = HMCTSSimulate(a, s, M, d, πrollout )
21: (s00 , r0 , k0 ) = Rollout (t, s0 , d + k, πrollout )
22: r = r + γk r0
23: steps = steps + k + k0
24: N (t, s) = N (t, s) + 1
25: N (t, s, a) = N (t, s, a) + 1
r−Q(t,s,a)
26: Q(t, s, a) = Q(t, s, a) + N (t,s,a)
27: Return (s00 , r, steps)
28: end function
Symmetry 2020, 12, 719 10 of 17

Algorithm 4 Rollout.
1: function Rollout (task t, state s, depth d, rollout policy πrollout )
2: steps = 0
3: if t is primitive then
4: (s0 , r, 1) ← Generative-Model(s, t) // domain generative model-simulator
5: Return (s0 , r, 1)
6: end if
7: if t terminates or d > dmax then
8: Return (s, 0, 0)
9: end if
10: a ∼ πrollout (s, t)
11: (s0 , r, k) = Rollout(a, s, d, πrollout )
12: (s00 , r0 , k0 ) = Rollout (t, s0 , d + k, πrollout )
13: r = r + γk r0
14: steps = steps + k + k0
15: Return (s00 , r, steps)
16: end function

Algorithm 5 GetGreedyPrimitive.
1: function GetGreedyPrimitive (task t, state s)
2: if t is primitive then
3: Return t
4: else
5: a = arg maxa0 Q(t, s, a0 )
6: Return GetGreedyPrimitive (a, s)
7: end if
8: end function

5. Experiment and Results

5.1. Experiment Setting

Aiming to evaluate the performance of the HMCTS-OP in the asymmetric adversarial environment,
we compare it with three online planning methods in three scenarios with an increasing scale, as shown
in Table 1 and Figure 5. In scenario 3, the state space contains 1044 states, which represents a very large
planning problem.
The three compared online planning methods are listed as follows:
UCT: This is a standard instance of an MCTS algorithm. UCB1 is used as the tree policy to select
actions and is described in Equation (4). The parameter setting is C = 0.05.
NaiveMCTS [21]: This uses naïve sampling in the MCTS. The parameter settings are π0 (ε0 = 0.4),
π1 (ε1 = 0.33), π g ε g = 0 .
InformedNaiveMCTS [20]: This learns the probability distribution of actions in advance and
incorporates the distribution into the NaiveMCTS.
The settings of our algorithms are as follows:
HMCTS-OP: This is a hierarchical MCTS-based online planning algorithm with a UCB1 policy.
The parameter setting is a discount factor of γ = 0.99.
The above four methods all use the random rollout policy for Monte Carlo simulations
Smart-HMCTS-OP: This is an HMCTS-OP algorithm equipped with a hand-coded rollout policy,
which is named the smart rollout policy. The smart rollout policy randomly selects the action, and it is
five times more likely to choose an attack action than other actions.
The maximal game length of each run is limited to 3000 decision cycles, after which the game is
considered to be a tie. There are 10 decision cycles per second.
Symmetry 2020, 12, 719 11 of 17

We tested the performance of all methods against two fixed opponents (UCT with random rollout
policy and Random). The accumulated match results for each algorithm were counted for each scenario
(win/tie/loss), and the score was calculated as the number of wins plus 0.5 times the number of ties
Symmetry 2019, 11, x FOR PEER REVIEW 11 of 17
+ 0.5 ties).
(winsSymmetry 2019, 11,The performance
x FOR PEER REVIEW was evaluated using the average game time, the score, 11 ofand
17 the

average cumulative reward. All of the results were obtained over 50 games.
number of ties (wins + 0.5 ties). The performance was evaluated using the average game time, the
number of ties (wins + 0.5 ties). The performance was evaluated using the average game time, the
score, and the average cumulative reward. All of the results were obtained over 50 games.
score, and the average cumulative
Table 1. Detailsreward. All of
of the three the results
scenarios usedwere obtained
in our over 50 games.
experiment.
Table 1. Details of the three scenarios used in our experiment
Table 1. Details of the three scenarios
Scenario 1 used in2our experiment
Scenario Scenario 3
Scenario 1 Scenario 2 Scenario 3
Size 8 × 8 1 Scenario
Scenario 10 × 10 2 Scenario
12 × 123
Size 8×8 10 × 10 12 × 12
Units
Size 23
8×8 23
10 × 10 12 ×2312
Units 23 23 23
Maximal states 103636 1041 1044
Units
Maximal states 10 23 104123 1044 23
Maximal states 1036 1041 1044

Figure 5. The three scenarios used in our experiment.

Figure 5. The three scenarios used in our experiment.
5.2. Results Figure 5. The three scenarios used in our experiment.
5.2. Results
Figure 6 and Table 2 show the summarized results from our experiments. The HMCTS-OP and
5.2. Results
Figure 6 and Table 2 show the summarized results from our experiments. The HMCTS-OP and the
the Smart-HMCTS-OP led to significant performance improvements compared with the other
Smart-HMCTS-OP
Figure 6asand
algorithms, ledTable
shown toinsignificant
2 show6.theperformance
Figure summarized results improvements
from our compared
experiments. with
Thethe other algorithms,
HMCTS-OP and
as shown in Figure 6.from theled
the Smart-HMCTS-OP
Specifically, to significant
perspective of averageperformance
game time,improvements
as shown in Figure compared with theofother
6a,c,e, regardless
algorithms,
whether theas
Specifically, shown
from
opponent thein Figure 6.or a UCT,
isperspective
Random of average
the averagegame gametime,
timeasofshown in Figure
the HMCTS-OP and6a,c,e, regardless of
the Smart-
Specifically,
HMCTS-OP was from
much the perspective
shorter than that of
of average
other game
algorithms
whether the opponent is Random or a UCT, the average game time of the HMCTS-OP and time,
in allas shown
scenarios. in
As Figure
the 6a,c,e,
opponent regardless
changed of the
from Random
whether the to UCT, the
opponent is average game
Random or a time of
UCT, allaverage
the algorithms
game increased
time of correspondingly.
the HMCTS-OP Moreover,
and the Smart-
Smart-HMCTS-OP was much shorter than that of other algorithms in all scenarios. As the opponent
the increasedwas
HMCTS-OP time of the
much HMCTS-OP
than thatwas shorter than thatinofallthe three other algorithms
opponentinchanged
all
changed from and
scenarios,
RandomtheUCT,
toshorter
UCT,
increased
the average
time ofgame
of other
game
the Smart-HMCTS-OP
algorithms
time of scenarios.
all algorithms
was shorter than
As the
increased correspondingly.
that of the three Moreover,
other
from Random to the average time of all algorithms increased correspondingly.
Moreover, the increased
algorithms in scenario time(10 of theandHMCTS-OP was shorter than that of the three other algorithms in
the increased time of 2the × 10)
HMCTS-OP scenario
was 3shorter
(12 × 12).
than that of the three other algorithms in all
all scenarios,
Fromand
the the increased
perspective of time
the of
score, the
as
scenarios, and the increased time of the Smart-HMCTS-OP Smart-HMCTS-OP
shown in Figure 6b,d,f,
was was
the shorter
scores
shorter than
of the
than that
HMCTS-OP
that ofthree
of the theandthree
otherother
Smart-HMCTS-OP were higher
× 10) than those of other algorithms in all scenarios. Moreover, when the
algorithms in scenario
algorithms in scenario2 (102 (10 × 10)andandscenario
scenario 33 (12(12 ××12).
12).
opponent
From the was UCT, the scores
perspective of of
thetheofscore,
HMCTS-OP and Smart-HMCTS-OP were much higher than those
From the perspective score,asasshown
shown in Figure6b,d,f,
in Figure 6b,d,f,thethe scores
scores of the
of the HMCTS-OP
HMCTS-OP and and
of the other algorithms.
Smart-HMCTS-OP
Smart-HMCTS-OP werewere higher
higher than
thanthose
thoseof of other algorithmsininallall
other algorithms scenarios.
scenarios. Moreover,
Moreover, when when
the the
opponent
opponent was was
UCT, UCT, the scores
the scores of HMCTS-OP
of HMCTS-OP andandSmart-HMCTS-OP
Smart-HMCTS-OP were weremuchmuchhigher
higherthan
than those
those of
of thealgorithms.
the other other algorithms.

(a) average game time in scenario 1 (b) score in scenario 1

Figure 6. Cont.
Symmetry 2020, 12, 719 12 of 17
Symmetry 2019, 11, x FOR PEER REVIEW 12 of 17

(c) average game time in scenario 2 (d) score in scenario 2

(e) average game time in scenario 3 (f) score in scenario 3

FigureFigure
6. The6. The average
average game
game timeand
time andscores
scores of
of 50 games
gamesfor foreach scenario
each scenarioandand
eacheach
algorithm against
algorithm against
two fixed opponents (Random and UCT): (a,b) the results in scenario 1; (c,d) the results
two fixed opponents (Random and UCT): (a,b) the results in scenario 1; (c,d) the results in scenario in scenario 2; 2;
(e,f) the results in scenario
(e,f) the results in scenario 3. 3.

Furthermore,
Furthermore, Table
Table 2 shows
2 shows theaverage
the averagecumulative
cumulative reward
rewardover
over5050
games for for
games eacheach
scenario and and
scenario
each algorithm against two fixed opponents (Random and UCT). As shown in Table 2, the HMCTS-
each algorithm against two fixed opponents (Random and UCT). As shown in Table 2, the HMCTS-OP
OP and Smart-HMCTS-OP outperformed the other algorithms. Next, we tested whether there is a
and Smart-HMCTS-OP outperformed the other algorithms. Next, we tested whether there is a
significant performance difference between our algorithms and the other three algorithms from a
significant performance
statistical perspective. difference between our algorithms and the other three algorithms from a
statistical perspective.
Table 2. The average cumulative reward over 50 games for each scenario and each algorithm against
2. The
Tabletwo fixedaverage cumulative
opponents reward
(Random and over 50 games for each scenario and each algorithm against
UCT).
two fixed opponents (Random and UCT).
Informed HMCTS- Smart-
Opponent Scenario UCT NaiveMCTS
NaiveMCTS
Informed OP HMCTS-OP
Opponent Scenario UCT NaiveMCTS HMCTS-OP Smart-HMCTS-OP
8×8 −38.94 −57.7 NaiveMCTS
−153.56 109.94 276.34
Random 8 × 10
8 × 10 −38.94105.02 −57.7 47.98 −153.5630.12 109.94
242.24 276.34
312.66
Random 10 × 12
10 × 12 105.02−6.64 47.98 55.72 30.12 69.88 242.24 312.66
247.52 261.38
8 × 8 −6.64
12 × 12 −929.98 55.72−987.68 69.88−995.06 −354.64
247.52 −288.38
261.38
UCT 8 × 10
8 × 10−929.98
−694.42 −987.68
−610.48 −995.06
−840.8 −354.64
−236.34 −288.38
−95.5
UCT 10 × 12
10 × 12−694.42
−653.12 −610.48
−660.16 −659.54
−840.8 −173.58
−236.34 −89.34
−95.5
12 × 12
We recorded all of −653.12 −660.16
the cumulative rewards of 50 −659.54 −173.58
games obtained −89.34
from the evaluation of each
algorithm in each scenario. These rewards were divided into 30 groups, as shown in Table 3. We
performed
We statistical
recorded all of significance tests on
the cumulative these data.
rewards of 50In games
this work, we usedfrom
obtained the Kruskal–Wallis
the evaluationone-
of each
way analysis of variance [22] to perform six tests among multiple groups of data (GroupR11~GroupR15,
algorithm in each scenario. These rewards were divided into 30 groups, as shown in Table 3.
GroupU11~GroupU15, GroupR21~GroupR25, GroupU21~GroupU25, GroupR31~GroupR35, GroupU31~GroupU35) for
We performed statistical significance tests on these data. In this work, we used the Kruskal–Wallis
each scenario and each opponent. Figure 7 and Table 4 show that there were significant differences
one-way analysis of variance [22] to perform six tests among multiple groups of data (GroupR11 ~GroupR15 ,
in cumulative rewards between different algorithms for all scenarios and opponents.
GroupU11 ~GroupU15 , GroupR21 ~GroupR25 , GroupU21 ~GroupU25 , GroupR31 ~GroupR35 , GroupU31 ~GroupU35 )
for each scenario and each opponent. Figure 7 and Table 4 show that there were significant differences
in cumulative rewards between different algorithms for all scenarios and opponents.
Symmetry 2020, 12, 719 13 of 17

Table 3. Data groups.

Informed
Opponent Scenario UCT NaiveMCTS HMCTS-OP Smart-HMCTS-OP
NaiveMCTS
8×8 GroupR11 GroupR12 GroupR13 GroupR14 GroupR15
Random 10 × 10 GroupR21 GroupR22 GroupR23 GroupR24 GroupR25
12 × 12 GroupR31 GroupR32 GroupR33 GroupR34 GroupR35
8×8 GroupU11 GroupU12 GroupU13 GroupU14 GroupU15
UCT 10 × 10 GroupU21 GroupU22 GroupU23 GroupU24 GroupU25
12 × 12 GroupU31 GroupU32 GroupU33 GroupU34 GroupU35

Table 4. p-values of significance tests among multiple groups.

GroupR11 – GroupU11 – GroupR21 – GroupU21 – GroupR31 – GroupU31 –

Group
GroupR15 GroupU15 GroupR25 GroupU25 GroupR35 GroupU3
p-value 1.2883 × 10−12 1.2570 × 10−32 7.5282 × 10−13 3.1371 × 10−22 3.0367 × 10−11 3.6793 × 10−22

Moreover, we performed multiple tests for each pair to test the following hypotheses at level
α0 = 0.05:
H0 : There were no significant differences between the HMCTS-OP/Smart-HMCTS-OP and the
UCT/NaiveMCTS/Informed NaiveMCTS in scenarios 1, 2, and 3 against the Random/UCT.
Our test rejected a null hypothesis if and only if the p-value was, at most, α0 [23].
Table 5 shows the p-values of multiple tests for each pair. Moreover, we used the Tukey–Kramer test
to adjust the p-values to correct for the occurrence of false positives. As shown in Table 5, the p-values
were smaller than α0 for all tests when the opponent was UCT, and the differences were significant
between the Smart-HMCTS-OP and the other three algorithms for all scenarios and opponents.

Table 5. p-values of multiple tests for each pair.

Informed
Opponent Scenario Group UCT NaiveMCTS
NaiveMCTS
HMCTS-OP 0.4145 0.2623 0.0115
8×8
Smart-HMCTS-OP 1.7033 × 10−7 4.0795 × 10−8 9.9303 × 10−9
HMCTS-OP 0.1634 0.0331 0.0010
Random 10 × 10
Smart-HMCTS-OP 6.7051 × 10−7 2.6123 × 10−8 9.9410 × 10−9
HMCTS-OP 6.2448 × 10−6 4.2737 × 10−4 6.7339 × 10−4
12 × 12
Smart-HMCTS-OP 1.1508 × 10−7 1.4085 × 10−5 2.4075 × 10−5
HMCTS-OP 9.9611 × 10−9 9.9217 × 10−9 9.9217 × 10−9
8×8
Smart-HMCTS-OP 9.9230 × 10−9 9.9217 × 10−9 9.9217 × 10−9
HMCTS-OP 1.1821 × 10−6 1.9423 × 10−4 9.9353 × 10−9
UCT 10 × 10
Smart-HMCTS-OP 9.9718 × 10−9 5.6171 × 10−8 9.9217 × 10−9
HMCTS-OP 7.1822 × 10−8 3.0630 × 10−8 1.5334 × 10−8
12 × 12
Smart-HMCTS-OP 9.9384 × 10−9 9.9260 × 10−9 9.9225 × 10−9

Based on the experimental results, we can see that HMCTS-OP and Smart-HMCTS-OP perform
significantly better than other methods. Moreover, the Smart-HMCTS-OP improves the HMCTS-OP
significantly with the help of the smart rollout policy. The introduction of attack-biased rollout policies
can be seen as an advantage of HMCTS-OP.
However, the drawback of HMCTS-OP is that we need to use a significant amount of domain
knowledge to construct the task hierarchies, which is challenging for very complex problems.
On the other hand, due to the introduction of domain knowledge, compared with other methods,
Symmetry 2019, 11, x FOR PEER REVIEW 14 of 17
Symmetry 2020, 12, 719 14 of 17

other hand, due to the introduction of domain knowledge, compared with other methods, the
performance of the HMCTS-OP
the performance and the and
of the HMCTS-OP Smart-HMCTS-OP is greatlyisimproved
the Smart-HMCTS-OP by using MAXQ-
greatly improved by using
based task hierarchies
MAXQ-based to reduce to
task hierarchies thereduce
searchthe
space andspace
search guideand
theguide
searchthe
process.
search Therefore, one of our
process. Therefore, one
areas of future work is to discover the underlying MAXQ-based task hierarchies automatically.
of our areas of future work is to discover the underlying MAXQ-based task hierarchies automatically.

400

200

-200

-400

-600

-800

-1000

UCT NaiveMCTS InformedNaiveMCTS HMCTS-OP Smart-HMCTS-OP

(a) Statistical significance test in scenario 1 against Random (GroupR11–GroupR15)

Cumulative Reward

(b) Statistical significance test in scenario 1 against UCT (GroupU11–GroupU15)

Figure 7. Cont.
Symmetry 2020, 12, 719 15 of 17
Symmetry 2019, 11, x FOR PEER REVIEW 15 of 17

Cumulative Reward

(c) Statistical significance test in scenario 2 against Random (GroupR21–GroupR25)

200

-200
Cumulative Reward

-400

-600

-800

-1000

-1200
UCT NaiveMCTS InformedNaiveMCTS HMCTS-OP Smart-HMCTS-OP

(d) Statistical significance test in scenario 2 against UCT (GroupU21–GroupU25)

400

200

-200

-400

-600

-800

-1000

-1200
UCT NaiveMCTS InformedNaiveMCTS HMCTS-OP Smart-HMCTS-OP

(e) Statistical significance test in scenario 3 against Random (GroupR31–GroupR35)

Figure 7. Cont.
Symmetry 2020, 12, 719 16 of 17
Symmetry 2019, 11, x FOR PEER REVIEW 16 of 17

Cumulative Reward

(f) Statistical significance test in scenario 3 against UCT (GroupU31–GroupU35)

Figure 7. Significance
Figure tests
7. Significance between
tests algorithms
between forfor
algorithms each scenario
each and
scenario each
and opponent.
each opponent.

6. 6. Conclusions
Conclusions
In In this
this work,
work, wewe proposed
proposed thethe HMCTS-OP,
HMCTS-OP, a novel
a novel and
and scalable
scalable hierarchical
hierarchical MCTS-based
MCTS-based online
online
planningmethod.
planning method.TheTheHMCTS-OP
HMCTS-OP integrates
integratesthe theMAXQ-based
MAXQ-based task hierarchies
task hierarchiesandandhierarchical MCTS
hierarchical
algorithms into the online planning framework. The aim of the method is to
MCTS algorithms into the online planning framework. The aim of the method is to tackle online tackle online planning with
a long horizon
planning andhorizon
with a long huge stateandspace
huge in thespace
state asymmetric adversarialadversarial
in the asymmetric environment. In the experiment,
environment. In the
we empirically showed that the HMCTS-OP outperforms other methods
experiment, we empirically showed that the HMCTS-OP outperforms other methods in terms of the in terms of the average
cumulative
average reward,reward,
cumulative averageaverage
game time,gameandtime,
accumulated score in three
and accumulated scorescenarios
in threeagainst Random
scenarios againstand
UCT. Inand
Random the UCT.
future,Inwe plan
the to explore
future, we planthe to
performance
explore theofperformance
the HMCTS-OP in more
of the complexin
HMCTS-OP problems.
more
Moreover, we would like to solve the problem of automatically discovering
complex problems. Moreover, we would like to solve the problem of automatically discovering underlying MAXQ-based
task hierarchies.
underlying This is especially
MAXQ-based important This
task hierarchies. for improving the performance
is especially important offor hierarchical
improving learning
the
and planning in larger, complex environments.
performance of hierarchical learning and planning in larger, complex environments.
Author
AuthorContributions:
Contributions:Conceptualization,
Conceptualization,Xueqiang
X.G.; Data Gu; Data curation,
curation, Xiang
X.J.; Formal Ji; Formal
analysis, L.L.; analysis, Lina Lu;
Funding acquisition,
X.G.; Supervision,
Funding acquisition, J.C.; Writing—original
Xueqiang draft, Jing
Gu; Supervision, L.L.;Chen;
Writing—review & editing,
Writing—original W.Z.
draft, AllLu;
Lina authors have read and
Writing—review
agreed to
& editing, the published
Wanpeng Zhang. version of the manuscript.
Funding: This research is supported by the National Natural Science Foundation of China (61603406, 61806212,
Funding:
61603403This
andresearch is supported by the National Natural Science Foundation of China (61603406, 61806212,
61502516).
61603403 and 61502516).
Conflicts of Interest: The authors declare no conflict of interest.
Conflicts of Interest: The authors declare no conflict of interest
References
References
1. Vien, N.A.; Toussaint, M. Hierarchical Monte-Carlo Planning. In Proceedings of the Twenty-Ninth AAAI
1. Vien, N.A.; Toussaint,
Conference M. Hierarchical
on Artificial Intelligence,Monte-Carlo Planning.
Austin, TX, USA, 25–30InJanuary
Proceedings of the
2015; pp. Twenty-Ninth AAAI
3613–3619.
2. Conference
Hostetler, J.; Fern, A.; Dietterich, T. State Aggregation in Monte Carlo Tree Search. In3613–3619.
on Artificial Intelligence, 25–30 January 2015, Austin, Texas, USA, 2015; pp. Proceedings of the
2. Hostetler, J.; Fern,AAAI
Twenty-Eighth A.; Dietterich, T. State
Conference Aggregation
on Artificial in Monte
Intelligence, Carlo City,
Québec Tree QC,
Search. In Proceedings
Canada, 27–31 Julyof the
2014.
3. Twenty-Eighth AAAI Conference on Artificial Intelligence, Québec City, QC, Canada, 27–31 July
He, R.; Brunskill, E.; Roy, N. PUMA: Planning under Uncertainty with Macro-Actions. In Proceedings of the 2014.
3. He,Twenty-Fourth
R.; Brunskill, E.; Roy,Conference
AAAI N. PUMA: on Planning under
Artificial Uncertainty
Intelligence, with GA,
Atlanta, Macro-Actions.
USA, 11–15 In Proceedings
July 2010; Volumeof 40,
thepp.
Twenty-Fourth
523–570. AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 11–15 July 2010;
4. Volume 40, pp.
Dietterich, 523–570.
T.G. Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. J. Artif.
4. Dietterich, T.G. Hierarchical
Intell. Res. 2000, Reinforcement
13, 227–303. [CrossRef] Learning with the MAXQ Value Function Decomposition. J.
5. Artif. Intell. Res.
Puterman, 2000,
M.L. 13, 227–303.
Markov Decision Processes: Discrete Stochastic Dynamic Programming; John Wiley & Sons:
5. Puterman, M.L. Markov Decision
New York, NY, USA, 1994; ISBN Processes: Discrete Stochastic Dynamic Programming; John Wiley & Sons:
0471619779.
New York, USA, 1994; ISBN 0471619779.
6. Browne, C.; Powley, E. A survey of monte carlo tree search methods. 2012, 4, 1–49.
Symmetry 2020, 12, 719 17 of 17

6. Browne, C.; Powley, E. A survey of monte carlo tree search methods. IEEE Trans. Comput. Intell. AI Games
2012, 4, 1–49. [CrossRef]
7. Kocsis, L.; Szepesvári, C. Bandit Based Monte-Carlo Planning. In European Conference on Machine Learning;
Springer: Berlin/Heidelberg, Germany, 2006; pp. 282–293. ISBN 978-3-540-45375-8.
8. Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn.
2002, 47, 235–256. [CrossRef]
9. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: London, UK, 2018.
10. Črepinšek, M.; Liu, S.H.; Mernik, M. Exploration and exploitation in evolutionary algorithms: A survey.
ACM Comput. Surv. 2013, 45, 1–33. [CrossRef]
11. Li, Z.; Narayan, A.; Leong, T. An Efficient Approach to Model-Based Hierarchical Reinforcement Learning.
In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA,
4–10 February 2017; pp. 3583–3589.
12. Doroodgar, B.; Liu, Y.; Nejat, G. A learning-based semi-autonomous controller for robotic exploration of
unknown disaster scenes while searching for victims. IEEE Trans. Cybern. 2014, 44, 2719–2732. [CrossRef]
[PubMed]
13. Schwab, D.; Ray, S. Offline reinforcement learning with task hierarchies. Mach. Learn. 2017, 106, 1569–1598.
[CrossRef]
14. Le, H.M.; Jiang, N.; Agarwal, A.; Dudík, M.; Yue, Y.; Daumé, H. Hierarchical imitation and reinforcement
learning. In Proceedings of the ICML 2018: The 35th International Conference on Machine Learning,
Stockholm, Sweden, 10–15 July 2018; Volume 7, pp. 4560–4573.
15. Bai, A.; Wu, F.; Chen, X. Online Planning for Large Markov Decision Processes with Hierarchical
Decomposition. ACM Trans. Intell. Syst. Technol. 2015, 6, 1–28. [CrossRef]
16. Bai, A.; Srivastava, S.; Russell, S. Markovian state and action abstractions for MDPs via hierarchical MCTS.
In Proceedings of the IJCAI: International Joint Conference on Artificial Intelligence, New York, NY, USA,
9–15 July 2016; pp. 3029–3037.
17. Menashe, J.; Stone, P. Monte Carlo hierarchical model learning. In Proceedings of the fourteenth International
Conference on Autonomous Agents and Multiagent Systems, Istanbul, Turkey, 4–8 May 2015; Volume 2,
pp. 771–779.
18. Sironi, C.F.; Liu, J.; Perez-Liebana, D.; Gaina, R.D. Self-Adaptive MCTS for General Video Game Playing
Self-Adaptive MCTS for General Video Game Playing. In Proceedings of the International Conference on the
Applications of Evolutionary Computation; Springer: Cham, Switzerland, 2018.
19. Neufeld, X.; Mostaghim, S.; Perez-Liebana, D. A Hybrid Planning and Execution Approach Through HTN
and MCTS. In Proceedings of the IntEx Workshop at ICAPS-2019, London, UK, 23–24 October 2019; Volume 1,
pp. 37–49.
20. Ontañón, S. Informed Monte Carlo Tree Search for Real-Time Strategy games. In Proceedings of the IEEE
Conference on Computational Intelligence in Games CIG, New York, NY, USA, 22–25 August 2017.
21. Ontañón, S. The combinatorial Multi-armed Bandit problem and its application to real-time strategy games.
In Proceedings of the Ninth Artificial Intelligence and Interactive Digital Entertainment Conference AIIDE,
Boston, MA, USA, 14–15 October 2013; pp. 58–64.
22. Theodorsson-Norheim, E. Kruskal-Wallis test: BASIC computer program to perform nonparametric one-way
analysis of variance and multiple comparisons on ranks of several independent samples. Comput. Methods
Programs Biomed. 1986, 23, 57–62. [CrossRef]
23. Morris, H.; Degroot, M.J.S. Probability and Statistics, 4th ed.; Addison Wesley: Boston, MA, USA, 2011;
ISBN 0201524880.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

ADHD Assessment
No ratings yet
ADHD Assessment
6 pages
OperReserch Book
No ratings yet
OperReserch Book
70 pages
Planning
No ratings yet
Planning
5 pages
Unit II
No ratings yet
Unit II
7 pages
SOW102-Doing Social Research, 2nd Edition-Therese Baker-1994 - (Learnclax - Com) - Pages-200-235
No ratings yet
SOW102-Doing Social Research, 2nd Edition-Therese Baker-1994 - (Learnclax - Com) - Pages-200-235
36 pages
Nia, Si
No ratings yet
Nia, Si
70 pages
Theory, 82, 132-146.: Submission To 2017 International Conference On Robotics & Automation, 8 Pages
No ratings yet
Theory, 82, 132-146.: Submission To 2017 International Conference On Robotics & Automation, 8 Pages
11 pages
Multi-Objective Hyperparameter Optimization in Machine Learning - An Overview
No ratings yet
Multi-Objective Hyperparameter Optimization in Machine Learning - An Overview
50 pages
Monte Carlo Tree Search A Review of Recent Modifications and Applications
No ratings yet
Monte Carlo Tree Search A Review of Recent Modifications and Applications
99 pages
Computational Intelligence in Optimization
100% (1)
Computational Intelligence in Optimization
424 pages
Alain Bellon - Orbis Ardentis
100% (5)
Alain Bellon - Orbis Ardentis
16 pages
Dynamic Resource Management Algorithms For Complex Systems and No
No ratings yet
Dynamic Resource Management Algorithms For Complex Systems and No
240 pages
Information-Theoretic Based Target Search With Multiple Agents
No ratings yet
Information-Theoretic Based Target Search With Multiple Agents
6 pages
Applsci 11 02056 v2
No ratings yet
Applsci 11 02056 v2
18 pages
Monte Carlo Tree Search: A Review of Recent Modifications and Applications
No ratings yet
Monte Carlo Tree Search: A Review of Recent Modifications and Applications
66 pages
Week 3 C5 Adversarial Search and Games (Belano & Ong Chua)
No ratings yet
Week 3 C5 Adversarial Search and Games (Belano & Ong Chua)
57 pages
Hybridizing Differential Evolution and Novelty Search For Multimodal Optimization Problems
No ratings yet
Hybridizing Differential Evolution and Novelty Search For Multimodal Optimization Problems
10 pages
A Monte Carlo Hyper Heuristic Algorithm With Low Level Heuristics Reward Prediction For Missile Path Planning
No ratings yet
A Monte Carlo Hyper Heuristic Algorithm With Low Level Heuristics Reward Prediction For Missile Path Planning
39 pages
Realism Middlemarch
100% (1)
Realism Middlemarch
31 pages
Machines 10 01195 v2
No ratings yet
Machines 10 01195 v2
25 pages
Boosting MCTS With Free Energy Minimization: Mawaba Pascal Dao, Adrian M. Peter
No ratings yet
Boosting MCTS With Free Energy Minimization: Mawaba Pascal Dao, Adrian M. Peter
42 pages
Potential-Based Reward Shaping For Finite Horizon Online POMDP Planning
No ratings yet
Potential-Based Reward Shaping For Finite Horizon Online POMDP Planning
43 pages
Operations Research Introduction
No ratings yet
Operations Research Introduction
25 pages
Unit 6
No ratings yet
Unit 6
48 pages
Markov Decision
100% (3)
Markov Decision
212 pages
Optimal Control and Planning
No ratings yet
Optimal Control and Planning
46 pages
4JH1 Gestión Electrónica
No ratings yet
4JH1 Gestión Electrónica
79 pages
A Bi-Objective Hyper Heuristic SVM For Bigdata Cyber Security
No ratings yet
A Bi-Objective Hyper Heuristic SVM For Bigdata Cyber Security
1 page
Power Mean Estimation in Stochastic Monte-Carlo Tree Search: Tuan Dam
No ratings yet
Power Mean Estimation in Stochastic Monte-Carlo Tree Search: Tuan Dam
25 pages
Optimization of Power Grid Maintenance With MDP
No ratings yet
Optimization of Power Grid Maintenance With MDP
6 pages
RTU Rev 01
100% (1)
RTU Rev 01
18 pages
C-MCTS: Safe Planning With Monte Carlo Tree Search: Preprint. Under Review
No ratings yet
C-MCTS: Safe Planning With Monte Carlo Tree Search: Preprint. Under Review
13 pages
Article 3
No ratings yet
Article 3
3 pages
4A Lesson Plan in English Grade 2: Valencia City Central School
No ratings yet
4A Lesson Plan in English Grade 2: Valencia City Central School
3 pages
Basic Operation Research Model
No ratings yet
Basic Operation Research Model
15 pages
Pedido 41 - 3
No ratings yet
Pedido 41 - 3
6 pages
1 s2.0 S2666912922000265 Main
No ratings yet
1 s2.0 S2666912922000265 Main
14 pages
Monte-Carlo Planning in Large Pomdps
No ratings yet
Monte-Carlo Planning in Large Pomdps
9 pages
Decentralized Multi-Policy Decision Making For Communication Constrained Multi-Robot Coordination
No ratings yet
Decentralized Multi-Policy Decision Making For Communication Constrained Multi-Robot Coordination
7 pages
Human Memory Optimization Algorithm
No ratings yet
Human Memory Optimization Algorithm
23 pages
A Q-Learning-Based Multi-Objective Hyper-Heuristic Algorithm With Fuzzy Policy Decision Technology
No ratings yet
A Q-Learning-Based Multi-Objective Hyper-Heuristic Algorithm With Fuzzy Policy Decision Technology
17 pages
A Multiobjective Multitask Optimization Algorithm Using Transfer Rank
No ratings yet
A Multiobjective Multitask Optimization Algorithm Using Transfer Rank
14 pages
Comprehend, Divide
No ratings yet
Comprehend, Divide
21 pages
NeurIPS 2019 Maximum Entropy Monte Carlo Planning Paper
No ratings yet
NeurIPS 2019 Maximum Entropy Monte Carlo Planning Paper
9 pages
Gradient Descent Algorithm For The Optimization of F 2024 Journal
No ratings yet
Gradient Descent Algorithm For The Optimization of F 2024 Journal
14 pages
Learning-Based Planning: Sergio Jiménez Celorrio
No ratings yet
Learning-Based Planning: Sergio Jiménez Celorrio
5 pages
95 843 Xiameter Ofx 0531 Fluid
No ratings yet
95 843 Xiameter Ofx 0531 Fluid
5 pages
Vector Quantized Models For Planning
No ratings yet
Vector Quantized Models For Planning
15 pages
Learning To Schedule Heuristic
No ratings yet
Learning To Schedule Heuristic
12 pages
Optimal Preventive Maintenance Policy Based On Reinforcement Learning of A Fleet of Military Trucks
No ratings yet
Optimal Preventive Maintenance Policy Based On Reinforcement Learning of A Fleet of Military Trucks
16 pages
Belief Space Partitioning For Symbolic Motion Planning
No ratings yet
Belief Space Partitioning For Symbolic Motion Planning
7 pages
Learning NP-Hard Multi-Agent Assignment Planning Using GNN: Inference On A Random Graph and Provable Auction-Fitted Q-Learning
No ratings yet
Learning NP-Hard Multi-Agent Assignment Planning Using GNN: Inference On A Random Graph and Provable Auction-Fitted Q-Learning
12 pages
Monte Carlo Search Tree
No ratings yet
Monte Carlo Search Tree
11 pages
M L S H: ETA Earning Hared Ierarchies
No ratings yet
M L S H: ETA Earning Hared Ierarchies
11 pages
Ramasubramanian ET Al (2021) (CPT Dynamic Program)
No ratings yet
Ramasubramanian ET Al (2021) (CPT Dynamic Program)
8 pages
National Cultural Policy
No ratings yet
National Cultural Policy
58 pages
Multiobjective Optimization-Aided Decision-Making System For Large-Scale Manufacturing Planning
No ratings yet
Multiobjective Optimization-Aided Decision-Making System For Large-Scale Manufacturing Planning
14 pages
Learning Sampling Dictionaries For Efficient and Generalizable Robot Motion Planning With Transformers
No ratings yet
Learning Sampling Dictionaries For Efficient and Generalizable Robot Motion Planning With Transformers
8 pages
Adjustment and Challenges of Technology and Livelihood Education Teachers in K To 12 Curriculum
No ratings yet
Adjustment and Challenges of Technology and Livelihood Education Teachers in K To 12 Curriculum
5 pages
Phihp RLC2024
No ratings yet
Phihp RLC2024
22 pages
Usage of GAMS-Based Digital Twins and Clustering To Improve Energetic Systems Control
No ratings yet
Usage of GAMS-Based Digital Twins and Clustering To Improve Energetic Systems Control
17 pages
Operation Research
No ratings yet
Operation Research
23 pages
Nikolay Khachatryan Project S3
No ratings yet
Nikolay Khachatryan Project S3
19 pages
1 s2.0 S221282712400828X Main
No ratings yet
1 s2.0 S221282712400828X Main
4 pages
Moldflow 2021 Features Comparison Matrix A4 en
No ratings yet
Moldflow 2021 Features Comparison Matrix A4 en
4 pages
U3 w22 Revision 4b (Handout)
No ratings yet
U3 w22 Revision 4b (Handout)
12 pages
Math 10 SLM 18 Permutation and Combination
No ratings yet
Math 10 SLM 18 Permutation and Combination
17 pages
Inventory Management and Control System
No ratings yet
Inventory Management and Control System
88 pages
Bentley Openbuildings Designer Connect Edition-Architectural Bim Quickstart A102: Modeling Interior Floors
No ratings yet
Bentley Openbuildings Designer Connect Edition-Architectural Bim Quickstart A102: Modeling Interior Floors
28 pages
Ie5203 Notes 3
No ratings yet
Ie5203 Notes 3
12 pages
Steel Works
No ratings yet
Steel Works
13 pages
Homework Hotline d428
100% (1)
Homework Hotline d428
5 pages
如何为学校写一篇有说服力的演讲
100% (1)
如何为学校写一篇有说服力的演讲
6 pages
Happy Elevators India PVT LTD: Sub: Elevator Quotation
No ratings yet
Happy Elevators India PVT LTD: Sub: Elevator Quotation
9 pages
Transcript
No ratings yet
Transcript
3 pages
Scaffolding in Learning
No ratings yet
Scaffolding in Learning
5 pages
Rizal Paris To Berlin
No ratings yet
Rizal Paris To Berlin
14 pages
M4 L1Assessment For Learning Using Assessment To Classify Learning and Understanding
No ratings yet
M4 L1Assessment For Learning Using Assessment To Classify Learning and Understanding
5 pages
Written Assignment Unit 4
No ratings yet
Written Assignment Unit 4
5 pages
Automated Planning and Scheduling
No ratings yet
Automated Planning and Scheduling
1 page
Experiment No.4 Atterberg Limits: Object
No ratings yet
Experiment No.4 Atterberg Limits: Object
3 pages
10 1016@j Cemconres 2020 106196
No ratings yet
10 1016@j Cemconres 2020 106196
8 pages
Data Ingestion From The RDS To HDFS Using Sqoop
No ratings yet
Data Ingestion From The RDS To HDFS Using Sqoop
5 pages
67207e78746876a86fe72ba5 Widavasigivexatez
No ratings yet
67207e78746876a86fe72ba5 Widavasigivexatez
2 pages
Assignment 1
No ratings yet
Assignment 1
1 page

Symmetry: HMCTS-OP: Hierarchical MCTS Based Online Planning in The Asymmetric Adversarial Environment

Uploaded by

Symmetry: HMCTS-OP: Hierarchical MCTS Based Online Planning in The Asymmetric Adversarial Environment

Uploaded by

SS symmetry

Keywords: HMCTS; online planning; MAXQ; asymmetric adversarial environment

Symmetry 2020, 12, 719; doi:10.3390/sym12050719 www.mdpi.com/journal/symmetry

2.1. Markov and Semi-Markov Decision Process

Symmetry 2020, 12, 719 3 of 17

It represents the expected cumulative Qπ (i, s, t)reward

Selection Expansion Simulation Backpropagation

4.1. Asymmetric Adversarial Environment Modeling

Attack Wait Defense

NavToCloOpp FireTo(.) NavToCloBaseOpp

NavUp NavDown FirUp FirDown NavUp NavDown

NavLeft NavRight FirLeft FirRight NavLeft NavRight

Q( Attack , s, NavToCloOpp) = V ( NavToCloOpp, s) +  P(s ', k | s, NavToCloOpp)γ V ( Attack , s ')

Q(De f ense, s, NavToCloBaseOpp) = V (NavToCloBaseOpp, s)

4.3. Hierarchical MCTS-Based Online Planning (HMCTS-OP)

5. Experiment and Results

5.1. Experiment Setting

Figure 5. The three scenarios used in our experiment.

(a) average game time in scenario 1 (b) score in scenario 1

(a) average game time in scenario 1 (b) score in scenario 1

(c) average game time in scenario 2 (d) score in scenario 2

(e) average game time in scenario 3 (f) score in scenario 3

Table 3. Data groups.

Table 4. p-values of significance tests among multiple groups.

GroupR11 – GroupU11 – GroupR21 – GroupU21 – GroupR31 – GroupU31 –

Table 5. p-values of multiple tests for each pair.

UCT NaiveMCTS InformedNaiveMCTS HMCTS-OP Smart-HMCTS-OP

(a) Statistical significance test in scenario 1 against Random (GroupR11–GroupR15)

(b) Statistical significance test in scenario 1 against UCT (GroupU11–GroupU15)

(c) Statistical significance test in scenario 2 against Random (GroupR21–GroupR25)

(d) Statistical significance test in scenario 2 against UCT (GroupU21–GroupU25)

(e) Statistical significance test in scenario 3 against Random (GroupR31–GroupR35)

(f) Statistical significance test in scenario 3 against UCT (GroupU31–GroupU35)

You might also like