Symmetry: HMCTS-OP: Hierarchical MCTS Based Online Planning in The Asymmetric Adversarial Environment
Symmetry: HMCTS-OP: Hierarchical MCTS Based Online Planning in The Asymmetric Adversarial Environment
Article
HMCTS-OP: Hierarchical MCTS Based Online
Planning in the Asymmetric Adversarial Environment
Lina Lu , Wanpeng Zhang *, Xueqiang Gu, Xiang Ji and Jing Chen
College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073,
Hunan, China; [email protected] (L.L.); [email protected] (X.G.); [email protected] (X.J.);
[email protected] (J.C.)
* Correspondence: [email protected]
Received: 7 February 2020; Accepted: 31 March 2020; Published: 2 May 2020
Abstract: The Monte Carlo Tree Search (MCTS) has demonstrated excellent performance in solving
many planning problems. However, the state space and the branching factors are huge, and the
planning horizon is long in many practical applications, especially in the adversarial environment.
It is computationally expensive to cover a sufficient number of rewarded states that are far away from
the root in the flat non-hierarchical MCTS. Therefore, the flat non-hierarchical MCTS is inefficient
for dealing with planning problems with a long planning horizon, huge state space, and branching
factors. In this work, we propose a novel hierarchical MCTS-based online planning method named
the HMCTS-OP to tackle this issue. The HMCTS-OP integrates the MAXQ-based task hierarchies and
the hierarchical MCTS algorithms into the online planning framework. Specifically, the MAXQ-based
task hierarchies reduce the search space and guide the search process. Therefore, the computational
complexity is significantly reduced. Moreover, the reduction in the computational complexity enables
the MCTS to perform a deeper search to find better action in a limited time. We evaluate the
performance of the HMCTS-OP in the domain of online planning in the asymmetric adversarial
environment. The experiment results show that the HMCTS-OP outperforms other online planning
methods in this domain.
1. Introduction
It is challenging to solve large-scale planning problems. These problems suffer from the “curse of
dimensionality”. Online planning algorithms (e.g., the Monte Carlo Tree Search (MCTS)) overcome
this problem by avoiding calculation of the complete policy for the whole state space. Additionally,
the MCTS has demonstrated excellent performance in many planning problems [1].
However, existing MCTS algorithms, including the UCT (Upper Confidence Bounds (UCB)
applied to Trees) cannot efficiently cope with long-horizon planning problems with a huge state space
and branching factors. The performance of the MCTS is determined by the effective search depth [2].
For long-horizon problems with a huge state space and branching factors, the MCTS needs to search
deeply to cover a sufficient number of rewarded states that are far away from the root, which leads to a
large computational overhead and poor performance.
In this work, we focus on the online planning problem in the asymmetric adversarial environment.
This domain has asymmetric military forces between two opposing players. Due to the long planning
horizon, huge state space, and significant disparity among forces of this domain, the planning problem
in this domain is very challenging for the MCTS.
In order to improve the performance of the MCTS, the task hierarchies or macro actions are
introduced into planning [1] to reduce the computational cost. We define the set of primitive actions as
A, the set of hierarchical tasks (subtasks) as Ã, the planning horizon as T, and the upper-bound on the
length of each subtask as L. Once the primitive actions in the MCTS are replaced by the high-level
T/L
subtasks, the computational cost is reduced from O(|A|T ) to O(|Ã| ) [3]. Therefore, the integration of
the task hierarchies and the MCTS algorithm make the algorithm more effective for solving large-scale
problems with a long planning horizon.
In this work, we use MAXQ [4] to formalize the task hierarchies. MAXQ is a decomposition method
of Markov Decision Process (MDP) value function, which makes the value function representation
more compact and makes the subtasks portable.
In this paper, we propose a novel hierarchical MCTS-based online planning method named the
HMCTS-OP. The HMCTS-OP integrates the MAXQ-based task hierarchies and the hierarchical MCTS
algorithms into the online planning framework. Specifically, the MAXQ-based task hierarchies reduce
the search space and guide the search process. Therefore, the computational cost is significantly
reduced, which enables the MCTS to perform a deeper search in a limited time frame. Therefore, it can
provide a more accurate value function estimate for the root node and enable the root node to implement
better action selection policies within the limited time frame. Moreover, combined with the advantages
of hierarchical planning and online planning, the HMCTS-OP facilitates the design and utilization of
autonomous agents in the asymmetric adversarial environment.
The main contributions of this paper are as follows:
1. We model the online planning problem in the asymmetric adversarial environment as an MDP and
extend the MDP to the semi-Markov decision process (SMDP) by introducing the task hierarchies.
This provides the theoretical foundation for MAXQ hierarchical decomposition.
2. We derive the MAXQ value hierarchical decomposition for the defined hierarchical tasks.
The MAXQ value hierarchical decomposition provides a scalable way to calculate the rewards of
hierarchical tasks in HMCTS-OP.
3. We use the MAXQ-based task hierarchies to reduce the search space and guide the search process.
Therefore, the computational cost is significantly reduced, which enables the MCTS to search
deeper to find better action within a limited time frame. As a result, the HMCTS-OP can perform
better in online planning in the asymmetric adversarial environment.
The rest of this paper is structured as follows. In Section 2, we start by introducing the necessary
background. Section 3 summarizes the related work. Section 4 discusses, in detail, how to model the
problem, extend the MDP to SMDP, and integrate the MAXQ-based task hierarchies in the MCTS for
online planning in the asymmetric adversarial environment. In Section 5, we evaluate the performance
of HMCTS-OP. In Section 6, we draw conclusions and discuss several future research areas.
2. Background
The framework of the SMDP is the foundation of hierarchical planning and learning.
Symmetry 2019, 11, x FOR PEER REVIEW 3 of 17
M0
M1 M2 M3
M4 M5 M6 M7 M8
Figure
Figure 1.
1A A graphic
graphic example
example of
of MAXQ-based
MAXQ-based task
task hierarchies.
hierarchies.
M𝑀 i is
is defined
defined as three-tuple hT
as aa three-tuple <i ,𝑇A, 𝐴
i , ,R̃𝑅i i.>.
•• T𝑇i represents
represents the the termination
termination condition
condition of of subtask
subtask M 𝑀i ,, which
which is used to
is used judge whether
to judge whether M 𝑀 i is
is
terminated. Specifically, 𝑆 and 𝐺 are the active states and termination
terminated. Specifically, Si and Gi are the active states and termination states of Mi respectively. states of 𝑀
respectively.
If the current If thescurrent
state ∈ Gi or the 𝑠ϵ𝐺 or the
statepredefined predefined
maximum maximum
calculation calculation
time or number time or number
of iterations is
of iterations
reached, Ti isisset
reached, 𝑇 is set that
to 1, indicating to 1, M indicating
i is that
terminated. 𝑀 is terminated.
•• A𝐴 isisaaset
i
setof
ofactions;
actions;ititcontains
containsboth
bothprimitive
primitiveactionsactionsand andhigh-level
high-levelsubtasks.
subtasks.
• 𝑅 is the optional pseudo-reward function of 𝑀 .
• R̃i is the optional pseudo-reward function of Mi .
The core component of MAXQ is the hierarchical decomposition of the value function. The value
The for
function corethe
component of MAXQ issubtask
MAXQ hierarchical the hierarchical
t is defineddecomposition
as of the value function. The value
function for the MAXQ hierarchical subtask t is defined as
Qπ (i, s, t ) = V π (t , s ) + C π (i, s, t ) . (1)
2.3. MCTS
Symmetry 2019, 11, x FOR PEER REVIEW 4 of 17
The Monte Carlo Tree Search (MCTS) is a best-first search. It combines the accuracy of tree
searchThe
with the Carlo
Monte universality of stochastic
Tree Search (MCTS) is asampling. The MCTS
best-first search. finds the
It combines theaccuracy
optimalofaction in the current
tree search
state
with by
theconstructing
universality ofanstochastic
asymmetric optimal
sampling. Theprior
MCTSsearch treeoptimal
finds the based action
on a Monte Carlo simulation
in the current state [6].
by constructing
Specifically, the an asymmetric
MCTS optimal
first selects an prior
actionsearch tree based
according to a ontreea policy.
Monte Carlo
Next,simulation
it executes [6].the action
Specifically,
through the MCTS
a Monte Carlofirst selects an action
simulation accordingthe
and observes to anew
tree state.
policy. If
Next,
the itstate
executes the action
is already on the tree,
through a Monte Carlo simulation and observes the new state. If the state is already
it evaluates the state recursively; otherwise, it adds a new node to the tree. Then, it runs the on the tree, it default
evaluates the state recursively; otherwise, it adds a new node to the tree. Then, it runs the default
rollout policy from this node until termination. Finally, it updates the information of each node on the
rollout policy from this node until termination. Finally, it updates the information of each node on
path via backpropagation. Figure 2 shows the basic process of the MCTS approach.
the path via backpropagation. Figure 2 shows the basic process of the MCTS approach.
Repeated n times
Figure2.2.
Figure TheThe basic
basic process
process ofMonte
of the the Monte
CarloCarlo Tree Search
Tree Search (MCTS)(MCTS) approach.
approach.
UCT
UCT (UCB
(UCB applied
applied to Trees)
Trees) is
is one
oneof ofthe
themost
mostfamous
famousMCTS
MCTS implementations [7]. ItIt is a standard
implementations [7].
MCTS algorithm for planning in MDPs which uses UCB1 [8] as the tree policy. The treeThe
standard MCTS algorithm for planning in MDPs which uses UCB1 [8] as the tree policy. policytree attempts
policy attempts to balance exploration and exploitation [6,9,10]. If all actions of a node have
to balance exploration and exploitation [6,9,10]. If all actions of a node have been selected, then the been
selected, then
subsequent the subsequent
actions actions
of this node areofselected
this nodeaccording
are selectedthe
according the UCB1
UCB1 policy. policy. Specifically,
Specifically,
logs N ( s)
UCB ( s, a ) = Q ( s, a ) + c (4)
N ( s, alog
) N (s)
UCB(s, a) = Q(s, a) + c (4)
N (s, a)
where 𝑄 (𝑠, 𝑎) is the average action value of all simulations in the context of (𝑠, 𝑎). 𝑁(𝑠, 𝑎) is the
cumulative number that action 𝑎 has been selected at state 𝑠 . 𝑁(𝑠) = ∑ ∈ 𝑁(𝑠, 𝑎) is the total
(s,visits
numberQof
where a) istothe average
state 𝑠. 𝑐 is aaction
factor value
that canofbeall simulations
used in the context
to balance exploration of (s, a). N (s, a) is the
P and exploitation.
cumulative number that action a has been selected at state s. N (s) = a∈A N (s, a) is the total number
of3. visits
Relatedto Work
state s. c is a factor that can be used to balance exploration and exploitation.
In this work, we integrate the underlying MAXQ-based task hierarchies of the problem with the
3.MCTS
Related Work
to tackle the online planning in an asymmetric adversarial environment. Substantial work has
beenIndone
thisin termswe
work, of integrate
MAXQ. Litheet al. [11] improved
underlying the MAXQtask
MAXQ-based method, and they
hierarchies of proposed
the problem the with the
context-sensitive
MCTS to tackle thereinforcement learningin
online planning (CSRL) framework, adversarial
an asymmetric but the focusenvironment.
of their work was to use
Substantial work
common knowledge in the subtasks to learn transitional dynamics effectively. Ponce et al. [12]
has been done in terms of MAXQ. Li et al. [11] improved the MAXQ method, and they proposed the
proposed a MAXQ-based semi-automatic control structure for robots to explore unknown and messy
context-sensitive reinforcement learning (CSRL) framework, but the focus of their work was to use
urban search and rescue (USAR) scenarios. Devin et al. [13] introduced the MAXQ-based task
common knowledge
hierarchies in the subtasks
to offline learning. Hoang toet learn
al. [14]transitional dynamics
used hierarchical effectively.toPonce
decomposition guide et al. [12] proposed
high-level
aimitation
MAXQ-based semi-automatic control structure for robots to explore unknown
learning. Bai et al. [15] extended the MAXQ method to online hierarchical planning.and messy urban search
and rescue the
However, (USAR) scenarios.
transition modelDevin
for eachet al. [13] introduced
subtask the MAXQ-based
must be estimated to calculatetask
thehierarchies
completion to offline
learning.
function. Hoang et al. [14] used hierarchical decomposition to guide high-level imitation learning.
Bai et al. [15] extended the MAXQ method to online hierarchical planning. However, the transition
model for each subtask must be estimated to calculate the completion function.
Symmetry 2020, 12, 719 5 of 17
In the domain of online planning, the MCTS combines the best-first search and Monte Carlo
simulation to find the approximate optimal policies. However, in long-horizon problems with a large
state space and branching factors, MCTS algorithms must search deeply to cover a sufficient number of
rewarded states that are far away from the root, which leads to a large computational overhead and poor
performance. Many related works have been done to tackle this problem. Hostetler et al. [2] introduced
state abstraction into the MCTS to reduce random branching factors and improve the performance
of UCT algorithms with limited samples. Bai et al. [16] converted the state-abstracted model into a
POMDP (Partially Observable MDP) problem to solve the non-Markovian problem caused by state
abstraction and proposed a hierarchical MCTS to deal with state and action abstraction in the POMDP
setting. Vien et al. [1] integrated a task hierarchy into the UCT and POMCP (Partially Observable
Monte Carlo Planning) framework and evaluated the methods in various settings. Menashe et al. [17]
proposed the transition-based upper confidence bounds for trees (T-UCT) approach to learn and exploit
the dynamics of structured hierarchical environments. Although these works work well in some
domains, they do not address online planning issues in the adversarial environment. Online planning
in asymmetric adversarial environments with a long planning horizon, huge state space, and branch
factors still poses a significant challenge.
There are many other related works. Sironi et al. [18] tuned different MCTS parameters to improve
an agent’s performance in the games. Neufeld et al. [19] proposed a hybrid planning approach that
combines the Hierarchical Task Network (HTN) and the MCTS. Santiago et al. [20] proposed the
informed MCTS, which exploits prior information to improve the performance of MCTS in large games.
However, the MCTS only works at the level of primitive actions in these works.
In this work, we propose HMCTS-OP, a hierarchical MCTS-based online planning algorithm.
The HMCTS-OP integrates the MAXQ-based task hierarchies and hierarchical MCTS algorithms into
the online planning framework. The evaluation shows that the HMCTS-OP is capable of excellent
performance in online planning in the asymmetric adversarial environment.
4. Method
1. Four navigation actions: NavUp (move upward), NavDown (move downward), NavLeft (move
Symmetryleftward), andPEER
2019, 11, x FOR NavRight
REVIEW(move rightward). 6 of 17
2. Four attack actions: FirUp (fire upward), FirDown (fire downward), FirLeft (fire leftward),
2. Four attack actions:
and FirRight FirUp (fire upward), FirDown (fire downward), FirLeft (fire leftward),
(fire rightward).
and3. FirRight
Wait. (fire rightward).
3. Wait.
Transition
Transition function.Each
function. Each move
move action
action has
has anan0.90.9 probability
probability of of moving
moving to to
thethe target
target position
position
successfully and a probability of 0.1 of staying in the current position.
successfully and a probability of 0.1 of staying in the current position. Each fire action has an 0.9 Each fire action has
an 0.9 probability
probability of damaging of damaging the opponent
the opponent successfully,
successfully, and a probability
and a probability of 0.1 ofof 0.1 of failing.
failing.
Reward function. Each primitive action has a reward of −1.
Reward function. Each primitive action has a reward of –1. If the agent attacks If the agent attacksthethe opponent
opponent
successfully and the opponent loses 1 health point, the agent will get a reward of 20. Conversely,if ifthe
successfully and the opponent loses 1 health point, the agent will get a reward of 20. Conversely,
theagent
agentisis attacked
attacked by bythe the opponent
opponent andandlosesloses 1 health
1 health point, itpoint,
will getit awill
rewardget ofa –20.
reward of −20.
Moreover,
Moreover, the reward is 100 for winning the game and
the reward is 100 for winning the game and –100 for losing the game. −100 for losing the game.
Figure 3. One
Figure of the
3. One scenarios
of the in in
scenarios thethe
asymmetric adversarial
asymmetric environment.
adversarial environment.
4.2.
4.2. MAXQ
MAXQ Hierarchical
Hierarchical Decomposition
Decomposition
InIn this
this section,we
section, wewill
willdescribe
describethethe MAXQ
MAXQ hierarchical
hierarchicaldecomposition
decompositionmethod
methodover the
over overall
the task
overall
hierarchies in detail. First, we define a set of subtasks at different levels according to the related
task hierarchies in detail. First, we define a set of subtasks at different levels according to the related domain
knowledge,
domain which which
knowledge, are usedareasused
the building blocks toblocks
as the building construct the task hierarchies.
to construct Then, weThen,
the task hierarchies. extend
wethe
MDP to the SMDP by introducing hierarchical tasks and derive the MAXQ hierarchical
extend the MDP to the SMDP by introducing hierarchical tasks and derive the MAXQ hierarchical decomposition
over the defined
decomposition overhierarchical
the definedtasks.
hierarchical tasks.
The
The hierarchical
hierarchical subtasks
subtasks areare defined
defined asas follows:
follows:
- - NavUp,
NavUp,NavDown, NavLeft,NavRight,
NavDown, NavLeft, NavRight,FirUp,
FirUp, FirDown,
FirDown, FirLeft,
FirLeft, FirRight,
FirRight, and and
Wait:Wait:
TheseThese
actions
actions are
are defined
definedbybythe
theRTS;
RTS;they
theyare
areprimitive
primitive actions. When a primitive action is performed,
actions. When a primitive action is performed, a local a
local reward
rewardofof–1
−1will
willbebe assigned to each
assigned to eachprimitive
primitiveaction.
action.This
This method
method ensures
ensures thatthat the online
the online policy
policy of the high-level subtask can reach the corresponding goal state as soon as possible.
of the high-level subtask can reach the corresponding goal state as soon as possible.
- NavToNeaOpp, NavToCloBaseOpp, and FireTo: The NavToNeaOpp subtask will move the
- NavToNeaOpp, NavToCloBaseOpp, and FireTo: The NavToNeaOpp subtask will move the
light military unit to the closest enemy unit as soon as possible by performing NavUp, NavDown,
light military unit to the closest enemy unit as soon as possible by performing NavUp,
NavLeft, and NavRight actions and taking into account the action uncertainties. Similarly, the
NavDown, NavLeft, and NavRight actions and taking into account the action uncertainties.
NavToCloBaseOpp subtask will move the light military unit to the enemy unit closest to the base as
Similarly, the NavToCloBaseOpp subtask will move the light military unit to the enemy unit
fast as possible. The goal of the FireTo subtask is to attack enemy units within a range.
closest to the base as fast as possible. The goal of the FireTo subtask is to attack enemy units
- Attack and Defense: The purpose of Attack is to destroy the enemy’s units to win by planning
within a range.
the attacking behaviors, and the purpose of Defense is to defend against the enemy’s units to protect
- Attack and Defense: The purpose of Attack is to destroy the enemy’s units to win by planning
bases by carrying out defensive behaviors.
the attacking behaviors, and the purpose of Defense is to defend against the enemy’s units to
- Root: This is a root task. The goal of Root is to destroy the enemy’s units and protect bases. In
protect bases by carrying out defensive behaviors.
the Root task, the Attack subtask and the Defense subtask are evaluated by the hierarchical UCB1
policy according to the HMCTS-OP, which is described in the next section in detail.
The overall task hierarchies of the online planning problem in the asymmetric adversarial
environment are shown in Figure 4. The parentheses after the name of a subtask indicate that the
subtask has parameters.
Symmetry 2020, 12, 719 7 of 17
- Root: This is a root task. The goal of Root is to destroy the enemy’s units and protect bases. In the
Root task, the Attack subtask and the Defense subtask are evaluated by the hierarchical UCB1
policy according to the HMCTS-OP, which is described in the next section in detail.
The overall task hierarchies of the online planning problem in the asymmetric adversarial
environment are shown in Figure 4. The parentheses after the name of a subtask indicate that the
subtask has parameters.
According
Symmetry 2019, to
11, the
x FORtheory proposed by He et al. [3], the computational cost of action 7branching
PEER REVIEW of 17
T/L
T
shrinks from O(|A| ) to O(Ã ) [3] by introducing high level subtasks (macro actions), where A is a
According to the theory proposed by He et al. [3], the computational cost of action branching
set of primitive actions, Ã is a set/of high level subtasks, T is the planning horizon, and L ≥ 1 is an
shrinks from 𝑂(|𝐴| ) to 𝑂(|𝐴| ) [3] by introducing high level subtasks (macro actions), where 𝐴
upper-bound
is a set of on the length
primitive of each
actions, 𝐴 is high-level
a set of highsubtask. In this work,
level subtasks, 𝑇 is thethe stochastic
planning branching
horizon, and 𝐿 ≥factor
1 of
the root node is reduced from 9 to 3 by introducing the MAXQ-based hierarchical
is an upper-bound on the length of each high-level subtask. In this work, the stochastic branchingsubtasks. Therefore,
à =factor theT/L
|A|/3ofand root ≤ T when
node L ≥ 1.from
is reduced As a9 result,
to 3 by the computational
introducing complexity
the MAXQ-based is significantly
hierarchical reduced
subtasks.
in theTherefore, 𝐴 = |𝐴|/3with
planning problem anda 𝑇/𝐿long≤planning
𝑇 when 𝐿horizon,
≥ 1 . Aswhich enables
a result, the algorithmcomplexity
the computational to search deeper
is
significantly
to find a better plan.reduced in the planning problem with a long planning horizon, which enables the
algorithm to search deeper to find a better plan.
Root
Figure 4. The
Figure overall
4. The task
overall hierarchies
task hierarchiesofofthe
theonline planningproblem
online planning problemin in
thethe asymmetric
asymmetric adversarial
adversarial
environment.
environment.
Based on the
Based tasktask
on the hierarchies
hierarchiesdefined
definedabove,
above, wewe extend
extendthetheMDPMDP to to
thethe SMDP
SMDP by including
by including an an
explicit reference
explicit to the
reference subtasks.
to the We
subtasks. Weintroduce
introducethe variable k𝑘 to
thevariable to represent theexecution
represent the executiontime
time steps of
steps
of the subtask.
the subtask.
The transition
The transition function P : S 𝑃:
function ×A 𝑆×× 𝐴S ×
× 𝑆k ×→𝑘 [→
0, [0,1] represents
1] represents thethe probability
probability of performing
of performing subtask
subtask 𝑎 in state 𝑠 and transitioning
0 to state
a in state s and transitioning to state s after k time steps.𝑠 after 𝑘 time steps.
P(s ', k | s, a) = Pr{st + k = s' | st = s, at = a} . (5)
P(s0 , ks, a) = Pr st+k = s0 st = s, at = a . (5)
The reward function of the hierarchical task accumulates the single step rewards when it is
executed.
The rewardThe function
accumulation manner
of the is relatedtask
hierarchical to theaccumulates
performance measure of HMCTS.
the single In this when
step rewards work, it is
we use the discount factor 𝛾 to maximize the accumulated discounted rewards.
executed. The accumulation manner is related to the performance measure of HMCTS. In this work,
We derive the MAXQ hierarchical decomposition as follows, according to Equations (1)(2)(3).
we use the discount factor γ to maximize the accumulated discounted rewards.
We derive the MAXQ Q( Root
hierarchical V ( Attack , s) + as
, s, Attack ) =decomposition P( sfollows,
', k | s, Attack )γ kV ( Rootto, sEquations
according ') (6)
(1)–(3).
s ', k
X
Q(Root, s, Attack) = V (Attack, s) + P(s0 , ks, Attack) γkk V (Root, s0 ) (6)
Q( Root , s, Defense) = V ( Defense, s) +s0
,k
P( s ', k | s, Defense)γ V ( Root , s ') (7)
s ', k
X
P(s0 , ks, De fk ense) γk V (Root, s0 )
Q( Root , s,Wait ) = R(Wait , s) + P(s ', k | s,Wait )γ V ( Root , s ')
Q(Root, s, De f ense) = V (De f ense, s) + (8) (7)
0 ,k
s ', ks
Q( Attack , s, FireTo( p)) = V ( FireTo( p), s) + P(s ', k | s, FireTo( p))γ V ( Attack , s ')
s ', k
k (9)
X
Q(Root, s, Wait) = R(Wait, s) + P(s0 , ks, Wait) γk V (Root, s0 ) (8)
s0 ,k
X
Q(Attack, s, FireTo(p)) = V (FireTo(p), s) + P(s0 , ks, FireTo(p))γk V (Attack, s0 ) (9)
s0 ,k
X
Q(Attack, s, NavToCloOpp) = V (NavToCloOpp, s) + P(s0 , ks, NavToCloOpp) γk V (Attack, s0 ) (10)
s0 ,k
Algorithm 1 PLAY.
1: function PLAY (task t, state s, MAXQ hierarchy M, rollout policy πrollout )
2: repeat
3: a ← HMCTS-OP(t, s, M, πrollout )
4: s0 ← Execute(s, a)
5: s ← s0
6: until termination conditions
7: end function
Symmetry 2020, 12, 719 9 of 17
Algorithm 2 HMCTS-OP.
1: function HMCTS-OP (task t, state s, MAXQ hierarchy M, rollout policy πrollout )
2: if t is primitive then Return R(s, t), t
3: end if
4: if s < St and s < Gt then Return h−∞, nili
5: end if
6: if s ∈ Gt then Return h0, nili
7: end if
8: Initialize search tree T for task t
9: while within computational budget do
10: HMCTSSimulate (t, s, M, 0, πrollout )
11: end while
12: Return GetGreedyPrimitive(t, s)
13: end function
Algorithm 3 HMCTSSimulate.
1: function HMCTSSimulate (task t, state s, MAXQ hierarchy M, depth d, rollout policy πrollout )
2: steps = 0 // the number of steps executed by task t
3: if t is primitive then
4: (s0 , r, 1) ← Generative-Model(s, t) // domain generative model-simulator
5: steps = 1
6: Return (s0 , r, steps)
7: end if
8: if t terminates or d > dmax then
9: Return (s, 0, 0)
10: end if
11: if node (t, s) is not in tree T then
12: Insert node (t, s) to T
13: Return Rollout(t, s, d, πrollout )
14: end if
15: if node (t, s) is not fully expanded then
16: choose a ∈ untried subtask from subtasks(t)
17: else
r
log N (t,s)
18: 0
a = arg maxa0 Q(t, s, a ) + c 0 //a0 ∈ subtasks(t)
N (t,s,a )
19: end if
20: (s0 , r, k) = HMCTSSimulate(a, s, M, d, πrollout )
21: (s00 , r0 , k0 ) = Rollout (t, s0 , d + k, πrollout )
22: r = r + γk r0
23: steps = steps + k + k0
24: N (t, s) = N (t, s) + 1
25: N (t, s, a) = N (t, s, a) + 1
r−Q(t,s,a)
26: Q(t, s, a) = Q(t, s, a) + N (t,s,a)
27: Return (s00 , r, steps)
28: end function
Symmetry 2020, 12, 719 10 of 17
Algorithm 4 Rollout.
1: function Rollout (task t, state s, depth d, rollout policy πrollout )
2: steps = 0
3: if t is primitive then
4: (s0 , r, 1) ← Generative-Model(s, t) // domain generative model-simulator
5: Return (s0 , r, 1)
6: end if
7: if t terminates or d > dmax then
8: Return (s, 0, 0)
9: end if
10: a ∼ πrollout (s, t)
11: (s0 , r, k) = Rollout(a, s, d, πrollout )
12: (s00 , r0 , k0 ) = Rollout (t, s0 , d + k, πrollout )
13: r = r + γk r0
14: steps = steps + k + k0
15: Return (s00 , r, steps)
16: end function
Algorithm 5 GetGreedyPrimitive.
1: function GetGreedyPrimitive (task t, state s)
2: if t is primitive then
3: Return t
4: else
5: a = arg maxa0 Q(t, s, a0 )
6: Return GetGreedyPrimitive (a, s)
7: end if
8: end function
We tested the performance of all methods against two fixed opponents (UCT with random rollout
policy and Random). The accumulated match results for each algorithm were counted for each scenario
(win/tie/loss), and the score was calculated as the number of wins plus 0.5 times the number of ties
Symmetry 2019, 11, x FOR PEER REVIEW 11 of 17
+ 0.5 ties).
(winsSymmetry 2019, 11,The performance
x FOR PEER REVIEW was evaluated using the average game time, the score, 11 ofand
17 the
average cumulative reward. All of the results were obtained over 50 games.
number of ties (wins + 0.5 ties). The performance was evaluated using the average game time, the
number of ties (wins + 0.5 ties). The performance was evaluated using the average game time, the
score, and the average cumulative reward. All of the results were obtained over 50 games.
score, and the average cumulative
Table 1. Detailsreward. All of
of the three the results
scenarios usedwere obtained
in our over 50 games.
experiment.
Table 1. Details of the three scenarios used in our experiment
Table 1. Details of the three scenarios
Scenario 1 used in2our experiment
Scenario Scenario 3
Scenario 1 Scenario 2 Scenario 3
Size 8 × 8 1 Scenario
Scenario 10 × 10 2 Scenario
12 × 123
Size 8×8 10 × 10 12 × 12
Units
Size 23
8×8 23
10 × 10 12 ×2312
Units 23 23 23
Maximal states 103636 1041 1044
Units
Maximal states 10 23 104123 1044 23
Maximal states 1036 1041 1044
Figure 6. Cont.
Symmetry 2020, 12, 719 12 of 17
Symmetry 2019, 11, x FOR PEER REVIEW 12 of 17
FigureFigure
6. The6. The average
average game
game timeand
time andscores
scores of
of 50 games
gamesfor foreach scenario
each scenarioandand
eacheach
algorithm against
algorithm against
two fixed opponents (Random and UCT): (a,b) the results in scenario 1; (c,d) the results
two fixed opponents (Random and UCT): (a,b) the results in scenario 1; (c,d) the results in scenario in scenario 2; 2;
(e,f) the results in scenario
(e,f) the results in scenario 3. 3.
Furthermore,
Furthermore, Table
Table 2 shows
2 shows theaverage
the averagecumulative
cumulative reward
rewardover
over5050
games for for
games eacheach
scenario and and
scenario
each algorithm against two fixed opponents (Random and UCT). As shown in Table 2, the HMCTS-
each algorithm against two fixed opponents (Random and UCT). As shown in Table 2, the HMCTS-OP
OP and Smart-HMCTS-OP outperformed the other algorithms. Next, we tested whether there is a
and Smart-HMCTS-OP outperformed the other algorithms. Next, we tested whether there is a
significant performance difference between our algorithms and the other three algorithms from a
significant performance
statistical perspective. difference between our algorithms and the other three algorithms from a
statistical perspective.
Table 2. The average cumulative reward over 50 games for each scenario and each algorithm against
2. The
Tabletwo fixedaverage cumulative
opponents reward
(Random and over 50 games for each scenario and each algorithm against
UCT).
two fixed opponents (Random and UCT).
Informed HMCTS- Smart-
Opponent Scenario UCT NaiveMCTS
NaiveMCTS
Informed OP HMCTS-OP
Opponent Scenario UCT NaiveMCTS HMCTS-OP Smart-HMCTS-OP
8×8 −38.94 −57.7 NaiveMCTS
−153.56 109.94 276.34
Random 8 × 10
8 × 10 −38.94105.02 −57.7 47.98 −153.5630.12 109.94
242.24 276.34
312.66
Random 10 × 12
10 × 12 105.02−6.64 47.98 55.72 30.12 69.88 242.24 312.66
247.52 261.38
8 × 8 −6.64
12 × 12 −929.98 55.72−987.68 69.88−995.06 −354.64
247.52 −288.38
261.38
UCT 8 × 10
8 × 10−929.98
−694.42 −987.68
−610.48 −995.06
−840.8 −354.64
−236.34 −288.38
−95.5
UCT 10 × 12
10 × 12−694.42
−653.12 −610.48
−660.16 −659.54
−840.8 −173.58
−236.34 −89.34
−95.5
12 × 12
We recorded all of −653.12 −660.16
the cumulative rewards of 50 −659.54 −173.58
games obtained −89.34
from the evaluation of each
algorithm in each scenario. These rewards were divided into 30 groups, as shown in Table 3. We
performed
We statistical
recorded all of significance tests on
the cumulative these data.
rewards of 50In games
this work, we usedfrom
obtained the Kruskal–Wallis
the evaluationone-
of each
way analysis of variance [22] to perform six tests among multiple groups of data (GroupR11~GroupR15,
algorithm in each scenario. These rewards were divided into 30 groups, as shown in Table 3.
GroupU11~GroupU15, GroupR21~GroupR25, GroupU21~GroupU25, GroupR31~GroupR35, GroupU31~GroupU35) for
We performed statistical significance tests on these data. In this work, we used the Kruskal–Wallis
each scenario and each opponent. Figure 7 and Table 4 show that there were significant differences
one-way analysis of variance [22] to perform six tests among multiple groups of data (GroupR11 ~GroupR15 ,
in cumulative rewards between different algorithms for all scenarios and opponents.
GroupU11 ~GroupU15 , GroupR21 ~GroupR25 , GroupU21 ~GroupU25 , GroupR31 ~GroupR35 , GroupU31 ~GroupU35 )
for each scenario and each opponent. Figure 7 and Table 4 show that there were significant differences
in cumulative rewards between different algorithms for all scenarios and opponents.
Symmetry 2020, 12, 719 13 of 17
Informed
Opponent Scenario UCT NaiveMCTS HMCTS-OP Smart-HMCTS-OP
NaiveMCTS
8×8 GroupR11 GroupR12 GroupR13 GroupR14 GroupR15
Random 10 × 10 GroupR21 GroupR22 GroupR23 GroupR24 GroupR25
12 × 12 GroupR31 GroupR32 GroupR33 GroupR34 GroupR35
8×8 GroupU11 GroupU12 GroupU13 GroupU14 GroupU15
UCT 10 × 10 GroupU21 GroupU22 GroupU23 GroupU24 GroupU25
12 × 12 GroupU31 GroupU32 GroupU33 GroupU34 GroupU35
Moreover, we performed multiple tests for each pair to test the following hypotheses at level
α0 = 0.05:
H0 : There were no significant differences between the HMCTS-OP/Smart-HMCTS-OP and the
UCT/NaiveMCTS/Informed NaiveMCTS in scenarios 1, 2, and 3 against the Random/UCT.
Our test rejected a null hypothesis if and only if the p-value was, at most, α0 [23].
Table 5 shows the p-values of multiple tests for each pair. Moreover, we used the Tukey–Kramer test
to adjust the p-values to correct for the occurrence of false positives. As shown in Table 5, the p-values
were smaller than α0 for all tests when the opponent was UCT, and the differences were significant
between the Smart-HMCTS-OP and the other three algorithms for all scenarios and opponents.
Informed
Opponent Scenario Group UCT NaiveMCTS
NaiveMCTS
HMCTS-OP 0.4145 0.2623 0.0115
8×8
Smart-HMCTS-OP 1.7033 × 10−7 4.0795 × 10−8 9.9303 × 10−9
HMCTS-OP 0.1634 0.0331 0.0010
Random 10 × 10
Smart-HMCTS-OP 6.7051 × 10−7 2.6123 × 10−8 9.9410 × 10−9
HMCTS-OP 6.2448 × 10−6 4.2737 × 10−4 6.7339 × 10−4
12 × 12
Smart-HMCTS-OP 1.1508 × 10−7 1.4085 × 10−5 2.4075 × 10−5
HMCTS-OP 9.9611 × 10−9 9.9217 × 10−9 9.9217 × 10−9
8×8
Smart-HMCTS-OP 9.9230 × 10−9 9.9217 × 10−9 9.9217 × 10−9
HMCTS-OP 1.1821 × 10−6 1.9423 × 10−4 9.9353 × 10−9
UCT 10 × 10
Smart-HMCTS-OP 9.9718 × 10−9 5.6171 × 10−8 9.9217 × 10−9
HMCTS-OP 7.1822 × 10−8 3.0630 × 10−8 1.5334 × 10−8
12 × 12
Smart-HMCTS-OP 9.9384 × 10−9 9.9260 × 10−9 9.9225 × 10−9
Based on the experimental results, we can see that HMCTS-OP and Smart-HMCTS-OP perform
significantly better than other methods. Moreover, the Smart-HMCTS-OP improves the HMCTS-OP
significantly with the help of the smart rollout policy. The introduction of attack-biased rollout policies
can be seen as an advantage of HMCTS-OP.
However, the drawback of HMCTS-OP is that we need to use a significant amount of domain
knowledge to construct the task hierarchies, which is challenging for very complex problems.
On the other hand, due to the introduction of domain knowledge, compared with other methods,
Symmetry 2019, 11, x FOR PEER REVIEW 14 of 17
Symmetry 2020, 12, 719 14 of 17
other hand, due to the introduction of domain knowledge, compared with other methods, the
performance of the HMCTS-OP
the performance and the and
of the HMCTS-OP Smart-HMCTS-OP is greatlyisimproved
the Smart-HMCTS-OP by using MAXQ-
greatly improved by using
based task hierarchies
MAXQ-based to reduce to
task hierarchies thereduce
searchthe
space andspace
search guideand
theguide
searchthe
process.
search Therefore, one of our
process. Therefore, one
areas of future work is to discover the underlying MAXQ-based task hierarchies automatically.
of our areas of future work is to discover the underlying MAXQ-based task hierarchies automatically.
400
200
-200
-400
-600
-800
-1000
Figure 7. Cont.
Symmetry 2020, 12, 719 15 of 17
Symmetry 2019, 11, x FOR PEER REVIEW 15 of 17
Cumulative Reward
200
-200
Cumulative Reward
-400
-600
-800
-1000
-1200
UCT NaiveMCTS InformedNaiveMCTS HMCTS-OP Smart-HMCTS-OP
400
200
-200
-400
-600
-800
-1000
-1200
UCT NaiveMCTS InformedNaiveMCTS HMCTS-OP Smart-HMCTS-OP
Figure 7. Cont.
Symmetry 2020, 12, 719 16 of 17
Symmetry 2019, 11, x FOR PEER REVIEW 16 of 17
Cumulative Reward
6. 6. Conclusions
Conclusions
In In this
this work,
work, wewe proposed
proposed thethe HMCTS-OP,
HMCTS-OP, a novel
a novel and
and scalable
scalable hierarchical
hierarchical MCTS-based
MCTS-based online
online
planningmethod.
planning method.TheTheHMCTS-OP
HMCTS-OP integrates
integratesthe theMAXQ-based
MAXQ-based task hierarchies
task hierarchiesandandhierarchical MCTS
hierarchical
algorithms into the online planning framework. The aim of the method is to
MCTS algorithms into the online planning framework. The aim of the method is to tackle online tackle online planning with
a long horizon
planning andhorizon
with a long huge stateandspace
huge in thespace
state asymmetric adversarialadversarial
in the asymmetric environment. In the experiment,
environment. In the
we empirically showed that the HMCTS-OP outperforms other methods
experiment, we empirically showed that the HMCTS-OP outperforms other methods in terms of the in terms of the average
cumulative
average reward,reward,
cumulative averageaverage
game time,gameandtime,
accumulated score in three
and accumulated scorescenarios
in threeagainst Random
scenarios againstand
UCT. Inand
Random the UCT.
future,Inwe plan
the to explore
future, we planthe to
performance
explore theofperformance
the HMCTS-OP in more
of the complexin
HMCTS-OP problems.
more
Moreover, we would like to solve the problem of automatically discovering
complex problems. Moreover, we would like to solve the problem of automatically discovering underlying MAXQ-based
task hierarchies.
underlying This is especially
MAXQ-based important This
task hierarchies. for improving the performance
is especially important offor hierarchical
improving learning
the
and planning in larger, complex environments.
performance of hierarchical learning and planning in larger, complex environments.
Author
AuthorContributions:
Contributions:Conceptualization,
Conceptualization,Xueqiang
X.G.; Data Gu; Data curation,
curation, Xiang
X.J.; Formal Ji; Formal
analysis, L.L.; analysis, Lina Lu;
Funding acquisition,
X.G.; Supervision,
Funding acquisition, J.C.; Writing—original
Xueqiang draft, Jing
Gu; Supervision, L.L.;Chen;
Writing—review & editing,
Writing—original W.Z.
draft, AllLu;
Lina authors have read and
Writing—review
agreed to
& editing, the published
Wanpeng Zhang. version of the manuscript.
Funding: This research is supported by the National Natural Science Foundation of China (61603406, 61806212,
Funding:
61603403This
andresearch is supported by the National Natural Science Foundation of China (61603406, 61806212,
61502516).
61603403 and 61502516).
Conflicts of Interest: The authors declare no conflict of interest.
Conflicts of Interest: The authors declare no conflict of interest
References
References
1. Vien, N.A.; Toussaint, M. Hierarchical Monte-Carlo Planning. In Proceedings of the Twenty-Ninth AAAI
1. Vien, N.A.; Toussaint,
Conference M. Hierarchical
on Artificial Intelligence,Monte-Carlo Planning.
Austin, TX, USA, 25–30InJanuary
Proceedings of the
2015; pp. Twenty-Ninth AAAI
3613–3619.
2. Conference
Hostetler, J.; Fern, A.; Dietterich, T. State Aggregation in Monte Carlo Tree Search. In3613–3619.
on Artificial Intelligence, 25–30 January 2015, Austin, Texas, USA, 2015; pp. Proceedings of the
2. Hostetler, J.; Fern,AAAI
Twenty-Eighth A.; Dietterich, T. State
Conference Aggregation
on Artificial in Monte
Intelligence, Carlo City,
Québec Tree QC,
Search. In Proceedings
Canada, 27–31 Julyof the
2014.
3. Twenty-Eighth AAAI Conference on Artificial Intelligence, Québec City, QC, Canada, 27–31 July
He, R.; Brunskill, E.; Roy, N. PUMA: Planning under Uncertainty with Macro-Actions. In Proceedings of the 2014.
3. He,Twenty-Fourth
R.; Brunskill, E.; Roy,Conference
AAAI N. PUMA: on Planning under
Artificial Uncertainty
Intelligence, with GA,
Atlanta, Macro-Actions.
USA, 11–15 In Proceedings
July 2010; Volumeof 40,
thepp.
Twenty-Fourth
523–570. AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 11–15 July 2010;
4. Volume 40, pp.
Dietterich, 523–570.
T.G. Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. J. Artif.
4. Dietterich, T.G. Hierarchical
Intell. Res. 2000, Reinforcement
13, 227–303. [CrossRef] Learning with the MAXQ Value Function Decomposition. J.
5. Artif. Intell. Res.
Puterman, 2000,
M.L. 13, 227–303.
Markov Decision Processes: Discrete Stochastic Dynamic Programming; John Wiley & Sons:
5. Puterman, M.L. Markov Decision
New York, NY, USA, 1994; ISBN Processes: Discrete Stochastic Dynamic Programming; John Wiley & Sons:
0471619779.
New York, USA, 1994; ISBN 0471619779.
6. Browne, C.; Powley, E. A survey of monte carlo tree search methods. 2012, 4, 1–49.
Symmetry 2020, 12, 719 17 of 17
6. Browne, C.; Powley, E. A survey of monte carlo tree search methods. IEEE Trans. Comput. Intell. AI Games
2012, 4, 1–49. [CrossRef]
7. Kocsis, L.; Szepesvári, C. Bandit Based Monte-Carlo Planning. In European Conference on Machine Learning;
Springer: Berlin/Heidelberg, Germany, 2006; pp. 282–293. ISBN 978-3-540-45375-8.
8. Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn.
2002, 47, 235–256. [CrossRef]
9. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: London, UK, 2018.
10. Črepinšek, M.; Liu, S.H.; Mernik, M. Exploration and exploitation in evolutionary algorithms: A survey.
ACM Comput. Surv. 2013, 45, 1–33. [CrossRef]
11. Li, Z.; Narayan, A.; Leong, T. An Efficient Approach to Model-Based Hierarchical Reinforcement Learning.
In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA,
4–10 February 2017; pp. 3583–3589.
12. Doroodgar, B.; Liu, Y.; Nejat, G. A learning-based semi-autonomous controller for robotic exploration of
unknown disaster scenes while searching for victims. IEEE Trans. Cybern. 2014, 44, 2719–2732. [CrossRef]
[PubMed]
13. Schwab, D.; Ray, S. Offline reinforcement learning with task hierarchies. Mach. Learn. 2017, 106, 1569–1598.
[CrossRef]
14. Le, H.M.; Jiang, N.; Agarwal, A.; Dudík, M.; Yue, Y.; Daumé, H. Hierarchical imitation and reinforcement
learning. In Proceedings of the ICML 2018: The 35th International Conference on Machine Learning,
Stockholm, Sweden, 10–15 July 2018; Volume 7, pp. 4560–4573.
15. Bai, A.; Wu, F.; Chen, X. Online Planning for Large Markov Decision Processes with Hierarchical
Decomposition. ACM Trans. Intell. Syst. Technol. 2015, 6, 1–28. [CrossRef]
16. Bai, A.; Srivastava, S.; Russell, S. Markovian state and action abstractions for MDPs via hierarchical MCTS.
In Proceedings of the IJCAI: International Joint Conference on Artificial Intelligence, New York, NY, USA,
9–15 July 2016; pp. 3029–3037.
17. Menashe, J.; Stone, P. Monte Carlo hierarchical model learning. In Proceedings of the fourteenth International
Conference on Autonomous Agents and Multiagent Systems, Istanbul, Turkey, 4–8 May 2015; Volume 2,
pp. 771–779.
18. Sironi, C.F.; Liu, J.; Perez-Liebana, D.; Gaina, R.D. Self-Adaptive MCTS for General Video Game Playing
Self-Adaptive MCTS for General Video Game Playing. In Proceedings of the International Conference on the
Applications of Evolutionary Computation; Springer: Cham, Switzerland, 2018.
19. Neufeld, X.; Mostaghim, S.; Perez-Liebana, D. A Hybrid Planning and Execution Approach Through HTN
and MCTS. In Proceedings of the IntEx Workshop at ICAPS-2019, London, UK, 23–24 October 2019; Volume 1,
pp. 37–49.
20. Ontañón, S. Informed Monte Carlo Tree Search for Real-Time Strategy games. In Proceedings of the IEEE
Conference on Computational Intelligence in Games CIG, New York, NY, USA, 22–25 August 2017.
21. Ontañón, S. The combinatorial Multi-armed Bandit problem and its application to real-time strategy games.
In Proceedings of the Ninth Artificial Intelligence and Interactive Digital Entertainment Conference AIIDE,
Boston, MA, USA, 14–15 October 2013; pp. 58–64.
22. Theodorsson-Norheim, E. Kruskal-Wallis test: BASIC computer program to perform nonparametric one-way
analysis of variance and multiple comparisons on ranks of several independent samples. Comput. Methods
Programs Biomed. 1986, 23, 57–62. [CrossRef]
23. Morris, H.; Degroot, M.J.S. Probability and Statistics, 4th ed.; Addison Wesley: Boston, MA, USA, 2011;
ISBN 0201524880.
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).