Intelligent Trainer For Model-Based Deep Reinforcement Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

1

Intelligent Trainer for Model-Based Deep


Reinforcement Learning
Yuanlong Li, Member, IEEE, Linsen Dong, Student Member, IEEE, Xin Zhou, Member, IEEE,
Yonggang Wen, Senior Member, IEEE, and Kyle Guan Member, IEEE

Abstract—Model-based reinforcement learning (MBRL) has allows for progressively learning the best control policy in
been proposed as a promising alternative solution to tackle complex systems. Previously RL has been adopted to solve
the high sampling cost challenge in the canonical reinforcement problems like robot arm control, maze solving and game
arXiv:1805.09496v6 [cs.LG] 5 Jun 2019

learning (RL), by leveraging a learned model to generate synthe-


sized data for policy training purpose. The MBRL framework, playing, reducing the human intervention in system modeling.
nevertheless, is inherently limited by the convoluted process of Recently, RL, in combination with the emerging deep learning
jointly learning control policy and configuring hyper-parameters techniques (so called the deep reinforcement learning - DRL)
(e.g., global/local models, real and synthesized data, etc). The [2], has become a popular choice for large complex system
training process could be tedious and prohibitively costly. In this control. This trend started with the huge success of AlphaGO.
research, we propose an “reinforcement on reinforcement” (RoR)
architecture to decompose the convoluted tasks into two layers of At the same time, researchers have also made breakthrough
reinforcement learning. The inner layer is the canonical model- in complex system control in continuous domains, via DRL
based RL training process environment (TPE), which learns the algorithms, for example, Deep Deterministic Policy Gradients
control policy for the underlying system and exposes interfaces (DDPG) [3] and Trust Region Policy Optimization (TRPO)
to access states, actions and rewards. The outer layer presents [4]. As a result, the transformative nature of RL and DRL
an RL agent, called as AI trainer, to learn an optimal hyper-
parameter configuration for the inner TPE. This decomposition have been driving its regained popularity in both academia
approach provides a desirable flexibility to implement different and industry.
trainer designs, called as “train the trainer”. In our research, However, many practical industry applications face great
we propose and optimize two alternative trainer designs: 1) a challenges in adopting RL solutions, especially when it is
uni-head trainer and 2) a multi-head trainer. Our proposed RoR costly to acquire data for policy training purpose. On the de-
framework is evaluated for five tasks in the OpenAI gym (i.e.,
Pendulum, Mountain Car, Reacher, Half Cheetah and Swimmer). mand side, the performance of RL algorithms in general hinges
Compared to three other baseline algorithms, our proposed upon a huge amount of operational data to train the control
Train-the-Trainer algorithm has a competitive performance in policy. On the supply side, acquiring a huge amount of training
auto-tuning capability, with upto 56% expected sampling cost data from operational systems might be prohibitively costly,
saving without knowing the best parameter setting in advance. in resource usage and/or time consumption. For instance, in a
The proposed trainer framework can be easily extended to other
cases in which the hyper-parameter tuning is costly. robotic control problem [5], the DRL agent can learn to score
a goal with high probability, only after about three million
Index Terms—Reinforcement learning, AutoML, Intelligent
samples observed. As result, training with a real robot to do
trainer, Ensemble algorithm.
the task in this case may take millions seconds, rendering the
system unacceptable in most application scenarios.
I. I NTRODUCTION To tackle this challenge with training data, researchers have

R Einforcement learning (RL) [1], owing to its flexibility


derived from data-driven nature, has recently regained
tremendous momentum in research and industry applications.
previously proposed a model-based reinforcement learning
(MBRL) [6]. In the MBRL framework, the data collected from
the real-world systems are used to train a system dynamic
RL, in comparison to supervised and unsupervised learning, model, which is in turn used to generate synthesized data
addresses how intelligent agents should take actions in an for policy training. The generated data, together with the
environment, aiming to maximize a chosen cumulative reward real world data, are used to train the target controller and
function. For example, an RL agent controlling a robot arm search for sensible actions to maximize the accumulative
to grab an object will observe the current state of the arm, reward. Generally, producing synthesized data in a cyber
issue an action to the arm and after the action being taken, environment is relatively inexpensive, as such the MBRL has
collect the reward, signifying whether the object has been the advantage of low data sampling cost. This comparative
grabbed or not, and new state information to train its pol- advantage contributes to the fact that the MBRL has been a
icy. This interaction between the agent and the environment popular approach in robot arm training [7] and online tree
search based planning [8]–[10].
Manuscript received... This work was supported in part by EIRP02 Grant
from Singapore EMA, GDCR01 Grant from Singapore IMDA. In real applications, the adoption of the MBRL framework
Yuanlong Li, Xin Zhou, Yonggang Wen and Linsen Dong are with School is limited by the manual configuration of some crucial pa-
of Computer Science and Engineering, Nanyang Technological University, rameters. As illustrated in Figure 1, the data acquired from
Nanyang Avenue, Singapore 639798. Email: {liyuanl, ygwen}@ntu.edu.sg,
[email protected] the physical environment would be used for two purposes,
Kyle Guan is with Bell Labs, Nokia. Email: [email protected]. namely:
2

Initialization TPE Actions


ti on Real aa00 a0, a1, a2
a2 Ac Environment
ward
Re
te,
Do local or global Sta
Real Sampling in Sampling in Modelling
sampling in State
Environment real/cyber ? -
Real Cyber , Re I ntelligent
Environment Environment ward
State hyparameter a0, a1 Tr ainer
, Re Tar get A c Cyber
ward ti on a1
Controller Environment
A cti Modelling
on Mixing potion of Update Update
n the real/cyber data? Target Cyber Tr aining Process Environment (TPE)
A cti o - hyparameter a2 Controller Model TPE Reward
Tar get
ard Cyber
Controller Rew Environment
te,
Sta
Is no. of
total real data samples
Fig. 2. Illustration of “Reinforcement on Reinforcement” (RoR) framework.
>N
Yes
No The inner box encapsulates a standard MBRL into a training processing
environment (TPE) as the inner layer. In the outer box we introduce an
Stop intelligent trainer as the outer layer, controlling the optimization of the MBRL
training in the TPE.
(a) (b)

Fig. 1. Illustration of MBRL algorithm using the model as a data source: (a)
The data flow of MBRL, where the cyber environment is used to generate we first encapsulate the canonical model-based RL training
synthetic training data for the target controller. (b) Typical training flow of process into a standard RL environment (Training Process
MBRL, in which we indicate the settings that are usually manually set. Environment, TPE in Fig. 2) with exposed state, action and
reward interfaces, as the inner RL layer. In the outer layer, we
• System model generation. The model is trained to mimic introduce an intelligent trainer, as an RL agent, to interact with
the real system in that, given the current state and action the inner layer, and control the sampling and training process
to take, it predicts the next system state. The learned of the target controller in TPE. Such layered architecture
model can be trained/used in a global or local [9] manner. is embedded with an inherent decomposition between the
The global manner means that the model is trained or training process and the controlling of the training process
utilized to generate data samples from the whole state in the inner layer, greatly liberating its applicability in more
space and can favor global exploration. The local manner generalized RL scenarios. In comparison with the existing
is to train or utilize the model to generate data samples approaches that directly modify the training algorithm of the
in certain constrained subspace, and thus can reinforce target controller [10], our design can work with different
local exploitation. In Fig. 1, parameters a0 and a1 control MBRL controllers and with different trainer designs. We call
whether to go global or local in the training and sampling the latter as “train the trainer” design.
procedure of the system model. Our research intends to optimize the “train the trainer”
• Control policy training. The collected data from physical
design for better learning efficiency in the outer layer of
environment are also used in the training of the target the RoR architecture, and validate our design over widely-
controller, together with the cyber data generated from the accepted benchmark cases in the openAI gym. First, we
learned model. In this case, the portion of the cyber data propose two alternative trainer designs:
to use requires proper configuration to achieve the desired • Uni-head trainer. This approach is to implement a single

outcome. As to be shown in the experimental results of trainer, cast into a DQN controller, to learn in an online
this paper, the proper setting can vary from case to case: manner to optimize the sampling and training in the inner
for certain cases using more cyber data can be helpful; layer.
while for other cases it may lead to serious performance • Multi-head trainer. This approach is to implement an

degeneration. In Fig. 1 parameter a2 controls this setting. ensemble trainer, comprising of multiple trainers that
We refer these configuration parameters introduced by the take independent actions in their training processes, and
model as model-related hyper-parameter setting. In previous are ranked across themselves to provide a quantitative
research, these parameters are manually tried in training stage, comparison of their respective actions.
often resulting in additional time and/or resource cost1 . It fol- We implement both trainer designs in Tensorflow for five
lows that an autoML solution for MBRL is highly demanded. benchmark cases (i.e., Pendulum, Mountain Car, Reacher, Half
In this research, we propose an autoML solution for the Cheetah, and Swimmer) and evaluate their performance in
MBRL framework, aiming to tackle the hyper-parameter set- learning the best control policy under external constraints.
ting challenge. Our proposed solution adopts a “reinforcement Our evaluation is compared against three baseline algorithms,
on reinforcement” (RoR) design architecture to learn the opti- including a model-free RL algorithm, a MBRL algorithm with
mal model-related parameters and training/sampling settings randomly hyper-parameter settings and a MBRL algorithm
in an online manner. Specifically, as illustrated in Fig. 2, with fixed hyper-parameter settings. Our numerical investi-
gations show that our proposed framework outperforms the
1 A naive approach to potentially solve this problem is to re-train the
aforementioned baseline algorithms in overall performance
controller with different parameter settings with the collected data samples in
the first trial. Such solution will not incur additional sampling cost. However,
across different test cases supposing the best parameter set-
the ”supervised” learning approach may not work well for the RL case, as the tings are unknown. Specifically, our proposed framework can
training performance of such a policy is largely determined by the data used achieve the following results:
in training. If the data used in training are sampled by an under-performed
policy, they may lack of important samples that can lead to better performance, • For the same learned policy quality, our proposed RoR
making the re-training useless. framework can achieve an expected sampling cost saving
3

upto 56%, over the average cost of the three baseline to control the sampling and training settings of the MBRL,
algorithms. and the reward information which can encourage an agent to
• Given the same sampling budget, our proposed RoR minimize the sampling cost in the real environment to train
framework can achieve a policy quality on par with the target controller to a target performance. To achieve these
the best policy available, without the prior requirement goals, the TPE is designed with two major functions and three
of knowing the best parameter setting across all the RL interfaces as following.
benchmark cases.
These evaluations suggest that our proposed RoR framework A. Basic Functions of TPE
can be readily applied to emerging industrial applications, with
The TPE has two major functions that can be executed to
cost concerns. For example, in data center room cooling con-
complete the entire training process of a general MBRL:
trol application, we can use the trainer framework to properly
• Initialization: execute initialization tasks for the MBRL
utilize the computational fluid dynamics (CFD) model with
the real monitoring data to train an air cooling unit controller training process. These tasks include initializing the real
[11]. At the same time, it can shed new lights on model- training environment, the cyber emulator, and the target
based RL research by leveraging the RoR framework for controller.
• Step(state, action): execute one step of training of the
autoML empowerment. Specifically, it can serve as a general
framework that can work with different RL training algorithms MBRL algorithm. This process includes sampling from
and could also be a potential solution for other learning tasks in the real and cyber environment, training the target con-
which online adaptive parameter setting is demanded. We have troller, and training the dynamic model of cyber emulator.
released the open-source code of our proposed RoR framework Note that in each step, we keep the number of real data
at [12], for the research community to further develop new samples to sample fixed (Kr ) while optimize the amount
applications and algorithms. Kc of cyber data used in the training. We found that
The remainder of this paper is organized as follows. Section such design is more stable in implementation as it can
II provides a detailed description of the proposed trainer provide a tractable evaluation of the policy by measuring
framework, including its key components, uni-head trainer the received reward from the real environment. With such
design, and ensemble trainer design. Section IV presents the setting, the TPE exposes action interfaces to determine
numerical evaluation results of the proposed framework. Sec- how many cyber data to use and how to do the sampling,
tion V briefly reviews the related works. Section VI concludes and reward information to encourage a trainer agent to
the whole paper. train the target controller to a better performance.
With the TPE, MBRL training process can be executed by
calling repeatedly calling the Step function after the Initializa-
II. RO R: R EINFORCEMENT ON R EINFORCEMENT
tion. The detailed training algorithm used to train the target
A RCHITECTURE
controller will be embedded in the Step function.
The overall architecture of the proposed intelligent trainer
framework is shown in Fig. 2. The inner layer, i.e., the
B. RL Elements of TPE
Training Process Environment (TPE) is a standard model-
based DRL system utilizing the model as a data source to For the interaction between TPE and the intelligent trainer,
train the target controller. The training data are provided by we define three interfaces State, Action, and Reward of TPE as
the physical environment, which represents the real-world follows. To distinguish the RL components in different layers,
system, and the cyber environment, which is an emulator of the in the following, superscript ξ is used to indicate variables in
physical system. The emulator can be either knowledge-based the target controller layer, while Ξ is used to indicate variables
or learning-based (e.g., a neural network prediction model). in the intelligent trainer layer.
The outer layer, i.e., the intelligent trainer, is also an RL agent • State: The state is a vector that is exposed to an out-
that controls and optimizes the sampling and training process side agent who can use the state to access the training
of the target controller in the real and cyber environment via progress. Ideally one can put as much information as
feedbacks and action outputs. Thus, the proposed framework possible into the state design to measure the training
can be considered as a “reinforcement on reinforcement” progress. However, we found that using a constant (zero)
architecture. Such modularized design can easily work for to represent the TPE state can still work as such simple
different kinds of target controller training algorithms (such setting allows the trainer to learn a good action quickly.
as DDPG, TRPO) and the extra layer of intelligent trainer can We also test other more informative state representation
be any optimizer that can output the control action when given designs, such as using the last average sampling reward or
a TPE observation. the normalized sampling count. They can achieve better
In the following we first present the inner layer of the pro- performance in certain cases. A comparative study of
posed trainer framework to introduce how we encapsulate the these different designs are provided in Section IV.
standard training process of MBRL as an RL environment. We • Action: the action interface comprises three controllable
design the TPE with two goals. First, it should be formulated parameters that are exposed to an outside agent who can
as a standard RL environment, such that any agent can interact utilize these actions to control the training progress, as
with it. Second, the TPE shall expose the action interface mentioned in Fig. 1. We represent these parameters as
4

probability values, all defined in the range of [0, 1]. Such the number Kc of cyber data to sample in each step
normalized action range can simplify the design of the by
trainer agent. Details of the three control parameters will Kr · (1 − a2 )
Kc = . (3)
be given subsequently. a2
– Action a0 decides whether one should train the The rationale of such design is to bound the action
model into a local or global model, which is achieved in the range [0, 1], which can ease the design of
by controlling the starting state of a new episode an agent that interplays with the TPE. The sampled
when the target controller samples from the real data from the real environment are stored in a
environment. A quality function Φ is defined to select real memory buffer, while the cyber data sampled
the starting points, controlled by a0 : from the cyber environment are stored in the cyber
memory buffer.
Φ(s) = a0 · Qξ (s, π(s)) + (1 − a0 ) · u[0,1] , (1) For the training part, a2 is also used to set the
where Qξ is the value produced by critic network of probability of taking a mini-batch from the real
the target controller, π is the current policy, and u[0,1] data memory buffer in training the target controller.
is a random number drawn from [0, 1]. With this Naturally, 1 − a2 represents the probability to take
quality function, we keep sampling random starting a mini-batch from the cyber data buffer. With fixed
points in the physical environment until a high- batch size, if we train with Tr batches of real data,
quality starting point is found, as shown in Algorithm then Tc batches of cyber data are used in this step:
1. In one way, when a0 approaches to one, initial Tr · (1 − a2 )
states with a higher Q value are likely to be selected, Tc = . (4)
a2
which will generate more data with high Q value.
Note that we use only one action to control both the
When these data are used to train the model, the
sampling and training process to accommodate some
model will be more accurate in a high Q value
DRL algorithms, such as TRPO, where the sampling
subspace, benefiting the local exploitation in it. In
and training process cannot be decoupled.
the other way, when a0 approaches zero, the quality
will be a random number and the starting point will • Reward: The reward interface is used to measure the
be a random state to favor global exploration. performance of the target controller. Note that the only
– Action a1 decides whether one should utilize the reliable information we can get from the training process
model in a local or global manner, which is achieved is the reward data we collected when sampling from the
by controlling the starting state of a new episode real environment. These reward data can be manipulated
when the target agent samples from the cyber envi- into various reward definitions for the trainer; one design
ronment. The starting state of an episode also matters of the reward rΞ is
ξ
in the cyber environment. For example, we can select rΞ = sign(r̄t+1 − r̄tξ ), (5)
a starting state s from the real data buffer B. In
ξ
this case, the subsequent sampling process will be where r̄t+1 and r̄tξ are the respective average sampling
a local search process similar to the imagination reward of the target controller at step t + 1 and t from
process used in [9] and is more likely to generate real environment. This means, as long as the reward
samples that are of high prediction accuracy as is increasing, the current training action is considered
the model has explored nearby samples in the real acceptable. Although such a simple design allows the
environment. Alternatively, we can use a data point trainer to learn the settings quickly, it may not be effective
srand randomly selected from the state space to favor in all practical cases, especially in the case where the
exploration. It thus can control the trade-off between cyber data does not degrade the performance but prolongs
exploitation and exploration during the sampling the convergence. A more effective order-based reward
process. In our design, a1 , with 0 ≤ a1 ≤ 1, design is used in the ensemble trainer in Section III-B.
represents the probability of choosing starting state Note that we can only utilize the reward information
s0 from the real data buffer, as received when sampling from the real environment to
( measure the performance of the target controller. To avoid
s ∈ B, if u[0,1] ≤ a1 additional sampling cost, we have to rely on the original
s0 = (2)
srand , otherwise, sampling process form the real environment, which is
why we set the number of real data sampled in each step
where u[0,1] is a uniformly distributed random num- to be fixed, as otherwise we may not receive a stable
ber drawn from [0, 1]. evaluation of the target controller.
– Action a2 decides how many cyber data are sampled
and used in training. For the sampling part, a2 is
set to the ratio of the number of real data sampled C. Problem Formulation
to the total data sampled (real and cyber) in this Based on the defined TPE environment, the problem to solve
training step. Recall that in each step we sample a in this paper is formulated as follows. Given a target controller
fixed number Kr of real data samples. a2 controls to train by a MBRL algorithm with a given maximum number
5

Initialization only one target controller is involved in training and all trainer
actions are tested in a single streamline of training. This “uni-
TPE head” trainer needs to learn quickly with limited training time
Entrance
steps and samples. Several trainer learning algorithms, like
Tr ainer DQN and REINFORCE, can be used to tackle this problem.
Sampling in Sampling in gener ates In the following, we use a DQN controller to demonstrate the
real cyber new trainer design. A comparison of different trainer designs is
Environment Environment control
TPE given in Section IV-C.
Step actions
We implement a specialized DQN trainer that carries out
Update discretized control actions with a relatively small-scale Q
Update
target Tr ain the
controller
cyber model
Tr ainer
network. At each time step, the trainer evaluates all the actions
with the Q network and selects the action with the highest Q
TPE Exit value.
Is no. of The training of the DQN controller follows standard epsilon-
No
total real data samples
>N greedy exploration [13] strategy. To enhance the training
Yes stability, the DQN controller is equipped with a memory, like
Stop the replay buffer in DDPG [3]. As such, the trainer can extract
good actions from the noisy data received from TPE. During
Fig. 3. Work flow of the uni-head intelligent trainer. In the initialization, we
the experiment, we notice that samples from mere one single
create the corresponding TPE. After that, The training iterates until the total action could flood the buffer. The homogeneity in actions
number of real data samples reaching the budget limit N . could prolong or even halt the training of DQN. To solve this
problem, for a given action we limit the total number of the
samples to M/|A|, where M and |A| are the size of buffer
of samples to collect from the physical environment, we
and the size of the action set, respectively. If the number of
encapsulate it into the TPE defined above, and aim to train
samples for a given action exceeds this limit, a new sample
a trainer in an online manner to maximize the accumulated
will replace a randomly selected old one.
reward received from this TPE:
The pseudo code of the uni-head intelligent trainer is shown
tX
max in Algorithm 1, with the detailed implementation of the
max rΞ (t), (6) sampling reset procedure in the real/cyber environment shown
πΞ
t=1 in Algorithm 2.
where π Ξ is the control policy of the trainer, tmax is the
maximum number of trainer steps when the real data sample Algorithm 1 Sampling Reset Procedure
budget is consumed.
1: if the current sampling environment is the real environ-
Note that the problem strictly demands online learning, as
ment then
re-training from the beginning will incur additional sampling
2: Initialize data set D = ∅, quality set G = ∅.
cost. In the following, we will propose different control policy
3: for i = 1 : M1 do
designs and trainer learning methods to accomplish this online
4: Generate one initial state s0 and compute its quality
learning task.
Φ(s0 ).
5: Append s0 to D and append Φ(s0 ) to G.
III. TTT: T RAINING THE T RAINER 6: if i > M2 and Φ(s0 ) ≥ max(G) then
In this section we present the outer layer of the RoR 7: Break.
architecture, the trainer designs, to tackle the above formulated 8: end if
problem. We first propose the basic intelligent trainer, which 9: end for
utilizes a single DQN controller to do the online learning. Then 10: Return the last state of D.
we propose an enhanced trainer design with multiple trainers 11: else
to better evaluate the trainer actions, which can even work in 12: if u[0,1] < a1 then
some tough situations. 13: Randomly select a state s from the real data memory.

A. Intelligent Trainer 14: Set the cyber environment to state s.


15: Return s.
We design an RL intelligent trainer to optimize control 16: else
action a0 , a1 , and a2 of TPE in an online and on-policy 17: Randomly initialize the cyber environment.
manner. The interaction workflow of the trainer with the TPE 18: Return the current state of the cyber environment.
is shown in Fig. 3. At each trainer step the trainer generates 19: end if
an action setting, with which the TPE advances for one time 20: end if
step. Such a process is equal to the MBRL training process
with online parameter adaptation. Note that with such design,
6

Algorithm 2 Intelligent Trainer Enhanced Model-Based DRL Initialization


Training Algorithm
1: Initialization: initialize the trainer agent (with a DQN
network), the training process environment, and the target
controller. Initialize real data memory and cyber data Tr ainer 2 Tr ainer 1 Tr ainer 0 Train the Real data
(NoCyber ) (Random) (DQN) memor y shar ing
memory as an empty set. Sample a small data set of size Tr ainer 0
generates generates generates (DQN
o to initialize the cyber emulator and initialize the real actions actions actions trainer)
Weight tr ansfer
environment.
2: Set number of total samples generated from real environ-
ment n = 0. Set the maximum number of samples allowed TPE 2 TPE 1 TPE 0 Compute
Step Step Step per for mance
to use as N . with with with skewness r atio;
3: //Training Process: reference reference reference Update pref
Sampling Sampling Sampling
4: while n < N do
5: Generate action a from the trainer agent. Compute or der
6: //One step in TPE: Is no. of based rewar d
No and
7: Train the target controller if there is enough data in its total real data samples
accumulated
>N
memory buffer. rewar d Ri , i=1,
Yes 2, and 3
8: Sample Kr data points from real environment according
to the sampling reset Algorithm 1, and append the data Stop Ensemble process
to the real data memory.
9: Sample Kc data points from the cyber environment ac- Fig. 4. Work flow of the ensemble trainer, with the major changes made to
cording to the sampling reset Algorithm 1, and append the uni-head trainer highlighted in bold font. In the initialization, we create
three different trainers and their corresponding TPEs. After that, The training
the data to the cyber data memory. iterates until the total number of real data samples reaching the budget limit
10: Train the dynamic model. N . In each training step of ensemble trainer, the original TPE step is revised to
11: Update n. include the reference sampling mechanism and after the TPE step, we execute
the designed ensemble process including memory sharing, order-based reward
12: Collect the state, action and reward data of TPE. calculation and weight transfer.
13: Update the trainer agent.
14: end while
each of them can work well in different cases. Note that it
is not a trivial task to have an effective ensemble trainer and
B. Ensemble Trainer at the same time not incurring additional real data cost, as
the samples from different trainers can have different quality
In this subsection, we present a more robust trainer design which may degenerate the ensemble’s overall performance. In
that learns by comparison to overcome the learning deficiency the following, we propose solutions to solve this issue.
of uni-head trainer in certain cases. The uni-head trainer, 1) Real-Data Memory Sharing: We introduce a memory-
described previously, for some cases cannot adequately assess sharing scheme to solve the issue of insufficient real data
the quality of the action as all actions are tested in a single samples for each trainer, as we splitting the whole real data
streamline. In other words, the actions could be correlated and sample budget evenly to all three trainers in the ensemble.
their quality could become indistinguishable. Also, for actions The even splitting is necessary for the evaluating of the target
that generate non-negative reward but could lead to slow controllers of each trainer. It then follows that each trainer only
convergence or locally optimal policy, the reward function has one-third of the real data samples in training compared
design (6) is unable to accurately assess their quality. To with the original uni-head trainer. This will cause significant
address these issues, we propose an ensemble trainer which performance degeneration as will be shown in Section IV.
uses a multi-head training process, similar to the boosted DQN To address this issue, we devise a memory sharing process
[14]. The design rationale is to diversify actions on different before the training of the target controller, as shown in Fig.
trainers without posting additional sampling cost, then evaluate 4. The memory sharing scheme is a pseudo sampling process
the actions by ranking their performance. executed after each trainer has done its sampling process from
The proposed ensemble trainer consists of three different real environment and saved these data into their own real
trainers with the work flow shown in Fig. 4. For trainer 0, memory buffer. Then each trainer will collect the new real
its actions are provided by the intelligent trainer; trainer 1, its data samples from the other trainers. As a result, at each step,
actions are provided by a random trainer; trainer 2, it uses each trainer receives Kr new data samples – the same amount
only real data, which means setting the three actions to 1, of data as in the uni-head training. Note that with memory
0, and 0 respectively. The settings in trainer 0 and 1 enable sharing, the real data from an underperformed target agent
the exploitation and exploration of the action space. Trainer could degrade, even fail the ensemble performance. To solve
2 is a normal DRL training process without using the cyber this problem, we introduce next a reference sampling scheme.
data generated by the dynamic model. The reason we choose 2) Reference Sampling: A reference sampling scheme is
to ensemble these three distinct trainers is because they can proposed to maintain the quality of the real data samples in-
provide sufficient coverage of different trainer actions and troduced by the memory-sharing mechanism. The idea behind
7

the reference sampling is to select the best trainer, then to weight parameters of the target controller trained by the best
use its target controller for other trainers to sample real data trainer to the target controller trained by the DQN trainer.
samples with a probability pref . In our algorithm, at the first We also utilize the accumulated trainer reward to detect
of every three steps, pref is forced to set to 0. As such this whether the best trainer is significantly better than other
first step, without reference sampling taking place, serves as trainers. We calculate a performance skewness ratio to measure
an evaluation step for the trainer. In next two steps, pref is the degree of the outperformance of the best trainer:
determined by the min function in the following equation. Rb − Rm
φ= , (9)
Rb − Rw
(
0, if mod (tΞ , 3) == 0
pref = φ−φmin where Rb , Rm and Rw are the best, median and worst Ri of
min{ φmax −φmin , 1}, otherwise
(7) the three trainers, respectively. The skewness ratio is used to
where tΞ is the current step number of trainers, and φ is the determine the pref as shown above.
skewness ratio, which measures the degree of the outperfor- Algorithm 3 shows the operational flow of the ensemble
mance of the best trainer; φmax and φmin are the estimated trainer. In summary, the ensemble trainer evaluates the quality
upper and lower bounds respectively. The details of φ are of the actions by sorting the rewards received by target
shown in the weight transfer procedure below. With such controllers. It can maintain the training quality by memory
design, the better the performance of the best trainer, the higher sharing scheme, without incurring additional sampling cost. It
pref will be used. can maintain the sample quality by reference sampling. It can
3) Order-based Trainer Reward Calculation: The rewards recover an underperformed trainer from poor actions. Though
of the trainers in the ensemble trainer are designed by ordering saving on the sampling cost, the ensemble trainer requires
the performance of different trainers. After the training process three times the training time. The increased training time can
of the target controllers of all trainers, for each trainer we be partially reduced by the early stop of some underperformed
calculate the average sampling reward of its corresponding trainers when necessary.
target controller r̄iξ as the raw reward of this trainer. Note that
r̄iξ is different from the sign reward used in (6). Next, we sort IV. N UMERICAL E VALUATIONS
the tuple (r̄0ξ , r̄1ξ , r̄2ξ ) in an ascending order. We then define the In this section, we evaluate the proposed intelligent trainer
index of c · r̄iξ in the sorted tuple as the reward r̂iΞ of trainer and ensemble trainer for five different tasks (or cases) of
i. OpenAI gym: Pendulum (V0), Mountain Car (Continuous V0),
The rationale is that if the action of a trainer is good for Reacher (V1), Half Cheetah ( [15]), and Swimmer (V1).
training, it should help the trainer to achieve better perfor-
mance (measured by the average sampling reward).
A. Experiment Configuration
Note that with the above reward design, the trainers will
generate three data samples at the trainer level in each step, and For the five test cases, different target controllers with
all these data will be used to update the intelligent trainer. Due promising published results are used: DDPG for Pendulum
to the reference sampling mechanism, the order information and Mountain Cars; TRPO for Reacher, Half Cheetah, and
may not correctly measure the performance of the trainers. To Swimmer. The well-tuned parameters of open-sourced codes
solve this issue, we will throw away these samples when pref [16] [17] are used for the hyper-parameters settings of the
is not zero. target controller (including Kr and Tr , as defined in Section
4) Weight Transfer: After collection of the trainer reward II. Simple neural networks with guideline provided in [25]
data, we add a particular weight transfer mechanism to solve are used for the cyber models. As our experiments have
the issue that some target agent may fail due to unfavorable shown, it is very useful to normalize both input and output for
trainer actions. The rationale is that after collecting the reward the dynamic model. In this paper, we use the normalization
information for a certain large number of steps, we can judge method provided by [16], in which the mean and standard
which trainer is currently the best one with high confidence. deviation of the data is updated during the training process.
In this case, we can transfer the best target agent to the other For hyperparamters M1 and M2 used in the reset procedure in
trainers, such that those trainers who fall behind can restart Algorithm 1, we set M1 = 50 and M2 = 5 respectively, which
from a good position. In particular, after the trainer reward indicates that we have maximum and minimum trial numbers
data are collected, we examine the number of steps nc that 50 and 5 respectively.
have been taken since the last weight transfer. If nc is larger
than a threshold C, we compute an accumulative reward for B. Comparison of Uni-Head Intelligent Trainer with Baseline
each trainer in the last nc steps as : Algorithms
X Multiple variants of the uni-head intelligent trainer are
Ri (tΞ ) = r̂iΞ (tΞ − j), (8)
compared with baseline algorithms. There are three baseline
j∈{nc −1,...,0}
algorithms and four intelligent trainers. Their designs are
where tΞ is the index of current trainer step. The trainer with summarized in Table I. The three baseline algorithms are:
maximum Ri will be set as the best trainer. We then examine • The NoCyber trainer is a standard DRL training process
if the DQN trainer is the best; if not, we will transfer the without using cyber data.
8

TABLE I
C ONFIGURATIONS OF DIFFERENT ALGORITHMS .
Baseline algorithms Intelligent trainers
NoCyber Fixed Random DQN DQN-5 actions DQN-larger memory REINFORCE DQN-TPE V1 DQN-TPE V2
Trainer type None None None DQN DQN DQN REINFORCE DQN DQN
Action (1, 0, 0) (0.6, 0.6, 0.6) ai ∈ {0.2, 1.0} ai ∈ {0.2, 1.0} ai ∈ {0.2, 0.4, 0.6, 0.8, 1.0} ai ∈ {0.2, 1.0} ai ∈ {0.2, 1.0} ai ∈ {0.2, 1.0} ai ∈ {0.2, 1.0}
Data source Real Real & Cyber Real & Cyber Real & Cyber Real & Cyber Real & Cyber Real & Cyber Real & Cyber Real & Cyber
Memory size - - - 32 32 2000 - 32 32
TPE state - - - Constant Constant Constant Constant Last sampling reward Real sample count

Algorithm 3 Ensemble Trainer Algorithm Algorithm 4 Performance Skewness Analysis Procedure


1: Initialization: initialize the three trainer agents and the 1: if nc > C then
corresponding training process environments, along with 2: Compute accumulative reward of trainer i as Ri for
the target controllers. Run the initialization process for i = 0, 1, 2.
each trainer. Initialize the best player to be NoDyna trainer 3: Update best trainer index as arg maxi (Ri ).
and the probability to use best player to sample is pref . 4: Compute the skewness ratio φ for the best player.
2: Set number of total samples generated from real environ- 5: Update best player reference probability pref according
ment n = 0. Set maximum number of samples allowed to to (7).
use as N . 6: if DQN trainer is not the best trainer then
3: //Training Process: 7: Do weight transfer from the best trainer to DQN
4: while n < N do trainer.
5: for trainer i ∈ 0, 1, 2 do 8: end if
6: Generate action a from the trainer agent. 9: Reset nc = 0.
7: //One step in TPE: 10: end if
8: Execute memory sharing procedure.
9: Train the target controller if there is enough data in TABLE II
its memory buffer. N UMBER OF TOTAL TPE STEPS FOR DIFFERENT TASKS .
10: Sample Kr /3 data points from real environment with
Pendulum Mountain Reacher Half Swimmer
reference sampling probability pref , and append the Car Cheetah
data to the real data memory. TPE 1000 30000 1000 400 200
11: Sample data from cyber environment according to the Steps
trainer action, and append the data to the cyber data
memory.
vary in practice), but to figure out if the proposed trainer
12: Share the real data memory across all trainers.
can select the better action among the predefined action
13: Train the dynamic model of the current trainer.
value set.
14: Update n.
15: Collect the state, action and raw reward data of TPE. We notice that, for some tasks, the total number of steps of
16: end for the TPE is only 200, as shown in Table. II. To simplify the
17: Compute reward for each trainer from the raw reward learning process, we discretize each dimension of the trainer
data and calculate the accumulative reward Ri for action.
trainers i = 0, 1, 2. The four intelligent trainers are:
18: Store TPE data of all three trainers into the DQN • DQN trainer. The trainer action chooses from two values
memory to train the intelligent trainer. of 0.2 and 1.0 like the Random trainer. That is, ai ∈
19: Update the trainer agents. {0.2, 1} for i = 0, 1, 2. The DQN controller is trained
20: Execute Algorithm 4 to do performance skewness anal- with a memory buffer of size 32. At each time steps,
ysis and weight transfer, update pref . four randomly selected batches of batch size eight are
21: end while used to update the controller. For exploration purpose,
the epsilon-greedy method is used, with the first 10% of
the trainer steps for epsilon-greedy exploration by setting
• The Fixed trainer follows the standard MBRL, with all final epsilon to 0.1. Note that the setting 0.6 used in Fixed
actions set to 0.6 throughout the training process. trainer is the expected mean of the actions from intelligent
• The Random trainer outputs action 0.2 or 1.0 with equal trainer if the trainer predicts uniformly random actions.
probability. The same action values will be used by • DQN-5 actions. To test the effect of more action values
the DQN trainer. These values are picked such that in the action discretization, we introduce a second trainer,
an extensive amount of cyber data can be used in the by selecting five values from {0.2, 0.4, 0.6, 0.8, 1}.
training, for example, when a2 is set to 0.2, the amount • DQN-larger memory. To test the impact of larger trainer
of cyber data sampled is five-time the real data sampled. memory, we introduce a third intelligent trainer with
The value 0.2 is chosen without any tuning, i.e., it is not memory size of 2000. In this case more trainer samples
tuned to make DQN trainer work better. Our focus is not are stored and relatively older historical data are used in
to find out the best settings of these parameters (as it will the training the DQN controller.
9

• REINFORCE. The fourth intelligent trainer is the same TABLE III


to DQN trainer except the DQN controller is replaced ACCUMULATIVE REWARDS OF DIFFERENT TRAINER VARIANTS WHEN
USING DIFFERENT TRAINER AND TPE DESIGNS .
by a REINFORCE controller. REINFORCE algorithm
requires data of multiple episodes to train, we manually Variants Pendulum Mountain Car Reacher Half Cheetah Swimmer
DQN -43323 1434.59 -7846 696492 4918
set five steps (manually tuned) of TPE as an episode. DQN-5 actions -43204 1493.88 -7724 985847 2473
DQN-larger memory -41329 1615.98 -7831 1354488 2142
The configurations for these algorithms are summarized in DQN-TPE V1 -41869 1849.96 -7456 868597 1522
DQN-TPE V2 -46533 1826.19 -7478 1172288 2233
Table I.
The test results of three baseline trainers and four intelligent
trainers are shown in Fig. 5. We obtain the test results by severe performance degradation can occur. When cyber
periodically evaluating the target controller in an isolated data are used, the target controller can be trapped by a
test environment. This ensures that data collection from the local optimum that is difficult to recover from. We resolve
test environment will not interfere with the training process. this issue by using ensemble trainer.
In other words, none of the data collected from the test
environment is used in the training. We observe that: To analyze the behavior of the trainer, we show in Fig.
5(f) the actions taken by the DQN trainer for the tasks of
• The tasks of Pendulum, Mountain Car, and Reacher can Mountain Car, Reacher, and Swimmer during the training
benefit from cyber data used in training. For tasks of process. We observe that for Mountain Car, the mean value of
Half Cheetah and Swimmer, NoCyber trainer performs a0 fluctuates around 0.5. This agrees with our observation that
significantly better than trainers using cyber data. This for the Mountain Car, random baseline algorithm performs the
indicates that using the cyber data may not be always best. For Reacher and Swimmer, the trainer quickly learns to
beneficial. Thus, the use of cyber model should be use more of the real data, with the mean value of action a2
considered carefully. eventually reaching to larger than 0.6. This again indicates
• In most tasks, the intelligent trainer performs better than the viability of the trainer. Note that for Swimmer, even the
the Fixed trainer. For example, DQN-5 actions performs mean value of action a2 is larger than 0.6, the performance
better than Fixed trainer for the tasks of Mountain Car, of the target controller is still very poor (Fig. 5) due to
Reacher, and Half Cheetah, and performs similarly for training process’ sensitivity to cyber data. This again verifies
the tasks of Pendulum and Swimmer. This indicates the the necessity of an ensemble trainer that can quickly recover
viability of the intelligent trainer. from degraded performance during training.
• For the tasks of Pendulum and Mountain Car, the Random
trainer performs the best. This can be attributed to the fact
that adding more noises would encourage exploration. For
example, to achieve better performance, the Mountain C. Sensitivity Analysis on Various Trainer and TPE Designs
Car requires more exploration to avoid local optimum We compare the performances of different trainer and
that could lead the target agent to unfavorable searching TPE designs to study the performance sensitivity against the
directions. We also observe that the performance of DQN- implementation variations. In addition to previously mentioned
5 actions is more stable than that of DQN, due to the DQN, DQN-5 actions, and DQN-large memory, we also test
increased dimension of action space that improves the DQN trainers with two different TPE state designs, as also
training diversity. We argue that even the DQN trainer listed in Table I. DQN-TPE V1 adopts the last average
is no better than the Random trainer in these tasks, the sampling reward of the target controller as the state of TPE;
DQN trainer is still learning something. The reason is DQN-TPE V2 adopts the ratio (a value in the range of [0,1])
that we are trying to learn a fixed good action through of the real samples used to the predefined maximum number
DQN trainer, which means that the DQN trainer will not of real samples as the state of TPE. Table III presents the
be able to provide the randomness which proves to be accumulative rewards for five test cases: Pendulum, Mountain
good in these tasks. Also we can observer that for the Car, Reacher, Half Cheetah, and Swimmer.
Half Cheetah task, the DQN trainer is much better than
the Random trainer. This suggests that the DQN trainer • For Mountain Car, Reacher and Half Cheetah, DQN-5
can indeed learn in an online manner. actions, DQN-larger memory, DQN-TPE V1 and DQN-
• We further examine the effect of using cyber data when TPE V2 consistently outperform DQN. This indicates
it seems not working. For the Half Cheetah, we examine that for some applications, the intelligent trainer that uses
the results of multiple independent runs and cyber data more action selections, larger memory, or more informa-
causes instability in performance, resulting in higher tive state representation can achieve better performance.
variance and low mean reward in ten independent tests. The results hint that a smart design of trainer or TPE can
For Swimmer, the poor performance with cyber data is compensate the situation of lack of training data.
due to a special feature that the first two dimensions • For Swimmer, we observe that none of the tested variants
are linearly correlated in its state definition. The trained of DQN or TPE can achieve satisfying performance. This
cyber model in this case is unable to correctly identify is due to the fact that even a very small amount of cyber
this feature and predict the state transition. Our results data can cause the target controller to be trapped in a
show that even incorporating 10% cyber data in training, local minimum that cannot be recovered.
10

0
125
−250
100 −10
−500 75

−750 50 −20
Reward

Reward

Reward
NoCyber NoCyber NoCyber
25
−1000 Fixed Fixed Fixed
Random 0 Random −30 Random
−1250 DQN DQN DQN
−25
DQN-5 actions DQN-5 actions DQN-5 actions
−1500
DQN-larger memory −50 DQN-larger memory −40 DQN-larger memory
REINFORCE REINFORCE REINFORCE
−1750 −75
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5
Number of real data samples used 1e4 Number of real data samples used 1e4 Number of real data samples used 1e6

(a) Pendulum V0 (b) Mountain Car Continuous V0 (c) Reacher V1

1.0
10000 250 Mountain Car
NoCyber
Reacher
7500 Fixed Swimmer
200 0.8
Random
5000
DQN
2500 150 DQN-5 actions

Action value
0.6
DQN-larger memory
Reward

Reward

0 NoCyber REINFORCE
100
−2500 Fixed 0.4
Random
−5000 DQN 50
DQN-5 actions 0.2
−7500
DQN-larger memory 0
−10000 REINFORCE

0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.0 0.2 0.4 0.6 0.8 1.0
Number of real data samples used 1e6 Number of real data samples used 1e6 TPE steps (normalized into range [0, 1])

(d) Half Cheetah (e) Swimmer V1 (f) Trainer action a2

Fig. 5. Accumulative rewards of different uni-head trainer designs for different tasks in (a)-(e). The curves show the average accumulative reward while the
shaded region shows the standard deviation of the reward in ten independent runs. The proposed uni-head trainer shows its adaptability (better than the Fixed
trainer) but may fail in certain cases like Swimmer. (f) shows the mean action a2 taken by DQN trainer on tasks of Mountain Car, Reacher, and Swimmer.

0
Ensemble 125
NoCyber
−250
Random 100 −10
DQN
−500 75

−750 50 −20
Reward

Reward

Reward

25
−1000
0 −30
−1250
Ensemble Ensemble
−25
NoCyber NoCyber
−1500
Random −40 Random
−50
DQN DQN
−1750
−75
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5
Number of real data samples used 1e4 Number of real data samples used 1e4 Number of real data samples used 1e6

(a) Pendulum V0 (b) Mountain Car Continuous V0 (c) Reacher V1

7500 1.0
Ensemble
250
NoCyber
5000
Random 0.8
200 DQN
2500

150
Action value

0 0.6
Reward

Reward

−2500 100
0.4
−5000
Ensemble 50
−7500 NoCyber 0.2 Mountain Car
Random 0 Reacher
−10000 DQN Swimmer
0.0
0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Number of real data samples used 1e6 Number of real data samples used 1e6 TPE steps (normalized into range [0, 1])

(d) Half Cheetah (e) Swimmer V1 (f) Trainer action a2

Fig. 6. Accumulative rewards of ensemble trainer for different tasks in (a)-(e). The proposed ensemble design shows close-to-optimal or even better
performance on all cases. (f) shows the mean action a2 taken by the DQN trainer in the ensemble trainer for Mountain Car, Reacher, and Swimmer.
11

125 DQN in ensemble


250
RANDOM in ensemble
100
−10 NoCyber in ensemble
200
75

50 −20 150

Reward

Reward
Reward

25 100

0 −30
50
−25
DQN in ensemble DQN in ensemble
−40 0
−50 RANDOM in ensemble RANDOM in ensemble
NoCyber in ensemble NoCyber in ensemble
−50
−75
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.2 0.4 0.6 0.8 1.0
Number of real data samples used 1e4 Number of real data samples used 1e6 Number of real data samples used 1e6

(a) Mountain Car Continuous V0 (b) Reacher V1 (c) Swimmer V1

Fig. 7. Accumulative reward of different individual trainers of the ensemble trainer on (a) Mountain Car, (b) Reacher, and (c) Swimmer. Trainers’ performance
are tending to fuse except certain extremely under-performed trainers.

TABLE IV
Ensemble
250
Ensemble without memory sharing
S AMPLING SAVING TO ACHIEVE CERTAIN PREDEFINED PERFORMANCE OF
Ensemble without reference sampling THE ENSEMBLE TRAINER . T HE BASELINE COST IS THE EXPECTED COST OF
200
THE THREE ALGORITHMS N O C YBER , R ANDOM TRAINER AND DQN
TRAINER .
150
Reward

Pendulum Mountain Car Reacher Half Cheetah Swimmer


100
Target reward -500 75 -10 2500 100
Samples saving 26% 36% 2% 38% 56%
50

0.0 0.2 0.4 0.6 0.8 1.0


as the DQN or Random trainer. For the tasks of Swimmer
Number of real data samples used 1e6
and Half Cheetah, the ensemble trainer performs as well as
the NoCyber trainer, even though the learning process makes
Fig. 8. Accumulative rewards of ensemble trainer and its two variants: without it learn slower in the Half Cheetah task. With the proposed
memory sharing and without reference sampling for Swimmer. The proposed
ensemble design shows significant better performance. ensemble trainer, we are more likely to achieve sampling
cost saving in practice as we it is hard to predict which
kind of algorithm variant will deliver the best performance in
D. Mitigating Action Correlation with Multi-head Ensemble advance. We compute the expected saving in Table IV with the
Trainer ensemble trainer when assuming the baseline sampling cost is
the average cost of the three uni-head trainers NoCyber, DQN
As discussed in Section III-B, the purpose of constructing trainer and Random trainer. Note that for tasks Mountain Car,
an ensemble trainer is to overcome the action correlation Half Cheetah and Swimmer, the uni-head trainer may fail to
problem in uni-head trainer. In this subsection, we provide achieve the predefined performance target, in this case we set
evidence of the virtue of the ensemble design by comparing the cost as the maximum number of samples we tried in the
its performance with uni-head trainers. The ensemble trainer experiment. That means the expected saving is actually larger
comprises a DQN trainer (with TPE state design V2), a than the number shown in Table IV.
Random Trainer, and a NoCyber trainer. Following the design In Fig. 6 (f), we observe that the action a2 taken by the DQN
in Section III-B, these three trainers jointly sample and train varies significantly from the uni-head case. For Swimmer case,
three independent target controllers. The target controller of the action a2 gradually converges to one which allows better
the best trainer will be used in the test. For the step threshold performance. For Reacher case, we observe a phase transition
C in weight transfer, it should be set to a TPE step count that in the middle, during which it changes from preferring fewer
a just sufficient number of trajectories (at least one episode) cyber data to more cyber data. This proves that when and
has been sampled. For such reason we set C = 3 for all how many cyber data should be utilized may be related to the
tasks except Mountain Car. For Mountain Car task, as in each training progress. For the Mountain Car task, we observe that
TPE step, only one real sample is generated which is far from it quickly converges to favor more cyber data which is helpful
enough to evaluate the performance. We set to C = 100 for in this task. This proves that the proposed ensemble trainer
this task. The upper and lower bounds φmax and φmin are can assess the control actions better than the uni-head trainer.
estimated in the experiments, we found that φmax = 0.7 and In Fig. 7, we show the interactions of trainers in the
φmin = 0.5 work well for all cases. ensemble by presenting individual results of the constituent
The results, as presented in Fig. 6, show that the ensemble trainers: DQN in ensemble, RANDOM in ensemble, and
trainer achieves overall good performance even in the cases NoCyber in ensemble, for the tasks of Mountain Car, Reacher,
the uni-head trainer fails. For the tasks of Pendulum, Mountain and Swimmer (In the following of this paragraph, we omit the
Car and Reacher, the ensemble trainer performs almost as well term of “in ensemble” for the sake of brevity). In all three
12

cases, we can observe that within the ensemble, the original the target controller is either predetermined or can only be
good trainer (uni-head) still performs very good. For example, adjusted manually, resulting in both sampling inefficiency and
for the Mountain Car task, the Random trainer performs almost additional algorithm tuning cost. In [27] the authors proposed
as good as the uni-head Random trainer. For task Swimmer, the a model-assisted bootstrapped DDPG algorithm, which uses a
DQN trainer can now perform as good as the NoCyber trainer, variance ratio computed from the multiple heads of the critic
which proves that the weight transfer process is working as network to decide whether the cyber data can be used or not.
expected. The method relies on the bootstrapped DQN design, which is
To further examine the effect of memory sharing and not suitable to other cases.
reference sampling, in Fig. 8 we compare the performance Instead of treating the cyber model as a data source for
of three different ensemble designs, for the task of Swimmer. training, some approaches use cyber model to conduct pre-
All of them comprise the same three trainers: DQN, Random, trial tree searches in applications, for which selecting the
and NoCyber, but differ in the incorporated schemes: ensem- right action is highly critical [8] [9]. The cyber model can
ble trainer (with memory sharing and reference sampling); prevent selecting unfavorable actions and thus accelerates the
ensemble trainer without memory sharing (with reference learning of the optimal policy. In [10], the authors introduced a
sampling); ensemble trainer without reference sampling (with planning agent and a manager who decides whether to sample
memory sharing). All these variants are with weight transfer. from the cyber engine or to take actions to minimize the
The results show that, without memory sharing, the ensemble training cost. Both approaches focuses on the tree search in
performance degrades. This is because each of the three action selection which is different to our design that we aim to
intelligent trainers uses only one-third of the original data select the proper data source in sampling. Some recent works
samples (which is why the curve stops at 1/3 of the others investigate integrating model-based and model-free approaches
in the x-axis). Without reference sampling, the ensemble in RL. In [28] the authors combined model-based and model-
performs very similar to the DQN trainer (Fig. 5). This is free approaches for Building Optimization and Control (BOC),
because without reference sampling, most of the real data where a simulator is used to train the agent, while a real-world
samples are from underperformed target controllers of DQN test-bed is used to evaluate the agent’s performance. In [29]
and Random trainers. The data from underperformed target the model-based DRL is used to train a controller agent. The
controllers deteriorates the learning process of the NoCyber agent is then used to provide weight initialization for a model-
trainer. The results indicate that memory sharing and reference free DRL approach, so as to reduce the training cost. Different
sampling are essential for ensemble trainer. to this approach, we focus on directly sample from the model
to reduce sampling cost in the real environment.
V. R ELATED W ORKS
To build intelligent agents that can learn to accomplish B. AutoML
various control tasks, researchers have been actively studying The method proposed in this paper is a typical AutoML
reinforcement learning for decades, such as [18]–[22]. With re- [30] solution. AutoML aims to develop an algorithm that
cent advancement of deep learning, DRL [2] has demonstrated can automatically train a high performance machine learning
its strength in various applications. For example, in [23] a model without human intervention, such as hyper-parameter
DRL agent is proposed to solve financial trading tasks; in [24] tuning, model selection etc. AutoML has been proposed to
a neural RL agent is trained to mimic the human motor skill solve various specific training tasks such as model compres-
learning; in [25] an off-policy RL method is proposed to solve sion for mobile device [31], transfer learning [32], general
nonlinear and nonzero-sum games. Our research is particularly neural network training [33].
focused on model-based RL which can be utilized to reduce Note that most AutoML solutions are proposed to solve
the sampling cost of RL, and we propose an AutoML method. supervised learning cases, in which the dataset is usually pre-
In the following, we briefly review the recent development of acquired. In our case, as the data will be collected by the target
model-based RL and the AutoML studies. controller to train, it actually demands an AutoML solution
more than a general supervised learning case.
A. Model-based RL
Despite the significant performance improvement, the high VI. C ONCLUSION
sampling cost necessitated by RL has become a significant In this paper we propose an intelligent trainer for online
issue in practice. To address this issue, MBRL is introduced to model training and sampling settings learning for MBRL
learn the system dynamics model, so as to reduce the data col- algorithm. The proposed approach treats the training process of
lection and sampling cost. In [7] the authors provided a MBRL MBRL as the target system to optimize, and use a trainer that
for a robot controller that samples from both real physical monitors and optimizes the sampling and training process in
environment and learned cyber emulator. In [26] the authors MBRL. The proposed trainer solution can be used in practical
adapted a model, trained previously for other tasks, to train the applications to reduce the sampling cost while achieve close-
controller for a new but similar task. This approach combines to-optimal performance.
prior knowledge and the online adaptation of dynamic model, For the future work, the proposed trainer framework can
thus achieves better performance. In these approaches, the be further improved by adding more control actions to ease
number of samples taken from the cyber environment to train algorithm tuning cost. An even more advanced design is to use
13

one trainer to train different DRL controllers for multiple tasks,


[19] F. L. Lewis and D. Liu, Reinforcement learning and approximate
which can learn the common knowledge shared by different dynamic programming for feedback control. John Wiley & Sons, 2013,
DRL algorithms for these tasks. vol. 17.
[20] D. Liu and Q. Wei, “Policy iteration adaptive dynamic programming
algorithm for discrete-time nonlinear systems,” IEEE Transactions on
R EFERENCES Neural Networks and Learning Systems, vol. 25, no. 3, pp. 621–634,
2014.
[1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
MIT press, 2018. [21] B. Luo, H.-N. Wu, and T. Huang, “Off-policy reinforcement learning
[2] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier- for h inf ty control design,” IEEE transactions on cybernetics, vol. 45,
stra, and M. Riedmiller, “Playing atari with deep reinforcement learn- no. 1, pp. 65–76, 2015.
ing,” arXiv preprint arXiv:1312.5602, 2013. [22] D. Liu, X. Yang, D. Wang, and Q. Wei, “Reinforcement-learning-based
[3] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, robust controller design for continuous-time uncertain nonlinear systems
D. Silver, and D. Wierstra, “Continuous control with deep reinforcement subject to input constraints,” IEEE transactions on cybernetics, vol. 45,
learning,” arXiv preprint arXiv:1509.02971, 2015. no. 7, pp. 1372–1385, 2015.
[4] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust [23] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, “Deep direct rein-
region policy optimization,” in International Conference on Machine forcement learning for financial signal representation and trading,” IEEE
Learning, 2015, pp. 1889–1897. transactions on neural networks and learning systems, vol. 28, no. 3,
[5] M. Hausknecht and P. Stone, “Deep reinforcement learning in param- pp. 653–664, 2017.
eterized action space,” in Proceedings of the International Conference [24] Y. Pan and H. Yu, “Biomimetic hybrid feedback feedforward neural-
on Learning Representations (ICLR), May 2016. network learning control,” IEEE transactions on neural networks and
[6] R. S. Sutton, “Dyna, an integrated architecture for learning, planning, learning systems, vol. 28, no. 6, pp. 1481–1487, 2017.
and reacting,” ACM SIGART Bulletin, vol. 2, no. 4, pp. 160–163, 1991. [25] R. Song, F. L. Lewis, and Q. Wei, “Off-policy integral reinforcement
[7] M. P. Deisenroth, C. E. Rasmussen, and D. Fox, “Learning to control a learning method to solve nonlinear continuous-time multiplayer nonzero-
low-cost manipulator using data-efficient reinforcement learning,” 2011. sum games,” IEEE transactions on neural networks and learning sys-
[8] X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang, “Deep learning for tems, vol. 28, no. 3, pp. 704–713, 2017.
real-time atari game play using offline monte-carlo tree search planning,” [26] S. Levine, N. Wagener, and P. Abbeel, “Learning contact-rich manip-
in Advances in neural information processing systems, 2014, pp. 3338– ulation skills with guided policy search,” in Robotics and Automation
3346. (ICRA), 2015 IEEE International Conference on. IEEE, 2015, pp.
[9] T. Weber, S. Racanière, D. P. Reichert, L. Buesing, A. Guez, D. J. 156–163.
Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li et al., “Imagination- [27] G. Kalweit and J. Boedecker, “Uncertainty-driven imagination for con-
augmented agents for deep reinforcement learning,” arXiv preprint tinuous deep reinforcement learning,” in Conference on Robot Learning,
arXiv:1707.06203, 2017. 2017, pp. 195–206.
[10] R. Pascanu, Y. Li, O. Vinyals, N. Heess, L. Buesing, S. Racanière, [28] S. Baldi, I. Michailidis, C. Ravanis, and E. B. Kosmatopoulos, “Model-
D. Reichert, T. Weber, D. Wierstra, and P. Battaglia, “Learning model- based and model-free plug-and-play building energy efficient control,”
based planning from scratch,” arXiv preprint arXiv:1707.06170, 2017. Applied Energy, vol. 154, pp. 829–841, 2015.
[11] Y. Li, Y. Wen, K. Guan, and D. Tao, “Transforming cooling optimization
[29] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural network
for green data center via deep reinforcement learning,” arXiv preprint
dynamics for model-based deep reinforcement learning with model-free
arXiv:1709.05077, 2017.
fine-tuning,” arXiv preprint arXiv:1708.02596, 2017.
[12] https://bitbucket.org/RLinRL/intelligenttrainerpublic, accessed: 2018-
05-06. [30] I. Guyon, I. Chaabane, H. J. Escalante, S. Escalera, D. Jajetic, J. R.
[13] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. Lloyd, N. Macià, B. Ray, L. Romaszko, M. Sebag et al., “A brief review
MIT press Cambridge, 1998, vol. 1, no. 1. of the chalearn automl challenge: any-time any-dataset learning without
[14] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy, “Deep exploration human intervention,” in Workshop on Automatic Machine Learning,
via bootstrapped DQN,” in Advances in neural information processing 2016, pp. 21–30.
systems, 2016, pp. 4026–4034. [31] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl for
[15] https://github.com/berkeleydeeprlcourse/homework/tree/master/hw4, ac- model compression and acceleration on mobile devices,” in Proceedings
cessed: 2018-05-06. of the European Conference on Computer Vision (ECCV), 2018, pp.
[16] https://github.com/pat-coady/trpo, accessed: 2018-05-06. 784–800.
[17] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, [32] C. Wong, N. Houlsby, Y. Lu, and A. Gesmundo, “Transfer learning with
J. Schulman, S. Sidor, and Y. Wu, “Openai baselines,” https://github. neural automl,” in Advances in Neural Information Processing Systems,
com/openai/baselines, 2017. 2018, pp. 8356–8365.
[18] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive [33] Y.-H. Kim, B. Reddy, S. Yun, and C. Seo, “Nemo: Neuro-evolution
elements that can solve difficult learning control problems,” IEEE with multiobjective optimization of deep neural network for speed and
transactions on systems, man, and cybernetics, no. 5, pp. 834–846, 1983. accuracy,” in ICML 2017 AutoML Workshop, 2017.

You might also like