Intelligent Trainer For Model-Based Deep Reinforcement Learning
Intelligent Trainer For Model-Based Deep Reinforcement Learning
Intelligent Trainer For Model-Based Deep Reinforcement Learning
Abstract—Model-based reinforcement learning (MBRL) has allows for progressively learning the best control policy in
been proposed as a promising alternative solution to tackle complex systems. Previously RL has been adopted to solve
the high sampling cost challenge in the canonical reinforcement problems like robot arm control, maze solving and game
arXiv:1805.09496v6 [cs.LG] 5 Jun 2019
Fig. 1. Illustration of MBRL algorithm using the model as a data source: (a)
The data flow of MBRL, where the cyber environment is used to generate we first encapsulate the canonical model-based RL training
synthetic training data for the target controller. (b) Typical training flow of process into a standard RL environment (Training Process
MBRL, in which we indicate the settings that are usually manually set. Environment, TPE in Fig. 2) with exposed state, action and
reward interfaces, as the inner RL layer. In the outer layer, we
• System model generation. The model is trained to mimic introduce an intelligent trainer, as an RL agent, to interact with
the real system in that, given the current state and action the inner layer, and control the sampling and training process
to take, it predicts the next system state. The learned of the target controller in TPE. Such layered architecture
model can be trained/used in a global or local [9] manner. is embedded with an inherent decomposition between the
The global manner means that the model is trained or training process and the controlling of the training process
utilized to generate data samples from the whole state in the inner layer, greatly liberating its applicability in more
space and can favor global exploration. The local manner generalized RL scenarios. In comparison with the existing
is to train or utilize the model to generate data samples approaches that directly modify the training algorithm of the
in certain constrained subspace, and thus can reinforce target controller [10], our design can work with different
local exploitation. In Fig. 1, parameters a0 and a1 control MBRL controllers and with different trainer designs. We call
whether to go global or local in the training and sampling the latter as “train the trainer” design.
procedure of the system model. Our research intends to optimize the “train the trainer”
• Control policy training. The collected data from physical
design for better learning efficiency in the outer layer of
environment are also used in the training of the target the RoR architecture, and validate our design over widely-
controller, together with the cyber data generated from the accepted benchmark cases in the openAI gym. First, we
learned model. In this case, the portion of the cyber data propose two alternative trainer designs:
to use requires proper configuration to achieve the desired • Uni-head trainer. This approach is to implement a single
outcome. As to be shown in the experimental results of trainer, cast into a DQN controller, to learn in an online
this paper, the proper setting can vary from case to case: manner to optimize the sampling and training in the inner
for certain cases using more cyber data can be helpful; layer.
while for other cases it may lead to serious performance • Multi-head trainer. This approach is to implement an
degeneration. In Fig. 1 parameter a2 controls this setting. ensemble trainer, comprising of multiple trainers that
We refer these configuration parameters introduced by the take independent actions in their training processes, and
model as model-related hyper-parameter setting. In previous are ranked across themselves to provide a quantitative
research, these parameters are manually tried in training stage, comparison of their respective actions.
often resulting in additional time and/or resource cost1 . It fol- We implement both trainer designs in Tensorflow for five
lows that an autoML solution for MBRL is highly demanded. benchmark cases (i.e., Pendulum, Mountain Car, Reacher, Half
In this research, we propose an autoML solution for the Cheetah, and Swimmer) and evaluate their performance in
MBRL framework, aiming to tackle the hyper-parameter set- learning the best control policy under external constraints.
ting challenge. Our proposed solution adopts a “reinforcement Our evaluation is compared against three baseline algorithms,
on reinforcement” (RoR) design architecture to learn the opti- including a model-free RL algorithm, a MBRL algorithm with
mal model-related parameters and training/sampling settings randomly hyper-parameter settings and a MBRL algorithm
in an online manner. Specifically, as illustrated in Fig. 2, with fixed hyper-parameter settings. Our numerical investi-
gations show that our proposed framework outperforms the
1 A naive approach to potentially solve this problem is to re-train the
aforementioned baseline algorithms in overall performance
controller with different parameter settings with the collected data samples in
the first trial. Such solution will not incur additional sampling cost. However,
across different test cases supposing the best parameter set-
the ”supervised” learning approach may not work well for the RL case, as the tings are unknown. Specifically, our proposed framework can
training performance of such a policy is largely determined by the data used achieve the following results:
in training. If the data used in training are sampled by an under-performed
policy, they may lack of important samples that can lead to better performance, • For the same learned policy quality, our proposed RoR
making the re-training useless. framework can achieve an expected sampling cost saving
3
upto 56%, over the average cost of the three baseline to control the sampling and training settings of the MBRL,
algorithms. and the reward information which can encourage an agent to
• Given the same sampling budget, our proposed RoR minimize the sampling cost in the real environment to train
framework can achieve a policy quality on par with the target controller to a target performance. To achieve these
the best policy available, without the prior requirement goals, the TPE is designed with two major functions and three
of knowing the best parameter setting across all the RL interfaces as following.
benchmark cases.
These evaluations suggest that our proposed RoR framework A. Basic Functions of TPE
can be readily applied to emerging industrial applications, with
The TPE has two major functions that can be executed to
cost concerns. For example, in data center room cooling con-
complete the entire training process of a general MBRL:
trol application, we can use the trainer framework to properly
• Initialization: execute initialization tasks for the MBRL
utilize the computational fluid dynamics (CFD) model with
the real monitoring data to train an air cooling unit controller training process. These tasks include initializing the real
[11]. At the same time, it can shed new lights on model- training environment, the cyber emulator, and the target
based RL research by leveraging the RoR framework for controller.
• Step(state, action): execute one step of training of the
autoML empowerment. Specifically, it can serve as a general
framework that can work with different RL training algorithms MBRL algorithm. This process includes sampling from
and could also be a potential solution for other learning tasks in the real and cyber environment, training the target con-
which online adaptive parameter setting is demanded. We have troller, and training the dynamic model of cyber emulator.
released the open-source code of our proposed RoR framework Note that in each step, we keep the number of real data
at [12], for the research community to further develop new samples to sample fixed (Kr ) while optimize the amount
applications and algorithms. Kc of cyber data used in the training. We found that
The remainder of this paper is organized as follows. Section such design is more stable in implementation as it can
II provides a detailed description of the proposed trainer provide a tractable evaluation of the policy by measuring
framework, including its key components, uni-head trainer the received reward from the real environment. With such
design, and ensemble trainer design. Section IV presents the setting, the TPE exposes action interfaces to determine
numerical evaluation results of the proposed framework. Sec- how many cyber data to use and how to do the sampling,
tion V briefly reviews the related works. Section VI concludes and reward information to encourage a trainer agent to
the whole paper. train the target controller to a better performance.
With the TPE, MBRL training process can be executed by
calling repeatedly calling the Step function after the Initializa-
II. RO R: R EINFORCEMENT ON R EINFORCEMENT
tion. The detailed training algorithm used to train the target
A RCHITECTURE
controller will be embedded in the Step function.
The overall architecture of the proposed intelligent trainer
framework is shown in Fig. 2. The inner layer, i.e., the
B. RL Elements of TPE
Training Process Environment (TPE) is a standard model-
based DRL system utilizing the model as a data source to For the interaction between TPE and the intelligent trainer,
train the target controller. The training data are provided by we define three interfaces State, Action, and Reward of TPE as
the physical environment, which represents the real-world follows. To distinguish the RL components in different layers,
system, and the cyber environment, which is an emulator of the in the following, superscript ξ is used to indicate variables in
physical system. The emulator can be either knowledge-based the target controller layer, while Ξ is used to indicate variables
or learning-based (e.g., a neural network prediction model). in the intelligent trainer layer.
The outer layer, i.e., the intelligent trainer, is also an RL agent • State: The state is a vector that is exposed to an out-
that controls and optimizes the sampling and training process side agent who can use the state to access the training
of the target controller in the real and cyber environment via progress. Ideally one can put as much information as
feedbacks and action outputs. Thus, the proposed framework possible into the state design to measure the training
can be considered as a “reinforcement on reinforcement” progress. However, we found that using a constant (zero)
architecture. Such modularized design can easily work for to represent the TPE state can still work as such simple
different kinds of target controller training algorithms (such setting allows the trainer to learn a good action quickly.
as DDPG, TRPO) and the extra layer of intelligent trainer can We also test other more informative state representation
be any optimizer that can output the control action when given designs, such as using the last average sampling reward or
a TPE observation. the normalized sampling count. They can achieve better
In the following we first present the inner layer of the pro- performance in certain cases. A comparative study of
posed trainer framework to introduce how we encapsulate the these different designs are provided in Section IV.
standard training process of MBRL as an RL environment. We • Action: the action interface comprises three controllable
design the TPE with two goals. First, it should be formulated parameters that are exposed to an outside agent who can
as a standard RL environment, such that any agent can interact utilize these actions to control the training progress, as
with it. Second, the TPE shall expose the action interface mentioned in Fig. 1. We represent these parameters as
4
probability values, all defined in the range of [0, 1]. Such the number Kc of cyber data to sample in each step
normalized action range can simplify the design of the by
trainer agent. Details of the three control parameters will Kr · (1 − a2 )
Kc = . (3)
be given subsequently. a2
– Action a0 decides whether one should train the The rationale of such design is to bound the action
model into a local or global model, which is achieved in the range [0, 1], which can ease the design of
by controlling the starting state of a new episode an agent that interplays with the TPE. The sampled
when the target controller samples from the real data from the real environment are stored in a
environment. A quality function Φ is defined to select real memory buffer, while the cyber data sampled
the starting points, controlled by a0 : from the cyber environment are stored in the cyber
memory buffer.
Φ(s) = a0 · Qξ (s, π(s)) + (1 − a0 ) · u[0,1] , (1) For the training part, a2 is also used to set the
where Qξ is the value produced by critic network of probability of taking a mini-batch from the real
the target controller, π is the current policy, and u[0,1] data memory buffer in training the target controller.
is a random number drawn from [0, 1]. With this Naturally, 1 − a2 represents the probability to take
quality function, we keep sampling random starting a mini-batch from the cyber data buffer. With fixed
points in the physical environment until a high- batch size, if we train with Tr batches of real data,
quality starting point is found, as shown in Algorithm then Tc batches of cyber data are used in this step:
1. In one way, when a0 approaches to one, initial Tr · (1 − a2 )
states with a higher Q value are likely to be selected, Tc = . (4)
a2
which will generate more data with high Q value.
Note that we use only one action to control both the
When these data are used to train the model, the
sampling and training process to accommodate some
model will be more accurate in a high Q value
DRL algorithms, such as TRPO, where the sampling
subspace, benefiting the local exploitation in it. In
and training process cannot be decoupled.
the other way, when a0 approaches zero, the quality
will be a random number and the starting point will • Reward: The reward interface is used to measure the
be a random state to favor global exploration. performance of the target controller. Note that the only
– Action a1 decides whether one should utilize the reliable information we can get from the training process
model in a local or global manner, which is achieved is the reward data we collected when sampling from the
by controlling the starting state of a new episode real environment. These reward data can be manipulated
when the target agent samples from the cyber envi- into various reward definitions for the trainer; one design
ronment. The starting state of an episode also matters of the reward rΞ is
ξ
in the cyber environment. For example, we can select rΞ = sign(r̄t+1 − r̄tξ ), (5)
a starting state s from the real data buffer B. In
ξ
this case, the subsequent sampling process will be where r̄t+1 and r̄tξ are the respective average sampling
a local search process similar to the imagination reward of the target controller at step t + 1 and t from
process used in [9] and is more likely to generate real environment. This means, as long as the reward
samples that are of high prediction accuracy as is increasing, the current training action is considered
the model has explored nearby samples in the real acceptable. Although such a simple design allows the
environment. Alternatively, we can use a data point trainer to learn the settings quickly, it may not be effective
srand randomly selected from the state space to favor in all practical cases, especially in the case where the
exploration. It thus can control the trade-off between cyber data does not degrade the performance but prolongs
exploitation and exploration during the sampling the convergence. A more effective order-based reward
process. In our design, a1 , with 0 ≤ a1 ≤ 1, design is used in the ensemble trainer in Section III-B.
represents the probability of choosing starting state Note that we can only utilize the reward information
s0 from the real data buffer, as received when sampling from the real environment to
( measure the performance of the target controller. To avoid
s ∈ B, if u[0,1] ≤ a1 additional sampling cost, we have to rely on the original
s0 = (2)
srand , otherwise, sampling process form the real environment, which is
why we set the number of real data sampled in each step
where u[0,1] is a uniformly distributed random num- to be fixed, as otherwise we may not receive a stable
ber drawn from [0, 1]. evaluation of the target controller.
– Action a2 decides how many cyber data are sampled
and used in training. For the sampling part, a2 is
set to the ratio of the number of real data sampled C. Problem Formulation
to the total data sampled (real and cyber) in this Based on the defined TPE environment, the problem to solve
training step. Recall that in each step we sample a in this paper is formulated as follows. Given a target controller
fixed number Kr of real data samples. a2 controls to train by a MBRL algorithm with a given maximum number
5
Initialization only one target controller is involved in training and all trainer
actions are tested in a single streamline of training. This “uni-
TPE head” trainer needs to learn quickly with limited training time
Entrance
steps and samples. Several trainer learning algorithms, like
Tr ainer DQN and REINFORCE, can be used to tackle this problem.
Sampling in Sampling in gener ates In the following, we use a DQN controller to demonstrate the
real cyber new trainer design. A comparison of different trainer designs is
Environment Environment control
TPE given in Section IV-C.
Step actions
We implement a specialized DQN trainer that carries out
Update discretized control actions with a relatively small-scale Q
Update
target Tr ain the
controller
cyber model
Tr ainer
network. At each time step, the trainer evaluates all the actions
with the Q network and selects the action with the highest Q
TPE Exit value.
Is no. of The training of the DQN controller follows standard epsilon-
No
total real data samples
>N greedy exploration [13] strategy. To enhance the training
Yes stability, the DQN controller is equipped with a memory, like
Stop the replay buffer in DDPG [3]. As such, the trainer can extract
good actions from the noisy data received from TPE. During
Fig. 3. Work flow of the uni-head intelligent trainer. In the initialization, we
the experiment, we notice that samples from mere one single
create the corresponding TPE. After that, The training iterates until the total action could flood the buffer. The homogeneity in actions
number of real data samples reaching the budget limit N . could prolong or even halt the training of DQN. To solve this
problem, for a given action we limit the total number of the
samples to M/|A|, where M and |A| are the size of buffer
of samples to collect from the physical environment, we
and the size of the action set, respectively. If the number of
encapsulate it into the TPE defined above, and aim to train
samples for a given action exceeds this limit, a new sample
a trainer in an online manner to maximize the accumulated
will replace a randomly selected old one.
reward received from this TPE:
The pseudo code of the uni-head intelligent trainer is shown
tX
max in Algorithm 1, with the detailed implementation of the
max rΞ (t), (6) sampling reset procedure in the real/cyber environment shown
πΞ
t=1 in Algorithm 2.
where π Ξ is the control policy of the trainer, tmax is the
maximum number of trainer steps when the real data sample Algorithm 1 Sampling Reset Procedure
budget is consumed.
1: if the current sampling environment is the real environ-
Note that the problem strictly demands online learning, as
ment then
re-training from the beginning will incur additional sampling
2: Initialize data set D = ∅, quality set G = ∅.
cost. In the following, we will propose different control policy
3: for i = 1 : M1 do
designs and trainer learning methods to accomplish this online
4: Generate one initial state s0 and compute its quality
learning task.
Φ(s0 ).
5: Append s0 to D and append Φ(s0 ) to G.
III. TTT: T RAINING THE T RAINER 6: if i > M2 and Φ(s0 ) ≥ max(G) then
In this section we present the outer layer of the RoR 7: Break.
architecture, the trainer designs, to tackle the above formulated 8: end if
problem. We first propose the basic intelligent trainer, which 9: end for
utilizes a single DQN controller to do the online learning. Then 10: Return the last state of D.
we propose an enhanced trainer design with multiple trainers 11: else
to better evaluate the trainer actions, which can even work in 12: if u[0,1] < a1 then
some tough situations. 13: Randomly select a state s from the real data memory.
the reference sampling is to select the best trainer, then to weight parameters of the target controller trained by the best
use its target controller for other trainers to sample real data trainer to the target controller trained by the DQN trainer.
samples with a probability pref . In our algorithm, at the first We also utilize the accumulated trainer reward to detect
of every three steps, pref is forced to set to 0. As such this whether the best trainer is significantly better than other
first step, without reference sampling taking place, serves as trainers. We calculate a performance skewness ratio to measure
an evaluation step for the trainer. In next two steps, pref is the degree of the outperformance of the best trainer:
determined by the min function in the following equation. Rb − Rm
φ= , (9)
Rb − Rw
(
0, if mod (tΞ , 3) == 0
pref = φ−φmin where Rb , Rm and Rw are the best, median and worst Ri of
min{ φmax −φmin , 1}, otherwise
(7) the three trainers, respectively. The skewness ratio is used to
where tΞ is the current step number of trainers, and φ is the determine the pref as shown above.
skewness ratio, which measures the degree of the outperfor- Algorithm 3 shows the operational flow of the ensemble
mance of the best trainer; φmax and φmin are the estimated trainer. In summary, the ensemble trainer evaluates the quality
upper and lower bounds respectively. The details of φ are of the actions by sorting the rewards received by target
shown in the weight transfer procedure below. With such controllers. It can maintain the training quality by memory
design, the better the performance of the best trainer, the higher sharing scheme, without incurring additional sampling cost. It
pref will be used. can maintain the sample quality by reference sampling. It can
3) Order-based Trainer Reward Calculation: The rewards recover an underperformed trainer from poor actions. Though
of the trainers in the ensemble trainer are designed by ordering saving on the sampling cost, the ensemble trainer requires
the performance of different trainers. After the training process three times the training time. The increased training time can
of the target controllers of all trainers, for each trainer we be partially reduced by the early stop of some underperformed
calculate the average sampling reward of its corresponding trainers when necessary.
target controller r̄iξ as the raw reward of this trainer. Note that
r̄iξ is different from the sign reward used in (6). Next, we sort IV. N UMERICAL E VALUATIONS
the tuple (r̄0ξ , r̄1ξ , r̄2ξ ) in an ascending order. We then define the In this section, we evaluate the proposed intelligent trainer
index of c · r̄iξ in the sorted tuple as the reward r̂iΞ of trainer and ensemble trainer for five different tasks (or cases) of
i. OpenAI gym: Pendulum (V0), Mountain Car (Continuous V0),
The rationale is that if the action of a trainer is good for Reacher (V1), Half Cheetah ( [15]), and Swimmer (V1).
training, it should help the trainer to achieve better perfor-
mance (measured by the average sampling reward).
A. Experiment Configuration
Note that with the above reward design, the trainers will
generate three data samples at the trainer level in each step, and For the five test cases, different target controllers with
all these data will be used to update the intelligent trainer. Due promising published results are used: DDPG for Pendulum
to the reference sampling mechanism, the order information and Mountain Cars; TRPO for Reacher, Half Cheetah, and
may not correctly measure the performance of the trainers. To Swimmer. The well-tuned parameters of open-sourced codes
solve this issue, we will throw away these samples when pref [16] [17] are used for the hyper-parameters settings of the
is not zero. target controller (including Kr and Tr , as defined in Section
4) Weight Transfer: After collection of the trainer reward II. Simple neural networks with guideline provided in [25]
data, we add a particular weight transfer mechanism to solve are used for the cyber models. As our experiments have
the issue that some target agent may fail due to unfavorable shown, it is very useful to normalize both input and output for
trainer actions. The rationale is that after collecting the reward the dynamic model. In this paper, we use the normalization
information for a certain large number of steps, we can judge method provided by [16], in which the mean and standard
which trainer is currently the best one with high confidence. deviation of the data is updated during the training process.
In this case, we can transfer the best target agent to the other For hyperparamters M1 and M2 used in the reset procedure in
trainers, such that those trainers who fall behind can restart Algorithm 1, we set M1 = 50 and M2 = 5 respectively, which
from a good position. In particular, after the trainer reward indicates that we have maximum and minimum trial numbers
data are collected, we examine the number of steps nc that 50 and 5 respectively.
have been taken since the last weight transfer. If nc is larger
than a threshold C, we compute an accumulative reward for B. Comparison of Uni-Head Intelligent Trainer with Baseline
each trainer in the last nc steps as : Algorithms
X Multiple variants of the uni-head intelligent trainer are
Ri (tΞ ) = r̂iΞ (tΞ − j), (8)
compared with baseline algorithms. There are three baseline
j∈{nc −1,...,0}
algorithms and four intelligent trainers. Their designs are
where tΞ is the index of current trainer step. The trainer with summarized in Table I. The three baseline algorithms are:
maximum Ri will be set as the best trainer. We then examine • The NoCyber trainer is a standard DRL training process
if the DQN trainer is the best; if not, we will transfer the without using cyber data.
8
TABLE I
C ONFIGURATIONS OF DIFFERENT ALGORITHMS .
Baseline algorithms Intelligent trainers
NoCyber Fixed Random DQN DQN-5 actions DQN-larger memory REINFORCE DQN-TPE V1 DQN-TPE V2
Trainer type None None None DQN DQN DQN REINFORCE DQN DQN
Action (1, 0, 0) (0.6, 0.6, 0.6) ai ∈ {0.2, 1.0} ai ∈ {0.2, 1.0} ai ∈ {0.2, 0.4, 0.6, 0.8, 1.0} ai ∈ {0.2, 1.0} ai ∈ {0.2, 1.0} ai ∈ {0.2, 1.0} ai ∈ {0.2, 1.0}
Data source Real Real & Cyber Real & Cyber Real & Cyber Real & Cyber Real & Cyber Real & Cyber Real & Cyber Real & Cyber
Memory size - - - 32 32 2000 - 32 32
TPE state - - - Constant Constant Constant Constant Last sampling reward Real sample count
0
125
−250
100 −10
−500 75
−750 50 −20
Reward
Reward
Reward
NoCyber NoCyber NoCyber
25
−1000 Fixed Fixed Fixed
Random 0 Random −30 Random
−1250 DQN DQN DQN
−25
DQN-5 actions DQN-5 actions DQN-5 actions
−1500
DQN-larger memory −50 DQN-larger memory −40 DQN-larger memory
REINFORCE REINFORCE REINFORCE
−1750 −75
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5
Number of real data samples used 1e4 Number of real data samples used 1e4 Number of real data samples used 1e6
1.0
10000 250 Mountain Car
NoCyber
Reacher
7500 Fixed Swimmer
200 0.8
Random
5000
DQN
2500 150 DQN-5 actions
Action value
0.6
DQN-larger memory
Reward
Reward
0 NoCyber REINFORCE
100
−2500 Fixed 0.4
Random
−5000 DQN 50
DQN-5 actions 0.2
−7500
DQN-larger memory 0
−10000 REINFORCE
0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.0 0.2 0.4 0.6 0.8 1.0
Number of real data samples used 1e6 Number of real data samples used 1e6 TPE steps (normalized into range [0, 1])
Fig. 5. Accumulative rewards of different uni-head trainer designs for different tasks in (a)-(e). The curves show the average accumulative reward while the
shaded region shows the standard deviation of the reward in ten independent runs. The proposed uni-head trainer shows its adaptability (better than the Fixed
trainer) but may fail in certain cases like Swimmer. (f) shows the mean action a2 taken by DQN trainer on tasks of Mountain Car, Reacher, and Swimmer.
0
Ensemble 125
NoCyber
−250
Random 100 −10
DQN
−500 75
−750 50 −20
Reward
Reward
Reward
25
−1000
0 −30
−1250
Ensemble Ensemble
−25
NoCyber NoCyber
−1500
Random −40 Random
−50
DQN DQN
−1750
−75
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5
Number of real data samples used 1e4 Number of real data samples used 1e4 Number of real data samples used 1e6
7500 1.0
Ensemble
250
NoCyber
5000
Random 0.8
200 DQN
2500
150
Action value
0 0.6
Reward
Reward
−2500 100
0.4
−5000
Ensemble 50
−7500 NoCyber 0.2 Mountain Car
Random 0 Reacher
−10000 DQN Swimmer
0.0
0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Number of real data samples used 1e6 Number of real data samples used 1e6 TPE steps (normalized into range [0, 1])
Fig. 6. Accumulative rewards of ensemble trainer for different tasks in (a)-(e). The proposed ensemble design shows close-to-optimal or even better
performance on all cases. (f) shows the mean action a2 taken by the DQN trainer in the ensemble trainer for Mountain Car, Reacher, and Swimmer.
11
50 −20 150
Reward
Reward
Reward
25 100
0 −30
50
−25
DQN in ensemble DQN in ensemble
−40 0
−50 RANDOM in ensemble RANDOM in ensemble
NoCyber in ensemble NoCyber in ensemble
−50
−75
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.2 0.4 0.6 0.8 1.0
Number of real data samples used 1e4 Number of real data samples used 1e6 Number of real data samples used 1e6
Fig. 7. Accumulative reward of different individual trainers of the ensemble trainer on (a) Mountain Car, (b) Reacher, and (c) Swimmer. Trainers’ performance
are tending to fuse except certain extremely under-performed trainers.
TABLE IV
Ensemble
250
Ensemble without memory sharing
S AMPLING SAVING TO ACHIEVE CERTAIN PREDEFINED PERFORMANCE OF
Ensemble without reference sampling THE ENSEMBLE TRAINER . T HE BASELINE COST IS THE EXPECTED COST OF
200
THE THREE ALGORITHMS N O C YBER , R ANDOM TRAINER AND DQN
TRAINER .
150
Reward
cases, we can observe that within the ensemble, the original the target controller is either predetermined or can only be
good trainer (uni-head) still performs very good. For example, adjusted manually, resulting in both sampling inefficiency and
for the Mountain Car task, the Random trainer performs almost additional algorithm tuning cost. In [27] the authors proposed
as good as the uni-head Random trainer. For task Swimmer, the a model-assisted bootstrapped DDPG algorithm, which uses a
DQN trainer can now perform as good as the NoCyber trainer, variance ratio computed from the multiple heads of the critic
which proves that the weight transfer process is working as network to decide whether the cyber data can be used or not.
expected. The method relies on the bootstrapped DQN design, which is
To further examine the effect of memory sharing and not suitable to other cases.
reference sampling, in Fig. 8 we compare the performance Instead of treating the cyber model as a data source for
of three different ensemble designs, for the task of Swimmer. training, some approaches use cyber model to conduct pre-
All of them comprise the same three trainers: DQN, Random, trial tree searches in applications, for which selecting the
and NoCyber, but differ in the incorporated schemes: ensem- right action is highly critical [8] [9]. The cyber model can
ble trainer (with memory sharing and reference sampling); prevent selecting unfavorable actions and thus accelerates the
ensemble trainer without memory sharing (with reference learning of the optimal policy. In [10], the authors introduced a
sampling); ensemble trainer without reference sampling (with planning agent and a manager who decides whether to sample
memory sharing). All these variants are with weight transfer. from the cyber engine or to take actions to minimize the
The results show that, without memory sharing, the ensemble training cost. Both approaches focuses on the tree search in
performance degrades. This is because each of the three action selection which is different to our design that we aim to
intelligent trainers uses only one-third of the original data select the proper data source in sampling. Some recent works
samples (which is why the curve stops at 1/3 of the others investigate integrating model-based and model-free approaches
in the x-axis). Without reference sampling, the ensemble in RL. In [28] the authors combined model-based and model-
performs very similar to the DQN trainer (Fig. 5). This is free approaches for Building Optimization and Control (BOC),
because without reference sampling, most of the real data where a simulator is used to train the agent, while a real-world
samples are from underperformed target controllers of DQN test-bed is used to evaluate the agent’s performance. In [29]
and Random trainers. The data from underperformed target the model-based DRL is used to train a controller agent. The
controllers deteriorates the learning process of the NoCyber agent is then used to provide weight initialization for a model-
trainer. The results indicate that memory sharing and reference free DRL approach, so as to reduce the training cost. Different
sampling are essential for ensemble trainer. to this approach, we focus on directly sample from the model
to reduce sampling cost in the real environment.
V. R ELATED W ORKS
To build intelligent agents that can learn to accomplish B. AutoML
various control tasks, researchers have been actively studying The method proposed in this paper is a typical AutoML
reinforcement learning for decades, such as [18]–[22]. With re- [30] solution. AutoML aims to develop an algorithm that
cent advancement of deep learning, DRL [2] has demonstrated can automatically train a high performance machine learning
its strength in various applications. For example, in [23] a model without human intervention, such as hyper-parameter
DRL agent is proposed to solve financial trading tasks; in [24] tuning, model selection etc. AutoML has been proposed to
a neural RL agent is trained to mimic the human motor skill solve various specific training tasks such as model compres-
learning; in [25] an off-policy RL method is proposed to solve sion for mobile device [31], transfer learning [32], general
nonlinear and nonzero-sum games. Our research is particularly neural network training [33].
focused on model-based RL which can be utilized to reduce Note that most AutoML solutions are proposed to solve
the sampling cost of RL, and we propose an AutoML method. supervised learning cases, in which the dataset is usually pre-
In the following, we briefly review the recent development of acquired. In our case, as the data will be collected by the target
model-based RL and the AutoML studies. controller to train, it actually demands an AutoML solution
more than a general supervised learning case.
A. Model-based RL
Despite the significant performance improvement, the high VI. C ONCLUSION
sampling cost necessitated by RL has become a significant In this paper we propose an intelligent trainer for online
issue in practice. To address this issue, MBRL is introduced to model training and sampling settings learning for MBRL
learn the system dynamics model, so as to reduce the data col- algorithm. The proposed approach treats the training process of
lection and sampling cost. In [7] the authors provided a MBRL MBRL as the target system to optimize, and use a trainer that
for a robot controller that samples from both real physical monitors and optimizes the sampling and training process in
environment and learned cyber emulator. In [26] the authors MBRL. The proposed trainer solution can be used in practical
adapted a model, trained previously for other tasks, to train the applications to reduce the sampling cost while achieve close-
controller for a new but similar task. This approach combines to-optimal performance.
prior knowledge and the online adaptation of dynamic model, For the future work, the proposed trainer framework can
thus achieves better performance. In these approaches, the be further improved by adding more control actions to ease
number of samples taken from the cyber environment to train algorithm tuning cost. An even more advanced design is to use
13