DRL Using Algoritmi Genetici
DRL Using Algoritmi Genetici
Abstract—Reinforcement learning (RL) enables agents to take based on the values of the parameters used in RL. In the
decision based on a reward function. However, in the process of following sections, we describe the manipulation tasks, the
learning, the choice of values for learning algorithm parameters DDPG + HER algorithms, and the parameters that affect
can significantly impact the overall learning process. In this
paper, we use a genetic algorithm (GA) to find the values of performance for these algorithms. Initial experimental results
arXiv:1905.04100v1 [cs.NE] 19 Feb 2019
parameters used in Deep Deterministic Policy Gradient (DDPG) showing performance and speed gains when using a GA
combined with Hindsight Experience Replay (HER), to help to search for good parameter values then provide evidence
speed up the learning agent. We used this method on fetch- that GAs find good parameter values leading to better task
reach, slide, push, pick and place, and door opening in robotic performance, faster.
manipulation tasks. Our experimental evaluation shows that our
method leads to better performance, faster than the original The paper is organized as follows: In Section 2, we present
algorithm. related work. Section 3 describes the DDPG + HER algo-
rithms. In Section 4, we describe the GA being used to find
I. INTRODUCTION the values of parameters. Section 5 then describes our learning
Q-learning methods have been applied on a variety of tasks tasks and experiments and our experimental results. The last
by autonomous robots [1], and much research has been done section provides conclusions and possible future research.
in this field starting many years ago [2], with some work
specific to continuous action spaces [3]–[6] and others on II. RELATED WORK
discrete action spaces [7]. Reinforcement Learning (RL) has RL has been widely used in training/teaching both a single
been applied to locomotion [8] [9] and also to manipulation robot [22], [23] and a multi-robot system [24]–[28]. Previous
[10], [11]. work has also been done on both model-based and model-
Much work specific to robotic manipulators also exists [12], free learning algorithms. Applying model-based learning algo-
[13]. Some of this work used fuzzy wavelet networks [14], rithms to real world scenarios, rely significantly on a model-
others used neural networks to accomplish their tasks [15] based teacher to train deep network policies.
[16]. Off-policy algorithms such as the Deep Deterministic Similarly, there is also much work in GA’s [29] [30] and
Policy Gradient algorithm (DDPG) [17] and Normalized Ad- the GA operators of crossover and mutation [31], applied to a
vantage Function algorithm (NAF) [18] are helpful for real variety of problem. GA has been specifically applied to variety
robot systems. A complete review of recent deep reinforcement of RL problems [31]–[34].
learning methods for robot manipulation is given in [19]. In this paper, we use model-free RL with continuous action
We are specifically using DDPG combined with Hindsight spaces and deep neural network. Our work is built on existing
Experience Replay (HER) [20] for our experiments. Recent work using the same techniques applied to robotic manipulator
work on using experience ranking to improve the learning [17] [20]. Specifically, we use a GA to search for good
speed of DDPG + HER was reported in [21]. DDPG + HER algorithm parameters and compare it with
The main contribution of this paper is a demonstration of original values of parameters [35], and hence the success rates.
better final performance at several manipulation tasks using DDPG + HER, a RL algorithm using deep neural networks
a Genetic Algorithm (GA) to find DDPG and HER param- in continuous action spaces has been successfully used for
eter values that lead more quickly to better performance at robotic manipulation tasks, and our GA improves on this work
these tasks. Our experiments revealed that learning algorithm by finding learning algorithm parameters that needs fewer
parameters are non-linearly related to task performance and epochs (one epoch is a single pass through full training set)
learning speed. Rather, success rate can vary significantly to learn better task performance.
Adarsh Sehgal, Hai Nguyen and Dr. Hung La are with the Advanced III. BACKGROUND
Robotics and Automation (ARA) Laboratory. Dr. Sushil Louis is professor
of the Department of Computer Science and Engineering, University of A. Reinforcement Learning
Nevada, Reno, NV 89557, USA. Corresponding author: Hung La, email: Consider a standard RL setup consisting of a learning agent,
[email protected]
This material is based upon work supported by the National Aeronautics and which interacts with an environment. An environment can be
Space Administration (NASA) Grant No. NNX15AI02H issued through the described by a set of variables where S is the set of states,
NVSGC-RI program under sub-award No. 19-21, and the RID program under A is the set of actions, p(s0 ) is a distribution of initial states,
sub-award No. 19-29, and the NVSGC-CD program under sub-award No. 18-
54. This work is also partially supported by the Office of Naval Research r : S×A → − R, p(st+1 |st , at ) are transition probabilities and
under Grant N00014-17-1-2558. γ ∈ [0, 1] is a discount factor.
A deterministic policy maps from states to actions: π : S →
− transition (st ||g, at , rt , st+1 ||g) with original goal g. HER
A. The beginning of every episode is marked by sampling an tends to store the transition (st ||g 0 , at , rt0 , st+1 ||g 0 ) to modified
initial state s0 . For each timestep t, the agent performs an goal g 0 as well. HER does great with extremely sparse rewards
action based on the current state: at = π(st ). The performed and is also significantly better for sparse rewards than shaped
action gets a reward rt = r(st , at ), and the distribution ones.
p(.|st , at ) helps to sample
P∞ the environments new state. The E. Genetic Algorithm (GA)
total return is: Rt = i=T γ i−t ri . The agents goal is to try
to maximize its expected return E[Rt |st , at ] and an optimal Genetic Algorithms (GAs) [29], [38], [39] were designed to
policy denoted by π ∗ can be defined as any policy π ∗ , such search poorly-understood spaces, where exhaustive search may
∗
that Qπ (s, a) ≥ Qπ (s, a) for every s ∈ S, a ∈ A and not be feasible, and where other search approaches perform
any policy π. The optimal policy, which has the same Q- poorly. When used as function optimizers, GAs try to maxi-
function, is called an optimal Q-function, Q∗ , which satisfies mize a fitness tied to the optimization objective. Evolutionary
the Bellman equation: computing algorithms in general and GAs specifically have
had much empirical success on a variety of difficult design and
Q∗ (s, a) = Es0 p(.|s,a)) [r(s, a) + γmax
0
Q∗ (s0 , a0 ))]. (1) optimization problems. They start with a randomly initialized
a ∈A
population of candidate solution typically encoded in a string
B. Deep Q-Networks(DQN) (chromosome). A selection operator focuses search on promis-
A Deep Q-Networks (DQN) [36] is defined as a model free ing areas of the search space while crossover and mutation
reinforcement learner, designed for discrete action spaces. In a operators generate new candidate solutions. We explain our
DQN, a neural network Q is maintained, which approximates specific GA in the next section.
Q∗ . πQ (s) = argmaxa∈A Q(s, a) denotes a greedy policy IV. DDPG + HER AND GA
w.r.t. Q. A - greedy policy takes a random action with
probability and action πQ (s) with probability 1 − . In this section, we present the primary contribution of our
Episodes are generated during training using a -greedy paper: The genetic algorithm searches through the space of
policy. A Replay buffer stores transition tuples (st , at , rt , st+1 ) parameter values used in DDPG + HER for values that max-
experienced during training. The neural network training is imize task performance and minimize the number of training
interlaced by generation of new episodes. A Loss L defined by epochs. We target the following parameters: discounting factor
L = E(Q(st , at )−yt )2 where yt = rt +γmaxa0 ∈A Q(st+1 , a0 ) γ; polyak-averaging coefficient τ [37]; learning rate for critic
and tuples (st , at , rt , st+1 ) are being sampled from the replay network αcritic ; learning rate for actor network αactor ; percent
buffer. of times a random action is taken ; and standard deviation of
The target network changes at a slower pace than the main Gaussian noise added to not completely random actions as a
network, which is used to measure targets yt . The weights of percentage of maximum absolute value of actions on different
the target networks can be set to the current weights of the coordinates η. The range of all the parameters is 0-1, which
main network [36]. Polyak-averaged parameters [37] can also can be justified using the equations following in this section.
be used. Our experiments show that adjusting the values of parame-
ters did not increase or decrease the agents learning in a linear
C. Deep Deterministic Policy Gradients (DDPG) or easily discernible pattern. So, a simple hill climber will
probably not do well in finding optimized parameters. Since
In Deep Deterministic Policy Gradients (DDPG), there are
GAs were designed for such poorly understood problems, we
two neural networks: an Actor and a Critic. The actor neural
use our GA to optimize these parameter values.
network is a target policy π : S →
− A, and critic neural network
Specifically, we use τ , the polyak-averaging coefficient to
is an action-value function approximator Q : S × A → − R.
show the performance non-linearity for values of τ . τ is used
The critic network Q(s, a|θQ ) and actor network µ(s|θµ ) are
in the algorithm as show in Equation (2):
randomly initialized with weights θQ and θµ .
A behavioral policy is used to generate episodes, which
0 0
is a noisy variant of target policy, πb (s) = π(s) + N (0, 1). θQ ←
− τ θQ + (1 − τ )θQ ,
The training of a critic neural network is done like the Q- 0 0
θµ ←
− τ θµ + (1 − τ )θµ . (2)
function in DQN but where the target yt is computed as yt =
rt +γQ(st+1 , π(st+1 )), where γ is the discounting factor. The Equation (3) shows how γ is used in the DDPG + HER al-
loss La = −Ea Q(s, π(s)) is used to train the actor network. gorithm, while Equation (4) describes the Q-Learning update.
denotes the learning rate. Networks are trained based on this
D. Hindsight Experience Replay (HER) update equation.
Hindsight Experience Reply (HER) tries to mimic human yi = ri + γQ0 (si+1 , µ0 (st+1 |θµ )|θQ ),
0 0
(3)
behavior to learn from failures. The agent learns from all
episodes, even when it does not reach the original goal.
Q(st , at ) ←
− Q(st , at ) + α[rt+1 + γQ(st+1 , at+1 )
Whatever state the agent reaches, HER considers that as the
modified goal. Standard experience replay only stores the −Q(st , at )]. (4)
(a) Optimal Parameters over 10 runs, vs. Original