0% found this document useful (0 votes)
32 views6 pages

DRL Using Algoritmi Genetici

This document discusses using a genetic algorithm to optimize parameters for deep reinforcement learning algorithms like DDPG combined with HER. It is applied to robotic manipulation tasks like reach, slide, push, pick and place, and door opening. The genetic algorithm searches for parameter values that improve performance and learning speed compared to the original algorithms.

Uploaded by

Gheorghe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views6 pages

DRL Using Algoritmi Genetici

This document discusses using a genetic algorithm to optimize parameters for deep reinforcement learning algorithms like DDPG combined with HER. It is applied to robotic manipulation tasks like reach, slide, push, pick and place, and door opening. The genetic algorithm searches for parameter values that improve performance and learning speed compared to the original algorithms.

Uploaded by

Gheorghe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Deep Reinforcement Learning using Genetic

Algorithm for Parameter Optimization


Adarsh Sehgal, Hung Manh La, Sushil J. Louis, Hai Nguyen

Abstract—Reinforcement learning (RL) enables agents to take based on the values of the parameters used in RL. In the
decision based on a reward function. However, in the process of following sections, we describe the manipulation tasks, the
learning, the choice of values for learning algorithm parameters DDPG + HER algorithms, and the parameters that affect
can significantly impact the overall learning process. In this
paper, we use a genetic algorithm (GA) to find the values of performance for these algorithms. Initial experimental results
arXiv:1905.04100v1 [cs.NE] 19 Feb 2019

parameters used in Deep Deterministic Policy Gradient (DDPG) showing performance and speed gains when using a GA
combined with Hindsight Experience Replay (HER), to help to search for good parameter values then provide evidence
speed up the learning agent. We used this method on fetch- that GAs find good parameter values leading to better task
reach, slide, push, pick and place, and door opening in robotic performance, faster.
manipulation tasks. Our experimental evaluation shows that our
method leads to better performance, faster than the original The paper is organized as follows: In Section 2, we present
algorithm. related work. Section 3 describes the DDPG + HER algo-
rithms. In Section 4, we describe the GA being used to find
I. INTRODUCTION the values of parameters. Section 5 then describes our learning
Q-learning methods have been applied on a variety of tasks tasks and experiments and our experimental results. The last
by autonomous robots [1], and much research has been done section provides conclusions and possible future research.
in this field starting many years ago [2], with some work
specific to continuous action spaces [3]–[6] and others on II. RELATED WORK
discrete action spaces [7]. Reinforcement Learning (RL) has RL has been widely used in training/teaching both a single
been applied to locomotion [8] [9] and also to manipulation robot [22], [23] and a multi-robot system [24]–[28]. Previous
[10], [11]. work has also been done on both model-based and model-
Much work specific to robotic manipulators also exists [12], free learning algorithms. Applying model-based learning algo-
[13]. Some of this work used fuzzy wavelet networks [14], rithms to real world scenarios, rely significantly on a model-
others used neural networks to accomplish their tasks [15] based teacher to train deep network policies.
[16]. Off-policy algorithms such as the Deep Deterministic Similarly, there is also much work in GA’s [29] [30] and
Policy Gradient algorithm (DDPG) [17] and Normalized Ad- the GA operators of crossover and mutation [31], applied to a
vantage Function algorithm (NAF) [18] are helpful for real variety of problem. GA has been specifically applied to variety
robot systems. A complete review of recent deep reinforcement of RL problems [31]–[34].
learning methods for robot manipulation is given in [19]. In this paper, we use model-free RL with continuous action
We are specifically using DDPG combined with Hindsight spaces and deep neural network. Our work is built on existing
Experience Replay (HER) [20] for our experiments. Recent work using the same techniques applied to robotic manipulator
work on using experience ranking to improve the learning [17] [20]. Specifically, we use a GA to search for good
speed of DDPG + HER was reported in [21]. DDPG + HER algorithm parameters and compare it with
The main contribution of this paper is a demonstration of original values of parameters [35], and hence the success rates.
better final performance at several manipulation tasks using DDPG + HER, a RL algorithm using deep neural networks
a Genetic Algorithm (GA) to find DDPG and HER param- in continuous action spaces has been successfully used for
eter values that lead more quickly to better performance at robotic manipulation tasks, and our GA improves on this work
these tasks. Our experiments revealed that learning algorithm by finding learning algorithm parameters that needs fewer
parameters are non-linearly related to task performance and epochs (one epoch is a single pass through full training set)
learning speed. Rather, success rate can vary significantly to learn better task performance.
Adarsh Sehgal, Hai Nguyen and Dr. Hung La are with the Advanced III. BACKGROUND
Robotics and Automation (ARA) Laboratory. Dr. Sushil Louis is professor
of the Department of Computer Science and Engineering, University of A. Reinforcement Learning
Nevada, Reno, NV 89557, USA. Corresponding author: Hung La, email: Consider a standard RL setup consisting of a learning agent,
[email protected]
This material is based upon work supported by the National Aeronautics and which interacts with an environment. An environment can be
Space Administration (NASA) Grant No. NNX15AI02H issued through the described by a set of variables where S is the set of states,
NVSGC-RI program under sub-award No. 19-21, and the RID program under A is the set of actions, p(s0 ) is a distribution of initial states,
sub-award No. 19-29, and the NVSGC-CD program under sub-award No. 18-
54. This work is also partially supported by the Office of Naval Research r : S×A → − R, p(st+1 |st , at ) are transition probabilities and
under Grant N00014-17-1-2558. γ ∈ [0, 1] is a discount factor.
A deterministic policy maps from states to actions: π : S →
− transition (st ||g, at , rt , st+1 ||g) with original goal g. HER
A. The beginning of every episode is marked by sampling an tends to store the transition (st ||g 0 , at , rt0 , st+1 ||g 0 ) to modified
initial state s0 . For each timestep t, the agent performs an goal g 0 as well. HER does great with extremely sparse rewards
action based on the current state: at = π(st ). The performed and is also significantly better for sparse rewards than shaped
action gets a reward rt = r(st , at ), and the distribution ones.
p(.|st , at ) helps to sample
P∞ the environments new state. The E. Genetic Algorithm (GA)
total return is: Rt = i=T γ i−t ri . The agents goal is to try
to maximize its expected return E[Rt |st , at ] and an optimal Genetic Algorithms (GAs) [29], [38], [39] were designed to
policy denoted by π ∗ can be defined as any policy π ∗ , such search poorly-understood spaces, where exhaustive search may

that Qπ (s, a) ≥ Qπ (s, a) for every s ∈ S, a ∈ A and not be feasible, and where other search approaches perform
any policy π. The optimal policy, which has the same Q- poorly. When used as function optimizers, GAs try to maxi-
function, is called an optimal Q-function, Q∗ , which satisfies mize a fitness tied to the optimization objective. Evolutionary
the Bellman equation: computing algorithms in general and GAs specifically have
had much empirical success on a variety of difficult design and
Q∗ (s, a) = Es0 p(.|s,a)) [r(s, a) + γmax
0
Q∗ (s0 , a0 ))]. (1) optimization problems. They start with a randomly initialized
a ∈A
population of candidate solution typically encoded in a string
B. Deep Q-Networks(DQN) (chromosome). A selection operator focuses search on promis-
A Deep Q-Networks (DQN) [36] is defined as a model free ing areas of the search space while crossover and mutation
reinforcement learner, designed for discrete action spaces. In a operators generate new candidate solutions. We explain our
DQN, a neural network Q is maintained, which approximates specific GA in the next section.
Q∗ . πQ (s) = argmaxa∈A Q(s, a) denotes a greedy policy IV. DDPG + HER AND GA
w.r.t. Q. A - greedy policy takes a random action with
probability  and action πQ (s) with probability 1 −  . In this section, we present the primary contribution of our
Episodes are generated during training using a -greedy paper: The genetic algorithm searches through the space of
policy. A Replay buffer stores transition tuples (st , at , rt , st+1 ) parameter values used in DDPG + HER for values that max-
experienced during training. The neural network training is imize task performance and minimize the number of training
interlaced by generation of new episodes. A Loss L defined by epochs. We target the following parameters: discounting factor
L = E(Q(st , at )−yt )2 where yt = rt +γmaxa0 ∈A Q(st+1 , a0 ) γ; polyak-averaging coefficient τ [37]; learning rate for critic
and tuples (st , at , rt , st+1 ) are being sampled from the replay network αcritic ; learning rate for actor network αactor ; percent
buffer. of times a random action is taken ; and standard deviation of
The target network changes at a slower pace than the main Gaussian noise added to not completely random actions as a
network, which is used to measure targets yt . The weights of percentage of maximum absolute value of actions on different
the target networks can be set to the current weights of the coordinates η. The range of all the parameters is 0-1, which
main network [36]. Polyak-averaged parameters [37] can also can be justified using the equations following in this section.
be used. Our experiments show that adjusting the values of parame-
ters did not increase or decrease the agents learning in a linear
C. Deep Deterministic Policy Gradients (DDPG) or easily discernible pattern. So, a simple hill climber will
probably not do well in finding optimized parameters. Since
In Deep Deterministic Policy Gradients (DDPG), there are
GAs were designed for such poorly understood problems, we
two neural networks: an Actor and a Critic. The actor neural
use our GA to optimize these parameter values.
network is a target policy π : S →
− A, and critic neural network
Specifically, we use τ , the polyak-averaging coefficient to
is an action-value function approximator Q : S × A → − R.
show the performance non-linearity for values of τ . τ is used
The critic network Q(s, a|θQ ) and actor network µ(s|θµ ) are
in the algorithm as show in Equation (2):
randomly initialized with weights θQ and θµ .
A behavioral policy is used to generate episodes, which
0 0
is a noisy variant of target policy, πb (s) = π(s) + N (0, 1). θQ ←
− τ θQ + (1 − τ )θQ ,
The training of a critic neural network is done like the Q- 0 0
θµ ←
− τ θµ + (1 − τ )θµ . (2)
function in DQN but where the target yt is computed as yt =
rt +γQ(st+1 , π(st+1 )), where γ is the discounting factor. The Equation (3) shows how γ is used in the DDPG + HER al-
loss La = −Ea Q(s, π(s)) is used to train the actor network. gorithm, while Equation (4) describes the Q-Learning update.
denotes the learning rate. Networks are trained based on this
D. Hindsight Experience Replay (HER) update equation.
Hindsight Experience Reply (HER) tries to mimic human yi = ri + γQ0 (si+1 , µ0 (st+1 |θµ )|θQ ),
0 0
(3)
behavior to learn from failures. The agent learns from all
episodes, even when it does not reach the original goal.
Q(st , at ) ←
− Q(st , at ) + α[rt+1 + γQ(st+1 , at+1 )
Whatever state the agent reaches, HER considers that as the
modified goal. Standard experience replay only stores the −Q(st , at )]. (4)
(a) Optimal Parameters over 10 runs, vs. Original

(b) Optimal Parameters averaged over 10 runs, vs. Original


Fig. 2: Success rate vs. epochs for FetchPush-v1 task when τ
and γ are found using the GA.

is a change in the agents learning, further emphasizing the


need to use a GA. The original (untuned) value of τ in DDPG
was set to 0.95, and we are using 4 CPUs. All the values of
τ are considered up to two decimal places, in order to see the
change in success rate with change in value of the parameter.
Fig. 1: Success rate vs. epochs for various τ for From the plots, we can clearly tell that there is a great scope
FetchPick&Place-v1 task. of improvement from the original success rate.
Algorithm 1 explains the integration of DDPG + HER with a
GA, which uses a population size of 30 over 30 generations.
Since we have two kinds of networks, we will need two We are using ranking selection [40] to select parents. The
learning rates, one for the actor network (αactor ), another for parents are probabilistically based on rank, which is in turn
the critic network (αcritic ). Equation (5) explains the use of decided based on the relative fitness (performance). Children
percent of times that a random action is taken, . are then generated using uniform crossover [41]. We are also
using flip mutation [39] with probability of mutation to be
(
a∗t with probability 1 − , 0.1. We use a binary chromosome to encode each parameter
at = (5) and concatenate the bits to form a chromosome for the GA.
random action with probability .
The six parameters are arranged in the order: polyak-averaging
Figure 1 shows that when the value of τ is modified, there coefficient; discounting factor; learning rate for critic network;
Algorithm 1 DDPG + HER and GA
1: Choose population of n chromosomes
2: Set the values of parameters into the chromosome
3: Run the DDPG + HER to get number of epochs for which
the algorithm first reaches success rate ≥ 0.85
4: for all chromosome values do
5: Initialize DDPG
6: Initialize replay buffer R ← φ
7: for episode=1, M do
8: Sample a goal g and initial state s0
9: for t=0, T-1 do
10: Sample an action at using DDPG behavioral
policy
11: Execute the action at and observe a new state
st+1
(a) Optimal Parameters over 2 runs, vs. Original
12: end for
13: for t=0, T-1 do
14: rt := r(st , at , g)
15: Store the transition (st ||g, at , rt , st+1 ||g) in R
16: Sample a set of additional goals for replay
G := S(current episode)
17: for g 0 ∈ G do
18: r0 := r(st , at , g 0 )
19: Store the transition (st ||g 0 , at , r0 , st+1 ||g 0 )
in R
20: end for
21: end for
22: for t=1,N do
23: Sample a minibatch B from the replay buffer
R
24: Perform one step of optimization using A and
minibatch B
(b) Optimal Parameters averaged over 2 runs, vs. Original 25: end for
26: end for
Fig. 3: Success rate vs. epochs for FetchSlide-v1 task when τ 27: return 1/epochs
and γ are found using the GA. 28: end for
29: Perform Uniform Crossover
30: Perform Flip Mutation at rate 0.1
learning rate for actor network; percent of times a random 31: Repeat for required number of generations to find optimal
action is taken and standard deviation of Gaussian noise added solution
to not completely random actions as a percentage of maximum
absolute value of actions on different coordinates. Since each
parameter requires 11 bits to be represented to three decimal a maximization problem. Since each fitness evaluation takes
places, we need 66 bits for 6 parameters. These string chromo- significant time an exhaustive search of the 266 size search
somes then enable domain independent crossover and mutation space is not possible and we thus use GA search.
string operators to generate new parameter values. We consider
parameter values up to three decimal places, because small V. EXPERIMENT AND RESULTS
changes in values of parameters causes considerable change Figure 4, shows the environments used to test robot learning
in success rate. For example, a step size of 0.001 is considered on five different tasks: FetchPick&Place-v1, FetchPush-v1,
as the best fit for our problem. FetchReach-v1, FetchSlide-v1, and DoorOpening . We ran the
The fitness for each chromosome (set of parameter values) GA separately on these environments to check the effective-
is defined by the inverse of number of epochs it takes for ness of our algorithm and compared performance with the
the learning agent to reach close to maximum success rate (≥ original values of the parameters. Figure 2 (a) shows the result
0.85) for the very first time. Fitness is the inverse of number of of our experiment with FetchPush-v1, while Figure 3 (a) shows
epochs because GA always maximizes the objective function the results with FetchSlide-v1. We let the system run with
and this converts our minimization of number of epochs to GA to find the optimal parameters and . Since the GA is
probabilistic, we show results from 10 runs of the GA and
the results show that the optimized parameters found by the
GA can lead to better performance. The learning agent can
run faster, and can reach the maximum success rate, faster.
In Figure 2 (b), we show one learning run for the original
parameter set and the average learning over these 10 different
runs of the GA.

Parameters Original Optimal


γ 0.98 0.88
τ 0.95 0.184
αactor 0.001 0.001
(a) FetchPick&Place environ- αcritic 0.001 0.001
ment (f) FetchPick&Place plot
 0.3 0.055
η 0.2 0.774

TABLE I: Original vs Optimal values of parameters

Figure 3 (b) compares one run for original with averaged


2 runs for optimizing parameters τ and γ. For this task, we
have run it for only 2 runs because these tasks can take a
few hours for one run. The results shown in Figures 2 and 3
(b) FetchPush environment (g) FetchPush plot show changes when only two parameters are being optimized
as we tested and debugged the genetic algorithm be we can
see the possibility for performance improvement. Our results
from optimizing all five parameters justify this optimism and
are described next.
The GA was then run to optimize all parameters and these
results were plotted in Figure 4 for all the tasks. Table I com-
pares the GA found parameters with the original parameters
used in the RL algorithm. Though the learning rates αactor
(c) FetchReach environment (h) FetchReach plot and αcritic are same as their original values, the other four
parameters have different values than original. The plots in
the figure 4 shows that the GA found parameters outperformed
the original parameters, indicating that the learning agent was
able to learn faster. All the plots in this figure are averaged
over 10 runs.

VI. DISCUSSION AND FUTURE WORK


(d) FetchSlide environment
(i) FetchSlide plot In this paper, we showed initial results that demonstrated
that a genetic algorithm can tune reinforcement learning
algorithm parameters to achieve better performance, faster
at six manipulation tasks. We discussed existing work in
reinforcement learning in robotics, presented an algorithm,
which integrates DDPG + HER with GA to optimize the num-
ber of epochs required to achieve maximal performance, and
explained why a GA might be suitable for such optimization.
Initial results bore out the assumption that GAs are a good
(e) Door Opening environment (j) DoorOpening plot fit for such parameter optimization and our results on the six
manipulation tasks show that the GA can find parameter values
Fig. 4: Environments and the corresponding Original vs Opti- that lead to faster learning and better (or equal) performance
mal plots, when all the 6 parameters are found by GA at our chosen tasks. We thus provide further evidence that
heuristic search as performed by genetic and other similar
evolutionary computing algorithms are a viable computational
tool for optimizing reinforcement learning performance in
multiple domains.
APPENDIX [19] H. Nguyen and H. M. La, “Review of deep reinforcement learning for
robot manipulation,” in The Third IEEE International Conference on
We have the code for this paper on github: Robotic Computing (IRC2019), 2019, pp. 1–6.
https://github.com/aralab-unr/ReinforcementLearningWithGA. [20] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder,
B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba, “Hindsight expe-
The parameters used in this paper can be found in rience replay,” in Advances in Neural Information Processing Systems,
baselines.her.experiment.config module. The parameters 2017, pp. 5048–5058.
are: discounting factor; polyak-averaging coefficient; learning [21] H. Nguyen, H. M. La, and M. Deans, “Deep learning with expe-
rience ranking convolutional neural network for robot manipulator,”
rate for critic network; learning rate for actor network; percent arXiv:1809.05819, cs.RO, 2018.
of times a random action is taken; and standard deviation [22] H. X. Pham, H. M. La, D. Feil-Seifer, and L. V. Nguyen, “Autonomous
of Gaussian noise added to not completely random actions uav navigation using reinforcement learning,” arXiv:1801.05086, cs.RO,
2018.
as a percentage of maximum absolute value of actions on [23] ——, “Reinforcement learning for autonomous uav navigation using
different coordinates, corresponds to gamma; polyak; Q lr; function approximation,” in 2018 IEEE International Symposium on
pi lr; random eps, noise eps, respectively in the code. Safety, Security, and Rescue Robotics (SSRR), Aug 2018, pp. 1–6.
[24] H. M. La, R. S. Lim, W. Sheng, and J. Chen, “Cooperative flocking and
R EFERENCES learning in multi-robot systems for predator avoidance,” in 2013 IEEE
International Conference on Cyber Technology in Automation, Control
[1] H. M. La, R. Lim, and W. Sheng, “Multirobot cooperative learning for and Intelligent Systems, May 2013, pp. 337–342.
predator avoidance,” IEEE Transactions on Control Systems Technology, [25] H. M. La, W. Sheng, and J. Chen, “Cooperative and active sensing in
vol. 23, no. 1, pp. 52–63, Jan 2015. mobile sensor networks for scalar field mapping,” IEEE Transactions
[2] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. on Systems, Man, and Cybernetics: Systems, vol. 45, no. 1, pp. 1–12,
3-4, pp. 279–292, 1992. Jan 2015.
[3] C. Gaskett, D. Wettergreen, and A. Zelinsky, “Q-learning in continuous [26] H. X. Pham, H. M. La, D. Feil-Seifer, and A. Nefian, “Cooperative
state and action spaces,” in Australasian Joint Conference on Artificial and distributed reinforcement learning of drones for field coverage,”
Intelligence. Springer, 1999, pp. 417–428. arXiv:1803.07250, cs.RO, 2018.
[4] K. Doya, “Reinforcement learning in continuous time and space,” Neural [27] A. D. Dang, H. M. La, and J. Horn, “Distributed formation control
computation, vol. 12, no. 1, pp. 219–245, 2000. for autonomous robots following desired shapes in noisy environment,”
[5] H. V. Hasselt and M. A. Wiering, “Reinforcement learning in continuous in 2016 IEEE International Conference on Multisensor Fusion and
action spaces,” 2007. Integration for Intelligent Systems (MFI), Sep. 2016, pp. 285–290.
[6] L. C. Baird, “Reinforcement learning in continuous time: Advantage [28] M. Rahimi, S. Gibb, Y. Shen, and H. M. La, “A comparison of various
updating,” in Neural Networks, 1994. IEEE World Congress on Com- approaches to reinforcement learning algorithms for multi-robot box
putational Intelligence., 1994 IEEE International Conference on, vol. 4. pushing,” in International Conference on Engineering Research and
IEEE, 1994, pp. 2448–2453. Applications. Springer, 2018, pp. 16–30.
[7] Q. Wei, F. L. Lewis, Q. Sun, P. Yan, and R. Song, “Discrete-time deter- [29] L. Davis, “Handbook of genetic algorithms,” 1991.
ministic q-learning: A novel convergence analysis,” IEEE transactions [30] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist
on cybernetics, vol. 47, no. 5, pp. 1224–1237, 2017. multiobjective genetic algorithm: Nsga-ii,” IEEE transactions on evolu-
[8] N. Kohl and P. Stone, “Policy gradient reinforcement learning for fast tionary computation, vol. 6, no. 2, pp. 182–197, 2002.
quadrupedal locomotion,” in Robotics and Automation, 2004. Proceed- [31] P. W. Poon and J. N. Carter, “Genetic algorithm crossover operators
ings. ICRA’04. 2004 IEEE International Conference on, vol. 3. IEEE, for ordering applications,” Computers & Operations Research, vol. 22,
2004, pp. 2619–2624. no. 1, pp. 135–147, 1995.
[9] G. Endo, J. Morimoto, T. Matsubara, J. Nakanishi, and G. Cheng, [32] F. Liu and G. Zeng, “Study of genetic algorithm with reinforcement
“Learning cpg-based biped locomotion with a policy gradient method: learning to solve the tsp,” Expert Systems with Applications, vol. 36,
Application to a humanoid robot,” The International Journal of Robotics no. 3, pp. 6995–7001, 2009.
Research, vol. 27, no. 2, pp. 213–228, 2008. [33] D. E. Moriarty, A. C. Schultz, and J. J. Grefenstette, “Evolutionary
[10] J. Peters, K. Mülling, and Y. Altun, “Relative entropy policy search.” in algorithms for reinforcement learning,” Journal of Artificial Intelligence
AAAI. Atlanta, 2010, pp. 1607–1612. Research, vol. 11, pp. 241–276, 1999.
[11] M. Kalakrishnan, L. Righetti, P. Pastor, and S. Schaal, “Learning force [34] S. Mikami and Y. Kakazu, “Genetic reinforcement learning for cooper-
control policies for compliant manipulation,” in Intelligent Robots and ative traffic signal control,” in Evolutionary Computation, 1994. IEEE
Systems (IROS), 2011 IEEE/RSJ International Conference on. IEEE, World Congress on Computational Intelligence., Proceedings of the First
2011, pp. 4639–4644. IEEE Conference on. IEEE, 1994, pp. 223–228.
[12] M. P. Deisenroth, C. E. Rasmussen, and D. Fox, “Learning to control a [35] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford,
low-cost manipulator using data-efficient reinforcement learning,” 2011. J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov, “Openai baselines,” https:
[13] L. Jin, S. Li, H. M. La, and X. Luo, “Manipulability optimization of //github.com/openai/baselines, 2017.
redundant manipulators using dynamic neural networks,” IEEE Trans- [36] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and
actions on Industrial Electronics, vol. 64, no. 6, pp. 4710–4720, June N. De Freitas, “Dueling network architectures for deep reinforcement
2017. learning,” arXiv preprint arXiv:1511.06581, 2015.
[14] C.-K. Lin, “H reinforcement learning control of robot manipulators using [37] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approxima-
fuzzy wavelet networks,” Fuzzy Sets and Systems, vol. 160, no. 12, pp. tion by averaging,” SIAM Journal on Control and Optimization, vol. 30,
1765–1786, 2009. no. 4, pp. 838–855, 1992.
[15] Z. Miljković, M. Mitić, M. Lazarević, and B. Babić, “Neural network [38] J. H. Holland, “Genetic algorithms,” Scientific american, vol. 267, no. 1,
reinforcement learning for visual control of robot manipulators,” Expert pp. 66–73, 1992.
Systems with Applications, vol. 40, no. 5, pp. 1721–1736, 2013. [39] D. E. Goldberg and J. H. Holland, “Genetic algorithms and machine
[16] M. Duguleana, F. G. Barbuceanu, A. Teirelbar, and G. Mogan, “Obstacle learning,” Machine learning, vol. 3, no. 2, pp. 95–99, 1988.
avoidance of redundant manipulators using neural networks based rein- [40] D. E. Goldberg and K. Deb, “A comparative analysis of selection
forcement learning,” Robotics and Computer-Integrated Manufacturing, schemes used in genetic algorithms,” in Foundations of genetic algo-
vol. 28, no. 2, pp. 132–146, 2012. rithms. Elsevier, 1991, vol. 1, pp. 69–93.
[17] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, [41] G. Syswerda, “Uniform crossover in genetic algorithms,” in Proceedings
D. Silver, and D. Wierstra, “Continuous control with deep reinforcement of the third international conference on Genetic algorithms. Morgan
learning,” arXiv preprint arXiv:1509.02971, 2015. Kaufmann Publishers, 1989, pp. 2–9.
[18] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep q-
learning with model-based acceleration,” in International Conference
on Machine Learning, 2016, pp. 2829–2838.

You might also like