DRL Using Algoritmi Genetici

This document discusses using a genetic algorithm to optimize parameters for deep reinforcement learning algorithms like DDPG combined with HER. It is applied to robotic manipulation tasks like reach, slide, push, pick and place, and door opening. The genetic algorithm searches for parameter values that improve performance and learning speed compared to the original algorithms.

Uploaded by

Gheorghe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views6 pages

DRL Using Algoritmi Genetici

Uploaded by

Gheorghe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Deep Reinforcement Learning using Genetic

Algorithm for Parameter Optimization

Adarsh Sehgal, Hung Manh La, Sushil J. Louis, Hai Nguyen

Abstract—Reinforcement learning (RL) enables agents to take based on the values of the parameters used in RL. In the
decision based on a reward function. However, in the process of following sections, we describe the manipulation tasks, the
learning, the choice of values for learning algorithm parameters DDPG + HER algorithms, and the parameters that affect
can significantly impact the overall learning process. In this
paper, we use a genetic algorithm (GA) to find the values of performance for these algorithms. Initial experimental results
arXiv:1905.04100v1 [cs.NE] 19 Feb 2019

parameters used in Deep Deterministic Policy Gradient (DDPG) showing performance and speed gains when using a GA
combined with Hindsight Experience Replay (HER), to help to search for good parameter values then provide evidence
speed up the learning agent. We used this method on fetch- that GAs find good parameter values leading to better task
reach, slide, push, pick and place, and door opening in robotic performance, faster.
manipulation tasks. Our experimental evaluation shows that our
method leads to better performance, faster than the original The paper is organized as follows: In Section 2, we present
algorithm. related work. Section 3 describes the DDPG + HER algo-
rithms. In Section 4, we describe the GA being used to find
I. INTRODUCTION the values of parameters. Section 5 then describes our learning
Q-learning methods have been applied on a variety of tasks tasks and experiments and our experimental results. The last
by autonomous robots [1], and much research has been done section provides conclusions and possible future research.
in this field starting many years ago [2], with some work
specific to continuous action spaces [3]–[6] and others on II. RELATED WORK
discrete action spaces [7]. Reinforcement Learning (RL) has RL has been widely used in training/teaching both a single
been applied to locomotion [8] [9] and also to manipulation robot [22], [23] and a multi-robot system [24]–[28]. Previous
[10], [11]. work has also been done on both model-based and model-
Much work specific to robotic manipulators also exists [12], free learning algorithms. Applying model-based learning algo-
[13]. Some of this work used fuzzy wavelet networks [14], rithms to real world scenarios, rely significantly on a model-
others used neural networks to accomplish their tasks [15] based teacher to train deep network policies.
[16]. Off-policy algorithms such as the Deep Deterministic Similarly, there is also much work in GA’s [29] [30] and
Policy Gradient algorithm (DDPG) [17] and Normalized Ad- the GA operators of crossover and mutation [31], applied to a
vantage Function algorithm (NAF) [18] are helpful for real variety of problem. GA has been specifically applied to variety
robot systems. A complete review of recent deep reinforcement of RL problems [31]–[34].
learning methods for robot manipulation is given in [19]. In this paper, we use model-free RL with continuous action
We are specifically using DDPG combined with Hindsight spaces and deep neural network. Our work is built on existing
Experience Replay (HER) [20] for our experiments. Recent work using the same techniques applied to robotic manipulator
work on using experience ranking to improve the learning [17] [20]. Specifically, we use a GA to search for good
speed of DDPG + HER was reported in [21]. DDPG + HER algorithm parameters and compare it with
The main contribution of this paper is a demonstration of original values of parameters [35], and hence the success rates.
better final performance at several manipulation tasks using DDPG + HER, a RL algorithm using deep neural networks
a Genetic Algorithm (GA) to find DDPG and HER param- in continuous action spaces has been successfully used for
eter values that lead more quickly to better performance at robotic manipulation tasks, and our GA improves on this work
these tasks. Our experiments revealed that learning algorithm by finding learning algorithm parameters that needs fewer
parameters are non-linearly related to task performance and epochs (one epoch is a single pass through full training set)
learning speed. Rather, success rate can vary significantly to learn better task performance.
Adarsh Sehgal, Hai Nguyen and Dr. Hung La are with the Advanced III. BACKGROUND
Robotics and Automation (ARA) Laboratory. Dr. Sushil Louis is professor
of the Department of Computer Science and Engineering, University of A. Reinforcement Learning
Nevada, Reno, NV 89557, USA. Corresponding author: Hung La, email: Consider a standard RL setup consisting of a learning agent,
[email protected]
This material is based upon work supported by the National Aeronautics and which interacts with an environment. An environment can be
Space Administration (NASA) Grant No. NNX15AI02H issued through the described by a set of variables where S is the set of states,
NVSGC-RI program under sub-award No. 19-21, and the RID program under A is the set of actions, p(s0 ) is a distribution of initial states,
sub-award No. 19-29, and the NVSGC-CD program under sub-award No. 18-
54. This work is also partially supported by the Office of Naval Research r : S×A → − R, p(st+1 |st , at ) are transition probabilities and
under Grant N00014-17-1-2558. γ ∈ [0, 1] is a discount factor.
A deterministic policy maps from states to actions: π : S →
− transition (st ||g, at , rt , st+1 ||g) with original goal g. HER
A. The beginning of every episode is marked by sampling an tends to store the transition (st ||g 0 , at , rt0 , st+1 ||g 0 ) to modified
initial state s0 . For each timestep t, the agent performs an goal g 0 as well. HER does great with extremely sparse rewards
action based on the current state: at = π(st ). The performed and is also significantly better for sparse rewards than shaped
action gets a reward rt = r(st , at ), and the distribution ones.
p(.|st , at ) helps to sample
P∞ the environments new state. The E. Genetic Algorithm (GA)
total return is: Rt = i=T γ i−t ri . The agents goal is to try
to maximize its expected return E[Rt |st , at ] and an optimal Genetic Algorithms (GAs) [29], [38], [39] were designed to
policy denoted by π ∗ can be defined as any policy π ∗ , such search poorly-understood spaces, where exhaustive search may
∗
that Qπ (s, a) ≥ Qπ (s, a) for every s ∈ S, a ∈ A and not be feasible, and where other search approaches perform
any policy π. The optimal policy, which has the same Q- poorly. When used as function optimizers, GAs try to maxi-
function, is called an optimal Q-function, Q∗ , which satisfies mize a fitness tied to the optimization objective. Evolutionary
the Bellman equation: computing algorithms in general and GAs specifically have
had much empirical success on a variety of difficult design and
Q∗ (s, a) = Es0 p(.|s,a)) [r(s, a) + γmax
0
Q∗ (s0 , a0 ))]. (1) optimization problems. They start with a randomly initialized
a ∈A
population of candidate solution typically encoded in a string
B. Deep Q-Networks(DQN) (chromosome). A selection operator focuses search on promis-
A Deep Q-Networks (DQN) [36] is defined as a model free ing areas of the search space while crossover and mutation
reinforcement learner, designed for discrete action spaces. In a operators generate new candidate solutions. We explain our
DQN, a neural network Q is maintained, which approximates specific GA in the next section.
Q∗ . πQ (s) = argmaxa∈A Q(s, a) denotes a greedy policy IV. DDPG + HER AND GA
w.r.t. Q. A - greedy policy takes a random action with
probability and action πQ (s) with probability 1 − . In this section, we present the primary contribution of our
Episodes are generated during training using a -greedy paper: The genetic algorithm searches through the space of
policy. A Replay buffer stores transition tuples (st , at , rt , st+1 ) parameter values used in DDPG + HER for values that max-
experienced during training. The neural network training is imize task performance and minimize the number of training
interlaced by generation of new episodes. A Loss L defined by epochs. We target the following parameters: discounting factor
L = E(Q(st , at )−yt )2 where yt = rt +γmaxa0 ∈A Q(st+1 , a0 ) γ; polyak-averaging coefficient τ [37]; learning rate for critic
and tuples (st , at , rt , st+1 ) are being sampled from the replay network αcritic ; learning rate for actor network αactor ; percent
buffer. of times a random action is taken ; and standard deviation of
The target network changes at a slower pace than the main Gaussian noise added to not completely random actions as a
network, which is used to measure targets yt . The weights of percentage of maximum absolute value of actions on different
the target networks can be set to the current weights of the coordinates η. The range of all the parameters is 0-1, which
main network [36]. Polyak-averaged parameters [37] can also can be justified using the equations following in this section.
be used. Our experiments show that adjusting the values of parame-
ters did not increase or decrease the agents learning in a linear
C. Deep Deterministic Policy Gradients (DDPG) or easily discernible pattern. So, a simple hill climber will
probably not do well in finding optimized parameters. Since
In Deep Deterministic Policy Gradients (DDPG), there are
GAs were designed for such poorly understood problems, we
two neural networks: an Actor and a Critic. The actor neural
use our GA to optimize these parameter values.
network is a target policy π : S →
− A, and critic neural network
Specifically, we use τ , the polyak-averaging coefficient to
is an action-value function approximator Q : S × A → − R.
show the performance non-linearity for values of τ . τ is used
The critic network Q(s, a|θQ ) and actor network µ(s|θµ ) are
in the algorithm as show in Equation (2):
randomly initialized with weights θQ and θµ .
A behavioral policy is used to generate episodes, which
0 0
is a noisy variant of target policy, πb (s) = π(s) + N (0, 1). θQ ←
− τ θQ + (1 − τ )θQ ,
The training of a critic neural network is done like the Q- 0 0
θµ ←
− τ θµ + (1 − τ )θµ . (2)
function in DQN but where the target yt is computed as yt =
rt +γQ(st+1 , π(st+1 )), where γ is the discounting factor. The Equation (3) shows how γ is used in the DDPG + HER al-
loss La = −Ea Q(s, π(s)) is used to train the actor network. gorithm, while Equation (4) describes the Q-Learning update.
denotes the learning rate. Networks are trained based on this
D. Hindsight Experience Replay (HER) update equation.
Hindsight Experience Reply (HER) tries to mimic human yi = ri + γQ0 (si+1 , µ0 (st+1 |θµ )|θQ ),
0 0
(3)
behavior to learn from failures. The agent learns from all
episodes, even when it does not reach the original goal.
Q(st , at ) ←
− Q(st , at ) + α[rt+1 + γQ(st+1 , at+1 )
Whatever state the agent reaches, HER considers that as the
modified goal. Standard experience replay only stores the −Q(st , at )]. (4)
(a) Optimal Parameters over 10 runs, vs. Original

(b) Optimal Parameters averaged over 10 runs, vs. Original

Fig. 2: Success rate vs. epochs for FetchPush-v1 task when τ
and γ are found using the GA.

is a change in the agents learning, further emphasizing the

need to use a GA. The original (untuned) value of τ in DDPG
was set to 0.95, and we are using 4 CPUs. All the values of
τ are considered up to two decimal places, in order to see the
change in success rate with change in value of the parameter.
Fig. 1: Success rate vs. epochs for various τ for From the plots, we can clearly tell that there is a great scope
FetchPick&Place-v1 task. of improvement from the original success rate.
Algorithm 1 explains the integration of DDPG + HER with a
GA, which uses a population size of 30 over 30 generations.
Since we have two kinds of networks, we will need two We are using ranking selection [40] to select parents. The
learning rates, one for the actor network (αactor ), another for parents are probabilistically based on rank, which is in turn
the critic network (αcritic ). Equation (5) explains the use of decided based on the relative fitness (performance). Children
percent of times that a random action is taken, . are then generated using uniform crossover [41]. We are also
using flip mutation [39] with probability of mutation to be
(
a∗t with probability 1 − , 0.1. We use a binary chromosome to encode each parameter
at = (5) and concatenate the bits to form a chromosome for the GA.
random action with probability .
The six parameters are arranged in the order: polyak-averaging
Figure 1 shows that when the value of τ is modified, there coefficient; discounting factor; learning rate for critic network;
Algorithm 1 DDPG + HER and GA
1: Choose population of n chromosomes
2: Set the values of parameters into the chromosome
3: Run the DDPG + HER to get number of epochs for which
the algorithm first reaches success rate ≥ 0.85
4: for all chromosome values do
5: Initialize DDPG
6: Initialize replay buffer R ← φ
7: for episode=1, M do
8: Sample a goal g and initial state s0
9: for t=0, T-1 do
10: Sample an action at using DDPG behavioral
policy
11: Execute the action at and observe a new state
st+1
(a) Optimal Parameters over 2 runs, vs. Original
12: end for
13: for t=0, T-1 do
14: rt := r(st , at , g)
15: Store the transition (st ||g, at , rt , st+1 ||g) in R
16: Sample a set of additional goals for replay
G := S(current episode)
17: for g 0 ∈ G do
18: r0 := r(st , at , g 0 )
19: Store the transition (st ||g 0 , at , r0 , st+1 ||g 0 )
in R
20: end for
21: end for
22: for t=1,N do
23: Sample a minibatch B from the replay buffer
R
24: Perform one step of optimization using A and
minibatch B
(b) Optimal Parameters averaged over 2 runs, vs. Original 25: end for
26: end for
Fig. 3: Success rate vs. epochs for FetchSlide-v1 task when τ 27: return 1/epochs
and γ are found using the GA. 28: end for
29: Perform Uniform Crossover
30: Perform Flip Mutation at rate 0.1
learning rate for actor network; percent of times a random 31: Repeat for required number of generations to find optimal
action is taken and standard deviation of Gaussian noise added solution
to not completely random actions as a percentage of maximum
absolute value of actions on different coordinates. Since each
parameter requires 11 bits to be represented to three decimal a maximization problem. Since each fitness evaluation takes
places, we need 66 bits for 6 parameters. These string chromo- significant time an exhaustive search of the 266 size search
somes then enable domain independent crossover and mutation space is not possible and we thus use GA search.
string operators to generate new parameter values. We consider
parameter values up to three decimal places, because small V. EXPERIMENT AND RESULTS
changes in values of parameters causes considerable change Figure 4, shows the environments used to test robot learning
in success rate. For example, a step size of 0.001 is considered on five different tasks: FetchPick&Place-v1, FetchPush-v1,
as the best fit for our problem. FetchReach-v1, FetchSlide-v1, and DoorOpening . We ran the
The fitness for each chromosome (set of parameter values) GA separately on these environments to check the effective-
is defined by the inverse of number of epochs it takes for ness of our algorithm and compared performance with the
the learning agent to reach close to maximum success rate (≥ original values of the parameters. Figure 2 (a) shows the result
0.85) for the very first time. Fitness is the inverse of number of of our experiment with FetchPush-v1, while Figure 3 (a) shows
epochs because GA always maximizes the objective function the results with FetchSlide-v1. We let the system run with
and this converts our minimization of number of epochs to GA to find the optimal parameters and . Since the GA is
probabilistic, we show results from 10 runs of the GA and
the results show that the optimized parameters found by the
GA can lead to better performance. The learning agent can
run faster, and can reach the maximum success rate, faster.
In Figure 2 (b), we show one learning run for the original
parameter set and the average learning over these 10 different
runs of the GA.

Parameters Original Optimal

γ 0.98 0.88
τ 0.95 0.184
αactor 0.001 0.001
(a) FetchPick&Place environ- αcritic 0.001 0.001
ment (f) FetchPick&Place plot
0.3 0.055
η 0.2 0.774

TABLE I: Original vs Optimal values of parameters

Figure 3 (b) compares one run for original with averaged

2 runs for optimizing parameters τ and γ. For this task, we
have run it for only 2 runs because these tasks can take a
few hours for one run. The results shown in Figures 2 and 3
(b) FetchPush environment (g) FetchPush plot show changes when only two parameters are being optimized
as we tested and debugged the genetic algorithm be we can
see the possibility for performance improvement. Our results
from optimizing all five parameters justify this optimism and
are described next.
The GA was then run to optimize all parameters and these
results were plotted in Figure 4 for all the tasks. Table I com-
pares the GA found parameters with the original parameters
used in the RL algorithm. Though the learning rates αactor
(c) FetchReach environment (h) FetchReach plot and αcritic are same as their original values, the other four
parameters have different values than original. The plots in
the figure 4 shows that the GA found parameters outperformed
the original parameters, indicating that the learning agent was
able to learn faster. All the plots in this figure are averaged
over 10 runs.

VI. DISCUSSION AND FUTURE WORK

(d) FetchSlide environment
(i) FetchSlide plot In this paper, we showed initial results that demonstrated
that a genetic algorithm can tune reinforcement learning
algorithm parameters to achieve better performance, faster
at six manipulation tasks. We discussed existing work in
reinforcement learning in robotics, presented an algorithm,
which integrates DDPG + HER with GA to optimize the num-
ber of epochs required to achieve maximal performance, and
explained why a GA might be suitable for such optimization.
Initial results bore out the assumption that GAs are a good
(e) Door Opening environment (j) DoorOpening plot fit for such parameter optimization and our results on the six
manipulation tasks show that the GA can find parameter values
Fig. 4: Environments and the corresponding Original vs Opti- that lead to faster learning and better (or equal) performance
mal plots, when all the 6 parameters are found by GA at our chosen tasks. We thus provide further evidence that
heuristic search as performed by genetic and other similar
evolutionary computing algorithms are a viable computational
tool for optimizing reinforcement learning performance in
multiple domains.
APPENDIX [19] H. Nguyen and H. M. La, “Review of deep reinforcement learning for
robot manipulation,” in The Third IEEE International Conference on
We have the code for this paper on github: Robotic Computing (IRC2019), 2019, pp. 1–6.
https://github.com/aralab-unr/ReinforcementLearningWithGA. [20] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder,
B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba, “Hindsight expe-
The parameters used in this paper can be found in rience replay,” in Advances in Neural Information Processing Systems,
baselines.her.experiment.config module. The parameters 2017, pp. 5048–5058.
are: discounting factor; polyak-averaging coefficient; learning [21] H. Nguyen, H. M. La, and M. Deans, “Deep learning with expe-
rience ranking convolutional neural network for robot manipulator,”
rate for critic network; learning rate for actor network; percent arXiv:1809.05819, cs.RO, 2018.
of times a random action is taken; and standard deviation [22] H. X. Pham, H. M. La, D. Feil-Seifer, and L. V. Nguyen, “Autonomous
of Gaussian noise added to not completely random actions uav navigation using reinforcement learning,” arXiv:1801.05086, cs.RO,
2018.
as a percentage of maximum absolute value of actions on [23] ——, “Reinforcement learning for autonomous uav navigation using
different coordinates, corresponds to gamma; polyak; Q lr; function approximation,” in 2018 IEEE International Symposium on
pi lr; random eps, noise eps, respectively in the code. Safety, Security, and Rescue Robotics (SSRR), Aug 2018, pp. 1–6.
[24] H. M. La, R. S. Lim, W. Sheng, and J. Chen, “Cooperative flocking and
R EFERENCES learning in multi-robot systems for predator avoidance,” in 2013 IEEE
International Conference on Cyber Technology in Automation, Control
[1] H. M. La, R. Lim, and W. Sheng, “Multirobot cooperative learning for and Intelligent Systems, May 2013, pp. 337–342.
predator avoidance,” IEEE Transactions on Control Systems Technology, [25] H. M. La, W. Sheng, and J. Chen, “Cooperative and active sensing in
vol. 23, no. 1, pp. 52–63, Jan 2015. mobile sensor networks for scalar field mapping,” IEEE Transactions
[2] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. on Systems, Man, and Cybernetics: Systems, vol. 45, no. 1, pp. 1–12,
3-4, pp. 279–292, 1992. Jan 2015.
[3] C. Gaskett, D. Wettergreen, and A. Zelinsky, “Q-learning in continuous [26] H. X. Pham, H. M. La, D. Feil-Seifer, and A. Nefian, “Cooperative
state and action spaces,” in Australasian Joint Conference on Artificial and distributed reinforcement learning of drones for field coverage,”
Intelligence. Springer, 1999, pp. 417–428. arXiv:1803.07250, cs.RO, 2018.
[4] K. Doya, “Reinforcement learning in continuous time and space,” Neural [27] A. D. Dang, H. M. La, and J. Horn, “Distributed formation control
computation, vol. 12, no. 1, pp. 219–245, 2000. for autonomous robots following desired shapes in noisy environment,”
[5] H. V. Hasselt and M. A. Wiering, “Reinforcement learning in continuous in 2016 IEEE International Conference on Multisensor Fusion and
action spaces,” 2007. Integration for Intelligent Systems (MFI), Sep. 2016, pp. 285–290.
[6] L. C. Baird, “Reinforcement learning in continuous time: Advantage [28] M. Rahimi, S. Gibb, Y. Shen, and H. M. La, “A comparison of various
updating,” in Neural Networks, 1994. IEEE World Congress on Com- approaches to reinforcement learning algorithms for multi-robot box
putational Intelligence., 1994 IEEE International Conference on, vol. 4. pushing,” in International Conference on Engineering Research and
IEEE, 1994, pp. 2448–2453. Applications. Springer, 2018, pp. 16–30.
[7] Q. Wei, F. L. Lewis, Q. Sun, P. Yan, and R. Song, “Discrete-time deter- [29] L. Davis, “Handbook of genetic algorithms,” 1991.
ministic q-learning: A novel convergence analysis,” IEEE transactions [30] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist
on cybernetics, vol. 47, no. 5, pp. 1224–1237, 2017. multiobjective genetic algorithm: Nsga-ii,” IEEE transactions on evolu-
[8] N. Kohl and P. Stone, “Policy gradient reinforcement learning for fast tionary computation, vol. 6, no. 2, pp. 182–197, 2002.
quadrupedal locomotion,” in Robotics and Automation, 2004. Proceed- [31] P. W. Poon and J. N. Carter, “Genetic algorithm crossover operators
ings. ICRA’04. 2004 IEEE International Conference on, vol. 3. IEEE, for ordering applications,” Computers & Operations Research, vol. 22,
2004, pp. 2619–2624. no. 1, pp. 135–147, 1995.
[9] G. Endo, J. Morimoto, T. Matsubara, J. Nakanishi, and G. Cheng, [32] F. Liu and G. Zeng, “Study of genetic algorithm with reinforcement
“Learning cpg-based biped locomotion with a policy gradient method: learning to solve the tsp,” Expert Systems with Applications, vol. 36,
Application to a humanoid robot,” The International Journal of Robotics no. 3, pp. 6995–7001, 2009.
Research, vol. 27, no. 2, pp. 213–228, 2008. [33] D. E. Moriarty, A. C. Schultz, and J. J. Grefenstette, “Evolutionary
[10] J. Peters, K. Mülling, and Y. Altun, “Relative entropy policy search.” in algorithms for reinforcement learning,” Journal of Artificial Intelligence
AAAI. Atlanta, 2010, pp. 1607–1612. Research, vol. 11, pp. 241–276, 1999.
[11] M. Kalakrishnan, L. Righetti, P. Pastor, and S. Schaal, “Learning force [34] S. Mikami and Y. Kakazu, “Genetic reinforcement learning for cooper-
control policies for compliant manipulation,” in Intelligent Robots and ative traffic signal control,” in Evolutionary Computation, 1994. IEEE
Systems (IROS), 2011 IEEE/RSJ International Conference on. IEEE, World Congress on Computational Intelligence., Proceedings of the First
2011, pp. 4639–4644. IEEE Conference on. IEEE, 1994, pp. 223–228.
[12] M. P. Deisenroth, C. E. Rasmussen, and D. Fox, “Learning to control a [35] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford,
low-cost manipulator using data-efficient reinforcement learning,” 2011. J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov, “Openai baselines,” https:
[13] L. Jin, S. Li, H. M. La, and X. Luo, “Manipulability optimization of //github.com/openai/baselines, 2017.
redundant manipulators using dynamic neural networks,” IEEE Trans- [36] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and
actions on Industrial Electronics, vol. 64, no. 6, pp. 4710–4720, June N. De Freitas, “Dueling network architectures for deep reinforcement
2017. learning,” arXiv preprint arXiv:1511.06581, 2015.
[14] C.-K. Lin, “H reinforcement learning control of robot manipulators using [37] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approxima-
fuzzy wavelet networks,” Fuzzy Sets and Systems, vol. 160, no. 12, pp. tion by averaging,” SIAM Journal on Control and Optimization, vol. 30,
1765–1786, 2009. no. 4, pp. 838–855, 1992.
[15] Z. Miljković, M. Mitić, M. Lazarević, and B. Babić, “Neural network [38] J. H. Holland, “Genetic algorithms,” Scientific american, vol. 267, no. 1,
reinforcement learning for visual control of robot manipulators,” Expert pp. 66–73, 1992.
Systems with Applications, vol. 40, no. 5, pp. 1721–1736, 2013. [39] D. E. Goldberg and J. H. Holland, “Genetic algorithms and machine
[16] M. Duguleana, F. G. Barbuceanu, A. Teirelbar, and G. Mogan, “Obstacle learning,” Machine learning, vol. 3, no. 2, pp. 95–99, 1988.
avoidance of redundant manipulators using neural networks based rein- [40] D. E. Goldberg and K. Deb, “A comparative analysis of selection
forcement learning,” Robotics and Computer-Integrated Manufacturing, schemes used in genetic algorithms,” in Foundations of genetic algo-
vol. 28, no. 2, pp. 132–146, 2012. rithms. Elsevier, 1991, vol. 1, pp. 69–93.
[17] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, [41] G. Syswerda, “Uniform crossover in genetic algorithms,” in Proceedings
D. Silver, and D. Wierstra, “Continuous control with deep reinforcement of the third international conference on Genetic algorithms. Morgan
learning,” arXiv preprint arXiv:1509.02971, 2015. Kaufmann Publishers, 1989, pp. 2–9.
[18] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep q-
learning with model-based acceleration,” in International Conference
on Machine Learning, 2016, pp. 2829–2838.

2023 Dynamics RL
No ratings yet
2023 Dynamics RL
20 pages
Comprehensive Survey of Reinforcement Learning From Algorithms To Practical Challenges
No ratings yet
Comprehensive Survey of Reinforcement Learning From Algorithms To Practical Challenges
79 pages
Autonomous Vehicle Control Via Deep Reinforcement Learning: Simon Kardell Mattias Kuosku
No ratings yet
Autonomous Vehicle Control Via Deep Reinforcement Learning: Simon Kardell Mattias Kuosku
73 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
406 pages
1 s2.0 S2405844024067288 Main
No ratings yet
1 s2.0 S2405844024067288 Main
26 pages
GO Ms No 402 dt8.10.1982
No ratings yet
GO Ms No 402 dt8.10.1982
8 pages
DuranK Thesis Redacted
No ratings yet
DuranK Thesis Redacted
63 pages
Deep Reinforcement Learning
100% (1)
Deep Reinforcement Learning
410 pages
Practical Hierarchical Reinforcement Lea
No ratings yet
Practical Hierarchical Reinforcement Lea
88 pages
Ship Shore Safety Checklist - Eng ISGOTT 6TH EDITION
100% (2)
Ship Shore Safety Checklist - Eng ISGOTT 6TH EDITION
13 pages
Automatic Hyperparameter Optimization Using Genetic Algorithm in Deep Reinforcement Learning For Robotic Manipulation Tasks
No ratings yet
Automatic Hyperparameter Optimization Using Genetic Algorithm in Deep Reinforcement Learning For Robotic Manipulation Tasks
35 pages
FULLTEXT01
No ratings yet
FULLTEXT01
68 pages
Reinforcement Learning-Based Mobile Robot Navigation
No ratings yet
Reinforcement Learning-Based Mobile Robot Navigation
22 pages
A Hybrid Multi-Task Learning Approach For Optimizing Deep Reinforcement Learning Agents
No ratings yet
A Hybrid Multi-Task Learning Approach For Optimizing Deep Reinforcement Learning Agents
23 pages
Adaptive Robotics Papers
No ratings yet
Adaptive Robotics Papers
56 pages
Reinforcement Learning For Embedded Robotics
No ratings yet
Reinforcement Learning For Embedded Robotics
9 pages
Impact of RL in Robot Control
No ratings yet
Impact of RL in Robot Control
20 pages
Thesis Ram April 1
No ratings yet
Thesis Ram April 1
88 pages
Transfer Learning in Deep Reinforcement Learning A Survey
No ratings yet
Transfer Learning in Deep Reinforcement Learning A Survey
19 pages
Paper Fiuri
No ratings yet
Paper Fiuri
17 pages
XPG-RL RL With Explainable Priority Guidance For Efficiency-Boosted Mechanical Search
No ratings yet
XPG-RL RL With Explainable Priority Guidance For Efficiency-Boosted Mechanical Search
13 pages
Machine Learning Meets Advanced Robotic Manipulation: A, B C, C C D e
No ratings yet
Machine Learning Meets Advanced Robotic Manipulation: A, B C, C C D e
69 pages
A Brief Survey of Deep Reinforcement Learning
No ratings yet
A Brief Survey of Deep Reinforcement Learning
16 pages
Akruti Software Details
No ratings yet
Akruti Software Details
2 pages
Evolution-Guided Policy Gradient in Reinforcement Learning: Shauharda Khadka Kagan Tumer
No ratings yet
Evolution-Guided Policy Gradient in Reinforcement Learning: Shauharda Khadka Kagan Tumer
15 pages
ARTICLEONnlp
No ratings yet
ARTICLEONnlp
18 pages
Survey of Model-Based Reinforcement Learning: Applications On Robotics
No ratings yet
Survey of Model-Based Reinforcement Learning: Applications On Robotics
21 pages
Virtual Testing and Policy Deployment Framework For Autonomous Navigation of An Unmanned Ground Vehicle Using Reinforcement Learning
No ratings yet
Virtual Testing and Policy Deployment Framework For Autonomous Navigation of An Unmanned Ground Vehicle Using Reinforcement Learning
6 pages
Lecture Reinforcement Learning
No ratings yet
Lecture Reinforcement Learning
28 pages
Optimization of Link Configuration For Satellite Communication Using Reinforcement Learning
No ratings yet
Optimization of Link Configuration For Satellite Communication Using Reinforcement Learning
10 pages
CBLM Final
75% (4)
CBLM Final
56 pages
Hyperparameter Tuning For Deep Reinforcement Learning Applications
No ratings yet
Hyperparameter Tuning For Deep Reinforcement Learning Applications
12 pages
Drones 06 00323 v3
No ratings yet
Drones 06 00323 v3
18 pages
Mastering Technology Transfer: From Invention To Innovation: George Vekinis
No ratings yet
Mastering Technology Transfer: From Invention To Innovation: George Vekinis
286 pages
An Introduction To Deep Reinforcement Learning PDF
No ratings yet
An Introduction To Deep Reinforcement Learning PDF
140 pages
SSRN 4768234
No ratings yet
SSRN 4768234
6 pages
Deep Reinforcement Learning For Mobile Robot Path Planning: Hao Liu, Yi Shen, Shuangjiang Yu, Zijun Gao, Tong Wu
No ratings yet
Deep Reinforcement Learning For Mobile Robot Path Planning: Hao Liu, Yi Shen, Shuangjiang Yu, Zijun Gao, Tong Wu
7 pages
Modern Deep Reinforcement Learning Algorithms
No ratings yet
Modern Deep Reinforcement Learning Algorithms
56 pages
3.reinforcement Learning DDPG-PPO Agent-Based Control S Ystem
No ratings yet
3.reinforcement Learning DDPG-PPO Agent-Based Control S Ystem
14 pages
Referencia 5
No ratings yet
Referencia 5
8 pages
Collision Avoidance Using RL
No ratings yet
Collision Avoidance Using RL
19 pages
Deep Reinforcement Learning For Robotic Manipulation
No ratings yet
Deep Reinforcement Learning For Robotic Manipulation
9 pages
A Review On Deep Reinforcement Learning For Fluid Mechanics
No ratings yet
A Review On Deep Reinforcement Learning For Fluid Mechanics
23 pages
The Effects of Memory Replay in Reinforcement Learning
No ratings yet
The Effects of Memory Replay in Reinforcement Learning
14 pages
Final MSC Report Divyam Rastogi
No ratings yet
Final MSC Report Divyam Rastogi
78 pages
Deep Reinforcement Learning For Robotic Manipulation Withasynchronous Off-Policy Updates
No ratings yet
Deep Reinforcement Learning For Robotic Manipulation Withasynchronous Off-Policy Updates
8 pages
Ibarz Et Al 2021 How To Train Your Robot With Deep Reinforcement Learning Lessons We Have Learned
No ratings yet
Ibarz Et Al 2021 How To Train Your Robot With Deep Reinforcement Learning Lessons We Have Learned
24 pages
Deep Reinforcement Learning For Spacecraft Proximity Operations Guidance
No ratings yet
Deep Reinforcement Learning For Spacecraft Proximity Operations Guidance
11 pages
Reinforcement Learning For Robotics Advance
No ratings yet
Reinforcement Learning For Robotics Advance
2 pages
HUAWEI Ascend P2-6070 Maintenance Manual
No ratings yet
HUAWEI Ascend P2-6070 Maintenance Manual
101 pages
Reinforcement Learning Optimization
No ratings yet
Reinforcement Learning Optimization
6 pages
Computation 12 00116
No ratings yet
Computation 12 00116
17 pages
Lesson7-Advanced Function Blocks
No ratings yet
Lesson7-Advanced Function Blocks
6 pages
Deep Reinforcement Learning With Optimized Reward Functions For Robotic Trajectory Planning
No ratings yet
Deep Reinforcement Learning With Optimized Reward Functions For Robotic Trajectory Planning
11 pages
The Actor-Dueling-Critic Method
No ratings yet
The Actor-Dueling-Critic Method
20 pages
Autonomous Car Racing in Simulation Environment Using Deep Reinforcement Learning
No ratings yet
Autonomous Car Racing in Simulation Environment Using Deep Reinforcement Learning
6 pages
A Deep Reinforcement Learning Algorithm For Robotic Manipulation Tasks in Simulated Environments
No ratings yet
A Deep Reinforcement Learning Algorithm For Robotic Manipulation Tasks in Simulated Environments
10 pages
Origins of Life Questions and Debates
No ratings yet
Origins of Life Questions and Debates
12 pages
Algorithms 17 00269
No ratings yet
Algorithms 17 00269
2 pages
Comparison of Multiple Reinforcement Learning and Deep Reinforcement Learning Methods For The Task Aimed at Achieving The Goal
No ratings yet
Comparison of Multiple Reinforcement Learning and Deep Reinforcement Learning Methods For The Task Aimed at Achieving The Goal
9 pages
ML Assign Shubham
No ratings yet
ML Assign Shubham
13 pages
Parameter Optimization of VLSI Placement Through Deep Reinforcement Learning
No ratings yet
Parameter Optimization of VLSI Placement Through Deep Reinforcement Learning
4 pages
Learning To Walk Via Deep Reinforcement Learning
No ratings yet
Learning To Walk Via Deep Reinforcement Learning
10 pages
T300 CAT - NRJED314621EN - 26mai16 - V19
No ratings yet
T300 CAT - NRJED314621EN - 26mai16 - V19
104 pages
Tax Invoice - 2023-11-08
No ratings yet
Tax Invoice - 2023-11-08
2 pages
Sri Siddhartha Academy of Higher Education
No ratings yet
Sri Siddhartha Academy of Higher Education
31 pages
RTR Bharti Shine Role
No ratings yet
RTR Bharti Shine Role
535 pages
Autonomous Driving With Deep Reinforcement Learning in CARLA Simulation
No ratings yet
Autonomous Driving With Deep Reinforcement Learning in CARLA Simulation
7 pages
The Impact of Cloud Computing On Elearning During COVID-19 Pandemic
No ratings yet
The Impact of Cloud Computing On Elearning During COVID-19 Pandemic
28 pages
L5-6 Reading Houses of The Future
No ratings yet
L5-6 Reading Houses of The Future
3 pages
Mechanical Engineering Research Paper Topics List
No ratings yet
Mechanical Engineering Research Paper Topics List
8 pages
SL2100 Keyphone
No ratings yet
SL2100 Keyphone
2 pages
Electromotive XDI-V1.6 Electronic ECU
100% (1)
Electromotive XDI-V1.6 Electronic ECU
35 pages
TRF Format
No ratings yet
TRF Format
13 pages
Web Development PHP
No ratings yet
Web Development PHP
9 pages
Deep Reinforcement Learning Mohit Sewak
No ratings yet
Deep Reinforcement Learning Mohit Sewak
6 pages
Geomax Zenith Differential GPS: A Guide For Basic Surveying
No ratings yet
Geomax Zenith Differential GPS: A Guide For Basic Surveying
10 pages
Datasheet HWT-D2152-10-SIU
No ratings yet
Datasheet HWT-D2152-10-SIU
6 pages
Hx370e 2
No ratings yet
Hx370e 2
2 pages
Oopm Unit - 1 Object Oriented Thinking 22 June 20
No ratings yet
Oopm Unit - 1 Object Oriented Thinking 22 June 20
21 pages
Manual GEM GSM-19 PDF
No ratings yet
Manual GEM GSM-19 PDF
149 pages
EtherCAT Seminar Document PDF
No ratings yet
EtherCAT Seminar Document PDF
22 pages
Tamrakar 2015
No ratings yet
Tamrakar 2015
6 pages
Arief, Eimam, Othniel Final Report
No ratings yet
Arief, Eimam, Othniel Final Report
8 pages
2 RCRV Final Speed Certificate For Operation of RCRV With Small Crane, HRD Fire Fighting MFG by Windhoff Germany Max. Speed 45kmph
No ratings yet
2 RCRV Final Speed Certificate For Operation of RCRV With Small Crane, HRD Fire Fighting MFG by Windhoff Germany Max. Speed 45kmph
6 pages
Lab - 1 - 2 - 6 Connecting Router LAN Interfaces (CISCO SYSTEMS)
No ratings yet
Lab - 1 - 2 - 6 Connecting Router LAN Interfaces (CISCO SYSTEMS)
2 pages
Buy Apple AirPods Pro (2nd Generation) With MagSa
No ratings yet
Buy Apple AirPods Pro (2nd Generation) With MagSa
1 page
Folleto Drone PDF
No ratings yet
Folleto Drone PDF
2 pages
Google JAX Essentials: A quick practical learning of blazing-fast library for machine learning and deep learning projects
From Everand
Google JAX Essentials: A quick practical learning of blazing-fast library for machine learning and deep learning projects
Mei Wong
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet