Joshi 2020

This document presents a deep reinforcement learning method for robotic grasping using visio-motor feedback, specifically through a Grasp-Q-Network that improves grasping accuracy. The proposed system utilizes a multi-view camera setup and off-policy reinforcement learning to enhance grasp success rates compared to traditional methods. Experiments conducted in both simulated and real-world environments demonstrate the effectiveness of the approach in optimizing grasping strategies without the need for extensive datasets.

Uploaded by

Ngô Minh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views6 pages

Joshi 2020

Uploaded by

Ngô Minh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

2020 16th IEEE International Conference on Automation Science and Engineering (CASE)

August 20-21, 2020, Online Zoom Meeting

Robotic Grasping using Deep Reinforcement Learning

Shirin Joshi, Sulabh Kumra, Ferat Sahin
Rochester Institute of Technology, Rochester, NY, USA
{saj8732, sk2881, fseee}@rit.edu

Abstract— In this work, we present a deep reinforcement

learning based method to solve the problem of robotic grasping
using visio-motor feedback. The use of a deep learning based
approach reduces the complexity caused by the use of hand-
designed features. Our method uses an off-policy reinforcement
learning framework to learn the grasping policy. We use the
double deep Q-learning framework along with a novel Grasp-
Q-Network to output grasp probabilities used to learn grasps
that maximize the pick success. We propose a visual servoing
mechanism that uses a multi-view camera setup that observes
the scene which contains the objects of interest. We performed
experiments using a Baxter Gazebo simulated environment as
well as on the actual robot. The results show that our proposed
method outperforms the baseline Q-learning framework and
increases grasping accuracy by adapting a multi-view model in
comparison to a single-view model.

I. I NTRODUCTION Fig. 1: Overview of the proposed system

Even though a lot of research has been done on robotic
grasping, in real-world scenarios the robot cannot obtain a
100% success rate in grasping a variety of objects. This a deep reinforcement learning framework characterized by
is due to inaccuracies in the potential grasp detection and Grasp-Q-Network and continuous visual feedback. For this,
ultimately leads to the problem of not being able to select a deep Q-network is trained, and the given task is assigned
the best possible grasp for an object. A lot of recent work with rewards. Further, we improve the performance of this
has addressed this by converting it into a detection problem network by adapting an off-policy reinforcement learning
which works on visual aspects of the image to infer the framework. The proposed framework reduces the time and
location where the robotic gripper needs to be placed. As computation required for training the network and uses a
compared to the use of hand-designed features used in small set of objects to train the robot instead of using a large
the past which increases the time complexity, recent work database in order to learn the best policy to detect grasps
focuses on the use of deep learning techniques that gives for different objects and increases the number of successful
much better performance in the field of visual perception, grasps which lead to optimal outcomes.
audio recognition, and natural language processing. The overview of our proposed system is shown in fig. 1.
Grasping is primarily a detection problem and since a The agent observes the environment which contains an object
majority of the work in deep learning has been applied for to be grasped. Using the observation from the overhead
recognition problems, previous applications of deep learning camera and the wrist camera along with its current motor
for detection have been specifically used for face detection, state, the Grasp-Q-Network outputs Q-values that are used
text detection, etc. This paper focuses on the use of a deep by the agent to take action. The agent then receives a reward
reinforcement learning framework to resolve the challenge based on a policy. After each iteration, the loss function
of grasping an object with the help of a two-finger robotic calculates the loss between the target Grasp-Q-Network and
gripper. The use of reinforcement learning in grasping en- the Grasp-Q-Network and updates the parameters of the
ables the robot to learn to pick and place an object on its network.
own without training it on a large dataset thereby eliminating The key contributions of this work can be summarized as:
the need for a dataset. Thus, the objective here is to not only • A novel deep reinforcement learning framework for
find a feasible grasp but also to find the optimum grasp for learning robust grasps using multiple cameras and motor
an object that maximizes the chance of efficiently grasping inputs.
it. • A novel visual servoing mechanism to produce success-
Precisely modeling the physical environment every single ful grasps using continuous visual feedback.
time is practically impossible and signifies a lack of cer- • Evaluation of the performance of our method in a
tainties. In this work, we study a grasping strategy through simulated as well as the real-world environment. We

978-1-7281-6904-0/20/$31.00 ©2020 IEEE 1461

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 28,2020 at 22:48:49 UTC from IEEE Xplore. Restrictions apply.
demonstrate that our novel double deep Q-learning of reinforcement learning. Several others claim to use rules
network outperforms the proposed Deep Q-learning and for grasping that is based on the research that portrays
the vanilla Q-learning framework by maximizing grasp human actions for grasping and manipulating objects. [11]
success. presents early work in using reinforcement learning for
• Analysis of the performance of our method on different robotic grasps in which the learning approach is adopted
objects using single-view and multi-view camera setup. from human grasping an object. Three layers of functional
modules that enable learning from a finite set of data along
II. R ELATED W ORK with maintaining a good generalization [12], [13] have shown
Robotic Grasping: There has been a lot of research in to have successful implementations of reinforcement learning
the field of robotic grasping to improve potential grasps of in the past.
novel objects. [1]–[4]. Not only is the pose and configuration However, due to complex issues such as memory complex-
of the robotic arm and gripper important in grasp detection, ity, sample complexity, as well as computational complexity,
but also the shape and physical properties of the object to be the user has to rely on deep learning networks. These
grasped are essential in determining how an object should networks use function approximations and representation
be gripped. An acute understanding of the target object’s 3D learning properties to overcome the problems of using algo-
model is crucial for a swift pick and place action using a rithms that require very high computational power and fast
robotic manipulator. This can be achieved using 3D sensing processing. [14] discusses the use of a reinforcement learning
cameras and sensors that give an almost good perception algorithm, Policy Improvement with Path Integrals (PI2 ) that
of the object in the real world. Grasping methods typically can be used when the state estimation is uncertain and
fall into two categories, particularly: Analytical methods and this approach does not require a specific model and is thus
Empirical or data-driven methods [5]. model-free. [15] discusses how a system can learn to reliably
The use of RGB-D cameras for object recognition, de- perform its task in a short amount of time by implementing
tection, and mapping has shown to improve the ratio of a reinforcement learning strategy with a minimum amount
successful grasps, as seen from [6]. The use of RGB-D of information provided for a given task of picking an
imagery is significantly useful since the depth information object. Deep learning has enabled reinforcement learning
from the RGB-D images is utilized for mapping the 3D real- to be used for decision-making problems such as settings
world co-ordinates to valuable information in the 2D frame. with large dimensional state and action spaces that were
The previous work in the field of robotics focuses solely once unmanageable. In recent years, a number of off-policy
on these images obtained from a depth camera for grasp reinforcement learning techniques have been implemented.
detection and object recognition [1]. For instance, [16] uses deep reinforcement learning for
Deep learning: A lot of advancements have been made solving Atari games, [17] uses a model-free algorithm based
in recent years with the use of vision-based techniques in on the deterministic policy gradient to solve problems in
robotic grasping using deep learning [1], [2], [4], [7]–[9]. continuous action domain.
A majority of the work in deep learning is associated with The current research on deep Q-network (DQN) shows
classification, while only a few have used it for detection how deep reinforcement learning can be applied for design-
[1], [4]. All of these approaches use a bounding box that ing closed-loop grasping strategies. [18] demonstrates this
contains the observed object, and the bounding box is similar by proposing a Q-function optimization technique to provide
for each valid object detected. However, for robotic grasping, a scalable approach for vision-based robotic manipulation
there may be several methods to grasp an object. But it applications. [19] uses deep Q-learning for learning grasping
is essential to pick the one with the highest grasp success strategies, along with pushing for applications comprising
or with the most stable grasp, thus relying on machine tightly packed spaces and cluttered environments. Our pro-
learning techniques to find the best possible grasp. The use posed work is based on a similar approach that uses an
of convolutional neural networks is a popular technique used adaptation of the deep reinforcement learning technique for
for learning features and visual models that uses a sliding detecting robot grasps.
window detection approach, as illustrated by [1]. However,
this technique is slow for a robot in a real-world situation III. P ROBLEM F ORMULATION
where it may be required to take fast actions. [4] worked We define the problem of robotic grasping as a Markov
on improving this by passing an entire image through the Decision Process (MDP) where at any given state st ∈ S at
network rather than using small patches to detect potential time t, the agent (i.e. robot) makes an observation ot ∈ O
grasps. of the environment and executes an action at ∈ A based on
Reinforcement learning: In reinforcement learning, the policy π(st ) and receives an immediate reward rt based on
learner is not informed which action to take but instead, it the reward function R (st , at ). The goal of the agent is to
should decide which action will yield the most reward by find an optimal policy π ∗ in order to maximize the expected
trial and error. In most cases, the actions not only affect sum of discounted future rewards i.e. γ-discounted sum on
the subsequent reward but the next action thereby affecting all future returns from time t to ∞.
all the subsequent rewards [10]. Thus, trial and error search In our work, the observation ot comprises of the RGB-
and delayed reward are the two most prominent features D image captured from the overhead depth camera and the

1462

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 28,2020 at 22:48:49 UTC from IEEE Xplore. Restrictions apply.
Fig. 2: The architecture of proposed Grasp-Q-Network. The input RGB-D image from the overhead camera Iot along with
the wrist RGB camera Iw t are individually fed to a 7 × 7 convolution with stride 2, followed by Batch Normalization. This
is followed by a 5 ×5 convolutional followed by a 3×3 convolution followed by Batch Normalization and Max-pooling.
The output features are then concatenated and fed to a convolutional layer followed by a fully connected layer. The servo
motor command Mt is processed by two fully connected layers. The result is then processed by two fully connected layers,
after which the network outputs the probability of grasp success using a softmax function.

RGB image taken from the wrist camera along with the joint then evaluates the action for each time step t and produces
angles of the robot’s arm observed at time t. a reward rt using the reward function R (st , at ). Each of
these time steps T results in training samples T given by the
IV. P ROPOSED A PPROACH equation:
We propose a novel deep reinforcement learning based T = (Ito , Itw , Mt , at , rt ) (1)
technique for vision-based robotic grasping. The proposed
framework consists of a novel double deep Q-learning based Each time step involves the observed sample images, the
architecture. The architecture consists of observed input motor position information, the action taken, and the reward
taken from the overhead camera and the wrist camera along received. The action space A, is defined as a vector of
with the current motor positions, fed into the Grasp-Q- motor actions comprising of end-effector displacement along
Network, that returns the grasp success probabilities i.e. the the three axes, rotation along the z-axis, and the gripper
Q values for all possible actions that the agent can take. action that involves closing the gripper which terminates the
These Q-values are then used to select the best action at episode.
based on the -greedy policy to find the optimal policy π ∗ . The Grasp-Q-Network comprises of two parts: visionNet
and motorNet as shown in fig.2. The Grasp-Q-Network
A. Architecture outputs probabilities as Q-values for each of the actions in
The architecture of the proposed approach is shown in fig. action space A, which are used to select the best action
2. The method presented involves end-to-end training of the at . The Grasp-Q-Network can be considered as Q-function
Grasp-Q-Network based on a visual servoing mechanism that approximator for the learning policy described by the visual
performs continuous servoing to adjust the motor commands servoing mechanism λ (ot ). The repeated utilization of the
of the robot by observing the current state of the environment target GQN that is defined by ψ 0 (It , Mt ) to get more data to
to produce a successful grasp. The motor actions are in the refit the GQN ψ (It , Mt ) can be regarded as fitted Q iteration.
robot frame, therefore there is no requirement for the camera
to be calibrated precisely with respect to the end effector. B. DQN Algorithm
The model uses visual indicators to establish the con- A deep Q network (DQN) is a multiple layer neural
nection between the graspable object and the robot end- network that returns a vector of action values Q(s, a | φ)
effector. A novel Deep Convolutional Neural Network is for a given state st and sequence of observations O, where
used that decides what motion of robot arm and the gripper φ represents the parameters of the network. The main aim
can produce an effective grasp. Each potential grasp consists of the policy network is to continuously train φ to attain
of time step T. For each of these time steps the robot the optimal policy π ∗ . Two important aspects of the DQN
records current images Ito and Itw from the overhead and algorithm, as proposed by [20], are the use of a target
wrist camera respectively and the current motor positions network and the use of experience replay.
Mt . Using the policy π(st ), the agent determines the next In our work, the target network ψ 0 , which is parameterized
action, i.e. in which direction to move the gripper. The agent by φ0 , is same as the GQN ψ except that its parameters

1463

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 28,2020 at 22:48:49 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1 Proposed Visual Servoing Mechanism Instead of the value obtained from the target network ψφ0 ,
1: procedure S ERVOING(λ (ot )) the action that amplifies the current network ψφ is used by
2: Given current images Ito and Itw the max operator in double Q-learning.
3: Given current motor position Mt Thus, double DQN can be seen as a policy network ψ,
4: top: parameterized by φ, which is continually being trained to
5: Initialize time step t and reward r approximate the optimal policy. Mathematically, it uses the
6: loop: Bellman equation to minimize the loss function L(φ) as
7: Infer at using network ψ (It , Mt ) and policy π ∗ below:
Execute action at
2
8:
9: if at is not executed then L(φ) = E Q (st , at | φ) − YtDDQN (5)
10: r ←r−1
11: goto top D. Reward Function
12: end if The reward for a successful grasp is given by R (st , at ) =
13: if result = successful pick then 10 . A grasp is defined as successful if the gripper is holding
14: r ← r + 10 an object and is above a certain height at the end of the
15: goto top episode. A partial reward equal to 1 is given if the end
16: else if result = partial pick then effector comes in contact with the object. A reward of -1
17: r ←r+1 is given if no action is taken. Moreover, a small penalty of
18: goto top -0.025 is given, for all time steps prior to termination, to
19: else encourage the robot to grasp more quickly.
20: r ← r − 0.025
21: end if V. E XPERIMENTS
22: t←t+1 To evaluate our approach, we perform our experiments
23: goto loop. using our proposed deep Q-learning and double deep Q-
24: end procedure learning algorithms. We use the vanilla Q-learning algorithm
as a baseline. The experiments were performed in simulation
as well as in the real-world on the 7-DOF Baxter robot.
are copied every N time steps from the GQN so that then Further, the experiments were carried out using a single as
φ0t=N = φt=N and kept fixed on all other steps. The target well as multiple camera setups.
used by DQN is then: A. Setup
YtDQN ≡ rt + γ max Q st+1 , a | φ0t

(2) The simulation setup involves interfacing the gazebo simu-
a
lator in Robot Operating System (ROS) with Atari DQN that
where, γ ∈ [0, 1) is the discount factor. A value closer to provides a custom environment for simulating our proposed
0 indicates that the agent will choose actions based only on method. Although the proposed approach is robot agnostic,
the current reward and values approaching 1 indicates that we use a Baxter robot with a parallel plate gripper as our
the agent will choose actions based on the long-term reward. robot platform. We create a simulation environment with a
table placed in front of the robot and a camera mounted on
C. Training
the right side of the table. An object is randomly spawned
Two separate value functions are trained in a mutually on the table at the start of each episode. The objects consist
symmetric fashion by designating experiences arbitrarily to of a cube, a sphere, and a cylinder as seen from fig. 4.
update one of the two value functions, that results in two set The real-world setup consisted of the Baxter robot, Intel
of weights φ and φ0 . One set of weights is used to learn the RealSense D435 depth camera used as an overhead camera,
greedy policy, and the other is used to determine its value and a wrist camera similar to the simulation setup. The
for each update. To compare with Q-learning target, which overhead camera position was adjusted in real-world imple-
is defined as: mentation to avoid collision with the robot.
Fig. 3 top row demonstrates an example run for simulation
YtQ ≡ rt + γ max Q (st+1 , at+1 | φ) (3)
a of the Baxter robot learning to pick objects in the Gazebo
where, rt is immediate reward, and st+1 is the resulting environment. The bottom row shows the robot approaching
state, and Q (s, a | φ) is the parameterized value function. an object and the successful completion of the task of picking
The target value in double deep Q-learning can be written an object in the real-world setting.
as: B. Training
YtDDQN = rt + γQ (st+1 , argmax Q (st+1 , at+1 | φ) | φ0 ) For training, at the start and reset, a colored sphere,
(4) cylinder, or cube is spawned/placed at a random orientation
This solves the overestimation issue faced with Q-learning in front of the Baxter robot. The robot then tries to pick
and outperforms the existing Deep Q-learning algorithm. up the object by maneuvering its arm around the object. In

1464

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 28,2020 at 22:48:49 UTC from IEEE Xplore. Restrictions apply.
Fig. 3: Baxter learning to grasp objects using the proposed deep reinforcement learning based visual servoing algorithm in
simulation (top row) and real-world (bottom row)

Fig. 4: 3D model of objects used for training the robot

our work, the arm movement is limited to a rotation around

the wrist and shoulder along with the ability to extend the
reach while the gripper is facing downwards. At the end
of the pickup action, the success of the task is measured Fig. 5: Learning curves of Q-learning, DQN and double-
by checking the gripper feedback. The gripper close status DQN on the robotic grasping task
where the gripper is not fully closed indicates that an object
is grasped and gripper fully closed indicates that the object
is not grasped. A partial reward is given if the robot comes the object is successfully picked up from its location or the
in contact with the object. The environment is reset at the maximum number of steps per episode is reached.
termination of the episode.
The arm movement is based on the visual servoing mech- C. Results
anism between the overhead camera and the right-hand wrist The resulting plot, as seen in fig. 5 shows that our double-
camera, presenting the current observation of the scene with DQN based algorithm performs better than the vanilla Q-
the target object in comparison to the current position of learning algorithm. The results demonstrate that DQN as well
the gripper. The image data consists of an image from the as double-DQN have comparatively higher completion and
overhead camera and an image from the wrist camera. The success rates than the baseline Q-learning framework. The
image data, along with motor position information, is sent experiments were performed for a total of 7000 episodes,
from the robot to the Grasp-Q-Network, which processes the and it can be observed that the number of steps required for
information obtained from the image, and in turn, sends servo the successful completion of the task is least for the double
commands back to the robot. Objects are randomized in the deep Q-learning method followed by the deep Q-learning
start position and can appear randomly in any position on method. Further, our experiment shows that this method can
the table within the robot’s reach. The gripper closes when be adapted for the real robot by adjusting the configurations
it moves to the position vector obtained from the Grasp- of the robot software environment.
Q-Network to pick the object, and the episode ends after The use of a single camera may often consist of occlusions

1465

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 28,2020 at 22:48:49 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES
[1] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic
grasps,” The International Journal of Robotics Research, vol. 34, no. 4-
5, pp. 705–724, 2015.
[2] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning
hand-eye coordination for robotic grasping with deep learning and
large-scale data collection,” The International Journal of Robotics
Research, vol. 37, no. 4-5, pp. 421–436, 2018.
[3] A. Saxena, J. Driemeyer, and A. Y. Ng, “Robotic grasping of novel
objects using vision,” International Journal of Robotics Research,
vol. 27, pp. 157–173, 2008.
[4] S. Kumra and C. Kanan, “Robotic grasp detection using deep convo-
lutional neural networks,” in 2017 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS), pp. 769–776, IEEE, 2017.
[5] J. Bohg, A. Morales, T. Asfour, and D. Kragic, “Data-driven grasp
synthesisa survey,” IEEE Transactions on Robotics, vol. 30, pp. 289–
Fig. 6: Comparison of performance of single and multi-view 309, April 2014.
[6] Y. Jiang, S. Moseson, and A. Saxena, “Efficient grasping from rgbd
models for the robotic grasping task images: Learning using a new rectangle representation,” in Robotics
and Automation (ICRA), 2011 IEEE International Conference on,
pp. 3304–3311, IEEE, 2011.
[7] J. Yu, K. Weng, G. Liang, and G. Xie, “A vision-based robotic
due to robot arm. For this reason, we also compare the grasping system using deep learning for 3d object recognition and
results of single and multi-view camera setups to test the pose estimation,” in 2013 IEEE International Conference on Robotics
performance of the robot for picking three objects: sphere, and Biomimetics (ROBIO), pp. 1175–1180, Dec 2013.
[8] D. Quillen, E. Jang, O. Nachum, C. Finn, J. Ibarz, and S. Levine,
cube, and cylinder, as shown in fig. 6. The single camera “Deep reinforcement learning for vision-based robotic grasping: A
setup involves using a single camera input, and multiple simulated comparative evaluation of off-policy methods,” in 2018
camera setup incorporates an overhead camera and a wrist IEEE International Conference on Robotics and Automation (ICRA),
pp. 6284–6291, IEEE, 2018.
camera of the robot arm. The results demonstrate that the [9] J. Mahler, M. Matl, V. Satish, M. Danielczuk, B. DeRose, S. McKinley,
multi-view camera model works better in comparison to and K. Goldberg, “Learning ambidextrous robot grasping policies,”
the single-view model for all three objects. An interesting Science Robotics, vol. 4, no. 26, p. eaau4984, 2019.
[10] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
observation to note is that both the single-view and multi- MIT press, 2018.
view models work best on the cube with an accuracy of [11] M. A. Moussa and M. S. Kamel, “A connectionist model for learning
91.1% on a multi-view model, and 83.2% on a single view robotic grasps using reinforcement learning,” in Proceedings of Inter-
national Conference on Neural Networks (ICNN’96), vol. 3, pp. 1771–
model as compared to the sphere and the cylinder, and have 1776 vol.3, June 1996.
the lowest success rate for the sphere. [12] N. Kohl and P. Stone, “Policy gradient reinforcement learning for
fast quadrupedal locomotion,” in IEEE International Conference on
Robotics and Automation, 2004. Proceedings. ICRA ’04. 2004, vol. 3,
VI. C ONCLUSIONS pp. 2619–2624 Vol.3, April 2004.
[13] A. Y. Ng, A. Coates, M. Diel, V. Ganapathi, J. Schulte, B. Tse,
A method for learning robust grasps is presented using E. Berger, and E. Liang, “Autonomous inverted helicopter flight via
a deep reinforcement learning framework that consists of a reinforcement learning,” in Experimental Robotics IX (M. H. Ang and
Grasp-Q-Network which produces grasp probabilities and a O. Khatib, eds.), (Berlin, Heidelberg), pp. 363–372, Springer Berlin
Heidelberg, 2006.
visual servoing mechanism that performs continuous servo- [14] F. Stulp, E. Theodorou, J. Buchli, and S. Schaal, “Learning to
ing to adjust the servo motor commands of the robot. In grasp under uncertainty,” in 2011 IEEE International Conference on
contrast to most grasping and visual servoing methods, this Robotics and Automation, pp. 5703–5708, May 2011.
[15] T. Lampe and M. Riedmiller, “Acquiring visual servoing reaching
method does not require training on large sets of data since it and grasping skills using neural reinforcement learning,” in The 2013
does not use any dataset for learning to pick up the objects. International Joint Conference on Neural Networks (IJCNN), pp. 1–8,
The experimental results show that this method works in Aug 2013.
[16] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
simulations as well as on the real robot. This method also D. Wierstra, and M. A. Riedmiller, “Playing atari with deep rein-
uses continuous feedback to correct inaccuracies and adjust forcement learning,” CoRR, vol. abs/1312.5602, 2013.
the gripper to the movement of the object in the scene. [17] L. Tai, G. Paolo, and M. Liu, “Virtual-to-real deep reinforcement
learning: Continuous control of mobile robots for mapless navigation,”
Our results on using multi-view models show that multi- in 2017 IEEE/RSJ International Conference on Intelligent Robots and
view camera setup works better in comparison to single view Systems (IROS), pp. 31–36, IEEE, 2017.
camera setup and also has a higher success rate for multiple [18] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang,
D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al., “Scalable
cameras. Further, the proposed novel Grasp-Q-Network has deep reinforcement learning for vision-based robotic manipulation,” in
been used to learn a reinforcement learning policy that Conference on Robot Learning, pp. 651–673, 2018.
can learn to grasp objects from the feedback obtained by [19] A. Zeng, S. Song, S. Welker, J. Lee, A. Rodriguez, and T. Funkhouser,
“Learning synergies between pushing and grasping with self-
the visual servoing mechanism. Finally, the results obtained supervised deep reinforcement learning,” in 2018 IEEE/RSJ Interna-
show that by adapting an off-policy reinforcement learning tional Conference on Intelligent Robots and Systems (IROS), pp. 4238–
method, the robot performs better at learning the task of 4245, IEEE, 2018.
[20] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
grasping the objects as compared to the vanilla Q-learning Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
algorithm, and reduces the overestimation problem that was et al., “Human-level control through deep reinforcement learning,”
observed using the deep Q-learning algorithm. Nature, vol. 518, no. 7540, p. 529, 2015.

1466

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 28,2020 at 22:48:49 UTC from IEEE Xplore. Restrictions apply.