Joshi 2020
Joshi 2020
Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 28,2020 at 22:48:49 UTC from IEEE Xplore. Restrictions apply.
demonstrate that our novel double deep Q-learning of reinforcement learning. Several others claim to use rules
network outperforms the proposed Deep Q-learning and for grasping that is based on the research that portrays
the vanilla Q-learning framework by maximizing grasp human actions for grasping and manipulating objects. [11]
success. presents early work in using reinforcement learning for
• Analysis of the performance of our method on different robotic grasps in which the learning approach is adopted
objects using single-view and multi-view camera setup. from human grasping an object. Three layers of functional
modules that enable learning from a finite set of data along
II. R ELATED W ORK with maintaining a good generalization [12], [13] have shown
Robotic Grasping: There has been a lot of research in to have successful implementations of reinforcement learning
the field of robotic grasping to improve potential grasps of in the past.
novel objects. [1]–[4]. Not only is the pose and configuration However, due to complex issues such as memory complex-
of the robotic arm and gripper important in grasp detection, ity, sample complexity, as well as computational complexity,
but also the shape and physical properties of the object to be the user has to rely on deep learning networks. These
grasped are essential in determining how an object should networks use function approximations and representation
be gripped. An acute understanding of the target object’s 3D learning properties to overcome the problems of using algo-
model is crucial for a swift pick and place action using a rithms that require very high computational power and fast
robotic manipulator. This can be achieved using 3D sensing processing. [14] discusses the use of a reinforcement learning
cameras and sensors that give an almost good perception algorithm, Policy Improvement with Path Integrals (PI2 ) that
of the object in the real world. Grasping methods typically can be used when the state estimation is uncertain and
fall into two categories, particularly: Analytical methods and this approach does not require a specific model and is thus
Empirical or data-driven methods [5]. model-free. [15] discusses how a system can learn to reliably
The use of RGB-D cameras for object recognition, de- perform its task in a short amount of time by implementing
tection, and mapping has shown to improve the ratio of a reinforcement learning strategy with a minimum amount
successful grasps, as seen from [6]. The use of RGB-D of information provided for a given task of picking an
imagery is significantly useful since the depth information object. Deep learning has enabled reinforcement learning
from the RGB-D images is utilized for mapping the 3D real- to be used for decision-making problems such as settings
world co-ordinates to valuable information in the 2D frame. with large dimensional state and action spaces that were
The previous work in the field of robotics focuses solely once unmanageable. In recent years, a number of off-policy
on these images obtained from a depth camera for grasp reinforcement learning techniques have been implemented.
detection and object recognition [1]. For instance, [16] uses deep reinforcement learning for
Deep learning: A lot of advancements have been made solving Atari games, [17] uses a model-free algorithm based
in recent years with the use of vision-based techniques in on the deterministic policy gradient to solve problems in
robotic grasping using deep learning [1], [2], [4], [7]–[9]. continuous action domain.
A majority of the work in deep learning is associated with The current research on deep Q-network (DQN) shows
classification, while only a few have used it for detection how deep reinforcement learning can be applied for design-
[1], [4]. All of these approaches use a bounding box that ing closed-loop grasping strategies. [18] demonstrates this
contains the observed object, and the bounding box is similar by proposing a Q-function optimization technique to provide
for each valid object detected. However, for robotic grasping, a scalable approach for vision-based robotic manipulation
there may be several methods to grasp an object. But it applications. [19] uses deep Q-learning for learning grasping
is essential to pick the one with the highest grasp success strategies, along with pushing for applications comprising
or with the most stable grasp, thus relying on machine tightly packed spaces and cluttered environments. Our pro-
learning techniques to find the best possible grasp. The use posed work is based on a similar approach that uses an
of convolutional neural networks is a popular technique used adaptation of the deep reinforcement learning technique for
for learning features and visual models that uses a sliding detecting robot grasps.
window detection approach, as illustrated by [1]. However,
this technique is slow for a robot in a real-world situation III. P ROBLEM F ORMULATION
where it may be required to take fast actions. [4] worked We define the problem of robotic grasping as a Markov
on improving this by passing an entire image through the Decision Process (MDP) where at any given state st ∈ S at
network rather than using small patches to detect potential time t, the agent (i.e. robot) makes an observation ot ∈ O
grasps. of the environment and executes an action at ∈ A based on
Reinforcement learning: In reinforcement learning, the policy π(st ) and receives an immediate reward rt based on
learner is not informed which action to take but instead, it the reward function R (st , at ). The goal of the agent is to
should decide which action will yield the most reward by find an optimal policy π ∗ in order to maximize the expected
trial and error. In most cases, the actions not only affect sum of discounted future rewards i.e. γ-discounted sum on
the subsequent reward but the next action thereby affecting all future returns from time t to ∞.
all the subsequent rewards [10]. Thus, trial and error search In our work, the observation ot comprises of the RGB-
and delayed reward are the two most prominent features D image captured from the overhead depth camera and the
1462
Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 28,2020 at 22:48:49 UTC from IEEE Xplore. Restrictions apply.
Fig. 2: The architecture of proposed Grasp-Q-Network. The input RGB-D image from the overhead camera Iot along with
the wrist RGB camera Iw t are individually fed to a 7 × 7 convolution with stride 2, followed by Batch Normalization. This
is followed by a 5 ×5 convolutional followed by a 3×3 convolution followed by Batch Normalization and Max-pooling.
The output features are then concatenated and fed to a convolutional layer followed by a fully connected layer. The servo
motor command Mt is processed by two fully connected layers. The result is then processed by two fully connected layers,
after which the network outputs the probability of grasp success using a softmax function.
RGB image taken from the wrist camera along with the joint then evaluates the action for each time step t and produces
angles of the robot’s arm observed at time t. a reward rt using the reward function R (st , at ). Each of
these time steps T results in training samples T given by the
IV. P ROPOSED A PPROACH equation:
We propose a novel deep reinforcement learning based T = (Ito , Itw , Mt , at , rt ) (1)
technique for vision-based robotic grasping. The proposed
framework consists of a novel double deep Q-learning based Each time step involves the observed sample images, the
architecture. The architecture consists of observed input motor position information, the action taken, and the reward
taken from the overhead camera and the wrist camera along received. The action space A, is defined as a vector of
with the current motor positions, fed into the Grasp-Q- motor actions comprising of end-effector displacement along
Network, that returns the grasp success probabilities i.e. the the three axes, rotation along the z-axis, and the gripper
Q values for all possible actions that the agent can take. action that involves closing the gripper which terminates the
These Q-values are then used to select the best action at episode.
based on the -greedy policy to find the optimal policy π ∗ . The Grasp-Q-Network comprises of two parts: visionNet
and motorNet as shown in fig.2. The Grasp-Q-Network
A. Architecture outputs probabilities as Q-values for each of the actions in
The architecture of the proposed approach is shown in fig. action space A, which are used to select the best action
2. The method presented involves end-to-end training of the at . The Grasp-Q-Network can be considered as Q-function
Grasp-Q-Network based on a visual servoing mechanism that approximator for the learning policy described by the visual
performs continuous servoing to adjust the motor commands servoing mechanism λ (ot ). The repeated utilization of the
of the robot by observing the current state of the environment target GQN that is defined by ψ 0 (It , Mt ) to get more data to
to produce a successful grasp. The motor actions are in the refit the GQN ψ (It , Mt ) can be regarded as fitted Q iteration.
robot frame, therefore there is no requirement for the camera
to be calibrated precisely with respect to the end effector. B. DQN Algorithm
The model uses visual indicators to establish the con- A deep Q network (DQN) is a multiple layer neural
nection between the graspable object and the robot end- network that returns a vector of action values Q(s, a | φ)
effector. A novel Deep Convolutional Neural Network is for a given state st and sequence of observations O, where
used that decides what motion of robot arm and the gripper φ represents the parameters of the network. The main aim
can produce an effective grasp. Each potential grasp consists of the policy network is to continuously train φ to attain
of time step T. For each of these time steps the robot the optimal policy π ∗ . Two important aspects of the DQN
records current images Ito and Itw from the overhead and algorithm, as proposed by [20], are the use of a target
wrist camera respectively and the current motor positions network and the use of experience replay.
Mt . Using the policy π(st ), the agent determines the next In our work, the target network ψ 0 , which is parameterized
action, i.e. in which direction to move the gripper. The agent by φ0 , is same as the GQN ψ except that its parameters
1463
Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 28,2020 at 22:48:49 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1 Proposed Visual Servoing Mechanism Instead of the value obtained from the target network ψφ0 ,
1: procedure S ERVOING(λ (ot )) the action that amplifies the current network ψφ is used by
2: Given current images Ito and Itw the max operator in double Q-learning.
3: Given current motor position Mt Thus, double DQN can be seen as a policy network ψ,
4: top: parameterized by φ, which is continually being trained to
5: Initialize time step t and reward r approximate the optimal policy. Mathematically, it uses the
6: loop: Bellman equation to minimize the loss function L(φ) as
7: Infer at using network ψ (It , Mt ) and policy π ∗ below:
Execute action at
2
8:
9: if at is not executed then L(φ) = E Q (st , at | φ) − YtDDQN (5)
10: r ←r−1
11: goto top D. Reward Function
12: end if The reward for a successful grasp is given by R (st , at ) =
13: if result = successful pick then 10 . A grasp is defined as successful if the gripper is holding
14: r ← r + 10 an object and is above a certain height at the end of the
15: goto top episode. A partial reward equal to 1 is given if the end
16: else if result = partial pick then effector comes in contact with the object. A reward of -1
17: r ←r+1 is given if no action is taken. Moreover, a small penalty of
18: goto top -0.025 is given, for all time steps prior to termination, to
19: else encourage the robot to grasp more quickly.
20: r ← r − 0.025
21: end if V. E XPERIMENTS
22: t←t+1 To evaluate our approach, we perform our experiments
23: goto loop. using our proposed deep Q-learning and double deep Q-
24: end procedure learning algorithms. We use the vanilla Q-learning algorithm
as a baseline. The experiments were performed in simulation
as well as in the real-world on the 7-DOF Baxter robot.
are copied every N time steps from the GQN so that then Further, the experiments were carried out using a single as
φ0t=N = φt=N and kept fixed on all other steps. The target well as multiple camera setups.
used by DQN is then: A. Setup
YtDQN ≡ rt + γ max Q st+1 , a | φ0t
(2) The simulation setup involves interfacing the gazebo simu-
a
lator in Robot Operating System (ROS) with Atari DQN that
where, γ ∈ [0, 1) is the discount factor. A value closer to provides a custom environment for simulating our proposed
0 indicates that the agent will choose actions based only on method. Although the proposed approach is robot agnostic,
the current reward and values approaching 1 indicates that we use a Baxter robot with a parallel plate gripper as our
the agent will choose actions based on the long-term reward. robot platform. We create a simulation environment with a
table placed in front of the robot and a camera mounted on
C. Training
the right side of the table. An object is randomly spawned
Two separate value functions are trained in a mutually on the table at the start of each episode. The objects consist
symmetric fashion by designating experiences arbitrarily to of a cube, a sphere, and a cylinder as seen from fig. 4.
update one of the two value functions, that results in two set The real-world setup consisted of the Baxter robot, Intel
of weights φ and φ0 . One set of weights is used to learn the RealSense D435 depth camera used as an overhead camera,
greedy policy, and the other is used to determine its value and a wrist camera similar to the simulation setup. The
for each update. To compare with Q-learning target, which overhead camera position was adjusted in real-world imple-
is defined as: mentation to avoid collision with the robot.
Fig. 3 top row demonstrates an example run for simulation
YtQ ≡ rt + γ max Q (st+1 , at+1 | φ) (3)
a of the Baxter robot learning to pick objects in the Gazebo
where, rt is immediate reward, and st+1 is the resulting environment. The bottom row shows the robot approaching
state, and Q (s, a | φ) is the parameterized value function. an object and the successful completion of the task of picking
The target value in double deep Q-learning can be written an object in the real-world setting.
as: B. Training
YtDDQN = rt + γQ (st+1 , argmax Q (st+1 , at+1 | φ) | φ0 ) For training, at the start and reset, a colored sphere,
(4) cylinder, or cube is spawned/placed at a random orientation
This solves the overestimation issue faced with Q-learning in front of the Baxter robot. The robot then tries to pick
and outperforms the existing Deep Q-learning algorithm. up the object by maneuvering its arm around the object. In
1464
Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 28,2020 at 22:48:49 UTC from IEEE Xplore. Restrictions apply.
Fig. 3: Baxter learning to grasp objects using the proposed deep reinforcement learning based visual servoing algorithm in
simulation (top row) and real-world (bottom row)
1465
Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 28,2020 at 22:48:49 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES
[1] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic
grasps,” The International Journal of Robotics Research, vol. 34, no. 4-
5, pp. 705–724, 2015.
[2] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning
hand-eye coordination for robotic grasping with deep learning and
large-scale data collection,” The International Journal of Robotics
Research, vol. 37, no. 4-5, pp. 421–436, 2018.
[3] A. Saxena, J. Driemeyer, and A. Y. Ng, “Robotic grasping of novel
objects using vision,” International Journal of Robotics Research,
vol. 27, pp. 157–173, 2008.
[4] S. Kumra and C. Kanan, “Robotic grasp detection using deep convo-
lutional neural networks,” in 2017 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS), pp. 769–776, IEEE, 2017.
[5] J. Bohg, A. Morales, T. Asfour, and D. Kragic, “Data-driven grasp
synthesisa survey,” IEEE Transactions on Robotics, vol. 30, pp. 289–
Fig. 6: Comparison of performance of single and multi-view 309, April 2014.
[6] Y. Jiang, S. Moseson, and A. Saxena, “Efficient grasping from rgbd
models for the robotic grasping task images: Learning using a new rectangle representation,” in Robotics
and Automation (ICRA), 2011 IEEE International Conference on,
pp. 3304–3311, IEEE, 2011.
[7] J. Yu, K. Weng, G. Liang, and G. Xie, “A vision-based robotic
due to robot arm. For this reason, we also compare the grasping system using deep learning for 3d object recognition and
results of single and multi-view camera setups to test the pose estimation,” in 2013 IEEE International Conference on Robotics
performance of the robot for picking three objects: sphere, and Biomimetics (ROBIO), pp. 1175–1180, Dec 2013.
[8] D. Quillen, E. Jang, O. Nachum, C. Finn, J. Ibarz, and S. Levine,
cube, and cylinder, as shown in fig. 6. The single camera “Deep reinforcement learning for vision-based robotic grasping: A
setup involves using a single camera input, and multiple simulated comparative evaluation of off-policy methods,” in 2018
camera setup incorporates an overhead camera and a wrist IEEE International Conference on Robotics and Automation (ICRA),
pp. 6284–6291, IEEE, 2018.
camera of the robot arm. The results demonstrate that the [9] J. Mahler, M. Matl, V. Satish, M. Danielczuk, B. DeRose, S. McKinley,
multi-view camera model works better in comparison to and K. Goldberg, “Learning ambidextrous robot grasping policies,”
the single-view model for all three objects. An interesting Science Robotics, vol. 4, no. 26, p. eaau4984, 2019.
[10] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
observation to note is that both the single-view and multi- MIT press, 2018.
view models work best on the cube with an accuracy of [11] M. A. Moussa and M. S. Kamel, “A connectionist model for learning
91.1% on a multi-view model, and 83.2% on a single view robotic grasps using reinforcement learning,” in Proceedings of Inter-
national Conference on Neural Networks (ICNN’96), vol. 3, pp. 1771–
model as compared to the sphere and the cylinder, and have 1776 vol.3, June 1996.
the lowest success rate for the sphere. [12] N. Kohl and P. Stone, “Policy gradient reinforcement learning for
fast quadrupedal locomotion,” in IEEE International Conference on
Robotics and Automation, 2004. Proceedings. ICRA ’04. 2004, vol. 3,
VI. C ONCLUSIONS pp. 2619–2624 Vol.3, April 2004.
[13] A. Y. Ng, A. Coates, M. Diel, V. Ganapathi, J. Schulte, B. Tse,
A method for learning robust grasps is presented using E. Berger, and E. Liang, “Autonomous inverted helicopter flight via
a deep reinforcement learning framework that consists of a reinforcement learning,” in Experimental Robotics IX (M. H. Ang and
Grasp-Q-Network which produces grasp probabilities and a O. Khatib, eds.), (Berlin, Heidelberg), pp. 363–372, Springer Berlin
Heidelberg, 2006.
visual servoing mechanism that performs continuous servo- [14] F. Stulp, E. Theodorou, J. Buchli, and S. Schaal, “Learning to
ing to adjust the servo motor commands of the robot. In grasp under uncertainty,” in 2011 IEEE International Conference on
contrast to most grasping and visual servoing methods, this Robotics and Automation, pp. 5703–5708, May 2011.
[15] T. Lampe and M. Riedmiller, “Acquiring visual servoing reaching
method does not require training on large sets of data since it and grasping skills using neural reinforcement learning,” in The 2013
does not use any dataset for learning to pick up the objects. International Joint Conference on Neural Networks (IJCNN), pp. 1–8,
The experimental results show that this method works in Aug 2013.
[16] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
simulations as well as on the real robot. This method also D. Wierstra, and M. A. Riedmiller, “Playing atari with deep rein-
uses continuous feedback to correct inaccuracies and adjust forcement learning,” CoRR, vol. abs/1312.5602, 2013.
the gripper to the movement of the object in the scene. [17] L. Tai, G. Paolo, and M. Liu, “Virtual-to-real deep reinforcement
learning: Continuous control of mobile robots for mapless navigation,”
Our results on using multi-view models show that multi- in 2017 IEEE/RSJ International Conference on Intelligent Robots and
view camera setup works better in comparison to single view Systems (IROS), pp. 31–36, IEEE, 2017.
camera setup and also has a higher success rate for multiple [18] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang,
D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al., “Scalable
cameras. Further, the proposed novel Grasp-Q-Network has deep reinforcement learning for vision-based robotic manipulation,” in
been used to learn a reinforcement learning policy that Conference on Robot Learning, pp. 651–673, 2018.
can learn to grasp objects from the feedback obtained by [19] A. Zeng, S. Song, S. Welker, J. Lee, A. Rodriguez, and T. Funkhouser,
“Learning synergies between pushing and grasping with self-
the visual servoing mechanism. Finally, the results obtained supervised deep reinforcement learning,” in 2018 IEEE/RSJ Interna-
show that by adapting an off-policy reinforcement learning tional Conference on Intelligent Robots and Systems (IROS), pp. 4238–
method, the robot performs better at learning the task of 4245, IEEE, 2018.
[20] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
grasping the objects as compared to the vanilla Q-learning Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
algorithm, and reduces the overestimation problem that was et al., “Human-level control through deep reinforcement learning,”
observed using the deep Q-learning algorithm. Nature, vol. 518, no. 7540, p. 529, 2015.
1466
Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 28,2020 at 22:48:49 UTC from IEEE Xplore. Restrictions apply.