0% found this document useful (0 votes)
11 views7 pages

Learning A Controller For Soft Robotic Arms and Testing Its Generalization To New Observations Dynamics and Tasks

Uploaded by

Pavan Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views7 pages

Learning A Controller For Soft Robotic Arms and Testing Its Generalization To New Observations Dynamics and Tasks

Uploaded by

Pavan Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Learning a Controller for Soft Robotic Arms and Testing its

Generalization to New Observations, Dynamics, and Tasks


Carlo Alessi1,2 , Helmut Hauser3 , Alessandro Lucantonio4 , and Egidio Falotico1,2 (Member, IEEE)

AbstractÐ Recently, learning-based controllers that leverage these model approximations degrades when the robot is
mechanical models of soft robots have shown promising re- subject to non-negligible external forces and unpredictable
sults. This paper presents a closed-loop controller for dy- interactions with the environment.
namic trajectory tracking with a pneumatic soft robotic arm
learned via Deep Reinforcement Learning using Proximal Policy Data-driven modeling and control are also viable ap-
Optimization. The control policy was trained in simulation proaches [8]. In the context of supervised learning, reservoir
2023 IEEE International Conference on Soft Robotics (RoboSoft) | 979-8-3503-3222-3/23/$31.00 ©2023 IEEE | DOI: 10.1109/ROBOSOFT55895.2023.10121988

leveraging a dynamic Cosserat rod model of the soft robot. computing was used to emulate the nonlinear dynamics of
The generalization capabilities of learned controllers are vital a pneumatic soft robotic arm and learning to reproduce
for successful deployment in the real world, especially when the trajectories [9]. Moreover, continual learning was proposed
encountered scenarios differ from the training environment.
We assessed the generalization capabilities of the controller to tune the weights of a neural network-based controller to
in silico for four tests. The first test involved the dynamic adapt to changes in the dynamics of a soft robotic arm due
tracking of trajectories that differ significantly in shape and to loading conditions without catastrophic forgetting [10].
velocity profiles from the training data. Second, we evaluated The success of reinforcement learning (RL) for behavior
the robustness of the controller to perpetual external end-point generation in rigid robots prompted interest in its application
forces for dynamic tracking. For tracking tasks, it was also
assessed the generalization to similar materials. Finally, we to continuum and soft robots [11]. The first applications of
transferred the control policy without retraining to intercept a RL for soft robot control relied on discretized state-action
moving object with the end-effector. The learned control policy spaces. The well-known Q-Learning algorithm was applied
has shown good generalization capabilities in all four tests. to train a model-free, open-loop, static controller for a multi-
Index TermsÐ Modeling, Control, and Learning for Soft segment planar pneumatic soft arm [12]. The same algorithm
Robots, Learning and Adaptive Systems, Soft Robot Applica-
tions. was used to compare control policies learned in simulation
and directly on the real soft robot subject to tip loads
I. I NTRODUCTION [13]. The SARSA algorithm was applied to obtain a static
controller of position and stiffness for a hybrid soft robotic
The modeling and control of continuum and soft robotic arm in a multi-agent setting, which considered the actuators
arms are still challenging problems due to hyper-redundancy, as individual agents cooperating in a shared environment
complex dynamics, and non-linear properties of soft mate- [14]. Following the recent advances of Deep RL, now it is
rials [1], [2]. Researchers have proposed several modeling also possible to consider continuous states and actuations.
techniques [3], which lead to the development of a variety For example, Satheeshbabu et al. [15] learns via Deep Q-
of model-based and model-free control strategies [4]. Learning transition between way-points in quasi-static con-
The most used approaches for deriving forward dynamics ditions with an open-loop position controller for a pneumatic
or kinematics models for continuum and soft robots have soft robotic arm capable of bending and twisting that was
been geometrical models like the piece-wise constant cur- trained in simulation, leveraging a Cosserat rod model. The
vature (PCC) approximation [5]. These models were used same authors extended the work by increasing the dexterity
within proportional-derivative (PD) control laws for dynamic of the soft robotic arm and attaining a closed-loop controller
task space control of a soft manipulator performing a variety for precise quasi-static positioning via Deep Deterministic
of real-world tasks [6]. As an alternative [7] described Policy Gradient approach [16]. The authors validated both
the shape of a synthetic planar soft robot analytically by controllers on unseen payloads. Our work presented here,
a polynomial curvature model, which was used within an however, uses a dynamic (not just quasi-static) Cosserat rod
extended PD regulator to achieve perfect steady state control model, subject to pressure-induced stretching and bending.
in generic curvature conditions. However, the suitability of In a similar work, Centurelli et al. [17] learned a closed-
*This work was supported by the European Union’s Horizon 2020
loop controller for dynamic trajectory tracking for a soft
Research and Innovation Programme under the Specific Grant Agreement robotic arm via Trust Region Policy Optimization leverag-
No. 945539 (Human Brain Project SGA3). ing an approximation of the robot forward dynamic model
1 The BioRobotics Institute, Scuola Superiore Sant’Anna, Pisa, Italy
(email: {c.alessi, e.falotico}@santannapisa.it).
obtained by a recurrent neural network. In Naughton et al.
2 Department of Excellence in Robotics and AI, Scuola Superiore [18], the authors applied several deep reinforcement learning
Sant’Anna, Pisa, Italy. algorithms to learn in simulation various control policies
3 Department of Engineering Mathematics, University of Bristol, Bristol,
using a synthetic soft arm based on Cosserat theory. All these
UK (email: [email protected]).
4 Department of Mechanical and Production Engineering, Aarhus Univer- approaches confirm that Deep RL algorithms are suitable
sity, Aarhus, Denmark (email: [email protected]). candidates for generating control policies for soft robots.

Authorized licensed use limited to: Aarhus University. Downloaded on December 05,2024 at 14:28:04 UTC from IEEE Xplore. Restrictions apply.
However, these works do not explore common problems
for RL-based controllers: the abilities to generalize to new
observations, environment dynamics, and tasks [19]. In this
work, we adopt an approach similar to [17], [18]. We propose
a closed-loop controller for dynamic trajectory tracking tasks
using a pneumatic soft robotic arm trained via deep rein-
forcement learning using Proximal Policy Optimization. The
control policy is learned in simulation leveraging a dynamic (a) (b)
Cosserat rod model of the soft robot. However, we diversify
Fig. 1: Robotic platform. (a) AM I-Support. (b) Rendering
the validation producing different target velocities profiles
of the used computational model.
to investigate in silico the generalization capabilities and
limitations of the controller in various conditions and tasks.
Specifically, we evaluate the generalization capabilities of d¯3 points along the center-line tangent (∂s x̄ = x̄s ). The
the controller in silico on four tests: (i) tracking trajectories deformations that the rod can undergo are expressed by the
of different geometries and velocities; (ii) tracking trajecto- shear/stretch vector σ(s, t) and the bend/twist vector κ(s, t).
ries subject to constant external forces applied to the end- Shearing and stretching deviate d3 from x̄s , σ = Q(x̄s −
effector; (iii) tracking trajectories using different material d¯3 ) = Qx̄s − d3 in the local frame. The curvature vector
properties; and (iv) intercepting an object moving at various κ encodes Q’s rotation rate along the material coordinate
velocities towards the workspace. Note that all of them are ∂s dj = κ × dj , while angular velocity ω is defined by
carried out without retraining. The rest of the paper is as ∂t dj = ω × dj . The rod dynamics is then governed by the
follows. Section II describes the soft robotic platform, the following set of nonlinear differential equations:
Cosserat rod model, and the training process to solve the
control problem of dynamic tracking. Section III reports  
Q⊺ Sσ
and discusses the results obtained for the four generalization ρA · ∂t2 x̄
= ∂s + ef¯ (1)
tests. Section IV concludes with an insight into future works. e
   
ρI Bκ κ × Bκ x̄s
II. M ATERIALS AND M ETHODS · ∂t ω = ∂s + + Q × Sσ
e e3 e3 e
A. Robotic Platform: The AM I-Support  ω ρIω
+ ρI · × ω + 2 · ∂t e + ec, (2)
The AM I-Support is a 3D-printed soft robotic arm with e e
three elliptical pneumatic chambers that can generate large where B is the bend/twist stiffness matrix, S is the
movements by combining stretching and bending [20]. As shear/stretch stiffness matrix, ρ is the constant material
shown in Fig. 1a, two terminal plates (top and bottom) density, A is the cross-sectional area, I is the second area
confine the modules, six rings distributed along the body moment of inertia, f¯ is the external force, c is the external
constrain the chambers, while nuts and bolts assemble the couple, and e = |x̄s | is the local stretching.
parts. The soft robotic arm is characterized by a cross- The pressurization of the pneumatic chambers of the AM
section of radius 30 mm, an overall length of ∼202 mm, and I-Support produces an internal force along d3 normal to
∼183 g overall weight. The pneumatic chambers are ∼180 the rod cross-section and a bending moment. To describe
mm long, while the top and bottom terminal are ∼20 mm the deformations of the robot, we modeled pressure-induced
and ∼5 mm long, respectively. The actuators are distributed strains as spontaneous stretching and bending, modifying the
axially, at a radial distance of 20 mm from the cross-section rest configuration of the arm dynamically. The rod is subject
centroid, and equally spaced by 120◦ around the centre. to gravity, viscous forces, viscous torques, and external
The arm was fabricated using the soft material thermoplastic forces applied to the free-end which can be integrated into
polyurethane with 80 Shore A hardness (TPU 80 A LF, by body dynamics via f¯ and c in (1)-(2). The rest length of
BASFT M ), which is characterized by 17 MPa tensile strength the rod and the cross-section radius were directly measured
and elongation at break of 471%. from the physical prototype [20]. We computed the effective
cross-sectional area and the second area moment of inertia
B. Cosserat Rod Model
considering the actuator geometry. The material density was
The soft robotic arm was modeled as a Cosserat rod with taken from the TPU datasheet, the Young Modulus was fitted
constant cross-section and homogeneous material properties from experimental stretching data, and the damping coeffi-
by extending the Cosserat theory introduced in [21] to cients were optimized on dynamic stretching and bending
account for the pneumatic actuation. A rod is described by data. This computational environment was used for training
a center-line x̄(s, t) ∈ R3 and orthogonal rotation matrix and testing the controller.
Q(s, t) = {d¯1 , d¯2 , d¯3 }−1 . Here, t is time and s ∈ [0, L] is
the material coordinate of a rod of length L. Q transforms C. Control Architecture
vectors from global frame to the local frame via x = Qx̄, To solve a trajectory tracking task using this soft robotic arm,
and vice versa x̄ = Q⊺ x. If the rod is unsheared, (d¯1 , d¯2 ) we adopt the closed-loop control scheme shown in Fig. 2.
spans the normal-binormal plane of the cross-section, and An arbitrary trajectory generator provides reference positions

Authorized licensed use limited to: Aarhus University. Downloaded on December 05,2024 at 14:28:04 UTC from IEEE Xplore. Restrictions apply.
2). State and action spaces are each normalized between -1
and 1 to increase numerical stability of the training process.
The reward is defined as
(
−10 if NaN
rt = (4)
−et + b(et ) otherwise,
where the penalty term of -10 was applied to discourage
actions that would cause numerical instabilities as proposed
Fig. 2: Control scheme with z −1 discrete time delay operator.
by [18], et = ||et || is the norm of the tracking error, and an
inductive bias b(·) is provided as incentive to explore
3
xtar
t+1 ∈ R , i.e., the desired position in Cartesian coordinates

0.05 0.03 < e ≤ 0.05

of the robot end-effector for the next time step t+1. The
b(e) = 0.1 0.01 < e ≤ 0.03 (5)
controller is implemented as a feed-forward neural network 
0.2 e ≤ 0.01.

with two hidden layers, each with 64 neurons with tanh
activation function. The output layer has a linear activation 1) Proximal Policy Optimization: The controller is
function. The controller takes as input learned via Proximal Policy Optimization (PPO), a policy-
h i gradient method appropriate for continuous control tasks
xt = dt , et , xtip
t , x tip
t−1 , x tip 15
t−2 ∈ R , (3) [22]. In particular, we adopt the reliable implementation
tip provided by [23]. The algorithm jointly optimizes a stochas-
where dt = xtart+1 −xt is the distance vector between the tic policy π(a|s) and a value-function approximator. PPO
current desired position and the current free-end position of alternates between sampling data from the policy through
the arm xtip tar
t measured before the actuation, et = xt −xt
tip
interaction with the environment and performing optimiza-
tip tip
is the current tracking error, and xt−1 and xt−2 are the two tion on the sampled data using stochastic gradient descent
previous positions of the robot end-effector. The vector dt (SGD) to maximize the objective
therefore provides the controller with a minimal prediction
of the future target position. The error vector et measures h  i
how well the tracking proceeds in line with standard closed- E min ρt (π) · Ât , clip(ρt (π), 1 − ϵ, 1 + ϵ) · Ât , (6)
loop controllers. Finally, xtip tip tip
t , xt−1 , and xt−2 provide the
π(at |st )
controller with a simple short-term memory that allows it to where ρt (π) = πold (at |st ) is the ratio of the probability
infer the velocity and acceleration of the soft robotic arm. of selecting an action under the current policy π and the
The controller outputs three pressure commands for the three probability of taking it with the policy πold that collected
chambers pt = [p1 , p2 , p3 ] limited between 0 and 3.5 bar. the current batch of data, ϵ=0.2 is the clipping parameter,
The initial values are d0 = xtar 1 − xtip
0 , e0 = 0, and and  is an estimator of the advantage function. This loss
tip tip tip
x0 = x−1 = x−2 = [0, 0, −L]. The control loop operates encourages the policy to select actions with a positive advan-
at 10 Hz frequency, actuating the robot every ∆t=0.1 s. tage while discouraging large policy updates via clipping.
D. Reinforcement Learning Algorithm E. Training Process
We solve the control problem using deep Reinforcement The control policy was optimized using PPO. For each
Learning. In general, in RL an agent receives at each time episode, a random trajectory was produced. The starting
step t an observation ot from the environment, which is a point was the resting tip position xtip
0 and two additional
subset of the full environment state st . The agent acts ac- way-points were uniformly sampled from 512 positions in
cording to a policy π mapping states/observations to actions, the workspace. Target trajectory xtar was then produced
which can be deterministic or stochastic. The agent receives a through interpolation of these three points using a cubic
scalar reward r(s, a) indicating
PT the current task performance. spline (see Fig. 3b). This was redone for each training
i−t
Let the return Gt = i=t γ r(si , ai ) be the discounted episode. This ensured that the controller visited different
sum of future rewards, with discount factor γ ∈ [0, 1]. The parts of the workspace. The duration of the training episode
agent aims to maximize the expected return Eπ [G0 |s0 ]. The was fixed at T =10 s for each target trajectory. Note that
state-value function is defined as Vπ (s) = Eπ [Gt |st ]. The since the space traveled in these 10s depended on how far
action-value function is defined as Qπ (s, a) = Eπ [Gt |st , at ]. apart the sampled way-points were, the training trajecto-
The advantage value function Aπ (s, a) = Qπ (s, a) − Vπ (s) ries had different velocity profiles. This approach ensured
expresses whether the action a is better or worse than an that the controller experienced a wide range of velocities
average action the policy π takes in the state s. ∆xtar . Through this process, we obtained learning data
In our setting, the agent is the controller implemented as points with velocities in the range [0, 0.10] m/s with a
a neural network, and the environment is the soft robotic mean of 0.025±0.016 m/s (see also Fig. 5a orange curve).
arm modeled as a Cosserat rod. Therefore, the agent receives Each episode starts with the robot at rest facing vertically
observations st = xt and outputs actions at = pt (see Fig. downward (see Fig. 1). The episode terminated when the

Authorized licensed use limited to: Aarhus University. Downloaded on December 05,2024 at 14:28:04 UTC from IEEE Xplore. Restrictions apply.
(a) (b) (c)

Fig. 3: Example of target trajectories. The starting point is the resting tip position xtip
0 . Additional waypoints are sampled
uniformly from the workspace. Target trajectories xtar are generated by interpolating xtip
0 and the waypoints using a cubic
spline. (a) 3D straight lines (1 waypoint, T =10 s); (b) 3D curves (2 waypoints, T =10 s); (c) 3D curves (3 waypoints, T =15
s). The controller was trained on curves with two waypoints and tested on straight lines and curves with three waypoints.

(a) (b)
Fig. 5: Statistics for the trajectory tracking tasks. (a) Distribu-
Fig. 4: Learning curve showing cumulative episode reward. tion of target velocities for each task. (b) Error distribution
for each task. The control policy trained on 3D curves (2
waypoints) generalizes to different trajectories and velocities.
entire target trajectory was done (i.e., the time limit of 10s
was reached) or when numerical problems occurred.
We trained a stochastic policy to solicit exploration in TABLE I: Tip error e/L (%) on dynamic trajectory tracking.
the environment. After training, we used a deterministic
Trajectory mean ± std (%) IQR (%)
and greedy policy to exploit the best actions learned. After
an empirical model selection and hyper-parameter tuning, 3D curves (2 waypoints) 8.86 ± 6.22 5.60
the learning took place over 1.2 million time steps (i.e., 3D lines (1 waypoint) 6.56 ± 4.03 5.11
∼10k episodes), equivalent to about 33 hours of learning 3D curves (3 waypoints) 9.61 ± 6.61 6.56
experience in silico. The training episodes were collected
using N =8 parallel agents interacting with the environment
for M =64 time steps per policy update. At each iteration,
the policy was optimized on the current N ·M samples with tigate its generalization abilities. In particular, first, we
SGD for ten epochs using four mini-batches and a learning tested how the controller could track trajectories of different
rate of 0.00025. The training lasted about 14 hours on a geometries and velocities profiles. Second, we assessed the
standard laptop (Intel i7-7500U Processor, 8 GB RAM). robustness of the controller to external forces of various
The learning curve in Fig. 4 shows the sum of the rewards magnitudes and directions applied to the robot’s end-effector
the agent received in each training episode. The light blue during trajectory tracking. Third, we evaluated the gener-
curve is noisy because of the intrinsic explorative behavior alization of the policy to different material stiffnesses for
of training a stochastic policy and the fact that each episode dynamic tracking. As a performance metric, we adopted the
generates a new target trajectory xtar . Nonetheless, the trend mean and standard deviation of the normalized tip error e/L,
of the exponential moving average (dark blue) is increasing. i.e., the error in percentage of the robot length L. In addition,
we measured the spread of the normalized tip error using the
III. R ESULTS AND D ISCUSSION interquartile range (IQR), which is robust to extreme outliers.
After learning the stochastic policy, we evaluated its greedy Finally, we deployed the controller to make the soft robot tip
(deterministic) version. We conducted four tests to inves- intercept the trajectory of a moving object.

Authorized licensed use limited to: Aarhus University. Downloaded on December 05,2024 at 14:28:04 UTC from IEEE Xplore. Restrictions apply.
Fig. 6: Controller evaluation on three sample trajectories: (left) straight line (1 waypoint); (center) curve (2 waypoints); (right)
curve (3 waypoints). The controller outputs various pressure profiles to track target trajectories geometrically different.

A. Generalization to Trajectory Tracking Tasks TABLE II: Tip error e/L (%) for trajectory tracking subject
to random perpetual external endpoint forces.
We tested the control policy trained on trajectories generated
from two waypoints on three different sets of 100 generated Trajectory fext (N) mean ± std (%) IQR (%)
tracking tasks (see Fig. 3). 3D curves (2 waypoints) 0.0 8.25 ± 5.64 5.23
As a baseline, we assessed the performance on trajectories 3D curves (2 waypoints) 0.1 8.31 ± 5.73 5.16
sampled from the same distribution as the ones used for 3D curves (2 waypoints) 0.5 8.75 ± 6.30 5.42
3D curves (2 waypoints) 1.0 9.89 ± 7.59 5.80
training, i.e., 3D curves (two way-points, with the same
starting point, duration T =10 s). The velocities for this task
ranged in [0, 0.10] m/s with a mean of 0.025±0.016 m/s.
On this set of trajectories, the controller achieved a mean tip test confirmed that the distributions are pair-wise different.
error of 8.86±6.22%L and 5.60%L IQR, see Table I. In particular, the p-value was 0 for the three tests, rejecting
For the first test of generalization capabilities, we em- the null hypothesis that the distributions are identical. Fig. 6
ployed trajectories that differed from the training set in shows examples of actuation, dimension-wise tracking, and
geometry and velocity profile (see Fig. 5a). The first group tracking error for each trajectory type.
of testing trajectories included 3D straight lines (1 way-
point, T =10 s) with velocities in the range [0, 0.019] m/s B. Generalization to External Forces
with a mean of 0.01±0.004 m/s. The second group included External forces applied to soft robots can significantly alter
3D curves (3 way-points, T =15 s) with velocities in the their dynamics. In this experiment, we evaluate the robust-
range [0, 0.178] m/s with a mean of 0.031±0.02 m/s. The ness of the controller to perpetual external forces applied
mean tip error performance attained on these tasks were to the end-effector in dynamic trajectory tracking tasks.
6.56±4.03%L, and 9.61±6.61% respectively (see Table I The force fext = [fx , fy , fz ] includes the special case of
for a comparison). The controller tracked 3D straight lines a standard payload in which fx =fy =0. After sampling the
better than the baseline (i.e., 3D curves with two way-points) force vector components from a normal distribution, fext
and performed slightly worse on tracking 3D curves with was scaled to the desired magnitude. We investigated three
three way-points. Fig. 5b shows the distributions of the tip different magnitudes, i.e., fext ∈ {0.1, 0.5, 1.0} N. When the
errors for each task. The controller achieved good results in soft arm was at rest, the force caused an average deflection
tracking trajectories that differed not only geometrically (e.g., in the direction of fext of 1.16±0.32%L, 5.82±1.56%L, and
lines and curves) but also in the velocity profile required 11.65±3.06%L, respectively. The controller did not have any
to follow them. The hypothesis is that the learned policy explicit information about the perturbations. The controller
has generalized to different tracking tasks. This was verified evaluated on 100 random trajectories for each of the three
quantitatively by conducting a Kolmogorov-Smirnov statis- magnitudes, each with a different endpoint force, shows
tical test [24] on the velocity distributions (see Fig. 5a). The comparable performance with the no-force case (see Table

Authorized licensed use limited to: Aarhus University. Downloaded on December 05,2024 at 14:28:04 UTC from IEEE Xplore. Restrictions apply.
TABLE III: Accuracy of trajectory interception task.
T (s) 3D lines 3D curves
10 85% 95%
5 74% 90%
2 46% 64%
1 33% 52%
0.5 18% 27%

object was a sphere of radius Robj =25 mm identified by its


centroid xtar travelling a path towards the workspace. The
Fig. 7: Controller generalization to a range of Young Moduli.
object’s initial position xtar
0 was uniformly sampled outside
the workspace from a sphere of radius 0.3 m centered at the
resting position of the free-end of the robot, i.e., xtip
0 . Then
II). Despite the wide range of force magnitudes, i.e., between
one or two additional way-points were sampled uniformly
fext =0.1 N and fext =1.0 N, the tip error increased only by
from the workspace and interpolated with a cubic spline to
around 1%L. Therefore, the control policy trained without
generate the object trajectories, i.e., 3D straight lines or 3D
disturbances generalized well to perpetual external forces
curves. The task was successful if the end-effector contacted
applied to the end-effector post-training.
the object intercepting its trajectory.
C. Generalization to Material Properties We evaluated the accuracy of trajectory interception for
Materials play a crucial role in soft robotics. Factors like different object velocities to understand the complexity of
temperature changes or material degradation could nonlin- the interception task (see Table III). The average velocity
early modify the stiffness of soft robots. Similarly, soft robots of the object ranged from 0.03 m/s (T =10 s) up to 0.51
with equal geometries but different material properties could m/s (T =0.5 s) for the linear trajectories, and from 0.05
attain significantly different deformations that affect the m/s (T =10 s) up to 0.63 m/s (T =0.5 s) for the curvilinear
reachable workspace. Therefore, it is interesting to evaluate trajectories. As expected, the percentage of successful inter-
the generalization of a controller to diverse materials. In ceptions decreased for objects moving at higher velocities
particular, we tested how the control policy generalized to for both trajectory types. This was because the average
different Young Moduli of the Cosserat rod for tracking object trajectory in the interception task was up to 25 times
trajectories. The calibrated value of the Young modulus of faster (for T =0.5 s) than the average training trajectory.
the rod used to train the controller was E=1.65 MPa. The Mechanical limitations of the soft robot also play a role.
learned controller was tested on Young Moduli ranging from Interestingly, the success rate for the curvilinear trajectories
0.49 MPa to 6.59 MPa in steps of ∆E=0.1E, respectively was higher than for the straight lines despite the higher
0.3 and four times the calibrated value. This range is reason- velocities of the former. This could be because the curves
able for soft robotics applications. For each value of E, we stay inside the learned workspace for longer, increasing the
averaged the normalized tip errors over 100 target trajectories probability of interception. Overall, the learned controller
sampled from two waypoints and T =10 s. As shown in performed the object interception satisfactorily, suggesting
Fig. 7, the controller generalized fairly well for values that the knowledge learned for dynamic path following is
of E that deviated moderately from the calibrated value. transferrable to other tasks. Fig. 8 shows a rendering of a
However, the mean and standard deviation of the normalized successful trajectory interception trial for T =5 s. From Fig.
tip error increased with ∆E. Notice how the error curve was 9, observe that the object started from a remote position
asymmetric to the calibrated stiffness. For lower values of the and quickly moved toward the workspace. The controller
Young Modulus, the tracking error increased faster than for smoothly tracked the trajectory contributing with all three
higher values. This could be because softer materials undergo chambers reducing the tracking error.
larger deformations under the same applied pressure, making
IV. C ONCLUSION
them harder to control. Conversely, stiffer materials reduced
the reachable workspace. As a result, the controller could In this paper, we leveraged a dynamic Cosserat rod model of
not track all the points along the generated trajectories as a soft robotic arm and trained a control policy for dynamic
effectively. In summary, the controller generalized well (e.g., trajectory tracking using Proximal Policy Optimization, a
average tip error less than 9%L) to similar materials, having deep reinforcement learning algorithm. We investigated how
Young Modulus in the range [0.9E, 1.4E]. well the learned control policy generalized to new observa-
tions, including tracking trajectories of different geometry
D. Generalization to Trajectory Interception Tasks and velocity profiles. Moreover, the controller generalized
After assessing that the control policy can generalize to to new environmental dynamics imposed as perpetual end-
track different trajectories at various speeds, we deployed the point forces of different magnitudes in any direction. We
controller to intercept a moving object. Again, the control also tested new dynamics for various material stiffnesses
policy was not retrained for solving the new task. The in trajectory tracking. Finally, the policy also generalized

Authorized licensed use limited to: Aarhus University. Downloaded on December 05,2024 at 14:28:04 UTC from IEEE Xplore. Restrictions apply.
(a) 0/4 (b) 1/4 (c) 2/4 (d) 3/4 (e) 4/4
Fig. 8: Rendering of object interception with a soft robotic arm. Episode lasts 3.3 seconds, max T =5 s.

[7] C. Della Santina and D. Rus, ªControl oriented modeling of soft


robots: the polynomial curvature case,º IEEE Robotics and Automation
Letters, vol. 5, no. 2, pp. 290±298, 2019.
[8] D. Kim, S.-H. Kim, T. Kim, B. B. Kang, M. Lee, W. Park, S. Ku,
D. Kim, J. Kwon, H. Lee et al., ªReview of machine learning methods
in soft robotics,º Plos one, vol. 16, no. 2, p. e0246102, 2021.
[9] M. Eder, F. Hisch, and H. Hauser, ªMorphological computation-
based control of a modular, pneumatically driven, soft robotic arm,º
Advanced Robotics, vol. 32, no. 7, pp. 375±385, 2018.
[10] F. PiquÂe, H. T. Kalidindi, L. Fruzzetti, C. Laschi, A. Menciassi, and
E. Falotico, ªControlling soft robotic arms using continual learning,º
IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 5469±5476,
2022.
[11] J. Kober, J. A. Bagnell, and J. Peters, ªReinforcement learning in
robotics: A survey,º The International Journal of Robotics Research,
vol. 32, no. 11, pp. 1238±1274, 2013.
[12] X. You, Y. Zhang, X. Chen, X. Liu, Z. Wang, H. Jiang, and X. Chen,
ªModel-free control for soft manipulators based on reinforcement
learning,º in 2017 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS). IEEE, 2017, pp. 2909±2915.
[13] H. Zhang, R. Cao, S. Zilberstein, F. Wu, and X. Chen, ªToward ef-
fective soft robot control via reinforcement learning,º in International
Conference on Intelligent Robotics and Applications. Springer, 2017,
pp. 173±184.
[14] Y. Ansari, M. Manti, E. Falotico, M. Cianchetti, and C. Laschi,
ªMultiobjective optimization for stiffness and position control in a
soft robot arm module,º IEEE Robotics and Automation Letters, vol. 3,
Fig. 9: Successful trajectory interception example, T =5 s. no. 1, pp. 108±115, 2017.
[15] S. Satheeshbabu, N. K. Uppalapati, G. Chowdhary, and G. Krishnan,
ªOpen loop position control of soft continuum arm using deep
reinforcement learning,º in 2019 International Conference on Robotics
to the new task of intercepting with the end-effector a and Automation (ICRA). IEEE, 2019, pp. 5133±5139.
[16] S. Satheeshbabu, N. K. Uppalapati, T. Fu, and G. Krishnan, ªCon-
moving object (zero-shot transfer). While, as expected, the tinuous control of a soft continuum arm using deep reinforcement
performance dropped slightly for these new scenarios, the learning,º in 2020 3rd IEEE International Conference on Soft Robotics
learned model was surprisingly robust. Cosserat rod models (RoboSoft). IEEE, 2020, pp. 497±503.
[17] A. Centurelli, L. Arleo, A. Rizzo, S. Tolu, C. Laschi, and E. Falotico,
of soft robots are promising for learning complex control ªClosed-loop dynamic control of a soft manipulator using deep rein-
policies in simulation. Extensions to this work include the forcement learning,º IEEE Robotics and Automation Letters, vol. 7,
sim-to-real transfer of the controller to multi-section soft no. 2, pp. 4741±4748, 2022.
[18] N. Naughton, J. Sun, A. Tekinalp, T. Parthasarathy, G. Chowdhary,
robotic arms, orientation tracking, using recurrent networks, and M. Gazzola, ªElastica: A compliant mechanics environment for
and learning the trajectory interception policy. soft robotic control,º IEEE Robotics and Automation Letters, vol. 6,
no. 2, pp. 3389±3396, 2021.
R EFERENCES [19] R. Kirk, A. Zhang, E. Grefenstette, and T. RocktÈaschel, ªA survey
of generalisation in deep reinforcement learning,º arXiv preprint
[1] C. Laschi and M. Cianchetti, ªSoft robotics: new perspectives for robot arXiv:2111.09794, 2021.
bodyware and control,º Frontiers in bioengineering and biotechnology, [20] L. Arleo, G. Stano, G. Percoco, and M. Cianchetti, ªI-support soft
vol. 2, p. 3, 2014. arm for assistance tasks: a new manufacturing approach based on 3d
[2] D. Rus and M. T. Tolley, ªDesign, fabrication and control of soft printing and characterization,º Progress in Additive Manufacturing,
robots,º Nature, vol. 521, no. 7553, pp. 467±475, 2015. vol. 6, no. 2, pp. 243±256, 2021.
[3] C. Armanini, F. Boyer, A. T. Mathew, C. Duriez, and F. Renda, [21] M. Gazzola, L. Dudte, A. McCormick, and L. Mahadevan, ªForward
ªSoft robots modeling: A structured overview,º IEEE Transactions on and inverse problems in the mechanics of soft filaments,º Royal Society
Robotics, 2023. open science, vol. 5, no. 6, p. 171628, 2018.
[4] T. George Thuruthel, Y. Ansari, E. Falotico, and C. Laschi, ªControl [22] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
strategies for soft robotic manipulators: A survey,º Soft robotics, vol. 5, ªProximal policy optimization algorithms,º arXiv preprint
no. 2, pp. 149±163, 2018. arXiv:1707.06347, 2017.
[5] R. J. Webster III and B. A. Jones, ªDesign and kinematic modeling [23] A. Hill, A. Raffin, M. Ernestus, A. Gleave, A. Kanervisto, R. Traore,
of constant curvature continuum robots: A review,º The International P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Rad-
Journal of Robotics Research, vol. 29, no. 13, pp. 1661±1683, 2010. ford, J. Schulman, S. Sidor, and Y. Wu, ªStable baselines,º
[6] O. Fischer, Y. Toshimitsu, A. Kazemipour, and R. K. Katzschmann, https://github.com/hill-a/stable-baselines, 2018.
ªDynamic task space control enables soft manipulators to perform [24] J. L. Hodges, ªThe significance probability of the smirnov two-sample
real-world tasks,º Advanced Intelligent Systems, p. 2200024, 2022. test,º Arkiv för Matematik, vol. 3, no. 5, pp. 469±486, 1958.

Authorized licensed use limited to: Aarhus University. Downloaded on December 05,2024 at 14:28:04 UTC from IEEE Xplore. Restrictions apply.

You might also like