Safe Optimal Control Using Stochastic Barrier Functions and Deep Forward-Backward Sdes
Safe Optimal Control Using Stochastic Barrier Functions and Deep Forward-Backward Sdes
Abstract: This paper introduces a new formulation for stochastic optimal control
and stochastic dynamic optimization that ensures safety with respect to state and
control constraints. The proposed methodology brings together concepts such as
Forward-Backward Stochastic Differential Equations, Stochastic Barrier Functions,
Differentiable Convex Optimization and Deep Learning. Using the aforementioned
concepts, a Neural Network architecture is designed for safe trajectory optimization
in which learning can be performed in an end-to-end fashion. Simulations are
performed on three systems to show the efficacy of the proposed methodology.
1 Introduction
Recent advancements in the areas of stochastic optimal control theory and machine learning create new
opportunities towards the development of scalable algorithms for stochastic dynamic optimization.
Despite the progress, there is a scarcity of methodologies that have the characteristics of being
solidly grounded in first principles, have the flexibility of deep learning algorithms in terms of
representational power, and can be deployed on systems operating in safety critical scenarios.
Safety plays a major role in designing any engineering system in various industries ranging from
automobiles and aviation to energy and medicine. With the rapid emergence of various advanced
autonomous systems, the control system community has investigated various techniques such as
barrier methods [1], reachable sets [2], and discrete approximation [3] to ensure safety certifications.
However with the recent introduction of Control Barrier Functions (CBFs) [4, 5, 6], there has been a
growing research interest in the community in designing controllers with verifiable safety bounds.
CBFs provide a measure of safety to a system given its current state. As the system approaches
the boundaries of its safe operating region, the CBFs tend to infinity, leading to the name "barrier".
[1], [6] have implemented CBFs for deterministic systems in robotics such as bi-pedal walking on
stepping stones. However, literature on CBFs for systems with stochastic disturbances is very scarce.
Very recently, [7] introduced stochastic CBFs for relative degree 1 barrier functions which means
the function that defines the safe set, depends only on the states that are directly actuated. While
the concept of stochastic CBFs is fairly recent, [7] only demonstrated their applicability to very
simplified stochastic systems in conjunction with control Lyapunov functions. The idea of merging
the concepts of CBFs with Control Lyapunov functions leads to some interesting results [4, 5, 6] in
sense of stability of the system.
We here propose to explore the inclusion of the concept of stochastic CBFs within the Stochastic
Optimal Control (SOC) framework which essentially leads to the problem of solving the Hamilton-
Jacobi-Bellman (HJB) equation on a constrained solution set. Solving the HJB amounts to overcoming
the curse of dimensionality. Popular solution methods in literature include iLQG [8], Path-Integral
Control [9] and the Forward-Backward Stochastic Differential Equations (FBSDEs) framework
[10] which tackle the HJB through locally optimal solutions. Of these, the algorithms based on
FBSDEs are the most general in the sense that they neither require assumptions to simplify the HJB
Partial Differential Equation (PDE) nor do they require Taylor’s approximations of the dynamics
and value function [8]. More recently, with the introduction of deep learning methods to solve
high-dimensional parabolic PDEs [11], deep learning based solutions of the HJB PDE using so called
Deep FBSDE controllers have emerged [12, 13, 14]. These algorithms leverage importance sampling
using Girsanov’s theorem of change of measure [15, Chapter 5] for FBSDEs within deep learning
models for sufficient exploration of the solution space. Additionally, they have been shown to scale
to high-dimensional linear systems [16] and highly nonlinear systems for finite-horizon SOC.
Paralleling the work on deep learning-based SOC is deep learning-based optimization layers, which
aims to increase the representational power of Deep Learning (DL) models. In [17], the differen-
tiable optimization layer was introduced to explicitly learn nonlinear mappings characterized by a
Quadratic Program (QP). Additionally, they introduce an efficient approach to compute gradients for
backpropagation through an optimization layer using the KKT conditions which allows such layers
to be incorporated within Deep FBSDEs which requires differentiability of all its subcomponents.
To the best of our knowledge, literature on combining SOC with Stochastic CBFs and differentiable
convex optimization, all embedded within a deep model is non-existent. Additionally this adds a
sense of interpretability to the entire framework compared to the relative black box nature of deep
neural networks used for end-to-end control. Our contributions are as follows:
i) A novel end-to-end differentiable architecture with embedded safety using Deep FBSDEs.
ii) Safe-RL algorithm combining differentiable optimization layers and Deep FBSDEs.
iii) An extension of existing theory on Stochastic Zeroing Control Barrier Functions (ZCBFs).
The rest of this paper is organized as follows: Section 2 goes over the problem formulation and
Section 3 introduces the FBSDE framework. In Section 2.1, the stochastic CBF is presented, while
the differentiable QP layer is summarized in Section 3.2. The algorithm and network architecture are
explained in Section 3.3. The simulation results are included in Section 4. Finally, we conclude the
paper and present some future directions in Section 5.
For the dynamical system defined by (1), safety is ensured if x(t) ∈ C for all t where the set C defines
the safe region of operation. C is defined by a locally Lipschitz function h : Rnx → R [4] as:
C = {x : h(x) ≥ 0} (2)
∂C = {x : h(x) = 0}. (3)
The literature of CBFs makes use of two different kinds of control barrier functions [5]: reciprocal
CBFs, whose value approaches infinity when x approaches the boundary of the safe set ∂C, and
zeroing CBFs, whose value approaches zero close to the boundary. In this work we will focus on the
latter kind. Without loss of generality, the function h(x) which defines the safe set, is a zeroing CBF
in itself and guarantees safety if for some class K function1 α(·) the following inequality holds ∀t:
∂h 1 ∂2h
T
f (x) + G(x)u + tr Σ(x)Σ (x) ≥ −α(h(x)). (4)
∂x 2 ∂x2
1
A class K function is a continuous, strictly increasing function α(·) such that α(0) = 0.
2
Above, the left-hand-side represents the rate of change of h while the right-hand-side provides a
bound to that rate which depends on how close x is to the boundary; essentially, the closer the state x
is to the boundary, the slower the value of h can further decrease, implying that x may approach the
boundary at ever-decreasing speed. The following theorem formalizes the safety guarantees:
Theorem 1. Let the safe set C be defined as per equations (2), (3) and let the initial condition
x(0) ∈ C. Assume there exists a control process u(t0 ) such that for all t0 < t, equation (4) is
satisfied. Then, for all t0 < t the process x(t) has remained within the safe set with probability 1, i.e.,
P(x(t) ∈ C ∀t0 < t) = 1.
The proof is a generalization of Theorem 3 in [18], and is given in the Appendix. Note that for any
given x, eq. (4) essentially imposes a constraint on the values of control u one can apply on the
system if one wishes to maintain safety (x ∈ C).
The SOC problem is formulated in order to minimize the expected cost given as:
" ZT #
1 T
J u = EQ q(x) + u Ru ds + φ x(T ) , (5)
2
t
subject to the stochastic dynamics given by (1), and the constraint that trajectories should remain in
the safe set C at all times. Here, φ : Rnx → R+ is the terminal state cost, q : Rnx → R+ denotes the
running state cost and R ∈ Sn+u where Sn+u represents the symmetric positive definite matrix space.
Finally, Q is the space of trajectories induced by the controlled dynamics in (1). We assume that
all necessary technical requirements [19] regarding filtered probability space, Lipschitz continuity,
regularity and growth conditions to guarantee existence and uniqueness of strong solutions to (1) are
met. Letting U([0, T ]) be the space of admissible controls such that x(t) remains within C over a
fixed finite time horizon T ∈ [0, ∞), the value function related to (5) is defined as
V x, t = inf J u)|x0 =x,t0 =t ,
u∈U [0,T ]
and using Bellman’s principle of optimality, one can derive the HJB PDE, given by
1 T
T
1 T
Vt + inf tr Vxx ΣΣ + Vx f (x) + G(x)u + q(x) + u Ru = 0, V x, T = φ x ,
u∈U [0,T ] 2 2
(6)
where we drop dependencies for brevity, and use subscripts to indicate partial derivatives with respect
to time and the state vector. The term inside the infimum operation defines the Hamiltonian:
1 1
H t, x, u, Vx , Vxx ΣΣT = tr Vxx ΣΣT + VxT f x) + G(x)u + q(x) + uT Ru.
2 2
Note that in this case the Hamiltonian is quadratic with respect to u; if the constraint on u in order to
remain within the safe set C (eq. (4)) were not present, the optimal control could be calculated by
setting ∂H/∂u = 0, resulting in u∗ (t, x) = −R−1 GT Vx . However, the CBF inequality constraint
prevents such a closed-form solution for u∗ .
Using the nonlinear Feynman-Kac lemma, we can establish an equivalence between the HJB PDE
and the following system of FBSDEs [19]: Given that a solution u∗ to the constrained minimization
of H exists, the unique solution of (6)corresponds to
the following system of FBSDEs,
∗
dx(t) = f (x(t)) + G(x(t))u (t) dt + Σ x(t) dwt , x(0) = x0 , (FSDE) (7)
1 ∗T
dV (t) = − q(x) + u (t)Ru∗ (t) dt + VxT (t)Σ x(t) dwt , V (T ) = φ(x(T )), (BSDE) (8)
2
∗ 1
u (t) = arg min H(t) = arg min VxT (t)G(x(t))u + uT Ru ,
(QP) (9)
u u 2
∂h 1 ∂2h T
s.t. (f (x(t)) + G(x(t))u) + tr Σ(x(t))Σ (x(t)) ≥ −α(h(x(t))).
∂x 2 ∂x2
3
Here, V (t) is a shorthand notation for V (x(t), t) and denotes an evaluation of the function V (x, t)
along a path of x(t), thus V (t) is a stochastic process (same for Vx (t) and Vxx (t)). The above
equations are time-discretized and (9) is a QP that needs to be solved for each time instant.
In the above system, x(t) evolves forward in time (due to its initial condition x(0) = x0 ), whereas
V (x(t), t) evolves backwards in time, due to its terminal condition φ(x(T )). However, a simple
backward integration of V (x(t), t) would result in it depending explicitly on future values of noise
(final term in eq. (8)), which is not desirable for a non-anticipating process, i.e., a process that does
not exploit knowledge on future noise values. To avoid back-propagation of the BSDE, we utilize
DL. Specifically, incorporating DL [12] allows us to randomly initialize the value function at start
time by treating it as a trainable parameter allowing forward propagation of V instead of backward,
and use backpropagation to train the initial V (x(0), 0), along with an approximation of the gradient
function Vx (x, t) which appears in the right-hand-side of eq. (8) and in the Hamiltonian of eq. (9). In
this work, we utilize the Deep FBSDE controller architecture in [12] as our starting point and add a
differentiable stochastic CBF layer to ensure safety and maintain end-to-end differentiability.
In [17], the authors proposed a differentiable optimization layer capable of learning nonlinear
mappings characterized by QPs of the form
1
ū = argmin uT Q(ui )u + q(ui )T u (10)
u 2
s.t. C(ui )u ≤ d(ui )
where, ui is the QP layer’s input, ū is the fixed point solution of the QP problem as well as
the output of the QP layer, and the QP parameters are Q : Rnu → Rnu ×nu , q : Rnu → Rnu ,
C : Rnu → Rnq ×nu , d : Rnu → Rnq and nq is the number of inequality constraints.
Specifically, during the forward pass through the QP layer, the optimization problem is solved by
running the Primal-Dual Interior Point Method (PDIPM) [20, Chapter 14] until convergence. PDIPM
requires a feasible initialization and searches within the constraints.
Since the Deep FBSDE network is trained by backpropagating gradients through time, one needs to
be able to pass gradients through QPs solved at each time step. During backpropagation, instead of
relying on the auto-differentiation of forward computations, a linear system formulated from the KKT
conditions, that ū must satisfy, is used to find the gradient of ū with respect to the QP parameters. In
our approach, the differentiable QP layer is used to solve the QP problem in eq. (9). To match the
∂h ∂h ∂2h T
f + 21 tr ∂x
QP in (10), we set Q = R, q = VxT G, C = − ∂x G, d = α(h(x)) + ∂x 2 ΣΣ . Unlike
[17], the QP parameters are not trainable in our case, as they are pre-defined by our problem.
Δ𝒘0 Δ𝒘1 Δ𝒘N-2 Δ𝒘N-1 𝑽N (Predicted Value)
Initial Value ... ...
ψ 𝑽𝟎 𝑽1 𝑽2 𝑽N-1
𝑽𝒙,N (Predicted
𝑽𝒙,0 𝑽𝒙,1 𝑽𝒙,2 𝑽𝒙,N-1 Gradient)
Figure 1: Safe FBSDE Controller: Note that the same trainable parameters of the DNN (2 LSTM +
1 Linear) layers are shared across all time steps, and the same noise is used to propagate the value
function, V and the system state, x. Additionally, the initial value, cell and hidden states are trainable.
4
3.3 Deep FBSDEs for Safe SOC - Algorithm and Network Architecture
The time discretization scheme is same as that employed in [12, Section IV]. The network architecture
of the safe FBSDE controller is presented in Fig. 1. The network consists of a FBSDE prediction layer
(dark green), a safe layer (light green) and two forward propagation processes. The main differences
between our proposed network and the one introduced in [12, Fig. 2] are: 1) the value function
gradient at initial time comes from network prediction rather than random initialization, making
the predictions more consistent temporally; 2) the propagated and true value function gradients are
included in the loss function instead of only using the propagated and true value function. The safe
layer, consisting of the differentiable QP OptNet layer [17], solves the constrained Hamiltonian
minimization problem (9). The differentiability of components in the safe layer ensures that the entire
network is end-to-end trainable (gradient can flow through all connections in fig. 1). A summary of
the algorithm is given in alg.1 and 2 in the Supplementary Material. Given an initial state, the value
function gradient at that state is approximated by the FBSDE layer prediction and is used to construct
the constrained Hamiltonian that is solved by the safe layer, as detailed in alg. 2. The safe layer
formulates the constrained Hamiltonian minimization as a QP and solves for the safe control using
the KKT conditions and PDIPM interior point solver. For further details of PDIPM see [17, 21]. Both
state x and value function V are then propagated forward with the computed safe control. At terminal
time T , the true value function and its gradient are calculated from the terminal state and compared
against their propagated counterparts. The loss function is constructed using the difference between
propagated and true value function, the true value function, the difference between propagated and
true value function gradient, and the true value function gradient (equation in alg. 1). Training is
done using the Adam [22] optimizer.
4 Simulations
In this section we provide details of the successful application of
our proposed Safe FBSDE controller on three different systems:
pendulum, cart-pole and 2D car in simulation. All simulations are
conducted in a safe RL setting which means we begin inside the
safe set and learn an optimal controller while never leaving the
safe set during the entire training process. We have included a
comprehensive description of each system in the supplementary
material. This includes, the equations of motion written as SDEs
in state-space form, the equations of the barrier functions used,
the derivations for the gradients, Hessians and Lie-derivatives as
required in (10) for the Hamiltonian minimization at each time step
as well as values for hyperparameters for training the deep Safe
FBSDE networks. In all simulations we compare with an unsafe
FBSDE controller which is work done in [12] without any safety
constraints. The barriers functions used in our simulations take a
generic form of,
h(x) = (position constraint) − µ(velocity)2
> 0, ∀x ∈ C\∂C
where, (position constraint) . The parame-
= 0, ∀x ∈ ∂C Figure 2: Obstacle avoid-
ter µ controls how fast the system can move inside the safe set. The ance: (a) and (b) represent un-
above formulation causes the barrier function to have a relative de- safe and safe respectively. Pol-
gree of 1. For more information see the Supplementary Material. All icy is learnt avoiding obstacles
the computational graphs were implemented in PyTorch [23] on a (grey) during training (blue)
system with Intel(R) Xeon(R) CPU E5-1607 v3 @ 3.10GHz, 64GB to reach the target (yellow).
RAM and a NVIDIA Quadro K5200 GPU. We considered a uniform Shown in (red) and (lime) are
time discretization of ∆t = 0.02s for all the tasks. mean test and the worst-case
trajectories.
4.1 Pendulum Balancing
The state for this system are given by x = [θ, θ̇]T representing pendulum angle and its velocity.
This task requires starting at the unstable equilibrium point of x = [π, 0]T and balancing around it
5
Figure 3: Cart-pole Swing-up: The horizontal axis represents time (s) in all subplots while vertical
axis represents cart-position (m) in (a) and (b) and pole-angle (rad) in (c) and (d). Plots (a) and
(c) show unsafe while (b) and (d) show safe trajectories. The safe controller learns to perform a
swing-up while always remaining within the bounds (dashed black lines) during training (shown by
blue trajectories). The mean test performance is shown in red, target pole-angle is shown in green
and the worst-case (i.e. closest to either bound during training) is shown in cyan.
Figure 4: Cart-pole Balancing: The horizontal axis is time (s) in all subplots. Upper row plots
are cart-position (m) and bottom row plots are pole angular-position (rad). Plots (a) and (d) are
unsafe trajectories, plots (b) and (e) are one constraint safe trajectories and plots (c) and (f) are
double constraint safe trajectories. The safety bounds are indicated by dashed black lines and mean
performance during testing in shown in red. Since all 25, 728 training trajectories are plotted, there is
no need to explicitly show the worst-case trajectory.
for a time horizon of 1.5 seconds. Due to stochasticity, the pendulum is perturbed from this initial
position causing it to fall to the stable equilibrium point x = [0, 0]T . To enforce safety, we impose
box constraints on the angular position of the pendulum of θ ∈ [2π/3, 4π/3]. The results from our
simulation are shown in Fig. 5 wherein all the 25, 728 trajectories encountered during training are
plotted in blue and compared with an unsafe solution. Our proposed controller Fig. 5(b) is able to
match the unsafe controller Fig. 5(a) on average during testing (red), while respecting the imposed
angular bounds during the entire training process.
6
Figure 6: 2D Car Collision Avoidance: Figures above show worst-case (i.e. closest encounter)
between each pair of cars during training. By design of the barrier function, each pair can be at 2 car
radii from each other with zero velocity. This is seen above as the cars come to a halt much before
reaching their targets. Vertical axes represent time (s) and horizontal axes represent (x,y) position.
4.2 Cart-pole
The states of the system are given by x = [xc , θ, x˙c , θ̇]T representing the cart-position, pole angular-
position, cart-velocity and pole angular-velocity respectively. We considered the following 3 different
tasks (the time horizon for all tasks is 1.5 seconds):
4.2.2 Pole-balancing
7
constraint case. Through this example we would like to highlight the adaptability of the safe FBSDE
controller to find the best solution while respecting the imposed constraints.
8
systems and combine this framework for systems in simulators for which explicit equations of motion
are not available. We believe these would serve as stepping stones to be able to deploy this algorithm
on real hardware in the future.
6 Acknowledgements
This work is supported by NASA and the NSF-CPS #1932288
References
[1] Q. Nguyen and K. Sreenath. Exponential control barrier functions for enforcing high relative-
degree safety-critical constraints. In 2016 American Control Conference (ACC), pages 322–328.
IEEE, 2016.
[2] M. Althoff, C. Le Guernic, and B. H. Krogh. Reachable set computation for uncertain time-
varying linear systems. In Proceedings of the 14th International Conference on Hybrid Systems:
Computation and Control, HSCC ’11, page 93–102, New York, NY, USA, 2011. Association
for Computing Machinery. ISBN 9781450306294. doi:10.1145/1967701.1967717. URL
https://doi.org/10.1145/1967701.1967717.
[3] S. Mitra, T. Wongpiromsarn, and R. M. Murray. Verifying cyber-physical interactions in
safety-critical systems. IEEE Security Privacy, 11(4):28–37, 2013.
[4] A. D. Ames, J. W. Grizzle, and P. Tabuada. Control barrier function based quadratic programs
with application to adaptive cruise control. In 53rd IEEE Conference on Decision and Control,
pages 6271–6278. IEEE, 2014.
[5] A. D. Ames, X. Xu, J. W. Grizzle, and P. Tabuada. Control barrier function based quadratic
programs for safety critical systems. IEEE Transactions on Automatic Control, 62(8):3861–
3876, 2016.
[6] A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada. Control
barrier functions: Theory and applications. In 2019 18th European Control Conference (ECC),
pages 3420–3431. IEEE, 2019.
[7] A. Clark. Control barrier functions for complete and incomplete information stochastic systems.
In 2019 American Control Conference (ACC), pages 2928–2935. IEEE, 2019.
[8] E. Todorov and W. Li. A generalized iterative lqg method for locally-optimal feedback control
of constrained nonlinear stochastic systems. In Proceedings of the 2005, American Control
Conference, 2005., pages 300–306. IEEE, 2005.
[9] E. Theodorou, J. Buchli, and S. Schaal. A generalized path integral control approach to
reinforcement learning. journal of machine learning research, 11(Nov):3137–3181, 2010.
[10] I. Exarchos and E. A. Theodorou. Stochastic optimal control via forward and backward
stochastic differential equations and importance sampling. Automatica, 87:159–165, 2018.
[11] J. Han, A. Jentzen, and E. Weinan. Solving high-dimensional partial differential equations using
deep learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018.
[12] M. A. Pereira, Z. Wang, I. Exarchos, and E. A. Theodorou. Learning deep stochastic optimal
control policies using forward-backward sdes. In Robotics: science and systems, 2019.
[13] Z. Wang, K. Lee, M. A. Pereira, I. Exarchos, and E. A. Theodorou. Deep forward-backward
sdes for min-max control. arXiv preprint arXiv:1906.04771, 2019.
[14] Z. Wang, M. A. Pereira, and E. A. Theodorou. Deep 2fbsdes for systems with control multi-
plicative noise. arXiv preprint arXiv:1906.04762, 2019.
[15] S. E. Shreve. Stochastic calculus for finance II: Continuous-time models, volume 11. Springer
Science & Business Media, 2004.
9
[16] M. Pereira, Z. Wang, T. Chen, E. Reed, and E. Theodorou. Feynman-kac neural network
architectures for stochastic control using second-order fbsde theory. 2020.
[17] B. Amos and J. Z. Kolter. Optnet: Differentiable optimization as a layer in neural networks.
In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages
136–145. JMLR. org, 2017.
[18] A. Clark. Control barrier functions for stochastic systems. arXiv preprint arXiv:2003.03498,
2020.
[19] J. Yong and X. Y. Zhou. Stochastic controls: Hamiltonian systems and HJB equations, vol-
ume 43. Springer Science & Business Media, 1999.
[20] J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.
[21] J. Mattingley and S. Boyd. Cvxgen: A code generator for embedded convex optimization.
Optimization and Engineering, 13(1):1–27, 2012.
[22] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
[23] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani,
S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-
performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché
Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32,
pages 8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/
9015-pytorch-an-imperative-style-high-performance-deep-learning-library.
pdf.
[24] I. Karatzas and S. E. Shreve. Brownian motion. In Brownian Motion and Stochastic Calculus
(2nd ed.). Springer, 1998.
[25] Z. Xie, C. K. Liu, and K. Hauser. Differential dynamic programming with nonlinear constraints.
In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 695–702,
2017.
10
7 Supplementary Material
7.1 Proof of Theorem 1
Theorem 1 is a generalization of Theorem [3] in [18], the proof of which is mimicked in construction
here. We want to show that for any t > 0, any > 0, and any δ ∈ (0, 1),
P inf
0
h(xt0 ) ≤ −) ≤ δ,
t <t
where a ∨ b denotes the maximum between a and b, and (ηi ∧ t) ∨ s is used to avoid integration
during times up to s, which are already included in Us . This inequality implies that the stochastic
process Ut is a supermartingale.
We will now show that (a). h(xt ) ≥ Ut and (b). Ut ≤ θ for all t. This will be done by induction. For
t = η0 = 0, both statements hold, because h(x0 ) = θ as per (11), and U0 = θ by construction (eq.
(13)). Now, suppose that the statements hold up to time t = ηi for some i ≥ 0. We will show that
they remain true for the interval [ηi , ζi ] first, and then that they remain true for the interval [ζi , ηi+1 ],
which would complete the induction. For t ∈ [ηi , ζi ] we have
Z t Z t
∂h 1 ∂2h T
∂h
h(xt ) = h(xηi ) + (f (xτ ) + G(xτ )u) + tr 2
ΣΣ dτ + Σ(xτ ) dwτ ,
ηi ∂x 2 ∂x ηi ∂x
Z t Z t
∂h
Ut = Uηi + −α(θ)dτ + Σ(xτ ) dwτ .
ηi ηi ∂x
We perform comparison by terms: for the first terms, h(xηi ) ≥ Uηi by the main assumption of our
induction process. The third terms are common, and the integrands of the second terms relate as
follows:
∂h 1 ∂2h
ΣΣT ≥ −α(h(x)) ≥ −α(θ),
(f (x) + G(x)u) + tr 2
∂x 2 ∂x
where the first inequality is due to the CBF inequality (4) and the second inequality is by definition
of stopping times ηi and ζi and monotonicity of α(·): in the interval [ηi , ζi ] we have that h(xt ) ≤ θ.
11
Thus, for the interval [ηi , ζi ], we showed that h(xt ) ≥ Ut , and as a corollary, by virtue of the stopping
time definition, we furthermore have Ut ≤ θ. We proceed with the interval t ∈ [ζi , ηi+1 ]. By
construction, Ut remains constant during that interval and we have
Ut = Uζi ≤ h(xζi ) = θ ≤ h(xt ),
wherein the first inequality holds as a result of the previous interval (t ∈ [ηi , ζi ]), and the second
equality and inequality hold by virtue of the definition of the stopping times ζi and ηi+1 . This
completes the induction and we have thus proved that for all t, we have h(xt ) ≥ Ut and Ut ≤ θ.
Since Ut ≤ h(xt ), it follows that
P inf
0
h(xt0 ) ≤ −) ≤ P inf
0
Ut0 ≤ − .
t <t t <t
+
wherein a = max{a, 0}. Taking s = max{i : ηi < t}, we can express E[Ut ] as
Ps
θ − α(θ)E[ i=0 (ζi − ηi )], t ∈ [ζs , ηs+1 ],
E[Ut ] = Ps−1
θ − α(θ)E[t − ηs + i=0 (ζi − ηi )], t ∈ [ηs , ζs ].
Both expectations inside the bracket are bounded above by t, and thus E[Ut ] ≥ θ−α(θ)t. Furthermore,
since Ut ≤ θ we have E[Ut+ ] ≤ θ, so combining the two we obtain
E[Ut+ ] − E[Ut ] ≤ θ − (θ − α(θ)t) = α(θ)t.
Thus,
P inf
0
h(xt0 ) ≤ −) ≤ P inf
0
Ut0 ≤ −
t <t t <t
t δ t δt
≤ α(θ) ≤ α α−1 ( ) = < δ,
2t 2t
where we used that θ = min{α−1 δ −1 δ
2t , h(x0 )} is no greater than α 2t . The proof is concluded.
In all our simulations we pick the following form of the class K function that bounds the rate of
change of the barrier function, α(·) = γh(x), where γ is a fixed positive constant that changes the
slope of the bounding function. Higher the value of γ, less conservative is the barrier function.
12
where,
0 0 T 0 0
Σ= =⇒ ΣΣ =
0 σ 0 σ2
The parameter values used were: b = 0.1, l = 0.5 m, g = 9.81 m/s2 , m = 2 kg and I = ml2 .
Balancing with angle constraints: For this task, we impose box constraints on the pendulum
angular position. The goal is to ensure that the pendulum does not fall outside of a predetermined safe
region around the unstable equilibrium point of x = [π, 0]T . Although the pendulum is initialized at
x = [π, 0]T , stochasticity in the dynamics will push it off this unstable equilibrium position and safe
controller is then tasked to keep the pendulum inside the safe region. As stated earlier, the pendulum
must remain inside the safe region during the entire training process.
Let θh and θl be the upper and lower bounds on the angular position. Let the corresponding
barrier functions be hh (x) = θh − θ and hl (x) = θ − θl and the combined barrier function be
h(x) = hh (x) · hl (x) − µθ̇2 , so that the safe set given by C = {x : h(x) ≥ 0}.
The above choice of barrier function is relative degree 1 because Lg h = (∂h/∂x)T g 6= 0, except
at θ̇ = 0. However, given the task at hand, the only time this can occur is at initial condition
x = [π, 0]T or if the controller manages to stabilize the pendulum at the unstable equilibrium point
and exactly cancel out all the stochasticity entering the system. If any one of these occur, since
θ = π (and because we choose box-constraint like bounds eg. (θl , θh ) = (2π/3, 4π/3)), we see
that the gradient of the barrier function ∂h/∂x is itself zero. This means that the safety constraint is
trivially satisfied (as we have 0 ≥ −α(h(x)) in (4). In these cases, the unconstrained optimal control,
u∗ (t, x) = −R−1 GT Vx , would be the solution to the QP problem.
Regarding values of hyper-parameters used to train the Safe FBSDE controller for this problem, we
used the following,
13
7.3.2 Cart Pole
T
For this system, the state vector x = xc , θ, x˙c , θ̇ , indicating cart-position, pole angular-position,
cart-velocity and pole angular-velocity respectively. The system dynamics (f ) vector and actuation
(G) matrix are given by,
x˙c
0
θ̇
f1
0
0
2 1
mp sin θ(lθ̇ + g cos θ)
f2
0
f = = f3 , G = = g
2
mc + mp sin2 θ mc + mp sin θ 3
− cos θ
−mp lθ̇2 cos θ sin θ − (mc + mp )g sin θ f4 g4
l(mc + mp sin2 θ) l(mc + mp sin2 θ)
The stochastic dynamics are given by,
dx(t) = f (x) dt + G(x)u dt + Σ(x) dW (t)
where,
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
Σ(x) = =⇒ Σ(x)ΣT (x) =
0 0 σ 0 0 0 σ2 0
0 0 0 σ 0 0 0 σ2
The system parameters chosen were: mp = 0.01 kg, mc = 1.0 kg, g = 9.81 m/s2 and l = 0.5 m.
i) Swing-up with cart position constraints: Here the task is to begin from the initial position
x0 = [0, 0, 0, 0]T and perform a swing-up to xtarget = [0, π, 0, 0]T within a fixed time
horizon of 1.5 seconds. Additionally, there are constraints on the cart-position that need to
be satisfied both during training (i.e. learning the policy) and testing (i.e. final deployment
of the policy). These constraints are represented as [xc,l , xc,h ] where xc,h > xc,l . A choice
of barrier function that combines both position and velocity constraints is as follows:
h̃ = (xc,h − xc ) · (xc − xc,l ) − µx˙c 2 = xc,h xc − xc,h xc,l − x2c + xc xc,l − µx˙c 2
where the parameter µ controls how fast the cart is allowed to move in the interior of the
safe set C.
xc,h − 2xc + xc,l
T
∂ h̃ 0 ∂ h̃ −2µx˙c
= =⇒ G=
∂x −2µx˙c ∂x mc + mp sin2 θ
0
T
∂ h̃
∴ C=− G and
∂x
T 2
∂ h̃ ∂ h̃
d = α(h̃(x)) + 0.5 · σ 2 (−2µ) + f ··· = −2µ
∂x ∂ x˙c 2
The relative degree of this barrier is also 1 similar to the pendulum case above. Please see
the explanation provided there. For this task we used, µ = 0.1, γ = 1.0 and cart-position
bounds of (xc,l , xc,h ) = (−10 m, 10 m).
ii) Pole balancing with pole position constraints: Similar to the construction above for
position constrained swing-up, we have the following hybrid barrier function for the task of
balancing with constrained pole angle,
hθ (x) = (θh − θ) · (θ − θl ) − µθ θ̇2 = θh θ − θh θl − θ2 + θθl − µθ θ̇2
0
∂hθ θh − 2θ + θl ∂hθ T (−2µθ θ̇) · (− cos θ)
= 0 =⇒ G=
∂x ∂x l(mc + mp sin2 θ)
−2µθ θ̇
∂hθ T
∴ C=− G and
∂x
∂hθ T ∂2h
d = α(hθ (x)) + 0.5 · σ 2 (−2µθ ) + f ··· = −2µθ
∂x ∂ θ̇2
14
The relative degree of this barrier is also 1 similar to the pendulum case above. Please
see the explanation provided there. For this task we used, µ = 0.1, γ = 1.0 and pole
angular-position bounds of (θl , θh ) = (π/2 rad, 3π/2 rad).
iii) Pole balancing with both pole and cart position constraints: For this task we consider
two separate barrier functions i.e. two safety inequality constraints for cart-position and
pole-angle. The barrier functions and corresponding terms of the inequality constraints are
exactly the same as the two sub-tasks above. The bounds are also the same as those used
in the above sub-tasks. The barrier function parameters used were µxc = 0.01, µθ = 10,
γxc = 10 and γθ = 100.
Following are some of the common hyper-parameter values used for the above cart-pole simulations:
# Parameter Name Parameter Value
1 batch size 128
2 training iterations (tasks i.) 2001
3 training iterations (tasks ii. and ii.) 201
4 σ 1.0
4 learning rate 1e−2
5 number of neurons per LSTM layer 16
6 optimizer Adam
i) Single Car Multiple Obstacle Avoidance: In this task a single car starts at (px , py ) =
(0 m, 0 m) with the goal to reach the target (px , py ) = (2 m, 2 m) while avoiding obstacles
shaped as circles. We use 1 barrier function for each obstacle which takes the following
form: 2
h̃i (x) = (px − oix )2 + (py − oiy )2 − oir − µv 2
where i is the obstacle index and oix , oiy , oir are the x-position, y-position and radius of the
obstacle respectively. This is a relative degree 1 barrier function, which can be shown by
2(px − oix )
∂ h̃i 2(p − oiy )
= y
∂x 0
−2µv
T
∂ h̃i
=⇒ G = [0 −2µv]
∂x
15
The other terms needed to set up (10) can be calculated as:
T
∂ h̃i
f = 2(px − oix )v cos(θ) + 2(py − oiy )v sin(θ)
∂x
1 ∂ 2 h̃i
ΣΣT = −µσ 2
tr 2
2 ∂x
The QP can be set up with
T
∂ h̃i
Ci = − G
∂x
T
1 ∂ 2 h̃i ∂ h̃i
di = α(h̃i (x)) + tr 2
ΣΣT + f
2 ∂x ∂x
ii) Multi-car Collision Avoidance: For multi-car collision avoidance, simply set the obstacle’s
positions in the previous section’s barrier function to the other car’s positions, the obstacle’s
size to be 2 times the car’s size, and subtract an additional velocity of the other car. In
our simulation example we used 4 cars for a total of 16 states, x = [px1 , py1 , . . . , θ4 , v4 ]T ,
dimensions and 8 control dimensions, u = [uθ1 , uv1 , . . . , uθ4 , uv4 ]T , resulting in 6 barrier
functions for each pair of vehicles. The initial conditions for the 4 cars are [0, 0, π/4, 0.1],
[2, 0, 3π/4, 0.1], [0, 2, −π/4, 0.1] and [2, 2, −3π/4, 0.1] respectively. For example, the
barrier between car 1 and car 2 looks like
2(px1 − px2 )
2(py1 − py2 )
0
∂ h̃12
−2µv1
= −2(px1 − px2 )
∂x
−2(py1 − py2 )
0
−2µv
2
08×1
T
∂ h̃12
=⇒ G = [0 −2µv1 0 −2µv2 01×4 ]
∂x
The other terms needed to set up (10) can be calculated as:
T
∂ h̃12
f = 2(px1 − px2 )v1 cos(θ1 ) + 2(py1 − py2 )v1 sin(θ1 )
∂x
− 2(px1 − px2 )v2 cos(θ2 ) − 2(py1 − py2 )v2 sin(θ2 )
1 ∂ 2 h̃12
ΣΣT = −2µσ 2
tr 2
2 ∂x
The QP can be set up with
T
∂ h̃12
C12 = − G
∂x
T
1 ∂ 2 h̃12 ∂ h̃12
d12 = α(h̃12 (x)) + tr 2
ΣΣT + f
2 ∂x ∂x
Regarding values of common hyper-parameters used to train the Safe FBSDE controllers for the 2D
car problems, we used the following,
16
Figure 8: 2D Car Collision Avoidance Average Testing Performance: Figures above show suc-
cessful collision avoidance performance of the trained policy averaged over 128 trials. Vertical axes
represent time (s) and horizontal axes represent (x,y) position.
# Parameter Name Parameter Value
1 batch size (task i.) 128
2 batch size (task ii.) 256
3 training iterations 1000
4 σ 0.1
5 learning rate 5e−3
6 number of neurons per LSTM layer 16
7 optimizer Adam
For task (i.), we used µ = 0.05 and γ = 1, while for task (ii.) we used, µ = 0.1 and γ = 1.
Attached with this supplementary material is an animated video of the multi-car collision avoidance
problem. For this animation we used the first batch element (from a batch of 128) from the respective
iterations seen in the video. However, when demonstrating performance during testing, we use
average performance. As seen in the plot Fig.7, the variance of the trajectories is considerably small
and therefore it suffices to show the mean performance. The mean cost is also the cost functional
used in the problem formulation. The colored cicles indicate the diagonally opposite targets for each
car. The dotted circles show the 2 car radii minimum allowable safe distance.
7.5 Algorithms
The Alg.1 below, is very similar to work in [12], with the following changes: computation of the
optimal control (Hamiltonian QP instead of closed form expression), propagation of the value function
(no need for Girsanov’s theorem), computation of an additional target for training the deep FBSDE
network (i.e. the true gradient of the value function) and additional terms in the loss function. The
colored circles in the video are the respective targets of each car.
In Alg.2, for details regarding Affine Scale, Centering-Corrector, Residul and Step-Size please
see the primal-dual interior point method detailed in [20] or refer to code provided on github in [17].
17
Algorithm 1: Finite Horizon Safe FBSDE Controller
Given:
x0 = ξ, f , G, Σ: Initial state and system dynamics;
φ, q, R: Cost function parameters;
a, b, c, d: Loss function parameters;
N : Task horizon, K: Number of iterations, M : Batch
size; ∆t: Time discretization; λ: weight-decay parameter;
Parameters:
ψ = V (x0 , 0): Value function at t = 0;
θ: Weights of all Linear and LSTM layers;
Initialize neural network parameters;
Initialize (for every batch element): {xi0 }M i i M i i
i=1 , x0 = ξ; {ψ0 }i=1 , ψ0 = V (x0 , 0)
for k = 1 to K do
for i = 1 to M do
for t = 1 to N − 1 do
Predict value function gradient:
i
Vx,t = fF BSDE (xit ; θk )
Solve constrained Hamiltonian minimization quadratic program:
ū∗i i i
t = fsaf e (xt , Vx,t )
Sample Brownian noise:
∆wti ∼ N (0, Σ∆t)
iT i
= Vti − q(xt ) + 21 ū∗iT ∗i
i
i
Propagate value function: Vt+1 t Rūt ∆t + Vx,t Σt ∆wt
i
Propagate system state: xit+1 = xit + fti ∆t + Git ū∗i
t ∆t + Σt ∆wt
i
end for
Compute targets: true terminal value and true value function gradient,
VN∗i = φ xiN ; Vx,N
∗i
= φx xiN
end for
Compute mini-batch loss:
M
1 X
L= a kVN∗i − VNi k22 + b kVx,N
∗i i
− Vx,N k22 + c kVN∗i k22 + d kVx,N
∗i
k22 + λ kθk k22
M i=1
Gradient step:
θk+1 , ψ k+1 ← Adam.step(L, θk , ψk)
end for
return θK , ψ K
18
Algorithm 2: Stochastic CBF layer (Safe OptNet layer)
Given:
h: Safe set characterizing barrier function; α(h(x)) = γh(x);
f , G, Σ: System dynamics; L: Maximum QP iterations, R: control cost positive definite matrix
Input: Current system state (xt ) and current predicted value function gradient (Vx,t )
Set up the QP problem:
∂h T ∂h T 2
T 1 T∂ h
Q = R, q = Vx G, C = − G(x), d = α(h(x)) + f (x) + tr Σ(x) Σ(x)
∂x ∂x 2 ∂x2
Initialize solution:
initialize feasible u, slack variable: s ∈ R1+ and Lagrange multiplier for CBF inequality constraint:
λ ∈ R+ and Residue Tolerance:
for j = 1 to L do
(∆uafj
f
, ∆saf
j
f
, ∆λaf j
f
) ←Affine Scale (uj , sj , λj )
(∆uccj , ∆s cc
j , ∆λ cc
j ) ← Centering-Corrector (uj , sj , λj )
ζ ← Residual(uj , sj , λj )
α ← Step Size(uj , sj , λj )
u(j+1) = uj + α(∆uaf j
f
+ ∆uccj )
af f
s(j+1) = sj + α(∆sj + ∆scc j )
af f
λ(j+1) = λj + α(∆λj + ∆λcc j )
if ζ ≤ then
Break
end if
end for
Return ū∗ ← uj
19