0% found this document useful (0 votes)

147 views19 pages

Safe Optimal Control Using Stochastic Barrier Functions and Deep Forward-Backward Sdes

This paper introduces a new methodology for safe stochastic optimal control that combines concepts such as forward-backward stochastic differential equations, stochastic barrier functions, differentiable convex optimization, and deep learning. A neural network architecture is designed for safe trajectory optimization that can be trained in an end-to-end fashion. The methodology is demonstrated on three systems. Key contributions include a differentiable architecture with embedded safety using deep FBSDEs, a safe reinforcement learning algorithm combining optimization layers and deep FBSDEs, and an extension of existing theory on stochastic zeroing control barrier functions.

Uploaded by

th onorimi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

147 views19 pages

Safe Optimal Control Using Stochastic Barrier Functions and Deep Forward-Backward Sdes

Uploaded by

th onorimi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Safe Optimal Control Using Stochastic Barrier

Functions and Deep Forward-Backward SDEs

Marcus A. Pereira Ziyi Wang
Institute for Robotics and Intelligent Machines (IRIM) School of Aerospace Engineering
Georgia Institute of Technology, Georgia Institute of Technology,
Atlanta, GA, 30332 Atlanta, GA, 30332
[email protected] [email protected]
arXiv:2009.01196v1 [eess.SY] 2 Sep 2020

Ioannis Exarchos Evangelos A. Theodorou

Department of Computer Science School of Aerospace Engineering
Stanford University, Stanford, CA 94305 Georgia Institute of Technology,
[email protected] Atlanta, GA, 30332
[email protected]

Abstract: This paper introduces a new formulation for stochastic optimal control
and stochastic dynamic optimization that ensures safety with respect to state and
control constraints. The proposed methodology brings together concepts such as
Forward-Backward Stochastic Differential Equations, Stochastic Barrier Functions,
Differentiable Convex Optimization and Deep Learning. Using the aforementioned
concepts, a Neural Network architecture is designed for safe trajectory optimization
in which learning can be performed in an end-to-end fashion. Simulations are
performed on three systems to show the efficacy of the proposed methodology.

1 Introduction

Recent advancements in the areas of stochastic optimal control theory and machine learning create new
opportunities towards the development of scalable algorithms for stochastic dynamic optimization.
Despite the progress, there is a scarcity of methodologies that have the characteristics of being
solidly grounded in first principles, have the flexibility of deep learning algorithms in terms of
representational power, and can be deployed on systems operating in safety critical scenarios.
Safety plays a major role in designing any engineering system in various industries ranging from
automobiles and aviation to energy and medicine. With the rapid emergence of various advanced
autonomous systems, the control system community has investigated various techniques such as
barrier methods [1], reachable sets [2], and discrete approximation [3] to ensure safety certifications.
However with the recent introduction of Control Barrier Functions (CBFs) [4, 5, 6], there has been a
growing research interest in the community in designing controllers with verifiable safety bounds.
CBFs provide a measure of safety to a system given its current state. As the system approaches
the boundaries of its safe operating region, the CBFs tend to infinity, leading to the name "barrier".
[1], [6] have implemented CBFs for deterministic systems in robotics such as bi-pedal walking on
stepping stones. However, literature on CBFs for systems with stochastic disturbances is very scarce.
Very recently, [7] introduced stochastic CBFs for relative degree 1 barrier functions which means
the function that defines the safe set, depends only on the states that are directly actuated. While
the concept of stochastic CBFs is fairly recent, [7] only demonstrated their applicability to very
simplified stochastic systems in conjunction with control Lyapunov functions. The idea of merging
the concepts of CBFs with Control Lyapunov functions leads to some interesting results [4, 5, 6] in
sense of stability of the system.
We here propose to explore the inclusion of the concept of stochastic CBFs within the Stochastic
Optimal Control (SOC) framework which essentially leads to the problem of solving the Hamilton-
Jacobi-Bellman (HJB) equation on a constrained solution set. Solving the HJB amounts to overcoming
the curse of dimensionality. Popular solution methods in literature include iLQG [8], Path-Integral
Control [9] and the Forward-Backward Stochastic Differential Equations (FBSDEs) framework
[10] which tackle the HJB through locally optimal solutions. Of these, the algorithms based on
FBSDEs are the most general in the sense that they neither require assumptions to simplify the HJB
Partial Differential Equation (PDE) nor do they require Taylor’s approximations of the dynamics
and value function [8]. More recently, with the introduction of deep learning methods to solve
high-dimensional parabolic PDEs [11], deep learning based solutions of the HJB PDE using so called
Deep FBSDE controllers have emerged [12, 13, 14]. These algorithms leverage importance sampling
using Girsanov’s theorem of change of measure [15, Chapter 5] for FBSDEs within deep learning
models for sufficient exploration of the solution space. Additionally, they have been shown to scale
to high-dimensional linear systems [16] and highly nonlinear systems for finite-horizon SOC.
Paralleling the work on deep learning-based SOC is deep learning-based optimization layers, which
aims to increase the representational power of Deep Learning (DL) models. In [17], the differen-
tiable optimization layer was introduced to explicitly learn nonlinear mappings characterized by a
Quadratic Program (QP). Additionally, they introduce an efficient approach to compute gradients for
backpropagation through an optimization layer using the KKT conditions which allows such layers
to be incorporated within Deep FBSDEs which requires differentiability of all its subcomponents.
To the best of our knowledge, literature on combining SOC with Stochastic CBFs and differentiable
convex optimization, all embedded within a deep model is non-existent. Additionally this adds a
sense of interpretability to the entire framework compared to the relative black box nature of deep
neural networks used for end-to-end control. Our contributions are as follows:
i) A novel end-to-end differentiable architecture with embedded safety using Deep FBSDEs.
ii) Safe-RL algorithm combining differentiable optimization layers and Deep FBSDEs.
iii) An extension of existing theory on Stochastic Zeroing Control Barrier Functions (ZCBFs).
The rest of this paper is organized as follows: Section 2 goes over the problem formulation and
Section 3 introduces the FBSDE framework. In Section 2.1, the stochastic CBF is presented, while
the differentiable QP layer is summarized in Section 3.2. The algorithm and network architecture are
explained in Section 3.3. The simulation results are included in Section 4. Finally, we conclude the
paper and present some future directions in Section 5.

2 Mathematical Preliminaries and Problem formulation

We consider stochastic dynamical systems that are nonlinear in state and affine in control:

dx(t) = f (x(t)) + G(x(t))u(t) dt + Σ(x(t))dw(t), (1)
nx nu
where w(t) is a nw dimensional standard Brownian motion, and x ∈ R , and u ∈ R denote the
state and control vectors, respectively. The functions f : Rnx → Rnx , G : Rnx → Rnx ×nu and
Σ : Rnx → Rnx ×nw represent the uncontrolled system drift dynamics, the actuator dynamics, and
the diffusion matrix. We assume that the state space is divided into a safe set C which consists of all
states that are deemed safe for exploration, and its complement Rnx \ C denoting unsafe states.

2.1 Stochastic Control Barrier Functions

For the dynamical system defined by (1), safety is ensured if x(t) ∈ C for all t where the set C defines
the safe region of operation. C is defined by a locally Lipschitz function h : Rnx → R [4] as:
C = {x : h(x) ≥ 0} (2)
∂C = {x : h(x) = 0}. (3)
The literature of CBFs makes use of two different kinds of control barrier functions [5]: reciprocal
CBFs, whose value approaches infinity when x approaches the boundary of the safe set ∂C, and
zeroing CBFs, whose value approaches zero close to the boundary. In this work we will focus on the
latter kind. Without loss of generality, the function h(x) which defines the safe set, is a zeroing CBF
in itself and guarantees safety if for some class K function1 α(·) the following inequality holds ∀t:
∂h 1 ∂2h
T
f (x) + G(x)u + tr Σ(x)Σ (x) ≥ −α(h(x)). (4)
∂x 2 ∂x2
1
A class K function is a continuous, strictly increasing function α(·) such that α(0) = 0.

2
Above, the left-hand-side represents the rate of change of h while the right-hand-side provides a
bound to that rate which depends on how close x is to the boundary; essentially, the closer the state x
is to the boundary, the slower the value of h can further decrease, implying that x may approach the
boundary at ever-decreasing speed. The following theorem formalizes the safety guarantees:
Theorem 1. Let the safe set C be defined as per equations (2), (3) and let the initial condition
x(0) ∈ C. Assume there exists a control process u(t0 ) such that for all t0 < t, equation (4) is
satisfied. Then, for all t0 < t the process x(t) has remained within the safe set with probability 1, i.e.,
P(x(t) ∈ C ∀t0 < t) = 1.
The proof is a generalization of Theorem 3 in [18], and is given in the Appendix. Note that for any
given x, eq. (4) essentially imposes a constraint on the values of control u one can apply on the
system if one wishes to maintain safety (x ∈ C).

2.2 Stochastic Optimal Control

The SOC problem is formulated in order to minimize the expected cost given as:
" ZT #
1 T
J u = EQ q(x) + u Ru ds + φ x(T ) , (5)
2
t
subject to the stochastic dynamics given by (1), and the constraint that trajectories should remain in
the safe set C at all times. Here, φ : Rnx → R+ is the terminal state cost, q : Rnx → R+ denotes the
running state cost and R ∈ Sn+u where Sn+u represents the symmetric positive definite matrix space.
Finally, Q is the space of trajectories induced by the controlled dynamics in (1). We assume that
all necessary technical requirements [19] regarding filtered probability space, Lipschitz continuity,
regularity and growth conditions to guarantee existence and uniqueness of strong solutions to (1) are
met. Letting U([0, T ]) be the space of admissible controls such that x(t) remains within C over a
fixed finite time horizon T ∈ [0, ∞), the value function related to (5) is defined as

V x, t = inf J u)|x0 =x,t0 =t ,
u∈U [0,T ]
and using Bellman’s principle of optimality, one can derive the HJB PDE, given by

1 T
T
1 T
Vt + inf tr Vxx ΣΣ + Vx f (x) + G(x)u + q(x) + u Ru = 0, V x, T = φ x ,
u∈U [0,T ] 2 2
(6)
where we drop dependencies for brevity, and use subscripts to indicate partial derivatives with respect
to time and the state vector. The term inside the infimum operation defines the Hamiltonian:
1 1
H t, x, u, Vx , Vxx ΣΣT = tr Vxx ΣΣT + VxT f x) + G(x)u + q(x) + uT Ru.

2 2
Note that in this case the Hamiltonian is quadratic with respect to u; if the constraint on u in order to
remain within the safe set C (eq. (4)) were not present, the optimal control could be calculated by
setting ∂H/∂u = 0, resulting in u∗ (t, x) = −R−1 GT Vx . However, the CBF inequality constraint
prevents such a closed-form solution for u∗ .

3 Safe Deep FBSDE Control

3.1 Deep FBSDE formulation

Using the nonlinear Feynman-Kac lemma, we can establish an equivalence between the HJB PDE
and the following system of FBSDEs [19]: Given that a solution u∗ to the constrained minimization
of H exists, the unique solution of (6)corresponds to
the following system of FBSDEs,
∗
dx(t) = f (x(t)) + G(x(t))u (t) dt + Σ x(t) dwt , x(0) = x0 , (FSDE) (7)
1 ∗T
dV (t) = − q(x) + u (t)Ru∗ (t) dt + VxT (t)Σ x(t) dwt , V (T ) = φ(x(T )), (BSDE) (8)

2
∗ 1
u (t) = arg min H(t) = arg min VxT (t)G(x(t))u + uT Ru ,

(QP) (9)
u u 2
∂h 1 ∂2h T

s.t. (f (x(t)) + G(x(t))u) + tr Σ(x(t))Σ (x(t)) ≥ −α(h(x(t))).
∂x 2 ∂x2

3
Here, V (t) is a shorthand notation for V (x(t), t) and denotes an evaluation of the function V (x, t)
along a path of x(t), thus V (t) is a stochastic process (same for Vx (t) and Vxx (t)). The above
equations are time-discretized and (9) is a QP that needs to be solved for each time instant.
In the above system, x(t) evolves forward in time (due to its initial condition x(0) = x0 ), whereas
V (x(t), t) evolves backwards in time, due to its terminal condition φ(x(T )). However, a simple
backward integration of V (x(t), t) would result in it depending explicitly on future values of noise
(final term in eq. (8)), which is not desirable for a non-anticipating process, i.e., a process that does
not exploit knowledge on future noise values. To avoid back-propagation of the BSDE, we utilize
DL. Specifically, incorporating DL [12] allows us to randomly initialize the value function at start
time by treating it as a trainable parameter allowing forward propagation of V instead of backward,
and use backpropagation to train the initial V (x(0), 0), along with an approximation of the gradient
function Vx (x, t) which appears in the right-hand-side of eq. (8) and in the Hamiltonian of eq. (9). In
this work, we utilize the Deep FBSDE controller architecture in [12] as our starting point and add a
differentiable stochastic CBF layer to ensure safety and maintain end-to-end differentiability.

3.2 Differential Quadratic Programming Layer

In [17], the authors proposed a differentiable optimization layer capable of learning nonlinear
mappings characterized by QPs of the form

1
ū = argmin uT Q(ui )u + q(ui )T u (10)
u 2
s.t. C(ui )u ≤ d(ui )

where, ui is the QP layer’s input, ū is the fixed point solution of the QP problem as well as
the output of the QP layer, and the QP parameters are Q : Rnu → Rnu ×nu , q : Rnu → Rnu ,
C : Rnu → Rnq ×nu , d : Rnu → Rnq and nq is the number of inequality constraints.
Specifically, during the forward pass through the QP layer, the optimization problem is solved by
running the Primal-Dual Interior Point Method (PDIPM) [20, Chapter 14] until convergence. PDIPM
requires a feasible initialization and searches within the constraints.
Since the Deep FBSDE network is trained by backpropagating gradients through time, one needs to
be able to pass gradients through QPs solved at each time step. During backpropagation, instead of
relying on the auto-differentiation of forward computations, a linear system formulated from the KKT
conditions, that ū must satisfy, is used to find the gradient of ū with respect to the QP parameters. In
our approach, the differentiable QP layer is used to solve the QP problem in eq. (9). To match the
∂h ∂h ∂2h T
f + 21 tr ∂x

QP in (10), we set Q = R, q = VxT G, C = − ∂x G, d = α(h(x)) + ∂x 2 ΣΣ . Unlike
[17], the QP parameters are not trainable in our case, as they are pre-defined by our problem.
Δ𝒘0 Δ𝒘1 Δ𝒘N-2 Δ𝒘N-1 𝑽N (Predicted Value)
Initial Value ... ...
ψ 𝑽𝟎 𝑽1 𝑽2 𝑽N-1
𝑽𝒙,N (Predicted
𝑽𝒙,0 𝑽𝒙,1 𝑽𝒙,2 𝑽𝒙,N-1 Gradient)

Initial Cell and Training Loss Function :

QP Safety QP Safety QP Safety
Hidden states:
(OptNet) (OptNet) ℒ(𝑽𝑵 , 𝑽∗𝑵 , 𝑽𝒙,𝑵 , 𝑽∗𝒙,𝑵 )
𝒉𝟐𝟎 , 𝒄𝟐𝟎 (OptNet)
…….……. …..……

𝒉𝟏𝟎 , 𝒄𝟏𝟎 Layer Layer Layer

DNN: DNN: DNN: DNN: DNN:

LSTM (2), LSTM (2), LSTM (2),
.... .... LSTM (2), LSTM (2),
𝑽∗𝒙,𝑵
Linear (1) Linear (1) Linear (1) Linear (1) Linear (1)
(True
Gradient)

... ... 𝑽∗𝑵

𝒙𝟎 𝒙1 𝒙2 𝒙N-1 𝒙N (True Value)
Δ𝒘0 Δ𝒘1 Δ𝒘N-2 Δ𝒘N-1

Figure 1: Safe FBSDE Controller: Note that the same trainable parameters of the DNN (2 LSTM +
1 Linear) layers are shared across all time steps, and the same noise is used to propagate the value
function, V and the system state, x. Additionally, the initial value, cell and hidden states are trainable.

4
3.3 Deep FBSDEs for Safe SOC - Algorithm and Network Architecture

The time discretization scheme is same as that employed in [12, Section IV]. The network architecture
of the safe FBSDE controller is presented in Fig. 1. The network consists of a FBSDE prediction layer
(dark green), a safe layer (light green) and two forward propagation processes. The main differences
between our proposed network and the one introduced in [12, Fig. 2] are: 1) the value function
gradient at initial time comes from network prediction rather than random initialization, making
the predictions more consistent temporally; 2) the propagated and true value function gradients are
included in the loss function instead of only using the propagated and true value function. The safe
layer, consisting of the differentiable QP OptNet layer [17], solves the constrained Hamiltonian
minimization problem (9). The differentiability of components in the safe layer ensures that the entire
network is end-to-end trainable (gradient can flow through all connections in fig. 1). A summary of
the algorithm is given in alg.1 and 2 in the Supplementary Material. Given an initial state, the value
function gradient at that state is approximated by the FBSDE layer prediction and is used to construct
the constrained Hamiltonian that is solved by the safe layer, as detailed in alg. 2. The safe layer
formulates the constrained Hamiltonian minimization as a QP and solves for the safe control using
the KKT conditions and PDIPM interior point solver. For further details of PDIPM see [17, 21]. Both
state x and value function V are then propagated forward with the computed safe control. At terminal
time T , the true value function and its gradient are calculated from the terminal state and compared
against their propagated counterparts. The loss function is constructed using the difference between
propagated and true value function, the true value function, the difference between propagated and
true value function gradient, and the true value function gradient (equation in alg. 1). Training is
done using the Adam [22] optimizer.

4 Simulations
In this section we provide details of the successful application of
our proposed Safe FBSDE controller on three different systems:
pendulum, cart-pole and 2D car in simulation. All simulations are
conducted in a safe RL setting which means we begin inside the
safe set and learn an optimal controller while never leaving the
safe set during the entire training process. We have included a
comprehensive description of each system in the supplementary
material. This includes, the equations of motion written as SDEs
in state-space form, the equations of the barrier functions used,
the derivations for the gradients, Hessians and Lie-derivatives as
required in (10) for the Hamiltonian minimization at each time step
as well as values for hyperparameters for training the deep Safe
FBSDE networks. In all simulations we compare with an unsafe
FBSDE controller which is work done in [12] without any safety
constraints. The barriers functions used in our simulations take a
generic form of,
h(x) = (position constraint) − µ(velocity)2

> 0, ∀x ∈ C\∂C
where, (position constraint) . The parame-
= 0, ∀x ∈ ∂C Figure 2: Obstacle avoid-
ter µ controls how fast the system can move inside the safe set. The ance: (a) and (b) represent un-
above formulation causes the barrier function to have a relative de- safe and safe respectively. Pol-
gree of 1. For more information see the Supplementary Material. All icy is learnt avoiding obstacles
the computational graphs were implemented in PyTorch [23] on a (grey) during training (blue)
system with Intel(R) Xeon(R) CPU E5-1607 v3 @ 3.10GHz, 64GB to reach the target (yellow).
RAM and a NVIDIA Quadro K5200 GPU. We considered a uniform Shown in (red) and (lime) are
time discretization of ∆t = 0.02s for all the tasks. mean test and the worst-case
trajectories.
4.1 Pendulum Balancing

The state for this system are given by x = [θ, θ̇]T representing pendulum angle and its velocity.
This task requires starting at the unstable equilibrium point of x = [π, 0]T and balancing around it

5
Figure 3: Cart-pole Swing-up: The horizontal axis represents time (s) in all subplots while vertical
axis represents cart-position (m) in (a) and (b) and pole-angle (rad) in (c) and (d). Plots (a) and
(c) show unsafe while (b) and (d) show safe trajectories. The safe controller learns to perform a
swing-up while always remaining within the bounds (dashed black lines) during training (shown by
blue trajectories). The mean test performance is shown in red, target pole-angle is shown in green
and the worst-case (i.e. closest to either bound during training) is shown in cyan.

Figure 4: Cart-pole Balancing: The horizontal axis is time (s) in all subplots. Upper row plots
are cart-position (m) and bottom row plots are pole angular-position (rad). Plots (a) and (d) are
unsafe trajectories, plots (b) and (e) are one constraint safe trajectories and plots (c) and (f) are
double constraint safe trajectories. The safety bounds are indicated by dashed black lines and mean
performance during testing in shown in red. Since all 25, 728 training trajectories are plotted, there is
no need to explicitly show the worst-case trajectory.
for a time horizon of 1.5 seconds. Due to stochasticity, the pendulum is perturbed from this initial
position causing it to fall to the stable equilibrium point x = [0, 0]T . To enforce safety, we impose
box constraints on the angular position of the pendulum of θ ∈ [2π/3, 4π/3]. The results from our
simulation are shown in Fig. 5 wherein all the 25, 728 trajectories encountered during training are
plotted in blue and compared with an unsafe solution. Our proposed controller Fig. 5(b) is able to
match the unsafe controller Fig. 5(a) on average during testing (red), while respecting the imposed
angular bounds during the entire training process.

6
Figure 6: 2D Car Collision Avoidance: Figures above show worst-case (i.e. closest encounter)
between each pair of cars during training. By design of the barrier function, each pair can be at 2 car
radii from each other with zero velocity. This is seen above as the cars come to a halt much before
reaching their targets. Vertical axes represent time (s) and horizontal axes represent (x,y) position.
4.2 Cart-pole

The states of the system are given by x = [xc , θ, x˙c , θ̇]T representing the cart-position, pole angular-
position, cart-velocity and pole angular-velocity respectively. We considered the following 3 different
tasks (the time horizon for all tasks is 1.5 seconds):

4.2.1 Swing-up with bounds on cart-position

This task requires starting from an initial state of x = [0, 0, 0, 0]T

and finding a sequence of control actions to reach a final state of
x = [0, π, 0, 0]T . To enforce safety, we impose constraints on the cart-
position of [−5.0 m, 5.0 m]. The training (blue) and average testing (red)
trajectories from our simulations are shown in Fig. 3. Since we cannot
plot all 256, 128 training trajectories we randomly sample 10, 000 for
plotting. However, we also plot the worst-case trajectory encountered (i.e.
trajectory that gets closest to one of the boundaries) to emphasize that no
trajectory violates the safety constraint during the entire training process.

4.2.2 Pole-balancing

Here we explored two sub-tasks (see Fig. 4 for simulation results):

With bounds on pole-angle: Similar to the pendulum, the pole has to Figure 5: Pendulum
be stabilized at the unstable equilibrium point of θ = π rads. The initial Balancing: (a) and (b)
state is x = [0, π, 0, 0]T . Due to stochasticity, the pole is displaced off represent safe and unsafe
of θ = π and falls toward the stable equilibrium point of θ = 0. In this trajectories respectively.
case (referred to as one constraint case in Fig. 4(b) and Fig. 4(e)), the
safe controller achieves the desired average performance during testing (red) while never leaving
the safe set of θ ∈ [π/2, 3π/2] during training. Since the bounds on cart-position were not enforced,
trajectories go beyond the limits as expected in Fig. 4(b).
With bounds on both cart-position and pole-angle: The task here (referred to as double constraint
case in Fig. 4(c) and Fig. 4(f)) is identical to the one above, except now in addition to bounds on θ,
the cart-position is constrained to xc ∈ [−10 m, 10 m]. Compared to the single constraint case, the
safe controller now respects both bounds during training.
Notice the similarity in Fig. 4(a) and Fig. 4(c), while Fig. 4(b) vastly differs. This is because, in the
unsafe Fig. 4(a), the optimal controller is allowed to let the pole fall beyond the safe bounds which
does not encourage it to move the cart further away from the origin "to save the pole from falling"
like in Fig. 4(b). In Fig. 4(c), because of additional restrictions, the controller is forced to rapidly
move the cart back-and-forth within the allowable bounds to stay safe. This becomes evident on
comparing Fig. 4(e) and Fig. 4(f) wherein the range of θ covered is further reduced in the double

7
constraint case. Through this example we would like to highlight the adaptability of the safe FBSDE
controller to find the best solution while respecting the imposed constraints.

4.3 Two-dimensional Car navigation

For each 2D car system, the state vector is given by

x = [px , py , θ, v]T representing the x-position, y-position,
heading angle (with respect to the global x-axis) and the
forward velocity of the car respectively. We consider the
following two tasks:

4.3.1 Single Car Multiple Obstacle Avoidance

In this task, a single car (see dynamics in Supplemen-
tary Material) has to navigate from an initial state of
x = [0, 0, 0, 0]T to a final state of x = [2, 2, 0, 0]T while Figure 7: Collision Avoidance: Trajec-
avoiding three obstacles shaped as circles whose centers tories during testing learnt policy with
are located at (1, 1), (1, 0) and (0, 2) each with a radius darker lines indicating mean.
0.3 in a time horizon of 3 seconds. The simulation results
are shown in Fig. 2. Of the 256, 256 trajectories encountered during training, 10, 000 are randomly
sampled and plotted as blue lines. However, to emphasize that safety constraints are not violated
during training, we plot the worst-case trajectory (i.e. the trajectory that gets closest to any one of the
three obstacles during training) in lime. Notice that the optimal policy does not choose to be very
close to the obstacle. This is due to the barrier function design, which allows the car to be at the
obstacle boundary only if its velocity is zero. Therefore, getting closer to the boundary, slows the car
down causing it to either not reach the target in the given time horizon or reach the target late thereby
accumulating more running cost. We hypothesize that during the training process, the learnt optimal
cost-to-go (value function) of states very close to the obstacle are high and therefore the optimal
controller chooses to stay away from the states close to the obstacle boundary. We hypothesize this
also being the reason for the optimal policy choosing to go through the larger gap in between the
upper and middle obstacle than the lower gap between the middle and lower obstacle.

4.3.2 Multi-Car Collision Avoidance

We finally tested our framework in a dynamic obstacle setting. In this task, 4 cars start at 4 corners of
a square grid (2 m × 2 m) and are tasked with swapping their positions with the diagonally opposite
car. To ensure safety, each car can be at a minimum distance of 0.1 m from each other. Thus, the goal
is to swap positions by avoiding collisions. We used 42 = 6 barrier functions for each car pair. Since

all training trajectories cannot be plotted, we find the worst-cases (i.e. closest approach) between
all car pairs during training. As seen in Fig. 6, the cars stop short of their respective targets (green
spheres) because they get too close to each other. This behavior stems from our barrier function
design, which allows the cars to be at 0.1 m from each other but with zero velocity. We hypothesize
that these states get associated with high value (high cost-to-go) which the optimal controller learns
to avoid. The performance during testing is shown in Fig. 7 wherein each car successfully reaches
its target on average. Please refer to the supplementary for 3D plots during testing to see successful
collision avoidance. An animation 2 demonstrating the progress of the policy through the training
process is available on YouTube.

5 Conclusions and Future Directions

In this paper we introduced a novel approach to safe reinforcement learning for stochastic dynamical
systems. Our framework is the first to integrate Stochastic CBFs into the deep FBSDE controller
framework for end-to-end safe optimal control. We extended the existing theory on Stochastic ZCBFs,
developed a numerical algorithm combining deep FBSDEs, stochastic ZCBFs and the differentiable
convex optimization layer (OptNet) frameworks and successfully demonstrated its application on
three systems in simulation. This initial success motivates a few directions for further investigation.
Namely, we would like to explore high relative degree stochastic CBFs for highly underactuated
2
https://youtu.be/noonPzL39ic

8
systems and combine this framework for systems in simulators for which explicit equations of motion
are not available. We believe these would serve as stepping stones to be able to deploy this algorithm
on real hardware in the future.

6 Acknowledgements
This work is supported by NASA and the NSF-CPS #1932288

References
[1] Q. Nguyen and K. Sreenath. Exponential control barrier functions for enforcing high relative-
degree safety-critical constraints. In 2016 American Control Conference (ACC), pages 322–328.
IEEE, 2016.
[2] M. Althoff, C. Le Guernic, and B. H. Krogh. Reachable set computation for uncertain time-
varying linear systems. In Proceedings of the 14th International Conference on Hybrid Systems:
Computation and Control, HSCC ’11, page 93–102, New York, NY, USA, 2011. Association
for Computing Machinery. ISBN 9781450306294. doi:10.1145/1967701.1967717. URL
https://doi.org/10.1145/1967701.1967717.
[3] S. Mitra, T. Wongpiromsarn, and R. M. Murray. Verifying cyber-physical interactions in
safety-critical systems. IEEE Security Privacy, 11(4):28–37, 2013.
[4] A. D. Ames, J. W. Grizzle, and P. Tabuada. Control barrier function based quadratic programs
with application to adaptive cruise control. In 53rd IEEE Conference on Decision and Control,
pages 6271–6278. IEEE, 2014.
[5] A. D. Ames, X. Xu, J. W. Grizzle, and P. Tabuada. Control barrier function based quadratic
programs for safety critical systems. IEEE Transactions on Automatic Control, 62(8):3861–
3876, 2016.
[6] A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada. Control
barrier functions: Theory and applications. In 2019 18th European Control Conference (ECC),
pages 3420–3431. IEEE, 2019.
[7] A. Clark. Control barrier functions for complete and incomplete information stochastic systems.
In 2019 American Control Conference (ACC), pages 2928–2935. IEEE, 2019.
[8] E. Todorov and W. Li. A generalized iterative lqg method for locally-optimal feedback control
of constrained nonlinear stochastic systems. In Proceedings of the 2005, American Control
Conference, 2005., pages 300–306. IEEE, 2005.
[9] E. Theodorou, J. Buchli, and S. Schaal. A generalized path integral control approach to
reinforcement learning. journal of machine learning research, 11(Nov):3137–3181, 2010.
[10] I. Exarchos and E. A. Theodorou. Stochastic optimal control via forward and backward
stochastic differential equations and importance sampling. Automatica, 87:159–165, 2018.
[11] J. Han, A. Jentzen, and E. Weinan. Solving high-dimensional partial differential equations using
deep learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018.
[12] M. A. Pereira, Z. Wang, I. Exarchos, and E. A. Theodorou. Learning deep stochastic optimal
control policies using forward-backward sdes. In Robotics: science and systems, 2019.
[13] Z. Wang, K. Lee, M. A. Pereira, I. Exarchos, and E. A. Theodorou. Deep forward-backward
sdes for min-max control. arXiv preprint arXiv:1906.04771, 2019.
[14] Z. Wang, M. A. Pereira, and E. A. Theodorou. Deep 2fbsdes for systems with control multi-
plicative noise. arXiv preprint arXiv:1906.04762, 2019.
[15] S. E. Shreve. Stochastic calculus for finance II: Continuous-time models, volume 11. Springer
Science & Business Media, 2004.

9
[16] M. Pereira, Z. Wang, T. Chen, E. Reed, and E. Theodorou. Feynman-kac neural network
architectures for stochastic control using second-order fbsde theory. 2020.
[17] B. Amos and J. Z. Kolter. Optnet: Differentiable optimization as a layer in neural networks.
In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages
136–145. JMLR. org, 2017.
[18] A. Clark. Control barrier functions for stochastic systems. arXiv preprint arXiv:2003.03498,
2020.
[19] J. Yong and X. Y. Zhou. Stochastic controls: Hamiltonian systems and HJB equations, vol-
ume 43. Springer Science & Business Media, 1999.
[20] J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.
[21] J. Mattingley and S. Boyd. Cvxgen: A code generator for embedded convex optimization.
Optimization and Engineering, 13(1):1–27, 2012.
[22] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
[23] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani,
S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-
performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché
Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32,
pages 8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/
9015-pytorch-an-imperative-style-high-performance-deep-learning-library.
pdf.
[24] I. Karatzas and S. E. Shreve. Brownian motion. In Brownian Motion and Stochastic Calculus
(2nd ed.). Springer, 1998.
[25] Z. Xie, C. K. Liu, and K. Hauser. Differential dynamic programming with nonlinear constraints.
In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 695–702,
2017.

10
7 Supplementary Material
7.1 Proof of Theorem 1

Theorem 1 is a generalization of Theorem [3] in [18], the proof of which is mimicked in construction
here. We want to show that for any t > 0, any > 0, and any δ ∈ (0, 1),
P inf
0
h(xt0 ) ≤ −) ≤ δ,
t <t

where the subscript t denotes dependence on time. Let θ = min{α−1 δ

2t , h(x0 )}. Note that for
t = 0 this reduces to
h(x0 ) = θ. (11)
Itô’s lemma then implies that
Z t Z t
∂h 1 ∂2h T
∂h
h(xt ) = h(x0 ) + (f (xτ ) + G(xτ )u) + tr 2
ΣΣ dτ + Σ(xτ ) dwτ . (12)
0 ∂x 2 ∂x 0 ∂x
Consider now θ as a particular value for h(x). As h(x) varies, it may cross this value from above or
from bellow. We call the times in which this occurs stopping times; in particular, we have stopping
times ηi whenever h(xt ) down-crosses the value θ (i.e., coming from above) and ζi whenever h(xt )
up-crosses the value θ (i.e., coming from below). The sequence is defined as follows:
η0 = 0, ζ0 = inf{t : h(xt ) > θ},
ηi = inf{t : h(xt ) < θ, t > ζi−1 }, ζi = inf{t : h(xt ) > θ, t > ηi−1 },
for i = 1, 2, . . .. Note that since h(x0 ) = θ, we take η0 = 0 by convention. We now proceed to
construct a random process Ut with U0 = θ as follows:
∞ Z ζi ∧t Z ζi ∧t
X ∂h
Ut = θ + −α(θ)dτ + Σ dwτ , (13)
i=0 ηi ∧t ηi ∧t ∂x
in which a ∧ b denotes the minimum between a and b. For s < t, this process satisfies
X∞ Z (ζi ∧t)∨s Z (ζi ∧t)∨s
∂h
E[Ut |Us ] = Us + E −α(θ)dτ + Σ dwτ
i=0 (ηi ∧t)∨s (ηi ∧t)∨s ∂x
X∞ Z (ζi ∧t)∨s
= Us + E −α(θ)dτ ≤ Us ,
i=0 (ηi ∧t)∨s

where a ∨ b denotes the maximum between a and b, and (ηi ∧ t) ∨ s is used to avoid integration
during times up to s, which are already included in Us . This inequality implies that the stochastic
process Ut is a supermartingale.
We will now show that (a). h(xt ) ≥ Ut and (b). Ut ≤ θ for all t. This will be done by induction. For
t = η0 = 0, both statements hold, because h(x0 ) = θ as per (11), and U0 = θ by construction (eq.
(13)). Now, suppose that the statements hold up to time t = ηi for some i ≥ 0. We will show that
they remain true for the interval [ηi , ζi ] first, and then that they remain true for the interval [ζi , ηi+1 ],
which would complete the induction. For t ∈ [ηi , ζi ] we have
Z t Z t
∂h 1 ∂2h T
∂h
h(xt ) = h(xηi ) + (f (xτ ) + G(xτ )u) + tr 2
ΣΣ dτ + Σ(xτ ) dwτ ,
ηi ∂x 2 ∂x ηi ∂x
Z t Z t
∂h
Ut = Uηi + −α(θ)dτ + Σ(xτ ) dwτ .
ηi ηi ∂x
We perform comparison by terms: for the first terms, h(xηi ) ≥ Uηi by the main assumption of our
induction process. The third terms are common, and the integrands of the second terms relate as
follows:
∂h 1 ∂2h
ΣΣT ≥ −α(h(x)) ≥ −α(θ),

(f (x) + G(x)u) + tr 2
∂x 2 ∂x
where the first inequality is due to the CBF inequality (4) and the second inequality is by definition
of stopping times ηi and ζi and monotonicity of α(·): in the interval [ηi , ζi ] we have that h(xt ) ≤ θ.

11
Thus, for the interval [ηi , ζi ], we showed that h(xt ) ≥ Ut , and as a corollary, by virtue of the stopping
time definition, we furthermore have Ut ≤ θ. We proceed with the interval t ∈ [ζi , ηi+1 ]. By
construction, Ut remains constant during that interval and we have
Ut = Uζi ≤ h(xζi ) = θ ≤ h(xt ),
wherein the first inequality holds as a result of the previous interval (t ∈ [ηi , ζi ]), and the second
equality and inequality hold by virtue of the definition of the stopping times ζi and ηi+1 . This
completes the induction and we have thus proved that for all t, we have h(xt ) ≥ Ut and Ut ≤ θ.
Since Ut ≤ h(xt ), it follows that

P inf
0
h(xt0 ) ≤ −) ≤ P inf
0
Ut0 ≤ − .
t <t t <t

By Doob’s supermartingale inequality [24] we have that for > 0

Ut0 ≤ − ≤ E[Ut+ ] − E[Ut ],

P inf
0 t <t

+
wherein a = max{a, 0}. Taking s = max{i : ηi < t}, we can express E[Ut ] as
Ps
θ − α(θ)E[ i=0 (ζi − ηi )], t ∈ [ζs , ηs+1 ],
E[Ut ] = Ps−1
θ − α(θ)E[t − ηs + i=0 (ζi − ηi )], t ∈ [ηs , ζs ].
Both expectations inside the bracket are bounded above by t, and thus E[Ut ] ≥ θ−α(θ)t. Furthermore,
since Ut ≤ θ we have E[Ut+ ] ≤ θ, so combining the two we obtain
E[Ut+ ] − E[Ut ] ≤ θ − (θ − α(θ)t) = α(θ)t.
Thus,

P inf
0
h(xt0 ) ≤ −) ≤ P inf
0
Ut0 ≤ −
t <t t <t
t δ t δt
≤ α(θ) ≤ α α−1 ( ) = < δ,
2t 2t
where we used that θ = min{α−1 δ −1 δ

2t , h(x0 )} is no greater than α 2t . The proof is concluded.

7.2 Relative Degree of Barrier Functions

∂h ∂h
Let Lf h and Lg h be the Lie derivatives of h, defined as ∂x f and ∂x G, respectively. The relative
r−1
degree r of a barrier function is defined such that Lg Lf h(x) 6= 0 and Lg h(x) = Lg Lf h(x) =
Lg L2f h(x) = . . . = Lg Lr−2
f h(x) = 0, ∀x. Note that relative degree 1 barrier functions are barrier
functions that impose restrictions on directly actuated system states, as a result of which Lg h(x) 6= 0.

7.3 Implementation Details

In all our simulations we pick the following form of the class K function that bounds the rate of
change of the barrier function, α(·) = γh(x), where γ is a fixed positive constant that changes the
slope of the bounding function. Higher the value of γ, less conservative is the barrier function.

7.3.1 Inverted Pendulum

For the inverted pendulum system, the state vector is given by x = [θ, θ̇]T indicating pendulum
angular-position and pendulum angular-velocity. The system dynamics (f ) vector and the actuation
(G) matrix are given by,
" # T
θ̇
1
f = −b g , G = 0,
θ̇ − sin θ I
I l
The stochastic dynamics are given by,
dx(t) = f (x) dt + G(x)u dt + Σ(x) dW (t)

12
where,

0 0 T 0 0
Σ= =⇒ ΣΣ =
0 σ 0 σ2

The parameter values used were: b = 0.1, l = 0.5 m, g = 9.81 m/s2 , m = 2 kg and I = ml2 .
Balancing with angle constraints: For this task, we impose box constraints on the pendulum
angular position. The goal is to ensure that the pendulum does not fall outside of a predetermined safe
region around the unstable equilibrium point of x = [π, 0]T . Although the pendulum is initialized at
x = [π, 0]T , stochasticity in the dynamics will push it off this unstable equilibrium position and safe
controller is then tasked to keep the pendulum inside the safe region. As stated earlier, the pendulum
must remain inside the safe region during the entire training process.
Let θh and θl be the upper and lower bounds on the angular position. Let the corresponding
barrier functions be hh (x) = θh − θ and hl (x) = θ − θl and the combined barrier function be
h(x) = hh (x) · hl (x) − µθ̇2 , so that the safe set given by C = {x : h(x) ≥ 0}.

h(x) = (θh − θ) · (θ − θl ) − µθ̇2 = θh θ − θh θl − θ2 + θl θ − µθ̇2

∂h T

∂h θh − 2θ + θl −2µθ̇
= =⇒ G=
∂x −2µ θ̇ ∂x I
∂h T 2µθ̇
∴C=− G= · · · for (10)
∂x I
∂h T

−b g
f = θ̇(θh − 2θ + θl ) + θ̇ − sin θ · (−2µθ̇)
∂x I l

b g
= θ̇(θh − 2θ + θl ) + θ̇ + sin θ · (2µθ̇)
I l
∂h T
2
1 ∂ h
∴ d = α(h(x)) + tr ΣΣT + f · · · for (10)
2 ∂ θ̇2 ∂x
∂h T ∂2h

2
= α(h(x)) − µσ + f ··· ∵ = −2µ
∂x ∂ θ̇2

The above choice of barrier function is relative degree 1 because Lg h = (∂h/∂x)T g 6= 0, except
at θ̇ = 0. However, given the task at hand, the only time this can occur is at initial condition
x = [π, 0]T or if the controller manages to stabilize the pendulum at the unstable equilibrium point
and exactly cancel out all the stochasticity entering the system. If any one of these occur, since
θ = π (and because we choose box-constraint like bounds eg. (θl , θh ) = (2π/3, 4π/3)), we see
that the gradient of the barrier function ∂h/∂x is itself zero. This means that the safety constraint is
trivially satisfied (as we have 0 ≥ −α(h(x)) in (4). In these cases, the unconstrained optimal control,
u∗ (t, x) = −R−1 GT Vx , would be the solution to the QP problem.
Regarding values of hyper-parameters used to train the Safe FBSDE controller for this problem, we
used the following,

# Parameter Name Parameter Value

1 batch size 128
2 training iterations 201
3 σ 1.0
4 learning rate 1e−2
5 number of neurons per LSTM layer 16
6 optimizer Adam
7 µ 0.05
8 γ 0.5

13
7.3.2 Cart Pole
T
For this system, the state vector x = xc , θ, x˙c , θ̇ , indicating cart-position, pole angular-position,
cart-velocity and pole angular-velocity respectively. The system dynamics (f ) vector and actuation
(G) matrix are given by,
 x˙c  
0

 θ̇  
f1

 0 
 
0
 2 1
mp sin θ(lθ̇ + g cos θ)

  f2  
 0

f =  = f3  , G =   = g 

2
 
mc + mp sin2 θ  mc + mp sin θ  3
− cos θ
 
 −mp lθ̇2 cos θ sin θ − (mc + mp )g sin θ  f4   g4
l(mc + mp sin2 θ) l(mc + mp sin2 θ)
The stochastic dynamics are given by,
dx(t) = f (x) dt + G(x)u dt + Σ(x) dW (t)
where,
0 0 0 0 0 0 0 0
   
0 0 0 0  0 0 0 0
Σ(x) =  =⇒ Σ(x)ΣT (x) = 
0 0 σ 0 0 0 σ2 0 
0 0 0 σ 0 0 0 σ2
The system parameters chosen were: mp = 0.01 kg, mc = 1.0 kg, g = 9.81 m/s2 and l = 0.5 m.
i) Swing-up with cart position constraints: Here the task is to begin from the initial position
x0 = [0, 0, 0, 0]T and perform a swing-up to xtarget = [0, π, 0, 0]T within a fixed time
horizon of 1.5 seconds. Additionally, there are constraints on the cart-position that need to
be satisfied both during training (i.e. learning the policy) and testing (i.e. final deployment
of the policy). These constraints are represented as [xc,l , xc,h ] where xc,h > xc,l . A choice
of barrier function that combines both position and velocity constraints is as follows:
h̃ = (xc,h − xc ) · (xc − xc,l ) − µx˙c 2 = xc,h xc − xc,h xc,l − x2c + xc xc,l − µx˙c 2
where the parameter µ controls how fast the cart is allowed to move in the interior of the
safe set C.
xc,h − 2xc + xc,l
 
T
∂ h̃  0 ∂ h̃ −2µx˙c
=  =⇒ G=

∂x −2µx˙c ∂x mc + mp sin2 θ
0
T
∂ h̃
∴ C=− G and
∂x
T 2
∂ h̃ ∂ h̃
d = α(h̃(x)) + 0.5 · σ 2 (−2µ) + f ··· = −2µ
∂x ∂ x˙c 2
The relative degree of this barrier is also 1 similar to the pendulum case above. Please see
the explanation provided there. For this task we used, µ = 0.1, γ = 1.0 and cart-position
bounds of (xc,l , xc,h ) = (−10 m, 10 m).
ii) Pole balancing with pole position constraints: Similar to the construction above for
position constrained swing-up, we have the following hybrid barrier function for the task of
balancing with constrained pole angle,
hθ (x) = (θh − θ) · (θ − θl ) − µθ θ̇2 = θh θ − θh θl − θ2 + θθl − µθ θ̇2
0
 
∂hθ θh − 2θ + θl  ∂hθ T (−2µθ θ̇) · (− cos θ)
= 0 =⇒ G=
∂x ∂x l(mc + mp sin2 θ)

−2µθ θ̇
∂hθ T
∴ C=− G and
∂x
∂hθ T ∂2h

d = α(hθ (x)) + 0.5 · σ 2 (−2µθ ) + f ··· = −2µθ
∂x ∂ θ̇2

14
The relative degree of this barrier is also 1 similar to the pendulum case above. Please
see the explanation provided there. For this task we used, µ = 0.1, γ = 1.0 and pole
angular-position bounds of (θl , θh ) = (π/2 rad, 3π/2 rad).
iii) Pole balancing with both pole and cart position constraints: For this task we consider
two separate barrier functions i.e. two safety inequality constraints for cart-position and
pole-angle. The barrier functions and corresponding terms of the inequality constraints are
exactly the same as the two sub-tasks above. The bounds are also the same as those used
in the above sub-tasks. The barrier function parameters used were µxc = 0.01, µθ = 10,
γxc = 10 and γθ = 100.
Following are some of the common hyper-parameter values used for the above cart-pole simulations:
# Parameter Name Parameter Value
1 batch size 128
2 training iterations (tasks i.) 2001
3 training iterations (tasks ii. and ii.) 201
4 σ 1.0
4 learning rate 1e−2
5 number of neurons per LSTM layer 16
6 optimizer Adam

7.3.3 2-D Car

The model is same as that used in [25] and has the following dynamics:
ẋ = v cos(θ), ẏ = v sin(θ)
θ̇ = vuθ , v̇ = uv
The state vector is x = [px , py , θ, v]T which corresponds to the x-position, y-position, heading
angle and forward velocity of the car. The control vector is u = [uθ , uv ]T which corresponds to the
acceleration and steering-rate of the car. The system dynamics (f ) vector and actuation (G) matrix
are given by,
v cos(θ) 0 0
   
 v sin(θ)   0 0
f = , G=
0  v 0
0 0 1
The stochastic dynamics are given by,
dx = f (x)dt + g(x)udt + Σ(x)dW
where
0 0 0 0 0 0 0 0
   
0 0 0 0 T 0 0 0 0
Σ(x) =  =⇒ Σ(x)Σ (x) = 
0 0 σ 0 0 0 σ2 0
0 0 0 σ 0 0 0 σ2

i) Single Car Multiple Obstacle Avoidance: In this task a single car starts at (px , py ) =
(0 m, 0 m) with the goal to reach the target (px , py ) = (2 m, 2 m) while avoiding obstacles
shaped as circles. We use 1 barrier function for each obstacle which takes the following
form: 2
h̃i (x) = (px − oix )2 + (py − oiy )2 − oir − µv 2
where i is the obstacle index and oix , oiy , oir are the x-position, y-position and radius of the
obstacle respectively. This is a relative degree 1 barrier function, which can be shown by
2(px − oix )
 
∂ h̃i 2(p − oiy )
= y
∂x 0

−2µv
T
∂ h̃i
=⇒ G = [0 −2µv]
∂x

15
The other terms needed to set up (10) can be calculated as:
T
∂ h̃i
f = 2(px − oix )v cos(θ) + 2(py − oiy )v sin(θ)
∂x
1 ∂ 2 h̃i
ΣΣT = −µσ 2

tr 2
2 ∂x
The QP can be set up with
T
∂ h̃i
Ci = − G
∂x
T
1 ∂ 2 h̃i ∂ h̃i
di = α(h̃i (x)) + tr 2
ΣΣT + f
2 ∂x ∂x

ii) Multi-car Collision Avoidance: For multi-car collision avoidance, simply set the obstacle’s
positions in the previous section’s barrier function to the other car’s positions, the obstacle’s
size to be 2 times the car’s size, and subtract an additional velocity of the other car. In
our simulation example we used 4 cars for a total of 16 states, x = [px1 , py1 , . . . , θ4 , v4 ]T ,
dimensions and 8 control dimensions, u = [uθ1 , uv1 , . . . , uθ4 , uv4 ]T , resulting in 6 barrier
functions for each pair of vehicles. The initial conditions for the 4 cars are [0, 0, π/4, 0.1],
[2, 0, 3π/4, 0.1], [0, 2, −π/4, 0.1] and [2, 2, −3π/4, 0.1] respectively. For example, the
barrier between car 1 and car 2 looks like

h̃12 (x) = (px1 − px2 )2 + (py1 − py2 )2 − (2oc )2 − µ(v12 + v22 )

This is a relative degree 1 barrier function, which can be shown by

2(px1 − px2 )
 
 2(py1 − py2 ) 

 0 

∂ h̃12

 −2µv1 

= −2(px1 − px2 )

∂x 
 −2(py1 − py2 ) 
0
 
 
 −2µv 
2
08×1
T
∂ h̃12
=⇒ G = [0 −2µv1 0 −2µv2 01×4 ]
∂x
The other terms needed to set up (10) can be calculated as:
T
∂ h̃12
f = 2(px1 − px2 )v1 cos(θ1 ) + 2(py1 − py2 )v1 sin(θ1 )
∂x
− 2(px1 − px2 )v2 cos(θ2 ) − 2(py1 − py2 )v2 sin(θ2 )
1 ∂ 2 h̃12
ΣΣT = −2µσ 2

tr 2
2 ∂x
The QP can be set up with
T
∂ h̃12
C12 = − G
∂x
T
1 ∂ 2 h̃12 ∂ h̃12
d12 = α(h̃12 (x)) + tr 2
ΣΣT + f
2 ∂x ∂x
Regarding values of common hyper-parameters used to train the Safe FBSDE controllers for the 2D
car problems, we used the following,

16
Figure 8: 2D Car Collision Avoidance Average Testing Performance: Figures above show suc-
cessful collision avoidance performance of the trained policy averaged over 128 trials. Vertical axes
represent time (s) and horizontal axes represent (x,y) position.
# Parameter Name Parameter Value
1 batch size (task i.) 128
2 batch size (task ii.) 256
3 training iterations 1000
4 σ 0.1
5 learning rate 5e−3
6 number of neurons per LSTM layer 16
7 optimizer Adam

For task (i.), we used µ = 0.05 and γ = 1, while for task (ii.) we used, µ = 0.1 and γ = 1.

7.4 Animation of Multi-Car Collision Avoidance

Attached with this supplementary material is an animated video of the multi-car collision avoidance
problem. For this animation we used the first batch element (from a batch of 128) from the respective
iterations seen in the video. However, when demonstrating performance during testing, we use
average performance. As seen in the plot Fig.7, the variance of the trajectories is considerably small
and therefore it suffices to show the mean performance. The mean cost is also the cost functional
used in the problem formulation. The colored cicles indicate the diagonally opposite targets for each
car. The dotted circles show the 2 car radii minimum allowable safe distance.

7.5 Algorithms

The Alg.1 below, is very similar to work in [12], with the following changes: computation of the
optimal control (Hamiltonian QP instead of closed form expression), propagation of the value function
(no need for Girsanov’s theorem), computation of an additional target for training the deep FBSDE
network (i.e. the true gradient of the value function) and additional terms in the loss function. The
colored circles in the video are the respective targets of each car.
In Alg.2, for details regarding Affine Scale, Centering-Corrector, Residul and Step-Size please
see the primal-dual interior point method detailed in [20] or refer to code provided on github in [17].

17
Algorithm 1: Finite Horizon Safe FBSDE Controller
Given:
x0 = ξ, f , G, Σ: Initial state and system dynamics;
φ, q, R: Cost function parameters;
a, b, c, d: Loss function parameters;
N : Task horizon, K: Number of iterations, M : Batch
size; ∆t: Time discretization; λ: weight-decay parameter;
Parameters:
ψ = V (x0 , 0): Value function at t = 0;
θ: Weights of all Linear and LSTM layers;
Initialize neural network parameters;
Initialize (for every batch element): {xi0 }M i i M i i
i=1 , x0 = ξ; {ψ0 }i=1 , ψ0 = V (x0 , 0)
for k = 1 to K do
for i = 1 to M do
for t = 1 to N − 1 do
Predict value function gradient:
i
Vx,t = fF BSDE (xit ; θk )
Solve constrained Hamiltonian minimization quadratic program:
ū∗i i i
t = fsaf e (xt , Vx,t )
Sample Brownian noise:
∆wti ∼ N (0, Σ∆t)
iT i
= Vti − q(xt ) + 21 ū∗iT ∗i
i
i
Propagate value function: Vt+1 t Rūt ∆t + Vx,t Σt ∆wt
i
Propagate system state: xit+1 = xit + fti ∆t + Git ū∗i
t ∆t + Σt ∆wt
i

end for
Compute targets: true terminal value and true value function gradient,
VN∗i = φ xiN ; Vx,N
∗i
= φx xiN
end for
Compute mini-batch loss:
M
1 X
L= a kVN∗i − VNi k22 + b kVx,N
∗i i
− Vx,N k22 + c kVN∗i k22 + d kVx,N
∗i
k22 + λ kθk k22
M i=1
Gradient step:
θk+1 , ψ k+1 ← Adam.step(L, θk , ψk)
end for
return θK , ψ K

18
Algorithm 2: Stochastic CBF layer (Safe OptNet layer)
Given:
h: Safe set characterizing barrier function; α(h(x)) = γh(x);
f , G, Σ: System dynamics; L: Maximum QP iterations, R: control cost positive definite matrix
Input: Current system state (xt ) and current predicted value function gradient (Vx,t )
Set up the QP problem:
∂h T ∂h T 2

T 1 T∂ h
Q = R, q = Vx G, C = − G(x), d = α(h(x)) + f (x) + tr Σ(x) Σ(x)
∂x ∂x 2 ∂x2
Initialize solution:
initialize feasible u, slack variable: s ∈ R1+ and Lagrange multiplier for CBF inequality constraint:
λ ∈ R+ and Residue Tolerance:
for j = 1 to L do
(∆uafj
f
, ∆saf
j
f
, ∆λaf j
f
) ←Affine Scale (uj , sj , λj )
(∆uccj , ∆s cc
j , ∆λ cc
j ) ← Centering-Corrector (uj , sj , λj )
ζ ← Residual(uj , sj , λj )
α ← Step Size(uj , sj , λj )
u(j+1) = uj + α(∆uaf j
f
+ ∆uccj )
af f
s(j+1) = sj + α(∆sj + ∆scc j )
af f
λ(j+1) = λj + α(∆λj + ∆λcc j )
if ζ ≤ then
Break
end if
end for
Return ū∗ ← uj

Beton Od Recikliranog Agregata: April 2014
No ratings yet
Beton Od Recikliranog Agregata: April 2014
8 pages
Continuous-Time Stochastic Policy Iteration of Adaptive Dynamic Programming
No ratings yet
Continuous-Time Stochastic Policy Iteration of Adaptive Dynamic Programming
13 pages
Full Text
No ratings yet
Full Text
16 pages
MTP 2 PPT
No ratings yet
MTP 2 PPT
28 pages
Safe Control of Multi-Agent Systems Via Low-Complexity Control Barrier Functions
No ratings yet
Safe Control of Multi-Agent Systems Via Low-Complexity Control Barrier Functions
14 pages
Attacks
No ratings yet
Attacks
9 pages
Risk-Bounded Control With Kalman Filtering and Stochastic Barrier Functions
No ratings yet
Risk-Bounded Control With Kalman Filtering and Stochastic Barrier Functions
7 pages
2024 Thesis-Applied maths-Safety-Critical Control For Dynamical System With Uncertainty
No ratings yet
2024 Thesis-Applied maths-Safety-Critical Control For Dynamical System With Uncertainty
129 pages
Slide by Slide Q A
No ratings yet
Slide by Slide Q A
4 pages
State and Input Constrained Model Reference Adaptive Control
No ratings yet
State and Input Constrained Model Reference Adaptive Control
6 pages
2017 - On The Sample Complexity of The Linear Quadratic Regulator
No ratings yet
2017 - On The Sample Complexity of The Linear Quadratic Regulator
43 pages
Electronics 11 03657
No ratings yet
Electronics 11 03657
21 pages
Predictive Control Barrier Functions For Online Safety Critical Control
No ratings yet
Predictive Control Barrier Functions For Online Safety Critical Control
8 pages
Xiao 等 - 2023 - BarrierNet Differentiable Control Barrier Functions for Learning of Safe Robot Control
No ratings yet
Xiao 等 - 2023 - BarrierNet Differentiable Control Barrier Functions for Learning of Safe Robot Control
19 pages
1 s2.0 S0925231224014486 Main 1 8
No ratings yet
1 s2.0 S0925231224014486 Main 1 8
13 pages
Dho CBF Apply
No ratings yet
Dho CBF Apply
9 pages
Solving High-Dimensional Hamilton-Jacobi-Bellman Equations With Functional Hierarchical Tensor
No ratings yet
Solving High-Dimensional Hamilton-Jacobi-Bellman Equations With Functional Hierarchical Tensor
24 pages
Control Barrier Functions: Theory and Applications
No ratings yet
Control Barrier Functions: Theory and Applications
12 pages
Safe Exploration in Model Based
No ratings yet
Safe Exploration in Model Based
9 pages
DHO CBF Faster
No ratings yet
DHO CBF Faster
8 pages
Uncertainty Quantification For Recursive Estimation in Adaptive Safety-Critical Control
No ratings yet
Uncertainty Quantification For Recursive Estimation in Adaptive Safety-Critical Control
8 pages
A Neural Network Approach For Solving Optimal Control Problems With Inequality Constraints and Some Applications
No ratings yet
A Neural Network Approach For Solving Optimal Control Problems With Inequality Constraints and Some Applications
29 pages
Robust Control Barrier Functions For Nonlinear Control Systems With Uncertainty: A Duality-Based Approach
No ratings yet
Robust Control Barrier Functions For Nonlinear Control Systems With Uncertainty: A Duality-Based Approach
8 pages
CLF QP Acc Final
No ratings yet
CLF QP Acc Final
8 pages
High-Order Control Barrier Functions
No ratings yet
High-Order Control Barrier Functions
8 pages
1 s2.0 S1474667016440140 Main
No ratings yet
1 s2.0 S1474667016440140 Main
6 pages
Mean-Field Control Barrier Functions: A Framework For Real-Time Swarm Control
No ratings yet
Mean-Field Control Barrier Functions: A Framework For Real-Time Swarm Control
7 pages
Epfl
No ratings yet
Epfl
12 pages
Tailoring: Training Manual
100% (5)
Tailoring: Training Manual
42 pages
Assessment Task 2: Activity No. 1
No ratings yet
Assessment Task 2: Activity No. 1
5 pages
Tac 232
No ratings yet
Tac 232
7 pages
Near-Optimal Control of Dynamical Systems With Neural Ordinary Differential Equations
No ratings yet
Near-Optimal Control of Dynamical Systems With Neural Ordinary Differential Equations
23 pages
Some Remarks On Stochastic H Control of Linear and Nonlinear It-Type Differential Systems
No ratings yet
Some Remarks On Stochastic H Control of Linear and Nonlinear It-Type Differential Systems
6 pages
Reinforcement Learning-Based Decentralized Safety Control For Constrained Interconnected Nonlinear Safety-Critical Systems
No ratings yet
Reinforcement Learning-Based Decentralized Safety Control For Constrained Interconnected Nonlinear Safety-Critical Systems
23 pages
Daylighting Streams Text
No ratings yet
Daylighting Streams Text
6 pages
HDBS Parameters
No ratings yet
HDBS Parameters
1 page
Product Decision - MM
No ratings yet
Product Decision - MM
33 pages
Introduction To Online Control Hazan & Singh
No ratings yet
Introduction To Online Control Hazan & Singh
192 pages
Dynamics and Control of Trajectory Tubes, Alexander B. Kurzhanski Pravin Varaiya
No ratings yet
Dynamics and Control of Trajectory Tubes, Alexander B. Kurzhanski Pravin Varaiya
457 pages
4174 Learning Barrier Functions With Memory For Robust Safe Navigation
No ratings yet
4174 Learning Barrier Functions With Memory For Robust Safe Navigation
8 pages
Venkat - AEM Developer
No ratings yet
Venkat - AEM Developer
4 pages
Optimal Control Via Path Integral
No ratings yet
Optimal Control Via Path Integral
7 pages
On Safety in Safe Bayesian Optimization: Christian Fiedler
No ratings yet
On Safety in Safe Bayesian Optimization: Christian Fiedler
29 pages
Gmail - 1st International Conference On Advances in Computing, Communication and Networking (ICAC2N2024) - Submission (295) Has Been Created
No ratings yet
Gmail - 1st International Conference On Advances in Computing, Communication and Networking (ICAC2N2024) - Submission (295) Has Been Created
2 pages
Approximate Controllability of Stochastic Hemivariational Control Problem in Hilbert Spaces
No ratings yet
Approximate Controllability of Stochastic Hemivariational Control Problem in Hilbert Spaces
28 pages
Cadd 1 Final Tos
No ratings yet
Cadd 1 Final Tos
2 pages
Minimax Linear Regulator Problems For Positive Systems
No ratings yet
Minimax Linear Regulator Problems For Positive Systems
26 pages
Nonparametric Adaptive Bayesian Stochastic Control Under Model Uncertainty
No ratings yet
Nonparametric Adaptive Bayesian Stochastic Control Under Model Uncertainty
28 pages
10.3934 dcdss.2024060
No ratings yet
10.3934 dcdss.2024060
20 pages
Translate Skenario Toefl
No ratings yet
Translate Skenario Toefl
2 pages
Applying A Random Projection Algorithm To Optimize Machine Learning Model For Predicting Peritoneal Metastasis in Gastric Cancer Patients Using CT Images
No ratings yet
Applying A Random Projection Algorithm To Optimize Machine Learning Model For Predicting Peritoneal Metastasis in Gastric Cancer Patients Using CT Images
24 pages
Surveycontrol
No ratings yet
Surveycontrol
44 pages
The Connections Between Lyapunov Functions For Some Optimization Algorithms and Differential Equations
No ratings yet
The Connections Between Lyapunov Functions For Some Optimization Algorithms and Differential Equations
18 pages
A Framework For A Modular Multi-Concept Lexicographic Closure Semantics
No ratings yet
A Framework For A Modular Multi-Concept Lexicographic Closure Semantics
18 pages
Robust, Accurate Stochastic Optimization For Variational Inference
No ratings yet
Robust, Accurate Stochastic Optimization For Variational Inference
16 pages
AAAI 2019 Richard Richard Joel-Annotated
No ratings yet
AAAI 2019 Richard Richard Joel-Annotated
9 pages
AI Solutions For Drafting in Magic: The Gathering: Henry N. Ward Daniel J. Brooks Dan Troha Bobby Mills
No ratings yet
AI Solutions For Drafting in Magic: The Gathering: Henry N. Ward Daniel J. Brooks Dan Troha Bobby Mills
15 pages
2009 01215 PDF
No ratings yet
2009 01215 PDF
14 pages
ACC23 Tutorial Paulson
No ratings yet
ACC23 Tutorial Paulson
12 pages
Optimal Control Under Unknown Intensity With Bayesian Learning
No ratings yet
Optimal Control Under Unknown Intensity With Bayesian Learning
23 pages
Fee Slip
No ratings yet
Fee Slip
1 page
Objective Function Decisions Demand Supply Constraints
No ratings yet
Objective Function Decisions Demand Supply Constraints
7 pages
Pleuropulmonary Infections
No ratings yet
Pleuropulmonary Infections
40 pages
RSS2017 Discrete CBF
No ratings yet
RSS2017 Discrete CBF
10 pages
Distributed Locally Non-Interfering Connectivity Via Linear Temporal Logic
No ratings yet
Distributed Locally Non-Interfering Connectivity Via Linear Temporal Logic
6 pages
A Bayesian Approach With Type-2 Student-T Membership Function For T-S Model Identification
No ratings yet
A Bayesian Approach With Type-2 Student-T Membership Function For T-S Model Identification
5 pages
Optimal Control Theory
No ratings yet
Optimal Control Theory
28 pages
Installation Instructions: HONDA CBR 1000 RR 08-18 HONDA CBR 1000 RR SP 17-18
No ratings yet
Installation Instructions: HONDA CBR 1000 RR 08-18 HONDA CBR 1000 RR SP 17-18
12 pages
Stochastic Feedback Controller Design Considering The Dual Effect
No ratings yet
Stochastic Feedback Controller Design Considering The Dual Effect
13 pages
Document Similarity From Vector Space Densities
No ratings yet
Document Similarity From Vector Space Densities
12 pages
SC Dec22
No ratings yet
SC Dec22
82 pages
View-Invariant Action Recognition: Synonyms
No ratings yet
View-Invariant Action Recognition: Synonyms
14 pages
NOTES CH 9 Living Organisms G6 2
No ratings yet
NOTES CH 9 Living Organisms G6 2
5 pages
Automatic A 2002
No ratings yet
Automatic A 2002
9 pages
PhysRevResearch.5.013122 Physics of Networks
No ratings yet
PhysRevResearch.5.013122 Physics of Networks
9 pages
3796 Neural Lyapunov Model Predicti
No ratings yet
3796 Neural Lyapunov Model Predicti
12 pages
Building Application-Specific Overlays On Fpgas With High-Level Customizable Ips
No ratings yet
Building Application-Specific Overlays On Fpgas With High-Level Customizable Ips
7 pages
Control of Toys
No ratings yet
Control of Toys
6 pages
130 FT End Fed Half Wave - Multiband Operation
No ratings yet
130 FT End Fed Half Wave - Multiband Operation
1 page
Pressure Transmitter Offer
No ratings yet
Pressure Transmitter Offer
2 pages
Happy Birthday
No ratings yet
Happy Birthday
2 pages
An Enhanced Simulation-Based Iterated Local Search Metaheuristic For Gravity Fed Water Distribution Network Design Optimization
No ratings yet
An Enhanced Simulation-Based Iterated Local Search Metaheuristic For Gravity Fed Water Distribution Network Design Optimization
27 pages
Master LN
No ratings yet
Master LN
135 pages
Dynamic Programming and Stochastic Control Processes: Ici-L RD Bellm An
No ratings yet
Dynamic Programming and Stochastic Control Processes: Ici-L RD Bellm An
12 pages
Family Emergency Plan
No ratings yet
Family Emergency Plan
2 pages
Chapter Test: QS - Explain How You Found Your Answer
No ratings yet
Chapter Test: QS - Explain How You Found Your Answer
1 page
Nikko New Product - Catalogue
No ratings yet
Nikko New Product - Catalogue
32 pages
Optimal Control of Hybrid Systems
No ratings yet
Optimal Control of Hybrid Systems
6 pages
B Quiz (Mains)
No ratings yet
B Quiz (Mains)
41 pages
What Is Defensive Driving?
No ratings yet
What Is Defensive Driving?
3 pages
Expt 4 Conclusion and Applications
0% (2)
Expt 4 Conclusion and Applications
2 pages
Plastic University MCQ Merged
No ratings yet
Plastic University MCQ Merged
13 pages
Lifelong Graph Learning: Preprint. Under Review
No ratings yet
Lifelong Graph Learning: Preprint. Under Review
12 pages
Reset Sony HCD-GR8
No ratings yet
Reset Sony HCD-GR8
1 page
Betas
No ratings yet
Betas
4 pages
M.sc. Chemistry
No ratings yet
M.sc. Chemistry
20 pages
CDC00-INV4502: A Numerical Method For Solving Singular Brownian Control Problems
No ratings yet
CDC00-INV4502: A Numerical Method For Solving Singular Brownian Control Problems
6 pages
Optimal Control Theory
No ratings yet
Optimal Control Theory
28 pages
Robust Design of Linear Control Laws For Constrained Nonlinear Dynamic Systems
No ratings yet
Robust Design of Linear Control Laws For Constrained Nonlinear Dynamic Systems
6 pages
Sampling Procedure APEDA 1721269949
No ratings yet
Sampling Procedure APEDA 1721269949
5 pages
Vitotres343 TechGuide PDF
No ratings yet
Vitotres343 TechGuide PDF
32 pages

Safe Optimal Control Using Stochastic Barrier Functions and Deep Forward-Backward Sdes

Uploaded by

Safe Optimal Control Using Stochastic Barrier Functions and Deep Forward-Backward Sdes

Uploaded by

Safe Optimal Control Using Stochastic Barrier

Functions and Deep Forward-Backward SDEs

Ioannis Exarchos Evangelos A. Theodorou

2 Mathematical Preliminaries and Problem formulation

2.1 Stochastic Control Barrier Functions

2.2 Stochastic Optimal Control

3 Safe Deep FBSDE Control

3.2 Differential Quadratic Programming Layer

Initial Cell and Training Loss Function :

𝒉𝟏𝟎 , 𝒄𝟏𝟎 Layer Layer Layer

DNN: DNN: DNN: DNN: DNN:

... ... 𝑽∗𝑵

4.2.1 Swing-up with bounds on cart-position

This task requires starting from an initial state of x = [0, 0, 0, 0]T

Here we explored two sub-tasks (see Fig. 4 for simulation results):

4.3 Two-dimensional Car navigation

For each 2D car system, the state vector is given by

4.3.1 Single Car Multiple Obstacle Avoidance

4.3.2 Multi-Car Collision Avoidance

5 Conclusions and Future Directions

where the subscript t denotes dependence on time. Let θ = min{α−1 δ

By Doob’s supermartingale inequality [24] we have that for  > 0

7.2 Relative Degree of Barrier Functions

7.3 Implementation Details

7.3.1 Inverted Pendulum

h(x) = (θh − θ) · (θ − θl ) − µθ̇2 = θh θ − θh θl − θ2 + θl θ − µθ̇2

# Parameter Name Parameter Value

7.3.3 2-D Car

h̃12 (x) = (px1 − px2 )2 + (py1 − py2 )2 − (2oc )2 − µ(v12 + v22 )

This is a relative degree 1 barrier function, which can be shown by

7.4 Animation of Multi-Car Collision Avoidance

You might also like

where the subscript t denotes dependence on time. Let θ = min{α−1 δ

By Doob’s supermartingale inequality [24] we have that for > 0