Robust Flight Navigation Out of Distribution With Liquid Neural Networks
Robust Flight Navigation Out of Distribution With Liquid Neural Networks
Autonomous robots can learn to perform visual navigation tasks from offline human demonstrations and gen-
eralize well to online and unseen scenarios within the same environment they have been trained on. It is chal-
lenging for these agents to take a step further and robustly generalize to new environments with drastic scenery
changes that they have never encountered. Here, we present a method to create robust flight navigation agents
that successfully perform vision-based fly-to-target tasks beyond their training environment under drastic dis-
tribution shifts. To this end, we designed an imitation learning framework using liquid neural networks, a brain-
neural dynamics improve the robustness of the decision-making tested for their ability to generalize OOD. Our experiments includ-
process in autonomous agents, leading to better transferability ed the following diverse set of tasks:
and generalization in new settings under the same training distribu- 1) Fly-to-target tasks. Train on offline expert demonstrations of
tion (18–21). We aimed to leverage brain-inspired pipelines and flying toward a target in the forest and test online in environments
empirically demonstrate that if the causal structure of a given task with drastic scenery changes.
is captured by a neural model from expert data, then the model can 2) Range test. Take a pretrained network from the fly-to-target
perform robustly even OOD. In a set of fly-to-target experiments task and, without additional training (zero-shot), test how far away
with different time horizons, we show that a certain class of we can place the agents to fly toward the target.
brain-inspired neural models, namely, liquid neural networks, gen- 3) Stress test. Add perturbations in the image space and measure
eralizes well to many OOD settings, achieving performance beyond the success rate of agents under added noise.
that of state-of-the-art models. 4) Attention profile of networks. Apply feature saliency compu-
tation to assess the task understanding capabilities of networks via
their attention maps.
RESULTS 5) Target rotation and occlusion. Rotate and occlude the target
Fly-to-target task training OOD settings but maintained a task structure identical to the one
The fly-to-target task consists of autonomously identifying a target learned from training data. We randomly positioned the quadrotor
of interest and performing the flight controls driving the quadrotor at a distance of about 10 m from the target, with the latter in the field
toward it using the onboard stabilized camera’s sequences of RGB of view of the onboard camera. We launched the closed-loop policy
(red, green and blue) images as the sole input (Fig. 1). We initially and observed whether the network could successfully guide the
started the quadrotor about 10 m away from the target and required drone to the target. The test was repeated 40 times for each
that the policy guide the drone to within 2 m of the target, with the network and in each environment. Success was accounted for
target centered in the camera frame. The objective was to learn to when the network was able to both stabilize the drone in a radius
complete this task entirely from an offline dataset of expert demon- of 2 m and maintain the target in the center of the frame for 10
strations. The learned policies were then tested online in a closed s. Failure cases were identified when the network generated com-
loop within the training distribution and in drastically distinct set- mands that led to an exit of the target from the range of view
tings. This experimental protocol allows for the principled assess- without the possibility of recovery. We also included cases where
ment of performance and generalization capabilities of liquid the drone failed to reach the target in less than 30 s to account for
networks compared with modern deep models (38, 41, 42). rare runs where the network generated commands indefinitely,
Fig. 1. Sample frames of training and test environments. (A) Third-person view of the quadrotor and one of the targets in the Training Woods where data were
collected. (B) Third-person view of the quadrotor during testing on the Urban Patio, with multiple adversary objects dispersed around the target camping chair. (C)
Training data frame samples from the quadrotor onboard camera containing various targets (camping chair, storage box, and RC car from top to bottom) and taken
during different seasons (summer, fall, and winter from left to right). (D) Test data frame samples from the quadrotor onboard camera against each of the four test
backgrounds: Training Woods (top left), Alternative Woods (top right), Urban Lawn (bottom left), and Urban Patio (bottom right).
landscape was largely different from the woods, as shown in Fig. 1 The results of this experiment show strong evidence that liquid
(A to D). Frames contained buildings, windows, large reflective me- networks have the ability to learn a robust representation of the task
tallic structures, and the artificial contrast from the geometric they are given and can generalize well to OOD scenarios where
shades they induced. Lighting conditions at different times of day other models fail. This observation is aligned with recent works
and varying wind levels added additional perturbations. (21) that showed that liquid networks are dynamic causal models
We lastly examined the networks’ generalization capabilities in (DCMs) (43) and can learn robust representations for their percep-
an OOD environment consisting of a brick patio. In this environ- tion modules to perform robust decision-making. In particular, CfC
ment, the background, including a number of man-made structures networks managed to fly the drone autonomously from twice and
of different shapes, colors, and reflectivity, drastically differed from three times the training distance to their targets from raw visual data
the training environment. Moreover, we added an extra layer of with a success rate of 90 and 20%, respectively. This is in contrast to
complexity to this experiment by positioning a number of other an LSTM network that lost the target in every single attempt at both
chairs in the frame of different colors (including red) and sizes. these distances, leading to a 0% success rate. Only ODE-RNN
This ultimate test, including real-world adversaries, required ro- managed to achieve a single success at 20 m among the nonliquid
bustness in the face of extremely heavy distribution shifts in addi- networks, and none reached the target from 30 m.
active testing results, where in OOD scenarios, CfCs stood out in VisualBackProp associates importance scores to input features
task completion success rate. Liquid NCPs also showed great resil- during decision-making. The method has shown promise in real-
iency to noise, brightness, and contrast perturbations because their world robotics applications where visual inputs are processed by
performance was hindered by changes in input pixels’ saturation convolutional filters first. The attention maps corresponding to
level. In this offline setting, LSTMs were also among the most the convolutional layers of each network in a test case scenario
robust networks after liquid networks. However, their OOD gener- are shown in Fig. 3. We observed that the saliency maps of liquid
alization was poor compared with liquid networks. networks (both NCPs and CfCs) were much more sensitive to the
Attention profile of networks target from the start of the flight compared with those of
To assess the task understanding capabilities of all networks, we other methods.
computed the attention maps of the networks via the feature sali- We show more comprehensive saliency maps in different scenar-
ency computation method called VisualBackProp (44) to the ios in figs. S1 to S3. We observed that the quality of the attention
CNN backbone that precedes each recurrent network.
observed that the network could indefinitely navigate the quadrotor in unseen and adversarial environments. Table 1 suggests that only
to new targets (we stopped after four laps, with a total of 12 check- five of the six networks were capable of reaching the target in the
points reached and detected). Training Woods environment on half or more of the attempts.
Dynamic target TCN (7.5%) failed to achieve this threshold even on data from the
A useful task in quadrotor autonomous flight is following a moving training background distribution. These models also performed
target. We thus tested the trained policies’ ability to pilot the drone poorly in the other testing scenarios, with TCN unable to achieve
in pursuit of a dynamic target. For this task, a figure of eight courses a single success in any of the other three environments and GRU-
with six checkpoints was set out (at the extremities and either side of ODE performing worse in comparison with the Training Woods
each loop, consecutive checkpoints being 5 m apart). This design scenario with the exception of Urban Lawn case, where it succeeded
choice ensured that run lengths were not upper-bounded, tested about a third of the time (32.5%). These two architectures exhibited
both turning directions, and included all possible natural lighting both poor closed-loop performance and generalization and were
angles, all while containing the experiment in a constrained space. thus deemed incapable of understanding and performing the
The target was moved from one checkpoint to the next, ensuring control task assigned. Among the models performing reasonably
that enough reaction time in the camera field of view was granted, well in both woods scenarios are the ODE-RNNs, with a 62.5%
and the number of checkpoints reached was assigned as the score of success rate when the target was in the same position as in the train-
a given test run. The experiment was reproduced in two environ- ing data and up to 82.5% with the target placed in a slightly different
ments, the Urban Lawn setting and a sports field we call Grass position. Although this network seemed to have acquired the capa-
Pitch (see fig. S10). Testing in the latter environment is subject to bility to achieve the task in closed loop, it generalized poorly to
highly challenging conditions of glaring sunlight and strong wind. unseen environments, with 17.5 and 25% OOD success rates in
In such conditions, most network architectures struggled to even the Urban Lawn and Urban Patio experiments.
latch onto the target (get to the first checkpoint and start following Other models seemed to perform consistently in the woods, with
the target). We took this into account in our results by taking aver- success rates of 82.5% for LSTM, 100% for NCP, and 85% for CfC
ages only on latched test runs but providing the rates at which each on the Training Woods test. In the Alternative Woods scenario,
network initially detected the target. LSTM and both liquid networks (NCPs and CfCs) succeeded in
Hence, table S7 shows that liquid networks consistently achieved reaching the target at a high success rate of more than 90%, show-
longer trajectories in the Urban Lawn environment, ahead of LSTM casing the acquired ability to learn and execute closed-loop tasks
and ODE-RNN architectures. In the Grass Pitch setting, all net- from expert demonstrations.
works apart from LSTM, NCP, and CfC failed to latch onto the When asked to generalize performance on the Urban Lawn, the
target in all attempts. In this challenging environment, liquid archi- top performer in this environment is CfC, which highly succeeded
tectures marginally outperformed LSTM in terms of trajectory in attending to the target (90% success). On the task of flying to the
length, although CfC managed to latch onto the target more than target in the presence of natural adversaries and distractions in the
twice as often as LSTM and NCP policies. extremely heavy distribution shift setup provided by the Urban
Patio, LSTM succeeded in only 27.5% of the attempts, whereas
both NCP (52.5%) and CfC (67.5%) achieved the best performance.
DISCUSSION Furthermore, comprehensive evidence in favor of a crisp advan-
Choice of models matters for OOD generalization tage for liquid neural networks for closed-loop end-to-end control
Our experiments show large inequalities between different RNN ar- learning was amassed through our extensive real-world experimen-
chitectures when performing a diverse range of fly-to-target control tal results. Our brain-inspired neural networks, and all the more so
tasks in closed loop and all the more so when required to generalize our CfC architecture, largely outperformed all other models on all
of the range tests, the rotation and occlusion robustness tests, the to navigate the drone. In these cases, the execution of the navigation
adversarial hiking task, and the dynamic target tracking task. task is hindered either by a distraction in the environment or by a
When all nonliquid networks failed to achieve the task at twice discrepancy between the representations learned in the perception
the nominal distance seen in training, our CfC network achieved a module and the recurrent module. This discrepancy might be
90% success rate and exhibited successful runs even at three times behind the LSTM agent’s tendency to fly away from the target or
the nominal distance from the target (20%). Also, LSTM’s perfor- to choose an adversarial target instead of the true one detected by
mance was more than halved when the chair is rotated or occluded, its perception module in the first place. With liquid networks,
whereas the CfC’s success rate was only reduced by 10% for rotation however, we noticed a stronger alignment of the representations
(with success rates above 80%) and around 30% for occlusion (with between the recurrent and perception modules because the execu-
success rates at 60%). Similarly, on the adversarial hiking task, CfC tion of the navigation task was more robust than that of
achieved a success rate more than twice that of its closest nonliquid other models.
competitor, completing 70% of runs, whereas the LSTM model only
reached all three targets on 30% of occasions. Last, LSTM could Causal models in environments with unknown causal
track a moving target for an average of 5.8 steps, whereas our CfC structure
Fig. 5. Liquid neural networks. (A) Schematic demonstration of a fully connected LTC layer (35). The dynamics of a single neuron i is given, where xi(t) represents the
approximations (Fig. 5B) (50). These brain-inspired models are in- are given by (50):
stances of continuous-time (CT) neural networks (35, 41) that can
be trained via gradient descent in modern automatic differentiation xðtÞ ¼ σð f ðx; I; θf ÞtÞ �gðx; I; θg Þ þ ½1 σð ½f ðx; I; θf Þ�tÞ� �hðx; I; θh Þ
|fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl {zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl }
frameworks. Liquid networks exhibit stable and bounded behavior, Time-continuous gating Time-continuous gating
yield superior expressivity within the family of CT neural models ð2Þ
(35, 41), and give rise to improved performance on a wide range
of time series prediction tasks compared with advanced, recurrent
Here, f, g, and h are three neural network heads with a shared
neural network models (50). In particular, a sparse network config-
backbone, parameterized by θf, θg, and θh, respectively. I(t) is an ex-
uration composed of fewer than two dozen LTC neurons supplied
ternal input, and t is the time sampled by input time stamps, as il-
with convolutional heads showed great promise in learning, end to
lustrated in Fig. 5.
end, to map high-dimensional visual input stream of pixels to
Throughout, liquid (neural) networks refer to the general cate-
robust control decisions (18). These liquid network instances are
gory of models that are presented either by LTCs or CfCs. CfCs are
called NCPs, because their four-layered network structure is in-
closed-form liquid networks, and LTCs are ODE-based liquid
spired by the neural circuits of the nervous system of the nematode
networks.
C. elegans (51), as illustrated in Fig. 5 (C and D). In prior work, these
sparse liquid networks learned how to navigate autonomous simu-
Liquid networks capture causality
lated aerial (21) and real ground (18) vehicles to their goals much
The key to liquid networks’ robust performance under distribution
more robustly than their advanced deep learning counterparts in a
shifts is their ability to dynamically capture the true cause and effect
large series of behavioral-cloning experiments within their training
of their given task (21). This can be shown analytically because LTCs
distribution. The state-space representation of an LTC neural
are DCMs (21, 43), a framework through which models can account
network is determined by the following set of ODEs (35):
for internal and external interventions with independent mecha-
� �
dxðtÞ 1 nisms. Vorbach et al. (21) theoretically showed that the learning
¼ þ f ðxðtÞ; IðtÞ; t; θÞ � xðtÞ system described by an LTC network of Eq. 1 can control internal
dt τ ð1Þ
and external interventions by the network parameters θ and thus
þ f ðxðtÞ; IðtÞ; t; θÞ � A reduces to a DCM (43) as long as f is monotonically increasing,
Here, x (D × 1)
(t) is the hidden state with size D, I (m × 1)(t) is an bounded, and Lipschitz continuous.
input signal, τ (D × 1)
is the fixed internal time-constant vector, DCMs are probabilistic graphical models (43). DCMs differ from
A (D × 1) is a bias parameter, and ⊙ is the Hadamard product. Intu- other causal models in that they do not cast around statistical depen-
itively, LTC networks are able to change their equations on the basis dencies from data directly; rather, they have dynamic mechanisms,
of the input they observe. These networks, either in their ODE form such as the structure presented in Eq. 2, that enable extracting
or in their closed-form representation (50), demonstrate causality cause-and-effect from data (52).
and generalizability in modeling spatiotemporal dynamics com- Causal properties of DCMs focus the attention of LTCs on the
pared with their counterparts (21). Their closed-form representa- task rather than the context of the task, and, for this reason, in
tions are called closed-form continuous-time (CfC) models and this article, we hypothesize and show that tasks learned in one en-
vironment can be transferred to different environments for LTC
networks where other models fail. More formally, consider a se-
quence of task steps T, a sequence of robot/world configurations
C, and a sequence of images V. Visual data are generated following a tasks in challenging environments. We built advanced neural
schematic graphical model: control agents for autonomous drone navigation tasks (fly-to-
target). We explored their generalization capabilities in new envi-
T!C!V ronments with a drastic change of scenery, weather conditions,
We can then explain how causal understanding implies and other natural adversaries.
knowing. Robot states cause images and not vice versa (so when
training on images and control inputs, the network should output Training procedure
positive velocity when required for the task, not because it has The following section details the preparation of the data and hyper-
learned that a sequence of images moving forward implies that parameters used for training the onboard models.
the drone is moving forward). In addition, the drone’s motion is Data preparation
governed by the task (for example, centering the chair in the The training runs were originally collected as long sequences in
frame) and not by other visual correlations in the data. With the which the drone moved between all five targets. Because sequences
data augmentation techniques implemented, we show that almost of searching for the next target and traveling between them could
all networks can handle the former condition. The latter condition, lead to ambiguity in the desired drone task, for each training run
Fig. 6. End-to-end learning setup. (A) Policies were trained to solve the following task: Using only images taken from an onboard camera, navigate the drone from its
starting location to a target 10 m away, keeping the object in the frame throughout. (B) One goal of the work is to explore the generalization capability of various neural
architectures by adapting the task to previously unseen environments. (C) Another goal is to understand the causal mechanisms underlying different networks’ behaviors
while completing the task, by visualizing networks’ input saliency maps. (D) The training process started by hand-collecting human expert trajectories and recording
camera observations and human expert actions. (E) To further convey the task, we then generated synthetic sequence by taking images collected during human flights
and repeatedly cropping them, creating the appearance of a video sequence in which the drone flies up to the target and centers it in the frame. (F) The networks were
trained offline on the collected expert sequences using a supervised behavior cloning MSE loss. (G) We then deployed trained networks on the drone in online testing and
observed task performance under heavy distributional shifts, varying lighting, starting location, wind, background setting, and more.
saturation offsets, but two images in different sequences had differ- For each chosen hyperparameter configuration, the number of
ent offsets. Last, each image had Gaussian random noise with a trainable parameters in the corresponding model for every neural
mean of 0 and an SD of 0.05 added. architecture is listed in table S3. Each network tested was prefixed
In addition, to better convey the task, we performed closed-loop by a simple CNN backbone for processing incoming images. The
augmentation by generating synthetic image and control sequences 128-dimensional CNN features were then fed to the recurrent
and adding them to the training set. To generate these sequences, we unit that predicted control outputs. The shared CNN architecture
took a single image with the target present and repeatedly cropped is pictured in table S4.
the image. Over the duration of the synthetic sequence, we moved Fine-tuning
the center of the crop from the edge of the image to the center of the To simultaneously learn the geospatial constructs present in the
target, causing the target to appear to move from the edge of the long, uncut sequences and the task-focused controls present in
frame to the center. In addition, we shrank the size of the the cut and synthetic sequences, we fine-tuned a model trained
cropped window and upscaled all of the cropped images to the on the long sequences with sequences featuring only one target.
same size, causing the target to appear to grow bigger. Together, The starting checkpoint was trained on the original uncut sequenc-
these image procedures simulated moving the drone to center the es, all sliced sequences, and synthetic data corresponding to all
19. M. Lechner, R. Hasani, M. Zimmer, T. A. Henzinger, R. Grosu, Designing worm inspired 41. R. T. Chen, Y. Rubanova, J. Bettencourt, D. Duvenaud, Neural ordinary differential equa-
neural networks for interpretable robotic control, in 2019 International Conference on Ro- tions, in Proceedings of the 32nd International Conference on Neural Information Processing
botics and Automation (ICRA) (IEEE, 2019), pp. 87–94. Systems (NeurIPS, 2018), pp. 6572–6583.
20. R. Hasani, M. Lechner, A. Amini, D. Rus, R. Grosu, A natural lottery ticket winner: Rein- 42. S. Bai, J. Z. Kolter, V. Koltun, An empirical evaluation of generic convolutional and recurrent
forcement learning with ordinary neural circuits, in International Conference on Machine networks for sequence modeling. arXiv:1803.01271 [cs.LG] (4 Mar 2018).
Learning (PMLR, 2020), pp. 4082–4093. 43. K. J. Friston, L. Harrison, W. Penny, Dynamic causal modelling. Neuroimage 19,
21. C. Vorbach, R. Hasani, A. Amini, M. Lechner, D. Rus, Causal navigation by continuous-time 1273–1302 (2003).
neural networks, in Advances in Neural Information Processing Systems (2021), vol. 34. 44. M. Bojarski, A. Choromanska, K. Choromanski, B. Firner, L. J. Ackel, U. Muller, P. Yeres,
22. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, K. Zieba, Visualbackprop: Efficient visualization of cnns for autonomous driving, in IEEE
Attention is all you need, in Advances in Neural Information Processing Systems (NIPS, 2017), International Conference on Robotics and Automation (ICRA) (ICRA, 2018), pp. 1–8.
pp. 5998–6008. 45. J. Pearl, Causal inference in statistics: An overview. Stat. Surv. 3, 96–146 (2009).
23. S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, 46. W. Penny, Z. Ghahramani, K. Friston, Bilinear dynamical systems. Phil. Trans. R. Soc. B Biol. Sci.
Y. Sulsky, J. Kay, J. T. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, 360, 983–993 (2005).
Y. Chen, R. Hadsell, O. Vinyals, M. Bordbar, N. de Freitas, A generalist agent. 47. C. Koch, I. Segev, Methods in Neuronal Modeling: From Ions to Networks (MIT Press, 1998).
arXiv:2205.06175 [cs.AI] (12 May 2022). 48. E. R. Kandel, J. H. Schwartz, T. M. Jessell, Principles of Neural Science (McGraw-Hill, 2000),
24. J. Kaplan, S. M. Candlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, vol. 4.
J. Wu, D. Amodei, Scaling laws for neural language models. arXiv:2001.08361 [cs.LG] (23 49. L. Lapique, Recherches quantitatives sur l’excitation electrique des nerfs traitee comme
patent with application number 63/415,382 on 12 October 2022. The other authors declare that share.cgi?ssid=06lMJMN&fid=06lMJMN (synthetic small4, 14.7 GB) used to train the starting
they have no competing interests. Data and materials availability: All data, code, and checkpoint and a synthetic chair–only dataset http://knightridermit.myqnapcloud.com:8080/
materials used in the analysis are openly available at https://zenodo.org/badge/latestdoi/ share.cgi?ssid=06lMJMN&fid=06lMJMN (synthetic chair, 4.3 GB) used to fine-tune the final
610810400 and https://zenodo.org/badge/latestdoi/381393816 under Apache 2.0 License for models for online testing. We would like to thank IBM for providing access to the Satori
purposes of reproducing and extending the analysis. The original hand-collected training computing cluster and MIT/MIT Lincoln Labs for providing access to the Supercloud computing
dataset is published (http://knightridermit.myqnapcloud.com:8080/share.cgi?ssid=06lMJMN& cluster. Both clusters were very helpful for training models and hyperparameter optimization.
fid=06lMJMN), with a file name of devens snowy and a fixed size of 33.2 GB. This dataset was
used for training the starting checkpoints. The dataset used for fine-tuning the models tested Submitted 6 May 2022
on the drone can be found at http://knightridermit.myqnapcloud.com:8080/share.cgi?ssid= Accepted 22 March 2023
06lMJMN&fid=06lMJMN (devens chair, 2.3 GB) and has a subset of the full dataset containing Published 19 April 2023
only runs with the chair target. We have also included the exact synthetic datasets we used for 10.1126/scirobotics.adc8892
our experiments. We have both a full dataset http://knightridermit.myqnapcloud.com:8080/
Science Robotics (ISSN ) is published by the American Association for the Advancement of Science. 1200 New York Avenue NW,
Washington, DC 20005. The title Science Robotics is a registered trademark of AAAS.
Copyright © 2023 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim
to original U.S. Government Works