ReconROS Executor - EventDriven Programming
ReconROS Executor - EventDriven Programming
Abstract—Many applications from the robotics domain can ROS nodes to either software or hardware is static and event-
arXiv:2201.07454v1 [cs.RO] 19 Jan 2022
benefit from FPGA acceleration. A corresponding key ques- driven programming, which is central to ROS 2, the latest
tion is how to integrate hardware accelerators into software- version of ROS, is not supported.
centric robotics programming environments. Recently, several
approaches have demonstrated hardware acceleration for the In this paper, we present the ReconROS executor for event-
robot operating system (ROS), the dominant programming envi- driven programming of ROS 2 applications with flexible
ronment in robotics. ROS is a middleware layer that features the hardware acceleration. The ReconROS executor registers ROS
composition of complex robotics applications as a set of nodes 2 node functions as callbacks and dispatches them to run in
that communicate via mechanisms such as publish/subscribe, and software on one of several processor cores or in hardware
distributes them over several compute platforms.
In this paper, we present a novel approach for event-based
in the reconfigurable fabric. While the executor builds on
programming of robotics applications that leverages ReconROS, the architecture and functionality of ReconROS [5], [6], a
a framework for flexibly mapping ROS 2 nodes to either software previously presented open source framework for hardware
or reconfigurable hardware. The ReconROS executor schedules acceleration in robotics, the main novelty is that the ReconROS
callbacks of ROS 2 nodes and utilizes a reconfigurable slot executor employs partial hardware reconfiguration with a
model and partial runtime reconfiguration to load hardware-
based callbacks on demand. We describe the ReconROS executor
reconfigurable slot model to load and execute hardware-mapped
approach, give design examples, and experimentally evaluate its ROS 2 callbacks on demand, and that it can dispatch callbacks
functionality with examples. for either software or hardware execution. As a result, ROS 2
developers can exploit the benefits of hardware acceleration
I. I NTRODUCTION from their standard programming environment.
The remainder of the paper is organized as follows: Sec-
Many robotics applications are computationally very de- tion II provides background, in particular an overview over
manding, in particular when they process large amounts of ROS and the ROS executor for event-driven programming,
data sensed from their environment and run involved algorithms and a discussion of related work that aims at integrating
to compute state information and next actions to take. In the FPGA hardware acceleration with ROS. Section III explains
last years, efficient implementations of robotics applications on the hardware architecture and design concept for the novel
high-performance embedded platforms comprising multi-core ReconROS executor, before Section IV presents an example
CPUs, general-purpose GPUs, or FPGAs have been studied and for the configuration and coding steps when developing a
several works have shown the potential advantage of FPGAs ReconROS application. Section V reports on experiments to
over CPUs and GPUs with respect to performance, energy demonstrate the functionality and advantages of our approach.
consumption, and latency, e.g., [1]–[4]. Finally, Section VI concludes the paper and gives an outlook
A key question for FPGA acceleration of robotics applica- to future work.
tions is how to integrate hardware accelerators into software-
centric robotics programming environments. Recently, several II. BACKGROUND AND R ELATED W ORK
approaches targeted the robot operating system (ROS), which In this section, we first introduce to the robot operating sys-
is the dominant programming environment in robotics. ROS is tem ROS, then focus on the functionality of the ROS executor
essentially a middleware layer that allows for the decomposition and, finally, review related approaches for making hardware
of complex robotics applications into a set of nodes that acceleration available to ROS-based robotics applications.
communicate via mechanisms such as publish/subscribe. The
nodes can then be distributed over several compute platforms. A. The Robot Operating System
Most approaches partition a ROS node and map the com- The Robot Operating System (ROS) is the dominant frame-
putation intensive kernels to reconfigurable hardware. A few work for robotics programming. ROS comprises a middleware
approaches go further and support the mapping of complete layer for data communication within a computer network and
ROS nodes to hardware, which greatly increases flexibility and tools and libraries for rapid and modular development of large
facilitates design space exploration. However, the mapping of and complex applications. ROS decomposes an application
into a set of ROS nodes that can communicate by exchanging message packages, e.g., for images or positional data. The
messages using three possible mechanisms: An m:n publish- ROS tool flow comprises tools for translating such message
subscribe mechanism and two 1:1 mechanisms denoted as definitions into data representations suitable for inclusion in
services and actions, that follow a client-server model. Services high-level language programs in, for example, Python and C++.
allow for nodes to access functionalities of other nodes in a B. The ROS 2 Executor
manner similar to remote procedure calls, and actions are more
The ROS nodes mapped to one computation platform can
elaborate and combine two service requests with a publish-
execute as Linux processes or threads using the underlying
subscribe communication for regular feedback information.
Linux scheduler. In such a case the nodes need to regularly poll
ROS applications are often represented as computation
the communication layer for available messages. However, the
graphs, in which the nodes represent ROS nodes and the
more common model under ROS 2 is the event-driven model,
edges represent a ROS-supported form of communication [7].
where nodes register callbacks that are executed when specific
The nodes of a computation graph are then mapped to
events occur. There are four categories of callbacks: Callbacks
compute platforms in a distributed system. Figure 1 outlines
executed by any node when a (periodic) timer event occurs,
an example for a computation graph using publish-subscribe
callbacks executed by a subscriber on a received message,
communication. The ROS node /camera captures images
callbacks executed by a ROS 2 service server on a received
from a camera and publishes them to the topic /image_raw.
service request, and callbacks executed by a ROS 2 service
The sobel /filter node subscribes to this topic and publishes
client on a received service response.
the filtered images to the topic /image_filtered. The
ROS 2 provides a so-called executor function that interacts
/viewer node displays the filtered images. The second part
with the underlying communication layer and timer infras-
of the application implements a control loop for servo control,
tructure to catch events and execute callbacks in a run-to-
where the node /pid_ctrl runs a PID control algorithm
completion mode utilizing one or more worker threads. That
on position sensor input, the node /inv_kinematics
is, callbacks are not preempted.
determines the required new position and, finally, the node
By default, ROS 2 offers standard single-threaded and
/actor_driver sets the motor signals accordingly.
multi-threaded executors for C++ and Python applications that
implements the scheduling algorithm sketched in Figure 2.
/camera /viewer
The algorithm comprises two nested loops. In the outer loop,
the executor interacts with the DDS layer to collect all ready
/image_raw /image_filtered
subscriber, server, and client callbacks into a readySet. In the
/filter
inner loop, the executor checks for timer-triggered callbacks
and, if such are available, executes them. Then, subscriber,
server, and client callbacks are considered in that order and if
/ctrl_out /inv_kinematics /tf_out
such a callback is ready, it is executed and removed from the
/pid_ctrl /actor_driver readySet. If there a no more callbacks ready, the next iteration
of the outer loop is started after a configurable waiting time.
Fig. 1: ROS computational graph example The ROS 2 executor implicitly implements priorities in the
sense that timer-triggered callbacks get high priority, and the
other callbacks lower priorities, since they are first collected
The most recent version ROS 2 builds on several layers on
in the outer loop and then executed in the inner loop in the
top of the operating system Linux. As communication layer,
order shown in Figure 2. Within one callback category, requests
ROS 2 uses the data distribution service (DDS), a common
are ordered by the sequence of their initial registration at the
standard for publish-subscribe communication specified by the
executor.
Object Management Group (OMG) [8]. The ROS middleware
The real-time behavior of the ROS 2 executor was studied
layer (rmw) in combination with a DDS implementation-
in [9]. The authors analyzed the response time of ROS 2
specific adaptor in the ROS 2 stack allow for interchangeability
applications and provided a scheduling model, a worst-case
of different DDS implementations. There are several different
response time analysis, and general insights into the real-time
DDS implementations available, e.g., the eProsima FastRTPS
behavior of ROS 2. In follow-up work [10] the precision
or RTI Connext DDS. Additionally, shared-memory based
of the response time analysis was improved. There are also
communication between nodes is enabled by, for example,
alternatives to the standard ROS 2 executor. For example,
iceoryx. On a higher level, the rcl library comprises the standard
in [11] an executor for micro ROS platforms [11] equipped
framework for providing ROS concepts. High-level libraries
with embedded micro controllers was presented. This executor
(rclcpp for C++ or rclpy for Python) wrap rcl and provide
is fully coded in C, supports domain-specific requirements, and
more advanced functionalities, e.g., execution management.
improves on the analysis of real-time aspects.
Another key part of the ROS framework is the message
infrastructure. Messages can be hierarchically composed out C. Related Approaches for ROS-FPGA Integration
of basic built-in elements such as integers or floats using a In the last years, several approaches have been presented for
common description language. Further, ROS includes standard integrating reconfigurable hardware accelerators into the ROS
Collect ready non-timer to reconfigurable logic as so-called acceleration kernels.
start
callbacks
III. T HE R ECON ROS E XECUTOR
Execute callback
In the existing ReconROS framework, ROS 2 hardware nodes
Yes
Ready timer callbacks? have to be statically placed in reconfigurable logic where they
remain until the application terminates. The hardware nodes
No
run in while(1)-loops, that start with blocking reads for new
Ready subscriber callbacks?
Yes input data, process the data, and write the output. As main
Wait
novelty we introduce partial hardware reconfiguration with
No a reconfigurable slot model to be able to load and execute
Yes hardware-mapped ROS 2 nodes on demand, and we devise a
Ready server callbacks? ROS 2 executor that can dispatch callbacks for either software
No
or hardware execution, if hardware versions of the callbacks
are available.
No Yes
Ready client callbacks? As a result, (i) robotics application developers can exploit
hardware acceleration from their known programming environ-
Fig. 2: ROS 2 standard executor scheduling algorithm [9] ment and event-driven programming model, and (ii) the limited
hardware resources are operated in an efficient manner. In the
following, we discuss the hardware architecture and the design
architecture. Most of these approaches partition a ROS software concept of the ReconROS executor.
node and map the runtime-intensive parts as acceleration kernels
to reconfigurable hardware, while the remaining parts stay on A. Hardware Architecture
the CPU. The works [12]–[15] follow this model and focus on Figure 3 highlights the architecture for ReconROS using
the automated generation of the interfaces between software and our novel ROS 2 executor. According to the underlying
acceleration kernels, the minimization of communication time ReconOS architecture, the programmable logic part of a
between them, and the use of high-level synthesis to increase platform FPGA contains a set of n reconfigurable slots (RS)
productivity. The ReconfROS framework [16] implements that can accommodate hardware threads, which implement
shared-memory communication between the software and the the hardware-mapped ROS 2 callbacks. The number and
acceleration kernels to further reduce communication effort. sizes of these reconfigurable slots is application-specific and
A different approach is followed in [17], [18], where one or configured during the design process. Each such reconfigurable
more complete ROS nodes can be mapped to hardware. All slot is connected to an operating system interface (OSIF) for
these hardware-mapped ROS nodes connect to an AXI-based communication with the host operating system Linux running
gateway followed by a protocol generator and a TCP/IP inter- on the processor cores, and to a memory interface (MEMIF)
face to allow for communication with other ROS nodes of the for accessing shared external memory. Another component
application. A central manager coordinates the communication omitted Figure 3 for simplicity is the memory subsystem
between the hardware-mapped ROS nodes and the gateway. that provides arbitration between the MEMIFs and includes
ReconROS [5], [6] also allows for mapping complete a memory management unit to allow the hardware threads to
ROS nodes to hardware and combines the reconfigurable work with virtual addresses.
hardware operating system ReconOS [19], [20] with ROS 2. ROS 2 hardware-mapped callbacks are loaded into recon-
ReconOS features multithreaded programming with hardware figurable slots on demand during runtime with a partial recon-
and software threads, i.e., both hardware and software threads figuration process utilizing the ZyCAP implementation [23].
use operating system services, such as semaphores and mutexes, ZyCAP comprises an ICAP (internal configuration access port)
in exactly the same way. Additionally, all threads can access interface and a DMA block on the hardware side, and a Linux
the common virtual shared memory address space. ReconROS kernel driver and user libraries on the software side. Library direct
extends the functionality of ReconOS by adding ROS 2 functions are available to load reconfigurable slots by setting mem
primitives, turning complete ROS 2 nodes into hardware threads, up DMA transfers from external memory to the ICAP interface. access
and allowing them to use the ROS communication mechanisms We have chosen ZyCAP over using the processor configura-
publish-subscribe, services, and actions. Since the ReconROS tion access port (PCAP) or the ICAP directly for two reasons:
architecture and build system are available in open source, we First, ZyCAP nominally features 3× higher performance
have been using it as starting point for the work presented in for writing bitstreams, i.e., 382 MByte/s for the ZyCAP
this paper. compared to 128 MBytes/s for PCAP [23]. Second, ZyCAP
Industry has also taken up the ROS-FPGA integration. For includes a DMA controller that lowers CPU load for partial
example, Xilinx develops the KRIA robotics stack [21], [22], reconfiguration and, in turn, frees the CPU for executing
which merges ROS build tool flows with the Vitis software ROS 2 callbacks. In the hardware architecture, the ZyCAP
platform. There, compute intensive calculations are outsourced block is connected to the processing system (PS) via a high
Platform FPGA one hardware worker thread per reconfigurable slot. The
main thread maintains four callback lists that include all
Processing system Programmable logic
callbacks registered at the executor, i.e., one for timers, one
Processor OSIF for subscribers, one for service servers, and the last one for
service clients. Each entry in such a callback list comprises
Core Core a unique identifier, a pointer to the received message in case
…
#0 #m-1 ICAP of non-timer callbacks, and a ResourceMask that contains a
field for software and each reconfigurable slot. If the execution
RS RS mode is software, the corresponding field includes a function
Global timer #0 … #n-1
pointer to the callback code. If the execution mode is hardware,
the corresponding fields contain pointers to callback bitstreams
DMA for the reconfigurable slots.
Memory The overall m software and n hardware worker threads
controller are started during the initialization of the executor. Each of
MEMIF
these threads implements the inner loop of Figure 2. Figure 4
displays the functionality of the hardware worker thread for
reconfigurable slot x. The thread accesses the callback lists in
the order of timers, subscribers, service servers, and service
RAM (DDR3 SDRAM)
clients and searches for ready callbacks (CB) with a matching
entry x in the ResourceMask. If such an entry is found (CB not
Fig. 3: Hardware architecture for the ReconROS executor zero), the thread checks wether the corresponding bitstream is
already loaded in the reconfigurable slot x. If so, the callback is
simply started; otherwise partial reconfiguration is performed to
performance port (HPx) to transfer the bitstream and to an AXI- load the callback bitstream. The worker thread then waits until
Lite interface for the configuration of DMA transactions. From the callback is finished and runs into the next loop iteration.
the existing ZyCAP project, we have simplified the user library
functions to the three basic functions ZyCAP_Init() for the
Hardware worker Callback lists
initialization, ZyCAP_Write_Bitstream() for blocking
write of the bitstream into the ICAP, and ZyCAP_DeInit() GetCallback(ResourceId=x, Position)
for de-initialization, and integrated them into the ReconROS
library. Additionally, we have adapted the ZyCAP Linux kernel CB=GetTimer(ResourceId)
client on the desktop PC just receives the messages and reports from this callback. Some messages are delayed, which
the times between consecutive messages. then likely reduces the time for the next message.
Figure 5(a) sketches an all-software mapping, where a multi- VI. C ONCLUSION AND F UTURE W ORK
threaded standard ROS 2 executor with two software worker
threads on the FPGA dispatches the callbacks from the server In this paper, we have introduced the ReconROS executor
nodes to two processor cores. Figure 5(b) displays the setup that enables event-based programming for hardware accelerated
with the ReconROS executor and additional four hardware ROS 2 applications. In contrast to related work, the ReconROS
worker threads that dispatch callbacks to four reconfigurable executor leverages partial reconfiguration for loading hardware-
slots. We have evaluated two mappings under the ReconROS mapped callbacks on demand to predefined reconfigurable slots
executor, a mixed software/hardware mapping where the four on the FPGA’s logic resource. Additionally, the ReconROS
callbacks with the highest speedups according to Section executor extends the ROS 2 standard executor and allows it to
V-A, i.e., Number sorting, Inverse kinematics, Sobel filter, schedule or dispatch callbacks to software and hardware.
and MNIST classifier, are executed in hardware and the Hash In future work, we plan to expand the mapping and
calculation callback is executed in software. The all-hardware scheduling strategies of the ReconROS executor to optimize
mapping finally runs all callbacks in hardware. the hardware/software mapping and the resource management.
In particular, callbacks that can run in both software and
Figure 6 analyzes the resulting roundtrip times for the three
hardware will allow for taking runtime mapping decisions. With
mappings. The figure plots for each of the five ROS 2 nodes
respect to resource management, techniques that minimize the
and the three mappings the relative frequency over the roundtrip
unused resources within loaded slots or preloading of hardware
time. The dashed lines denote the averages. Going from the
callbacks are worth investigating.
all-software over the software/hardware to the all-hardware
mapping, the speedups based on the averaged roundtrip times R EFERENCES
are 6.21 an 6.29 for Inverse kinematics, 0.97 and 1.00 for the [1] M. Qasaimeh, K. Denolf, J. Lo, K. Vissers, J. Zambreno, and P. H. Jones,
MNIST classifier, 1.03 and 1.15 for the Sobel filter, 18.18 and “Comparing Energy Efficiency of CPU, GPU and FPGA Implementations
20.97 for Number sorting, and 1.00 for the Hash calculation. for Vision Kernels,” in Proc. 2019 IEEE International Conference on
Embedded Software and Systems (ICESS), 2019, pp. 1–8.
Overall, we make the following observations: [2] C. Brugger, L. Dal’Aqua, J. A. Varela, C. D. Schryver, M. Sadri, N. Wehn,
M. Klein, and M. Siegrist, “A quantitative cross-architecture study of
• The speedups for the individual ROS 2 nodes within morphological image processing on CPUs, GPUs, and FPGAs,” in Proc.
the overall application follow the trends for the callbacks 2015 IEEE Symposium on Computer Applications Industrial Electronics
measured in isolation, shown in Table I, although generally (ISCAIE), 2015, pp. 201–206.
[3] O. Ulusel, C. Picardo, C. B. Harris, S. Reda, and R. I. Bahar, “Hardware
lower due to the communication between desktop PC and acceleration of feature detection and description algorithms on low-power
FPGA board, the ROS 2 communication layers, and the embedded platforms,” in Proc. 2016 26th International Conference on
executors. For Number sorting, Inverse kinematics, and to Field Programmable Logic and Applications (FPL), 2016, pp. 1–9.
[4] S. I. Venieris and C.-S. Bouganis, “fpgaConvNet: Mapping Regular and
some extent the Sobel filter, distinct speedups are realized. Irregular Convolutional Neural Networks on FPGAs,” IEEE Transactions
• The hash calculation is triggered with a 250 ms period. on Neural Networks and Learning Systems, vol. 30, no. 2, pp. 326–342,
The distribution of roundtrip times shows entries with 2019.
[5] C. Lienen, M. Platzner, and B. Rinner, “Reconros: Flexible hardware
less and more than 250 ms since the ROS 2 client on the acceleration for ros2 applications,” in 2020 International Conference on
desktop PC measures times between arriving messages Field-Programmable Technology (ICFPT), 2020, pp. 268–276.
[6] C. Lienen and M. Platzner, “Design of distributed reconfigurable robotics circuit for FPGA,” in Proc. 2019 IEEE/ACM 2nd International Workshop
systems with reconros,” ACM Trans. Reconfigurable Technol. Syst., on Robotics Software Engineering (RoSE), 2019, pp. 33–36.
vol. 15, no. 3, dec 2022. [16] M. Eisoldt, S. Hinderink, M. Tassemeier, M. Flottmann, J. Vana,
[7] L. Joseph, Mastering ROS for Robotics Programming. Packt Publishing, T. Wiemann, J. Gaal, M. Rothmann, and M. Porrmann, “Reconfros:
2015. Running ros on reconfigurable socs,” in Proc. 2021 Drone Systems
[8] J. Schlesselman, G. Pardo-Castellote, and B. Farabaugh, “Omg data- Engineering and Rapid Simulation and Performance Evaluation: Methods
distribution service (dds): architectural update,” in IEEE MILCOM 2004. and Tools Proceedings, ser. DroneSE and RAPIDO ’21. New York,
Military Communications Conference, 2004., vol. 2, 2004, pp. 961–967 NY, USA: Association for Computing Machinery, 2021, p. 16–21.
Vol. 2. [17] A. Podlubne and D. Göhringer, “FPGA-ROS: Methodology to Augment
[9] D. Casini, T. Blaß, I. Lütkebohle, and B. B. Brandenburg, “Response-time the Robot Operating System with FPGA Designs,” in Proc. 2019
analysis of ros 2 processing chains under reservation-based scheduling,” International Conference on ReConFigurable Computing and FPGAs
in 31st Euromicro Conference on Real-Time Systems (ECRTS 2019). (ReConFig), 2019.
Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019. [18] A. Podlubne, J. Mey, R. Schöne, U. Aßmann, and D. Göhringer, “Model-
[10] Y. Tang, Z. Feng, N. Guan, X. Jiang, M. Lv, Q. Deng, and W. Yi, based approach for automatic generation of hardware architectures for
“Response time analysis and priority assignment of processing chains on robotics,” IEEE Access, vol. 9, pp. 140 921–140 937, 2021.
ros2 executors,” in 2020 IEEE Real-Time Systems Symposium (RTSS), [19] E. Lübbers and M. Platzner, “ReconOS: Multithreaded Programming for
2020, pp. 231–243. Reconfigurable Computers,” ACM Transactions on Embedded Computing
[11] J. Staschulat, I. Lütkebohle, and R. Lange, “The rclc executor: Domain- Systems, vol. 9, no. 1, pp. 8:1–8:33, 2009.
specific deterministic scheduling mechanisms for ros applications on [20] A. Agne, M. Happe, A. Keller, E. Lübbers, B. Plattner, M. Platzner, and
microcontrollers: work-in-progress,” in 2020 International Conference C. Plessl, “ReconOS: An Operating System Approach for Reconfigurable
on Embedded Software (EMSOFT), 2020, pp. 18–19. Computing,” IEEE Micro, vol. 34, no. 1, pp. 60–71, 2014.
[12] K. Yamashina, T. Ohkawa, K. Ootsu, and T. Yokota, “Proposal of ROS- [21] V. Mayoral-Vilches and G. Corradi, “Adaptive computing in robotics,
compliant FPGA component for low-power robotic systems (retraction leveraging ros 2 to enable software-defined hardware for fpgas,” arXiv
notice),” in Proc. International Conference on Intelligent Earth Observing preprint arXiv:2109.03276, 2021.
and Applications 2015, vol. 9808, 2015, p. 98082N. [22] V. Mayoral-Vilches, “Kria Robotics Stack,” https://www.xilinx.com/
[13] K. Yamashina, H. Kimura, T. Ohkawa, K. Ootsu, and T. Yokota, applications/industrial/robotics/wp540-kria-robotics-stack.html, 2021, ac-
“CReComp: Automated Design Tool for ROS-Compliant FPGA Com- cessed: 2022-01-13.
ponent,” in Proc. IEEE 10th International Symposium on Embedded [23] K. Vipin and S. A. Fahmy, “Zycap: Efficient partial reconfiguration
Multicore/Many-Core Systems-on-Chip, MCSoC 2016. IEEE, 2016, pp. management on the xilinx zynq,” IEEE Embedded Systems Letters, vol. 6,
138–145. no. 3, pp. 41–44, 2014.
[14] Y. Sugata, T. Ohkawa, K. Ootsu, and T. Yokota, “Acceleration of [24] R. Gonzalez and R. Woods, Digital Image Processing. Pearson, 2018.
Publish/Subscribe Messaging in ROS-Compliant FPGA Component,” in [25] D. Knuth, The Art of Computer Programming: Volume 3: Sorting and
Proc. of the 8th International Symposium on Highly Efficient Accelerators Searching. Pearson Education, 1998.
and Reconfigurable Technologies (HEART2017). ACM, 2017. [26] D. Stewart, “A Platform with Six Degrees of Freedom,” in Proc. of the
[15] T. Ohkawa, Y. Sugata, H. Watanabe, N. Ogura, K. Ootsu, and T. Yokota, Institution of Mechanical Engineers, vol. 180, no. 1, 1965, pp. 371–386.
“High level synthesis of ROS protocol interpretation and communication