0% found this document useful (0 votes)
3 views

ReconROS Executor - EventDriven Programming

Uploaded by

bestaccelerator
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

ReconROS Executor - EventDriven Programming

Uploaded by

bestaccelerator
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

ReconROS Executor: Event-Driven Programming of

FPGA-accelerated ROS 2 Applications


Christian Lienen Marco Platzner
offloading CPU-intensive Paderborn University Paderborn University
workloads to a PCI device that can
do it much faster so CPU is free to Germany Germany
do other things [email protected] [email protected]

Abstract—Many applications from the robotics domain can ROS nodes to either software or hardware is static and event-
arXiv:2201.07454v1 [cs.RO] 19 Jan 2022

benefit from FPGA acceleration. A corresponding key ques- driven programming, which is central to ROS 2, the latest
tion is how to integrate hardware accelerators into software- version of ROS, is not supported.
centric robotics programming environments. Recently, several
approaches have demonstrated hardware acceleration for the In this paper, we present the ReconROS executor for event-
robot operating system (ROS), the dominant programming envi- driven programming of ROS 2 applications with flexible
ronment in robotics. ROS is a middleware layer that features the hardware acceleration. The ReconROS executor registers ROS
composition of complex robotics applications as a set of nodes 2 node functions as callbacks and dispatches them to run in
that communicate via mechanisms such as publish/subscribe, and software on one of several processor cores or in hardware
distributes them over several compute platforms.
In this paper, we present a novel approach for event-based
in the reconfigurable fabric. While the executor builds on
programming of robotics applications that leverages ReconROS, the architecture and functionality of ReconROS [5], [6], a
a framework for flexibly mapping ROS 2 nodes to either software previously presented open source framework for hardware
or reconfigurable hardware. The ReconROS executor schedules acceleration in robotics, the main novelty is that the ReconROS
callbacks of ROS 2 nodes and utilizes a reconfigurable slot executor employs partial hardware reconfiguration with a
model and partial runtime reconfiguration to load hardware-
based callbacks on demand. We describe the ReconROS executor
reconfigurable slot model to load and execute hardware-mapped
approach, give design examples, and experimentally evaluate its ROS 2 callbacks on demand, and that it can dispatch callbacks
functionality with examples. for either software or hardware execution. As a result, ROS 2
developers can exploit the benefits of hardware acceleration
I. I NTRODUCTION from their standard programming environment.
The remainder of the paper is organized as follows: Sec-
Many robotics applications are computationally very de- tion II provides background, in particular an overview over
manding, in particular when they process large amounts of ROS and the ROS executor for event-driven programming,
data sensed from their environment and run involved algorithms and a discussion of related work that aims at integrating
to compute state information and next actions to take. In the FPGA hardware acceleration with ROS. Section III explains
last years, efficient implementations of robotics applications on the hardware architecture and design concept for the novel
high-performance embedded platforms comprising multi-core ReconROS executor, before Section IV presents an example
CPUs, general-purpose GPUs, or FPGAs have been studied and for the configuration and coding steps when developing a
several works have shown the potential advantage of FPGAs ReconROS application. Section V reports on experiments to
over CPUs and GPUs with respect to performance, energy demonstrate the functionality and advantages of our approach.
consumption, and latency, e.g., [1]–[4]. Finally, Section VI concludes the paper and gives an outlook
A key question for FPGA acceleration of robotics applica- to future work.
tions is how to integrate hardware accelerators into software-
centric robotics programming environments. Recently, several II. BACKGROUND AND R ELATED W ORK
approaches targeted the robot operating system (ROS), which In this section, we first introduce to the robot operating sys-
is the dominant programming environment in robotics. ROS is tem ROS, then focus on the functionality of the ROS executor
essentially a middleware layer that allows for the decomposition and, finally, review related approaches for making hardware
of complex robotics applications into a set of nodes that acceleration available to ROS-based robotics applications.
communicate via mechanisms such as publish/subscribe. The
nodes can then be distributed over several compute platforms. A. The Robot Operating System
Most approaches partition a ROS node and map the com- The Robot Operating System (ROS) is the dominant frame-
putation intensive kernels to reconfigurable hardware. A few work for robotics programming. ROS comprises a middleware
approaches go further and support the mapping of complete layer for data communication within a computer network and
ROS nodes to hardware, which greatly increases flexibility and tools and libraries for rapid and modular development of large
facilitates design space exploration. However, the mapping of and complex applications. ROS decomposes an application
into a set of ROS nodes that can communicate by exchanging message packages, e.g., for images or positional data. The
messages using three possible mechanisms: An m:n publish- ROS tool flow comprises tools for translating such message
subscribe mechanism and two 1:1 mechanisms denoted as definitions into data representations suitable for inclusion in
services and actions, that follow a client-server model. Services high-level language programs in, for example, Python and C++.
allow for nodes to access functionalities of other nodes in a B. The ROS 2 Executor
manner similar to remote procedure calls, and actions are more
The ROS nodes mapped to one computation platform can
elaborate and combine two service requests with a publish-
execute as Linux processes or threads using the underlying
subscribe communication for regular feedback information.
Linux scheduler. In such a case the nodes need to regularly poll
ROS applications are often represented as computation
the communication layer for available messages. However, the
graphs, in which the nodes represent ROS nodes and the
more common model under ROS 2 is the event-driven model,
edges represent a ROS-supported form of communication [7].
where nodes register callbacks that are executed when specific
The nodes of a computation graph are then mapped to
events occur. There are four categories of callbacks: Callbacks
compute platforms in a distributed system. Figure 1 outlines
executed by any node when a (periodic) timer event occurs,
an example for a computation graph using publish-subscribe
callbacks executed by a subscriber on a received message,
communication. The ROS node /camera captures images
callbacks executed by a ROS 2 service server on a received
from a camera and publishes them to the topic /image_raw.
service request, and callbacks executed by a ROS 2 service
The sobel /filter node subscribes to this topic and publishes
client on a received service response.
the filtered images to the topic /image_filtered. The
ROS 2 provides a so-called executor function that interacts
/viewer node displays the filtered images. The second part
with the underlying communication layer and timer infras-
of the application implements a control loop for servo control,
tructure to catch events and execute callbacks in a run-to-
where the node /pid_ctrl runs a PID control algorithm
completion mode utilizing one or more worker threads. That
on position sensor input, the node /inv_kinematics
is, callbacks are not preempted.
determines the required new position and, finally, the node
By default, ROS 2 offers standard single-threaded and
/actor_driver sets the motor signals accordingly.
multi-threaded executors for C++ and Python applications that
implements the scheduling algorithm sketched in Figure 2.
/camera /viewer
The algorithm comprises two nested loops. In the outer loop,
the executor interacts with the DDS layer to collect all ready
/image_raw /image_filtered
subscriber, server, and client callbacks into a readySet. In the
/filter
inner loop, the executor checks for timer-triggered callbacks
and, if such are available, executes them. Then, subscriber,
server, and client callbacks are considered in that order and if
/ctrl_out /inv_kinematics /tf_out
such a callback is ready, it is executed and removed from the
/pid_ctrl /actor_driver readySet. If there a no more callbacks ready, the next iteration
of the outer loop is started after a configurable waiting time.
Fig. 1: ROS computational graph example The ROS 2 executor implicitly implements priorities in the
sense that timer-triggered callbacks get high priority, and the
other callbacks lower priorities, since they are first collected
The most recent version ROS 2 builds on several layers on
in the outer loop and then executed in the inner loop in the
top of the operating system Linux. As communication layer,
order shown in Figure 2. Within one callback category, requests
ROS 2 uses the data distribution service (DDS), a common
are ordered by the sequence of their initial registration at the
standard for publish-subscribe communication specified by the
executor.
Object Management Group (OMG) [8]. The ROS middleware
The real-time behavior of the ROS 2 executor was studied
layer (rmw) in combination with a DDS implementation-
in [9]. The authors analyzed the response time of ROS 2
specific adaptor in the ROS 2 stack allow for interchangeability
applications and provided a scheduling model, a worst-case
of different DDS implementations. There are several different
response time analysis, and general insights into the real-time
DDS implementations available, e.g., the eProsima FastRTPS
behavior of ROS 2. In follow-up work [10] the precision
or RTI Connext DDS. Additionally, shared-memory based
of the response time analysis was improved. There are also
communication between nodes is enabled by, for example,
alternatives to the standard ROS 2 executor. For example,
iceoryx. On a higher level, the rcl library comprises the standard
in [11] an executor for micro ROS platforms [11] equipped
framework for providing ROS concepts. High-level libraries
with embedded micro controllers was presented. This executor
(rclcpp for C++ or rclpy for Python) wrap rcl and provide
is fully coded in C, supports domain-specific requirements, and
more advanced functionalities, e.g., execution management.
improves on the analysis of real-time aspects.
Another key part of the ROS framework is the message
infrastructure. Messages can be hierarchically composed out C. Related Approaches for ROS-FPGA Integration
of basic built-in elements such as integers or floats using a In the last years, several approaches have been presented for
common description language. Further, ROS includes standard integrating reconfigurable hardware accelerators into the ROS
Collect ready non-timer to reconfigurable logic as so-called acceleration kernels.
start
callbacks
III. T HE R ECON ROS E XECUTOR
Execute callback
In the existing ReconROS framework, ROS 2 hardware nodes
Yes
Ready timer callbacks? have to be statically placed in reconfigurable logic where they
remain until the application terminates. The hardware nodes
No
run in while(1)-loops, that start with blocking reads for new
Ready subscriber callbacks?
Yes input data, process the data, and write the output. As main
Wait
novelty we introduce partial hardware reconfiguration with
No a reconfigurable slot model to be able to load and execute
Yes hardware-mapped ROS 2 nodes on demand, and we devise a
Ready server callbacks? ROS 2 executor that can dispatch callbacks for either software
No
or hardware execution, if hardware versions of the callbacks
are available.
No Yes
Ready client callbacks? As a result, (i) robotics application developers can exploit
hardware acceleration from their known programming environ-
Fig. 2: ROS 2 standard executor scheduling algorithm [9] ment and event-driven programming model, and (ii) the limited
hardware resources are operated in an efficient manner. In the
following, we discuss the hardware architecture and the design
architecture. Most of these approaches partition a ROS software concept of the ReconROS executor.
node and map the runtime-intensive parts as acceleration kernels
to reconfigurable hardware, while the remaining parts stay on A. Hardware Architecture
the CPU. The works [12]–[15] follow this model and focus on Figure 3 highlights the architecture for ReconROS using
the automated generation of the interfaces between software and our novel ROS 2 executor. According to the underlying
acceleration kernels, the minimization of communication time ReconOS architecture, the programmable logic part of a
between them, and the use of high-level synthesis to increase platform FPGA contains a set of n reconfigurable slots (RS)
productivity. The ReconfROS framework [16] implements that can accommodate hardware threads, which implement
shared-memory communication between the software and the the hardware-mapped ROS 2 callbacks. The number and
acceleration kernels to further reduce communication effort. sizes of these reconfigurable slots is application-specific and
A different approach is followed in [17], [18], where one or configured during the design process. Each such reconfigurable
more complete ROS nodes can be mapped to hardware. All slot is connected to an operating system interface (OSIF) for
these hardware-mapped ROS nodes connect to an AXI-based communication with the host operating system Linux running
gateway followed by a protocol generator and a TCP/IP inter- on the processor cores, and to a memory interface (MEMIF)
face to allow for communication with other ROS nodes of the for accessing shared external memory. Another component
application. A central manager coordinates the communication omitted Figure 3 for simplicity is the memory subsystem
between the hardware-mapped ROS nodes and the gateway. that provides arbitration between the MEMIFs and includes
ReconROS [5], [6] also allows for mapping complete a memory management unit to allow the hardware threads to
ROS nodes to hardware and combines the reconfigurable work with virtual addresses.
hardware operating system ReconOS [19], [20] with ROS 2. ROS 2 hardware-mapped callbacks are loaded into recon-
ReconOS features multithreaded programming with hardware figurable slots on demand during runtime with a partial recon-
and software threads, i.e., both hardware and software threads figuration process utilizing the ZyCAP implementation [23].
use operating system services, such as semaphores and mutexes, ZyCAP comprises an ICAP (internal configuration access port)
in exactly the same way. Additionally, all threads can access interface and a DMA block on the hardware side, and a Linux
the common virtual shared memory address space. ReconROS kernel driver and user libraries on the software side. Library direct
extends the functionality of ReconOS by adding ROS 2 functions are available to load reconfigurable slots by setting mem
primitives, turning complete ROS 2 nodes into hardware threads, up DMA transfers from external memory to the ICAP interface. access
and allowing them to use the ROS communication mechanisms We have chosen ZyCAP over using the processor configura-
publish-subscribe, services, and actions. Since the ReconROS tion access port (PCAP) or the ICAP directly for two reasons:
architecture and build system are available in open source, we First, ZyCAP nominally features 3× higher performance
have been using it as starting point for the work presented in for writing bitstreams, i.e., 382 MByte/s for the ZyCAP
this paper. compared to 128 MBytes/s for PCAP [23]. Second, ZyCAP
Industry has also taken up the ROS-FPGA integration. For includes a DMA controller that lowers CPU load for partial
example, Xilinx develops the KRIA robotics stack [21], [22], reconfiguration and, in turn, frees the CPU for executing
which merges ROS build tool flows with the Vitis software ROS 2 callbacks. In the hardware architecture, the ZyCAP
platform. There, compute intensive calculations are outsourced block is connected to the processing system (PS) via a high
Platform FPGA one hardware worker thread per reconfigurable slot. The
main thread maintains four callback lists that include all
Processing system Programmable logic
callbacks registered at the executor, i.e., one for timers, one
Processor OSIF for subscribers, one for service servers, and the last one for
service clients. Each entry in such a callback list comprises
Core Core a unique identifier, a pointer to the received message in case

#0 #m-1 ICAP of non-timer callbacks, and a ResourceMask that contains a
field for software and each reconfigurable slot. If the execution
RS RS mode is software, the corresponding field includes a function
Global timer #0 … #n-1
pointer to the callback code. If the execution mode is hardware,
the corresponding fields contain pointers to callback bitstreams
DMA for the reconfigurable slots.
Memory The overall m software and n hardware worker threads
controller are started during the initialization of the executor. Each of
MEMIF
these threads implements the inner loop of Figure 2. Figure 4
displays the functionality of the hardware worker thread for
reconfigurable slot x. The thread accesses the callback lists in
the order of timers, subscribers, service servers, and service
RAM (DDR3 SDRAM)
clients and searches for ready callbacks (CB) with a matching
entry x in the ResourceMask. If such an entry is found (CB not
Fig. 3: Hardware architecture for the ReconROS executor zero), the thread checks wether the corresponding bitstream is
already loaded in the reconfigurable slot x. If so, the callback is
simply started; otherwise partial reconfiguration is performed to
performance port (HPx) to transfer the bitstream and to an AXI- load the callback bitstream. The worker thread then waits until
Lite interface for the configuration of DMA transactions. From the callback is finished and runs into the next loop iteration.
the existing ZyCAP project, we have simplified the user library
functions to the three basic functions ZyCAP_Init() for the
Hardware worker Callback lists
initialization, ZyCAP_Write_Bitstream() for blocking
write of the bitstream into the ICAP, and ZyCAP_DeInit() GetCallback(ResourceId=x, Position)
for de-initialization, and integrated them into the ReconROS
library. Additionally, we have adapted the ZyCAP Linux kernel CB=GetTimer(ResourceId)

driver to more recent Linux kernels.


[CB = 0]
B. Executor Design CB=GetSubscriber(ResourceId, Position)

The main steps in designing the ReconROS executor are [CB = 0]


CB=GetServiceSrv(ResourceId, Position)
providing timers and coming up with a scheduling or dispatch-
ing algorithm, respectively, that utilizes all available processor
[CB = 0]
cores and reconfigurable slots for callbacks. CB CB=GetServiceClt(ResourceId, Position)
In the ROS 2 stack, timers are part of the high-level
[CB != 0]
libraries rclcpp (C++) or rclpy (Python) and use the operating ZyCAP_Write_Bitstream()

system to measure wall clock time. Since ReconROS builds


[CB != 0]
on rcl, the underlying standard framework for ROS primitives, ReconROS Start & Join
we have added a corresponding timer primitive to that. Our HW thread
[CB != 0] ReleaseCallback()
implementation uses the ARM Cortex-A9 global timer (see
Figure 3) as its main time reference, and a low-overhead
function ros_timer_is_ready() to check whether a time
interval has expired. Fig. 4: Sequence diagram for a hardware worker thread
The development of an executor algorithm is more chal-
lenging. In contrast to the standard ROS 2 executor (see The standard ROS 2 executor shown in Figure 2 collects
Section II-B) that dispatches ready to execute callbacks to a ready non-timer callbacks before entering the inner loop.
number of identical software worker threads, typically one per This ensures that all callbacks collected up to a certain time
available processor core, the ReconROS executor can either will actually be executed before new non-timer callbacks are
execute callbacks in software or hardware and, if executed considered. Very frequently appearing subscriber callbacks, for
in hardware, in specified reconfigurable slots. Therefore, our example, can thus not lead to starving callbacks for service
executor implementation is structured into an executor main servers and service clients. Since our ReconROS executor uses
thread, one software worker thread per processor core, and more independent worker threads, we resort to a different
mechanism to avoid starvation. Each worker thread maintains (here inverse msg) and a publisher and subscriber for topic
an OffsetVector that holds for each non-timer callback list an /ctrl_out and /tf_out.
entry Position that identifies the callback last served. Whenever
the worker thread checks the list for the next ready callback, it Listing 1: Configuration file (Partial reconfiguration / ROS 2
starts the search from Position+1. After serving the callback, related part) for the R ECON ROS application shown in Figure 1
Position is incremented. Position is initialized with the length
1 [HwSlot(at)ReconfSlot(0:0)]
of the list and wrapped around to zero when the end of the 2 Id = 0
callback list is reached. Software worker threads are identical, 3 Reconfigurable = true
except that they start callback functions in software. 4 Region0 = SLICE_X0Y150SLICE_X103Y199,
DSP48_X0Y60DSP48_X7Y79,
Our ReconROS executor tries to mimic the behavior of the RAMB18_X0Y60RAMB18_X5Y79,
ROS 2 standard executor and requires the designer to specify RAMB36_X0Y30RAMB36_X5Y39
5
the size of the reconfigurable slots, and for each hardware 6 [ResourceGroup(at)ResourceGroupSobel]
callback the possible reconfigurable slots to which the callback 7 node_2 = rosnode, "/filter"
can be mapped. Obviously, different and improved ReconROS 8 img_msg = rosmsg, sensor_msgs, msg, Image
9 sub = rossub, node_2, img_msg, "/image_raw"
executor designs are conceivable. For example, for callbacks 10 pub = rospub, node_2, img_msg, "/image_filtered"
that can be run in software and hardware, the executor could 11
decide at runtime which mode to choose. Moreover, involved 12 [ResourceGroup(at)ResourceGroupInverse]
13 node_5 = rosnode, "/inv_kinematics"
resource management problems arise if the reconfigurable slots 14 inverse_msg = rosmsg, std_msgs, msg, Uint32
are of different size and hardware callbacks are available for 15 sub = rossub, node_2, inverse_msg, "/ctrl_out"
different reconfigurable slots. Such improved executor scenarios 16 pub = rospub, node_2, inverse_msg, "/tf_out"
are left to future work.
Listing 2: C/C++ code (partial) for the HLS implementation
IV. D ESIGN E XAMPLE of the subscriber callback for the /filter ROS 2 node
As an example we elaborate on the ROS 2 application from
1 //Initdata contains pointer to message
Figure 1 comprising six nodes. The ReconROS application 2 pMsg = THREAD_GETINITDATA();
from this design example handles the nodes /filter and 3 pMsg += OFFSETOF(sensor_msgs__msg__Image,
/inv_kinematics. The remaining nodes are assumed to be 4 data.data);
5
developed and compiled with ROS 2 design flows and mapped 6 // Get pointer to image in memory and
to other compute platforms. 7 // copy it to FPGA-internal BRAM
Both considered ROS 2 nodes comprise subscriber func- 8 MEM_READ(pMsg, pPayloadImage, 4);
9 MEM_READ(pPayloadImage[0], ram,
tionality for getting input data. According to the event- 10 IMAGE_SIZE * 4);
based programming approach, callbacks are invoked after the 11
arrival of messages. In this example, the two callbacks are 12 SobelFilter(ram);
13
implemented in hardware and designed for execution in a 14 // Write filtered image back to memory and
reconfigurable slot. 15 // publish filtered image
16 MEM_WRITE(ram, pPayloadImage[0],
Listing 1 shows the ROS 2 related part of the R ECON ROS 17 IMAGE_SIZE * 4);
configuration file for the overall ReconROS design project. The 18 ROS_PUBLISHER_PUBLISH(resourcesobel_pub,
file starts with a block for the specification of the reconfigurable 19 resourcesobel_img_msg);
20
slot, which will accommodate the hardware callbacks. The 21 THREAD_EXIT();
specification is done by lists of contiguous resources for
each type (LUT Slices, DSP, BRAM18, BRAM36). This lists In Listing 2, C/C++ code for the HLS implementation of the
can be derived by drawing pblocks using Xilinx Vivado and subscriber callback of the /filter node is presented. The
reading the resulting constraints. The information for the ROS presented code is not performance optimized. The callback
2 nodes is organized into so-called resource groups. Lines 6–10 starts with accessing the initial data of the callback, which
specify the /filter node, beginning with the definition of a provides a pointer to the message object in main memory. At
rosnode object named ”/filter” in line 7. In line 8, the message this point, the message is already in the memory and ready
object of type ROS 2 Image message is defined with further for access. The position of the image data is calculated using
references to a ROS 2 message package (sensor_msgs) and the OFFSETOF macro in line 3. Using the resulting pointer,
the communication (msg) as well as message types (Image). the first use of MEM_READ macro reads the address of the
Lines 9 and 10 declares primitives for the subscription of input image and then, with the second use of the macro, the callback
data from topic /image_raw and the publication of filtered reads the image into the ram memory within in FPGA. After
data to topic /image_filtered. the execution of the Sobel filter function, the callback writes
Lines 12–16 specify the /inv_kinematics node, includ- the filtered image back to main memory via the MEM_WRITE
ing the rosnode object named ”inv kinematics”. Similar to macro. After that, the callback publishes the filtered image by
the /filter node, it comprises a message type definition using the node-related publisher.
Listing 3 displays a similar procedure for the V. E XPERIMENTS
/inv_kinematics node, which basically relies on This section reports on experiments to show the functionality
the same procedure as described in Listing 2. Again, the of our ReconROS executor. First, we measure the execution
received data from the subscriber is loaded into the FPGA- times for a set of ROS 2 callbacks. Then, we determine the
internal memory, processed, and then written back to main reconfiguration times for differently sized reconfigurable slots
memory before publishing to the output topic. on our platform FPGA. Finally, we present a ROS 2 application
comprising a desktop PC and a FPGA board and experiment
Listing 3: C/C++ code (partial) for the HLS implementation with three different hardware/software mappings.
of the subscriber callback for the /inv_kinematics ROS A. Callback Execution Times
2 node
As a first part of the evaluation of the ReconROS executor,
1 //Initdata contains pointer to message we have measured the runtimes for five ROS 2 nodes, more
2 pMsg = THREAD_GETINITDATA(); precisely their callbacks:
3 Sobel filter: This callback implements a Sobel image
4 // Get pointer to ctrl out in memory and
5 // copy it to FPGA-internal memory filter [24] operating on three channels (RGB) of dimension
6 MEM_READ(pMsg, pPayloadInverse, 4); 640 × 480. The filter applies two filter kernels on each channel
7 MEM_READ(pPayloadInverse[0], ram, 4); of the image and calculates the absolute value of the dot product
8
9 Inverse_Kinematics(ram); as an approximation for the geometric mean. The ROS 2 input
10 and output messages are of the type Image from the ROS 2
11 // Write outputdata to memory and publish result sensor message package.
12 MEM_WRITE(ram, pPayloadInverse[0], 4);
13 ROS_PUBLISHER_PUBLISH(resourcesinverse_pubdata,
Number sorting: This callback provides a ROS 2 service
14 resourcesinverse_inverse_msg); which sorts an array of 32 Bit unsigned integers based on the
15 odd-even transposition sort algorithm [25]. The algorithm is
16 THREAD_EXIT();
based on a comparator network that employs n stages with
n comparisons each to sort n numbers. The ROS 2 node on
The last needed user-created file for this example appli- the PC generates random numbers and publishes messages
cation is the main file, in which the ReconROS executor is comprising 2048 numbers as an array.
instanciated and configured. The Listing 4 shows the needed MNIST classifier: This callback classifies handwritten digits
steps. In line 2, the ReconROS executor is initialized for from the MNIST dataset by implementing a neural network.
execution without sw workers but with one hardware worker The classifier is implemented using ROS 2 publish / subscribe
using the ReconROS_Executor_Init function. The fourth communication. It subscribes for input images of size 28 × 28
argument for calling that function is the path to the partial and publishes the estimated digit as unsigned integer. The
bitstreams in the local filesystem. In line 3–4, the hardware classifier consists of three convolution layers, three pooling
callbacks are registered at the executor. The list of arguments layers and two fully connected layers. The achieved accuracy
comprises the executor instance, the ROS 2 node name, the is about 97%.
ResourceMask, the ReconROS primitive type, the callback- Inverse kinematics: This callback computes control signals
creating ReconROS primitive instance and the ReconROS target for driving a servo motor that sets a joint angle based on
message primitive. The last line of the code spins the executor a desired position and orientation of a robotic manipulation
and blocks until the application is terminated. platform. The application is part of a larger mechatronic system
for controlling the movements of a Stewart platform [26] with
six degrees of freedom. The computation involves coordinate
Listing 4: C/C++ code (partial) main thread of the ReconROS transformations and an iterative implementation of the arctan()
application function. The ROS 2 input message is an unsigned 32 Bit
integer packed with two fixed-point numbers in Q8.6 format
1 // Init the ReconROS executor without sw workers
and one hw worker that represent the desired rotation angles of the platform around
2 ReconROS_Executor_Init(&reconros_executor, 0, 1, the x-axis and the y-axis. The ROS 2 output messages is also
"/mnt/bitstreams/"); a 32 Bit unsigned integer containing a 10 Bit unsigned integer
3
4 ReconROS_Executor_Add_HW_Callback(& which is the pulse width coded control signal for the motor.
reconros_executor, "/inv_kinematics", 1, Hash calculation: The hash calculation callback is imple-
ReconROS_SUB, resourceinverse_subdata, mented for the demonstration of a callback triggered by a
resourceinverse_inverse_msg);
5 ReconROS_Executor_Add_HW_Callback(& periodic timer. At each run, the algorithm reads a 1920 × 1080
reconros_executor, "/filter", 1, image with 24 bit color depth from main memory and calculates
ReconROS_SUB, resourcesobel_subdata, its SHA256 hash value. Afterwards, the hash value is published
resourcesobel_image_msg);
6 to a ROS 2 topic as an unsigned integer array with 8 elements.
7 ReconROS_Executor_Spin(&reconros_executor); All callback functions have been coded in C/C++ and
synthesized with Xilinx Vivado HLS to a Zynq Z7100 on a
texec−HW texec−SW
ROS 2 callback
[ms] [ms]
Speedup on the hardware callback’s functionality. The reconfiguration
Sobel filter 16.50 42.00 2.5 time adds to the execution time of a hardware callback only if
Number sorting 0.85 41.00 48.2 the targeted reconfigurable slot is not yet configured with the
MNIST classifier 11.90 16.50 1.4 required bitstream.
Inverse kinematics 0.35 1.50 4.3
Hash calculation 81.00 94.00 1.2
C. Example Application
TABLE I: Execution times for five ROS 2 callbacks in hard- In the last experiment we compare the performance of a
ware (texec−HW ) and software (texec−SW ), and the resulting ROS 2 application in software using the standard ROS 2
speedup executor with two different hardware/software mappings using
our ReconROS executor. The experimental setup comprises
the Zynq Z7100 MiniITX FPGA board and a desktop PC
MiniITX FPGA board. The hardware callbacks run at 120 with an Intel i7-8000 series CPU, connected via a Gigabit
MHz, the ReconROS infrastructure at 100 MHz and the Ethernet connection. Both platforms run Ubuntu 18.04 LTS
ARM Cortex-A9 at 666 MHz. Table I shows the execution with ROS 2 dashing. All software components are compiled
times for the callbacks, comprising the execution time for the with optimization level O3.
callback function in software and the time between starting
and completing the callback in hardware, respectively. The Desktop PC Zynq-Z7100 MiniITX Board
ROS 2 benchmark ROS 2 Multi-threaded executor application
reconfiguration times are not included in this measurement. application
Speedups are achieved for all five callbacks, with the hash Inverse
server
Sobel
server
MNIST
server
Sort
server
Hash
server
Inverse Sobel node node node node node
calculation resulting in the lowest and the number sorting client client
node node 4 Topics
resulting in the highest speedup. Software worker

MNIST Sort Core Core


B. Reconfiguration Overheads client client 5 Topics #0 #1
node node

For quantizing the reconfiguration time, we have cre-


ated a ReconROS setup with four reconfigurable slots, Hash
client
Gigabit
node
Ethernet
RS #0, . . . , RS #3. Table II shows the number of available
resources per reconfigurable slot, the resulting bitstream size
S, and the measured reconfiguration times trc for the four (a)

reconfigurable slots. Desktop PC Zynq-Z7100 MiniITX Board


ROS 2 benchmark ReconROS executor application
application
Reconfigurable Slice Block RAMs S trc Inverse Sobel MNIST Sort Hash
DSPs
slot LUTs (36 / 18) [Byte] [ms] Inverse Sobel
server server server server server
node node node node node
RS #0 20800 160 60 / 120 2838976 24.0 client client
node node 4 Topics
RS #1 20800 160 60 / 120 2838976 24.0 Software worker Hardware worker
RS #2 41600 320 240 / 120 5285728 38.4
MNIST Sort Core Core RS RS
RS #3 40800 280 200 / 100 4883328 36.9 client client 5 Topics #0 #1 #0 #1
node node

TABLE II: Reconfiguration slots with resources (Z7100 slices


Hash RS RS
LUTs, DSPs, and Block RAMs), bitstream size and reconfigu- client Gigabit #2 #3
node Ethernet
ration time
(b)
Using linear regression on the measured reconfiguration
times and a reconfiguration time model that includes a constant Fig. 5: Experimental setup for a ROS 2 application with a
offset part tof f set and a bitstream size dependent part S/B, standard ROS 2 executor (a), and our ReconROS executor (b)
where B denotes the transfer bandwidth, i.e., trc = S/B +
tof f set , our measurements result in tof f set = 6.8ms and The ROS 2 setup is illustrated in Figures 5(a) and (b).
B ≈ 160M Byte/s. The achieved bandwidth is much lower On the desktop PC, there are five ROS 2 client nodes
than the results reported in [23]. The authors of [23] apparently programmed in C++ an compiled against the ROS 2 rclcpp
used a bare-metal implementation of the ZyCAP driver without library. These client nodes, i.e., the Sort client, Inverse client,
operating system. Our current implementation suffers from Sobel client, and MNIST client nodes, comprise a publisher and
copying the bitstream between user and kernel space. An a subscriber. Starting with an initial published message, the
improved implementation of the Linux driver with a zero- client’s subscriber waits for a response from the corresponding
copy approach, e.g., based on get user pages, would increase server node on the FPGA. After receiving a new message for the
the performance. topic, the clients immediately publish a new message for their
The reconfiguration times reported in Table II are directly server counterpart. The resulting roundtrip times are logged
depending on the size of the reconfigurable slot and the during the experiment. The Hash client node forms a special
corresponding bitstream size, but they are basically independent case. Since the hash server node only publishes messages, the
Fig. 6: Relative frequencies of the roundtrip times for the ROS 2 standard executor and two R ECON ROS executor configurations;
the dashed lines show the average roundtrip time for the specific ROS 2 node

client on the desktop PC just receives the messages and reports from this callback. Some messages are delayed, which
the times between consecutive messages. then likely reduces the time for the next message.
Figure 5(a) sketches an all-software mapping, where a multi- VI. C ONCLUSION AND F UTURE W ORK
threaded standard ROS 2 executor with two software worker
threads on the FPGA dispatches the callbacks from the server In this paper, we have introduced the ReconROS executor
nodes to two processor cores. Figure 5(b) displays the setup that enables event-based programming for hardware accelerated
with the ReconROS executor and additional four hardware ROS 2 applications. In contrast to related work, the ReconROS
worker threads that dispatch callbacks to four reconfigurable executor leverages partial reconfiguration for loading hardware-
slots. We have evaluated two mappings under the ReconROS mapped callbacks on demand to predefined reconfigurable slots
executor, a mixed software/hardware mapping where the four on the FPGA’s logic resource. Additionally, the ReconROS
callbacks with the highest speedups according to Section executor extends the ROS 2 standard executor and allows it to
V-A, i.e., Number sorting, Inverse kinematics, Sobel filter, schedule or dispatch callbacks to software and hardware.
and MNIST classifier, are executed in hardware and the Hash In future work, we plan to expand the mapping and
calculation callback is executed in software. The all-hardware scheduling strategies of the ReconROS executor to optimize
mapping finally runs all callbacks in hardware. the hardware/software mapping and the resource management.
In particular, callbacks that can run in both software and
Figure 6 analyzes the resulting roundtrip times for the three
hardware will allow for taking runtime mapping decisions. With
mappings. The figure plots for each of the five ROS 2 nodes
respect to resource management, techniques that minimize the
and the three mappings the relative frequency over the roundtrip
unused resources within loaded slots or preloading of hardware
time. The dashed lines denote the averages. Going from the
callbacks are worth investigating.
all-software over the software/hardware to the all-hardware
mapping, the speedups based on the averaged roundtrip times R EFERENCES
are 6.21 an 6.29 for Inverse kinematics, 0.97 and 1.00 for the [1] M. Qasaimeh, K. Denolf, J. Lo, K. Vissers, J. Zambreno, and P. H. Jones,
MNIST classifier, 1.03 and 1.15 for the Sobel filter, 18.18 and “Comparing Energy Efficiency of CPU, GPU and FPGA Implementations
20.97 for Number sorting, and 1.00 for the Hash calculation. for Vision Kernels,” in Proc. 2019 IEEE International Conference on
Embedded Software and Systems (ICESS), 2019, pp. 1–8.
Overall, we make the following observations: [2] C. Brugger, L. Dal’Aqua, J. A. Varela, C. D. Schryver, M. Sadri, N. Wehn,
M. Klein, and M. Siegrist, “A quantitative cross-architecture study of
• The speedups for the individual ROS 2 nodes within morphological image processing on CPUs, GPUs, and FPGAs,” in Proc.
the overall application follow the trends for the callbacks 2015 IEEE Symposium on Computer Applications Industrial Electronics
measured in isolation, shown in Table I, although generally (ISCAIE), 2015, pp. 201–206.
[3] O. Ulusel, C. Picardo, C. B. Harris, S. Reda, and R. I. Bahar, “Hardware
lower due to the communication between desktop PC and acceleration of feature detection and description algorithms on low-power
FPGA board, the ROS 2 communication layers, and the embedded platforms,” in Proc. 2016 26th International Conference on
executors. For Number sorting, Inverse kinematics, and to Field Programmable Logic and Applications (FPL), 2016, pp. 1–9.
[4] S. I. Venieris and C.-S. Bouganis, “fpgaConvNet: Mapping Regular and
some extent the Sobel filter, distinct speedups are realized. Irregular Convolutional Neural Networks on FPGAs,” IEEE Transactions
• The hash calculation is triggered with a 250 ms period. on Neural Networks and Learning Systems, vol. 30, no. 2, pp. 326–342,
The distribution of roundtrip times shows entries with 2019.
[5] C. Lienen, M. Platzner, and B. Rinner, “Reconros: Flexible hardware
less and more than 250 ms since the ROS 2 client on the acceleration for ros2 applications,” in 2020 International Conference on
desktop PC measures times between arriving messages Field-Programmable Technology (ICFPT), 2020, pp. 268–276.
[6] C. Lienen and M. Platzner, “Design of distributed reconfigurable robotics circuit for FPGA,” in Proc. 2019 IEEE/ACM 2nd International Workshop
systems with reconros,” ACM Trans. Reconfigurable Technol. Syst., on Robotics Software Engineering (RoSE), 2019, pp. 33–36.
vol. 15, no. 3, dec 2022. [16] M. Eisoldt, S. Hinderink, M. Tassemeier, M. Flottmann, J. Vana,
[7] L. Joseph, Mastering ROS for Robotics Programming. Packt Publishing, T. Wiemann, J. Gaal, M. Rothmann, and M. Porrmann, “Reconfros:
2015. Running ros on reconfigurable socs,” in Proc. 2021 Drone Systems
[8] J. Schlesselman, G. Pardo-Castellote, and B. Farabaugh, “Omg data- Engineering and Rapid Simulation and Performance Evaluation: Methods
distribution service (dds): architectural update,” in IEEE MILCOM 2004. and Tools Proceedings, ser. DroneSE and RAPIDO ’21. New York,
Military Communications Conference, 2004., vol. 2, 2004, pp. 961–967 NY, USA: Association for Computing Machinery, 2021, p. 16–21.
Vol. 2. [17] A. Podlubne and D. Göhringer, “FPGA-ROS: Methodology to Augment
[9] D. Casini, T. Blaß, I. Lütkebohle, and B. B. Brandenburg, “Response-time the Robot Operating System with FPGA Designs,” in Proc. 2019
analysis of ros 2 processing chains under reservation-based scheduling,” International Conference on ReConFigurable Computing and FPGAs
in 31st Euromicro Conference on Real-Time Systems (ECRTS 2019). (ReConFig), 2019.
Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019. [18] A. Podlubne, J. Mey, R. Schöne, U. Aßmann, and D. Göhringer, “Model-
[10] Y. Tang, Z. Feng, N. Guan, X. Jiang, M. Lv, Q. Deng, and W. Yi, based approach for automatic generation of hardware architectures for
“Response time analysis and priority assignment of processing chains on robotics,” IEEE Access, vol. 9, pp. 140 921–140 937, 2021.
ros2 executors,” in 2020 IEEE Real-Time Systems Symposium (RTSS), [19] E. Lübbers and M. Platzner, “ReconOS: Multithreaded Programming for
2020, pp. 231–243. Reconfigurable Computers,” ACM Transactions on Embedded Computing
[11] J. Staschulat, I. Lütkebohle, and R. Lange, “The rclc executor: Domain- Systems, vol. 9, no. 1, pp. 8:1–8:33, 2009.
specific deterministic scheduling mechanisms for ros applications on [20] A. Agne, M. Happe, A. Keller, E. Lübbers, B. Plattner, M. Platzner, and
microcontrollers: work-in-progress,” in 2020 International Conference C. Plessl, “ReconOS: An Operating System Approach for Reconfigurable
on Embedded Software (EMSOFT), 2020, pp. 18–19. Computing,” IEEE Micro, vol. 34, no. 1, pp. 60–71, 2014.
[12] K. Yamashina, T. Ohkawa, K. Ootsu, and T. Yokota, “Proposal of ROS- [21] V. Mayoral-Vilches and G. Corradi, “Adaptive computing in robotics,
compliant FPGA component for low-power robotic systems (retraction leveraging ros 2 to enable software-defined hardware for fpgas,” arXiv
notice),” in Proc. International Conference on Intelligent Earth Observing preprint arXiv:2109.03276, 2021.
and Applications 2015, vol. 9808, 2015, p. 98082N. [22] V. Mayoral-Vilches, “Kria Robotics Stack,” https://www.xilinx.com/
[13] K. Yamashina, H. Kimura, T. Ohkawa, K. Ootsu, and T. Yokota, applications/industrial/robotics/wp540-kria-robotics-stack.html, 2021, ac-
“CReComp: Automated Design Tool for ROS-Compliant FPGA Com- cessed: 2022-01-13.
ponent,” in Proc. IEEE 10th International Symposium on Embedded [23] K. Vipin and S. A. Fahmy, “Zycap: Efficient partial reconfiguration
Multicore/Many-Core Systems-on-Chip, MCSoC 2016. IEEE, 2016, pp. management on the xilinx zynq,” IEEE Embedded Systems Letters, vol. 6,
138–145. no. 3, pp. 41–44, 2014.
[14] Y. Sugata, T. Ohkawa, K. Ootsu, and T. Yokota, “Acceleration of [24] R. Gonzalez and R. Woods, Digital Image Processing. Pearson, 2018.
Publish/Subscribe Messaging in ROS-Compliant FPGA Component,” in [25] D. Knuth, The Art of Computer Programming: Volume 3: Sorting and
Proc. of the 8th International Symposium on Highly Efficient Accelerators Searching. Pearson Education, 1998.
and Reconfigurable Technologies (HEART2017). ACM, 2017. [26] D. Stewart, “A Platform with Six Degrees of Freedom,” in Proc. of the
[15] T. Ohkawa, Y. Sugata, H. Watanabe, N. Ogura, K. Ootsu, and T. Yokota, Institution of Mechanical Engineers, vol. 180, no. 1, 1965, pp. 371–386.
“High level synthesis of ROS protocol interpretation and communication

You might also like