0% found this document useful (0 votes)
43 views8 pages

AXIOM A Hardware-Software Platform For

The document summarizes the AXIOM project, which aims to develop a hardware-software platform for cyber-physical systems. The platform uses an OmpSs programming model and provides parallel programmability and scalability through its AXIOM-Link communication interface. It also leverages FPGAs to accelerate application portions. The document describes the AXIOM software stack, programming model, AXIOM-Link interface, and provides early experimental results from smart video and home applications used to drive the platform's design.

Uploaded by

Ramji Chaurasiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views8 pages

AXIOM A Hardware-Software Platform For

The document summarizes the AXIOM project, which aims to develop a hardware-software platform for cyber-physical systems. The platform uses an OmpSs programming model and provides parallel programmability and scalability through its AXIOM-Link communication interface. It also leverages FPGAs to accelerate application portions. The document describes the AXIOM software stack, programming model, AXIOM-Link interface, and provides early experimental results from smart video and home applications used to drive the platform's design.

Uploaded by

Ramji Chaurasiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2016 Euromicro Conference on Digital System Design

AXIOM: A Hardware-Software Platform for


Cyber Physical Systems
Somnath Mazumdar1 , Eduard Ayguade2 , Nicola Bettin3 , Javier Bueno2 ,
Sara Ermini4 , Antonio Filgueras2 , Daniel Jimenez-Gonzalez5 , Carlos Alvarez Martinez5 ,
Xavier Martorell2 , Francesco Montefoschi4 , David Oro6 , Dionisis Pnevmatikatos7,8 ,
Antonio Rizzo4 , Dimitris Theodoropoulos7 , and Roberto Giorgi1
1
Dipartimento di Ingegneria dell’Informazione e Scienze Matematiche, Università degli Studi di Siena, Italy
Email:{surname}@dii.unisi.it
2
Barcelona Supercomputing Center, Barcelona, Spain
Email:{name.surname}@bsc.es
3
VIMAR SpA Marostica, Italy
Email:{name.surname}@vimar.com
4
Dipartimento di Scienze Sociali, Politiche e Cognitive, Università degli Studi di Siena, Italy
Email:{name.surname}@unisi.it
5
Computer Architecture Dept., Universitat Politecnica de Catalunya Barcelona, Spain
Email:{name.surname}@bsc.es
6
Herta Security Barcelona, Spain
Email:{name.surname}@hertasecurity.com
7
Institute of Computer Science, Foundation for Research and Technology - Hellas (FORTH) - Crete, Greece
Email:{pnevmati,dtheodor}@ics.forth.gr
8
School of ECE, Technical University of Crete, Chania, Greece

Abstract—Cyber-Physical Systems (CPSs) are widely necessary for Module) [8], [9] provides a general platform focusing on pro-
many applications that require interactions with the humans and the viding a scalable and easy-to-program platform. Unlike other EU
physical environment. A CPS integrates a set of hardware-software
projects (such as CONTREX [10], DREAMS [11], EMC 2 [12],
components to distribute, execute and manage its operations. The
AXIOM project (Agile, eXtensible, fast I/O Module) aims at developing MultiPARTES [13]) which mainly focused on the mixed-criticality
a hardware-software platform for CPS such that i) it can use an applications, AXIOM provides a generic platform with its complete
easy parallel programming model and ii) it can easily scale-up the application development suite. Despite the existence of many FPGA-
performance by adding multiple boards (e.g., 1 to 10 boards can run based boards, to the best of our knowledge none of them combines
in parallel). AXIOM supports task-based programming model based on
OmpSs and leverage a high-speed, inexpensive communication interface all the features (such as parallel programmability, scalability). To
called AXIOM-Link. Another key aspect is that the board provides illustrate this, we compared more than twenty boards, most of which
programmable logic (FPGA) to accelerate portions of an application. We coming from crowd-funding initiatives (some of which met our
are using smart video surveillance, and smart home living applications targets, while others did not), and present the comparisions in Table I.
to drive our design.
In this paper, we describe the progress of AXIOM after the
Index Terms—Cyber-physical systems, distributed shared memory,
programming model, performance evaluation, reconfigurable, smart completion of its first year. AXIOM project aims to bridge the
video surveillance, smart home living. gap between the different approaches to the design (heterogeneity),
data analysis and seamless control of hardware to execute generic
I. I NTRODUCTION applications. In this paper, our contributions are:
“Cyber-physical systems integrate computation, communication, • We detail the software stack and programming model support
sensing, and actuation with physical systems to fulfill time-sensitive for AXIOM based on OmpSs programming model.
functions with varying degrees of interaction with the environment, • We illustrate in detail the low-level, inexpensive, high-speed
including human interaction.” [1]. Cyber-physical systems (CPS) [2], AXIOM-Link and its operation.
[3], [4] are getting more and more pervasive in various daily • We discuss some early results from our design space exploration.
life aspects. A CPS is an integrated framework of a network of The rest of the paper is organized as follows: in Section II, we
information processing, sensors and actuators [5], [6]. Such systems explain how the support for threads is provided using the AXIOM
become ubiquitous to human life, allowing a close interaction not stack and the OmpSs programming model together with the profile
only system to system, but also with human-system or vice-versa. support in Section III; in Section IV, we illustrate the high-speed
CPS domain includes Internet of Things (IoT), smart homes, smart AXIOM-Link; in Section V, we discuss about our evaluation plat-
cities, or the smart grid. Everyday life is becoming increasingly form. In Section VI, we illustrate our application scenarios and in
dependent on CPS (e.g., smart video surveillance). In 2008, CPS was Section VII, we show our experimental results. We also discuss the
one of the highest priority research topic [7]. The noted challenges related works in Section VIII and finally, we conclude the paper.
in designing a CPS architecture are infrastructural challenges, time
management, data management (the data workflow), proper software- II. P ROGRAMMING M ODEL OF AXIOM
hardware integration (implementational challenges) and compliance The AXIOM software stack is referred in Figure 1(a). In this
with standards. The AXIOM project (Agile, eXtensible, fast I/O section, we briefly describe the OmpSs programming model [14]; the

978-1-5090-2817-7/16 $31.00 © 2016 IEEE 539


DOI 10.1109/DSD.2016.80
TABLE I: Comparison of Recent FPGA-based Boards (AXIOM related aspects are in bold)
ATMEL based
Boards FPGA/CPU RAM Standalone Connectivity Programmability
Arduino
LOGi FPGA Spartan6 LX9 256MB No Pinout SATA, Raspy/Beagle, SPI IDE, GUI
MiniSpartan6+ Spartan6 LX9/25 32MB Yes – I/O ports, DAC, ADC IDE, GUI
ATmega32U4,
Papilio DUO Spartan6 LX9 2MB Yes I/O ports IDE, GUI
Mega Pinout
ATmega32U4,
MOJO Spartan6 LX9 – Yes – IDE, GUI
Custom Pinout
Fast network and
SmartZynq Zynq 7010/7020 8GB No – Hard
board2board
Zynq 7010/7020 Gigabit ethernet,
Parallella 1GB Yes – Standard tools
16-core Ephiphany Four high speed connectors
aijuboard Zynq 7015 1GB Yes – SATA, Gigabit ethernet Standard tools
Gigabit ethernet,
RED PITAYA Zynq 7010 512MB Yes – Standard tools
4 fast analog inputs
OHO Spartan 3E – Yes – I/O ports Xilinx ISE only
Analog and digital inputs,
RetroCade Synth Spartan 3E or LX9 4MB No – –
MIDI, audio jacks
PAPILIO Spartan 3E 8MB Yes – I/O ports No SDK
TRIFDEV Lattice MACHXO2-1200 – Yes Partial pinout, I2C I/O ports No SDK
owlBoard Spartan6 LX9 – Yes – I/O ports No SDK
ATmega32U4 Arduino IDE,
Alan Spartan6 LX45 – Yes I/O ports
and Pinout Xilinx ISE
4CH sig.gen. Spartan6 LX9 – Yes – 4 DAC None
Compatible with
Logitraxx Spartan6 LX9 64MB Yes I/O ports No SDK
shields
Spartan6 LX9 Arduino Due LVDS, CAN,
KromaLights 256MB Yes SDK (Ardunio IDE)
Cortex-M3 (SAM3X) USART
Spartan6 LX9
CrystalBoard 2GB Yes Atmega328, UNO Pinout Ethernet, WiFi –
4-core Cortex-A9
Atmel XMega32
PSHDL board Actel A3PN250 – Yes UART, I/O ports Simplified VHDL
(to program FPGA)
Helix-4 Altera Cyclone 4 (22k) 4MB Yes Arduino UNO shield I/O ports Altera Quartus II IDE
ZynqBerry Zynq 7010 128 MB Yes – Ethernet Xilinx SDK
Z-turn Board Zynq 7010/7020 1 GB Yes – CAN, Ethernet Xilinx SDK

extensions planned for OmpSs to spawn tasks in the FPGA-device, machine. The second level of task parallelism is expressed through the
and the extensions needed to support the cluster version of AXIOM. OmpSs extensions targeting the FPGAs (see below, Section II-A1).
The OmpSs programming model is based on two main components
A. Introduction to OmpSs Programming Model and some additional tools. They are:
• The Mercurium compiler [15] takes the source code as specified
The OmpSs programming model supports the execution of het-
erogeneous tasks written in OpenCL, CUDA, or a high-level C by the programmer and understands the OmpSs directives to
or C++ language that can be converted to the machine language transform the code to run on heterogeneous platforms, including
used in GPUs or converted to the bitstream to program FPGAs. OpenCL and CUDA, accelerators. In this project, the compiler is
Also, the runtime supports the communications within a cluster going to be also extended to support FPGA-based accelerators.
• The Nanos++ runtime system, which is the responsible to
of distributed memory machines. OmpSs can target tasks to the
different nodes of the cluster. From the programmer perspective, the manage and schedule parallel tasks, respecting their dependen-
annotations required for the cluster support are exactly equivalent cies, transferring the data needed to/from the accelerators when
to the symmetric multiprocessing (SMP). Currently, both OpenCL needed, and the lower-level interactions.
• Additionally, OmpSs can use the Extrae tool [16] to generate
and CUDA options require the programmer to provide the OpenCL
or CUDA code and use the OmpSs target clauses (similar to the execution traces that can be later visualized with the Paraver
OpenMP target clauses) to move the data to the associated accelerator. tool [17], and analyze the execution behavior.
In the AXIOM project, we are using the same technique to spawn 1) OmpSs Extensions for FPGAs: OmpSs needs to be extended to
tasks to the FPGA provided there was a compiler to generate the support the Zynq chip with the FPGA selected in the AXIOM project.
FPGA bitstream implementing the task, from C or C++ code or The extensions to provide support for these chips in the Mercurium
bitstream available with a known interface to access the data. compiler are:
For executing tasks in the cluster version, the programmer needs • To incorporate a new target device named fpga: in addition

to specify the task as plain C or C++ code. Execution on the to the current smp, cuda and opencl devices, the fpga
OmpSs@cluster version automatically allows the runtime system device will cause the Mercurium compiler to understand that
to spawn tasks to remote nodes. The programming model allows the function annotated is to be compiled with the Xilinx Vivado
parallelizing applications on the AXIOM cluster and spawn tasks HLS compiler, for the FPGA, in order to generate the bitstream.
on the FPGAs available on each board. Using OmpSs@cluster with Figure 1(b) shows the main phases of the bitstream generation and
FPGAs support, programmers express two levels of parallelism. The compilation of the OmpSs code. With this extension, the compiler
first level of parallelism targets the AXIOM-cores, i.e. the cores generates the code for the runtime system specifying the tasks that
that are available on the AXIOM-board (e.g., the ARM-A9 cores should be run in the FPGA device. The code is compiled with a back-
in the case of a Xilinx Zynq SoC). Tasks at this level are spread end compiler (e.g., gcc) that will be executed in the Zynq-ARM cores.
across the AXIOM boards as if they would be executed on an SMP This binary code (OmpSs.elf in Figure 1(b)) will call the Nanos++

540
Application 1 Application 2 … Application n
OmpSs Code
OmpSs programming model

OmpSs Run-Time Library Node-1 Node-k


Mercurium FPGAPhase
FPGA Phase
FPGA plug-in Cluster plug-in DMA Host C code + Accelerator
Library Nanos++ call codes
GASNET

GCC Vivado HLS

FPGA DMA Interconnect MMU


driver driver API
Zynq DMA Engines Netlist
Linux OS Nanos++ OS +
Runtime O
OmpSs.elf
m (FPGA
ARM
M DMA driver)
AXIOM-LINK Hardware
High-Speed Dedicated Bitstream Vivado Generation
FPGA
Programmable Processing System Interconnect
Logic (PL) (PS)
FPGA AXIOM SBC

(a) AXIOM Software Stack (b) FPGA Programming Support


Fig. 1: Proposed Software Stack and Overview of the OmpSs Support for AXIOM
runtime with FPGA execution support. This support is based on the implementation, this layer is GASNet [19], usually running on top
DMA library and the FPGA-DMA driver in the system. of MPI [20] through an Ethernet link. Next step is to provide the
B. Runtime Support runtime with a layer communication that can exploit the high-speed
dedicated interconnection AXIOM-Link (see Figure 1(a)) using the
The runtime support has two parts: i) first part is responsible for AXIOM network interface explained in Section IV.
the FPGA-based execution, ii) second part for cluster environment.
1) FPGA Runtime Support: The Nanos++ runtime system has also C. OmpSs Coding Example
been extended, in the following ways:
Figure 2 shows an example of matrix multiplication that has been
• Support to spawn tasks in the FPGA device. annotated with OmpSs directives. Note that this code is independent
• Support for the target clauses related to data transfers. Data-copy of the execution platform (i.e., cluster, nodes with FPGAs, nodes
clauses (copy_in, copy_out, copy_inout) trigger the with GPUs.), being the runtime responsible for taking care of the
data transfer of the data specified to/from the FPGA device. task execution scheduling of the tasks to the devices or nodes of the
Also, dependence clauses will trigger data transfers to the cluster, transparently to the programmer.
device by default. In particular, this code shows a parallel tiled matrix multiply
• Support for data transfers to/from the FPGA. The Nanos++ where each of the tiles is a task. Each of those tasks has two input
runtime now invokes the services of the DMA library developed dependencies and an output dependence that will be managed at
to transfer data in the FPGA environment. runtime by Nanos++. Those tasks will be able to be scheduled/fired
• Include the FPGA device in the support of the implements clause to a SMP or FPGA, as it is annotated in the target device directive,
to allow several implementations of tasks to be scheduled in the depending on the resource availability. The copy_deps clause
processors/devices available. associated to the target directive hints the Nanos++ runtime to
FPGA Layer Support: The DMA library interface provides the copy the data related to the input and output dependencies to/from
means to interact with the Linux driver supporting the FPGA device. the device when necessary.
In the current prototype, when the data transferred to the FPGA
hardware, the IP kernel is initiated automatically. The computation III. P ROFILING S UPPORT
on the data proceeds to the end, and after finishing, the results can The current implementation provides support to profile and trace
be read back to the host from the FPGA. cluster execution. At the same time, a new hardware tracing mech-
The main DMA library primitives allow to get the number of IP anism allows to profile and trace basic information from fpga
accelerators present in the FPGA device, and the handles to operate tasks. Traces are automatically generated and translated to Paraver
with them. For each IP accelerator, the library allows to open input traces if specified at execution time. Those traces include both
and output DMA channels to send/receive data to/from it. The library application and OmpSs runtime execution state information so that
allows to allocate special memory buffers in kernel space to exchange the programmer can analyze the parallel execution behavior to detect
data between the Linux kernel and the FPGA hardware. Kernel potential performance bottlenecks. In Section VII some trace results
buffers are pinned to physical memory to avoid swapping them out, are presented and discussed. Those results uncover the need for
while a DMA transfer is in progress. Buffers can be submitted for hardware profiling support.
a DMA transfer to/from the specified device. Data transfers can be
monitored to determine if they are in progress, they have finished, or A. Hardware Profiling Support
a transfer error has occurred. This interface is used by the Nanos++ A new hardware support for FPGA profiling and tracing (from
runtime system to drive the work of the IP accelerators in the FPGA. inside FPGA) for high-level languages has been introduced. This
2) Cluster Runtime Support: The OmpSs@Cluster [18] approach new feature is, to the best of our knowledge, novel for task-based
uses a communication layer to launch tasks to remote nodes. Task parallel heterogeneous programming. The support is in the process
descriptors and data travel on the communication layer. In our current of being integrated into the FPGA-task acceleration in OmpSs and

541
# pragma omp t a r g e t d e v i c e ( f p g a , smp ) c o p y _ d e p s
# pragma omp t a s k i n ( a [ 0 : BS∗BS−1] , b [ 0 : BS∗BS−1]) \  
i n o u t ( c [ 0 : BS∗BS−1])

v o i d m a t r i x _ m u l t i p l y ( i n t BS , f l o a t a [ BS ] [ BS ] ,   
f l o a t b [ BS ] [ BS ] ,  

f l o a t c [ BS ] [ BS ] ) {   
f o r ( i n t i a = 0 ; i a < BS ; ++ i a )  
f o r ( i n t i b = 0 ; i b < BS ; ++ i b ) { 
f l o a t sum = 0 ;
f o r ( i n t i d = 0 ; i d < BS ; ++ i d )  
sum += a [ i a ] [ i d ] ∗ b [ i d ] [ i b ] ;

c [ i a ] [ i b ] = sum ;
} Fig. 3: AXIOM Boards Interconnected in 2D-mesh
}
 
... $
i n t main ( i n t a r g c , c h a r ∗ a r g v [ ] ) {  
"#
i n t BS = . . .  "#
 %" "
... 
f o r ( i = 0 ; i < NB; i ++) {
f o r ( j = 0 ; j < NB; j ++) { 
!
f o r ( k = 0 ; k < NB; k ++) { 
&&'
m a t r i x _ m u l t i p l y ( BS , A[ i ] [ k ] , B [ k ] [ j ] , C [ i ] [ j ] ) ; 
 
}
} !
} *+&&'
# pragma omp t a s k w a i t "
  %"
...
}
Fig. 2: OmpSs Directives on Matrix Multiplication Fig. 4: The AXIOM Network Interface Architecture
the support is transparent to the programmer. The first profiling four bi-directional links, so that the nodes can be connected in many
and tracing objective is to have input and output memory transfer, different ways, such as ring and 2D-mesh/torus.
and computation information from inside the OmpSs fpga task Figure 3 illustrates an example of several AXIOM boards inter-
execution. With this aim, the idea is to: connected in a 2D-mesh. The integrated processing system (PS7)
• Create a hardware platform that integrates hardware counters of each board, communicates using an on-chip network interface
that can be read from both SMP cores and fpga accelerated (NI) implemented in the FPGA region that efficiently supports the
tasks, transparently to the programmer. application communication protocols. Figure 4 illustrates the NI
• Create hardware counters that do not affect the performance of architecture, originally introduced in [9], which implements remote
the fpga tasks. direct memory access (RDMA) and remote write operations as basic
• Make the fpga tasks return the profiling information as part of communication primitives visible at the application level.
their outputs, transparently to the programmer. The router module shown in Figure 5 implements the routing and
• Interpreting the profiling information in the profiling OmpSs network discovery processes. The AXIOM routing algorithm will
runtime device dependent layer, transparently to the program- feature cut-through packet transmission with virtual circuits (VCs),
mer. and the network discovery process will be initiated at boot time by
• Include the profiling information to the automatically generated the master node of network. After the process completion, every node
Paraver trace. will have its id, and local routing table, based on which all packets
will be forwarded to output links. In case that the network topology
Our implementation has used the OMPT API [21] to generate the changes, such as a node is added or becomes faulty, the network
execution traces using the Extrae instrumentation tool. The OMPT discovery updates the original topology table that resides in the master
API helps to integrate profiling of different accelerators/devices node, and all local node routing tables accordingly.
and CPUs using the same API that can be supported by different The core router components can be outlined as: i) input buffering,
instrumentation tools. ii) control, and iii) crossbar and link traversal. The input buffering
module consists of four link controllers (LC), where each link
IV. T HE AXIOM N ETWORK I NTERFACE employs queues to implement three VCs to store different priority
In this section, we describe the AXIOM approach to connectivity. packets. The router uses a Xon/Xoff strategy for notifying adjacent
The AXIOM platform is designed around the Xilinx Zynq SoC that nodes on VC input buffer availability. If a VC queue reaches a
features a multi-core ARM processor tightly coupled with FPGA predefined threshold, the router instantly transmits a Xoff packet
fabric. AXIOM is designed to be modular at the next level, allowing to the link’s adjacent node to block further packets transmission.
the formation of more efficient processing systems through low- Similarly, when the VC fullness drops below a certain level, the router
cost, but scalable high-speed interconnect. The latter will utilize the instantly transmits a Xon packet to the link’s adjacent node to resume
integrated gigabit-rate transceivers with relatively low-cost USB-C packets transmission via this particular link.
connectors to interconnect multiple boards. Such connectivity will The route calculation (RC) finds the required output interface for a
allow to build (or upgrade at a later moment) flexible and low-cost packet, based on the routing table and destination node, starting from
systems by cascading more AXIOM boards, without the need of the highest VC. If the VC number of the output link is enabled, then
costly connectors and cables. AXIOM boards will feature two or the packet is forwarded to the corresponding VC allocation (VCA).
For each input link, the VCA always attempts to serve the VC with

542
 
 TABLE II: Comparison of COTSon with Others
 
     Features Sniper Graphite Gem5 MARSx86 COTSon

Timing
No No Yes No No
 

! "! 
Directed
 Functional


   Yes Yes No Yes Yes


Directed
#
! "!   User Level Yes Yes Yes No No
    Full System
 No No Yes Yes Yes
   
Simulation
 

 Parallel
Yes Yes No No No
! "! 
(In node)

  Parallel
No No No No Yes

 
   (Multi-node)


! "!  Shared Cache Yes No Yes Yes Yes
     
                
is to modify it to model the behavior of our custom interconnects.
Fig. 5: The AXIOM Router Pipelined Architecture The motivation for multiple interconnects derives from the AXIOM
the highest priority, except if its destination node input VC buffer is project design that aims to separate the traffic for building a multi-
blocked. In that case, it falls to the next lower input VC. board system and the traffic for the internet related connection. With
During the switch allocation process, the packets from each buffer the COTSon mediator, we can model both cases. The SimNow is the
request a Xbar output. The switch allocation pairs the Xbar inputs virtual machine (VM), which models all details of a computer. AMD
to the Xbar outputs as efficiently as possible, trying not to leave an is also providing a separate SDK to model any particular board that
output link idle. If more than one packets request the same output has to be plugged-in (such as a network card or a GPU).
link, the grant policy decides according to: A. Thread Support
• Priority (Xon/Xoff >VC2 >VC1 >VC0).
• If packets are of the same priority (e.g., both VC2), it chooses
Synchronization and distribution of data can be managed efficiently
one (in a round-robin based fashion) to grant an output port, by reorganizing the execution in such a way that the threads follow
while at the same time looks for available packets of lower more closely the data flow of the program (such as with DF-Threads
priority (VC1 or VC0) on the same input link that requires a [25]). DF-Threads can be efficiently implemented by a distributed
different port. hardware thread scheduler [26] which support fault tolerance at the
• Repeat until no races exist.
hardware level and efficient fine grain dataflow thread distribution.
To reduce the thread management overhead, the scheduling needs to
The use of VCs with priorities ensures that we avoid protocol dead-
be accelerated in hardware, by mapping its structure into the FPGA.
locks in the network. As we use low priorities for requests, medium
A DF-Thread is defined as a function that expects no parameters and
priorities for responses and a top priority for acknowledgments, there
returns no parameters. The body of this function can refer to data
is no possibility for high-priority packets to clog the network as they
which reside at the memory location for which it has got the pointer.
will be less or equal than the requests that were accepted by the
The DF-Thread API’s [27] are summarized below:
network. Thus in case a high network congestion, acknowledgments
will exit the network, then responses will be sent, and then more • void *DF_TSCHEDULE(bool cnd, void *ip,
requests will be accepted. uint64_t sc): Allocates the resources (a DF-frame of
Finally, the crossbar module is responsible for forwarding all size sc words and a corresponding entry in the distributed
available packets to their output links. All packets then traverse thread scheduler or DTS) for a new DF-Thread and it returns
via the physical link to the neigbour node and are stored to the a frame pointer fp. The ip is the instruction pointer of
corresponding VC input queue. DF-Thread. The allocated DF-Thread is not executed until its
sc reaches 0 and together also satisfy the boolean condition
V. AXIOM E VALUATION P LATFORM (AEP) cnd.
Design space exploration (DSE) and its automation is an important • void DF_DESTROY(): To release allocated resources held by
part of our current performance evaluation and power estimation current DF-Thread.
methodologies [22]. The proposed method in AXIOM requires first • uint64_t DF_TREAD(uint64_t offset): Loads the
exploring and modeling parts on the simulator and then, once the DSE data indexed by offset from the current thread of DF-frame.
is completed, implementing them on the FPGA-based prototypes. • void DF_TWRITE(uint64_t val, void
This has the considerable advantage of allowing immediately to *fp,uint64_t off): The data val is stored into the
develop the software stack early. AEP is made of two important DF-frame pointed to by fp at the specified offset off.
tools: the HP-Labs COTSon simulator [23] and the Xilinx Zynq • void *DF_TALLOC(uint64_t size, uint_8
based platform. Given the goals of this project, we also needed a type): Allocates a block of memory of size words
more flexible platform for the DSE. The simulation platform is used and returns the pointer (or null) while type specifies the
to understand better bottlenecks (e.g., the congestion on a bus, cache special purpose memory type.
size), which are not trivial to track on the FPGA prototyping platform. • void DF_TFREE(void *p): Frees memory pointed to by
COTSon also includes an interface to the HP McPAT tool [24] for p.
estimating the power consumption. Table II presents some advantages
of using COTSon for our purpose. VI. A PPLICATION S CENARIOS OF AXIOM
COTSon uses the so-called “functional-directed” approach. The Smart video surveillance and smart home applications are now a
simulator permitted us to execute the full-system simulation. The hot topic in CPS and we have customized these two scenarios for
“mediator” of COTSon represents the model of a switch, and our aim our AXIOM platform.

543
TABLE III: OmpSs Experimental Results
Machine Execution Time (s) GFLOPS Speedup
UDOO: 1 core (1 node) 7.6 0.28 1
UDOO: 4 cores (1 node) 1.9 1.13 4
UDOO: 8 cores (2 nodes) 1.3 1.61 5.7
Zynq 706 board (FPGA) 0.5 4.06 15.3

Fig. 6: The AXIOM Smart Home Living (SHL) Scenario Fig. 7: Paraver trace of the OmpSs MxM using 2 nodes UDOO x86,
A. Smart Home Living (SHL) with 4 threads per node.

For the SHL case study, we selected a scenario which aims to cores of the same node of the UDOO cluster, iii) all cores of the two-
increase the natural interaction of the user with his house using node UDOO cluster, and iv) a Zynq ZC706-SoC using the FPGA to
both audio and video analysis. Figure 6 shows an overview of SHL accelerate the matrix multiply tiles. All the results are for a tiled
scenario. Real-time multimedia streams processing is required to matrix multiply with BS=128 and 1024x1024 matrices. Speedup
enable a natural interaction for the user as well as the capability to results are obtained comparing each environment result to the UDOO
correlate instantly the information extracted from the audio and video 1 core environment.
data in different ways, related to the actual situation inside and outside On one hand, results show that OmpSs@cluster scales pretty
the house in the particular time in which data are collected. Audio well inside the node. Meanwhile, it seems that there are some
and video analysis are also useful both to enhance the security level overheads that reduce the scalability when using the two nodes of the
of the house and to increase the automation potential of the smart cluster. In particular, the connection done by Ethernet may affect the
house, thus reducing waste of energy and at the same time increasing synchronization and communication overhead. Therefore, the use of
the comfort. The sound streams coming from both vocal and non- the high-speed dedicated interconnection AXIOM-Link should help
vocal signals will be analyzed on the FPGA-based SoC. The main to reduce this overhead and improve the scalability of the OmpSs
computation tasks in the audio processing are: filtering out the noise, applications. On the other hand, the Zynq ZC706 board result, using
voice detection, and extraction of the specific information necessary FPGA accelerators for the matrix multiply, shows a much better
for the automation of the house. In video processing, vital steps are performance than the UDOO cluster. It can be stated that the AXIOM
frame decompression and recognition of specific information inside platform will outperform the UDOO cluster by at least an order of
frames. The results of the audio and video processing are correlated magnitude.
to increase the level of the system’s intelligence.
B. Profiling and Tracing Results
B. Smart Video Surveillance (SVS) In this sub-section, profiling and tracing results are presented.
For SVS case study, we selected an automated smart marketing Cluster profiling results have been obtained using a cluster of UDOO
scenario involving real-time face detection in crowds while per- x86’s; meanwhile, one node traces with fpga task executions are on
forming demographics estimation (e.g., age, gender and ethnicity). a Zynq 706 board.
The SVS scenario will employ state-of-the-art cognitive computer 1) Cluster Profiling: Figure 7 shows the execution of the
vision techniques based on models built from a boosted cascade of OmpSs matrix multiply (BS=128) in Figure 2 with target
classifiers combined with deep convolutional neural networks. A low- device(smp). The cluster has two nodes with four threads per
power high-performance inference engine for such models will be node, each of them executing smp tasks. The Paraver trace has as
implemented in the reconfigurable logic of the SoC using the OmpSs many horizontal lines as threads running OmpSs tasks. The different
programming model. Since this scenario will analyze high-definition colors mean different thread states along the execution time of
(HD) video feeds, other computational challenges related to video the application. Therefore, there are eight horizontal lines (one per
processing must also be addressed. HD video stream decoding (i.e., thread). Green flags indicate trace events (e.g., start/end a task).
format parsing, codec implementation, de-muxing and color space Main area colors in the trace have the following meaning: pink areas
conversion) will be performed by relying on a heterogeneous com- correspond to the task creation on the master thread (top), yellow
puting approach combining single instruction, multiple data (SIMD) areas correspond to smp tasks running in the SMP, light red
instructions with on-die logic blocks. in the master thread (first horizontal line) corresponds to a global
task synchronization, and dark red corresponds to idle state where
VII. E VALUATIONS AND R ESULTS
those threads are doing nothing. The trace shows that tasks have
In this section, we present some preliminary results for some been evenly distributed among the two UDOO’s nodes, achieving a
software and hardware prototypes the AXIOM project is designing promising performance result. In this Paraver trace, the dependences
and implementing. between tasks have not been shown for clarity purposes.
A. OmpSs Timing Results 2) One Node Profiling: In this case, and for the purpose of
presenting an execution trace that helps to detect a performance
Table III shows the execution time and GFLOPS of the matrix bottleneck, we have selected a sub-optimal hardware/software co-
multiplication of Figure 2 for different execution environments. Those design of the parameters and task target devices of the tasks of an
environments are: i) one core of the UDOO x86 cluster [28], ii) four

544
64
No. of Nodes=1
No. of Nodes=2
32
No. of Nodes=4
2*N^3
n
16
Fig. 8: Paraver trace of the OmpSs MxM using 1 SMP (top) and 1
8
helper thread (bottom) for two FPGA accelerators.

OmpSs application in one node and accelerating tasks in the FPGA. 4


Therefore, Figure 8 shows a Paraver trace of the parallel execution
of the OmpSs matrix multiply (BS=128) in Figure 2, running in a 2
Zynq machine and using one thread for smp task executions and
one thread for fpga task submissions to two accelerators. In this 1
n
Size=256,b=4 n
Size=512,b=4 n
Size=1024,b=4
Paraver trace, there is one thread (top) running tasks in the SMP and
Fig. 9: Instruction count normalized to the matrix size 256 (n=256)
one thread (helper thread) submitting tasks to two MxM accelerators
and b is the block-size.
in the FPGA (bottom). Green flags indicate trace events (e.g.,
start/end a task) and Yellow lines between events/states indicate 4
n=256,
n
n =256,bb=4
b=4
=4
Size=256,b=4
task dependences. Main color areas in the trace have the following n=512,
n b=4
Size=512,b=4
meaning: pink areas correspond to the task creation on the master
n
n=1024, b=4
Size=1024,b=4
thread (top), yellow areas correspond to smp tasks running in the
SMP and purple areas are for the submission of fpga tasks to
one of the FPGA accelerators. Light green in the helper thread 2
corresponds to thread waiting for more tasks to be submitted to the
FPGA.
On one hand, this execution trace shows significant load imbalance
between the two threads. The reason is the decision of executing tasks
in SMP when the FPGA, at the same task granularity, is much faster
1
than the SMP. The programmer could decide either to specify only
No. of Nodes=1 No. of Nodes=2 No. of Nodes=4
fpga tasks and/or change the task granularity at SMP. On the other
hand, the trace does not give much information about the memory Fig. 10: Speed-up of user cycles count normalized to the matrix size
transfer (DMA) from/to Host/FPGA, possible overlapping of memory 256 (n=256) and b is the block-size.
transfers and FPGA acceleration, and FPGA computation time in the
two accelerators. For this reason, it is important to have hardware speedup is almost 4). We do not report here the effect of different
profiling support to provide useful FPGA profiling information to block sizes, but for smaller block sizes we typically achieve better
the programmer, from inside the FPGA. scalability [29]. What has to be stressed here is the possibility to
scale performance across nodes that have separate address spaces.
C. DF-Threads Initial Results
VIII. R ELATED W ORKS
We are reporting in this sub-section our experimental results when
AXIOM platform consists of 1, 2 or 4 nodes. In this case, the CPS is gaining importance in recent years, and a lot of works
execution model is based on the DF-Threads and the methodology been done in this domain. Smart homes are one of the most
illustrated in Section V. For simplicity, we use a well-known bench- traditional CPS application scenario [30], [31]. We refer readers to
mark which is the blocked matrix multiplication (see Figure 2). The the general survey on CPS [32] for more information. Completed
parallelization is based on the ratio between the matrix size n and EU projects such as SCUBA also developed a CPS architecture for
the block size b (i.e., the expected number of DF-Threads is n/b). self-organizing, cooperative and robust building automation systems
In our experiment, we consider three matrix sizes: n=256, 512, (BAS) [33]. Another related area of CPS is the mixed-criticality
1024 while the block size is fixed to b=4 and we report the results application. The related projects such as EM C 2 project provides
in Figure 9 and in Figure 10. In such cases, the number of DF-Thread a flexible MPSoC architecture which can be tailored by middleware
is respectively 64, 128, 256. The interesting result is related to for executing real-time and mixed-criticality applications. CONTREX
the total number of instructions. As we can see from Figure 9, for project mainly focused on developing energy efficient and low-
each matrix size the instruction count has almost the same value cost hardware design for critical applications (such as automotive,
once we vary the node size from 1 to 4 (three superposing lines). aeronautics and telecommunications). Another EU project DREAMS
The reason for that is due to the small overhead to manage DF- focuses on the cross domain architecture based on open-source
Threads across nodes. Moreover, the number of instructions follow (XtratuM) virtualization and design tools for supporting execution
the theoretical increase (i.e., the number of instructions increases as of mixed critical applications on networked multi-core chips. These
O(n3 )) in the case of a classical block-matrix multiplication closely. mentioned projects are highly focused on (mixed) critical applications
We normalized the total number of instructions for each curve to the and evaluate their platforms mostly on avionics, wind power based
case of matrix size n=256 to compare the three experimental cases application domains. The main difference between other projects with
and the theoretical O(n3 ) line in Figure 9. AXIOM is that we focus on the complete hardware-software co-
As we can see from Figure 10, the scalability improves signif- design suite. Moreover, AXIOM provides a generic programming
icantly when we have a larger number of threads. In the case of model which can work with its high-speed interconnect subsystem
n=1024, b=4 the speedup is almost ideal (for four nodes the on multiple platforms together with proper hardware support. In
this context Data-Flow Soft-Core [34] is another current work that

545
shows the potentiality of dataflow based approaches in reconfigurable [15] J. Balart, A. Duran, M. Gonzàlez, X. Martorell, E. Ayguadé, and
domain. J. Labarta, “Nanos mercurium: a research compiler for openmp,” in
Proceedings of the European Workshop on OpenMP, vol. 8, 2004, p. 56.
IX. C ONCLUSIONS [16] B. S. Center, “Extrae instrumentation library,” 2016 (accessed June 16,
2016). [Online]. Available: http://www.bsc.es/computer-sciences/extrae
The AXIOM platform provides an integrated approach including [17] V. Pillet, J. Labarta, T. Cortes, and S. Girona, “Paraver: A tool to
a heterogeneous SoC (currently with an FPGA) board, a new high- visualize and analyze parallel code,” in Proceedings of WoTUG-18:
performance connection link for cluster and the task-based program- Transputer and occam Developments, vol. 44. mar, 1995, pp. 17–31.
[18] J. Bueno, X. Martorell, R. M. Badia, E. Ayguadé, and J. Labarta,
ming model, that can support single and multiple-node heterogeneous “Implementing ompss support for regions of data in architectures with
parallel execution, transparently to the programmer. The initial results multiple address spaces,” in Proceedings of the 27th international ACM
are encouraging in terms of scalability while keeping an easy pro- conference on International conference on supercomputing. ACM,
gramming model for the programmer. 2013, pp. 359–368.
[19] D. Bonachea, “Gasnet specification, v1. 1,” 2002.
X. ACKNOWLEDGMENT [20] M. P. I. Forum, “Mpi: A message-passing interface standard,
version 3.0.” 2016 (accessed June 16, 2016). [Online]. Available:
This work is partially supported by the European Union H2020 http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
program through the AXIOM project (grant ICT-01-2014 GA [21] A. E. Eichenberger, J. Mellor-Crummey, M. Schulz, M. Wong, N. Copty,
645496) and HiPEAC (GA 687698), by the Spanish Government R. Dietrich, X. Liu, E. Loh, and D. Lorenz, “Ompt: An openmp
tools application programming interface for performance analysis,” in
through Programa Severo Ochoa (SEV-2011-0067), by the Spanish OpenMP in the Era of Low Power Devices and Accelerators. Springer,
Ministry of Science and Technology through TIN2012-34557 project, 2013, pp. 171–185.
and by the Generalitat de Catalunya (contract 2009-SGR-980). [22] C. Silvano, W. Fornaciari, G. Palermo, V. Zaccaria, F. Castro, M. Mar-
tinez, S. Bocchio, R. Zafalon, P. Avasare, G. Vanmeerbeeck et al.,
R EFERENCES “Multicube: Multi-objective design space exploration of multi-core ar-
chitectures,” in VLSI 2010 Annual Symposium. Springer, 2011, pp.
[1] C. P. S. P. W. Group et al., “Framework for cyber-physical systems,” 47–63.
Preliminary Discussion Draft, Release 0.8, 2015. [23] E. Argollo, A. Falcón, P. Faraboschi, M. Monchiero, and D. Ortega,
[2] E. A. Lee, “Cyber physical systems: Design challenges,” in Object “Cotson: infrastructure for full system simulation,” ACM SIGOPS Op-
Oriented Real-Time Distributed Computing (ISORC), 2008 11th IEEE erating Systems Review, vol. 43, no. 1, pp. 52–61, 2009.
International Symposium on. IEEE, 2008, pp. 363–369. [24] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P.
[3] R. Baheti and H. Gill, “Cyber-physical systems,” The impact of control Jouppi, “Mcpat: an integrated power, area, and timing modeling frame-
technology, vol. 12, pp. 161–166, 2011. work for multicore and manycore architectures,” in Proceedings of the
[4] T. Sanislav and L. Miclea, “Cyber-physical systems-concept, challenges 42nd Annual IEEE/ACM International Symposium on Microarchitecture.
and research areas,” Journal of Control Engineering and Applied Infor- ACM, 2009, pp. 469–480.
matics, vol. 14, no. 2, pp. 28–33, 2012. [25] R. Giorgi and P. Faraboschi, “An introduction to df-threads and their
[5] J. Sztipanovits, S. Ying, I. Cohen, D. Corman, J. Davis, H. Khurana, execution model,” in Computer Architecture and High Performance
P. Mosterman, V. Prasad, and L. Stormo, “Strategic r&d opportunities Computing Workshop (SBAC-PADW), 2014 International Symposium on.
for 21st century cyber-physical systems,” Technical Report for Steering IEEE, 2014, pp. 60–65.
Committee for Foundation in Innovation for Cyber-Physical Systems: [26] R. Giorgi and A. Scionti, “A scalable thread scheduling co-processor
Chicago, IL, USA, 13 March, Tech. Rep., 2012. based on data-flow principles,” Future Generation Computer Systems,
[6] E. Geisberger and M. Broy, Living in a networked world: Integrated vol. 53, pp. 100–108, 2015.
research agenda Cyber-Physical Systems (agendaCPS). Herbert Utz [27] R. Giorgi, “Teraflux: exploiting dataflow parallelism in teradevices,” in
Verlag, 2015. Proceedings of the 9th conference on Computing Frontiers. ACM, 2012,
[7] J. H. Marburger, E. F. Kvamme, G. Scalise, and D. A. Reed, “Leadership pp. 303–304.
under challenge: Information technology r&d in a competitive world. an [28] UDOO, “Udoox86: The most powerful maker board ever,” 2016
assessment of the federal networking and information technology r&d (accessed June 16, 2016). [Online]. Available: https://www.kickstarter.
program,” DTIC Document, Tech. Rep., 2007. com/projects/udoo/udoo-x86-the-most-powerful-maker-board-ever
[8] D. Theodoropoulos, D. Pnevmatikatos, C. Alvarez, E. Ayguade, [29] R. Giorgi, “Scalable embedded systems: Towards the convergence of
J. Bueno, A. Filgueras, D. Jimenez-Gonzalez, X. Martorell, N. Navarro, high-performance and embedded computing,” in Embedded and Ubiq-
C. Segura et al., “The axiom project (agile, extensible, fast i/o module),” uitous Computing (EUC), 2015 IEEE 13th International Conference on.
in Embedded Computer Systems: Architectures, Modeling, and Simula- IEEE, 2015, pp. 148–153.
tion (SAMOS), 2015 International Conference on. IEEE, 2015, pp. [30] D. Retkowitz and S. Kulle, “Dependency management in smart homes,”
262–269. in Distributed applications and interoperable systems. Springer, 2009,
[9] C. Alvarez, E. Ayguade, J. Bueno, A. Filgueras, D. Jimenez-Gonzalez, pp. 143–156.
X. Martorell, N. Navarro, D. Theodoropoulos, D. N. Pnevmatikatos, [31] C. Reinisch, M. Kofler, F. Iglesias, and W. Kastner, “Thinkhome energy
C. Scordino et al., “The axiom software layers,” in Digital System Design efficiency in future smart homes,” EURASIP Journal on Embedded
(DSD), 2015 Euromicro Conference on. IEEE, 2015, pp. 117–124. Systems, vol. 2011, no. 1, pp. 1–18, 2011.
[10] “Contrex,” 2016 (accessed June 16, 2016). [Online]. Available: [32] J. Shi, J. Wan, H. Yan, and H. Suo, “A survey of cyber-physical
https://contrex.offis.de/home/ systems,” in Wireless Communications and Signal Processing (WCSP),
[11] “Dreams,” 2016 (accessed June 16, 2016). [Online]. Available: 2011 International Conference on. IEEE, 2011, pp. 1–6.
https://www.uni-siegen.de/dreams/home/ [33] F. Bernier, J. Ploennigs, D. Pesch, S. Lesecq, T. Basten, M. Boubekeur,
[12] W. Weber, A. Hoess, F. Oppenheimer, B. Koppenhoefer, B. Vissers, and D. Denteneer, F. Oltmanns, F. Bonnard, M. Lehmann et al., “Archi-
B. Nordmoen, “Emc2 a platform project on embedded microcontrollers tecture for self-organizing, co-operative and robust building automation
in applications of mobility, industry and the internet of things,” in Digital systems,” in Industrial Electronics Society, IECON 2013-39th Annual
System Design (DSD), 2015 Euromicro Conference on. IEEE, 2015, Conference of the IEEE. IEEE, 2013, pp. 7708–7713.
pp. 125–130. [34] L. Verdoscia and R. Giorgi, “A data-flow soft-core processor for ac-
[13] S. Trujillo, A. Crespo, A. Alonso, and J. Pérez, “Multipartes: Multi- celerating scientific calculation on FPGAs,” Mathematical Problems in
core partitioning and virtualization for easing the certification of mixed- Engineering, vol. 2016, no. 1, pp. 1–21, 2016.
criticality systems,” Microprocessors and Microsystems, vol. 38, no. 8,
pp. 921–932, 2014.
[14] A. Duran, E. Ayguadé, R. M. Badia, J. Labarta, L. Martinell, X. Mar-
torell, and J. Planas, “Ompss: a proposal for programming heterogeneous
multi-core architectures,” Parallel Processing Letters, vol. 21, no. 02, pp.
173–193, 2011.

546

You might also like