AXIOM A Hardware-Software Platform For
AXIOM A Hardware-Software Platform For
Abstract—Cyber-Physical Systems (CPSs) are widely necessary for Module) [8], [9] provides a general platform focusing on pro-
many applications that require interactions with the humans and the viding a scalable and easy-to-program platform. Unlike other EU
physical environment. A CPS integrates a set of hardware-software
projects (such as CONTREX [10], DREAMS [11], EMC 2 [12],
components to distribute, execute and manage its operations. The
AXIOM project (Agile, eXtensible, fast I/O Module) aims at developing MultiPARTES [13]) which mainly focused on the mixed-criticality
a hardware-software platform for CPS such that i) it can use an applications, AXIOM provides a generic platform with its complete
easy parallel programming model and ii) it can easily scale-up the application development suite. Despite the existence of many FPGA-
performance by adding multiple boards (e.g., 1 to 10 boards can run based boards, to the best of our knowledge none of them combines
in parallel). AXIOM supports task-based programming model based on
OmpSs and leverage a high-speed, inexpensive communication interface all the features (such as parallel programmability, scalability). To
called AXIOM-Link. Another key aspect is that the board provides illustrate this, we compared more than twenty boards, most of which
programmable logic (FPGA) to accelerate portions of an application. We coming from crowd-funding initiatives (some of which met our
are using smart video surveillance, and smart home living applications targets, while others did not), and present the comparisions in Table I.
to drive our design.
In this paper, we describe the progress of AXIOM after the
Index Terms—Cyber-physical systems, distributed shared memory,
programming model, performance evaluation, reconfigurable, smart completion of its first year. AXIOM project aims to bridge the
video surveillance, smart home living. gap between the different approaches to the design (heterogeneity),
data analysis and seamless control of hardware to execute generic
I. I NTRODUCTION applications. In this paper, our contributions are:
“Cyber-physical systems integrate computation, communication, • We detail the software stack and programming model support
sensing, and actuation with physical systems to fulfill time-sensitive for AXIOM based on OmpSs programming model.
functions with varying degrees of interaction with the environment, • We illustrate in detail the low-level, inexpensive, high-speed
including human interaction.” [1]. Cyber-physical systems (CPS) [2], AXIOM-Link and its operation.
[3], [4] are getting more and more pervasive in various daily • We discuss some early results from our design space exploration.
life aspects. A CPS is an integrated framework of a network of The rest of the paper is organized as follows: in Section II, we
information processing, sensors and actuators [5], [6]. Such systems explain how the support for threads is provided using the AXIOM
become ubiquitous to human life, allowing a close interaction not stack and the OmpSs programming model together with the profile
only system to system, but also with human-system or vice-versa. support in Section III; in Section IV, we illustrate the high-speed
CPS domain includes Internet of Things (IoT), smart homes, smart AXIOM-Link; in Section V, we discuss about our evaluation plat-
cities, or the smart grid. Everyday life is becoming increasingly form. In Section VI, we illustrate our application scenarios and in
dependent on CPS (e.g., smart video surveillance). In 2008, CPS was Section VII, we show our experimental results. We also discuss the
one of the highest priority research topic [7]. The noted challenges related works in Section VIII and finally, we conclude the paper.
in designing a CPS architecture are infrastructural challenges, time
management, data management (the data workflow), proper software- II. P ROGRAMMING M ODEL OF AXIOM
hardware integration (implementational challenges) and compliance The AXIOM software stack is referred in Figure 1(a). In this
with standards. The AXIOM project (Agile, eXtensible, fast I/O section, we briefly describe the OmpSs programming model [14]; the
extensions planned for OmpSs to spawn tasks in the FPGA-device, machine. The second level of task parallelism is expressed through the
and the extensions needed to support the cluster version of AXIOM. OmpSs extensions targeting the FPGAs (see below, Section II-A1).
The OmpSs programming model is based on two main components
A. Introduction to OmpSs Programming Model and some additional tools. They are:
• The Mercurium compiler [15] takes the source code as specified
The OmpSs programming model supports the execution of het-
erogeneous tasks written in OpenCL, CUDA, or a high-level C by the programmer and understands the OmpSs directives to
or C++ language that can be converted to the machine language transform the code to run on heterogeneous platforms, including
used in GPUs or converted to the bitstream to program FPGAs. OpenCL and CUDA, accelerators. In this project, the compiler is
Also, the runtime supports the communications within a cluster going to be also extended to support FPGA-based accelerators.
• The Nanos++ runtime system, which is the responsible to
of distributed memory machines. OmpSs can target tasks to the
different nodes of the cluster. From the programmer perspective, the manage and schedule parallel tasks, respecting their dependen-
annotations required for the cluster support are exactly equivalent cies, transferring the data needed to/from the accelerators when
to the symmetric multiprocessing (SMP). Currently, both OpenCL needed, and the lower-level interactions.
• Additionally, OmpSs can use the Extrae tool [16] to generate
and CUDA options require the programmer to provide the OpenCL
or CUDA code and use the OmpSs target clauses (similar to the execution traces that can be later visualized with the Paraver
OpenMP target clauses) to move the data to the associated accelerator. tool [17], and analyze the execution behavior.
In the AXIOM project, we are using the same technique to spawn 1) OmpSs Extensions for FPGAs: OmpSs needs to be extended to
tasks to the FPGA provided there was a compiler to generate the support the Zynq chip with the FPGA selected in the AXIOM project.
FPGA bitstream implementing the task, from C or C++ code or The extensions to provide support for these chips in the Mercurium
bitstream available with a known interface to access the data. compiler are:
For executing tasks in the cluster version, the programmer needs • To incorporate a new target device named fpga: in addition
to specify the task as plain C or C++ code. Execution on the to the current smp, cuda and opencl devices, the fpga
OmpSs@cluster version automatically allows the runtime system device will cause the Mercurium compiler to understand that
to spawn tasks to remote nodes. The programming model allows the function annotated is to be compiled with the Xilinx Vivado
parallelizing applications on the AXIOM cluster and spawn tasks HLS compiler, for the FPGA, in order to generate the bitstream.
on the FPGAs available on each board. Using OmpSs@cluster with Figure 1(b) shows the main phases of the bitstream generation and
FPGAs support, programmers express two levels of parallelism. The compilation of the OmpSs code. With this extension, the compiler
first level of parallelism targets the AXIOM-cores, i.e. the cores generates the code for the runtime system specifying the tasks that
that are available on the AXIOM-board (e.g., the ARM-A9 cores should be run in the FPGA device. The code is compiled with a back-
in the case of a Xilinx Zynq SoC). Tasks at this level are spread end compiler (e.g., gcc) that will be executed in the Zynq-ARM cores.
across the AXIOM boards as if they would be executed on an SMP This binary code (OmpSs.elf in Figure 1(b)) will call the Nanos++
540
Application 1 Application 2 … Application n
OmpSs Code
OmpSs programming model
541
# pragma omp t a r g e t d e v i c e ( f p g a , smp ) c o p y _ d e p s
# pragma omp t a s k i n ( a [ 0 : BS∗BS−1] , b [ 0 : BS∗BS−1]) \
i n o u t ( c [ 0 : BS∗BS−1])
v o i d m a t r i x _ m u l t i p l y ( i n t BS , f l o a t a [ BS ] [ BS ] ,
f l o a t b [ BS ] [ BS ] ,
f l o a t c [ BS ] [ BS ] ) {
f o r ( i n t i a = 0 ; i a < BS ; ++ i a )
f o r ( i n t i b = 0 ; i b < BS ; ++ i b ) {
f l o a t sum = 0 ;
f o r ( i n t i d = 0 ; i d < BS ; ++ i d )
sum += a [ i a ] [ i d ] ∗ b [ i d ] [ i b ] ;
c [ i a ] [ i b ] = sum ;
} Fig. 3: AXIOM Boards Interconnected in 2D-mesh
}
... $
i n t main ( i n t a r g c , c h a r ∗ a r g v [ ] ) {
"#
i n t BS = . . . "#
%" "
...
f o r ( i = 0 ; i < NB; i ++) {
f o r ( j = 0 ; j < NB; j ++) {
!
f o r ( k = 0 ; k < NB; k ++) {
&&'
m a t r i x _ m u l t i p l y ( BS , A[ i ] [ k ] , B [ k ] [ j ] , C [ i ] [ j ] ) ;
}
} !
} *+&&'
# pragma omp t a s k w a i t "
%"
...
}
Fig. 2: OmpSs Directives on Matrix Multiplication Fig. 4: The AXIOM Network Interface Architecture
the support is transparent to the programmer. The first profiling four bi-directional links, so that the nodes can be connected in many
and tracing objective is to have input and output memory transfer, different ways, such as ring and 2D-mesh/torus.
and computation information from inside the OmpSs fpga task Figure 3 illustrates an example of several AXIOM boards inter-
execution. With this aim, the idea is to: connected in a 2D-mesh. The integrated processing system (PS7)
• Create a hardware platform that integrates hardware counters of each board, communicates using an on-chip network interface
that can be read from both SMP cores and fpga accelerated (NI) implemented in the FPGA region that efficiently supports the
tasks, transparently to the programmer. application communication protocols. Figure 4 illustrates the NI
• Create hardware counters that do not affect the performance of architecture, originally introduced in [9], which implements remote
the fpga tasks. direct memory access (RDMA) and remote write operations as basic
• Make the fpga tasks return the profiling information as part of communication primitives visible at the application level.
their outputs, transparently to the programmer. The router module shown in Figure 5 implements the routing and
• Interpreting the profiling information in the profiling OmpSs network discovery processes. The AXIOM routing algorithm will
runtime device dependent layer, transparently to the program- feature cut-through packet transmission with virtual circuits (VCs),
mer. and the network discovery process will be initiated at boot time by
• Include the profiling information to the automatically generated the master node of network. After the process completion, every node
Paraver trace. will have its id, and local routing table, based on which all packets
will be forwarded to output links. In case that the network topology
Our implementation has used the OMPT API [21] to generate the changes, such as a node is added or becomes faulty, the network
execution traces using the Extrae instrumentation tool. The OMPT discovery updates the original topology table that resides in the master
API helps to integrate profiling of different accelerators/devices node, and all local node routing tables accordingly.
and CPUs using the same API that can be supported by different The core router components can be outlined as: i) input buffering,
instrumentation tools. ii) control, and iii) crossbar and link traversal. The input buffering
module consists of four link controllers (LC), where each link
IV. T HE AXIOM N ETWORK I NTERFACE employs queues to implement three VCs to store different priority
In this section, we describe the AXIOM approach to connectivity. packets. The router uses a Xon/Xoff strategy for notifying adjacent
The AXIOM platform is designed around the Xilinx Zynq SoC that nodes on VC input buffer availability. If a VC queue reaches a
features a multi-core ARM processor tightly coupled with FPGA predefined threshold, the router instantly transmits a Xoff packet
fabric. AXIOM is designed to be modular at the next level, allowing to the link’s adjacent node to block further packets transmission.
the formation of more efficient processing systems through low- Similarly, when the VC fullness drops below a certain level, the router
cost, but scalable high-speed interconnect. The latter will utilize the instantly transmits a Xon packet to the link’s adjacent node to resume
integrated gigabit-rate transceivers with relatively low-cost USB-C packets transmission via this particular link.
connectors to interconnect multiple boards. Such connectivity will The route calculation (RC) finds the required output interface for a
allow to build (or upgrade at a later moment) flexible and low-cost packet, based on the routing table and destination node, starting from
systems by cascading more AXIOM boards, without the need of the highest VC. If the VC number of the output link is enabled, then
costly connectors and cables. AXIOM boards will feature two or the packet is forwarded to the corresponding VC allocation (VCA).
For each input link, the VCA always attempts to serve the VC with
542
TABLE II: Comparison of COTSon with Others
Features Sniper Graphite Gem5 MARSx86 COTSon
Timing
No No Yes No No
! "!
Directed
Functional
Yes Yes No Yes Yes
Directed
#
! "! User Level Yes Yes Yes No No
Full System
No No Yes Yes Yes
Simulation
Parallel
Yes Yes No No No
! "!
(In node)
Parallel
No No No No Yes
(Multi-node)
! "! Shared Cache Yes No Yes Yes Yes
is to modify it to model the behavior of our custom interconnects.
Fig. 5: The AXIOM Router Pipelined Architecture The motivation for multiple interconnects derives from the AXIOM
the highest priority, except if its destination node input VC buffer is project design that aims to separate the traffic for building a multi-
blocked. In that case, it falls to the next lower input VC. board system and the traffic for the internet related connection. With
During the switch allocation process, the packets from each buffer the COTSon mediator, we can model both cases. The SimNow is the
request a Xbar output. The switch allocation pairs the Xbar inputs virtual machine (VM), which models all details of a computer. AMD
to the Xbar outputs as efficiently as possible, trying not to leave an is also providing a separate SDK to model any particular board that
output link idle. If more than one packets request the same output has to be plugged-in (such as a network card or a GPU).
link, the grant policy decides according to: A. Thread Support
• Priority (Xon/Xoff >VC2 >VC1 >VC0).
• If packets are of the same priority (e.g., both VC2), it chooses
Synchronization and distribution of data can be managed efficiently
one (in a round-robin based fashion) to grant an output port, by reorganizing the execution in such a way that the threads follow
while at the same time looks for available packets of lower more closely the data flow of the program (such as with DF-Threads
priority (VC1 or VC0) on the same input link that requires a [25]). DF-Threads can be efficiently implemented by a distributed
different port. hardware thread scheduler [26] which support fault tolerance at the
• Repeat until no races exist.
hardware level and efficient fine grain dataflow thread distribution.
To reduce the thread management overhead, the scheduling needs to
The use of VCs with priorities ensures that we avoid protocol dead-
be accelerated in hardware, by mapping its structure into the FPGA.
locks in the network. As we use low priorities for requests, medium
A DF-Thread is defined as a function that expects no parameters and
priorities for responses and a top priority for acknowledgments, there
returns no parameters. The body of this function can refer to data
is no possibility for high-priority packets to clog the network as they
which reside at the memory location for which it has got the pointer.
will be less or equal than the requests that were accepted by the
The DF-Thread API’s [27] are summarized below:
network. Thus in case a high network congestion, acknowledgments
will exit the network, then responses will be sent, and then more • void *DF_TSCHEDULE(bool cnd, void *ip,
requests will be accepted. uint64_t sc): Allocates the resources (a DF-frame of
Finally, the crossbar module is responsible for forwarding all size sc words and a corresponding entry in the distributed
available packets to their output links. All packets then traverse thread scheduler or DTS) for a new DF-Thread and it returns
via the physical link to the neigbour node and are stored to the a frame pointer fp. The ip is the instruction pointer of
corresponding VC input queue. DF-Thread. The allocated DF-Thread is not executed until its
sc reaches 0 and together also satisfy the boolean condition
V. AXIOM E VALUATION P LATFORM (AEP) cnd.
Design space exploration (DSE) and its automation is an important • void DF_DESTROY(): To release allocated resources held by
part of our current performance evaluation and power estimation current DF-Thread.
methodologies [22]. The proposed method in AXIOM requires first • uint64_t DF_TREAD(uint64_t offset): Loads the
exploring and modeling parts on the simulator and then, once the DSE data indexed by offset from the current thread of DF-frame.
is completed, implementing them on the FPGA-based prototypes. • void DF_TWRITE(uint64_t val, void
This has the considerable advantage of allowing immediately to *fp,uint64_t off): The data val is stored into the
develop the software stack early. AEP is made of two important DF-frame pointed to by fp at the specified offset off.
tools: the HP-Labs COTSon simulator [23] and the Xilinx Zynq • void *DF_TALLOC(uint64_t size, uint_8
based platform. Given the goals of this project, we also needed a type): Allocates a block of memory of size words
more flexible platform for the DSE. The simulation platform is used and returns the pointer (or null) while type specifies the
to understand better bottlenecks (e.g., the congestion on a bus, cache special purpose memory type.
size), which are not trivial to track on the FPGA prototyping platform. • void DF_TFREE(void *p): Frees memory pointed to by
COTSon also includes an interface to the HP McPAT tool [24] for p.
estimating the power consumption. Table II presents some advantages
of using COTSon for our purpose. VI. A PPLICATION S CENARIOS OF AXIOM
COTSon uses the so-called “functional-directed” approach. The Smart video surveillance and smart home applications are now a
simulator permitted us to execute the full-system simulation. The hot topic in CPS and we have customized these two scenarios for
“mediator” of COTSon represents the model of a switch, and our aim our AXIOM platform.
543
TABLE III: OmpSs Experimental Results
Machine Execution Time (s) GFLOPS Speedup
UDOO: 1 core (1 node) 7.6 0.28 1
UDOO: 4 cores (1 node) 1.9 1.13 4
UDOO: 8 cores (2 nodes) 1.3 1.61 5.7
Zynq 706 board (FPGA) 0.5 4.06 15.3
Fig. 6: The AXIOM Smart Home Living (SHL) Scenario Fig. 7: Paraver trace of the OmpSs MxM using 2 nodes UDOO x86,
A. Smart Home Living (SHL) with 4 threads per node.
For the SHL case study, we selected a scenario which aims to cores of the same node of the UDOO cluster, iii) all cores of the two-
increase the natural interaction of the user with his house using node UDOO cluster, and iv) a Zynq ZC706-SoC using the FPGA to
both audio and video analysis. Figure 6 shows an overview of SHL accelerate the matrix multiply tiles. All the results are for a tiled
scenario. Real-time multimedia streams processing is required to matrix multiply with BS=128 and 1024x1024 matrices. Speedup
enable a natural interaction for the user as well as the capability to results are obtained comparing each environment result to the UDOO
correlate instantly the information extracted from the audio and video 1 core environment.
data in different ways, related to the actual situation inside and outside On one hand, results show that OmpSs@cluster scales pretty
the house in the particular time in which data are collected. Audio well inside the node. Meanwhile, it seems that there are some
and video analysis are also useful both to enhance the security level overheads that reduce the scalability when using the two nodes of the
of the house and to increase the automation potential of the smart cluster. In particular, the connection done by Ethernet may affect the
house, thus reducing waste of energy and at the same time increasing synchronization and communication overhead. Therefore, the use of
the comfort. The sound streams coming from both vocal and non- the high-speed dedicated interconnection AXIOM-Link should help
vocal signals will be analyzed on the FPGA-based SoC. The main to reduce this overhead and improve the scalability of the OmpSs
computation tasks in the audio processing are: filtering out the noise, applications. On the other hand, the Zynq ZC706 board result, using
voice detection, and extraction of the specific information necessary FPGA accelerators for the matrix multiply, shows a much better
for the automation of the house. In video processing, vital steps are performance than the UDOO cluster. It can be stated that the AXIOM
frame decompression and recognition of specific information inside platform will outperform the UDOO cluster by at least an order of
frames. The results of the audio and video processing are correlated magnitude.
to increase the level of the system’s intelligence.
B. Profiling and Tracing Results
B. Smart Video Surveillance (SVS) In this sub-section, profiling and tracing results are presented.
For SVS case study, we selected an automated smart marketing Cluster profiling results have been obtained using a cluster of UDOO
scenario involving real-time face detection in crowds while per- x86’s; meanwhile, one node traces with fpga task executions are on
forming demographics estimation (e.g., age, gender and ethnicity). a Zynq 706 board.
The SVS scenario will employ state-of-the-art cognitive computer 1) Cluster Profiling: Figure 7 shows the execution of the
vision techniques based on models built from a boosted cascade of OmpSs matrix multiply (BS=128) in Figure 2 with target
classifiers combined with deep convolutional neural networks. A low- device(smp). The cluster has two nodes with four threads per
power high-performance inference engine for such models will be node, each of them executing smp tasks. The Paraver trace has as
implemented in the reconfigurable logic of the SoC using the OmpSs many horizontal lines as threads running OmpSs tasks. The different
programming model. Since this scenario will analyze high-definition colors mean different thread states along the execution time of
(HD) video feeds, other computational challenges related to video the application. Therefore, there are eight horizontal lines (one per
processing must also be addressed. HD video stream decoding (i.e., thread). Green flags indicate trace events (e.g., start/end a task).
format parsing, codec implementation, de-muxing and color space Main area colors in the trace have the following meaning: pink areas
conversion) will be performed by relying on a heterogeneous com- correspond to the task creation on the master thread (top), yellow
puting approach combining single instruction, multiple data (SIMD) areas correspond to smp tasks running in the SMP, light red
instructions with on-die logic blocks. in the master thread (first horizontal line) corresponds to a global
task synchronization, and dark red corresponds to idle state where
VII. E VALUATIONS AND R ESULTS
those threads are doing nothing. The trace shows that tasks have
In this section, we present some preliminary results for some been evenly distributed among the two UDOO’s nodes, achieving a
software and hardware prototypes the AXIOM project is designing promising performance result. In this Paraver trace, the dependences
and implementing. between tasks have not been shown for clarity purposes.
A. OmpSs Timing Results 2) One Node Profiling: In this case, and for the purpose of
presenting an execution trace that helps to detect a performance
Table III shows the execution time and GFLOPS of the matrix bottleneck, we have selected a sub-optimal hardware/software co-
multiplication of Figure 2 for different execution environments. Those design of the parameters and task target devices of the tasks of an
environments are: i) one core of the UDOO x86 cluster [28], ii) four
544
64
No. of Nodes=1
No. of Nodes=2
32
No. of Nodes=4
2*N^3
n
16
Fig. 8: Paraver trace of the OmpSs MxM using 1 SMP (top) and 1
8
helper thread (bottom) for two FPGA accelerators.
545
shows the potentiality of dataflow based approaches in reconfigurable [15] J. Balart, A. Duran, M. Gonzàlez, X. Martorell, E. Ayguadé, and
domain. J. Labarta, “Nanos mercurium: a research compiler for openmp,” in
Proceedings of the European Workshop on OpenMP, vol. 8, 2004, p. 56.
IX. C ONCLUSIONS [16] B. S. Center, “Extrae instrumentation library,” 2016 (accessed June 16,
2016). [Online]. Available: http://www.bsc.es/computer-sciences/extrae
The AXIOM platform provides an integrated approach including [17] V. Pillet, J. Labarta, T. Cortes, and S. Girona, “Paraver: A tool to
a heterogeneous SoC (currently with an FPGA) board, a new high- visualize and analyze parallel code,” in Proceedings of WoTUG-18:
performance connection link for cluster and the task-based program- Transputer and occam Developments, vol. 44. mar, 1995, pp. 17–31.
[18] J. Bueno, X. Martorell, R. M. Badia, E. Ayguadé, and J. Labarta,
ming model, that can support single and multiple-node heterogeneous “Implementing ompss support for regions of data in architectures with
parallel execution, transparently to the programmer. The initial results multiple address spaces,” in Proceedings of the 27th international ACM
are encouraging in terms of scalability while keeping an easy pro- conference on International conference on supercomputing. ACM,
gramming model for the programmer. 2013, pp. 359–368.
[19] D. Bonachea, “Gasnet specification, v1. 1,” 2002.
X. ACKNOWLEDGMENT [20] M. P. I. Forum, “Mpi: A message-passing interface standard,
version 3.0.” 2016 (accessed June 16, 2016). [Online]. Available:
This work is partially supported by the European Union H2020 http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
program through the AXIOM project (grant ICT-01-2014 GA [21] A. E. Eichenberger, J. Mellor-Crummey, M. Schulz, M. Wong, N. Copty,
645496) and HiPEAC (GA 687698), by the Spanish Government R. Dietrich, X. Liu, E. Loh, and D. Lorenz, “Ompt: An openmp
tools application programming interface for performance analysis,” in
through Programa Severo Ochoa (SEV-2011-0067), by the Spanish OpenMP in the Era of Low Power Devices and Accelerators. Springer,
Ministry of Science and Technology through TIN2012-34557 project, 2013, pp. 171–185.
and by the Generalitat de Catalunya (contract 2009-SGR-980). [22] C. Silvano, W. Fornaciari, G. Palermo, V. Zaccaria, F. Castro, M. Mar-
tinez, S. Bocchio, R. Zafalon, P. Avasare, G. Vanmeerbeeck et al.,
R EFERENCES “Multicube: Multi-objective design space exploration of multi-core ar-
chitectures,” in VLSI 2010 Annual Symposium. Springer, 2011, pp.
[1] C. P. S. P. W. Group et al., “Framework for cyber-physical systems,” 47–63.
Preliminary Discussion Draft, Release 0.8, 2015. [23] E. Argollo, A. Falcón, P. Faraboschi, M. Monchiero, and D. Ortega,
[2] E. A. Lee, “Cyber physical systems: Design challenges,” in Object “Cotson: infrastructure for full system simulation,” ACM SIGOPS Op-
Oriented Real-Time Distributed Computing (ISORC), 2008 11th IEEE erating Systems Review, vol. 43, no. 1, pp. 52–61, 2009.
International Symposium on. IEEE, 2008, pp. 363–369. [24] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P.
[3] R. Baheti and H. Gill, “Cyber-physical systems,” The impact of control Jouppi, “Mcpat: an integrated power, area, and timing modeling frame-
technology, vol. 12, pp. 161–166, 2011. work for multicore and manycore architectures,” in Proceedings of the
[4] T. Sanislav and L. Miclea, “Cyber-physical systems-concept, challenges 42nd Annual IEEE/ACM International Symposium on Microarchitecture.
and research areas,” Journal of Control Engineering and Applied Infor- ACM, 2009, pp. 469–480.
matics, vol. 14, no. 2, pp. 28–33, 2012. [25] R. Giorgi and P. Faraboschi, “An introduction to df-threads and their
[5] J. Sztipanovits, S. Ying, I. Cohen, D. Corman, J. Davis, H. Khurana, execution model,” in Computer Architecture and High Performance
P. Mosterman, V. Prasad, and L. Stormo, “Strategic r&d opportunities Computing Workshop (SBAC-PADW), 2014 International Symposium on.
for 21st century cyber-physical systems,” Technical Report for Steering IEEE, 2014, pp. 60–65.
Committee for Foundation in Innovation for Cyber-Physical Systems: [26] R. Giorgi and A. Scionti, “A scalable thread scheduling co-processor
Chicago, IL, USA, 13 March, Tech. Rep., 2012. based on data-flow principles,” Future Generation Computer Systems,
[6] E. Geisberger and M. Broy, Living in a networked world: Integrated vol. 53, pp. 100–108, 2015.
research agenda Cyber-Physical Systems (agendaCPS). Herbert Utz [27] R. Giorgi, “Teraflux: exploiting dataflow parallelism in teradevices,” in
Verlag, 2015. Proceedings of the 9th conference on Computing Frontiers. ACM, 2012,
[7] J. H. Marburger, E. F. Kvamme, G. Scalise, and D. A. Reed, “Leadership pp. 303–304.
under challenge: Information technology r&d in a competitive world. an [28] UDOO, “Udoox86: The most powerful maker board ever,” 2016
assessment of the federal networking and information technology r&d (accessed June 16, 2016). [Online]. Available: https://www.kickstarter.
program,” DTIC Document, Tech. Rep., 2007. com/projects/udoo/udoo-x86-the-most-powerful-maker-board-ever
[8] D. Theodoropoulos, D. Pnevmatikatos, C. Alvarez, E. Ayguade, [29] R. Giorgi, “Scalable embedded systems: Towards the convergence of
J. Bueno, A. Filgueras, D. Jimenez-Gonzalez, X. Martorell, N. Navarro, high-performance and embedded computing,” in Embedded and Ubiq-
C. Segura et al., “The axiom project (agile, extensible, fast i/o module),” uitous Computing (EUC), 2015 IEEE 13th International Conference on.
in Embedded Computer Systems: Architectures, Modeling, and Simula- IEEE, 2015, pp. 148–153.
tion (SAMOS), 2015 International Conference on. IEEE, 2015, pp. [30] D. Retkowitz and S. Kulle, “Dependency management in smart homes,”
262–269. in Distributed applications and interoperable systems. Springer, 2009,
[9] C. Alvarez, E. Ayguade, J. Bueno, A. Filgueras, D. Jimenez-Gonzalez, pp. 143–156.
X. Martorell, N. Navarro, D. Theodoropoulos, D. N. Pnevmatikatos, [31] C. Reinisch, M. Kofler, F. Iglesias, and W. Kastner, “Thinkhome energy
C. Scordino et al., “The axiom software layers,” in Digital System Design efficiency in future smart homes,” EURASIP Journal on Embedded
(DSD), 2015 Euromicro Conference on. IEEE, 2015, pp. 117–124. Systems, vol. 2011, no. 1, pp. 1–18, 2011.
[10] “Contrex,” 2016 (accessed June 16, 2016). [Online]. Available: [32] J. Shi, J. Wan, H. Yan, and H. Suo, “A survey of cyber-physical
https://contrex.offis.de/home/ systems,” in Wireless Communications and Signal Processing (WCSP),
[11] “Dreams,” 2016 (accessed June 16, 2016). [Online]. Available: 2011 International Conference on. IEEE, 2011, pp. 1–6.
https://www.uni-siegen.de/dreams/home/ [33] F. Bernier, J. Ploennigs, D. Pesch, S. Lesecq, T. Basten, M. Boubekeur,
[12] W. Weber, A. Hoess, F. Oppenheimer, B. Koppenhoefer, B. Vissers, and D. Denteneer, F. Oltmanns, F. Bonnard, M. Lehmann et al., “Archi-
B. Nordmoen, “Emc2 a platform project on embedded microcontrollers tecture for self-organizing, co-operative and robust building automation
in applications of mobility, industry and the internet of things,” in Digital systems,” in Industrial Electronics Society, IECON 2013-39th Annual
System Design (DSD), 2015 Euromicro Conference on. IEEE, 2015, Conference of the IEEE. IEEE, 2013, pp. 7708–7713.
pp. 125–130. [34] L. Verdoscia and R. Giorgi, “A data-flow soft-core processor for ac-
[13] S. Trujillo, A. Crespo, A. Alonso, and J. Pérez, “Multipartes: Multi- celerating scientific calculation on FPGAs,” Mathematical Problems in
core partitioning and virtualization for easing the certification of mixed- Engineering, vol. 2016, no. 1, pp. 1–21, 2016.
criticality systems,” Microprocessors and Microsystems, vol. 38, no. 8,
pp. 921–932, 2014.
[14] A. Duran, E. Ayguadé, R. M. Badia, J. Labarta, L. Martinell, X. Mar-
torell, and J. Planas, “Ompss: a proposal for programming heterogeneous
multi-core architectures,” Parallel Processing Letters, vol. 21, no. 02, pp.
173–193, 2011.
546