Alexander Biedermann (Auth.) - Design Concepts For A Virtualizable Embedded MPSoC Architecture - Enabling Virtualization in Embedded Multi-Processor Systems-Vieweg+Teubner Verlag (2014)
Alexander Biedermann (Auth.) - Design Concepts For A Virtualizable Embedded MPSoC Architecture - Enabling Virtualization in Embedded Multi-Processor Systems-Vieweg+Teubner Verlag (2014)
Springer Vieweg
© Springer Fachmedien Wiesbaden 2014
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or
part of the material is concerned, specifically the rights of translation, reprinting, reuse of illus-
trations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by
similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are
exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained
herein or for any errors or omissions that may have been made.
Alexander Biedermann
Contents
[DesignedȱTransistorsȱperȱStaffȬMonth]
[TransistorsȱperȱChip]
Productivity
AvailableȱComputationalȱPower
DesignȱGap
Time
Figure 1.1: The Design Gap: Growth of available Computing Power outperforms the Pro-
ductivity of System Designers [Belanovic 2003].
implementation in the scope of this work, lists 87 instructions [Xilinx, Inc. 2012] – a fraction of the 434
instructions provided by the Intel 64 and IA-32 CISC architectures.
8 2 Virtualizable Architecture for embedded MPSoC
simplified chip design furthermore reduces the transistor count. Consequently, the most
significant constraint for embedded designs may be met: device cost. An embedded
processor has to be as cheap as possible in order to lower the overall device cost. At
the expense of this constraint, the performance of embedded processors is usually
lower than that of those employed in personal computers. As a consequence of a lower
transistor count and a simpler chip design, the power consumption is also lower. Since
many of today’s embedded computers, such as smartphones or mp3 players, are mobile,
reducing energy consumption is an essential requirement to lengthen battery life. The
simpler chip design is further accompanied by a fairly reduced interfacing. This will be
an important property regarding the envisaged shift of task execution.
As low device cost is desired, embedded soft-core processors may be targeted. In
contrast to usual, hard-wired integrated circuits, the so-called hard-cores, a soft-core
processor is solely represented either by a behavior description given in a hardware
description language (HDL), such as Verilog or VHDL or by a netlist. The description
of a soft-core processor may be transformed into a hardware design by a synthesis
process. Here, the description is mapped to primitives of the targeted chip, e. g., to
look-up tables and registers of an FPGA.
Soft-core processors feature advantages, which are not present for hard-core pro-
cessors. Due to their representation in an HDL, their behavior may be modified by
altering their corresponding hardware description. However, the vendors of commer-
cial soft-core processors may restrict the modification, e. g., by encrypting the files
containing the hardware description. Nevertheless, open-source processors, such as the
PicoBlaze [Xilinx, Inc. 2013b] or the Secretblaze [Barthe 2011], offer the modification of
their hardware description.
Besides this manual adaption, soft-core processors may feature several pre-defined
customizations. Based on the intended use, e. g., floating-point units, multipliers,
or barrel shifters may be activated. During synthesis, the corresponding hardware
descriptions of the desired functionality are included. Therefore, in contrast to a hard-
wired processor, a soft-core processor may be tailored to the application purpose by
disabling unnecessary functions and, thus, cost is reduced by a resulting lower resource
consumption.
As a soft-core processor is available by its hardware description, multi-processor
systems may easily be designed by instantiating the processor description multiple
times in a top-level hardware description. Thus – given sufficient resources of the
targeted FPGA – a multi-processor system may be designed without an increase in cost
for the additional soft-core processors, whereas for a hard-core processor design, each
additional core also causes additional cost. However, there are currently no common
multi-core soft-core processors available yet. Therefore, multi-processing on soft-core
processors relies on instantiating a set of soft-core processors.
Main drawback of soft-core processors is performance, which is usually lower than
that of hard-wired embedded processors. A hard-wired processor, whose placement
has underwent several optimization steps will always outperform a processor with
2.2 Virtualization for Embedded Multi-Processor Architectures 9
Instruction Memory
Data Memory
Instruction Data
Memory Memory
Controller Controller
MicroBlaze
Soft-Core Processor
similar behavior, whose description has to be mapped onto existing chip primitives.
For the first one, structures and routes on the chip may be tailored to the processor,
whereas for the latter one, the processor is tailored to the target chip.
This work will demonstrate the virtualization concept by exploiting the commer-
cial MicroBlaze soft-core processor, which is provided by the device vendor Xilinx,
Inc. [Xilinx, Inc. 2013a]. The files containing the hardware description are encrypted
and cannot be modified by a designer. However, since this work will demonstrate
an approach to enable virtualization features without modifying existing processor
designs, a modification of the hardware description is neither necessary nor desired.
The MicroBlaze is a common 32-bit RISC processor, which features a five stage
pipeline in its default set-up. It is designed as Harvard architecture; therefore, instruc-
tion and data memory are separated from each other. After synthesis, instructions and
data reside in BlockRAM (BRAM), memory primitives on FPGAs. Instructions and
data are transferred between memory and the processor by two dedicated memory
controllers. A processor system containing the MicroBlaze IP core, as well as instruc-
tion and data memories with their corresponding memory controllers is depicted in
Figure 2.1. Further details of the processor architecture are provided later in this work
as they get of particular importance.
Before highlighting the enhancements of the processor architecture depicted in
Figure 2.1 towards a virtualizable multi-processor design, the intended purpose of the
virtualization procedure is motivated.
Task A
Task B
Task C
Figure 2.2: Task Management by a Kernel (a) compared to dedicated Processor Resources for
each Task (b).
enhances embedded system design, two common design alternatives for embedded
processor systems are discussed.
In the first alternative, a kernel manages the access of tasks to a processor, cf. Fig-
ure 2.2, left hand side. Both kernel and tasks reside in the same memory area. In this
system, all the tasks and the kernel are statically bound to the processor. The employ-
ment of a kernel eases the scheduling of tasks on the processor. Furthermore, individual
tasks may dynamically be excluded from processor access or may temporarily get a
higher priority assigned. Unlike for personal computers, memory in embedded devices
is often of limited size. Therefore, despite the convenient task handling, a kernel may
add an unwanted overhead in terms of memory. Additionally, the switching process
between tasks is time consuming. The Xilkernel, a kernel for the MicroBlaze processor
provided by the device vendor Xilinx, Inc. takes approximately 1.360 clock cycles
to switch between tasks. As embedded systems may face harsh timing constraints,
the additional time required for a task switch is possibly not acceptable. In addition,
if an embedded system is employed in a safety-critical environment, the usage of a
kernel may pose a significant safety risk in case of address space violations of tasks if
no memory management is exploited. All the tasks in the system reside in the same
memory. A faulty task might thus alter memory sections of a task relevant to security.
Therefore, in many systems that feature safety-critical tasks, switching tasks by a kernel
is avoided and the second alternative as follows is chosen.
In the second alternative, each task features a dedicated processor, cf. Figure 2.2,
right hand side. Since there are no other tasks running on a processor, there is no need
for scheduling or for an underlying kernel. This eliminates the overhead both in terms
of memory and time caused by a kernel. Moreover, aside from task communication, e.g.,
via buses, tasks are logically and physically separated from each other. This prevents
harmful mutual task interference. Furthermore, since each task may occupy all of
the processor time of its processor, performance may be higher compared to a single
processor system that features a kernel. Drawback of this solution is the tremendous
resource overhead.
2.2 Virtualization for Embedded Multi-Processor Architectures 11
Virtualization Layer
In order to fulfill these requirements, the following sections will present the pre-
requisites necessary to enable virtualization features.
MicroBlaze MicroBlaze
Soft-Core Processor Soft-Core Processor
R1: 0x 00 00 00 FF R1: 0x 00 00 00 17
R2: 0x 00 00 00 CC R2: 0x 00 00 00 CC
… …
R31: 0x 00 00 0F 00 R31: 0x 00 00 0F 00
MSR: 0x 10 01 01 01 MSR: 0x 10 01 01 11
Upon this task’s reactivation, the content is read back into the processor’s register set.
Accordingly, the proposed virtualization solution aims at extracting a task’s context
during the deactivation phase and at restoring this context during its reactivation. A
comparable approach for an embedded multi-processor system was highlighted in
[Beaumont 2012]. However, only a subset of the processors’ registers was considered.
In contrast, the virtualization procedure will take the full context inside a processor
core into consideration. For hardware modules with internal states, the works of
[Kalte 2005] and [Levinson 2000] highlight context extraction, which is based on a
readback of the FPGA’s configuration. While a readback of the FPGA configuration
could also be exploited to determine the current state of a software task, this would
limit the presented approach to FPGA architectures and, furthermore, would require
an off-chip resource managing this context extraction. Thus, the present approach will
handle task context extraction in a more convenient way independent of the actual chip
architecture. The context elements to be saved during a virtualization procedure are
now discussed in short.
Instruction and Data Memory The instruction memory is a read-only memory and,
therefore, its state does not change during task execution. Thus, no special treatment is
needed for the instruction memory. During task execution, the data memory is read
and written, cf. Figure 2.4. With each write operation, the state of the data memory
is altered. During deactivation of a task, the current state of the data memory has to
be preserved and prevented from being altered. This may easily be achieved, e. g., by
detaching the data memory from the processor during the deactivation phase.
2.2 Virtualization for Embedded Multi-Processor Architectures 15
Program Counter The program counter is managed by the processor. By default, after
executing an instruction, the program counter is set by the processor to point to the
next instruction in the instruction memory. This is depicted by the arrow pointing at
the instruction memory in Figure 2.4. A task may alter the program counter address,
e. g., by jump or branch instructions. Here, the specific instruction or the result of its
computation defines the new value of the program counter. As processors may feature
a pipeline and, additionally, the computation of a new program counter address may
take several clock cycles, predicting the program counter address is complex. During
computation of a new program counter address, the instructions immediately following
the jump or branch instruction are usually not executed by the processor. Some branch
instructions may explicitly activate the execution of the first instruction following
immediately after by setting a “delay slot” parameter. Thus, when interrupting a task’s
execution, it has to be ensured that all program counter address calculations have been
completed before saving the program counter address. The saved program counter
address has to point at the instruction that will be executed right after the reactivation
of the task. Failing this, gaps in the instruction execution or duplicated execution of
instructions will occur. Both scenarios cause an erratic behavior of the task and have
to be strictly avoided. Section 2.2.5 will detail the procedure of saving the program
counter address.
Internal Processor State The part of the task context, which is represented by the
internal state of the processor executing this task is defined by the content of the
processor’s registers. The register set of a processor consists of several general purpose
registers, which may be addressed by a task, and state registers. State registers
save information about past instructions, such as carry information after arithmetic
operations. Changes in the register content of the processor are highlighted in Figure 2.4,
lower portion of right hand side. To preserve the context of a task, the content of the
registers has to be extracted out of the processor when deactivating a task. A solution
is to output the register content on the data memory interface of the processor and to
save it in a dedicated memory. For this purpose, a so-called code injection approach will
be exploited. This includes the injection of a dedicated portion of machine code into
the processor in order to output its register contents on the data memory interface.
After having defined, which information has to be preserved during a task switch-
ing procedure, the Virtualization Layer is now introduced that implements the task
switch. For demonstration purposes, the Virtualization Layer is highlighted for a single-
processor system at first. For this system, a transparent deactivation and reactivation of
a single task is detailed. Afterwards, this concept will be expanded to feature an array
of processors and tasks.
16 2 Virtualizable Architecture for embedded MPSoC
Instruction Memory
Data Memory
MicroBlaze
Soft-Core Processor
Figure 2.5: A MicroBlaze Processor System with the Virtualization Layer encapsulating the
Memory Controllers.
4 As a side effect, commercial design tools of the device vendor Xilinx, Inc. recognize the entire structure
as a valid processor system. Adding the Virtualization Layer instead as an IP core between the memory
controllers and the processor or the memories, respectively, causes synthesis errors due to failing the
so-called Design Rule Checks, i. e., automatic procedures, which check for valid system designs.
5 For the MicroBlaze, there are 32 general purpose registers as well as the Machine Status Register (MSR).
However, general purpose register R0 is a register containing the static value ‘0’ and, thus, has not to
be saved explicitely.
2.2 Virtualization for Embedded Multi-Processor Architectures 17
Instruction Memory
External
Trigger
Data Memory
Instruction Data
Memory Memory
Controller Controller VirtualizationȱLayer
Code AddedȱIPȱCores
Injection Task Context
Logic Memory
DataȱandȱInstructions
ControlȱSignals
MicroBlaze
Soft-Core Processor
Figure 2.6: The Virtualization Layer with Code Injection Logic and Task Context Memory.
based on the MicroBlaze processor, all these signals sum up to a bit vector of the width
of 296 bit. Therefore, the display of a multiplexer in Figure 2.6 is a simplified repres-
entation. Details of the processor-task memory interconnect are given in Section 2.3.
An external signal connected to the CIL may now trigger the virtualization procedure,
which will halt the task’s execution and extract its context. Upon triggering the external
signal again, the context is restored and the task resumes its execution from the point
of its interruption. The following section will highlight this procedure in detail.
As soon as the Virtualization Layer receives a signal from the external trigger depicted
in Figure 2.6, the halting sequence is invoked. The instruction memory is immediately
detached from the processor interface. Instead the CIL is routed to the processor. The
CIL thereby mimics the behavior of the instruction memory controller. Therefore, the
18 2 Virtualizable Architecture for embedded MPSoC
processor assumes to further fetch instructions from its instruction memory. As at this
point in time, still some instructions wait in the pipeline inside the processor for being
executed.6 The task’s execution will be interrupted after the instructions queued in the
pipeline have passed the pipeline.
The first step is to identify and preserve the program counter address of the instruc-
tion that will be executed after the reactivation of a task. In the default case, this is
just the next instruction in the instruction memory to be addressed by the program
counter. However, as outlined in Paragraph 2.2.4, ongoing jump and branch instructions
have to be considered as well. The solution to this problem is to let the CIL inject a
sequence of save word immediate commands into the processor. These instructions
are part of default instruction set of almost every processor, which is in accordance
with Postulate 7 (No modification of the processor core, usage of off-the-shelf processors). The
instruction save word immediate outputs the content of a processor register on
the data memory interface. This value is then to be written to the data memory to
the address defined in the save word immediate instruction. However, as soon as
the instructions originally queued in the pipeline have passed the pipeline, the data
memory is detached from the processor. Thus, data eventually being output on the
data memory is not actually written into the data memory, but discarded. Instead, the
sequence of save word immediate instructions is exploited to determine, which in-
struction from the instruction memory would actually be executed next. This procedure
is denoted in Algorithm 1.
A special treatment is needed, if a branch instruction featuring a delay slot, a return
statement or an immediate instruction is detected in the instruction stream as last
instruction to regularly enter the processor pipeline before a virtualization procedure is
triggered. In this case, the procedure is being withheld for one clock cycle. Otherwise,
the instruction in the delay slot would be marked to be the instruction to be executed
right after the task’s reactivation. While this would be the correct address, the program
would then continue with the instruction following the delay slot. The original branch
target addressed by the branch instruction would be lost. In delaying the procedure
by one clock cycle, i. e., by interrupting the execution after the delay slot entered the
pipeline, the instruction at which the task’s execution will resume is the one addressed
by the branch instruction.
Another issue when interrupting task execution may arise by atomic instruction
sequences. Such sequences may not be interrupted in order to ensure correct system
behavior. If a processor detects an atomic instruction sequence, e. g., the invocation of
interrupt service routines is blocked despite occurrence of an interrupt. The purpose
of such sequences is, e. g., to realize transferring bursts of data to a communication
partner that expects to receive or send a data word every instruction cycle. The
behavior of the Virtualization Layer may be adapted to wait for the end of such an
atomic instruction sequence by monitoring the instruction stream before invoking the
a predefined sequence of instructions stored in the CIL, whose result depend on the
current value of the status bits in the status registers. Based on the result, the value of
the status registers may be derived.
By now, all relevant context elements, which were identified in the previous Section,
are stored. The steps performed in order to extract and preserve a task’s context are
denoted in Algorithm 2.
Processor Condition
When having extracted the task’s context, the processor remains attached to the CIL and
TCM. After executing the last command injected by the CIL, the processor is requesting
the next instruction via its instruction memory interface. As long as the CIL does not
input another instruction, the processor remains in this state. Instead of triggering a
task switch as highlighted in the following section, a processor may also be deactivated
at the end of phase 1. This may be achieved by disabling its clock input. In doing so,
the energy consumption in a (multi-)processor system may be reduced. Section 2.7 will
give details about the energy management features and measurements regarding the
amount of energy saving by temporarily disabling a processor.
In favor of a clear illustration, from now on, the CIL and TCM as well as the memory
controllers, which correspond to a task are now referred as a Virtbridge, cf. Figure 2.7,
right hand side. A Virtbridge is statically assigned to a dedicated task, i. e., its instruc-
tion and data memory. By now, only a single-processor system is considered. This
system is expanded in the sequel to feature a second or third processor and corres-
ponding tasks, cf. Figure 2.8. Note that each task resides in a dedicated memory region
and the memories are completely disjoint. An additional controller now manages
2.2 Virtualization for Embedded Multi-Processor Architectures 21
Instruction Memory
External
Trigger
Data Memory
External
Trigger
Instruction Data Task
Memory Memory
Controller Controller
Code Virtbridge
Injection Task Context
Logic Memory
Processor
MicroBlaze
Soft-Core Processor
Figure 2.7: The IP Cores to facilitate Context Extraction and Restoration are subsumed to a
Virtbridge.
the evocation of the virtualization procedures. After a task has been halted by the
virtualization procedure denoted in Algorithms 1 and 2, it is not reliant to resume
its execution on its original processor. Instead, the task’s context may be restored
on another processor in the system that is connected to the Virtualization Layer, i. e.,
the task-processor binding may be updated. A task-processor binding is defined in
Definition 1; a convenient notation for bindings is given by the so-called Binding Vectors
(BV), cf. Definition 2. Consequently, if now two or more tasks are halted by means of
the virtualization procedure, the halted tasks may switch their corresponding processor.
As depicted in Figure 2.8, this may be achieved, e. g., by means of a switching element
that changes the routing of task memories to processors. Section 2.3 will introduce a
sophisticated interconnect solution.
Definition 1 (Task-Processor Binding). A task-processor binding Bi is a bijective function,
which maps a task ti of a set of tasks t1 , t2 , . . . , tn ∈ T to a processor pi of a set of processors
p1 , p2 , . . . , pm ∈ P: B : T → P, t → p.
Definition 2 (Binding Vector). A binding vector BVi = (tw → p x ), (ty → pz ), . . . is
composed of bindings Bi .
Note that in a multi-processor system featuring the Virtualization Layer, a task does
not have to be bound to a processor all the time. Instead, a task may be halted by the
virtualization procedure and may then remain disconnected from a processor for an
arbitrary amount of time. Furthermore, processors do not necessarily have to be bound
to a task. These two features enable transparent sharing of a processor resource, cf.
Section 2.2.6. In consequence, a binding vector BVi does not necessarily cover all tasks
22 2 Virtualizable Architecture for embedded MPSoC
Controller
Processor Processor Processor
1 2 3
and processors present in the multi-processor system. Tasks and processors not covered
by a binding vector BVi remain unconnected. Without the prevenient virtualization
procedure, just routing a task to another processor during normal task execution would
lead to erroneous task behavior as parts of the task’s context, which are stored inside
the registers of the original processors, would not be transferred to the newly bound
processor.
Algorithm 3 outlines the steps performed by the controller depicted in Figure 2.8
in order to facilitate a task switch. As the controller is implemented as an hardware
IP-core, all the steps in the for-loops are executed in parallel. The question about when
and how to define a new binding will be discussed in Chapter 3. For now, the new
binding is given as additional information via the external trigger signal. After the
binding is updated by means of adapting the routing of task memories to processors,
the tasks are still halted. In order to resume task execution, phase 3, the context
restoration is triggered.
2.2 Virtualization for Embedded Multi-Processor Architectures 23
As detailed before, a task may either be just halted or a new processor binding may
be realized by the virtualization procedure. In order to resume task execution, the
context of the task has to be restored on the processor, which is now connected to the
task. Therefore, load word immediate instructions injected by the CIL access the
TCM and load its content sequentially into the register set of the processor. Again,
the MicroBlaze features a dedicated instruction to set the MSR. In case that such an
instruction is not available for other status registers or other processor types, a sequence
of instructions may be inserted whose execution sets the bits in the status register(s)
accordingly. Since the current content of the status registers is unknown until task
deactivation, this instruction sequence would have to be set dynamically in order
to trigger the specific bit pattern in the status registers after its execution. For the
prototype based on the MicroBlaze processor, this special treatment is not necessary in
the processor’s standard configuration. After having restored the context information,
the data memory of the task is reattached to the processor. An unconditioned jump to
the beforehand saved program counter address and routing the instruction memory
to the processor’s instruction memory interface conclude the virtualization procedure.
The execution of the task then seamlessly continues from the point of its previous
interruption. Algorithm 4 denotes the steps performed during context restoration.
Controller
Processor Processor Processor Processor
1 2 1 2
Figure 2.9: Sharing of a Processor Resource via Virtualization Procedure over Time.
As detailed in the previous Section, the task-processor bindings may be updated by the
virtualization procedure. In consequence, a processor may be occupied by several tasks
over time. Altering the tasks that are bound to a processor is equivalent to common
multi-tasking schemes. However, due to the properties of virtualization, neither the
tasks nor the processor have or need knowledge about the fact, that more than one
task makes use of the processor resource. As an advantage, the tasks’ memories
remain strictly separated all the time. Even during a virtualization procedure, their
contexts are kept separate by design. This prevents harmful memory access violations
by faulty tasks and is compliant to Postulate 4 (Strict encapsulation and isolation of tasks).
Figure 2.9 depicts the processor resource sharing between two tasks over time via the
Virtualization Layer. In Section 2.4, the benefits of an interconnection network will be
exploited in order to enable convenient and fast processor resource sharing among tasks.
Feasibility of the transparent resource sharing cannot be evaluated without regarding
timing characteristics of the virtualization procedure. Furthermore, interrupting tasks
at any time is a prerequisite to enable reliable processor resource sharing.
The design of the Virtualization Layer guarantees that tasks can be interrupted at
any point in time. This is achieved by detaching the instruction memory as soon as
a virtualization procedure is triggered. However, to guarantee this property, several
constraints apply. Interrupt and exception handling may come into conflict with the
virtualization procedure as interrupts and exceptions do also intermit normal task
execution. In consequence, an interrupt or exception may distort the virtualization
procedure. Moreover, since the instruction memory is detached during the virtualization
procedure, the interrupt service routine or the exception handling routine, which are
usually stored inside the instruction memory, cannot be evoked in case of occurrence of
an interrupt or exception. As a solution, interrupts and exceptions can be suppressed
during the virtualization phase. However, for some interrupt-based systems, this may
2.2 Virtualization for Embedded Multi-Processor Architectures 25
0 5 10 Timeȱ[clock cycles]
SaveȱGeneral WaitȱforȱEndȱof
Purpose Registers WriteȱOperations
44ȱClockȱCycles
Jumpȱto
Signalȱto Restore General Saved
Resume Purpose Registers PCȱAddress
45ȱClockȱCycles
Figure 2.10: The Timing of the Virtualization Procedure for Halting and Resuming Task Exe-
cution for a Xilinx MicroBlaze Soft-Core Processor.
Timing Properties
The timing of the virtualization procedure is mainly affected by the size of the register
set of the processor type employed. As visible from Figure 2.10, which denotes the
timing of phases 1 and 3 of the virtualization procedure for a MicroBlaze soft-core
processor, the transfer of the processor registers’ contents into the TCM takes the
largest amount of time spent during the virtualization procedure. The duration of
26 2 Virtualizable Architecture for embedded MPSoC
the procedure scales linearly with the number of processor registers. Time spent for
flushing the pipeline, as well as handling of the status register is negligible. Not
depicted in Figure 2.10 is the time needed for the actual switch of the task-processor
bindings. Section 2.3 will introduce a task-processor interconnection network, which
computes the routing of the next task-processor binding to be applied in parallel to
phase 1 of the virtualization procedure. Thus, the actual task switch can be performed
within one clock cycle as soon as phase 1 has been completed. For the prototype
implementation, a complete task switch, i. e., the context extraction of one task as well
as the reactivation of another task on the same processor is, therefore, handled within
89 clock cycles. For comparison reasons, the time needed for a conventional kernel-
based task was measured for the same processor type. Here, the Xilkernel, a kernel
provided by Xilinx, Inc. for the MicroBlaze processor, was employed. Measurements
have revealed that a conventional task switch by employing the Xilkernel takes around
1.400 clock cycles, i. e., up to fifteen times longer than the virtualization procedure.
Thus, the virtualization procedure provides a fast way for a transparent update of
task-processor bindings, which outperforms conventional solutions and, thereby, offers
significant advantages such as task isolation and transparent resource sharing.
Task Memories
In the presented approach, tasks are stored into completely disjoint memory areas.
Even during binding updates, these memories remain strictly separated all the time.
A task cannot access the memory region of any other task. Furthermore, each TCM
is statically assigned to a dedicated task. In doing so, no mix-up of task contexts can
occur. This prevents harmful memory access violations. In safety-critical systems, it is
strictly avoided to schedule safety-critical tasks with other tasks of the system. A task
might either corrupt the processor’s state or even block the entire processor system
due to an unforeseen stall. In this case, the safety-critical task would also be affected.
However, the strict separation of task memories and contexts as well as the guaranteed
task interruption of the virtualization approach may enable processor resource sharing
even for safety-critical tasks. In an application example in Chapter 4.1, tasks relevant to
safety with harsh timing constraints will be taken into consideration.
A byproduct of the disjoint memory design is a simplified placement on the target
chip. FPGAs are not optimized to feature large multi-port memories that would be
necessary if the chosen memory design would be abandoned. In this case, the inefficient
synthesis as well as the hardened placement of such a memory might significantly
deteriorate the performance of the system.
A prototyped system consisting of three processors and tasks has been implemented for
a Virtex-5 FPGA from vendor Xilinx Inc. Table 2.1 compares the synthesis results of a
static three-processor solution, i. e., processors that do not feature any task migration or
resource sharing, to a three-processor system that is connected by a Virtualization Layer.
7 Inits default configuration, the MicroBlaze features 32 general purpose registers as well as the MSR.
Instead of 33 consecutive reads, 1 read would be sufficient to extract the complete context.
28 2 Virtualizable Architecture for embedded MPSoC
Table 2.1: Comparing Static and Virtualizable 3-Processor Solutions for a Virtex-5 LX110T
FPGA.
Conventional Virtualizable
System System
4,649ௗLUT
Resource Overhead —
3,222ௗFF
The TCMs are implemented in flip-flop memory. The overhead introduced by the
Virtualization Layer occupies just about 5 % of the available resources on the employed
Virtex-5 LX110T FPGA. The maximum frequency of the virtualizable solution is lower
than for a static processor system. This is due to two reasons. The virtualization logic
adds a combinatorial delay to the task-processor path. Moreover, the multiplexer logic
that is exploited to adapt the routings between tasks and processors further elongates
combinatorial paths. As Section 2.3.4 will discuss, adding buffers into the path between
tasks and processors is not feasible.
Figure 2.11 depicts a placement map for a design on a Virtex-5 FPGA, which features
three soft-core processors highlighted by numbers as well as the Virtualization Layer,
which is highlighted in light grey. Although one can clearly derive from the Figure
that the design is far from exceeding the logic resources available on the FPGA, ex-
panding the design to feature more than three processors leads to unresolvable routing
constraints. This is caused by the interconnection of tasks and processors. An overlay
in dark grey depicts the connectivity of the design. As every task has to be connected
to each processor, a high wiring complexity is caused by the employed multiplexer
structures. As this limits scalability of the solution, the following section will introduce
a sophisticated task-to-processor interconnection scheme. This interconnect will not
only ease the routing but also feature intrinsic mechanisms for task scheduling and
updating bindings during runtime.
Figure 2.11: Floorplan of a virtualizable System featuring three Processors and the Virtual-
ization Layer (light grey) on a Virtex-5 LX110T FPGA. The Area in dark grey
depicts the Connectivity.
30 2 Virtualizable Architecture for embedded MPSoC
of their adjacent task. This renders interconnection solutions unfeasible that would
sequentially deliver the processors with instructions. Therefore, the envisaged solution
has to feature a simultaneous interconnection of all processors to task memories.
The virtualization procedure, which was introduced in the previous Section, may
update the task-processor bindings during runtime. In consequence, this requires
dynamically routing a task to another processor. As not only the binding of one task-
processor relation may be updated, but, in contrast, all of the processors may feature a
new task assignment, the routes from tasks to processors may differ completely among
two different binding vectors. As routing may be a problem with high complexity, e. g.,
running an on-demand synthesis, which computes the routes necessary for a binding to
be applied, and applying this result by a dynamic partial reconfiguration on an FPGA
is not feasible. This is due to both the time needed for synthesis, which lies in the range
of minutes, if not hours, and the computation effort that requires a strong processor
as well as a decent amount of RAM – both are not in the usual equipment of an
embedded system. Another solution would be to precalculate all binding possibilities
on a workstation and to store these routing results in an off-chip memory. From
there, the required routing may be selected and applied, again by dynamic partial
reconfiguration. However, even when automating the step of synthesizing different
bindings, the number of routings, which would be generated increases drastically with
the size of the processor array. For a design featuring eight processors and eight tasks,
there would be about 40,000 possible routings, which have to be precalculated, as well
as stored somewhere in the embedded design.8 Again, this is not a feasible solution
due to the overhead in terms of time and memory. This leads to the requirement that
an interconnection solution has to feature a fast online-rerouting of task-processor
connections, which does not rely on pre-synthesized results.
A fast and simple update of task-processor bindings is further required in order to
provide an efficient mean for scheduling several tasks on the same processor resource
by exploiting the virtualization procedures. The time needed for establishing a task-
processor route may not significantly exceed the time needed for the virtualization
procedure to accomplish a context switch. Otherwise, scheduling of tasks would be
drastically slowed down by setting up the corresponding task-processor routes.
Moreover, the interconnection solution has to handle not only the transport of single
signals, but of complete processor-task interfaces. As highlighted in the previous
Section, the virtualization solution encapsulates the memory controllers of a task.
Therefore, all the signals on the memory interface between the memory controllers and a
processor have to be routed by the interconnection solution.9 As the fundamental design
of FPGAs, which are exploited for the reference implementation, is not optimized to
dynamically route interfaces of huge bit widths, the synthesis of such an interconnection
8 For the number of processors n, the number of tasks m and the assumption that m ≥ n, the number of
possible bindings is (mm!
−n)!
.
9 For the prototype design based on MicroBlaze soft-core processors, the interface plus a set of trace
Interconnection Network
Figure 2.12: A virtualizable System featuring a dedicated Soft-Core Processor for Binding
Management.
solution – almost regardless of its type – may lead to inefficient synthesis results.
Section 2.3.5 will discuss the outcomes of this issue.
Last but not least, aside from the interconnection between tasks and processors, an
instance is required, which will manage the update of task-processor routes. For this
purpose, a dedicated soft-core processor, which is not part of the parallel processor array,
is introduced, cf. Figure 2.12. On this processor, the management of bindings as well as
commands that will trigger an update of the task-processor routes is accomplished.
Static Interconnect
In the beginning of the era of processors, a chip usually featured a single processor,
which was sometimes accompanied by a set of co-processors. As there was just one
processor there, the interconnection to the corresponding task memory was static.
Today’s embedded design most often still rely on the static interconnect of a task
memory to a processor. Despite being still predominant for many embedded designs,
such a static interconnect is obviously completely unsuited for the targeted virtualizable
multi-processor system.
However, other static interconnects, such as busses, may be exploited for multi-
processor interconnection. Processor architectures such as the IBM Cell [Chen 2007]
feature, e. g., a ring-based bus system to enable intra-processor communication. For
the interconnection between a set of task memories and an array of processors, busses
have the disadvantage of complex access management. A bus may realize only one
2.3 Dynamically Reconfigurable Task-to-Processor Interconnection Network 33
Network-on-Chip
i1 i2 i1 i2
o1 o2 o1 o2
One of the oldest interconnection solutions based on telephone interconnects are non-
blocking switching networks [Clos 1953]. Here, arrays of crossbars are interconnected
in a way so that every input-output permutation can be realized. By its non-blocking
structure, each input is guaranteed to be routed to its designated output. In contrast
to full crossbars, switching networks reduce the overhead in switching elements and
are, thus, less resource-expensive. Since the structure of a switching network may
be purely combinatorial, these networks are well-suited for communication, which is
based upon 1-1 relations and require a continuous data transfer without delay caused
by register stages within the network. The concept of switching networks of Clos
was later expanded to permutation networks [Waksman 1968]. Permutation networks
introduce several distinct advantages, such as fast routing algorithms and a generic,
scalable layout. Thus, permutation networks will be the interconnection type of choice
for virtualizable MPSoC.
i1 i2 i3 i4 i1 i2 i3 i4
o1 o2 o3 o4 o1 o2 o3 o4
Figure 2.14: Two Networks with different static Interconnects between Crossbar Switches.
A B C D E F G H
1 2 3 4 5 6 7 8
Figure 2.15: A Butterfly Network of Size 8 × 8 with Occurrence of a Blockade: C and G cannot
establish their desired Routes at the same Time.
outputs in a cross manner. Figure 2.13 depicts a crossbar switch in its two configuration
modes. The configuration of a crossbar switch may be toggled during runtime.
The actual type of a permutation network is defined by the pattern of static in-
terconnects, which connect the crossbar switches. The interconnect is chosen at the
design phase and is not altered at runtime. For four crossbar switches, in Figure 2.14,
two different types of permutation networks are established by choosing two static
interconnects. Depending on the permutation network type, the number of crossbar
switches as well as the static interconnect between crossbar switches may vary.
In the following, different permutation network types are evaluated in short in order
to identify the most suited type for being employed in a virtualizable MPSoC regarding
the prerequisites defined in Section 2.3.1. The selection for the specific network types
to be discussed was made in order to highlight differences among them in terms
of flexibility, routing complexity, scalability and resource consumption. A further
in-depth discussion of other well-known permutation network types, such as Omega
networks, adds no insight regarding the employment in the virtualizable MPSoC. So,
an exhaustive discussion of different permutation network types is not the goal of this
work and is, therefore, omitted.
Butterfly Networks Butterfly networks, cf. Figure 2.15 feature fair resource consump-
tion and a modest length of their combinatorial paths. The resource consumption is
ni×i = n(i/2)×(i/2) · 2 + 2i crossbar switches, with i being the number of inputs of the
36 2 Virtualizable Architecture for embedded MPSoC
A B C D E F G H A B C D E F G H
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
(a) The thin dotted Line indicates the Axis, (b) Blockade resolved by alternate Route.
on which a Butterfly Network is mirrored
to form a Beneš network. A Blockade
between A and C occurs at the high-
lighted Crossbar Switch.
network.11 For the smallest network size with i = 2 the number of crossbar switches is
n2×2 = 1. Because of the relatively few stages12 of the Butterfly network, it offers only
a low interconnectivity, i. e., not every input-output relation may be established. This
is depicted in Figure 2.15 as well. In this example, inputs C and G cannot be routed
to outputs 3 and 4 at the same point in time. There, a so-called blockade occurs. Thus,
if the inputs are seen as tasks and the outputs of the network as processors, not all
task-processor bindings would be realizable. As the flexibility of task migration would,
therefore, be limited by employing Butterfly Networks, they are not considered fur-
ther. However, they might be well-suited for scenarios with harsh resource constraints
and few dynamic binding configurations, where their low connectivity does not raise
problems. Routing is fairly easy by applying a bisection method.
A B C D E F G H
< <
< <
<
1 2 3 4 5 6 7 8
Sorting Networks Next, Sorting Networks are considered. In a sorting network each
crossbar switch acts as a comparator element as detailed in Algorithm 5. By applying a
sorting behavior at each crossbar switch, in consequence, at the output stage all values
from the input ports will be sorted in order. Therefore, establishing routes is very easy,
if the outputs are labeled in ascending order and the inputs use these labels to identify
their desired output. Since sorting networks are able to sort any input sequence, they
consequently are also able to realize every possible input/output relation. They are,
therefore, true permutation networks in its original sense.
38 2 Virtualizable Architecture for embedded MPSoC
NumberȱofȱCrossbarȱSwitches
100
10
Butterfly
Beneš
MaxȬMin
1
2x2 4x4 8x8 16x16 32x32
SizeȱofȱNetwork
Figure 2.18: Growth of Resource Consumption in Terms of Crossbar Switches for different
Permutation Network Types and Sizes.
The number of crossbar switches necessary for a Max-Min network of the size i × i is
ni×i = ni/2×i/2 · 2 + ∑i/2
x =1 x. Figure 2.22 of Section 2.3.4 illustrates the derivation of this
formula.
Figure 2.18 depicts the resource consumption in terms of crossbar switches for the
discussed permutation networks. The Butterfly network features the lowest resource
consumption at the expense of a high risk of blockades. Up to size 8 × 8, the Max-Min
network maintains a lower resource consumption than a Beneš network. Beginning with
sizes 16 × 16, the growth of the Max-Min network’s resource consumption distinctly
surpasses those of Beneš and Butterfly networks. Section 2.3.4 will document, however,
the limited necessity of network sizes of 16 × 16 and beyond. Furthermore, Section 2.3.5
proposes design alternatives by clustering networks of lower size. Thus, the objection of
a step growth in resource consumption for the Max-Min network with larger network
sizes has not to be considered.
Table 2.2 summarizes the characteristics of the evaluated permutation networks.
This is further depicted by Figure 2.19, which visualizes the advantages of the dif-
ferent network types. Despite their higher resource consumption, whose impact is
subsequently lowered by Moore’s Law, Max-Min networks are the type of choice for
the virtualizable MPSoC, because of the fast routing algorithm and the full flexibility.
This routing algorithm enables the quick setup of new task-processor relations. In the
following, the adoption of the Max-Min network as a task-processor interconnection
is detailed. In the scope of a virtualizable MPSoC, this enhancement was proposed in
[Biedermann 2012c].
IdealȱNetwork
MaxȬMin
InterconnectionȱFlexibility
high
Beneš
CombinatorialȱDelay
medium
Butterfly low
medium
high
low
Structure
This Section will outline the behavior of the interconnection network to establish a
task-processor Binding Vector. As defined in Section 2.2.5, a Binding Vector denotes,
which task is to be executed on which processor. For example, the Binding Vector
Control
Processor Virtbridge Virtbridge Virtbridge
Layer 2
Layer 1
Shadow Copy
Physical Interconnect
calculates
applies Results from Layer 2
new Routes
The network is symmetric in terms of number of inputs and outputs, i. e., in the number
of attached tasks and memories. It is designed to scale to an arbitrary power of two of
inputs and outputs. This is achieved by the generic structure of the network. Larger
networks contain smaller networks as building elements, cf. Figure 2.22. Despite this
theoretical scalability, the feasibility of large-scaled permutation networks is limited.
This is due to the fact, that the permutation network features only combinatorial, un-
2.3 Dynamically Reconfigurable Task-to-Processor Interconnection Network 43
4 2 7 3 1 5 4 2 7 3 1 6 8 5
A B C D E F G H A B C D E F G H
< <
(b) Naming of Inputs according to Pro- (c) Remaining Inputs filled with unused
cessor Assignments. Processor Labels.
4 2 7 3 1 6 8 5
A B C D E F G H
4 2 7 3 1 6 8 5
2 3 4 7 1 5 6 8
3 4 5 6
2 1 3 5 4 6 7 8
2 3 5 4 6 7
3 4 5 6
4 5
1 2 3 4 5 6 7 8
(d) Applying the Sorting Behavior of Algorithm 5. Processors 6 and 8 are marked for being
temporarily deactivated.
Figure 2.21: Illustration of the Routing Algorithm performed in the Shadow Copy of the
Interconnection Network.
buffered signals between tasks and processors. The lengths of the longest combinatorial
path, i. e., the longest route between two register elements determine the maximum
clock frequency achievable for the system. If a higher clock frequency is chosen than
defined by the longest combinatorial path, a signal being output at a clock event onto
this path can be altered by the succeeding clock event before being saved to the register
44 2 Virtualizable Architecture for embedded MPSoC
at the end of the path. Thus, the results are incorrect register values and system
misbehavior. The larger the permutation network is scaled, the more stages of crossbar
switches exist in the network and, consequently, the longer the combinatorial path gets.
With increasing network size, the system’s clock frequency therefore decreases.
The depth of a Max-Min network can be derived from its generic structure. A
Max-Min network of the size n × n consists of two Max-Min networks of the size n2 × n2
placed beside each other and a funnel shaped sequence of crossbar switches placed
below these nets, cf. Figure 2.22. This “funnel” has the width of n2 crossbar switches
on top. Therefore, for a n × n network, the funnel adds n2 additional stages of crossbar
switches. The overall depth of the network may now be derived recursively as denoted
in Algorithm 7.
Figure 2.23 gives an impression about the correlation of network size and maximum
clock frequency. The numbers are given for a Xilinx Virtex-6 FPGA. As a network size
of 16 × 16 causes placement errors on the target chip due to resource constraints, the
frequency given for size 16 × 16 is an estimation. One may derive from the numbers
of Figure 2.23 that the employment of a permutation network causes a significant
2.3 Dynamically Reconfigurable Task-to-Processor Interconnection Network 45
100
78,217
75
54,154
50
37,077
25 ƿ20,6
0
2¯2 4¯4 8¯8 16¯16
SizeȱofȱNetwork
Figure 2.23: The Correlation between Network Size and System Frequency for Network Sizes
2 × 2 to 16 × 16.
performance drop in the system. The first prototype of a Virtualization Layer between
three processors without dedicated interconnection scheme achieved 91 MHz on a
Virtex-5. When employing the permutation network, already the 2 × 2 version is slower
as the first prototype without the interconnection network. Obviously, the permutation
network is, therefore, a performance bottleneck.
Adding buffers into the combinatorial paths of the network seem to solve the problem
at first sight. Here, one or several register stages are added into the network. Indeed,
this significantly shortens the critical paths and, therefore, raises the clock frequency.
However, adding registers between task memories and processors causes a system
behavior that is hard to handle and error-prone. Usually, the instruction fetch, i. e.,
reading an instruction out of a memory location of the instruction memory, which
46 2 Virtualizable Architecture for embedded MPSoC
is defined by the program counter is accomplished within one clock cycle. Adding
registers to the instruction memory-processor route causes the instruction fetch to take
several clock cycles. As a consequence, the correlation between program counter address
and the instruction to be executed is lost. This may cause erroneous behavior in case
that branch executions perform branches with relative branch targets, whose resulting
branch target address depend on the current program counter address. Therefore,
buffering the network route is not an applicable solution.
A significant drop in system frequency is only tolerable, if the system gains other,
distinct advantages in return. The most prominent gain by exploiting a permutation
network is the full flexibility with regard to the task-processor bindings. The following
section discusses, under which circumstances flexibility is a valuable tradeoff against
system frequency.
By exploiting the Max-Min network, a task inside this network may be shifted to
any processor that is connected to the network. Chapter 3 will highlight the benefits
of exploiting this flexibility. In short, dynamically adapting task-processor bindings
may not only enable a convenient method of sharing of processor resources, but also
allow for module redundancy or even advanced concepts, such as the Agile Processing
paradigm, which is introduced in Section 3.4. However, the flexibility comes at the
price of significantly reduced system frequency as discussed above. Therefore, it is not
feasible to design a virtualizable design with a processor number larger than eight. For
more than eight processors, a frequency of about one fifth or less of the maximum clock
frequency is expected. In return, the execution of a task could be shifted, e. g., among
sixteen processors. The design concepts in the following chapter will point out that
tasks in a complex embedded design indeed benefit from at least some flexibility but do
not need full flexibility, i. e., there is no need to assure that any task in an embedded
design may be executed on any processor of the system. For redundancy issues, for
example, it is important that at least one or two processor resources can take over the
execution of a task relevant to safety in case its processor failed. The Agile Processing
scheme will also profit from dynamic binding updates, but will not rely on a very
large permutation network in which each task might be shifted to any processor. In
conclusion, for smaller network sizes, e. g., a 4x4 network, the performance drop is
still distinct, yet significantly lower than for lager network sizes. In return, all of the
discussed design concepts that will be discussed in Chapter 3 are enabled already
for this network size. In case that a system with more processors is desired, instead
of scaling the permutation network to larger sizes, and, therefore, further lowering
the system performance, building clusters of processor arrays which feature each a
Virtualization Layer is a viable solution.
2.3 Dynamically Reconfigurable Task-to-Processor Interconnection Network 47
100
SystemȱFrequencyȱ[MHz] 2×2
78,217
75
4×4
54,154 2ȱȉȱ4×4
48,349 3ȱȉȱ4×4
50 40,578
8×8 8×8ȱ+ȱ4×4
25 37,077 37,006
0
0 2 4 6 8 10 12
NumberȱofȱProcessors
Figure 2.24: Clustering improves System Performance. All clustered Designs depicted are
close Clusters.
2.3.5 Optimization
Clusters of Virtualizable Networks
In the first alternative, the close clusters, several processor arrays with independent
Virtualization Layers feature the same control modules, which, e. g. manage and
update the binding definitions for the processor array. Close clusters are depicted in
Figure 2.25.
In loose clusters, each processor array is a self-contained virtualizable system with
dedicated control modules. In fact, loose clusters are built by instantiating a virtu-
alizable design as depicted in Figure 2.8 several times. Loose clusters are depicted
in Figure 2.26. Figure 2.24 visualizes the clock frequency for several close clusters of
various sizes for a Virtex-6 LX240T FPGA.13 For designs featuring eight processors,
the frequency drop is lower for a 2 · 4 × 4 system compared to an 8 × 8 solution. Fur-
thermore, a solution featuring twelve processors structured in three close clusters of
four processors outperforms the design featuring one 8 × 8 network. The application
example in Section 4.2 will exploit such a close 8 × 8 + 4 × 4 cluster.
Asymmetric Shaping
As discussed above, the Max-Min network that is employed for the virtualizable
multi-processor architecture features the same number of inputs and outputs, i. e., of
processors and of tasks. Indeed, in many embedded designs, assigning a dedicated
processor for each task is not uncommon to prevent malicious interference with other
tasks and to maintain system safety. However, for multi-processor designs that feature
sharing of processor resources, the need to feature as many processors as tasks is not
desired. In most scenarios, which employ resource sharing, there will be more tasks
than processors. Even if some processor instances are held ready as spare resources
in case of errors, the overall number of processors is usually lower than the number
of tasks. In order to overcome the shortcomings of a conventional symmetric network
design, an asymmetric shaping of the interconnection network is now proposed. This
13 It is assumed that loose clusters will achieve slightly better results. This will be due to the increased
degree of freedom for module placement as loose clusters do not share access to a common control
processor.
2.3 Dynamically Reconfigurable Task-to-Processor Interconnection Network 49
optimization will allow featuring more tasks than processors in a Max-Min network.
Foundation of the optimization is a Max-Min network of the size n × m, where n is
the number of inputs, i. e., of tasks and m the number of outputs, i. e., of processors.
As detailed above, in the beginning n = m holds. Result of the optimization will be a
network of the size n × m with m < n. Algorithm 8 lists the step performed to modify
the interconnection network. This algorithm has to be executed during system design.
The algorithm starts at the middle of the output stage, where the combinatorial
path is the longest.14 This is depicted in Figure 2.27. In doing so, by discarding parts
of the network, the most time critical path is eliminated at first. The algorithm now
continuously removes outputs and crossbar switches that are not needed any more
for the desired number of outputs. The algorithm works from the inner to both outer
sections of the output stages, alternating between going left and right. This ensures
evenly reducing the longest paths in the network. Figures 2.27 and 2.28 depict the
process of reshaping a 8 × 8 down to a 8 × 1 network.
The result of this procedure is accompanied by several characteristics. First, less
processor instances as well as less crossbar switches have to be synthesized. Crossbar
switches are removed, which cannot be a section of a route to a remaining processor.
This leads to a significant decrease of occupied logic slices on the target FPGA, cf.
Figure 2.29. The LUT consumptions of an asymmetrical shaped network is, however,
higher than for an unmodified network of a smaller size. This can easily be explained
by a visual comparison of, e. g., a 4 × 4 network as depicted on the top of Figure 2.22
with an asymmetrical shaped 8 × 4 network as depicted in Figure 2.27, lower portion of
14 As discussed in Section 2.3.3, some routes pass more crossbar switches than others.
50 2 Virtualizable Architecture for embedded MPSoC
A B C D E F G H A B C D E F G H
1 2 3 6 7 8
1 2 3 4 6 7 8
A B C D E F G H A B C D E F G H
1 2 7 8
1 2 3 7 8
the right hand side. One immediately sees that given the same number of processors,
the number of crossbar switches is higher for an asymmetrical shaped network than
for an unmodified network.
Second, depending on the number of remaining processors, the longest path of
the system is shortened accordingly, leading to an increase in system frequency and,
thus, performance, compared to the original network size. The combinatorial delay tcd
of routes from task memories to processors via the interconnection network may be
written as
A B C D E F G H A B C D E F G H
1 8
1 2 8
A B C D E F G H
where tmem_to_icn is the delay caused by the route from task memories to the inputs of
the interconnection network, ticn is the delay caused by the interconnection network
and ticn_to_proc is the delay caused by the route from the output of the interconnection
network to the input of the processors. The asymmetric reshaping reduces ticn , while
tmem_to_icn and ticn_to_proc remain unchanged.15 This effect is depicted in Figure 2.30.
Third, those crossbar switches, which now only feature one output behave like a
normal multiplexer. If the left output remains connected, the input with the lower ID is
15 Indeed,due to reshaping and the resulting decrease in occupied logic, a slightly better placement of
task memories and processors might be achieved, resulting in a decrease of tmem_to_icn and ticn_to_proc .
These deviations are not considered further.
52 2 Virtualizable Architecture for embedded MPSoC
8×8
LUTȱConsumption
5000
8×4
4000
3000
2000 8×2
4×4
1000
2×2
0
2 4 8
NumberȱofȱProcessors
100
SystemȱFrequencyȱ[MHz]
2×2
78,217
75
4×4
54,154
50 8×8
8×2 37,077
49,781 8×4
25 39,974
0
0 2 4 6 8
NumberȱofȱProcessors
routed to this output. If the right output remains connected, the input with the higher
ID is routed to this output.
Care has to be taken that with a reduced size of the network, only the remaining
processors may be addressed as a target of a binding definition. As the processors are
removed starting from the middle of the network, the remaining processors are not
numbered in a continuous sequence any more, cf. Figures 2.27 and 2.28, where the
remaining processors are, e. g., 1, 2, and 8. This cluttered sequence has to be maintained
2.3 Dynamically Reconfigurable Task-to-Processor Interconnection Network 53
3 1 2 4 5 8 6 7
A B C D E F G H
1 2 8
in order for the routing algorithm to work correctly. Aside from this fact, no adaptions
regarding the binding or scheduling of tasks have to be considered.
To realize a given binding, e. g., in a network of the size 8 × 3 the same routing
algorithm as denoted in Algorithm 6 applies as for an unmodified symmetrical network.
In consequence, the processors, which have been removed by the reshaping process, are
treated just like processors, which are not considered in a binding. Figure 2.31 depicts
a network configuration for the 8 × 3 network with the active binding:
Crossbar switches, which only feature one output as result of the reshaping process,
behave as a 2:1 multiplexer and are highlighted in Figure 2.31.
Reshaping an otherwise symmetrical network is one way to feature more tasks than
processors in a system. As mentioned in the previous Section, multi-tasking systems
usually do have far more tasks than processors. Reshaping the network may not be a
feasible solution to handle a huge number of tasks. The scalability of the interconnection
network decreases system performance with increasing inputs and outputs, even when
reducing the overall network size by the asymmetric reshaping process. Thus, in order
to support an arbitrary number of tasks in the system, another solution has to be found.
Tasks consist of instructions, data, and the current context. Instructions are stored
inside an instruction memory, data inside a data memory. For an FPGA-based im-
plementation, both memories are mapped to a so-called BlockRAM (BRAM) module,
existing memory banks on the FPGA device with two ports. Usually instruction and
data memory are mapped into the same chain of BRAM modules, and, therefore,
54 2 Virtualizable Architecture for embedded MPSoC
B D
C C
A A
BRAM
BRAM
BRAM
BRAM
CLB
CLB
CLB
CLB
CLB
CLB
DSP
DSP
CLB
CLB
Figure 2.32: A partial Reconfiguration on an FPGA allows switching Modules B and D at
Runtime without interrupting the Execution of Modules A and C.
occupy one of the memory ports each. As discussed, the design and the size of the
interconnection network and, consequently, the number of tasks attached to the system,
are defined during the design phase and cannot be altered during runtime.
One solution might be to fill other BRAM banks, which are not occupied by the
design, with tasks and route them as inputs to the network by means of a multiplexer
structure. This solution, however, would be nothing more than an inefficient expansion
of the interconnection network. Moreover, an FPGA’s BRAM resources are usually
limited, which would then be the upper boundary for the number of possible tasks.
Now the runtime partial reconfiguration features for FPGAs is exploited to over-
come this issue. Partial reconfiguration is a technique, which alters only parts of
the FPGA’s configuration. This is achieved by applying a partial bitstream onto the
device. This partial bitstream only contains the configuration update of the chip region
to be reconfigured. As a significant side effect, those areas, which are not affected
by the partial reconfiguration, may continue their execution even during the partial
reconfiguration phase. Figure 2.32 depicts a partial reconfiguration of a design, where
module B is replaced by module D, whereas modules A and C continue their execution
all the time. This allows building systems with high execution dynamism. The partial
reconfiguration technique has, therefore, contributed to a wide range of applications
such as [Hübner 2006, Paulsson 2006b, Ullmann 2004, Lanuzza 2009]. Inter-FPGA re-
configuration was proposed in [Wichman 2006], which in consequence led to complex,
reconfiguration-based FPGA networks [Biedermann 2013a].
The left hand side of Figure 2.33 depicts a virtualizable system with two task memories.
These memories are not explicitly bound to a specific task, but may be filled over time
with different tasks. On the right hand side of Figure 2.33, task B residing in task
memory 2 is replaced by task C by means of a partial reconfiguration. In order to
replace a task by dynamic reconfiguration, the task that is currently being executed
is halted by the virtualization procedure detailed in Section 2.2.5. In doing so, the
processor is detached from the task’s memory. After the virtualization procedure has
2.3 Dynamically Reconfigurable Task-to-Processor Interconnection Network 55
Task
C ID2
Figure 2.33: Over Time, the Task Memories in the System can be filled with different Tasks by
performing a partial Reconfiguration.
stopped the task’s execution, the control processor outside the processor array triggers
the partial reconfiguration by calling a dedicated function, which denotes the partial
bitstream to be applied and the target region on the FPGA. The new partial bitstream
contains the machine code of a new task. Target area are the BRAM cells on the FPGA,
which host the original task. The reconfiguration is then applied by a dedicated IP-core,
the so-called ICAP, which is provided by the device vendor Xilinx Inc. While a full
reconfiguration of an FPGA of the size of a Virtex-6 takes about 12 to 20 seconds via
a serial PC connection, the duration of a partial reconfiguration depends on the size
of the partial bitstream and on the speed of the storage medium, which contains the
partial bitstream file. For altering the BRAM cells, which form a task memory, the
partial reconfiguration takes about 2 seconds. Indeed, this includes the time needed for
the data transfer of the bitstream from a personal computer. When exploiting a fast,
off-chip memory, which stores the bitstreams, a reconfiguration process is significantly
accelerated. Recent works demonstrated a huge speedup in the transfer of partial
bitstreams [Bonamy 2012]. However, dynamic task replacement is still slower than a
conventional virtualization procedure, where the tasks are already stored in the chips
memory. Therefore, this approach is not suited for frequent task scheduling events
while facing harsh real-time constraints.
As soon as the reconfiguration process is finished, the memory is reattached to the
processor by the virtualization procedure. Additionally, the content of the TCM, which
still features the context of the now overwritten task, is erased. Consequently, the new
task may start its execution without any interference with the old task. Strict isolation
of task contents as required by Postulate 4 (Strict encapsulation and isolation of tasks) is
preserved.
The dynamic task replacement procedure enables the employment of an unlimited
number of tasks. However, as soon as a task is being replaced, its current content, which
was stored in its TCM is erased. Therefore, when this task is configured onto the FPGA
again, it has to restart its execution from start. Section 2.4.4 will highlight a feature,
56 2 Virtualizable Architecture for embedded MPSoC
where a task may indicate that it has finished one execution cycle.16 Consequently,
tasks are replaced not until they have completed their current processing. The Agile
Processing scheme highlighted in Section 3.4 will modify this approach slightly in order
to enable a smooth transition between sequential and parallel task execution.
In case that the context of the task has to be preserved when overwriting the task’s
memory by partial reconfiguration, a configuration readout may be performed. Here,
the ICAP may read out the current configuration of the TCM, before applying a new
partial bitstream to the BAM. The extracted context may be saved elsewhere off-chip.
Alternatively, every task, which may be mapped into the system may feature a dedicated
TCM. However, this would cause a significant resource overhead, as the number of
tasks being applicable by a dynamic task replacement is almost unlimited. For the
prototype implementation, TCM readout or dedicated TCMs for each and every task
are not implemented.
Please note that despite the partial reconfiguration technique is necessary in order to
efficiently feature an unlimited number of tasks on a FPGA-based virtualizable design,
the approach is not limited to FPGAs. For other target architectures, which do not offer
any runtime reconfiguration features, but whose structures may be outlined by the
designer – in contrast to an FPGA – the same approach may be implemented by means
of a suited memory design. Here, a dedicated port in each task memory may be added
in order to externally replace the memory contents.
The partial reconfiguration feature might not only be exploited to replace tasks, but
also to exchange the entire processor-task interconnection network during runtime.
Depending on the current flexibility requirements, a network with fewer stages and,
consequently, less flexibility, such as a butterfly network, could replace the max-min
network at runtime. In theory, fewer combinatorial stages would lead to an increase
in system frequency. By dynamically replacing the max-min network, the system’s
frequency could be increased. However, this is not considered by the design tools.
Moreover, a methodology about when to apply which interconnection network is
missing. If many exchanges of the interconnection network would be required during
runtime, the advantage of the theoretical increase in system frequency would disappear.
Although the work of [Pionteck 2006] has already demonstrated the implementation of
NoC reconfiguration on FPGAs – however for networks, which feature off-line routing
in contrast to the proposed runtime routing – a dynamic exchange if the interconnection
network seems not feasible when considering the constraints discussed above.
16 In embedded designs, many tasks run “infinitely”, i. e., are encapsulated in a loop to continuously
process input data. In this context, one cycle in this loop is called an execution cycle. Often, no data
dependencies are carried between two execution cycles.
2.3 Dynamically Reconfigurable Task-to-Processor Interconnection Network 57
Figure 2.34 depicts a floorplan of a loose cluster, cf. Section 2.3.5, which consists of
an array of eight processors plus an array of four processors. Target chip is a Virtex-
6 LX240T FPGA. Processors are marked by numbers, task memories are indicated
by letters. The corresponding control processors are highlighted in light gray. The
dark grey backgrounds, respectively, depict the FPGA’s resources occupied by the
interconnection networks. In grey, the network connectivity is depicted. The clusters
are spatially separated indicated by the dashed line. One may clearly see, however, that
the networks are scattered over almost the entire area of the FPGA. This is due to the
combinatorial structure of the interconnection network, which is mapped into LUTs.
The combinatorial networks occupy 20 % and 45 %, respectively, of the FPGA’s LUT
resources. In only about 30 % of occupied slices both LUT and registers are exploited.
This hints to an inefficient placement caused by the structure of the interconnection
network. The scattered placement, furthermore, elongates routes between registers and,
thus, cause long critical paths, which in return lower the system’s overall frequency.
The exploitation of multi-stage interconnection networks, however, not automatically
implies a reduction of a system’s performance. Multi-processor systems, such as the
Plurality architecture [Plurality 2013] feature a multi-stage interconnection network
as dispatcher between processors and memories, which is depicted in [Green 2013].
Plurality’s interconnect is advertised featuring “low latency”. However, they do not state
whether the routes from processors to memories are buffered, a solution that is unsuited
for the virtualizable architecture as discussed above. Nevertheless, transferring the
design from the prototype FPGA platform to a custom chip whose layout is optimized to
realize the multistage structure may significantly shorten combinatorial paths and, thus,
result in an articulate increase of system performance. Therefore, the performance drop
is not seen as a weakness of the advocated virtualization structure, but as a limitation
arising from the FPGA platform being exploited for the prototype implementations.
Figure 2.34: The Floorplan clearly depicts the two Clusters of the System. Processors are
marked by Numbers, Task Memories by Letters. The control processors are
depicted in light grey. The corresponding networks (dark grey) are depicted with
an Overlay of the Connectivity (grey.)
2.4 Task Scheduling via Interconnection Network 59
Time
P1 A
P2 B
P3 C
Definition 4 (Binding Vector with Task Groups). A binding vector BVi = {((tv , tw , . . .) →
p x ), (ty → pz ), . . .} is composed of bindings Bi .
60 2 Virtualizable Architecture for embedded MPSoC
Time
P1 A B A B A B A
P2 idle
P3 C
Time
P1 A B A
P2 idle
P3 C
Time
P1 A B A C A
P2 idle
P3 idle
all active tasks is triggered. In doing so, the execution of all tasks in the system is
halted. For a Binding Vector not featuring a Task Group, the routing Algorithm 6
of Section 2.3.4 will apply in order to setup according task-processor routings in the
interconnection network. This Algorithm is now modified in order to support Task
Groups. Algorithm 11 notes the initial setup of the network for a Binding Vector, which
features a Task Group.
In doing so, the first task of a Task Group and all tasks of 1-1 relations will be
activated, cf. Figure 2.39, left hand side. As soon as the budget of the first task of the
2.4 Task Scheduling via Interconnection Network 63
Task Group runs out, a scheduling event is triggered. Consequently, this task is being
interrupted by the virtualization procedure. Meanwhile, a new network configuration is
computed in parallel. This new configuration will feature a route from the next task of
the task group to the targeted processor instance. With the new network configuration,
this next task of the Task Group may start or resume its execution. Algorithm 12
denotes the steps for a Task Group scheduling event. For Task Groups, scheduling
events are internally managed by timers inside the Virtualization Layer as well as
registers keeping track of task execution sequences. Therefore, the control processor is
relieved from managing scheduling events.
As mentioned before, new network configurations are calculated in a shadow copy
of the interconnection network at runtime. During Task Group Scheduling events, the
bindings of tasks not part of the Task Group are not altered. Hence, as soon as the
new network configuration is found, it may immediately be applied to the network
without the need for interrupting the tasks outside the Task Group. Thus, all tasks
bound to other processors in the array are not affected by this procedure, even if the
route between their task memory and their corresponding processor is adapted by the
network reconfiguration, cf. Task D in Figure 2.39, right hand side.
1 6 2 4 5 3 7 8 1 6 5 4 2 3 7 8
A B C D E F G H A B C D E F G H
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Figure 2.39: Task Group Scheduling Event for the Binding Vector BV1 = (A → 1), (B → 6),
((C, E, G) → 5), (D → 4), (F → 3), (H → 8). The Task Group’s tasks are high-
lighted as well as the Crossbar Switches reconfigured during a Task Group
Scheduling Event. The Route of the Binding (D → 4), which is unrelated to
the Task Group, is adapted without affecting D’s Execution.
is completed. The designer may choose task budgets in a way, that scheduling events
occur at common factors of these budgets. If two or more scheduling events are
triggered within the same clock cycle, all scheduling events are handled simultaneously.
Consequently, only one new network configuration is computed for these scheduling
events. Nevertheless, due to the speed of the virtualization procedure and the fast
routing algorithm, such delays are in the range of 90 clock cycles.
By now, a time division scheme for processor resource sharing has been detailed.
However, by performing such a scheme, the execution of a task is disrupted into several
parts. This is an issue, if tasks have dependencies among each other. Here, a task
may only start its execution if it has received data from the tasks it is depending
on. As commonly adopted, a task is assumed to at first receive data from preceding
tasks. Then it may resume its computations. At the end of its execution cycle, the task
sends its result to subsequent tasks. Besides this, no communication occurs during its
computation. This is depicted for a simple task graph in Figure 2.40. A time division
scheme may lead to a significant delay in the overall system execution time. This is
2.4 Task Scheduling via Interconnection Network 65
A B main{
receive(a,A);
receive(b,B);
c=a+b;
C
d=b+c;
…
result=e *d;
D send(result,D);
}
Figure 2.40: A Task, here C, at first receives Data from the Tasks it is dependent on, here A
and B. Afterwards, it may start its Computation. Finally, it sends its result to the
Tasks, which depend on it, here D.
highlighted in Figure 2.41. Here, a task graph consisting of the four tasks A, B, C,
and D is mapped to processor P1. The individual task execution times are given in
Figure 2.41 as well. P11 to P14 illustrate different scheduling alternatives. The default
time division interval is t.
P11 realizes the simple binding
BV1 = ((A, B, C, D) → 1)
66 2 Virtualizable Architecture for embedded MPSoC
0 t Time
A B
2ήt A
C 1.5ήt B
t C
D 1.5ήt D
0 t 2ήt Time
Bȱfinished
P11 A B C D A B C D A B C D 11.5ήt
Bȱfinished
P12 A B C D 7.5ήt
Bȱfinished Dȱfinished
P13 A B C D 6.5ήt
Bȱstalls Dȱstalls
Aȱfinished Cȱfinished
Bȱfinished Dȱfinished
P14 A B C D 6ήt
Aȱfinished Cȱfinished
Figure 2.41: A Task Graph with given Task Execution Times is executed in four different
Ways.
where each task is interrupted after t by the virtualization procedure to activate the
next task. A and B are interrupted without having completed their computation, i. e.,
without production of a result. As C and D, however, rely on the results of A and B, C
and D stall during their first invocation. During the next invocation of B, B finishes its
computation after 2t . B stalls for the rest of this time slot. Not until the next invocation,
C may start its execution for the first time. Afterwards, D may start its execution.
If C would have featured an execution time longer than t, the start of D’s execution
would have been further delayed. Besides task stalling, because they are waiting for
other tasks to produce results, tasks being invoked despite already having completed
2.4 Task Scheduling via Interconnection Network 67
their computation further elongate the overall system execution time. Altogether, the
execution time for the task graph is 11.5 · t for P11 .
The system execution time may be reduced, if the time division interval is skillfully
adapted to the task execution times. For P12 , the time division interval in the architec-
ture is set to 2 · t, which is exactly the execution time of the slowest task, A, in the task
system. Now, each task is guaranteed to finish its execution within its first invocation.
For tasks B, C, and D, which feature shorter execution times than A, stalls occur after
they have finished their execution. Nevertheless, the overall system execution time may
be lowered to 7.5 · t for the given task graph.
For P13 , the budget parameter of the Virtualization Layer is exploited to further
optimize the system execution time. The basic time division interval is reset to t,
however, the Binding Vector now features a budget parameter for A, B, and D:
BV2 = ((A : 2, B : 2, C, D : 2) → 1)
In doing so, for each invocation, A, B, and D feature double as much execution time
as task C. As a result, the behavior is somewhat similar to P12 , as C features only a
time slot of t, the stall after C’s execution disappears. By defining 2 · t as time interval
for A, B, and D, all three tasks are still guaranteed to finish their computation during
their first invocation. The system execution time is now lowered to 6.5 · t.
At first sight, a simple solution would be to assign each task a budget, which exactly
correlates with its execution time. However, task execution times may vary during
runtime. If a task unexpectedly stalls, e. g., because a communication participant does
not provide or consume data in time, the execution time is elongated. Though, the
Virtualization Layer interrupts the task after its budget runs out without accounting for
the stall. Thus, relying on static registered task execution time seems not to be feasible.
P14 , finally, provides an optimal scheduling. Here, each task is interrupted exactly
after it has finished its execution. The system execution time is the optimal as 6 · t is the
sum of the execution times of A, B, C, and D.17 This scheduling behavior is established
by the so-called self-scheduling features of the Virtualization Layer.
Self-scheduling is a term originally aimed at the decentralized, manual organization
of work shifts of nurses [Burke 2004]. In the field of computing, the term is used several
times for the automatic or partially guided mapping of parallelized loop-sections to
parallel processor architectures [Polychronopoulos 1987, Liu 1993, Tang 1986]. In the
scope of virtualizable MPSoC, self-scheduling was proposed in [Biedermann 2012b].
By exploiting the self-scheduling scheme, a task may indicate the end of its computation.
This indication is provided by means of dedicated scheduling instructions, which may
17 Forthe entire example, the timing overhead generated by the virtualization procedure, which performs
task switching, is neglected. The basic time interval as well as task execution times are assumed to be
a magnitude above the duration of the virtualization procedure.
68 2 Virtualizable Architecture for embedded MPSoC
Instruction Memory
Data Memory
Instruction Data
Memory Memory
Controller Controller
Code Instruction
Task Context
Injection Stream
Memory
Logic Monitor
Interconnection Network
Soft-Core Processor
Figure 2.42: The Instruction Stream Monitor scans the Instruction Interface and catches dedic-
ated Scheduling Instructions.
be included into the program code of the task. A combinatorial Instruction Stream
Monitor (ISM) is placed in the instruction interface between the task’s instruction
memory and the processor. It is able to observe the instructions being sent to the
processor. It is exploited to detect these dedicated scheduling instructions and passes
this information to the Code Injection Logic (CIL), cf. Figure 2.42. Consequently, the
CIL triggers a virtualization procedure, which is similar to the invocation after the
task’s budget would have run out. As the scheduling instructions are based upon
instructions available for the given processor and employ as parameters values, which
are defined by the designer, these instructions can be compiled by existing compilers
without modification. Furthermore, as these instructions are catched by the ISM and
are not forwarded to the processor, no modification of the processor itself is necessary.
These properties are in accordance with Postulates 3, 7, and 8 (Guaranteed activation and
interruption of task execution at any point in time; No modification of the processor core, usage
of off-the-shelf processors; Minor or no modification of existing software code of tasks).
For the self-scheduling instruction, the two alternatives nap and sleep exist. By
indicating the end of the execution via nap, a task is disabled and the next task given
in the Task Group or its execution sequence register, respectively, is activated on the
target processor. After all tasks in the Task Group have been worked off, the first task
of the Task Group is activated for the second time and the eradication of the Task
Group is repeated. For the sleep alternative, a task is not only deactivated, but deleted
from the Task Group. Therefore, for the next eradication of the Task Group, a task
terminated by a sleep instruction is not considered further. The behavior of the nap
2.4 Task Scheduling via Interconnection Network 69
A B C
result=e *d;
…
sleep(); Bc
}
Time
Sixth…ȱ
Fifth…ȱ
Fourth…
Thirdȱexecutionȱcycleȱstarts
SecondȱexecutionȱcycleȱofȱTaskȱGroupȱstarts
P1 A Ba C A Bb A Bc A A A
nap nap sleep nap nap nap sleep nap nap
Figure 2.43: The three Tasks A, B, and C are mapped via the Binding Vector BV1 =
((A, B, C) → 1) as a Task Group to Processor 1. The Insertion of Nap and Sleep
Instructions leads to the Timing Behavior illustrated in the lower Portion.
and the sleep instruction is detailed in Figure 2.43. By combining such instructions,
tasks may indicate a pause halfway through their execution by a nap instruction and
after their execution is resumed at this point during the next invocation, the end of the
tasks execution may be indicated by a sleep instruction. In doing so, the designer can
define reasonable points for interrupting a task’s execution. These two instructions are
now being exploited in order to self-schedule task graphs. A prototype implementation
for self-scheduling instructions for Xilinx MicroBlaze processors was accomplished in
[Eicke 2012].
Figure 2.44 depicts a simple task graph consisting of two tasks, whereas task B is
dependent on the data sent by A. After A has completed its computation, it sends
its result to B in a blocking manner. In doing so, task A assures that the subsequent
70 2 Virtualizable Architecture for embedded MPSoC
A
ID 0
B
ID 1
main{
A … \\normaltaskexecution
ID 0 x=y+z;
sendP2P(x,1,blocking); \\senddatatoBinblockingmanner
sleep(); \\terminatetask
}
main{
B intflag=1;
ID 1 while(flag){
recvP2P(x,0,nonblocking); \\nonŞblockingsemantics
if(readSuccessful()){ \\indicatedbyStatusBit
flag=0; \\leaveloop,starttaskexecution
}else {
nap(); \\disabletask,ifnodatareceived
} \\entrypointatreactivation
Entryȱpoint
}
atȱtask Self-Scheduling Routine
reactivation
… \\normaltaskexecution
}
Figure 2.44: By adding a dedicated Portion of Software Code, a Task is only executed further,
if preceding Tasks in the Task Graph provide Data. If not, the Task is suspended.
After a task has successfully served succeeding tasks, it terminates itself by the
sleep Instruction.
task B has received the data before task A disables itself by the sleep instruction. The
receiving task, however, reads the data sent by A in a non-blocking manner. In non-
blocking semantics, a carry bit indicates, whether a successful data transfer occurred or
not.18 The dedicated self-scheduling routine wrapped around the receiving command
determines by checking this carry bit, whether B has received the data of A necessary
to start its own computations. If this is not the case, B suspends itself by the nap
instruction. Thus, the next task in the Binding Vector would be activated. If blocking
semantics would have been applied for the receiving command, the task would stall
until the data are delivered by the preceding task. Non-blocking semantics, however,
allows for a self-induced suspension of the task. If the current task depends on the data
of more than one preceding task, the self-scheduling code routine is slightly expanded
18 For MicroBlaze architecture employed for the prototype implementation, this carry bit is set in the
Machine Status Register of the processor.
2.4 Task Scheduling via Interconnection Network 71
to a sequential evaluation of the receive commands of all the predecessors in the task
graph, cf. Algorithm 14.
Figure 2.45 depicts a task graph and the corresponding task execution times. At
first, a conventional, pre-calculated as-soon-as-possible (ASAP) scheduling is applied
(a). This scheduling scheme minimizes the overall execution time of the task graph.
For a minimal execution time, three processor instances are necessary. In order to
realize this scheme on an actual system out of the task graph, a corresponding task-
processor binding as well as the sequence, in which the tasks will be activated on their
corresponding processor, is derived. These two properties, i. e., the binding and the
task sequence, may either be calculated in some instance inside the system right before
the task graph is being executed or may be a pre-calculated result. In both variants,
the scheduling instance of the system has to ensure to activate the tasks in the defined
sequence and on the processors. After a task has completed its computation, it might
either be disabled by the scheduling instance, e. g., because the scheduling instance
keeps track of the execution times of the tasks or the task itself might indicate the end of
its computation, e. g., by the sleep instruction introduced above. While this procedure
is common for embedded designs, which feature multi-tasking, the self-scheduling
scheme allows discarding any information about the task graph at the slight expense of
a small timing overhead.
In (b), the same binding is applied for demonstration purposes. Consequently, the
Binding Vector
0 t Time
A D t A
2ήt B
B C
1.5ήt C
2ήt D
E F
3ήt E
G H t F
1.5ήt G
2.5ήt H
0 t 2ήt Time
P1 A C F H G (a)
P2 D
Conventional
P3 B E ASAP Scheduling
P1 A C F H G (b)
P2 D
Same Binding with
P3 B E Self- Scheduling
P1 A B E (c)
P2 D C H Random
Binding on
P3 F G three Processors
P1 A C E H (d)
Random Bind. on
P2 D B F G two Processors
Figure 2.45: Conventional ASAP Scheduling vs. Self-Scheduling. The Time depicted for an
unsuccessful Task Activation is exaggerated.
is fed into the virtualizable MPSoC. Now, every task features the self-scheduling code
section detailed in Algorithm 14. Thus, if a task is activated and the preceding tasks
have not yet provided the necessary data, it suspends itself with the nap instruction and
the next task in the Task Group is being activated.19 In consequence, the same behavior
as for a conventional ASAP-scheduled system is observed. As the only deviation, tasks
B and E, which are mapped to processor instance P3, are activated in turn until task
19 Without denoting a task execution sequence, the Virtualization Layer activates tasks in the order they
are entered into the system. Thus, for the Binding Vector BV1 = ((A, B) → 1), A is activated first.
Entering BV2 = ((B, A) → 1) instead, leads to B being the first task being activated.
2.4 Task Scheduling via Interconnection Network 73
T1
T1 T2 T5
T3 T4
T2
P1 …
… T5 T3
Tn
Tn
T4
Figure 2.46: An example Task Graph and Architecture Graph for determining the Worst-Case
Timing Overhead caused by the Self-Scheduling Scheme.
In an optimally scheduled binding, the execution time t TG of the entire task graph
would be
n
t TG = ∑ ti
i =1
with ti being the execution time of task Ti. When employing binding BVWC , however,
the execution time is
with tss being the time for an unsuccessful evocation and immediate suspension of a
task as denoted in Algorithm 14. Actually, this time may differ for each task depending
on the number of inputs and amount of input data the corresponding task has to
accumulate before being able to start its execution. When assuming that each task
awaits one word from its preceding task, tss was measured to be 99 clock cycles.
44 clock cycles are needed to invoke the task and 45 to suspend it, cf. Figure 2.10. An
implementation of the self-scheduling routine denoted in Algorithm 14 when awaiting
one word from one input takes 10 clock cycles to evaluate. Thus, for eight tasks, i. e.,
n = 8 and tss = 99 clock cycles, the worst case timing overhead t TO for the eradication
of the entire task graph via the self-scheduling scheme is
2.4 Task Scheduling via Interconnection Network 75
82 − 8
t TO = · 99 = 2, 772 clock cycles.
2
For many scenarios, the time for unsuccessful task activations of tss = 99 clock cycles
will usually be negligible in comparison to the execution times of individual tasks,
which will lay in the range of thousands of clock cycles. Even the worst overall timing
overhead t TO for the eradication of the entire task graph may be acceptable, as in
embedded kernels, a simple task switch between two tasks may already consume more
than 1, 000 clock cycles. Thus, the timing overhead generated by the self-scheduling
scheme depicted in Figure 2.45 is displayed highly exaggerated in comparison to
real-word applications.
As a result of the self-scheduling scheme, no explicit knowledge about the actual
task graph and the individual task execution times is needed. Thus, besides entering
an initial Binding Vector, which, moreover, may be chosen completely arbitrarily, the
central processor is relieved from scheduling management and might, therefore, be
completely skipped. Self-scheduling is, therefore, a solution, which may be applied
for systems, whose timing constraints allow for a slight overhead caused by the self-
scheduling procedures. In return, the self-scheduling leads to a drastic decrease in
control overhead, as there are not dedicated scheduling algorithms to apply and no
representation of the task graph has to be held available anywhere in the system. The
proposed self-scheduling scheme leads to self-organizing task graph execution, no
matter whether the task graph is mapped to a single processor or to a processor array.
Besides the intrinsic nap and sleep commands of the Virtualization Layer, no additional
modules are required in order to establish the self-scheduling scheme. This mechanism
will be exploited for the Agile Processing scheme detailed in Section 3.4.
Processor
Task A Task B Task C
by writing into this memory region. If a lot of data has to be transferred between
tasks, instead of copying these values to the shared memory, a task may just receive a
pointer to the memory region containing this data. At first sight, shared memory seems
as a well-suited communication scheme for the virtualizable MPSoC as this solution
does not require knowledge on which processor a task is currently being executed.
However, this concept has several drawbacks. There has to be some access management
provided in order to prevent write/read conflicts. Furthermore, globally addressable
memories as described in the case where pointers are used, are a potential security
risk. It has to be ensured that a task may not harmfully access or corrupt data stored
in memory regions reserved for other tasks. Moreover, for the FPGA-based prototype
of the virtualizable MPSoC, large shared memories are hard to synthesize efficiently
due to the on-chip memory primitives that usually just feature two read ports and a
common write port. The usage of an off-chip shared memory, however, would increase
the latency and, therefore, significantly slows down the system performance.
Another alternative for task communication is bus transfer, cf. Figure 2.47b. Here, all
tasks have access to a common communication interface. They may communicate by
writing the receiver address and the payload data to this interface. Since all tasks are
78 2 Virtualizable Architecture for embedded MPSoC
Task A Task B
Processor 1 Processor 2
Figure 2.48: In conventional Designs, Point-to-Point Interfaces are hard-wired among Pro-
cessors.
assumed to eventually send data, a bus arbiter instance has to manage bus accesses
among tasks. The more participants are connected to the bus, the more the arbiter
becomes a potential bottleneck of the system performance. Moreover, since the bus
may only be accessed by one writing instance at a point in time, throughput of the bus
is limited. As a solution, several busses might be instantiated. Each task then features
a dedicated bus that just this task may access in a writing manner. The IBM Cell
architecture, e. g., exploits a set of four ring-based data busses. Here, eight so-called
SPU processor units are interconnected by a data ring [Kistler 2006]. However, this
causes a tremendous overhead in terms of both resources and wiring complexity.
The last communication alternative to address is point-to-point connection. Here,
communication participants are usually hardwired together, cf. Figure 2.47c. This
allows for very fast data transfer; however, with an increasing number of participants,
the wiring complexity dramatically increases since each participant has to be connected
to each other participant in the system. As an alternative, a chain or ring of connected
participants may be established, cf. Figure 2.47d. For ring-based communication, the
saving in terms of resources and wiring complexity are paid for by a significantly higher
latency, as data may have to pass several processors before reaching the designated
one. Furthermore, other tasks have to interrupt their execution in order to forward data
meant for other tasks.
Obviously, none of these techniques may directly be exploited for the virtualizable
system. Therefore, a suited communication scheme has to be derived from the commu-
nication alternatives described above. As the performance of task communication will
directly affect the system’s overall performance, a solution with low latency, even at
the expense of a higher resource consumption, is favored. Here, point-to-point commu-
nication outperforms shared memories as well as bus-based solutions. To overcome
the communication issues arising from transparent task migration, the advantage of
shared memories, which feature task-based addressing instead of relying on hard-wired
processor point-to-point communication interfaces, is exploited. The following section
will, thus, detail the merging of a conventional point-to-point communication scheme
2.5 Virtualized Task-to-Task Communication 79
Addressing interfaces may either be static or dynamic. In the static case, the interface
identifier is written statically into the software code, whereas in dynamic addressing,
the identifier is the result of a computation, which takes place at runtime. Dynamic ad-
dressing will be, e. g., exploited for the Agile Processing scheme detailed in Section 3.4.
As each variant is translated to different machine code representation and, later on,
will undergo special treatments in order to enable virtualized task communication, it is
necessary to discuss both variants in short.
For FSL, the send command of the example above reads as:
//Code for Task A (sender)
myVariable = 1234;
putfslx(myVariable, 4, FSL_DEFAULT);
Here, the default parameter FSL_DEFAULT sets the communication semantics to be
blocking. If the interface identifier is static, putfslx is translated to the following
machine code representation according to the MicroBlaze datasheet [Xilinx, Inc. 2012]:
0 1 1 0 1 1 0 0 0 0 0 rA 1 n c t a 0 0 0 0 0 0 0 FSLx put
0 6 11 16 28
The first six bits indicate the opcode of the instruction. Bit 16 distinguishes between
a send and a receive instruction. rA from bits 11 to 15 addresses the register, which
contains the content to be sent. n, c, t, and a are optional flags, which indicate
non-blocking semantics, sending of a control word instead of a data word, atomic
instruction sequences, and a simulated test write. For the prototype implementation,
the flags to indicate non-blocking semantics and control words are considered. The
four bits FSLx from bit position 28 to 31 indicate, which of the 16 FSL interfaces of the
Microblaze is addressed.
In the example above, the value of variable myVariable of task A is assumed to
be stored in register 11 of the MicroBlaze. Consequently, the machine code for the
putfslx instruction of task A is:
0 1 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 put
0 6 11 16 28
0 1 1 0 1 1 rD 0 0 0 0 0 0 n c t a e 0 0 0 0 0 0 FSLx get
0 6 11 16 28
2.5 Virtualized Task-to-Task Communication 81
putfslx and getfslx share the same opcode and bit 16 defines the selected
instruction. Here, rD designates the address of the register, in which the received data
word is written. Again, the FSL interface is explicitly addressed by the four bits of
FSLx. As only addition to the optional parameters, e may indicate an exception in case
of a control bit mismatch. As exception handling is not considered for the prototype
implementation, the usage of this parameter is restricted. For the sake of completeness,
the machine code representation of task B’s getfslx is as follows, if variable newVar
is assumed to be stored into register 19 of the receiving processor:
0 1 1 0 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 get
0 6 11 16 28
0 1 0 0 1 1 0 0 0 0 0 rA rB 1 n c t a 0 0 0 0 1 1 putd
0 6 11 16 21
0 1 0 0 1 1 rD 0 0 0 0 0 rB 0 n c t a e 0 0 0 1 1 getd
0 6 11 16 21
As much code employed in today’s designs is based on legacy code, the envisaged
communication scheme should require just slight adaptions of existing software code.
Therefore, the advocated solution aims at relying on the communication commands as
already provided by the processor’s API. Thus, the instructions introduced above are
reused for the virtualizable task communication scheme. The underlying architecture,
however, is essentially modified as follows.
0 1 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 put
0 6 11 16 28
0 1 1 0 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 get
0 6 11 16 28
To ID0‘sȱDataȱMemoryȱInterfaceȱMultiplexer
receiver
Task ID1
ID1 Read
Logic
ID2 Read
Logic
ID… Read
Logic
Figure 2.49: The Task Data Matrix with a Send Row and a Receive Column for each Task
currently mapped into a Task Memory. The Write and Read Logic of a Task is
routed to its Data Memory Interface Multiplexer as depicted in Figure 2.51.
can be achieved, e. g., by partial reconfiguration. Each task, which is currently not
mapped into a memory, features a dedicated memory region in the Message Memory.
Each address is assigned to another task in the system, cf. Figure 2.50.
The Instruction Stream Monitor (ISM) will be exploited to detect send and receive
commands and passes this information to the Code Injection Logic (CIL), cf. Figure 2.51.
The actual task communication scheme may be divided into three distinct steps. The
runtime code modification for sending, the Message Hub transfer, and the runtime
code modification for receiving data. These three steps are detailed in the following
sections. At first, a basic communication scheme is introduced, which is then expanded
in order to support dynamic addressing as well as transfers to tasks currently not being
mapped into a task memory.
Receive Column
of TaskȱIDx
from ID0
from ID1
from ID2
…
from IDn
from ID0
Receive Column
from ID1
of TaskȱIDy
from ID2
…
from IDn
Figure 2.50: The Message Memory is a Communication Matrix sequentialized into a Memory
Block.
Instruction Memory
Data Memory
Instruction Data
Memory Memory
Controller Controller
Code Instruction
Task Context
Injection Stream
Memory
Logic Monitor
Message Hub
TDM MM
Interconnection Network
Soft-Core Processor
Figure 2.51: A Section of the Virtualization Layer displaying the Connection of the Message
Hub to a Task’s Data Memory Interface Multiplexer.
As mentioned in Section 2.2.5, the Virtualization Layer is able to monitor the instructions
sent from a task’s instruction memory to its corresponding processor. Moreover, the
2.5 Virtualized Task-to-Task Communication 85
CIL can be routed to the processor’s instruction interface in order to force instructions
into the processor. The virtualization logic is now extended to monitor the occurrence
of communication commands such as the machine code representation of putfslx in
the instruction stream via the ISM. A command is identified by its opcode. As soon
as a putfslx instruction is detected, this instruction is withheld by the Virtualization
Layer and the CIL is routed to the processors’ instruction interface to inject a dedicated
instruction sequence. The CIL replaces at runtime the original putfslx command by a
store word immediate instruction. This instruction is part of the default instruction
set of common processors. It is designed to write contents of a processor register into
the data memory. For the prototyped implementation on Xilinx MicroBlazes, according
to its datasheet [Xilinx, Inc. 2012], the store word immediate command has the
following format:
swi rD, rA, IMM
rD designates a register address of one of the 32 general purpose registers of the
MicroBlaze. Its content will be output on the processor’s data memory interface, which
is usually connected to a task’s data memory. To address the memory, the sum of the
content of register rA and the value of IMM is taken. IMM is a so-called immediate
value, which is directly passed as an argument in the command. The bit pattern of the
store word immediate instruction is as follows:
1 1 1 1 1 0 rD rA IMM swi
0 6 11 16
The CIL now sets rD to the value of rA, which was denoted in the original putfslx
command, cf. Section 2.5.2. rA of the store word immediate instruction is set to 0,
thus addressing general purpose register 0, which statically contains the value 0. IMM
is filled with the task ID of the receiver, which was commited in the FSLx section of
the putfslx command, cf. Section 2.5.2. Revisting the example of Section 2.5.2, the
following machine code is injected into the processor:
1 1 1 1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 swi
0 6 11 16
This will output the value of myVariable, which is stored in register 11, on the
processor’s data memory interface. As a memory address, 1 is being output, which
is the ID of the receiving task, cf. Figure 2.52. The value being consequently output
by the processor is, however, not forwarded to the data memory. Instead, the Message
Hub catches this value.
As soon as the processor executes an instruction, which was modified by the CIL, the
task’s data memory is detached and the Message Hub is routed to the processor’s data
86 2 Virtualizable Architecture for embedded MPSoC
Clk
FSL4_Data 1234
FSL4_Exists
Clk
FSL4_Data
FSL4_Exists
Data_Addr 100
Data_Write 1234
Write_Strobe
Figure 2.52: The upper Part illustrates a conventional Instruction Sequence. In the lower Part,
the CIL alters a putfslx to a store word immediate, thus, addressing the receiving
Task by its ID.
memory interface.22 In the example, the value of myVariable being the output on
the processor’s data memory interface is written into the TDM by the corresponding
write column of the sending task. The cell of this row is addressed by the task ID of
the receiver. For commands with static addressing, the ID is passed on the Data_Addr
lane, cf. Figure 2.52. If a new value is written into a cell of the TDM, the TDM cell
indicates this by a flag. This flag will be reset during the following read process. After
successfully accessing the TDM, the task’s data memory is reattached to the processor.
By withholding the injection of the store word immediate instruction in case a
TDM cell is already occupied when trying to write in a blocking manner, the task’s
execution stalls in accordance with blocking semantics. However, the task remains
interruptible by the virtualization procedure as the processor itself is not in a blocking
condition. Algorithm 15 denotes the steps for sending commands with static addressing
via the virtualizable communication scheme.
As each task features a dedicated write logic and read logic, all tasks in the system
may write or read in parallel. This prevents the TDM from becoming a bottleneck
for task communication. The newly written data is now ready for being received by
another task.
22 Depending on the processor type, the time between instruction fetch and the actual execution of the
command may vary. The time depends of the pipeline depth of the employed processor type.
2.5 Virtualized Task-to-Task Communication 87
Clk
FSL3_Data 1234
FSL3_Read
Clk
FSL3_Data
FSL3_Read
Data_Addr 0
Data_Read 1234
Read_Strobe
The basic communication scheme replaces one original send or receive instruction
by one dedicated store or load instruction. Thus, the sequence of the instruction
stream remains in order and no correction of the program counter address is necessary.
Consequently, there is no timing overhead introduced to the execution of the tasks.
A communication is completed within one clock cycle. The interruption of the tasks’
execution by the virtualization procedure at any point in time is still guaranteed even
when applying blocking semantics, which may stall a task’s execution. The basic
communication scheme is, thus, a reliable and fast method to provide communication
in a virtualized environment.
This basic scheme, however, also features some drawbacks. It is assumed that a
instruction sent to the processor is eventually being executed. However, depending on
the code, a send or receive instruction may immediately follow a conditional branch.
In case a branch is taken, the send or receive instruction or rather their dynamic
replacement is already read into the processor’s pipeline but will not be executed. In
this case, the TDM will already be configured to receive or provide data. The actual
read or write access, however, will not happen. As an alleged solution, a compiler
might be modified in order to prevent the placement of communication commands
close behind branch instructions.
In order to provide dynamic addressing, not only information on the actual data,
but also details about the actual sender or receiver have to be passed to the Message
Hub. In order to obtain this information, the original send or receive command has
to be replaced by a sequence of instructions. The replacement of one instruction
by a sequence enforces a subsequent correction of the program counter of the task.
Moreover, the basic scheme does not consider that some tasks are not configured into
a task memory at runtime. A solution might be to scale the TDM to a size of n × n,
where n is the number of all tasks in the system. However, this would lead to very
inefficient synthesis results. Thus, in order to cope with all these issues, an enhanced
communication scheme is built on top of the basic communication method.
For the enhanced communication scheme, send or receive commands are replaced by a
dedicated sequence of operations. This sequence is structured in the following way:
1. Detection, whether static or dynamic addressing is present and detection, whether
the original send or receive instruction would be executed by the processor.
90 2 Virtualizable Architecture for embedded MPSoC
4. Optionally setting processor’s status register accordingly (e. g., error bit).
Clk
Instruction Valid
(a) The Communication Command is valid as well as the Instruction immediately following.
Clk
Instruction Valid
(b) The Communication Command is invalid as a Branch Target Calculations is ongoing. The
next Instruction is executed after the Branch Target Calculation has been completed.
Clk
Instruction Valid
(c) The Communication Command is valid as it lays in the Delay Slot of a Branch Command.
The Instruction following directly is invalid due to the Branch Target Calculation.
Figure 2.54: The highlighted Instructions mark the first Instructions to be executed after
Occurrence of a Communication Command.
Depending on the original send or receive instruction and its semantics, bits in the
processor’s status register are modified. In the optional Step 4, these modifications in
the status register are handled. This is the case, e. g., when expecting receiving data
marked by a control bit23 , but receiving data without this bit being set. This is indicated
in the processor’s status registers.
At last, in Step 5, an unconditional jump to the return address saved in Step 2
is performed. After having completed this sequence, the task resumes its normal
operation. As soon as another communication command is detected, the sequence is
triggered again. The behavior of the Message Hub and the TDM is modified as well.
For the basic communication scheme, only the TDM of the Message Hub was exploited.
In doing so, all tasks mapped to task memories could communicate with each other.
Now, a task may also send data to a task, which is currently not mapped into a task
memory. Therefore, the TDM is accompanied by the Message Memory. While the
TDM contains rows and columns for each task currently being mapped into a task
23 A status bit to indicate, whether user data or control data are being transmitted.
92 2 Virtualizable Architecture for embedded MPSoC
Task
C ID2
receiver receiver
sender
Task ID1 3456 Task ID2 3456
Figure 2.55: Replacing the Content of the Task Memory leads to an Update of the TDM Con-
figuration.
Conventionalȱ Virtualized
PointȬtoȬPoint PointȬtoȬPoint
time, swapping of TDM and Message Memory contents can be done in parallel without
adding additional timing overhead.
Timing Behavior
A feature, which is enabled by the virtualized data transfer and that is usually not
available for common point-to-point data transfer is the option for data broadcasts.
A designer may define an ID, which acts a broadcast address. If the Virtualization
Layer detects this target ID address by a send command, the corresponding data word
is written in a non-blocking manner into the entire write row of the sending task in
the TDM. As this happens concurrently, the broadcast expansion does not add any
94 2 Virtualizable Architecture for embedded MPSoC
latency. Therefore, existing values in the TDM, which have not yet been read out, are
overwritten. The automotive application example in Section 4.1 will exploit this feature
to continuously distribute sensor data to a set of driver assistance tasks.
TDM Buffer
Routing data transfers via the Message Hub has the significant advantage that a
receiving task does not necessarily have to be executed at the time the sending task
outputs its data word. The Message Memory buffers this data word until the receiving
task is activated and reaches the receive instructions. In doing so, even tasks may
communicate with each other that share the same processor instance in a time division
scheme.
An extension to the advocated communication scheme can be made by adding a
third dimension to the TDM. In this case, each cell of the TDM could buffer several
data words. Depending on the specific application scenario, this could prevent stalls
and improve the system’s throughput. The communication scheme itself, however,
does not gain any advantage or level of complexity in doing so. As discussed for the
task-processor interconnection network in Section 2.3.5, the expected structure of a
TDM with larger buffer depth is not well-suited to be efficiently mapped to the inherent
structure of an FPGA. Therefore, the implementation of a larger buffer depth of the
TDM was skipped for the prototype implementations.
The communication scheme may either be adjusted to feature either blocking or non-
blocking semantics. However, in order to prevent deadlocks, the Virtualization Layer
has to maintain the ability to disable a task at any time, e. g., due to a scheduling event,
even if the task is currently stalling because of a blocking send or receive. Blocking is
realized by consecutively feeding nop instructions into the processor. Therefore, in case
a task currently stalling due to a blocking send or receive has to be deactivated, the
processor itself is not in a blocking condition and the CIL may inject the virtualization
machine code to disable the task. Upon the task’s next activation, task execution is
resumed at the point of interruption within this sequence.
The tasks in the virtualizable system may also communicate with external modules
via the task communication scheme. Therefore, each external module receives a static
ID, just as the tasks in the system. For each external module, the Message Hub
transparently acts as glue logic between virtualized tasks and external components.
The automotive application scenario in Section 4.1 demonstrates the usage of external
modules in combination with the virtualizable task communication scheme.
2.6 Reliability Features of the Virtualization Layer 95
The Virtualization Layer provides a safe execution environment due to its disjoint task
memory design. However, for systems relevant to security, communication may pose a
potential thread. The designer may want to restrict or at least control communication
between tasks of the virtualizable processor array and external modules. In this case,
approaches as proposed in [Cotret 2012] may be exploited. Here, for FPGA-based
MPSoC, an additional local firewall can be added for each communication participant,
which manages and controls data transfers. While this was presented for bus-based
systems, the concept of local firewalls might be transferrable to the virtualizable
communication scheme.
Intermediate Conclusions
A Task B1 Task B2
B Processor 1 Processor 2
C
DMR Voter
Error
Signal
VotedȱResultȱofȱTaskȱB
Figure 2.56: Task B of the Task Graph is executed in a Dual Modular Redundancy Scheme.
C
TMR Voter
Error
Signal
VotedȱResultȱofȱTaskȱB
Figure 2.57: Task B of the Task Graph is executed in a Triple Modular Redundancy Scheme.
output data for being in a feasible range. Such monitoring has been presented, e. g., in
[Biedermann 2011a]. A simple DMR system is depicted in Figure 2.56.
For TMR, three instances of a task are executed in parallel. Here, one faulty process
may be identified if two processes output identical results and a third one deviates.
In case that all results are different, the same issue as for DMR arises. Here, similar
approaches may be applied to resolve this situation. A TMR system is depicted in
Figure 2.57.
As depicted in Figures 2.56 and 2.57, in DMR- or TMR-voted systems, each task
instance is usually statically bound to a processor. The voter module is connected to the
processors. In the virtualized system, however, the task-processor binding may change
during runtime. Moreover, a task may not be the sole user of a processor resource.
These challenges have to be considered.
In order to support redundancy schemes, the Virtualization Layer features generic
modules, the so-called Virtualized Module Redundancy (VMR) modules, which may
2.6 Reliability Features of the Virtualization Layer 97
VMR VMR
Interconnection Network
Figure 2.58: Virtualization Layer featuring the generic VMR Modules for Redundancy.
either act in DMR or in TMR mode. In a design with n task memories, n/2 VMR
modules are instantiated, as for n/2 DMR-voted tasks, n task memories are necessary.
A virtualizable MPSoC featuring 4 task memories and 4 processors expanded by VMR
modules is depicted in Figure 2.58. Each of the VMR modules is connected to the TDM
for data exchange. For this purpose, the designer defines a unique ID for each VMR
module, which may not overlap with the IDs already assigned to tasks or to external
modules.
Besides these voting modules, the behavior of the Virtualization Logic as well as the
binding definitions have to be adapted in order to support the redundancy schemes.
This is illustrated by means of an example. Given the task graph in Figure 2.59, left
hand side, task A sends its result to task B, which passes its result further to task C. In
order to enable virtualized task communication as detailed in Section 2.5.3, the designer
has arbitrarily defined task A to feature ID 0, whereas task B has ID 1 and task C ID 2.
The right hand side of Figure 2.59 depicts the corresponding software code. Now, a
redundant execution of task B with subsequent result voting is desired. Accordingly,
a second task instance of B as well as a voting module is added, cf. Figure 2.60. The
original task B now acts as instance B1 . The second instance B2 of B is treated just as
a normal task in the system by the architecture, even if its software code may be an
identical copy of original task B. Consequently, the designer has to define a unique ID
to task B2 in order to enable communication for this task. He arbitrarily choses ID 4.
Obviously, changes in the code of all tasks are now necessary. A has to be modified
in order to provide data not only to B1 , i. e., the original B, but also to B2 . Both instances
of B have to be revised in order to send their data not immediately to C, but to a
voter module. Additionally, C has to be altered in order to receive B’s data via this
voter module. The simplest solution would be to burden the designer with the duty
of adapting task communication to the redundancy scheme. The present approach
will, however, automatically handle these issues and, therefore, enable the transparent
transition between redundant and non-redundant task execution at runtime.
98 2 Virtualizable Architecture for embedded MPSoC
A A B C
ID 0 ID 0 ID 1 ID 2
Figure 2.59: An example Task Graph with its corresponding Software Code.
A
ID 0
B1 B2
ID 1 ID 4
DMR Voter
C
ID 2
Figure 2.60: The Task Graph of Figure 2.59 with Task B being executed in a redundant
Scheme.
At first, the Binding Vector has to be expanded not only to denote information about
which tasks to execute in a redundancy scheme, but also in order to pass information
about the data flow in the task graph. For the example in Figure 2.59, a corresponding
Binding Vector might be:
BV1 = (A → 1), (B → 2), (C → 3)
A is mapped to processor 1, B to processor 2, and C to processor 3. For the task
graph in Figure 2.60 with a redundant execution of task B, the expanded Binding Vector
is denoted as follows:
BV2 = (A → 1), (B1 → 2), (C → 3), (B2 → 4); (A, (B1 (1), B2 (4)) → VMR(9))
Here a second instance of B is mapped to processor 4. Additionally, configuration
information for a VMR module is passed in the Binding Vector. The configuration
information has the following format:
ponding communication IDs. By denoting two task instances, DMR is selected. If three
task instances are denoted, TMR is automatically selected in the corresponding VMR
module. IDVMR denotes a specific voter module in the system by its communication ID.
In the example, ld consists of task A. The tasks to be voted are task B1 with ID 1 and
task B2 with ID 4. As voter, the VMR module with ID 9 is chosen.
From this point, the Virtualization Layer manages the corresponding data distri-
bution to and from voter modules. Algorithm 17 denotes the behavior of dynamic
communication rerouting when executing a redundancy scheme. For the example,
Figure 2.61 details the resulting TDM transfers.
Note that for ld the data distribution is not realized by means of inserting additional
send commands as this would require a subsequent correction of ld ’s program counter.
Instead, the Message Hub is configured to distribute the data to the b parts. For all tasks
in the system, the employment of voter modules is fully transparent. Therefore, there
is no need to manually modify the software of tasks in order to apply a redundancy
scheme. Furthermore, the system may switch at runtime between redundant and
100 2 Virtualizable Architecture for embedded MPSoC
…
sendP2P(x,1);
recvP2P(x,0);
A sendP2P(x,1); …
ID 0 sendP2P(x,4); sendP2P(n,2);
…
sendP2P(n,9);
…
recvP2P(x,0);
B1 B2 …
ID 1 ID 4 sendP2P(n,2); receiver
sendP2P(n,9); ID0 ID1 ID2 ID3 ID4 … ID9
…
DMR Voter ID0 x x
ID 9 recvP2P(n1,1);
ID1 nv n
recvP2P(n2,4);
sender
… ID2
C sendP2P(nv,2);
ID 2 ID3
recvP2P(n,1);
… ID4 n
Figure 2.61: Runtime Code Manipulations and Data Transfers via the TDM when executing a
Redundancy Scheme based on Algorithm 17. In the TDM, VMR sends the voted
result nv to Task C with ID 2 by exploiting B1 ’s Write Column.
24 When being on standby most of the time, the battery of a smartphone may last for about a week.
However, the author observed this to be a rare consumer behavior for smartphone users.
25 Depending on the processor type employed, a processor may automatically induce a self-reset as soon
Controller
Processor Processor Processor Processor
1 2 1 2
Figure 2.62: In order to reduce the Power Consumption of the System, a Task’s Context is ex-
tracted by the Virtualization Procedure and the corresponding Processor Instance
is disabled.
“low energy” profiles by hand or a system scheduling instance may decide at runtime
about which tasks to exclude from execution. In the updated binding, the tasks to be
deactivated are not assigned to a processor. The update of the task-processor binding
may then free processor resources, which, in consequence, may be temporarily disabled
as described above.
In the scope of a Bachelor Thesis of Antonio Gavino Casu [Casu 2013], the effect of
deactivating clock inputs of processor cores was analyzed. Here, a Virtex-6 LX240T
FPGA was equipped with a system monitor, which tracks the on-chip power consump-
tion of Xilinx MicroBlaze processors. A dedicated processor instance is exploited for
measurement management. Figure 2.63 depicts the effect on power consumption in a
four processor system, when varying the number of active cores. All active processor
cores run on 200 MHz. Despite a static power offset present in the system, which is,
e. g., caused by the processor used for measure management, power consumption is
linearly linked with the number of cores as visible from Figure 2.63.
Instead of completely deactivating a processor instance, the clock frequency or the
supply voltage may be lowered in order to reduce the temporary energy consumption
of the system. In these cases, the virtualization procedure is not needed, [Irwin 2004].
However, as processor and task memory are clocked modules, they need to feature
a common clock signal in order to function properly. Consequently, when stepping
down a processor’s frequency, also the frequency of the tasks bound to the processor
have to be stepped down accordingly, cf. Figure2.64. An analysis of the measurements
of [Casu 2013] reveals a correlation between a MicroBlaze’s frequency and the system’s
power consumption, cf. Figure 2.65.
The virtualizable architecture may also consider Task Groups for energy management.
Instead of just suspending a task’s execution, when the corresponding processor is being
disbaled, the task originally bound to this processor may be scheduled in a Task Group
together with other tasks on another processor in the array instead. This is depicted in
2.7 Features for Energy Management 103
1,5
Powerȱȱ[W]
1,4
1,3
1,2
4 3 2 1
NumberȱofȱactiveȱProcessorȱCores
Figure 2.63: Power Consumption in a Multi-Processor System scales linearly with the Num-
ber of active Cores. All Cores employ the same Frequency.
Controller
Figure 2.64: Stepping down a Processor’s Clock Frequency may reduce the temporary En-
ergy Consumption of the System. The Frequency of the Task Memory has to be
stepped down accordingly.
Figure 2.66. Here, processor 2 is being disabled. Instead of also deactivating task 2, this
task now shares access to processor 1 together with task 1. This feature is enabled by
the transparent resource sharing provided by the Virtualization Layer, cf. Section 2.2.6.
However, care has to be taken by the designer in order to not violate timing constraints
of tasks thereby. The usage of such “low energy” profiles is demonstrated for the
quadrocopter application example in Section 4.2.
The last mean to preserve energy discussed is the dynamic reduction of the paral-
lelization degree. Section 3.4 will demonstrate that the virtualizable architecture may
be tailored towards a data-flow multi-processor system featuring a dynamic degree of
parallelism. As several processors execute parallelized sections of a task, by lowering
the degree of parallelization at runtime, processor resources are freed and may be
104 2 Virtualizable Architecture for embedded MPSoC
0,1
PowerȱOffsetȱ[W]
0,08
0,06
0,04
0,02
0
0 50 100 150 200
SystemȱFrequencyȱ[MHz]
Figure 2.65: Power Consumption is directly affected by a Processor Core’s Clock Frequency.
Controller
Figure 2.66: Tasks originally assigned to a Processor, which is being disabled, may share
access to another Processor Resource by the Transparent Multi-Tasking provided
by the Virtualization Layer.
deactivated as highlighted above. As a further step, the execution of the task may be
shifted completely from a parallel to a pure sequential representation. In doing so, all
but of one processor may be deactivated. This will also be demonstrated in Section 4.2.
In a nutshell, the virtualizable architecture features several techniques in order to
cope with deviations in the energy supply. These methods include the processor
instances, the tasks as well as corresponding Virtbridges. Measurements on the proto-
type implementation for a Virtex-6 LX240T FPGA have proven that these methods are
effective means to realize energy-aware systems.
points in time is enabled. This allows for dynamic binding updates. A dynamically
reconfigurable interconnection network provides intrinsic means for processor resource
sharing between tasks. Task groups, i. e., tasks, which are assigned to the same processor
resource, may be scheduled in a time division scheme or apply a self-scheduling scheme,
which eliminates the need for a central scheduling instance. Fast task communication
despite varying processor assignments is accomplished by the runtime modification
of point-to-point communication commands. Data is thereby rerouted via a Message
Hub. This hub expands common point-to-point protocols by features such as data
broadcasts. Generic VMR voter modules enable the application of common redundancy
strategies in order to increase the system’s reliability. By exploiting the virtualizable
task communication scheme, a fully transparent transition from non-redundant to
redundant task execution is achieved. Features for energy management allow for
dynamic reactions to deviations in power supply, e. g., by processor frequency stepping
or processor deactivation. The feature to dynamically update task-processor updates
assures the execution of tasks despite the deactivation of parts of the processor array.
The following Chapter will address the challenge of designing MPSoC systems,
which may exploit the advocated virtualization properties.
3 The Virtualizable MPSoC: Requirements,
Concepts, and Design Flows
The previous Chapter has introduced a hardware-based virtualization procedure for
tasks running on embedded processors. This architecture enables task migration at
runtime and, thus, provides a huge degree in execution dynamism. However, by
now, the aspect of designing systems for this virtualizable architecture has not been
considered.
This problem of efficiently exploiting the benefits provided by the virtualizable
architecture for embedded designs may be divided into two parts. The first part of
the problem is providing a suited design environment, which allows for a fast and
safe instantiation of the underlying virtualizable architecture based on the needs of
the designer. Therefore, the following section will discuss in short requirements for
efficiently creating designs for the virtualizable architecture. Afterwards, common
flaws and weaknesses in current design frameworks regarding the design of embedded
multi-processor systems are outlined before highlighting a framework, which features
support for the virtualizable architecture.
The second part of the problem is the establishment of adequate design flows, which
make use of the execution dynamism of the architecture. These design flows, in
turn, require the existence of design tools, as stated in the first problem, in order
to map the designs onto the virtualizable architecture. The past has proven several
times that promising, cutting-edge multi-processor architectures, such as the AmBrick
chip, which in 2008 featured up to 336 streaming RISC processors on a single die
[Ambric Inc. 2008], will eventually fail [Portland Business Journal 2008], if the aspect
of providing corresponding paradigms is neglected. In case of Ambric these issues
were discussed, e. g., in [Biedermann 2008]. Without either new, easy to adopt design
flows or transition solutions, which allow for re-using legacy designs, even the most
powerful new architecture will remain futile. Thus, the following sections will highlight
a set of design flows tailored to recent trends in embedded system design, which
may directly profit from the virtualization features. Figure 3.1 depicts some of the
design optimization goals achievable with the virtualizable MPSoC. Foundation of the
envisaged design flows is the ability for a dynamic reshaping of the system’s (initial)
configuration, i. e., task-processor binding. Initial configurations are being setup by
executing either Algorithm 6, 11 or 13, depending on whether the binding vector
contains task groups or not. Algorithm 18 denotes the general procedure for runtime
system reshaping, which exploits the mechanisms introduced in the last chapter.
A. Biedermann, Design Concepts for a Virtualizable Embedded MPSoC Architecture,
DOI 10.1007/978-3-658-08047-1_3, © Springer Fachmedien Wiesbaden 2014
108 3 The Virtualizable MPSoC: Requirements, Concepts, and Design Flows
Energy-Aware Systems
Virtualizable MPSoC Ȭ ReactȱtoȱFluctuationsȱinȱPowerȱ
Ȭ TransparentȱUpdatesȱof Supply
TaskȬProcessorȱBindings Ȭ (Temporary)ȱProcessorȱ
Ȭ VeryȱfastȱTaskȱSwitching Deactivation
Ȭ SeamlessȱTaskȱInterruptionȱandȱ Ȭ (Temporary)ȱTaskȱDeactivation
ReactivationȱatȱanyȱPointȱinȱTime Ȭ Taskȱ(SelfȬ)Rescheduling
Figure 3.1: Possible Design Goals achievable by means of the virtualizable MPSoC.
Unsurprisingly the design tools, such as Xilinx EDK only feature a strict binding of
tasks to processors. Therefore, their application in combination with the virtualization
solution is not considered further. Instead, a new, dedicated design framework is
introduced, which not only resolves the most hindering issues when designing embed-
ded multi-processor systems, but also keeps a mutual compatibility to existing design
frameworks, such as Xilinx EDK. In doing so, the advantages of established workflows,
as the seamless link to other tools in the tool chain, e. g., to system simulators, is
maintained.
The following section will introduce this design framework and highlight the exten-
sions made in order to support the application of the virtualizable architecture.
Figure 3.3: Work Environment in FripGa Tool with clear visual Representation.
For reference, the instantiation of a MicroBlaze soft-core processor takes in Xilinx EDK
approximately 50 mouse clicks in different workspace windows. A skilled designer is
able to perform this tedious sequence in about two and a half minute. FripGa accepts
input either by exploiting a visual paradigm or in terms of scripting commands based
on the language Tcl [Tcl Developer Xchange 2013]. Thus, in comparison, in FripGa
the instantiation of the same processor type is performed either by simply dragging
an processor instance into the design window or by executing a single Tcl command.
Both alternatives take less than 10 seconds. Figure 3.4 depicts a design, which is
automatically setup if the designer executes a corresponding Tcl script. In consequence,
processor instances, communication links, software projects as well as processor-task
bindings are automatically created. The resulting design is ready for synthesis.
FripGa features a mutual compatibility to existing design tools. As a consequence,
an already existing design flow and tool chain may be exploited. For the prototype
implementation, FripGa exploits the Xilinx Design Flow, which targets FPGAs, cf. Fig-
ure 3.5. Thus, designs created in FripGa may be exported into Xilinx EDK. In Xilinx
EDK, modules and interconnects, which are not natively supported by FripGa, may be
added and configured. The ability to refine a design in existing design tools unburdens
FripGa from the need to replicate the full functionality of these design tools. Moreover,
the re-import of a design, which was altered in EDK, into FripGa is possible. Despite
not displaying modules unsupported by FripGa, FripGa keeps them in the design
description. In doing so, maintenance of designs, which were created in FripGa and
were then further altered in EDK is enabled. The design depicted in Figure 3.3 in FripGa
is the same as in EDK in Figure 3.2, which features a confusing visual representation.
The yield in terms of a clear representation is obvious.
112 3 The Virtualizable MPSoC: Requirements, Concepts, and Design Flows
FripGa
Bitstream
Assembly
Proprietary
Implementation Design Flow
Device
Initialization
Figure 3.5: Design Flow with FripGa on Top of a commercial Design Flow, e. g., for Xilinx
FPGAs.
this application scenario explicitly aims at the weaknesses of EDK, these disadvantages
hold true for most of the design tools mentioned in the previous Section. Design tools
commonly exploited in commercial flows are complex, feature-heavy tools, whose
functional range has grown over years, but which currently lack decent support for the
design and editing of embedded multi-processor architectures. Table 3.1 provides a
side-by-side comparison of FripGa and Xilinx EDK. Besides these advantages, FripGa
supports the design of virtualizable MPSoC architectures as well.
Figure 3.7: Workflow for virtualizable Designs: Tasks are added to the Virtualization Element,
the Number of Processors as well as sets of Task-Processor Bindings are defined.
• Group 0: These systems do not feature any means to prevent faults caused by the
occurrence of errors.
• Group 1: The systems of Group 1 have at least mechanisms to detect the occurrence
of errors, e. g., by dual modular redundancy. Consequently, suited counter-measures
may be applied. However, a correct continuation of the system execution is not
ensured.
• Group 2: These systems may not only detect, but also mask errors, e. g., by applying
triple modular redundancy, where faulty results are detected as a deviation of one of
three parallel computations. This faulty result is then masked by the alleged correct
result produced by the other two parallel computations. Thereby, the system may
resume its execution.
• Group 3: This group contains systems, which have means to detect and mask errors
and may, additionally, apply mechanisms to recover from permanent errors or faults.
Systems, which provide spare resources or a dynamic reconfiguration of the system’s
behavior, may hide from an external observer that a fault has occurred. Other systems
may resume their operation, however at the aware expense of some features, e. g.,
the deactivation of tasks with lower priority or a reduction of their performance
due to the disabling of defunct processor resources. Systems in Group 3 are called
Self-Healing systems within this work.
Figure 3.8: Hard/Soft Error CUDD Tool introduced in [Israr 2012], which derives optimal
Bindings.
A self-repair approach for an MPSoC was presented in [Muller 2013]. Here, an array
of Very Long Instruction Word (VLIW) processors provide means of self-healing at the
event of errors. Due to the VLIW architecture of the processor cores, each instruction
word consists of several slots, which may be filled with operations. In case that a
part of the circuitry, which executes a certain slot, is detected as being defective, a
re-scheduling of the task takes place so that the corresponding slot is not used anymore.
In case that the entire resource is faulty, tasks may be bound to another processor.
However, in contrast to the virtualizable MPSoC, a dedicated VLIW architecture has
to be exploited, while in the virtualizable MPSoC, any embedded general-purpose
processor architecture may be employed. Moreover, in [Muller 2013] a binding update
is performed by means of memory transfers. Instead, the binding update via a network
reconfiguration takes no additional memory access overhead. Furthermore, as the
network configuration is computed in parallel with the task context extraction, it
outperforms a solution, which relies on memory transfers.
The execution dynamism achievable with the runtime binding updates of the virtu-
alizable MPSoC in combination with the redundancy modules detailed in Section 2.6
enable self-healing features of the virtualizable MPSoC. The following section will
highlight some strategies tailored to the virtualization features. The actual application
of these measures is application-specific.
0 t Time
t A
A D P1
2ήt B A D
B C 1.5ήt C P2
B E
2ήt D
E P3
t E C F
F t F P4
Figure 3.9: Example Task Graph with corresponding Task Execution Times and Architecture
Graph.
liability may be derived. By applying an optimized binding, the likelihood for the
occurrence of faults is minimized.
Thus, the procedure of mapping tasks to processors is of particular importance.
In general, a plethora of mapping strategies with various optimization goals for
multi-core architectures has been proposed. An overview is provided in [Singh 2013]
and [Marwedel 2011].
The work in [Israr 2012] details an approach to optimize task-processor bindings
regarding reliability and power consumption as two main optimization goals. As input
values, a task graph and an architecture mapping as depicted in Figure 3.9, as well as
reliability values for the processors employed are taken. The designer is hereby aided
by a frontend with graphical user interface in order to input the design specification,
cf. Figure 3.8. Output is a set of task-processor bindings, which are optimal in terms
of reliability and power consumption. For each processor constellation, an optimal
binding is given, cf. Table 3.2. The resulting bindings may directly be exploited as
input for the control processor in the virtualizable MPSoC. If processor instances fail at
runtime, the virtualization procedure allows switching to another optimal binding by a
reshaping, cf. Algorithm 18.
While conventional embedded systems with static task-processor bindings may
indeed be tailored to feature an initial binding, which has been optimized, e. g., by
the procedures given in [Israr 2012], in the case of failing processor resources, a switch
to another task-processor binding is not possible. The virtualizable MPSoC, however,
may establish a new binding by exploiting the virtualization procedure introduced
in Section 2.2.5 within approximately 90 clock cycles. Figure 3.10 depicts the binding
update after errors have been detected in the results of tasks, e. g., by means of
redundancy techniques or self-tests. By exploiting the self-scheduling scheme of
Section 2.4.4, no additional scheduling procedures are necessary in order to facilitate
the execution of the tasks in the new mapping. In the given example, the execution
cycle of the task graph is repeated at the detection of an error. This behavior is, however,
application-specific. The following section will propose several schemes in order to
3.2 Design Concepts for Reliable, Self-Healing Systems 119
P1 A B A B A B E A B A B D E
P2 C F C
P3 D D D D
P4 E C F C C F
2nd 3rd
1st Exec. Exec. 2nd Exec. Cycle Exec. 3rd Exec. Cycle
Cycle Cycle (repeated) Cycle (repeated)
Figure 3.10: The dynamic Binding Update at the Detection of Errors based on the System
defined in Figure 3.9 and the Bindings denoted in Table 3.2.
provide self-healing capabilities. The application example in Section 4.1 will then
demonstrate the combination of a binding optimization process and fault-detection
mechanisms in the virtualizable MPSoC under consideration of Quality of Service
aspects.
1 Withlower probability, the two matching results, however, may be wrong as well. Furthermore, the
matching results may be wrong with the one causing the alleged deviation being correct.
3.2 Design Concepts for Reliable, Self-Healing Systems 121
VMR VMR
Interconnection Network
whether the modules provide the required properties. As for the monitoring approach,
defects are assumed, if some system properties are out of the bounds of a corridor of
valid states.
After having identified an error or a fault, e. g., by exploiting one or several of
the means mentioned above, the virtualizable MPSoC may trigger corresponding
countermeasures. As the virtualizable MPSoC enables dynamic updates of task-
processor bindings, a task’s execution may be shifted at runtime to another processor
resource, e. g., to a dedicated spare resource held ready for the event of errors, cf.
Figure 3.11. In doing so, the processor instance, on which an error was detected, may
either be reset or be completely deactivated. If the system does not feature any spare
resources, two tasks may swap their processor assignment as a first measurement
before completely abandoning the faulty processor resource. Possibly, this new task-
processor assignment does not produce errors, e. g., if a bit is permanently stuck in a
register of the faulty processor, which is not read or written by the newly assigned task.
Other approaches, such as in [Bolchini 2007, Paulsson 2006a] apply a partial dynamic
reconfiguration at the event of permanent errors detected by TMR in order to recover to
a correct system state. When reconfiguring the affected area with the same functionality
but another mapping, the defunct cell is potentially not part of the new mapping. As
the exact identification of a defunct cell is, however, a complex task and the success
when applying a new mapping is not predictable, this approach is not considered for
the proposed self-healing concepts.
In conventional embedded systems, deactivating a processor instance also automatic-
ally implies the deactivation of the tasks, which are mapped to this processor instance.
The binding dynamism of the virtualizable MPSoC, however, allows for disabling
the faulty processor instance and for the rescheduling of its tasks to other processor
instances in the array as depicted in Figure 3.10. By exploiting the self-scheduling mech-
anism detailed in Section 2.4.4, apart from defining a new binding, which omits the
faulty processor instance, this task rescheduling procedure is completely self-organizing.
Depending on the actual system, scheduling of the tasks, which were assigned to a
122 3 The Virtualizable MPSoC: Requirements, Concepts, and Design Flows
now defunct processor, together with the other tasks in the system on the remaining
processor instances may violate application-specific timing constraints. In this case,
the designer can define to omit tasks from the binding set in order to still fulfill the
timing constraints. In contrast to conventional systems, the designer has the freedom to
choose, which task to discard, if the timing constraints cannot be met on the remaining
processor instances. Thus, the MPSoC allows for a gracefuld degradation of the system.
Summing up, the virtualizable MPSoC features the following, intrinsic means to
enable self-healing properties:
By combining these features, embedded MPSoC can be designed, which may cope with
multiple defects without having to forfeit executing the most crucial tasks in the system.
The application example in Section 4.1 will highlight these self-healing abilities.
main{
…
a=b+c;
… A
for(i=0;i<3;i++){
my_arr[i]=2*i;
} B
…
d=a+my_arr[1]+my_arr[2]+my_arr[3];
… C
}
Processor 1
Processor 1
B1
B1
Processor 0 Processor 2 Processor 4
Processor 0 Processor 2
A B2 C
A, C B2
Processor 3
Processor 3
B3
B3
Figure 3.12: A parallelizable Code Section B may be executed in a Co-Processor Scheme (a) or
a Data Flow Scheme with higher Throughput (b).
When assuming that a given sequential software code contains parts, which may be
parallelized, it may be divided into three parts A, B, and C. This is depicted in the upper
half of Figure 3.12. Parts A and C are portions, which cannot be parallelized, whereas
part B can be executed in parallel. As an embedded multi-processor architecture is
targeted, each parallel thread in B is planned to be executed on a dedicated processor
instance.
The lower half of Figure 3.12 depicts two design alternatives for the resulting structure.
In the left hand side, a processor/co-processor scheme is applied. Processor 0 starts
the execution of part A. The parallelizable sections Bi may be seen as tasks, which
are outsourced to a set of co-processors, which send their results back to the original
processor. Processor 0 then executes part C. In this scheme, the B tasks are fed by
processor 0 with all data they need in order to be executed. Resolution of data
dependencies is, therefore, easy. However, this scheme has a significant disadvantage,
e. g., if the software is executed in a cyclic way, as it is not uncommon for tasks in
embedded systems. Here, processor 0 has to finish the execution of part C before it
3.3 Design Flow for Parallelized Execution 125
main{
my_array[6]=a+c; DynamicȱDependencies
StaticȱDependency intb=123; betweenȱAȱandȱB
betweenȱAȱandȱC a= b+c; A
for(i=0;i<k;i++){
my_array[i]=a *2;
} B
DynamicȱDependency
betweenȱAȱandȱC DynamicȱDependency
d=a+my_array[2]; betweenȱBȱandȱC
e=d+b; C
}
Figure 3.13: Several static and dynamic Dependencies between Code Sections A, B, and C.
may start another execution cycle. During this time, the processors, which execute the
Bi tasks remain unused.
Thus, a data flow scheme as depicted in Figure 3.12, right hand side, is advocated.
In contrast to the co-processor scheme, part C is assigned to a dedicated processor.
Consequently, the processor, which executes part A, feeds the processors executing
the Bi tasks. These processors send their results further to processor 4, which executes
C. Meanwhile, processor 0 may already start the next execution cycle of A. In doing
so, a pipelined structure is established, which delivers a higher throughput than the
co-processor scheme. As visible from the code depicted in Figure 3.12, part C relies on
variables computed in part A, e. g., variable a. When dividing code into several disjoint
sections, these dependencies between part A and C have also to be considered.
Static Dependencies
For static dependencies, the value, which is needed by C and that is stored in A, is not
altered by A at runtime, i. e., remains static. In Figure 3.13, this applies to variable b.
Thus, once transferred from A to C, C may re-use this value for each of its execution
cycles. A compiler may identify static variables, e. g., by the fact that they are read-only
126 3 The Virtualizable MPSoC: Requirements, Concepts, and Design Flows
Precompilation
Pragma Processing
Architecture Setup
in FripGa
values, which are never overwritten. As such variables may already be detected during
the compilation phase, instead of transferring this value once at runtime, they may be
copied to C at the compilation phase. In doing so, a data transfer at runtime is avoided,
which reduces execution time.
Dynamic Dependencies
For dynamic dependencies, the value stored in A, which is needed by C, may be over-
written by A at runtime, e. g., if it is the result of A’s computations. In Figure 3.13, this
applies to variable a. Between sections B and C, the variable stored in my_array[2]
also causes a dynamic dependency. In this case, copying the original value to C at the
compilation phase will cause erratic behavior, as the value in A will vary at runtime.
As a solution, a dynamic transfer of this value is established. For each execution cycle
of C, the value, which is stored and processed in A, will be sent from A to C. The
performance as well as the amount of dynamic data transfers will significantly affect the
overall performance of the parallelized system, as the following sections will outline.
main{
my_arr[6]=a+c;
intb=123;
a=b+c;
send(a,Bi);
sendArray(my_arr,sizeof(my_arr),Bi);
send(sizeof(my_arr),C);
main { send(a,C);
my_arr[6]=a+c; }
int b=123;
a=b+c; A main{
while(1){
#fp_parallel for numProc=4 recv(a,A);
for (i=0;i<k;i++){ recv(my_arr_i,A);
my_arr[i]=a*2; my_arr_i=a*2;
} B send(my_arr_i,C);
}
d=a+my_arr[2];
}
e=d+b; C
} main{
b=123;
recv(s,A);
recv(a,A);
recvArray(my_arr,s,Bi);
d=a+my_arr[2];
e=d+b;
}
Figure 3.15: Processing a Pragma splits the sequential Software into three Parts A, B, and C.
Communication Commands resolve dynamic Data Dependencies.
parts. The B parts are fed sequentially with array elements. After having processed the
array, the C part reassembles the elements into an array. For this purpose, A transmits
the size of the array to C, cf. Figure 3.15. The send and receive routines exploit the
virtualizable task-to-task communication scheme, which was detailed in Section 2.5.
In the precompiler, the communication IDs of code sections A, B, and C as well as
the ID of the control processor of the Virtualization Layer are denoted. An example
invocation of the precompiler may look as follows:
ParallelPreprocessor -IexternSource --output-dir Output
example.c --CMB=0 --A=1 --C=2 --B=3, 4, 5, 6
CMB denotes the ID of the control MicroBlaze, A, B, and C the IDs of the corresponding
code sections. The list of IDs denoted for the B sections must match the actual number
of B parts created, either defined by the loop counter of the encapsulated for-loop or by
the numProc parameter.
3.3 Design Flow for Parallelized Execution 129
Processor 1
B1
Processor 3
B3
Interconnection Network
After having split the sequential software into the A, B, and C parts, a Tcl script is
created by the pre-compilation process. This Tcl script contains commands, which
will setup the architecture, i. e., the processor instances, as well as the corresponding
software-processor bindings in FripGa. When executing this script in FripGa, the A
and C part, as well as the number of B parts as defined in the pragma are bound to a
corresponding number of processor instances. The designer may now either further
elaborate the design in FripGa, simulate the design on a system simulator or he may
trigger the bitstream generation phase. The resulting structure of the MPSoC design is
depicted in Figure 3.16.
3.3.5 Discussion
The speedup achievable by parallelizing a task can be estimated by Amdahl’s Law
[Amdahl 1967]. It may be denoted as
1
S= rp
rs + n
130 3 The Virtualizable MPSoC: Requirements, Concepts, and Design Flows
0 Time
P1 t
P1 t/2
P2 t/2
Timeȱavailableȱforȱcommunicationȱandȱstalls
P1 t/3
P2 t/3
P3 t/3
Figure 3.17: The higher the Degree of Parallelism, the more Time in each parallel Processor
Instance Pi is available for Communication Overhead or Stalls without increasing
the overall Execution Time.
with S being the achievable speedup. rs is the fraction of a task, which is executed
sequentially, r p denotes the fraction, which is executed in parallel on n parallel modules.
Thus, rs + r p = 1. By following this law, the sequential portion limits the potential
speedup. The smaller the parallelizable fraction r p is, the less effective is increasing
the degree of parallelism. Thus, in an ideal case, the B parts of the proposed solution
account for most of the computational effort, while the A and C part just distribute and
accumulate data.
Figure 3.17 depicts a theoretical consideration of communication overhead for par-
allelizing tasks. When dividing a task into parallel instances assigned to dedicated
processor instances Pi each, the theoretical execution time t of the task is reduced to
t/n with n being the number of parallel instances. However, time for resolving data
dependencies and stalls caused by waiting for receiving these data will occur in real
world. Waiting for data and transferring these data may take up to t − t/n for each
instance without increasing the overall execution time. This is the theoretical border
which defines the efficiency of a parallelization solution. In Figure 3.18 the execution
time of the task consisting of the Sections A, B, and C is higher for a parallelized
version of the task (a) than for the sequential implementation running on a single
processor. This is due to the resolution of dynamic data dependencies, i. e., the delay
caused by data transfers as detailed beforehand. In this example, the communication
overhead – depicted as s and r for send and receive – is very high in comparison to the
computation time spent in the parallelized sections of B. For this example, no dynamic
dependencies between A and C are assumed. By delivering the parallel instances in a
3.3 Design Flow for Parallelized Execution 131
0 t Time
A
t A
B B 8ήt B
t C
C
0 t 2ήt Time
P1 A B C 10ήt
P1 A s s (a)
P2 r B s
P3 r B s
P4 r r C 11ήt
P1 A s s s s (b)
P2 r B s r B s
P3 r B s r B s
P4 r r r r C 10ήt
P1 A s s s s s s s s (c)
P2 r B r B r B r B
P3 r B r B r B r B
P4 C 9.5ήt
Figure 3.18: Increase in Execution Time despite Parallelization due to the Overhead for
resolving Data Dependencies (a). Increased Throughput and reduced Execution
Time by Pipelining (b and c).
pipelined fashion (b), the execution time may be reduced, as the parallel instances may
start earlier with their computations in comparison to (a). However, the execution time
is equal to the sequential implementation despite the fact that now four processors
instead of one are executing the task. Further pipelining the data transfers (c) eventually
shows a reduction of the execution time.
When talking about parallel architectures, the most obvious step to further reduce
the execution time is to increase the degree of parallelism. Figure 3.19 depicts the
same task as above, but now being executed on six processors, i. e., with four parallel
instances. Timing overhead is assumed to be the same as in the example above. Now,
132 3 The Virtualizable MPSoC: Requirements, Concepts, and Design Flows
0 t Time
A
t A
B B B B 8ήt B
t C
C
0 t 2ήt Time
P1 A B C 10ήt
P1 A s s s s (d)
P2 r B s
P3 r B s
P4 r B s
P5 r B s
P6 r r r r C 8.5ήt
P1 A s s s s s (e)
P2 r B r B
P3 r B r B
P4 r B r B
P5 r B r B
P6 C 7.25ήt
Figure 3.19: Lowered Execution Time by higher Parallelization Degree (d). Further Decrease
by Pipelining; Stalls between two Execution Cycles, e. g., in P2 , indicate that an
additional parallel Instance will not further reduce the Execution Time. (e)
the timing overhead for resolving the dynamic data dependencies carries less weight
due to the higher number of parallel instances (d). The overall execution time is lower
than for the sequential implementation. By applying the pipelining scheme (e), the
execution time may further be reduced. Therefore, the code parallelization by pragma
insertion exploits the pipelining scheme to reduce the execution time. Thus, each
parallel instance is fed with the smallest fraction of data possible in each turn. As
visible from the small gaps, e. g., in (e) at P2 between its two execution cycles of B,
further increasing the number of parallel instances would not lead to a further reduction
of the execution time. As P2 stalls until receiving the data for its second execution
cycle of B, a theoretical fifth parallel instance would have to wait the same time until
beginning its first execution cycle of B.
3.3 Design Flow for Parallelized Execution 133
0 t Time
A
t A
B B B B 8ήt B
t C
C
0 t 2ήt Time
P1 A B C 10ήt
P1 A s s (f)
P2 r B
P3 r B
P4 C 7.25ήt
P1 A s s s s (g)
P2 r B
P3 r B
P4 r B
P5 r B
P6 C 6.25ήt
Figure 3.20: Less Communication Overhead in Comparison to the Execution Time of B Sec-
tions reduces Execution Time considerably.
The overhead for stalls and transferring data is the main bottleneck for parallelizing
tasks. Figure 3.20 depicts the same task parallelization as before in a non-pipelined
manner with two (f) and four (g) parallel instances, respectively. Now, the delay for
sending and receiving is assumed to be lower, e. g., due to less data needed for the
computations in B or a faster communication infrastructure. As visible, the execution
time is now already lower than the sequential implantation for the version with two
parallel instances (f). This time is further reduced by the version with four parallel
instances (g). Due to the reduction of the communication overhead, a fifth parallel
instance would lead to a further reduction of the execution time, as it could immediately
be delivered right after P5 is fed. Thus, the efficiency of a parallelization solution is
not only defined by the number of parallel instances, but likewise by the overhead
caused by resolving data dependencies. For the virtualizable MPSoC, the virtualizable
task-to-task communication scheme detailed in Section 2.5 provides fast data transfers
with low latency and therefore contributes to efficient parallelization solutions.
134 3 The Virtualizable MPSoC: Requirements, Concepts, and Design Flows
2 The
pragma preprocessor has indeed a second configuration mode, in which a multi-processor system
may be set up without the Virtualization Layer between tasks and processors.
3.4 Design Flow for Agile Processing 135
P1 P2 P3 P4 P5
main{
A
main{ main{ main{
B B B B
main{
} } }
C C
} }
Figure 3.21: Instead of a sequential Execution, parallel Instances of the Task may be evoked.
3.4.2 Prerequisites
With the Agile Processing scheme, sequential and parallel execution cycles of a task
may take turns. Consequently, it is assumed, that after a task has finished its execution,
it will be executed again, either immediately or at least eventually in the future. This
holds true for most tasks in embedded systems. Many tasks run permanently, i. e.,
nested in a unconditioned loop. For each execution cycle of a task, the execution
scheme, i. e., sequentially or in parallel, may be selected.
Figure 3.21 depicts a task with three sections A, B, and C. As in Section 3.3, the code
in section B can parallelized, e. g., by means of pragma preprocessing. In the Agile
Processing scheme, a task’s execution may either be sequential or in parallel. In the
latter case, parallel instances of the B section are mapped to processor instances by
means of a binding update via the virtualization procedure. Results of the parallel
instances are sent to section C, which in case of the parallel execution is also executed
on a dedicated processor. In order to obtain the desired structure, the procedure of
pragma preprocessing is now being upgraded.
Precompilation
Pragma Processing
Architecture Setup
in FripGa
Definition of
Binding Sets and
Triggers
Figure 3.22: The Design Flow for the Setup of a Design supporting Agile Processing.
By now, the result of pragma preprocessing was a task, which has been split into three
disjoint parts A, B, and C. A distributes data to the parallel instances of B, while C
accumulates the results computed by B. This representation is now being merged with
the pure sequential representation of the task in order to match the structure with
Figure 3.21. Consequently, the preprocessor for pragma evaluation is modified so that
the result of the procedure is again three disjoint parts of the task. However, instead
of section A, an instance of the sequential task with dedicated code portions for data
distribution to the instances of B is created. Depending on the desired execution scheme
for the current execution cycle, the task either resumes its sequential implementation or
enters the code portion, which acts as section A of the original procedure, cf. Figure 3.23.
In the Agile Processing scheme, the number of parallel instances of the B section
is variable. Therefore, the routines, which resolve dynamic data dependencies, have
to be modified as well. This minor change is implemented by replacing the fixed
variable in send and receive routines, which denote the number of parallel instances
into a dynamic parameter, which will be passed by the control processor at runtime to
the code switching point. Additionally, the designer defines in the preprocessor the
communication IDs of the sequential section, the B sections, and the C section. It is
required that the IDs of the B sections are assigned as a continuous sequence. Changes
in the preprocessing have been implemented by Matthias Zöllner in [Zöllner 2014].
138 3 The Virtualizable MPSoC: Requirements, Concepts, and Design Flows
The code switching point is a section of the task, in which the execution scheme
for the current execution cycle of the task is selected. The motivation for the Agile
Processing scheme has denoted several reasons which may trigger the switch of the
execution mode. By now it is assumed that the request to toggle between sequential and
parallel execution is sent by the control processor, which manages the task-processor
bindings inside the Virtualization Layer. With slight modifications in the following
communication protocol, a task itself may request a switch of its execution mode. In
the latter case, a task, e. g., in a sensor system, which detects an unusual high number
of incoming sensor events might, therefore, temporarily switch to a parallel execution.
As soon as the sequential task reaches the code switching point, it requests informa-
tion about the execution mode. The control processor responds with the information,
whether as sequential or parallel execution is desired. In case a sequential execution
is indicated, the task resumes its execution. In case a parallel execution is selected,
the control processor triggers a virtualization procedure in order to activate the corres-
ponding instances of the B section as well as the C section in the system. If required,
these task sections may dynamically be reconfigured into the system as detailed in
Section 2.3.5. Accordingly, the control processor passes information about the number
of the B sections as well as the ID of the ResultMux to the sequential task. The duty
of the ResultMux module is detailed in the following section. Consequently, the send
routines to automatically resolve dynamic data dependencies start distributing data
to the allocated B instances starting with the B instance with the lowest ID. If there
are dynamic data dependencies between the sequential and the C section, these are
resolved as well, cf. Figure 3.23. The C section is provided with the ID of the ResultMux
module. After accumulating the results generated in the parallel stages of the instances
of the B section, the C section sends these results further to the ResultMux. Figure 3.24
depicts the communication between the participants of the Agile Processing Scheme.
Result Aggregation
main{
my_arr[6]=a+c;
intb=123;
a=b+c;
[CodeforSwitchingPoint]
//ParallelExecution
send(a,Bi);
main { sendArray(my_arr,sizeof(my_arr),Bi);
my_arr [6]=a+c; send(sizeof(my_arr),C);
int b=123; send(a,C);
a=b+c; A main{
while(1){
//SequentialExecution
#fp_parallel for numProc=4 recv(a,A);
for(i=0;i<k;i++){
for (i=0;i<k;i++){ recv(my_array_i,A);
my_arr[i]=a*2;
my_arr [i]=a*2; my_array_i=a*2;
}
} B send(my_array_i,C);
d=a+my_arr[2];
}
d=a+my_arr [2]; e=d+b;
}
e=d+b; C }
} main{
b=123;
recv(s,A);
recv(a,A);
recvArray(my_array,Bi,s);
d=a+my_array[2];
e=d+b;
}
Beneath sorting result values, the ResultMux has the duty to provide the results for
subsequent tasks in the task graph. A sequential task implementation as well as each
B and C section of its parallelized implementation feature a Task ID, cf. Section 2.5
in order to enable virtualized task communication, cf. Figure 3.26. The ResultMux
disguises itself as being the sequential implementation of the task. This is depicted
in Figure 3.27. In doing so, the designer may build the communication among tasks
in the task graph based upon a pure sequential representation of all tasks. If tasks
are being executed at runtime in a parallelized version, the ResultMux ensures the
correct addressing of tasks. In order to mimic the sequential representation of a task, a
ResultMux is able to access the write column of this task in the TDM. For subsequent
tasks it is, therefore, transparent whether they are accessing data delivered by the
sequential representation or by the ResultMux.
Control Seq.ȱTaskȱ
BȱSections CȱSection ResultMux
Processor Section
alt Dataȱfor
[#BȱSections >ȱ0] ParallelȱStages
Resultsȱof
ParallelȱStages
IDȱofȱResultMux,ȱDataȱforȱCȱSection Results aggregated
from BȱSections
Control End
P1 P2 P3 … P6
main{
A
main{ main{
B B B
main{
} }
C C
} }
8
7 4 5 6
3
2 ConfigurationȱbyȱControlȱProcessor
1
ResultMux 8 7 6 5 4 3 2 1
Figure 3.25: The ResultMux Module aggregates and sorts Results generated in sequential and
parallel Stages.
A A
ID0 ID0
B B B …
D ID2 ID3 ID4
ID1
C
ID8
D
ID1
Figure 3.26: Based on the Execution Scheme, a subsequent Task may receive Results either
from the A or the C Part of the preceeding Task.
While the transition between sequential and parallel execution of a task is now enabled,
the scheduling of a task graph, which contains parallelizable tasks is challenging.
Figure 3.28 depicts a task graph and corresponding task execution times. Task B is
142 3 The Virtualizable MPSoC: Requirements, Concepts, and Design Flows
A
ID0
B B B …
ID2 ID3 ID4
C
ID8
ResultMux
(ID0)
D
ID1
Figure 3.27: The ResultMux mimics the sequential Implementation of a Task in order to
provide Results for subsequent Tasks.
0 t Time
t A
5ήt B
A D A D
1.5ήt C
B C 2ήt D BA C
3ήt E
E F t F BB BB
1.5ήt G
G H Bc F
2.5ήt H
0.5ήt BA
E H
2ήt BB
G
0.5ήt BC
Figure 3.28: A Task Graph and corresponding Task Execution Times with B being an Agile
Processing Task.
0 t 2ήt Time
P1 E G
P2 A F
P3 C H
P4 D B (a)
P1 BB E G
P2 A BB F
P3 C BC C H
P4 D BA (b)
Reshaping Reshaping
ReadoutȱofȱBV
Figure 3.29: A normal self-scheduled Scheme (a) in comparison to the Agile Processing
Scheduling Scheme (b).
was randomly chosen. Task B passes its results via a ResultMux to E and H not depicted
here. In each invocation, a task checks whether it can start its execution. If not, it
suspends itself via the nap instruction. If it has eventually finished its actual execution,
it removes itself from the task group via a sleep instruction.
144 3 The Virtualizable MPSoC: Requirements, Concepts, and Design Flows
0 t 2ήt Time
P1 BB E
P2 A BB H
P3 C F
P4 D BA
P5 BC G (c)
Reshaping Reshaping
ReadoutȱofȱBV
Figure 3.30: The enhanced Agile Processing Scheduling Scheme: Tasks C and F are not inter-
rupted by the Agile Processing Scheme.
In case that the control processor choses the parallelization degree to apply to be lower
than the number of processors present in the system, some tasks may resume their
execution despite an ongoing Agile Processing phase. The question about which tasks to
3.4 Design Flow for Agile Processing 145
Preservation of Contexts
In order to fully profit from this approach, the task memories of the MPSoC are initially
filled with normal, i. e., sequential tasks. Only in case of a parallel execution, the
corresponding code sections are mapped into task memories, e. g., by means of a partial
dynamic reconfiguration. However, as discussed in Section 2.3.5 overwriting a task
memory also erases the TCM of the task which previously was active. In case of
the Agile Processing Scheme, thus, an interrupted task could not resume from the
point of its interruption after its re-activation, when the parallelized task has finished
its computation. As Section 2.3.5 has detailed, performing a TCM read-out before
overwriting the task memory with another task or providing dedicated TCMs for each
task are solutions to this issue. Alternatively, the Virtualization Layer could be modified
so that the B parts and the C part, which are mapped into task memories at runtime,
do not access the TCM of the previously active tasks. As the B parts and the C part
they start from scratch, they do not need to restore any of their contexts. Consequently,
the Virtualization Layer would have to be adjusted to not erase the TCM context in case
of a partial reconfiguration. While this is a safety feature, which enforces Postulate 4
(Strict encapsulation and isolation of tasks), this may be disabled for the Agile Processing
scheme.
3.4.5 Discussion
By applying the Agile Processing scheme, the execution behavior of tasks may be
tailored to dynamic factors, such as temporarily increased need for throughput, de-
fective processor resources, or variations in energy supply. In combining the pragma
parallelization approach, which splits sequential code into data flow-oriented code
sections, with the virtualization features, a high degree of execution dynamism is
achieved. The virtualization procedure guarantees a smooth and fast transition between
the execution modes. As for other tasks in the task graph the current execution mode
of a task is fully transparent, there is no need for the designer to modify the original
task graph and the addressing of task-to-task communication.
By now, just one parallelized task was considered. However, the design flow depicted
in Figure 3.22 may be executed for each task to be parallelized before the step of the
automatic architecture generation. Afterwards, the architecture generation phase may
be triggered once in order to set up the design. The designer may then define the
corresponding triggers for each task.
3.5 Conclusions 147
3.5 Conclusions
This Chapter has demonstrated how to exploit the virtualization properties in order
to create architectures with distinct optimization goals. Reliable architectures may be
set up as well as architectures, which feature the parallelized execution of tasks. The
design of such architectures is supported by a dedicated design framework, which is
seamlessly integrated into existing design flows. Consequently, a designer may rely
on a comprehensive workflow in order to set up his desired system. The following
Chapter highlights application examples, which will demonstrate the actual system
behavior when employing a virtualizable processor array.
4 Onefor the A and the C section each as well as at least two instances of the B section. One B section is
regarded a pathological case.
4 Application Scenarios
This chapter demonstrates some selected use cases, which highlight the diversity of a
virtualizable MPSoC. The proposed virtualization concept, e. g., enables the design of
reliable, energy-constraint, real-time, or parallelized architectures. The first application
example is an automotive use case, where the virtualization features provide functional
reshaping in order to adapt a car’s assistance systems to the current road type under
the consideration of reliability and harsh timing constraints.
Afterwards, energy awareness features and Agile Processing schemes are exploited
for a simulated energy-constrained quadrocopter, which is assumed to be equipped
with several sensors and analysis tasks. Based on the available power supply, the
quadrocopter facilitates processing either in a parallelized fashion or sequentially.
Sensor Layer
Fusion Layer
Application Layer
Figure 4.1: Layers of the Automotive Assistance Systems [Darms 2006]. Modules implemen-
ted for this Case Study are depicted on the right hand side.
• Blind Spot Detection (BS) warns the driver, if another car is currently present in the
mirror’s blind spot
• Lane Change Support (LCS) recognizes envisaged lane changes, prevents collision
with cars in the blind spot
• Lane Keeping Support (LKS) monitors lane markings to keep the car on track
• High Beam Assistance (HBA) deactivates high beam to prevent dazzling of oncom-
ing traffic participants
• Traffic Sign Detection (TSD) warns the driver in case a passed traffic sign is assumed
to be not correctly interpreted by the driver
• Parking Assistant (PA) highlights parking spots and aids backing into a parking
spot.
Table 4.1: Functional Profiles, i. e., active Driver Assistance Tasks depending on Road Type.
Freeway 0 1 1 2 2 3 4
Road 0 1 2 3 4
Town 0 1 1 2 2 4
The second road type, which is considered, is rural road. These roads usually feature
one lane per direction with no distinct separation besides a center line, cf. Figure 4.2b.
Permitted velocities are usually up to 100 km/h in Germany. As there is no physical
separation of the driving directions, accidentally leaving the lane may lead to a frontal
collision.
The third road type to be considered is roads in towns, cf. Figure 4.2c. Here, despite
the quite low velocity of below 50 km/h, an additional risk is added by pedestrians
crossing the street as well as by frequent lane changes necessary in front of large
crossroads. Finding a fitting parking spot is a further challenge.
Information about actual execution times of driver assistance tasks vary widely depend-
ing on the underlying processor architecture. As the prototype implementation of the
advocated virtualization scheme exploits Xilinx MicroBlaze soft-core processors with
fairly limited performance, explicit task execution times are not given. Instead, task
execution times are denoted by means of abstracted time units. Based on execution
times available in literature, such as in [Chen 2009, Leu 2011, Cerone 2009, Wu 2009,
Jung 2005], and confidential data provided by commercial manufacturers, execution
times of assistance tasks are scaled to be comparable. The times denoted in Figure 4.3
are assumed for the assistance tasks to be the worst case execution time to interpret
sensor data and to deliver a result.
4.1 Reshaping in an Automotive Assistance System 153
0 t Time
2.5 ήt CD Freeway,ȱRoad,ȱTown
0
t LCS Freeway,ȱTown
0
t LKS Freeway,ȱRoad,ȱTown
0
ͲǤͷήt BS Freeway
ͶǤͷήt
ͳǤͷήt
Town Freeway
t HBA Road
7.75ήt
9.75ήt
ͳǤʹήt TSD Freeway
t Road
1.25ήt
2ήt
Town Freeway
t FLA Road
7.75ήt
9.75ήt
ͲǤͺήt PA Town
2.5ήt
Figure 4.3: Execution Times of the Assistance Tasks. The Reactivation Times for each Road
Type are indicated by dashed Bars.
The usage of LKS and LCS is mutually exclusive. As soon as a planned lane change
is detected, e. g. by the activation of the direction indicator, LKS has to be deactivated
in favor of LCS. Otherwise, LKS would interfere with the driver’s aim to change the
lane. Therefore, LKS and LCS may be both bound to the same processor.
Besides the task execution times, another time constraint needs to be considered. As
sets of tasks will be scheduled on a processor instance, it has to be ensured that each
task will be reactivated soon enough to fulfill its duty. A blind spot detection, e. g.,
which could only check every 20 seconds for another car in the blind spot due to
scheduling would be meaningless as for most of the time, the driver would not receive
up-to-date information of potential cars in the blind spot. Thus, the so-called reactivation
time is defined as the time between the deactivation of a task and its next reactivation,
cf. Figure 4.3. Based on the assumed velocity of the car, the individual properties of
the task and the task’s worst case execution time, a reactivation time for each task is
derived. Only if a task is suspended for less than the reactivation time, it can support
the driver reasonably. As for the execution times, abstract time values are derived
from a set of actual timing values found in literature or provided confidentially by
manufacturers. CD, LKS, and LCS feature harsh timing constraints, their reactivation
time is given as 0. Thus, scheduling one of these tasks with any other task on the same
154 4 Application Scenarios
MB0
CD HBA
MB1
MB2
CD LCS LKS BS LCS FLA
MB3
MB4
HBA FLA TSD PA LKS TSD
MB5
MB6
BS PA
MB7
Figure 4.4: Task Graph featuring eight independent Tasks and Architecture Graph for the
Automotive Application Example.
processor resource will lead to a violation of its timing constraints. Therefore, each of
these tasks will be assigned to a dedicated processor in the processor array.
In order to find suited mappings of the driver assistance tasks to the virtualizable pro-
cessor array, the binding optimization methodology of [Israr 2012], whose exploitation
for the virtualization approach was detailed in Section 3.2.1, is employed. Input is
an unconnected Task Graph and a fully connected Architecture Graph as depicted
in Figure 4.4. As reliability value for the employed MicroBlaze processors, 0.99 is
assumed. After a Design Space Exploration, the output of this optimization process
are bindings, which are optimal in terms of reliability. For each functional profile and
possible combination of processors in the system, an optimal binding is generated.
According to [Israr 2012], in case that there are several bindings with the same level
of reliability for a given set of tasks and processors, the one producing the shortest
execution time is chosen. However, this procedure does not take the reactivation times
introduced above into consideration. Thus, the procedure may output bindigs, which
violate some timing constraints.
Therefore, the design flow is expanded to an iterative procedure, which, as first
step, outputs bindings with optimal reliability for a task graph and its corresponding
architecture graph. Subsequently, these bindings are checked whether they meet the
timing constraints arising from reactivation times. If not, two alternatives exist. One
the one hand, another binding might be chosen by hand, which complies with timing
constraints, but which in return abandons the property of optimal reliability. On the
other hand, instead of trading reliability, tasks with low priority may subsequently
be discarded before calculating new bindings by the optimization procedure with the
remaining task graph until a binding is derived, which meets the timing constraints. As
4.1 Reshaping in an Automotive Assistance System 155
DiscardedȱDecisionȱtowardsȱ OptimalȱBindingȱviolatingȱ
QualityȱofȱService TimingȱConstraints
QualityȱofȱService
(#ȱofȱactiveȱAssistanceȱTasks)
8
Purelyȱhypothetical
perfectȱBinding
Decideȱinȱfavorȱofȱ
Reliability
ValidȱBindings
(meeting TimingȱConstraints
onȱgiven Processor Array)
ParetoȱFrontier
0
0 0,99 1
Reliability
Figure 4.5: Pareto Frontier of Bindings optimal in Terms of Quality of Service and Reliability.
reliability and the quality of service, i. e., the number of active driver assistance tasks,
represent Pareto optima, it is up to the designer to decide, which optimization goal to
favor.
Considering the fact that some of the driver assistance tasks, such as Collision
Detection, directly account for the driver’s safety and his probability of survival at the
event of an accident, their highly reliable operation outweigh the deactivation of less
important tasks, such as the Fog Light Assistant. In Figure 4.5, the corresponding Pareto
frontier is depicted. In case the binding optimization process outputs a binding, which
violates the timing constraints, instead of choosing a binding with lower reliability
(arrow with dashed line), the number of tasks is reduced, until a binding optimal in
terms of reliability is found, which holds the timing constraints.
Based on this deliberation, a design flow as depicted in Figure 4.6 is derived.
Timing Validation
System Specification
Design Space
Exploration
Reliability-optimal
Bindings Rejection of Tasks
depending on Priority
Timing Validation
Timing-validated
Bindings
Virtualizable MPSoC
Component Binding and
Scheduling
Figure 4.6: Iterative Design Flow to obtain Timing-validated Bindings which are optimal in
Terms of Reliability.
0 t Time
PA
2.5ήt
TSD
2ήt
BS
ͳǤͷήt
1.2ȱ+ȱ0.8ȱ>ȱ1.5
Figure 4.7: The conventional Task Group Scheduling leads to a Violation of Timing Con-
straints. With an enhanced Scheme, the Tasks meet the Timing Constraints.
assistance tasks are assumed to run within a while(true) loop, a new execution
cycle will start as soon as the task is reactivated at the position of the nap instruction.
In doing so, no external scheduling management is necessary.
0 t Time
Figure 4.8: Within four Cycles in the outer Loop of Algorithm 22, the Scheduling Sequence S
is derived for the Binding BVTown1 = ((BS, TSD, PA) → 1).
main {
while(true){ \\taskruns endlessly
… \\normaltask execution
Entryȱpoint …
at task …
reactivation …
nap(); \\executioncycle finished
}
}
Figure 4.9: Adding a nap Instruction to Driver Assistance Tasks enables Task Group
(Self-)Scheduling without Need for explicit Scheduling Management.
correspond to the number of processors active in the system.1 For the Application
Example, bindings are stored on a personal computer, which is connected via an UART
interface to the prototyped implementation on a Virtex-6 FPGA. For the test runs,
fetching bindings via UART is the slowest part in the reshaping procedure. The time of
the UART communication is not considered in the evaluation, as an external memory
module in a real-world application would deliver bindings within a range of few clock
cycles.
1 The maximum number of stored bindings for each profile is 2n − 1, where n is the number of processors
in the system.
160 4 Application Scenarios
Figure 4.10: The Setup of the Driver Assistance Tasks in FripGa and an initial Binding for the
Town Profile occupying five Processor Resources.
The corresponding setup is done in FripGa, cf. Figure 4.10 and is finalized in Xilinx
EDK. Afterwards, the design is synthesized. The architecture, whose structure is
depicted in Figure 4.11 is now ready for runtime operation. For the test runs, the
underlying virtualizable MPSoC is scaled to feature a set of eight processors. As there
are eight driver assistance tasks to be mapped, the employment of all eight processors
at the same time leads to a trivial scenario. Hence, the test runs are performed for a
subset of the processors in order to harden scheduling for demonstration purposes. The
test runs highlighting the functional reshaping functionalities start with a processor
set of six processors. The reliability reshaping will demonstrate exploiting the spare
resources in the processor array for self-healing procedures at the event of failing
processor resources.
Sensor Fusion
Server CD LCS LKS BS TSD FLA PA HBA
VB VB VB VB VB VB VB VB
Processor
Binding
Control
Task Data Matrix
Storage
Interconnection Network
CAN Interface P0 P1 P2 P3 P4 P5 P6 P7
Personal
Computer FPGA Prototype
via UART by the control processor from the central binding storage, a new network
configuration is computed. Afterwards, the virtualization procedure enables the tasks
as defined by the new binding.
For an architecture featuring six processors P0 to P5, the following bindings have been
derived by the iterative Design Flow, which is depicted in Figure 4.6:
BVTown012345 =(CD → 0), ((LCS, LKS) → 1), (BS → 2), (TSD → 3), (PA → 4)
BVFreeway012345 =(CD → 0), ((LCS, LKS) → 1), (BS → 2), (HBA → 3),
(FLA → 4), (TSD → 5)
The indices of bindings denote the processor instances active in the system. For the
Town binding, processor resource P5 remains unused. In the simulation run, binding
BVTown012345 is the system’s initial configuration. The task-processor network is set
according to Algorithm 13. As soon as a change of the road type is detected, processor
P5 is activated by executing Algorithm 18, which setups the new binding BVFreeway012345 .
Figure 4.12 depicts the resulting task execution sequence. Besides the tasks LCS, which
replaces LKS in case a lane change is indicated by the driver by activating the direction
indicator, no task scheduling is required in this scenario. While fetching a new binding
in by the UART interface was in the scope of milliseconds, as discussed in Section 4.1.8,
the actual reshaping process by an update of the task-processor bindings and the
interconnection network was accomplished in about 90 clock cycles.
Now, the system is designed to feature a set of four processors. The following bindings
have been output by the binding optimization algorithm.
162 4 Application Scenarios
t
Time
P0 CD CD CD CD CD
LK
P1 LCS LCS LCS LKS LKS LKS LKS LKS LKS LKS
S
P2 BS BS BS BS BS BS BS BS BS BS BS BS BS BS BS BS BS BS BS BS
P3 TSD TSD TSD TSD TSD TSD HBA HBA HBA
P4 PA PA PA PA PA PA PA PA P FLA FLA FLA
P5 TSD TSD TSD
Town Freeway
LaneȱChangeȱfinished FunctionalȱReshaping
byȱVirtualizationȱProcedure
ActivationȱofȱP5
Figure 4.12: A functional Reshaping Process during the Transition between Road and Free-
way for six Processors.
t
Time
P0 CD CD CD CD CD CD
P1 LCS LKS LKS LKS LKS LKS LKS LKS LKS LCS
P2 BS FLA BS FLA TSD HBA HBA BS PA BS PA BS
BS
P3 TSD HBA TSD FLA FLA FLA TSD TSD TS
Freeway Road Town
FunctionalȱReshaping
byȱVirtualizationȱProcedure
Figure 4.13: A functional Reshaping Process during the Transition between Road and Free-
way for four Processors.
by defective processor resources. In the following, these features are highlighted for
the automotive application example.
At first, the question arises about how to detect failing processor resources.3 Here,
several approaches exist. One might encapsulate a processor instance into a wrapper,
which monitors the inputs and outputs handled by the processor resource. Out of an
analysis of the input-output relations and the behavior of a set of output values, it can
be assumed with a certain probability that the task running on a processor instance
is working correctly. Such an approach was demonstrated for a MicroBlaze soft-core
processor in [Biedermann 2011a]. Another possibility to detect errors in processor
resources is a periodic execution of a short, dedicated self-test program, which e. g.,
consecutively reads and writes every process register and whose output value is known
in advance. If the result differs from the expected value, it may be assumed that the
processor resource is faulty, e. g., by a stuck bit value. In the following test runs, the
presence of this fault detection mechanism is assumed. The dedicated self-test might
for example be activated by the control processor in designer-defined intervals by
the virtualization processor. For the simulation test runs, the occurrence of faults is
manually indicated to the control processor.
3 Errors may also occur in task memories or the TCM as a result,e. g., of bit flips caused by radiation effects.
They may also become manifest in erroneous system behavior. In systems with high safety constraints,
usually memory error correction mechanisms such as Error-Correcting Code (ECC) are exploited. Thus,
this kind of errors is not considered in the scope of this example. Without dedicated memory error
detection, if such an error is not due to a permanent defect of the circuitry, it may be treated, however,
by overwriting the task memory with its initial configuration, clearing the TCM and restarting the task.
164 4 Application Scenarios
For the first test case, a processor system consisting of eight processors is exploited,
cf. Figure 4.11. As for the functional reshaping, an active set of six processors P0 to P5
is assumed at first. Two of the eight processor resources, P6 and P7 will act as spare
resources. For six processors, the initial binding for the freeway road type, which is
optimal in terms of reliability, is the same as for the functional reshaping process:
BVFreeway012345 =(CD → 0), ((LCS, LKS) → 1), (BS → 2), (HBA → 3),
(FLA → 4), (TSD → 5)
BVFreeway013456 =(CD → 0), ((LCS, LKS) → 1), (BS → 3), (HBA → 4),
(FLA → 5), (TSD → 6)
As visible from Figure 4.14, after the reliability reshaping process, all tasks continue
their work with the highest reliability value possible. Seen from the outside, the system
resumes its operation as before the occurrence of the fault. This continues, when
another processor resource, e. g., P4 fails. In this case, the left spare resource P7 is
exploited to further assure a correct system operation. The new binding
BVFreeway013567 =(CD → 0), ((LCS, LKS) → 1), (BS → 3), (HBA → 5),
(FLA → 6), (TSD → 7)
is also depicted in Figure 4.14. Until now, the systems functionality could be maintained
even with two failing processor resources. In case another processor fails, a rescheduling
will become necessary.
After the fails of P2 and P4, processor P0 fails. The binding
BVFreeway13567 =(CD → 1), ((LCS, LKS) → 3), (BS, TSD → 5), (HBA → 6),
(FLA → 7)
now matches BS with TSD on the same processor resource. The timing validation for
this binding given the task execution and reactivation times as depicted in Figure 4.3
4.1 Reshaping in an Automotive Assistance System 165
t
Time
P0 CD CD CD CD
P1 LKS LK LKS LKS LKS LKS LKS LK CD C
P2 BS BS BS
P3 HBA HB BS BS BS BS
BS BS BS BS BS BS BS LKS LKS LKS
P4 FLA FLA HBA HBAHBA
P5 TSD T FLA FLA FLA HBA HBA HB BS TSD BS TSD
P6 TSD TSD TSD FLA FLA FL HBA HBA HBA
P7 TSD TSD FLA FLA FLA
Freeway
ReliabilityȱReshaping
byȱVirtualizationȱProcedure
Figure 4.14: Reliability Reshaping Procedures after the Detection of faulty Processor Re-
sources.
was successful. Still, all tasks remain active in the system; the selected binding is
optimal in terms of reliability. The execution sequence of this test run is depicted on
the right hand side of Figure 4.14.
BVFreeway356/raw =(CD → 3), ((LCS, LKS) → 5), (BS, TSD, HBA, FLA → 6)
However, the timing validation reveals that in this case, the reactivation time of task
TSD is violated. Thus, in order to both meet the timing constraints as well as to apply a
binding optimal in terms of reliability, the task with the least priority in the system is
discarded. As visible from Table 4.1, this is task FLA. After having removed FLA
TSD still violates timing constraints. Thus, according to Table 4.1, HBA is deactivated
next. The resulting binding
166 4 Application Scenarios
t
Time
P0
P1 CD CD
P2
P3 LKS LKS LKS LKS LKS CD CD
P4
P5 BS TSD BS TSD BS TSD BS LKS LKS LKS LCS LCS
P6 HBA HBA HBA HBA HBA BS TSD BS TSD BS TSD
P7 FLA FLA FLA FLA FLA
Freeway
ReliabilityȱReshaping LaneȱChange
byȱVirtualizationȱProcedure initiated
Figure 4.15: Reliability Reshaping after Tasks HBA and FLA have been discarded in order to
meet Timing Constraints.
finally meets the timing constraints. As detailed above, the validation and iterative
re-computation of bindings under refusal of tasks has been performed already in the
system design phase. The central processor may, thus, directly fetch and apply binding
BVFreeway356 . As this binding relies on the output of the binding optimization theory it
is, despite the degradation of its scope of operation, still optimal in terms of reliability.4
Figure 4.15 depicts the execution sequence during the simultaneous fail of P1 and P7.
While five failing processor instances out of a set of eight processors is a highly
unlikely case, this application set has demonstrated that the reliability reshaping
procedure is able to maintain the most important driver assistance tasks even at the
event of a series of errors. Unlike for systems with static bindings, a failing processor
resource not automatically affects the tasks currently being executed on this resource.
Instead, in case the full operation cannot be restored, e. g., due to timing constraints in
the remaining processor array, the least important tasks may be selected to be discarded
first. Thus, the system’s reliability and, last but not least, the driver’s safety may profit
significantly from the execution dynamism introduced by the Virtualization Layer.
4 In
fact, the overall reliability increases from 0.99283 over 0.99392 to 0.99481 when subsequently removing
FLA and HBA. As there are less tasks in the system, the probability for a task having a fault is reduced.
4.1 Reshaping in an Automotive Assistance System 167
Another method to detect or even cover faults is module redundancy. For FPGA-based
automotive systems, a redundant hardware task exeuction scheme was presented in
[Paulsson 2006b]. Here, reconfigurable slots of the FPGA are filled with redundant
instances of the tasks and a TMR voter. In case the TMR voter detects a deviation, a
partial reconfiguration of the affected slot is initiated or the task is moved to another
slot. The virtualizable MPSoC takes a similar approach as the slot-based technique for
hardware tasks. Detected deviations will cause a shift of the task to another “slot”,
i. e., to another processor. As detailed in Section 2.6, the Virtualization Layer features
dedicated VMR modules to enable redundancy features. However, redundancy has to
be paid by a significant overhead in terms of processor resources and task memory. In
the following an example test run is shown for a subset of tasks, which is executed in
different redundancy schemes.
The employment of Triple Modular Redundancy occupies three processor instances
as well as three task memories for each task. Thus, in a virtualizable MPSoC with
an 8 × 8 task-processor interconnection network only two tasks may be executed in a
TMR scheme at one point in time. As six processors are occupied by redundant task
execution, the remaining two processors may either be held ready as spare resources or
they might be exploited to execute other tasks without a redundancy scheme. Usually,
if a processor resource fails in a TMR scheme, the results produced by the faulty
resource are masked by the TMR voting module. The subsequent computations rely,
therefore, on the remaining two processor instances. The reshaping capabilities of
the virtualizable MPSoC allow, however, replacing the processor instance, which has
been identified as faulty. In doing so, not only faults are masked, but the original
redundancy scheme with three, parallel executing task instances may be restored.
A binding for a TMR scheme and the most important tasks, CD and LKS for the
freeway profile may appear as:
In order to allow for a correct voting procedure as detailed in Section 2.6, the task
IDs of the redundant task instances are given as well as the source, which feeds the
parallel instances. In this case, ext1 refers to the simulated Sensor Fusion Server, which
is connected as an external module via the TDM, cf. Algorithm 17. In case an error is
detected in the TMR voter, a binding may be requested, which employs another, recently
unused processor resource of the system. For the case that processor instance P5 fails,
a new binding, in which the TMR assignment is renewed, is applied by application of
Algorithm 18:
168 4 Application Scenarios
t
Time
P0 CD CD CD CD
P1 CD CD CD CD
P2 CD CD CD CD
P3 LKS LKS LKS LKS LKS LKS LKS LKS LKS LKS
P4 LKS LKS LKS LKS LKS LKS LKS LKS LKS LKS
P5 LKS LKS LKS LKS LKS
P6 LKS LKS LKS LKS LKS
P7
Freeway
ReliabilityȱReshaping
byȱVirtualizationȱProcedure
Figure 4.16: A TMR Execution Scheme: The Virtualization Procedure may replace and deac-
tivate Processor Resources identified as faulty by TMR Voting Modules.
At runtime, an execution sequence like the one depicted in Figure 4.16 is obtained.
In a DMR scenario, errors are masked. However, it is not able to identify, which of the
processor resources produced an erroneous result. The Virtualization Layer allows to
perform a “sequential TMR” out of a DMR scheme. In case a deviation is detected by a
voter, the processors employed subsequently exchanged as denoted in Algorithm 23.
This procedure is depicted in Figure 4.17. In doing so, a faulty processor resource may
be identified and excluded from the current binding.
t
Time
P0 CD CD CD CD CD CD
P1 CD CD CD CD CD CD
P2 LKS LKS LKS LKS LKS LKS LKS LKS LKS
P3 LKS LKS LKS LKS
P4
P5 LKS LKS LKS LKS LKS LKS LKS
P6
P7
VMR
Freeway
ReliabilityȱReshaping
toȱidentifyȱfaultyȱresource
Figure 4.17: A DMR Execution Scheme: A faulty Processor Resource may be identifed by a
ßequential TMR"Procedure.
network is depicted as a dark grey overlay. Figure 4.19 depicts the resource consumption
of the synthesized system. While the overhead introduced by the Virtualization Layer
seems to be significant at first sight, it has to be considered that just about 10 % of
170 4 Application Scenarios
the entire registers of the FPGA are occupied and the employed MicroBlaze processor
types are designed to be mapped very efficiently to the underlying FPGA. Thus, the
contribution of the Virtualization Layer appears to be significant, but only in comparison
to the tiny MicroBlaze processors, not to the entire device. In terms of Lookup-Tables,
however, the Virtualization Layer indeed introduces a fair overhead. As discussed
in Section 2.3.5, this is due to the inefficient placement of the crossbar switches of
the interconnection network, which also leads to the drop in system frequency when
targeting an FPGA. The choice of another target chip architecture, however, will resolve
this issue. The overall memory consumption is very moderate.
Slice Register
10%
30,201 of 301,440
Slice LUT
40%
60,953 of 150,720
BRAM Blocks
6%
26 of 416
Figure 4.20. A Quadrocopter may be equipped with a plethora of sensor devices, but
has to cope with a limited supply of power. The application example will demonstrate
the execution behavior of a virtualizable MPSoC in an energy-constrained system.
Depending on the actual environmental situation as well as the energy supply, a
dynamic system reshaping takes place. Furthermore, the Agile Processing scheme is
exploited during the Quadrocopter’s flight.
0 SystemȱInitializationȱ(SI) 0 StartupȱRoutine
1 Ground Communicationȱ(GC) 0
2 PositionȱControlȱ(PC) 0
FlightȱControl
3 Autopilotȱ(AP) 1
4 GlobalȱPositioningȱSystemȱ(GPS) 1
5 GroundȱHeatȱDetectionȱ&ȱAnalysisȱ(GHD) 2
6 Microphoneȱ(MIC) 2 PrimaryȱCasualtyȱ
7 AnalysisȱofȱSourcesȱofȱLightȱ(SOL) 2 Detection Systems
8 VisualȱBodyȱRecognitionȱ(VBR) 2 AgileȱProcessingȱTask
9 EnvironmentalȱTemperatureȱTrackingȱ(TT) 3
10 AirȱPressureȱTrackingȱ(APT) 3
11 CO2 LevelȱTrackingȱ(C2T) 3 Auxiliary
12 RadiationȱLevelȱTrackingȱ(RT) 3 SensorȱSystems
13 AudioȱResponseȱbyȱMegaphoneȱ(AR) 2 CasualtyȱComm.
Table 4.3: Battery Level Thresholds and resulting Changes in the System’s Configuration.
BatteryȱLevel
Thresholds SystemȱState
>ȱ30 % Noȱrestrictions,ȱAgileȱProcessingȱwithȱmaximumȱ
DegreeȱofȱParallelism
30 % ShutdownȱofȱAuxiliaryȱSensorȱSystems,ȱAgileȱ
ProcessingȱwithȱthreeȱparallelȱInstances
5 % ShutdownȱofȱCasualtyȱDetectionȱSystems
InitiateȱReturnȱFlightȱtoȱGroundȱStation
2 % ShutdownȱofȱAutoȱPilotȱandȱGPS
VBR will act as an Agile Processing Task in the application example. As soon as the
Quadrocopter has identified a human within the debris, the position is transmitted to
the ground station. With an onboard megaphone, the victim may be informed about its
imminent rescue.
The tasks feature several priority classes, cf. Table 4.2, where a lower value represents
a higher priority. During normal system operation, any task may be executed. The
battery of the Quadrocopter is assumed to last for about 20 minutes of operation when
all systems are active. Table 4.3 denotes the changes in the system’s configuration
depending on the battery charge level. With the battery charge level being above 30 %
of the battery’s capacity, the Agile Processing task, cf. Section 4.2.2, may run with a
maximum degree of parallelism in case a parallel execution scheme is selected. With a
battery charge level between 5 % and 30 %, the degree of parallelism is limited to three
instances. With a parallelization degree of three, a distinct speedup in comparison with
a sequential execution is achieved, however, the energy consumption is lower than with
the maximum degree of parallelism. It is assumed that the control processor receives
information about the current battery charge level from the onboard battery. For the
simulation runs, this information, i. e., the triggers for a system reshaping, is provided
from the outside world by means of a UART interface.
VBR acts as an Agile Processing task, which searches for humans by analyzing pic-
tures captured by the onboard camera. Computer vision algorithms, such as image
segmentation [Felzenszwalb 2004] may be exploited. In order to parallelize this process
the picture may, e. g., be cut into several sections. The size of a section depends on
176 4 Application Scenarios
width
6 4
height
5 3 2
Figure 4.21: A Picture of the Camera may be cut into Sections depending on the Degree of
Parallelism.
the desired degree of parallelism, which is denoted by numbers in Figure 4.21.5 The
parallelization of an Agile Processing task by the insertion of pragmas was detailed in
Section 3.4.3.
It is assumed that the time needed to transfer the pixel data to the parallel instances
is significantly lower than the computation time spent in the parallel instances. Sec-
tion 3.3.5 has demonstrated that delivering the parallel parts in a pipelined fashion,
cf. Figure 3.18 (b) further lowers the overall execution time. For the application example,
the execution time of the task and its communication overhead are scaled accordingly.
It is assumed that the result of a parallel instance is a binary yes/no decision, whether
a human shape has been detected or not. As the execution time of the other tasks in the
system is not of particular importance in this example, they run in an indefinite loop
unless they are suspended because of a Task Group scheduling event or a reshaping
procedure.
4 3 2 1 0
V
D IV VI
C IV II III
B
II
A !
VII
Rescue
Coordination
Center
Figure 4.22: Sequence of a Rescue Mission of the Quadrocopter. Background Image taken
from [Abassi 2010].
sequentially scans the rectangles in the map for casualties. The flight path is indicated
by the arrow. In rectangle C1, the heat sensor indicates a ground temperature, which
may hint to the existence of a victim. As a result, VBR starts to analyze the most recent
picture in a parallel fashion (III). As result of processing this picture, the presence of
a victim has been detected. Its position is transmitted to the ground station and the
victim is informed by the megaphone about its rescue. The Quadrocopter returns to a
sequential state of VBR and resumes scanning the next rectangle for further casualties.
At some point in time, the battery level falls below the 30 % threshold. Consequently,
the auxiliary systems are shut down in order to preserve energy (IV). In rectangle D3,
the microphone records noises. Again, a parallel processing of the recorded pictures by
VBR is enabled. (V) However, the degree of parallelism is limited to three instances
due to the drowning energy supply, cf. Table 4.3. After successfully detecting a human,
the Quadrocopter tries to resume its planned flight route. However, already in the next
rectangle, the energy supply hits the 5 % mark (VI). Thus, the rescue mission is aborted
and the casualty detection systems are shutdown. The Quadrocopter autonomously
triggers the return to its ground station. However, the energy supply further diminishes.
As the battery is in a critical state, with just 2 % of its charge level, the ground station is
informed that the Quadrocopter will now disable its autonomous flight functions (VII).
The rescue team now has to manually fly the Quadrocopter back to its ground station.
178 4 Application Scenarios
VB VB VB VB VB VB VB VB
Interconnection Network
Battery
Control
Core
SI/GC PC AP GPS
VB VB VB VB
Interconnection Network
Personal
Computer Pa0 Pa1 Pa2 Pa3 FPGA Prototype
Figure 4.23: The virtualizable MPSoC is shaped to a close Cluster consisting of an Array of
four and an Array of eight Processors.
After a recharge process, the Quadrocopter may resume its mission starting from the
last rectangle, in which it searched for victims.
The decision to exploit a clustered MPSoC instead of one larger processor array is
not for safety reasons. Due to the strict memory separation of tasks, the tasks necessary
for flight control could be bound to the other tasks in the system without the danger
of malicious task interference, cf. Section 2.2.6. Nevertheless, larger processor arrays
lead to a larger task-processor interconnection network, thus increasing the overall
latency, cf. Section 2.3.4. The tasks relevant for flight management are assumed to
feature timing constraints, which prohibit being scheduled with other tasks on the
same processor instance. They are, therefore, mapped to a 4 × 4 array. The remaining
tasks are bound to the 8 × 8 array.
A UART interface acts as a simulated battery. The drain of the battery is adapted
to the measurements of the processors’ energy consumption, cf. Section 2.7. Thus, the
battery informs the control processor of the MPSoC in case a threshold is undershot
and then the control processor will manage the updates of task-processor bindings. A
reshaping may be triggered for several reasons:
1. The GPS task provides information about whether the rescue area has been entered
or left. In this case, the casualty detection systems are being activated or deactivated,
respectively.
3. The battery indicates falling below a certain threshold and, thus, triggers a graceful
degradation of the sensor tasks.
t
Time
Pa0 SI GC
Pa1 PC
Pa2 AP
Pa3 GPS
P b0 GHD VBRB1 GHD
P b1 MIC VBRB2 MIC
P b2 SOL VBRB3 SOL
P b3 VBR VBRA AR VBR
P b4 TT VBRB4 TT
P b5 APT VBRB5 APT
P b6 RT VBRB6 RT
P b7 C2T VBRC C2T
I II III II
Figure 4.24: Execution Behavior of Flight Phases I to III according to Figure 4.22.
in Figure 4.25. Finally, just the most crucial tasks to maintain flight operation remain
active.
Timing overhead for the partial reconfiguration when applying the Agile Processing
scheme cannot be derived from the simulation. Furthermore, the actual speed is
dependent on the speed of the off-chip memory containing the partial bitstream images.
According to measurements in the scope of a Bachelor Thesis by Randolph Lieding
[Lieding 2014], the protocol to set up the Agile Processing scheme for six parallel
Instances, cf. Section 3.24, is executed within 510 clock cycles after reshaping the
systems configuration for a parallel execution.
The diagram in Figure 4.26 depicts the relation between the measured power con-
sumption of the virtualizable MPSoC and the Quadrocopter’s simulated battery charge
level. The power offset of the system as well as the assumed external sensor devices,
which may be disabled as well, are not considered. Measurements were performed
on the virtualizable MPSoC clocked to 10 MHz in the scope of the bachelor thesis of
Antonio Gavino Casu [Casu 2013]. Vertical lines depict the points in time at which a
reshaping of the system’s configuration takes place. The small battery icons depict
4.2 Reshaping in an Energy-Constrained Quadrocopter System 181
t
Time
Pa0 GC
Pa1 PC
Pa2 AP
Pa3 GPS
Pb0 GHD GHD VBRB1 GHD
Pb1 MIC MIC VBRB2 MIC
Pb2 SOL SOL VBRB3 SOL
Pb3 VBR VBR VBRA AR VBR
Pb4 TT VBRC
Pb5 APT
Pb6 RT
Pb7 C2T
II IV V IV VI VII
energy-related reshaping procedures. The different mission states are depicted in Ro-
man numbers, cf. Figure 4.22. When deactivating the number of active processors, the
slope of the battery charge level becomes flatter, cf. phase IV in Figure 4.26. Running
just the most critical tasks in order to maintain a stable flight further strechtes the
Quadrocopter’s battery life, cf. phases VI and VII.
BatteryȱChargeȱLevel PowerȱConsump.
100% 70
ArrayȱPowerȱConsump.ȱ[mW]
BatteryȱChargeȱLevel
90%
60
80%
70% 50
I II III II IV V IV VI VII
60%
40
50%
40% 30
30% 20
20%
10
10%
5%
0% 0
0 5 10 15 20 25
Timeȱ[min.]
Figure 4.26: Drain of the Quadrocopter’s Battery over Time depending on the current Power
Consumption. Vertical Lines indicate a Reshaping of the Architecture.
networks and the virtbridges. The connectivity between task memories and processors
is depicted in darker shades of blue and grey. Again, the clustering is clearly visible.
The resource consumption of the prototype design is depicted in Figure 4.28. The
design of a close cluster of an 8 × 8 and a 4 × 4 array allows for a direct comparison of
both array sizes. In terms of register count, the resource consumption scales linearly
with the size of the array. Due to updates in the hardware description, the overall
register use is lower than for the synthesized prototype in the previous application
example. For the LUTs, with increasing array size the resource consumption increases in
a faster pace. This is due to the structure of the interconnection network, cf. Section 2.3.4.
For the sake of completeness, the BRAM consumption is given as well. In contrast to
the automotive application example in Section 4.1, the individual task memory size
has been increased in order to fit the corresponding software code. The overhead
introduced by the virtualization solution in terms of BRAM is negligible. Summing
up, the virtualization approach introduces a fair amount of resource overhead but,
provides a decent yield in terms of execution dynamism. As motivated in Chapter 1, the
availability of enough resources is indeed not a problem to worry about today. Instead,
the resources spent for the Virtualization Layers allow to overcome this “Nulticore”
effect as discussed in Section 1.
4.2.7 Conclusion
The exploitation of the Agile Processing scheme for an energy-constrained embedded
system has demonstrated the yield in terms of execution flexibility gained by the
4.2 Reshaping in an Energy-Constrained Quadrocopter System 183
virtualization properties. The embedded system may reshape its current configuration
based on external factors such as sensor inputs as well as based on internal ones such
as the energy remaining in its batteries. Due to the Agile Processing scheme, tasks may
temporarily boost their performance by switching to an adjustable parallel execution
184 4 Application Scenarios
Slice Register
11%
33,988 of 301,440
17,256 8,288
Slice LUT
73%
111,130 of 150,720
74,553 20,716
BRAM Blocks
31%
130 of 416
Figure 4.28: A Visualization of Device Logic Occupation:The Register Count scales linearly
with the Size of the Processor Arrays. The LUT Consumption increases at a
faster Rate when the Array Size doubles.
behavior. Other tasks in the system may be suspended by the virtualization procedure
in the meantime. During parallel execution, the degree of parallelism may be adjusted
transparently. In each execution cycle, the number of parallel instances may be varied
up to n − 2, where n is the number of processors in the array.6 The application example
has proven the feasibility of the Agile Processing scheme in combination with the
virtualization features in order to obtain self-awareness of systems regarding their
energy management and execution behavior.
6 One processor is needed to execute the A part, one processor to execute the C part of a parallelized task.
5 Conclusion and Outlook
This work has introduced several new concepts for the design of multi-processor sys-
tems in embedded designs, which rely on hardware virtualization. By exploiting a
hardware-based virtualization concept, design flows for a variety of use cases, which
span from highly reliable to parallel executable systems, are covered. In particular, the
Agile Processing scheme allows the transparent switch between sequential and parallel
execution of tasks, whereas the degree of parallelization can be scaled dynamically.
Resolving the usually static task-processor bindings and keeping task memories strictly
disjoint enables features known from software-managed operating systems without
abandoning the requirements, which are often crucial for embedded systems: Predict-
able execution times, strict memory separation for safety reasons, and the availability
of fault detection and fault recovery mechanisms, just to name a few. The virtualization
concept is based upon a layer between tasks and processors, which provides the func-
tionality of shifting task execution among the processor array in a very fast manner
that outperforms existing software-based solutions.
In the Virtualization Layer a dedicated task-processor interconnection network, whose
configuration is updated at runtime, manages the shift of task execution. Additionally, it
provides intrinsic means for task group scheduling. Denoting task execution sequences
as well as a self-organizing scheduling scheme evoked by tasks are techniques enabled
by the proposed architecture without causing additional control overhead.
Besides the task-processor interconnection, the enhancement of a point-to-point task
communication concept has been introduced. Here, a runtime code manipulation
allows for transparent point-to-point communication despite the lack of knowledge,
on which processor the communication partner is currently being executed or whether
it is executed at this point in time at all. As a runtime code modification scheme is
exploited, just minor changes to existing legacy code is to be made in order to enable
the employment of this task in a virtualizable environment.
As reliability and safety are often optimization goals in the embedded world, the
Virtualization Layer features generic redundancy modules as well as several self-healing
means in order to recover from erroneous states. The ability to shift the execution of a
task transparently to another processor resource is an essential prerequisite in order
to allow for self-healing strategies. The application of existing binding optimization
procedures, which produce bindings that are optimal in terms of reliability, has been
demonstrated in the scope of a complex automotive application example.
Transparent task activation and deactivation as well as temporarily disabling pro-
cessor cores as necessary are provided by the Virtualization Layer. These features
A. Biedermann, Design Concepts for a Virtualizable Embedded MPSoC Architecture,
DOI 10.1007/978-3-658-08047-1_5, © Springer Fachmedien Wiesbaden 2014
186 5 Conclusion and Outlook
may be exploited in order to conceive energy aware systems, which may tailor their
execution behavior, i. e., the number and set of active tasks and processors, to the
current supply of energy. This has been demonstrated for a quadrocopter, which faces
harsh energy constraints and may profit from the energy awareness features introduced
by the Virtualization Layer.
A dedicated development environment called FripGa accompanies the generic hard-
ware architecture of the Virtualization Layer. Thus, the designer may construct virtual-
izable systems either by means of a graphic user interface or by scripted input. The
design environment is built to work seamlessly with existing design flows. For the
prototype implementation, the tool chain of the chip vendor Xilinx, Inc. is supported.
As a consequence, existing designs may be edited in FripGa, virtualization features may
be added and the entire system can be synthesized for an FPGA right out of FripGa.
The proposed virtualization concept has been implemented as a prototype for an
FPGA platform and was successfully tested for systems consisting up to twelve array
processors. It is worth mentioning that the proposed virtualization concept does not
rely on a specific hardware architecture, such as an FPGA, or a dedicated processor
type. In case a certain feature exclusively present on FPGAs, such as partial dynamic
configuration, has been exploited, alternatives for other architectures were given as
well. The instructions fed into the processor array by the virtualization procedure are
taken from a standard set of instructions present in almost all common general purpose
processors. As for the chip architecture, no feature of the processor has been exploited,
which is unique to this processor type. Consequently, given the effort of code adaption,
the approach can be exploited for any other processor type and tool suite. It is not
limited to soft-core processors or the Xilinx workflow in any way.
As every optimization towards one property almost always inevitably lead to a
deterioration of other aspects, the yield in execution dynamism introduced by the
virtualization features leads to a drop in the system’s maximum frequency. By scaling
the virtualizable processor array down, by defining clusters, or by choosing another
target chip architecture, this effect can be compensated. Consequently, it is up to
designer to define the degree of execution dynamism needed for the specific application
scenario.
Summing up, the virtualization concept introduced by this work covers a com-
prehensive concept for the design, setup, test, and implementation of virtualizable
embedded designs. The design flows resulting from the adoption of the virtualization
concept propose solutions to the common problem of nowadays multi-processor and
multi-core computing, where the mere instantiation of many processor instances mostly
raises many problems and solves just a few. The application examples taken from the
automotive world and from systems with harsh energy constraints, of which each faces
individual requirements, prove the versatility of the virtualization solution.
Taking the virtualization concept as a foundation, a plethora of new problem state-
ments arises. In future work, virtualizable processor arrays might consist of processors,
which indeed feature a common instruction set, but differ in terms of performance,
5 Conclusion and Outlook 187
[2] Alexander Biedermann, Boris Dreyer and Sorin A. Huss. A Generic, Scalable
Reconfiguration Infrastructure for Sensor Networks Functionality Adaption. 26th IEEE
International SoC Conference (SOCC), Nürnberg, Germany, September 2013.
[3] Alexander Biedermann and Sorin A. Huss. A Methodology for Invasive Programming
on Virtualizable Embedded MPSoC Architectures. 13th International Conference on
Computational Science (ICCS), Barcelona, Spain, June 2013.
[5] Alexander Biedermann, Matthias Zöllner and Sorin A. Huss. Automatic Code
Parallelization and Architecture Generation for Embedded MPSoC. ACM 5th Workshop
on Mapping of Applications to MPSoCs (Map2MPSoC/SCOPES 2012), St. Goar,
Germany, May 2012.
[6] Alexander Biedermann and Sorin A. Huss. FripGa: A Prototypical Design Tool for
Embedded Multi-Core System-on-Chip Architectures. IEEE/ACM Design, Automation
and Test in Europe, DATE’12, University Booth, Dresden Germany, March 2012.
[7] Alexander Biedermann and Sorin A. Huss. Scalable Multi-Core Virtualization for
Embedded System-on-Chip Architectures. IEEE/ACM Design, Automation and Test
in Europe, DATE’12, Friday Workshop: Quo Vadis, Virtual Platforms? Dresden,
Germany, January 2012.
[8] Alexander Biedermann, Thorsten Piper, Lars Patzina, Sven Patzina, Sorin A Huss,
Andy Schürr and Neeraj Suri. Enhancing FPGA Robustness via Generic Monitoring
IP Cores. International Conference on Pervasive and Embedded Computing and
Communication Systems, Vilamoura, Portugal, March 2011.
[9] Alexander Biedermann, Marc Stoettinger, Lijing Chen and Sorin A. Huss. Secure
Virtualization within a Multi-Processor Soft-core System-on-Chip Architecture. The 7th
A. Biedermann, Design Concepts for a Virtualizable Embedded MPSoC Architecture,
DOI 10.1007/978-3-658-08047-1, © Springer Fachmedien Wiesbaden 2014
190 List of Publications
[10] Alexander Biedermann and H. Gregor Molter (Eds.). Design Methodologies for
Secure Embedded Systems, volume 78. of Lecture Notes in Electrical Engineering.
Springer, Berlin, Germany, November 2010.
[12] André Seffrin, Alexander Biedermann and Sorin A. Huss. Tiny-Pi: A Novel Formal
Method for Specification, Analysis, and Verification of Dynamic Partial Reconfiguration
Processes. 13th IEEE Forum on Specification and Design Languages (FDL 2010),
Southampton, UK, September 2010.
[13] Marc Stoettinger, Alexander Biedermann and Sorin A. Huss. Virtualization within
a Parallel Array of Homogeneous Processing Units. 6th International Symposium on
Applied Reconfigurable Computing, Bangkok, Thailand, March 2010.
[14] Felix Madlener, Sorin A. Huss and Alexander Biedermann. RecDEVS: A Compre-
hensive Model of Computation for Dynamically Reconfigurable Hardware Systems. 4th
IFAC Workshop on Discrete-Event System Design (DESDes’09), Gandia, Spain,
October 2009.
List of supervised Theses
[8] Tobias Rückelt. Implementierung eines Sensorsystems als Beispielapplikation für gen-
erisches FPGA-Monitoring. Bachelor Thesis, TU Darmstadt, July 2011.
[10] Maik Görtz. FripGa - Software zur Codegenerierung von Many-Core-Architekturen auf
FPGAs. Master Thesis, TU Darmstadt, March 2011.
[4] Kevin Luck. Ein virtualisierbares Automotive MPSoC. Final Thesis, TU Darmstadt,
May 2013.
[5] Nicolas Eicke, Sebastian Funke, Kai Schwierczek and Markus Tasch. Framework
zur Modellierung verteilter, paralleler eingebetteter HW/SW-Designs. Practical Course,
TU Darmstadt, October 2012.
[7] Tobias Rückelt. Graphischer Editor für Models of Computation. Final Thesis, TU
Darmstadt, October 2012.
[10] Antonio Gavino Casu, Kadir Inac, Randolph Lieding, Mischa Lundberg and
Daniel Schneider. Virtualisierung in eingbetteten Multi-Prozessor-Systemen. Practical
Course, TU Darmstadt, April 2012.
[11] Michael Koch, Niels Ströher, Lucas Rothamel and Manuel Weiel. Framework zur
Modellierung verteilter, paralleler eingebetteter HW/SW-Designs. Practical Course, TU
Darmstadt, April 2012.
[14] Jan Post. Implementierung eines HW-IP-Cores für schnelles Pixel-Resampling. Final
Thesis, TU Darmstadt, October 2011.
List of supervised Theses 193
[15] Hieu Ha Chi, Dan Le, Do Thanh Tung and Binh Vu Duc. Graphisches Tool für
parallelisierte HW/SW-FPGA-Designs. Practical Course, TU Darmstadt, September
2011.
[16] Quoc Hien Dang, Johannes Decher, Thorsten Jacobi and Omid Pahlevan Sharif.
Graphisches Tool für parallelisierte HW/SW-FPGA-Designs. Practical Course, TU
Darmstadt, September 2011.
[17] Michael Koch, Niels Ströher, Lucas Rothamel and Manuel Weiel. Graphisches
Tool für parallelisierte HW/SW-FPGA-Designs. Practical Course, TU Darmstadt,
September 2011.
[19] Peter Glöckner, Amir Naseri, Johannes Simon and Matthias Zöllner. Framework
für massiv-parallele Systeme in heterogenen FPGA-Netzwerken. Practical Course, TU
Darmstadt, April 2011.
[20] Christopher Huth. Methodiken zur Bewertung der Güte von Security-USB-Token.
Practical Course, TU Darmstadt, October 2010.
[23] Joel Njeukam. Partielle Rekonfiguration für Virtex-5 FPGAs. Practical Course, TU
Darmstadt, April 2010.
Bibliography
[Abassi 2010] L. Abassi. 2010 Haiti Earthquake Damage 4. online: http://upload
load.wikimedia.org/wikipedia/commons/a/a2/2010_Haiti_earth
quake_damage4.jpg, accessed 07/01/2014, 9:00 am, 2010. 177
[Ajtai 1983] M. Ajtai, J. Komlós and E. Szemerédi. An O(N Log N) Sorting Network. In
Proceedings of the Symposium on Theory of Computing (STOC), pages 1–9.
ACM, 1983. 38
[Ambric Inc. 2008] Ambric Inc. Am2000 Family Massively Parallel Processor Array.
online: http://web.archive.org/web/20080516200115/http://www.
ambric.com/products/, accessed 12/11/2013, 11:40 am, May 2008. 107
[Amdahl 1967] G.M. Amdahl. Validity of the single Processor Approach to achieving large
Scale computing Capabilities. In Proceedings of the April 18-20, 1967, Spring Joint
Computer Conference, pages 483–485. ACM, 1967. 129
[Arora 1990] S. Arora, T. Leighton and B. Maggs. On-line Algorithms for Path Selection
in a nonblocking Network. In Proceedings of the twenty-second annual ACM
Symposium on Theory of Computing, pages 149–158. ACM, 1990. 37
[Barthe 2011] L. Barthe, L.V. Cargnini, P. Benoit and L. Torres. The SecretBlaze: A
configurable and cost-effective open-source soft-core Processor. In Proceedings of
International Symposium on Parallel and Distributed Processing Workshops
and Phd Forum (IPDPSW), pages 310–313. IEEE, 2011. 8
[Batcher 1968] K.E. Batcher. Sorting Networks and their Applications. In Proceedings of
the April 30–May 2, 1968, Spring Joint Computer Conference, pages 307–314.
ACM, 1968. 38
[Benini 2002] L. Benini and G. De Micheli. Networks on Chips: A new SoC Paradigm.
Computer, vol. 35, no. 1, pages 70–78, 2002. 33
[Biedermann 2012a] A. Biedermann and S.A. Huss. FripGa: A Prototypical Design Tool
for Embedded Multi-Core System-on-Chip Architectures. In Conference on Design,
Automation and Test in Europe (DATE), University Booth. IEEE/ACM, 2012.
110
[Biedermann 2012d] A. Biedermann, M. Zöllner and S.A. Huss. Automatic Code Par-
allelization and Architecture Generation for Embedded MPSoC. In Workshop on
Mapping of Applications to MPSoCs (Map2MPSoC/SCOPES 2012), 2012. 123
[Biedermann 2013a] A. Biedermann, B. Dreyer and S.A. Huss. A Generic, Scalable Recon-
figuration Infrastructure for Sensor Networks Functionality Adaption. In International
SoC Conference (SOCC). IEEE, 2013. 54
[Biedermann 2013b] A. Biedermann and S.A. Huss. A Methodology for Invasive Pro-
gramming on Virtualizable Embedded MPSoC Architectures. In Proceedings of
International Conference on Computational Science (ICCS), pages 359–368.
Procedia Computer Science, 2013. 135
[Bolchini 2007] C. Bolchini, A. Miele and M.D. Santambrogio. TMR and Partial Dynamic
Reconfiguration to mitigate SEU faults in FPGAs. In Proceedings of International
Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT), pages 87–95.
IEEE, 2007. 121
[Bonamy 2012] R. Bonamy, H.M. Pham, S. Pillement and D. Chillet. Ultra-fast power-
aware Reconfiguration Controller. In Proceedings of Conference on Design, Auto-
mation and Test in Europe (DATE), pages 1373–1378. IEEE/ACM, 2012. 55
[Brebner 1996] G.J. Brebner. A Virtual Hardware Operating System for the Xilinx XC6200.
In Proceedings of Conference on Field-Programmable Logic and Applications
(FPL), pages 327–336. Springer, 1996. 6
[Brebner 2001] G.J. Brebner and O. Diessel. Chip-Based reconfigurable Task Management.
In Proceedings of Conference on Field-Programmable Logic and Applications
(FPL), pages 182–191. IEEE, 2001. 6
[Burke 2004] E.K. Burke, P. De Causmaecker and G. Vanden Berghe. Novel meta-
heuristic Approaches to Nurse Rostering Problems in Belgian Hospitals. Handbook of
Scheduling: Algorithms, Models and Performance Analysis, vol. 18, pages 1–44,
2004. 67
[Casu 2011] A.G. Casu, K. Inac, R. Lieding, M. Lundberg and D. Schneider. Virtualis-
ierung in eingbetteten Multi-Prozessor-Systemen. Practical Course, TU Darmstadt,
April 2011. 112
[Chandra 1997] R. Chandra, D.-K. Chen, R. Cox, D.E. Maydan, N. Nedeljkovic and J.M.
Anderson. Data Distribution Support on Distributed Shared Memory Multiprocessors.
ACM SIGPLAN Notices, vol. 32, no. 5, pages 334–345, 1997. 123
[Chen 2007] T. Chen, R. Raghavan, J.N. Dale and E. Iwata. Cell Broadband Engine
Architecture and its first Implementation – A Performance View. IBM Journal of
Research and Development, vol. 51, no. 5, pages 559–572, 2007. 1, 32
[Chen 2009] C.T. Chen and Y.S. Chen. Real-time Approaching Vehicle Detection in Blind-
Spot Area. In Proceedings of 12th International IEEE Conference on Intelligent
Transportation Systems (ITSC), pages 1–6. IEEE, 2009. 152
[Clos 1953] C. Clos. A Study of non-blocking Switching Networks. Bell System Technical
Journal, vol. 32, no. 2, pages 406–424, 1953. 34
[Cohen 2010] A. Cohen and E. Rohou. Processor Virtualization and split Compilation for
heterogeneous Multicore Embedded Systems. In Proceedings of Design Automation
Conference (DAC), pages 102–107. ACM, 2010. 5
[Cotret 2012] P. Cotret, J. Crenne, G. Gogniat and J. Diguet. Bus-based MPSoC Security
through Communication Protection: A latency-efficient Alternative. In Proceedings of
Annual International Symposium on Field-Programmable Custom Computing
Machines (FCCM), pages 200–207. IEEE, 2012. 95
[Dagum 1998] L. Dagum and R. Menon. OpenMP: An Industry Standard API for Shared-
Memory Programming. Computational Science & Engineering, IEEE, vol. 5, no. 1,
pages 46–55, 1998. 123, 127
Bibliography 199
[Eicke 2012] N. Eicke, S. Funke, K. Schwierczek and M. Tasch. Framework zur Model-
lierung verteilter, paralleler eingebetteter HW/SW-Designs. Practical Course, TU
Darmstadt, October 2012. 69
[Franke 2003] B. Franke and M.F.P. Oboyle. Compiler Parallelization of C Programs for
multi-core DSPs with multiple Address Spaces. In Proceedings of International
Conference on Hardware/Software Codesign and System Synthesis, pages
219–224. IEEE/ACM/IFIP, 2003. 123
[Gericota 2005] M.G. Gericota, G.R. Alves and J.M. Ferreira. A self-healing Real-Time
System based on run-time Self-Reconfiguration. In Proceedings of Conference on
Emerging Technologies and Factory Automation (ETFA), volume 1, pages 4–pp.
IEEE, 2005. 116
200 Bibliography
[Glöckner 2011] P. Glöckner, A. Naseri, J. Simon and M. Zöllner. Framework für massiv-
parallele Systeme in heterogenen FPGA-Netzwerken. Practical Course, TU Darm-
stadt, April 2011. 112
[Goke 1973] L. Rodney Goke and G.J. Lipovski. Banyan Networks for partitioning Multi-
processor Systems. ACM SIGARCH Computer Architecture News, vol. 2, no. 4,
pages 21–28, 1973. 34
[Goldberg 1974] R.P. Goldberg. Survey of virtual Machine Research. Computer, vol. 7,
no. 6, pages 34–45, 1974. 5
[Green 2013] O. Green and Y. Birk. Scheduling Directives for shared-memory Many-Core
Processor Systems. In Proceedings of the International Workshop on Programming
Models and Applications for Multicores and Manycores, pages 115–124. ACM,
2013. 57
[Henkel 2003] J. Henkel. Closing the SoC Design Gap. Computer, vol. 36, no. 9, pages
119–121, 2003. 1
[Huang 2009] C.H. Huang and P.-A. Hsiung. Hardware Resource Virtualization for dy-
namically partially reconfigurable Systems. IEEE Embedded Systems Letters, vol. 1,
no. 1, pages 19–23, 2009. 6
[Intel Corporation 2013a] Intel Corporation. Desktop 3rd Generation Intel CoreProcessor
Family, Desktop Intel Pentium Processor Family, Desktop Intel Celeron Processor
Family, and LGA1155 Socket Thermal Mechanical Specifications and Design Guidelines
(TMSDG). Intel Corporation, 1 2013. 7
[Intel Corporation 2013b] Intel Corporation. Intel64 and IA-32 Architectures Software
Developer’s Manual. Intel Corporation, vol. 2 edition, 3 2013. 7
[Irwin 2004] M.J. Irwin, L. Benini, N. Vijaykrishnan and M. Kandemir. Techniques for
designing energy-aware MPSoCs, pages 21–47. Morgan Kaufman, 2004. 102
[Israr 2012] A. Israr. Reliability Aware High-Level Embedded System Design in presence of
Hard and Soft Errors. PhD Thesis, TU Darmstadt, May 2012. 117, 118, 154, 156
[Jung 2005] H.G. Jung, D.S. Kim, P.J. Yoon and J.H. Kim. Computer Analysis of Images
and Patterns, chapter Stereo Vision based Localization of free Parking Site, pages
231–239. Springer, 2005. 152
[Kalte 2005] H. Kalte and M. Porrmann. Context Saving and Restoring for Multitasking
in reconfigurable Systems. In Proceedings of Conference on Field-Programmable
Logic and Applications (FPL), pages 223–228. IEEE, 2005. 14
202 Bibliography
[Khronos OpenCL Working Group 2008] Khronos OpenCL Working Group. The
OpenCL Specification. A. Munshi (ed.), 2008. 123
[Knuth 1997] D.E. Knuth. The Art of Computer Programming, volume 3: Sorting and
Searching, chapter 5.3.4: Networks for Sorting, pages 219–247. Addison-Wesley,
1997. 38
[Koal 2013] T. Koal, M. Ulbricht and H.T. Vierhaus. Virtual TMR Schemes Combining
Fault Tolerance and Self Repair. In Proceedings of Conference on Digital System
Design (DSD), pages 235–242. Euromicro, 2013. 119
[Koch 2011] M. Koch, N. Ströher, L. Rothamel and M. Weiel. Graphisches Tool für
parallelisierte HW/SW-FPGA-Designs. Practical Course, TU Darmstadt, September
2011. 112
[Koch 2012] M. Koch, N. Ströher, L. Rothamel and M. Weiel. Framework zur Modellierung
verteilter, paralleler eingebetteter HW/SW-Designs. Forschungsarbeit, TU Darmstadt,
April 2012. 112
[Kuck 1981] D.J. Kuck, R.H. Kuhn, D.A. Padua, B. Leasure and M. Wolfe. Dependence
Graphs and Compiler Optimizations. In Proceedings of the ACM SIGPLAN-
SIGACT Symposium on Principles of Programming Languages, pages 207–218.
ACM, 1981. 125
[Lala 2003] P.K. Lala and B.K. Kumar. An Architecture for self-healing Digital Systems.
Journal of Electronic Testing, vol. 19, no. 5, pages 523–535, 2003. 116
[Leiserson 2010] C.E. Leiserson. The Cilk++ Concurrency Platform. The Journal of
Supercomputing, vol. 51, no. 3, pages 244–257, 2010. 123
[Leu 2011] A. Leu, D. Aiteanu and A. Graser. A novel stereo Camera based Collision
Warning System for Automotive Applications. In Proceedings of International
Symposium on Applied Computational Intelligence and Informatics (SACI),
pages 409–414. IEEE, 2011. 152
Bibliography 203
[Luck 2013] K. Luck. Ein virtualisierbares Automotive MPSoC. Final Thesis, TU Darm-
stadt, May 2013. 152
[Meloni 2010] P. Meloni, S. Secchi and L. Raffo. An FPGA-based Framework for technology-
aware Prototyping of multicore embedded Architectures. IEEE Embedded Systems
Letters, vol. 2, no. 1, pages 5–9, 2010. 112
[Mentor Graphics 2013] Mentor Graphics. ModelSim - Leading Simulation and Debug-
ging. online: http://www.mentor.com/products/fpga/model, accessed
12/18/2013, 10:10 am, 2013. 112
[Moore 1965] G.E. Moore et al. Cramming more Components onto Integrated Circuits,
1965. 1
[Muller 2013] S. Muller, M. Scholzel and H.T. Vierhaus. Towards a Graceful Degradable
Multicore-System by Hierarchical Handling of Hard Errors. In Proceedings of
International Conference on Parallel, Distributed and Network-Based Processing
(PDP), pages 302–309. Euromicro, 2013. 117
204 Bibliography
[Nassimi 1981] D. Nassimi and S. Sahni. A Self-Routing Benes Network and Parallel
Permutation Algorithms. IEEE Transactions on Computers, vol. C-30, no. 5, pages
332–340, 1981. 37
[Ottoni 2005] G. Ottoni, R. Rangan, A. Stoler and D.I. August. Automatic Thread Ex-
traction with decoupled Software Pipelining. In Proceedings of International Sym-
posium on Microarchitecture (MICRO), pages 12 pp.–. IEEE/ACM, 2005. 123
[Parberry 1992] I. Parberry. The pairwise Sorting Network. Parallel Processing Letters,
vol. 2, no. 2, page 3, 1992. 38
[Paulsson 2006b] K. Paulsson, M. Hubner, M. Jung and J. Becker. Methods for run-
time Failure Recognition and Recovery in dynamic and partial reconfigurable Systems
based on Xilinx Virtex-II Pro FPGAs. In Proceedings of Annual Symposium on
Emerging VLSI Technologies and Architectures, pages 6–pp. IEEE, 2006. 54, 167
[Plurality 2013] Plurality. Plurality - Leading the Multicore Revolution. online: http://
www.plurality.com/technology.html, accessed: 12/16/2013, 11:15 am,
2013. 57
[Portland Business Journal 2008] Portland Business Journal. Ambric for sale.
online: http://www.bizjournals.com/portland/stories/2008/11/
17/daily25.html?t=printable, accessed 12/11/2013, 11:50 am, Novem-
ber 2008. 107
[Quinlan 2011] D.J. Quinlan and C. Liao. ROSE Source-to-Source Compiler Infrastruc-
ture. In Proceedings of Cetus Users and Compiler Infrastructure Workshop in
conjunction with PACT, volume 2011, page 1, 2011. 127
[Rückelt 2012] T. Rückelt. Graphischer Editor für Models of Computation. Final Thesis, TU
Darmstadt, October 2012. 112
[Runge 2012] A. Runge. Determination of the Optimum Degree of Redundancy for Fault-
prone Many-Core Systems. GMM-Fachbericht Zuverlässigkeit und Entwurf, 2012.
120
[Singh 2013] A.K. Singh, M. Shafique, A. Kumar, J. Henkel, A. Das, W. Jigang, T. Srik-
anthan, S. Kaushik, Y. Ha, A. Prakash et al. Mapping on multi/many-core Systems:
Survey of current and emerging Trends. In Proceedings of the International Confer-
ence on Computer-Aided Design (ICCAD), pages 508–515. IEEE/ACM, 2013.
118
[Sunderam 1990] V.S. Sunderam. PVM: A Framework for parallel distributed Computing.
Concurrency: Practice and Experience, vol. 2, no. 4, pages 315–339, 1990. 5
[Tang 1986] P. Tang and P.-C. Yew. Processor Self-Scheduling for Multiple-Nested Parallel
Loops. In Proceedings of International Conference on Parallel Processing (ICPP),
volume 86, pages 528–535, 1986. 67
[Tcl Developer Xchange 2013] Tcl Developer Xchange. Welcome to the Tcl Developer
Xchange! online: http://www.tcl.tk/, accessed: 13/11/2013, 13:00 am,
2013. 111
[Ullmann 2004] M. Ullmann, M. Hübner, B. Grimm and J. Becker. On-demand FPGA run-
time System for dynamical Reconfiguration with adaptive Priorities. In Proceedings
of Conference on Field-Programmable Logic and Applications (FPL), pages
454–463. IEEE, 2004. 54
[Ventroux 2005] N. Ventroux and F. Blanc. A low complex Scheduling Algorithm for Multi-
Processor System-on-Chip. In Proceedings of the Conference on Parallel and
Distributed Computing and Networks, pages 540–545. IASTED/ACTA Press,
2005. 75
[Waksman 1968] A. Waksman. A Permutation Network. Journal of the ACM, vol. 15,
no. 1, pages 159–163, 1968. 34
Bibliography 207
[Wirth 1995] N. Wirth. A Plea for lean Software. Computer, vol. 28, no. 2, pages 64–68,
1995. 1
[Wolf 2008] W. Wolf, A.A. Jerraya and G. Martin. Multiprocessor System-on-Chip (MPSoC)
Technology. IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, vol. 27, no. 10, pages 1701–1713, 2008. 2
[Wu 2009] J. Wu, M. Si, F. Tan and C. Gu. Real-time automatic Road Sign Detection. In
Proceedings of International Conference on Image and Graphics (ICIG), pages
540–544. IEEE, 2009. 152
[Xilinx, Inc. 2011] Xilinx, Inc. LogiCORE IP Fast Simplex Link (FSL) V20 Bus (v2.11e).
Xilinx, Inc., Oct. 2011. 79
[Xilinx, Inc. 2012] Xilinx, Inc. MicroBlaze Processor Reference Guide. Xilinx, Inc., v14.1
edition, 04 2012. 7, 80, 85
[Xilinx, Inc. 2013a] Xilinx, Inc. MicroBlaze Soft Processor Core. online: http://www.
xilinx.com/tools/microblaze.htm, accessed: 10/01/2013, 11:45 am,
2013. 9
[Xilinx, Inc. 2013b] Xilinx, Inc. PicoBlaze 8-bit Microcontroller. online: http://www.
xilinx.com/products/intellectual-property/picoblaze.htm, ac-
cessed: 10/01/2013, accessed 11:55 am, 2013. 8
[Xilinx, Inc. 2013c] Xilinx, Inc. Vivado Design Suite. online: http://www.xilinx.
com/products/design-tools/vivado/, accessed 12/18/2013, 10:45 am,
2013. 109