0% found this document useful (0 votes)
39 views8 pages

A Tiny Scale VLIW Processor For RealTime

The paper presents the TinyVLIW8, a tiny scale VLIW soft-core processor designed for real-time embedded control tasks, focusing on minimal instruction execution time and design size. It integrates a secure wake-up receiver for low power wireless sensor nodes, utilizing a time-based one-time password algorithm for security. The architecture achieves an average instruction execution time of one clock cycle and is optimized for low power consumption, making it suitable for various embedded applications.

Uploaded by

tomokii.andy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views8 pages

A Tiny Scale VLIW Processor For RealTime

The paper presents the TinyVLIW8, a tiny scale VLIW soft-core processor designed for real-time embedded control tasks, focusing on minimal instruction execution time and design size. It integrates a secure wake-up receiver for low power wireless sensor nodes, utilizing a time-based one-time password algorithm for security. The architecture achieves an average instruction execution time of one clock cycle and is optimized for low power consumption, making it suitable for various embedded applications.

Uploaded by

tomokii.andy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2014 17th Euromicro Conference on Digital System Design

A Tiny Scale VLIW Processor for Real-time


Constrained Embedded Control Tasks

Oliver Stecklina and Michael Methfessel


IHP
Im Technologiepark 25
15236 Frankfurt (Oder), Germany
Email: {stecklina, methfessel}@ihp-microelectronics.com

In this paper we present the architectural design of the tiny which a wake-up signal is sent repeatedly to drain the mote’s
scale very long instruction word (VLIW) soft-core processor power supply. Our secure wake-up receiver (SWUR) is based
TinyVLIW8. The processor is designed to achieve a minimal on a WUR developed by Fraunhofer IIS [3] and is extended
instruction execution time and design size. Although, the by a security module that features the time-based one-time
instruction repertoire is not large, it is adequate for control password (TOTP) algorithm for a trustworthy and secure
tasks, which require decision making that could not easily wake-up scheme. The algorithm employs a time-synchronized
be implemented in an application specific integrated circuit SHA-1 keyed-hash message authentication code (HMAC) [4].
(ASIC) and in which an extensive mathematical processing Due to the complexity of the TOTP algorithm we decided
is not required. Especially in the area of embedded control to integrate a tiny scale soft-core processor to achieve the
tasks with real-time requirements an architecture using single- highest possible flexibility and consider power-efficiency for
cycle instructions is key. We will illustrate an application of the demanded control application. Furthermore, we use the
our TinyVLIW8 by presenting an design of a secure wake-up 32 kHz crystal oscillator of the WUR to minimize the active
receiver for low power wireless sensor nodes. To the best of our power consumption. The TOTP algorithms defines the timing
knowledge, the presented soft-core processor is the smallest requirements that must meet by the underlying hardware driven
VLIW design with an average instruction execution time of by the 32 kHz clock source. Hence, the design of the tiny
one clock cycle only. The core consumes less than 6 % of the scale soft-core processor must balance logic and instruction
logic cells of the smallest Altera Cyclone IV FPGA and can execution time.
be used in a system-on-chip design as well.
In the following we present the TinyVLIW8 a tiny RISC
Keywords—embedded systems;hardware-software codesign;soft- processor core that is optimized for a low logic and minimal
core processors;VLIW;CPU;ASIC;embedded controller;sensor net- execution time per instruction. The core is based on Harvard
works;security architecture with dedicated instruction, data and I/O buses.
It has three different function units, whereby two can be
I. I NTRODUCTION used in parallel by a single 32-bit instruction word. Each
instruction is executed within exactly two clock cycles. By
Tiny scale systems become more and more indispensable using two instructions in parallel the average execution time of
to our everyday life because they are powerful enough to an instruction is a single clock cycle only. The design satisfies
gather useful information to assist our life. But partly as a real-time requirements by providing fixed instruction size
result of their manifold capabilities, embedded systems are and an invariant instruction executed time. These capabilities
also able to do many things that we do not want. In fact, predestine our design for a broad variety of embedded control
tiny scale systems will be always vulnerable to doing the applications in tiny scale systems.
bidding of attackers, to the detriment of their owners. Hence,
the integration of security functions becomes mandatory for In the following section we briefly present a small selection
this class of embedded systems as well. But, the insufficient of existing soft-core processors for field programmable gate
resources and especially the limited power sources of battery arrays (FPGAs) and silicon devices. In section III we describe
powered systems forbid complex extensions. Furthermore, the the architecture of our TinyVLIW8 processor core. Implemen-
continuously changing security demands ask for a flexible tation variants and the integration in a frequently used sensor
architecture, which make the implementation of a bare ap- node are explained in section IV. Section V summarizes our
plication specific integrated circuit (ASIC) less attractive. We evaluation results. We compared our design with soft-core
are convinced that the integration of a tiny scale soft-core processors presented in section II. We conclude this paper with
processor is a much better solution for integrated circuits (ICs) a short summary of the key contributions of our processor
with an embedded application specific controller. design.
Especially wireless devices (motes) must be able to cope
II. R ELATED WORK
the diverging requirements regarding power consumption, se-
curity and increasing functionality. In context of the research The idea of using a small soft-core processor as part of a
project [1], we have implemented a concept of an ultra low hardware design process is becoming more widespread. This
power secure wake-up receiver [2]. In particular a wake-up is due to a number of significant advantages that soft-core
receiver (WUR) is vulnerable against depletion attacks, in processors hold over application specific ICs [5]. Cores are

978-1-4799-5793-4/14 $31.00 © 2014 IEEE 559


DOI 10.1109/DSD.2014.31
offered by the FPGA manufactures as well as open source FPGAs [15]. One of its design goals was to find a balance
projects. The openCores website lists more than 160 available between logic and on-chip memory. The core consumes less
cores [6]. Most of the cores re-implement an instruction set than 300 logic cells of an Altera Cyclone II FPGA and the
architecture (ISA) of a commercially available processor such on-chip memory is used for the instruction memory and the
as the Intel 8080/8081, Z80 [7], MIPS or SPARC. The use data memory only. A small design of a RISC MCU with
of such a core simplifies the software development process a invariant instruction execution time is given by the Open
in a significant manner. Compiler suites such as the GNU uRISC [16]. The ’supersmall’ soft-core processor is focused
compiler collection (GCC) are already available and higher on applications with very low speed requirements [17]. It is
languages can be used. But most of these cores are highly designed to squeeze the last little bit out of the processor’s
over-featured for simple embedded control tasks and consume design. But this small design come at the cost of serialization,
a substantial logic on FPGA devices. Furthermore, strict real- which causes time-consuming instructions.
time requirements are hard to satisfy with these cores because
The first very large instruction word (VLIW) soft-core
of the broad difference of the execution time of the various
processors found in literature is Spyder [18]. Later, similar
number of instructions.
projects were presented, e.g. a modular VLIW processor [19],
All major FPGA manufactures, e.g. Xilinx, Altera or the ρ-VEX, a reconfigurable and extensible soft-core VLIW
Lattice, offer soft-core processors for their FPGAs. These processor based on the VEX ISA [20] or a native Java VLIW
processors cover a broad variety of applications. There are processor [21]. But these soft-core processors are focused on
very small designs - Altera Nios II/e [8], PicoBlaze by Xilinx performance or the presences of a good software tool chain.
[9] - for embedded controlling tasks as well as more powerful Limitations of these architectures in the application area of
designs - Nios II/f, MicroBlaze [10] - for complex operations tiny scale systems are mainly their resource usage and their
available. Whereby, these processor cores are restricted to the long execution time far of a single instruction.
manufacture’s FPGAs and cannot be used on other platforms or
in silicon devices. By that, their application in a development III. A TINY SCALE VLIW CORE
process with a silicon design as a final goal is inapplicable.
The TinyVLIW8 is a tiny RISC processor following the
Soft-core processors with an open source license can be design strategy of the Harvard architecture. The processor core
used on FPGA devices and can be implemented in silicon de- is built of an instruction decoder, a load store unit (ldst), an
vices as well. For example, the Leon2 processor was originally arithmetic and logic unit (alu), a jump unit (jmp), an interrupt
designed by the European Space Research and Technology controller, a register set, a program counter and a status
Centre (ESTEC) to study and develop a high-performance register. Figure 1 shows the block diagram of the processor
processor. The design is based on the SPARC V8 Reduced core and its buses. The core uses an 11-bit instruction memory
Instruction Set Computer (RISC) architecture and is licensed address bus, which can addresses 2,048 memory locations of
under the LGPL/GPL [11]. A commercial license can be 32 bits each. The 8-bit data memory and IO bus use dedicated
purchased for integration in proprietary products. Commercial address buses with a wide of 8 bit.
silicon devices are offered by Atmel - AT697E/F, AT7913E - or The register set has eight general purpose (GP) and four
by NXP - JN5148 wireless micro controller unit (MCU). The additional interrupt registers. The interrupt registers are ac-
SPARC architecture is promoted by the SPARC International cessible within an interrupt only and cover the lower four
Inc. and is fully open and non-proprietary. In March 2006 the GP registers. The upper four GP registers are accessible all
64-bit UltraSPARC T1 microprocessor was released in open the time. The registers are used by the ldst unit and the
source and can be used in research as well as in commercial alu. Due to the fact that the alu supports register-register
products [12]. But the Leon system architecture is focused on operations only, all operands must be loaded by the ldst unit
high performance and includes complex features such as a five- into the registers before as well as must be stored later. Direct
stage pipeline, caches and a branch prediction unit, which are memory operations are not supported by the ISA. Although
mostly useless for embedded control tasks. Furthermore, these this restriction requires additional instructions, it simplifies
features make a reliable prediction of the expected execution the design significantly. Furthermore, memory copy operations
time quit hard. In the context of low power applications can be implemented in an efficient way by using the VLIW
simple architectures are more appropriated. Therefore, a port capability of the processor core. Hence, the bus architecture of
of a MCU soft-core like the TI’s MSP430 ultra low power the processor core must be able to handle parallel accesses of
MCU is more suitable. The openMSP project offers an open- both units to the register set within a single instruction cycle.
source MSP430 binary compatible 16-bit processor core [13].
A commercial 20-bit core is available by Fraunhofer IPMS
and an extended MCU with integrated cryptographic units is A. The instruction set architecture (ISA)
offered by the IHP [14]. In agreement with the primary design goals of the
TinyVLIW8 - design size and single instruction execution time
16-bit MCUs are widely deployed in low power wireless
- its ISA is limited to few instructions. Each instruction has a
sensor networks (WSNs). By using low duty cycle protocols,
fixed-size of 16 bits and is structured in an opcode, an address
a node lifetime of several years can be achieved with a small
flags, a destination and a source field. The instruction format
battery pack. But tiny scale controllers are focused on systems
and the size of the fields is given by Figure 2.
with a continuous operation or with a high duty cycle. In these
systems the active power consumption is key and the number Two 16-bit instructions can be combined in a single 32-
of active components must be as small as possible. The Leros bit instruction word if they are use different functional units.
is a tiny soft-core processor that is optimized for low-cost A single instruction in a 32-bit word is described by using

560
  


   
&'      

      


   
   

 
     
!"  # 

$ %         
$(
  $ %   )


Fig. 1. The block diagram of the TinyVLIW8 processor core. The core has a single input clock and supports four interrupt lines. External peripherals are
connected to the 8-bit I/O memory.

     


   
alu uses the flag to differentiate between a constant operator
   
and two register operands operation. If a constant operand
is selected the source of the second operand is equal to the
    
destination. The constant is read from the 8-bit source field.
    In a two register operand operation the source field is split in
    two nibbles, where each nibble consists of a flag bit and the
  
3-bit register number. The nibble flag can be used to invert the
content of the register or to generate the two’s complement of
   an operand. The second bit of the address flag is used to invert
   the result (and, xor, or) or to include the carry bit (add, shift) in
  the operation. The shift operation has only one source operand.
The address flags are used to specify the shift direction and to
Fig. 2. Format of the 16-bit instructions of the TinyVLIW8 processor. integrated the carry flag. The jmp unit uses the address flags
to differentiate between different jump conditions (none, zero,
negative, non-zero). The destination address is given by the
the same opcode for both 16-bit instructions. The instruction destination and the source field.
decoder will detect this resource conflict and does not enable
a second functional unit within this cycle.
B. Execution stages
The opcode has a fixed size of three bits and is used to
select one of the functional units and the operation within the The TinyVLIW8 processor needs two clock cycles to
functional unit. A summary of all supported opcodes and the execute a single instruction. The two clock cycles are used
addressed operations as well as their responsible functional to drive the 4-bit execution stage bus (ESB). Figure 3 shows
units are given by Table I. the waveform of main clock and the ESB signals. The ESB is
generated by the content of two flip-flops, which are flipped
TABLE I. T HE PROCESSOR ISA HAS A 3- BIT OPCODE , WHICH by the rising and falling edge of the main clock. The four
ADDRESSES THE FUNCTIONAL UNIT AND THE SPECIFIC INSTRUCTION . ESB lines are enabled in consecutive order, where by only
Operation Opcode Functional Unit one signal is active at a time.
load 000 LDST The functional units of the TinyVLIW8 are controlled by
store 001 LDST
the ESB. Each ESB line corresponds to an execution stage
add 010 ALU
shift 011 ALU
and enables dedicated logic within the function units and
and 100 ALU
synchronizes their inputs and outputs. The execution stages
or 101 ALU are fetch, decode, execute and write back.
xor 110 ALU
jump 111 JMP
a) Fetch: This stage is used to transfer the next instruc-
tion to the instruction decoder. On the rising edge of the line
ESB(0) the instruction memory is enabled. It is guaranteed that
Each opcode can be extended by a two bit address field.
the instruction memory address is stable at this event.
The address flag is specific to the different functional units as
well as the addressed operations. The ldst unit uses the address b) Decode: This stage starts with the falling edge of
flag for selecting the IO or the data memory and to differentiate line ESB(0) and enables the ldst, the alu or the jmp unit. The
between a direct address given by the 8-bit source field or units start their instruction decoding with the raising edge of
an immediate address given by the content of a register. The the line ESB(1).

561
The program counter is incremented within stage one, the
 result is written to the temporary PC register pcInt. The content
of pcInt register can be read by the ldst unit. The update of the
 PC takes place within the stage four. It is guaranteed that all
sources provide are stable signal within this stage. The update
of the register is prioritized by the following order: interrupt

 

(highest), ldst, jmp and pcInt (lowest).
 In case of an interrupt the content of the lower prioritized
source is stored in the pcIrq register. This register can be
  read by the ldst unit to implement the return from interrupt
 
instruction.

Fig. 3. The functional units are driven by execution stage bus (ESB). The
ESB is generated from the main clock by using the raising and the falling
D. Interrupt handling
edges of two clock cycles.
The interrupt controller supports four asynchronous in-
terrupt sources. A peripheral unit can raise an interrupt by
enabling its interrupt line. The line must be held by the
c) Execute: This stage is used by the alu to write is
peripheral unit until an IRQ acknowledge is set. The IRQ
results back to the register set. The ldst unit enables the data
acknowledge is generated by the interrupt controller within
memory and peripheral registers to use it within the next stage.
the stage three when the interrupt vector (IV) is loaded to the
d) Write back: This stage is the last stage and enables PC. If an interrupt occurs before stage three it will be handled
the write enable signals of the data memory bus, the IO within the following instruction cycle. Otherwise the next
memory bus as well as or the register set. Furthermore, the instruction is finished and interrupt handling starts afterwards.
program counter register is loaded from one of the sources Due to the fixed execution time of a single instruction the
within this stage. maximal interrupt delay is three clock cycles.
The register set is used by the alu and the ldst unit. To The return from interrupt must be implemented manually
use both units in parallel their accesses are spread over the by the usage of ldst operations. The instruction, displaced by
last two execution stages. The instruction memory is enabled the interrupt, is stored in a dedicated register of the PC unit. It
within the first two stages only. Therefore, the functional units must be copied with ldst operations to the PC register to return
must buffer the data in local registers. The buffer is necessary back to the normal program flow. During the interrupt handling
to change the instruction address before the rising edge of the the register set provides four additional registers, which can
first stage. be freely used without overwriting data from the normal flow.
The registers are mapped into the register set and cover the
four lower GP registers. Interrupt handling is finished when
C. Program counter the program counter is loaded by the ldst unit. An in-interrupt
The program counter (PC) is not part of the general purpose flag is provided by the status register. The flag can be used by
register set. It is implemented in an additional unit. The content software to share code between interrupt handling and normal
of the register can be modified by incrementing it as well as operations.
overwriting it by the ldst unit, the jmp unit or the interrupt
The processor core does not support nested interrupts. If an
controller. Furthermore, the ldst unit can read the program
interrupt occurs while the interrupt flag is active the handling
counter, which is necessary for implementing subroutine calls.
is delayed until a return from interrupt is done. The interrupt
The content of the PC register must be stable within the first
acknowledge is also delayed so that the nested interrupt will
two execution stages for addressing the current instruction as
not be ignored.
well as within the last two stages for the ldst read operation.
Therefore, the update and the read operation must be done on
shadow registers. A flow diagram of the update process and E. Clock gating
the used shadow registers are shown in Figure 4.
To reduce the idle consumption of the TinyVLIW8 the
 peripheral units and the processor core can be selectivly
 switched off. Each peripheral unit with a clock input can be
 disabled as well. Furthermore, the processor core can be stalled

by software by writing the processor stall bit of the status
  register. When the stall bit is written, the processor finishes
the current instruction and disables the ESB generation for the
following clock cycles.
  
The processor is reactivated if it receives an interrupt. It
restarts the ESB generation and loads the IV into the PC. The
Fig. 4. The PC unit includes the PC register and logic for its update. The
incremented PC is buffered in the pcInt register during the stage one. The PC
instruction just following the stall is saved by the PC units and
is loaded during the stage four from one of the four possible sources. must be loaded manually.

562
F. Debug interface pure software implementation. The software was written for
an MSP430-based wireless sensor node with low power capa-
The TinyVLIW8 processor core uses volatile memories
bilities. In a following step we selected software components
for the data and the instruction memory. Both memories do
for a possible hardware integration. The selection process was
not require a controller and can be directly accessed by the
guided by the availability of the component in HDL. The
processor core, which simplifies the design. Furthermore, a
final design should be based on as many as possible already
silicon device can be fabricated by using a standard process if
available modules as well as general purpose components to
a non-volatile memory is not required. But then the program
reduce the development costs.
code must be loaded by an external component on each power
on, which requires a simple debug interface.
A. The SWUR design
In the interest of a small and simply design, a serial
peripheral interface (SPI) slave is implemented. The SPI can The TOTP algorithm was designed to authenticate a user
be easily accessed by any off-the-shelf MCU, which makes in a distributed server-based system and requires a globally
an easy integration of the TinyVLIW8 core in a mote design synchronized clock source. We ported the algorithm to a WSN
feasible. Using the SPI slave all three memory buses can be without a precise clock source. The authentication pattern is
accessed. Therefor, a data transfer initiated by the SPI master encoded by using the two available wake-up pattern codeA and
must start with a 16-bit header. The header elements are shown codeB. The full sequence is shown in Figure 6. It starts with
in Figure 5 and include the memory address, the selected the start phase, where a codeA and a codeB are sent with a
memory and the access type - read or write. The length of minimal data rate to minimize the idle power consumption.
the transfer is always 32 bits for the instruction memory and After a reconfiguration phase with a predefined length, the
eight bits otherwise. authentication pattern and the address follow. Due to the
absence of a synchronized clock source, sender and receiver
      must use the incoming signals to synchronize their reception
  process. Furthermore, because of the unreliable wireless com-
     
    munication the receiver has to cope with missing bits. We
     took all these into account by designing an application specific
symbol decoder unit.
Fig. 5. The structure of the SPI header of the TinyVLIW8 debug interface.

The chip select (CS) signal of the SPI is connected to the


stall signal of the processor core. When activating the SPI
by pulling the CS line low the TinyVLIW8 core enters the
stall state, where its ESB is disabled. Due to the asynchronous            
behavior of the CS signal the TinyVLIW8 debug interface pulls
Fig. 6. The wake-up sequence with its four different phases. A bit is
the SPI MISO line high until the processor core has finished encode by a wake-up pattern codeA or codeB. To minimize the idle power
the last instruction. The SPI master must observe the MISO consumption the start up code is sent with a minimal data rate.
line and can start its data transfer when the MISO line is low
again. During the processor core stall an exclusive access to The final design of the SWUR is shown in Figure 7. The
all memories by the SPI is guaranteed. design is built around the TinyVLIW8 core, which is extended
by a SHA-1 hardware module to accelerate the most expensive
G. Software tools mathematical operation. The SHA-1 function is periodically
We have implemented a simple assembler written in C. used by the TOTP algorithm to update the authentication
The assembler language supports the usage of the VLIW pattern. We could take the SHA-1 module from our IHP430X
instructions by combining two instructions in a single line. The MCU [22] and had to adapt it to the 8-bit interface only. The
programmer has to take care that only two different functional Symbol Decoder is the one functional unit that is specific for
units are used. Otherwise the second instruction will be ignored SWUR application and had to be implemented from scratch.
by the processor core. To support the development process, the The resource utilization of the SWUR components on a
language supports the definition of variables, labels and special Cyclone IV FPGA device is shown in Table II. The table
code sections like the interrupt vector (IV). Defined variables shows, that by using a soft-core processor in combination with
are automatically initialized by a small bootstrap code, which standard peripheral components, the size of the application-
is inserted by the assembler tool. specific component is quite small. The overall size of our
Due to the complexity of optimizing the VLIW code the design is 3,712 lookup tables (LUTs) and 226 registers, in
usage of higher level programming languages is not supported which the symbol decoder uses 260 LUTs (7.0 %) and 174
yet. Although the presence of a good software tool chain registers (12.2 %).
significantly simplifies the implementation of complex applica-
tions, in the application areas of tiny scale systems its absence B. Implementation
may be acceptable.
We started the implementation of our TinyVLIW8 proces-
sor core on an Altera Cyclone IV FPGA. For testing and
IV. S OFT- CORE INTEGRATION
evaluation the design was ported to a Cyclone II device.
The TinyVLIW8 processor core was initially designed for The Cyclone II is used on the Altera’s development kit 1
the SWUR presented in [2]. We started the SWUR with a (DK1), which is commonly used in research and education.

563
&  
'&

     
!" #

-  
 &

() *
 

   


$%
 ) 
+,-+'  -+'
) )&  )
 . /0  ,  

-


0 /  1$

Fig. 7. System architecture of the SWUR. The TinyVLIW8 firmware implements the HMAC-SHA1, the secure pattern update and the WUR configuration
process. Mathematically extensive and time-critical operations are implemented in dedicated hardware units, which are connected via the IO bus to the processor
core.

TABLE II. R ESOURCE UTILIZATION OF THE SWUR COMPONENTS


AND THE PERCENTAGE OF THE COMPLETE DESIGN GENERATED FOR A
   
C YCLONE IV E FPGA DEVICE .  !" 





Component LUTs per cent Registers per cent
SHA-1 2,166 58.4 847 59.2
TinyVLIW8 760 20.5 226 15.8  
  
Symbol decoder 260 7.0 174 12.2


Timer 111 3.0 64 4.5


SPI master 98 2.6 39 2.7


Debug interface 66 1.8 56 3.9    
GPIO 4 0.1 24 1.7     

We used the DK1 for measurements and for a comparison


of our design with the cores introduced in section II. The Fig. 8. Modulare IHPstack mote used for evaluating the SWUR design.
final goal of the development process was focused on the The mote is assembled by different PCBs with dedicated functionalities. We
assembling of a silicon device. In a final step we will ported the integrated the Cyclone IV PCB an additional FPGA to implement to security
TinyVLIW8 design to the IHP’s in-house BiCMOS technology extension of the μRX1080 WUR.
to manufacture a silicon device. Therefore, the Altera Nios II/e
processor core was never an option.
mixed-signal IC. IHP is able to fabricate BiCMOS designs
C. Mote integration with a structure size of 0.25 μm and 0.13 μm in our in-house
fab. For these technologies the idle current leakage of digital
The Cyclone IV FPGA device is part of the sensor node circuits can be mostly neglected. However, the active current
platform IHPstack [23]. The IHPstack is a modular sensor node is proportional to the clock frequency. Hence, by using the
with a flexible number of printed circuit boards (PCBs). Each 32 kHz clock of the WUR for the processor core, an ultra low
PCB layer of the IHPstack is used for a single dedicated task. power SWUR becomes possible.
All modules are connected by a well-defined mote component
interconnect (MCI), which guarantees high flexibility when
combining modules. We added an FPGA stack module for V. E VALUATION
easy integration of application specific hardware designs. The
module can be used as a replacement of an MCU module or To draw conclusions about the performance and the us-
to extend another module with dedicated hardware functions. ability of our soft-core processor, we compared the design
size of different approaches and the memory footprint of their
An assembled node with four layers - USB base (power software programs. Furthermore, we analyzed the drawbacks
supply), MSP430 MCU, CC1101 transceiver, Cyclone IV of TinyVLIW8’s limited ISA. For this comparison, we imple-
FPGA and μRX1080 WUR - is shown in Figure 8. The mented a short example of a typical embedded control task.
diagram illustrates the mapping of the IHPstack modules to the
components of the sensor node featuring our SWUR design.
We used the FPGA module to extend the μRX1080 layer A. Design size
with the SWUR security functions. The TinyVLIW8 core is
For an evaluation of the design size we took the values of
implemented in the FPGA module, which is connected by SPI
other soft-core processors from the literature - Leros, ρ-VEX,
and GPIO to the MCU and the μRX1080 WUR.
Leon2 - or performed measurements using Altera’s Quartus II
In the final design, the security module with the embedded version 11.1 software if the designs were synthesizeable. The
TinyVLIW8 core and the WUR will be integrated into a single results are summarized in Table III.

564
TABLE III. C OMPARISON OF THE DESIGN SIZE OF SOFT- CORE clock cycles, which is only slightly more than the requirements
PROCESSORS . * SIZE ESTIMATED of more complex ISAs.
Processor Logic cells Register width FPGA
l d i r4 , #0 x06 | add r7 , #0 x f f ;
TinyVLIW8 1,056 8-bit Altera Cyclone II l d i r5 , #0 x07 | add r4 , #0 x04 ;
Leros 435 16-bit Xilinx Spartan 3E s t r4 , @r7 | a d d i r5 , #0 x00 ;
openMSP 2,841 16-bit Altera Cyclone II add r7 , 0 xff ;
IHP430X 4,107 20-bit Altera Cyclone II s t r5 , @r7 | jmp s h a 1 i n i t ;
Supersmall 385* 32-bit Altera Cyclone II
Altera Nios II/e 802 32-bit Altera Cyclone II
Listing 1. Emulation of a subroutine call
ρ-VEX 1,895 32-bit Xilinx Virtex-II Pro
Leon2 [24] 9,299 32-bit Altera Cyclone The return instruction must be emulated as well. We can do
this by the program code shown in Listing 2. We must restore
the program counter from the stack content and must increment
Table III shows that the TinyVLIW8 has the smallest design the stack address. Again, the ldst and the ALU unit can be used
size besides the Nios II/e, the Leros and the ’supersmall’ soft- in parallel, which minimizes the clock cycle overhead.
core processor. The ’supersmall’ soft-core processor has the ldi r5 , @r7 ;
smallest design that we found in the literature. It uses only 236 sti r5 , #0 x05 | add r7 , #0 x01 ;
LUTs on an Altera Stratix III device, which corresponds to an ldi r5 , @r7 ;
estimated size of 385 LUTs on a Cyclone II device. It uses a sti r5 , #0 x04 | add r7 , #0 x01 ;
serial architecture to achieve a minimal area. It is only 36 % Listing 2. Emulation of a return from subroutine
of the size of our approach. The Leros is a pipelined 16-bit
accumulator processor in which only a single dedicated register A return from interrupt instruction can be emulated in a
- the accumulator - is connected to the arithmetic logical unit similar way. When an interrupt occurs the program counter is
(ALU) output and provides one input to the ALU. Although the stored in the PC unit. It must be copied back to the PC register
Leros is a Java system its ISA is limited to very few operations. by using load-store operations. The interrupt shadow registers
Furthermore, it does not have a status register and a interrupt can be used, so that an overwriting of the register contents can
controller, which significantly limits its use. Especially the lack be avoided.
of an interrupt controller makes it unsuitable for an application
of embedded control tasks. The Nios II/e, optimized for the C. Execution time
Altera Cyclone FPGA, is only 20 per cent smaller than our
design. But its application is restricted to the Altera FPGAs. To analyze the single instruction performance we compared
the clock cycles per instructions with other ISAs. But, we
In comparison to an alternative VLIW soft-core, the ρ-
skipped the Leros soft-core processor because of its reduced
VEX, our design is two times smaller. The ρ-VEX core ISA and the Leon2 because of its size. Due to the serial
was configured in a similar configuration with two parallel architecture of the ’supersmall’ soft-core processor it has a
operations. Designs with four - 5,105 slices - and eight - 10,433
considerable speed penalty, which makes it 10 times slower
slices - parallel operations of the soft-core processor allocate than the Nios II/e soft-core processor. We used the openMSP
significant more resources. General purpose MCUs, like the as a representative of the MSP430 ISA, which includes the
IHP430X (389 %) and the Leon2 (881 %) are significant larger
IHP430X as well. Furthermore, we analyzed the Nios II/e
than the TinyVLIW8 soft-core processor. We are convinced processor core. The results are summarized in Table IV.
that these type of processors is seriously over-featured for
embedded tasks. TABLE IV. C OMPARISON OF THE INSTRUCTION EXECUTION
PERFORMANCE .

B. Instruction emulation Instruction TinyVLIW8 openMSP Nios II/e

Due to the limited size of the instruction word the number subroutine call 10 4 6
of different instructions of the ISA is quite low. In particular return from subroutine 8 4 6
return from interrupt 8 8 6
the designed ISA lacks a native support of a move instruction
interrupt service 2 6 6
and a subroutine call. The move instruction must be emulated ALU op (1 - 2) 1 6
by the logical operations xor, to clear the target register, and or, move reg-reg (1 - 4) 1 6
to load the content to the register. However, due to the VLIW move reg-mem (1 - 2) 4 6
feature the required xor operation can be combined with any
previous ldst or jmp instruction. In most cases the additional Table IV shows that the TinyVLIW8 is slightly slower
instruction can avoided in this way. than its counterparts when executing subroutine calls and
A subroutine call is more complex because of the required returns. But interrupt handling and basic operations can be
stack handling. Due to our approach does not have a native executed with a similar performance or even faster when two
support of a stack pointer register, it must be emulated with instructions can be combined as described above.
one of the available GP registers. Listing 1 shows an example
implementation of a subroutine call. We use the register r7 D. Program size
to store the stack pointer. Before calling the subroutine the
current program counter must be copied from the PC unit onto In a last step we compared the size of a HMAC-SHA1
the stack, whereby we can use the ldst and the ALU unit in implementation on the various soft-core processors. The im-
parallel. A subroutine call than needs five instructions and 10 plementation is based on the work of Aaron Gifford [25]. Since

565
the SHA-1 transform operation is provided by a hardware [3] H. Milosiu, F. Oehler, M. Eppel, D. Frühsorger, S. Lensing, G. Popken,
module, the software must serve the hardware interface only. and T. Thönes, “A 3-μW 868-MHz Wake-Up Receiver with -83
dBm Sensitivity and Scalable Data Rate,” in Proceedings of the 39th
For the openMSP and the Nios II/e we used the GCC ports of European Solid-State Circuit conference, ser. ESSCIRC. Bucharest,
these processors. The TinyVLIW8 implementation was written Romania: IEEE, September 2013.
in assembler. The results are summarized in the Table V. [4] D. M’Raihi, S. Machani, M. Pei, and J. Rydell, “IETF RFC: RFC6238
- TOTP: Time-based One-Time Password Algorithm,” US, May 2011.
TABLE V. C OMPARISON OF THE PROGRAM SIZE OF A HMAC-SHA1
IMPLEMENTATION . T HE SHA1 MODULE INCLUDES A DRIVER FOR A [5] J. G. Tong, I. D. L. Anderson, and M. A. S. Khalid, “Soft-Core
HARDWARE MODULE ONLY. Processors for Embedded Systems,” in Microelectronics, 2006. ICM
’06. International Conference on, Dec 2006, pp. 170–173.
Processor Function Instr. Text Data [6] OpenCores, “The #1 community within open source hardware
SHA-1 185 512 3 IP-cores,” 2013. [Online]. Available: http://opencores.org
TinyVLIW8
HMAC 251 616 66 [7] W. T. Barden, Z80 Microcomputer Handbook. Indianapolis, IN, USA:
SHA-1 109 278 4 Sams, 1978.
openMSP
HMAC 121 344 66 [8] “Nios II/e: Economy,” Altera, San Jose, CA, USA. [Online]. Available:
SHA-1 109 436 8 ”http://www.altera.com/devices/processor/nios2/cores/economy/ni2-
Nios II/e
HMAC 166 672 66 economy-core.html”
[9] Xilinx, PicoBlaze 8-bit Embedded Microcontroller User
Although Table V shows that the TinyVLIW8 has the Guide, 2nd ed., June 2011. [Online]. Avail-
able: http://www.xilinx.com/support/documentation/ip documentation/
largest number of instructions and code size, its performance is ug129.pdf
competitive or better to its counterparts. Due to its capability [10] ——, MicroBlaze Processor Reference Guide, 9th ed., 2008. [Online].
of combining two instructions in a single 32-bit instruction Available: http://www.xilinx.com/support/documentation/sw manuals/
word, the real number of instructions is only 128 for the mb ref guide.pdf
SHA-1 and 154 for the HMAC module, in which all these [11] J. Gaisler, LEON2 Processor User’s Manual, 1st ed., Aeroflex
instructions are executed within exactly two cycles. Hence, Gaisler AB, Goteborg, Sweden, 2003. [Online]. Available:
in real applications the overall performance will be equal or http://www.gaisler.com/doc/leon2-1.0.24-xst.pdf
slightly better than the openMSP’s and obviously batter than [12] OpenSPARC, “World’s first free 64-bit cmt microprocessor.” [Online].
Available: http://www.opensparc.net/
the Nios II/e’s performance. Furthermore, the TinyVLIW8’s
[13] O. Girard, “openMSP :: Overview,” 2014. [Online]. Available:
code size is similar to the Nios II/e and the smaller code size http://opencores.org/project,openmsp430
of the openMSP comes at the cost of the doubled design size.
[14] “IPMS430x,” Website, 2010, http://www.ipms.fraunhofer.de/.
[15] M. Schoeberl, “Leros: A Tiny Microcontroller for FPGAs,” in Proceed-
VI. C ONCLUSION ings of the 2011 21st International Conference on Field Programmable
Logic and Applications, ser. FPL ’11. Washington, DC, USA: IEEE
To be of use as embedded controller in tiny scale sys- Computer Society, 2011, pp. 10–14.
tems, a soft-core processor should combine a minimal design [16] K. Hays and Jshamlet, “Open8 urisc :: Overview,” 2013. [Online].
size with a consistent, predictable and low cycle count per Available: http://opencores.org/project,open8 urisc
instruction and the suitability for implementation in an ASIC. [17] J. Robinson, S. Vafaee, J. Scobbie, M. Ritche, and J. Rose, “The
While existing approaches exhibit some of these features, the supersmall soft processor,” in Proceedings of the 6th VI Southern
presented design is the first which combines all the desired Programmable Logic Conference, ser. SPL ’10, Porto de Galinhas
properties in one unit. The design size is less than the half Beach, Brazil, March 2010, pp. 3–8.
of other full-featured soft-core processors. Its performance is [18] C. Iseli and E. Sanchez, “Spyder: a reconfigurable VLIW processor
using FPGAs,” in FPGAs for Custom Computing Machines, 1993.
similar or even better than that of FPGA-optimized soft-core Proceedings. IEEE Workshop on, Apr 1993, pp. 17–24.
processor in terms of code size and required cycles to complete
[19] V. Brost, F. Yang, and M. Paindavoine, “A modular VLIW Processor,”
a task. It is not skewed towards a specific FPGA devices and in Circuits and Systems, 2007. ISCAS 2007. IEEE International Sym-
is therefore free to be used in silicon devices. The conclusions posium on, May 2007, pp. 3968–3971.
are based on analysis of a typical application, which shows [20] S. Wong, T. van As, and G. Brown, “(rho)-VEX: A reconfigurable and
that our design is suitable for realistic scenarios. In summary, extensible softcore VLIW processor,” in ICECE Technology, 2008. FPT
the presented VLIW processor implements a good balance of 2008. International Conference on, Dec 2008, pp. 369–372.
logic and instruction execution time, which makes it suitable [21] A. Beck and L. Carro, “A VLIW low power Java processor for
for a broad variety of embedded control tasks. embedded applications,” in Integrated Circuits and Systems Design,
2004. SBCCI 2004. 17th Symposium on, Sept 2004, pp. 157–162.
[22] G. Panic, T. Basmer, O. Schrape, S. Peter, F. Vater, and K. Tittelbach-
ACKNOWLEDGEMENTS Helmrich, “Sensor node processor for security applications,” in In
proceedings of 18th IEEE International Conference on Electronics,
The research leading to these results has received funding Circuits and Systems, ser. ICECS 2011, Beirut, Lebanon, December
from the Federal Ministry of Education and Research (BMBF) 2011, pp. 81–84.
under grant agreement No. 16 BN1110. [23] O. Stecklina, D. Genschow, and C. Goltz, “TandemStack - A Flexible
and Customizable Sensor Node Platform for Low Power Applications,”
R EFERENCES in Proceedings of the 1st International Conference on Sensor Networks,
ser. Sensornets 2012, Rome, Italy, February 2012.
[1] IHP, “Aeternitas: A energy-efficient Wakeup-System for wireless [24] “Running Leon2 on the Altera Nios Develop-
sensor nodes,” 2012. [Online]. Available: http://www.aet- ment Board, Cyclone Edition.” [Online]. Available:
projekt.de/project.html http://www.mdforster.pwp.blueyonder.co.uk/LeonCyclone.html
[2] O. Stecklina, S. Kornemann, and M. Methfessel, “A Secure Wake-up [25] A. Gifford, “Implementations of SHA-1, SHA-224, SHA-
Scheme for Low Power Wireless Sensor Nodes,” in Proceedings of the 256, SHA-384 and SHA-512,” 2004. [Online]. Available:
4th International Workshop on Mobile Systems and Sensors Networks http://www.aarongifford.com/computers/sha.html
for Collaboration, ser. MSSNC 2014, Minneapolis, USA, May 2014.

566

You might also like