A Tiny Scale VLIW Processor For RealTime
A Tiny Scale VLIW Processor For RealTime
In this paper we present the architectural design of the tiny which a wake-up signal is sent repeatedly to drain the mote’s
scale very long instruction word (VLIW) soft-core processor power supply. Our secure wake-up receiver (SWUR) is based
TinyVLIW8. The processor is designed to achieve a minimal on a WUR developed by Fraunhofer IIS [3] and is extended
instruction execution time and design size. Although, the by a security module that features the time-based one-time
instruction repertoire is not large, it is adequate for control password (TOTP) algorithm for a trustworthy and secure
tasks, which require decision making that could not easily wake-up scheme. The algorithm employs a time-synchronized
be implemented in an application specific integrated circuit SHA-1 keyed-hash message authentication code (HMAC) [4].
(ASIC) and in which an extensive mathematical processing Due to the complexity of the TOTP algorithm we decided
is not required. Especially in the area of embedded control to integrate a tiny scale soft-core processor to achieve the
tasks with real-time requirements an architecture using single- highest possible flexibility and consider power-efficiency for
cycle instructions is key. We will illustrate an application of the demanded control application. Furthermore, we use the
our TinyVLIW8 by presenting an design of a secure wake-up 32 kHz crystal oscillator of the WUR to minimize the active
receiver for low power wireless sensor nodes. To the best of our power consumption. The TOTP algorithms defines the timing
knowledge, the presented soft-core processor is the smallest requirements that must meet by the underlying hardware driven
VLIW design with an average instruction execution time of by the 32 kHz clock source. Hence, the design of the tiny
one clock cycle only. The core consumes less than 6 % of the scale soft-core processor must balance logic and instruction
logic cells of the smallest Altera Cyclone IV FPGA and can execution time.
be used in a system-on-chip design as well.
In the following we present the TinyVLIW8 a tiny RISC
Keywords—embedded systems;hardware-software codesign;soft- processor core that is optimized for a low logic and minimal
core processors;VLIW;CPU;ASIC;embedded controller;sensor net- execution time per instruction. The core is based on Harvard
works;security architecture with dedicated instruction, data and I/O buses.
It has three different function units, whereby two can be
I. I NTRODUCTION used in parallel by a single 32-bit instruction word. Each
instruction is executed within exactly two clock cycles. By
Tiny scale systems become more and more indispensable using two instructions in parallel the average execution time of
to our everyday life because they are powerful enough to an instruction is a single clock cycle only. The design satisfies
gather useful information to assist our life. But partly as a real-time requirements by providing fixed instruction size
result of their manifold capabilities, embedded systems are and an invariant instruction executed time. These capabilities
also able to do many things that we do not want. In fact, predestine our design for a broad variety of embedded control
tiny scale systems will be always vulnerable to doing the applications in tiny scale systems.
bidding of attackers, to the detriment of their owners. Hence,
the integration of security functions becomes mandatory for In the following section we briefly present a small selection
this class of embedded systems as well. But, the insufficient of existing soft-core processors for field programmable gate
resources and especially the limited power sources of battery arrays (FPGAs) and silicon devices. In section III we describe
powered systems forbid complex extensions. Furthermore, the the architecture of our TinyVLIW8 processor core. Implemen-
continuously changing security demands ask for a flexible tation variants and the integration in a frequently used sensor
architecture, which make the implementation of a bare ap- node are explained in section IV. Section V summarizes our
plication specific integrated circuit (ASIC) less attractive. We evaluation results. We compared our design with soft-core
are convinced that the integration of a tiny scale soft-core processors presented in section II. We conclude this paper with
processor is a much better solution for integrated circuits (ICs) a short summary of the key contributions of our processor
with an embedded application specific controller. design.
Especially wireless devices (motes) must be able to cope
II. R ELATED WORK
the diverging requirements regarding power consumption, se-
curity and increasing functionality. In context of the research The idea of using a small soft-core processor as part of a
project [1], we have implemented a concept of an ultra low hardware design process is becoming more widespread. This
power secure wake-up receiver [2]. In particular a wake-up is due to a number of significant advantages that soft-core
receiver (WUR) is vulnerable against depletion attacks, in processors hold over application specific ICs [5]. Cores are
560
&'
!" #
$ %
$(
$ % )
Fig. 1. The block diagram of the TinyVLIW8 processor core. The core has a single input clock and supports four interrupt lines. External peripherals are
connected to the 8-bit I/O memory.
561
The program counter is incremented within stage one, the
result is written to the temporary PC register pcInt. The content
of pcInt register can be read by the ldst unit. The update of the
PC takes place within the stage four. It is guaranteed that all
sources provide are stable signal within this stage. The update
of the register is prioritized by the following order: interrupt
(highest), ldst, jmp and pcInt (lowest).
In case of an interrupt the content of the lower prioritized
source is stored in the pcIrq register. This register can be
read by the ldst unit to implement the return from interrupt
instruction.
Fig. 3. The functional units are driven by execution stage bus (ESB). The
ESB is generated from the main clock by using the raising and the falling
D. Interrupt handling
edges of two clock cycles.
The interrupt controller supports four asynchronous in-
terrupt sources. A peripheral unit can raise an interrupt by
enabling its interrupt line. The line must be held by the
c) Execute: This stage is used by the alu to write is
peripheral unit until an IRQ acknowledge is set. The IRQ
results back to the register set. The ldst unit enables the data
acknowledge is generated by the interrupt controller within
memory and peripheral registers to use it within the next stage.
the stage three when the interrupt vector (IV) is loaded to the
d) Write back: This stage is the last stage and enables PC. If an interrupt occurs before stage three it will be handled
the write enable signals of the data memory bus, the IO within the following instruction cycle. Otherwise the next
memory bus as well as or the register set. Furthermore, the instruction is finished and interrupt handling starts afterwards.
program counter register is loaded from one of the sources Due to the fixed execution time of a single instruction the
within this stage. maximal interrupt delay is three clock cycles.
The register set is used by the alu and the ldst unit. To The return from interrupt must be implemented manually
use both units in parallel their accesses are spread over the by the usage of ldst operations. The instruction, displaced by
last two execution stages. The instruction memory is enabled the interrupt, is stored in a dedicated register of the PC unit. It
within the first two stages only. Therefore, the functional units must be copied with ldst operations to the PC register to return
must buffer the data in local registers. The buffer is necessary back to the normal program flow. During the interrupt handling
to change the instruction address before the rising edge of the the register set provides four additional registers, which can
first stage. be freely used without overwriting data from the normal flow.
The registers are mapped into the register set and cover the
four lower GP registers. Interrupt handling is finished when
C. Program counter the program counter is loaded by the ldst unit. An in-interrupt
The program counter (PC) is not part of the general purpose flag is provided by the status register. The flag can be used by
register set. It is implemented in an additional unit. The content software to share code between interrupt handling and normal
of the register can be modified by incrementing it as well as operations.
overwriting it by the ldst unit, the jmp unit or the interrupt
The processor core does not support nested interrupts. If an
controller. Furthermore, the ldst unit can read the program
interrupt occurs while the interrupt flag is active the handling
counter, which is necessary for implementing subroutine calls.
is delayed until a return from interrupt is done. The interrupt
The content of the PC register must be stable within the first
acknowledge is also delayed so that the nested interrupt will
two execution stages for addressing the current instruction as
not be ignored.
well as within the last two stages for the ldst read operation.
Therefore, the update and the read operation must be done on
shadow registers. A flow diagram of the update process and E. Clock gating
the used shadow registers are shown in Figure 4.
To reduce the idle consumption of the TinyVLIW8 the
peripheral units and the processor core can be selectivly
switched off. Each peripheral unit with a clock input can be
disabled as well. Furthermore, the processor core can be stalled
by software by writing the processor stall bit of the status
register. When the stall bit is written, the processor finishes
the current instruction and disables the ESB generation for the
following clock cycles.
The processor is reactivated if it receives an interrupt. It
restarts the ESB generation and loads the IV into the PC. The
Fig. 4. The PC unit includes the PC register and logic for its update. The
incremented PC is buffered in the pcInt register during the stage one. The PC
instruction just following the stall is saved by the PC units and
is loaded during the stage four from one of the four possible sources. must be loaded manually.
562
F. Debug interface pure software implementation. The software was written for
an MSP430-based wireless sensor node with low power capa-
The TinyVLIW8 processor core uses volatile memories
bilities. In a following step we selected software components
for the data and the instruction memory. Both memories do
for a possible hardware integration. The selection process was
not require a controller and can be directly accessed by the
guided by the availability of the component in HDL. The
processor core, which simplifies the design. Furthermore, a
final design should be based on as many as possible already
silicon device can be fabricated by using a standard process if
available modules as well as general purpose components to
a non-volatile memory is not required. But then the program
reduce the development costs.
code must be loaded by an external component on each power
on, which requires a simple debug interface.
A. The SWUR design
In the interest of a small and simply design, a serial
peripheral interface (SPI) slave is implemented. The SPI can The TOTP algorithm was designed to authenticate a user
be easily accessed by any off-the-shelf MCU, which makes in a distributed server-based system and requires a globally
an easy integration of the TinyVLIW8 core in a mote design synchronized clock source. We ported the algorithm to a WSN
feasible. Using the SPI slave all three memory buses can be without a precise clock source. The authentication pattern is
accessed. Therefor, a data transfer initiated by the SPI master encoded by using the two available wake-up pattern codeA and
must start with a 16-bit header. The header elements are shown codeB. The full sequence is shown in Figure 6. It starts with
in Figure 5 and include the memory address, the selected the start phase, where a codeA and a codeB are sent with a
memory and the access type - read or write. The length of minimal data rate to minimize the idle power consumption.
the transfer is always 32 bits for the instruction memory and After a reconfiguration phase with a predefined length, the
eight bits otherwise. authentication pattern and the address follow. Due to the
absence of a synchronized clock source, sender and receiver
must use the incoming signals to synchronize their reception
process. Furthermore, because of the unreliable wireless com-
munication the receiver has to cope with missing bits. We
took all these into account by designing an application specific
symbol decoder unit.
Fig. 5. The structure of the SPI header of the TinyVLIW8 debug interface.
563
&
'&
!" #
-
&
() *
$%
)
+,-+' -+'
) )& )
. /0 ,
-
0 / 1$
Fig. 7. System architecture of the SWUR. The TinyVLIW8 firmware implements the HMAC-SHA1, the secure pattern update and the WUR configuration
process. Mathematically extensive and time-critical operations are implemented in dedicated hardware units, which are connected via the IO bus to the processor
core.
Component LUTs per cent Registers per cent
SHA-1 2,166 58.4 847 59.2
TinyVLIW8 760 20.5 226 15.8
Symbol decoder 260 7.0 174 12.2
Timer 111 3.0 64 4.5
564
TABLE III. C OMPARISON OF THE DESIGN SIZE OF SOFT- CORE clock cycles, which is only slightly more than the requirements
PROCESSORS . * SIZE ESTIMATED of more complex ISAs.
Processor Logic cells Register width FPGA
l d i r4 , #0 x06 | add r7 , #0 x f f ;
TinyVLIW8 1,056 8-bit Altera Cyclone II l d i r5 , #0 x07 | add r4 , #0 x04 ;
Leros 435 16-bit Xilinx Spartan 3E s t r4 , @r7 | a d d i r5 , #0 x00 ;
openMSP 2,841 16-bit Altera Cyclone II add r7 , 0 xff ;
IHP430X 4,107 20-bit Altera Cyclone II s t r5 , @r7 | jmp s h a 1 i n i t ;
Supersmall 385* 32-bit Altera Cyclone II
Altera Nios II/e 802 32-bit Altera Cyclone II
Listing 1. Emulation of a subroutine call
ρ-VEX 1,895 32-bit Xilinx Virtex-II Pro
Leon2 [24] 9,299 32-bit Altera Cyclone The return instruction must be emulated as well. We can do
this by the program code shown in Listing 2. We must restore
the program counter from the stack content and must increment
Table III shows that the TinyVLIW8 has the smallest design the stack address. Again, the ldst and the ALU unit can be used
size besides the Nios II/e, the Leros and the ’supersmall’ soft- in parallel, which minimizes the clock cycle overhead.
core processor. The ’supersmall’ soft-core processor has the ldi r5 , @r7 ;
smallest design that we found in the literature. It uses only 236 sti r5 , #0 x05 | add r7 , #0 x01 ;
LUTs on an Altera Stratix III device, which corresponds to an ldi r5 , @r7 ;
estimated size of 385 LUTs on a Cyclone II device. It uses a sti r5 , #0 x04 | add r7 , #0 x01 ;
serial architecture to achieve a minimal area. It is only 36 % Listing 2. Emulation of a return from subroutine
of the size of our approach. The Leros is a pipelined 16-bit
accumulator processor in which only a single dedicated register A return from interrupt instruction can be emulated in a
- the accumulator - is connected to the arithmetic logical unit similar way. When an interrupt occurs the program counter is
(ALU) output and provides one input to the ALU. Although the stored in the PC unit. It must be copied back to the PC register
Leros is a Java system its ISA is limited to very few operations. by using load-store operations. The interrupt shadow registers
Furthermore, it does not have a status register and a interrupt can be used, so that an overwriting of the register contents can
controller, which significantly limits its use. Especially the lack be avoided.
of an interrupt controller makes it unsuitable for an application
of embedded control tasks. The Nios II/e, optimized for the C. Execution time
Altera Cyclone FPGA, is only 20 per cent smaller than our
design. But its application is restricted to the Altera FPGAs. To analyze the single instruction performance we compared
the clock cycles per instructions with other ISAs. But, we
In comparison to an alternative VLIW soft-core, the ρ-
skipped the Leros soft-core processor because of its reduced
VEX, our design is two times smaller. The ρ-VEX core ISA and the Leon2 because of its size. Due to the serial
was configured in a similar configuration with two parallel architecture of the ’supersmall’ soft-core processor it has a
operations. Designs with four - 5,105 slices - and eight - 10,433
considerable speed penalty, which makes it 10 times slower
slices - parallel operations of the soft-core processor allocate than the Nios II/e soft-core processor. We used the openMSP
significant more resources. General purpose MCUs, like the as a representative of the MSP430 ISA, which includes the
IHP430X (389 %) and the Leon2 (881 %) are significant larger
IHP430X as well. Furthermore, we analyzed the Nios II/e
than the TinyVLIW8 soft-core processor. We are convinced processor core. The results are summarized in Table IV.
that these type of processors is seriously over-featured for
embedded tasks. TABLE IV. C OMPARISON OF THE INSTRUCTION EXECUTION
PERFORMANCE .
Due to the limited size of the instruction word the number subroutine call 10 4 6
of different instructions of the ISA is quite low. In particular return from subroutine 8 4 6
return from interrupt 8 8 6
the designed ISA lacks a native support of a move instruction
interrupt service 2 6 6
and a subroutine call. The move instruction must be emulated ALU op (1 - 2) 1 6
by the logical operations xor, to clear the target register, and or, move reg-reg (1 - 4) 1 6
to load the content to the register. However, due to the VLIW move reg-mem (1 - 2) 4 6
feature the required xor operation can be combined with any
previous ldst or jmp instruction. In most cases the additional Table IV shows that the TinyVLIW8 is slightly slower
instruction can avoided in this way. than its counterparts when executing subroutine calls and
A subroutine call is more complex because of the required returns. But interrupt handling and basic operations can be
stack handling. Due to our approach does not have a native executed with a similar performance or even faster when two
support of a stack pointer register, it must be emulated with instructions can be combined as described above.
one of the available GP registers. Listing 1 shows an example
implementation of a subroutine call. We use the register r7 D. Program size
to store the stack pointer. Before calling the subroutine the
current program counter must be copied from the PC unit onto In a last step we compared the size of a HMAC-SHA1
the stack, whereby we can use the ldst and the ALU unit in implementation on the various soft-core processors. The im-
parallel. A subroutine call than needs five instructions and 10 plementation is based on the work of Aaron Gifford [25]. Since
565
the SHA-1 transform operation is provided by a hardware [3] H. Milosiu, F. Oehler, M. Eppel, D. Frühsorger, S. Lensing, G. Popken,
module, the software must serve the hardware interface only. and T. Thönes, “A 3-μW 868-MHz Wake-Up Receiver with -83
dBm Sensitivity and Scalable Data Rate,” in Proceedings of the 39th
For the openMSP and the Nios II/e we used the GCC ports of European Solid-State Circuit conference, ser. ESSCIRC. Bucharest,
these processors. The TinyVLIW8 implementation was written Romania: IEEE, September 2013.
in assembler. The results are summarized in the Table V. [4] D. M’Raihi, S. Machani, M. Pei, and J. Rydell, “IETF RFC: RFC6238
- TOTP: Time-based One-Time Password Algorithm,” US, May 2011.
TABLE V. C OMPARISON OF THE PROGRAM SIZE OF A HMAC-SHA1
IMPLEMENTATION . T HE SHA1 MODULE INCLUDES A DRIVER FOR A [5] J. G. Tong, I. D. L. Anderson, and M. A. S. Khalid, “Soft-Core
HARDWARE MODULE ONLY. Processors for Embedded Systems,” in Microelectronics, 2006. ICM
’06. International Conference on, Dec 2006, pp. 170–173.
Processor Function Instr. Text Data [6] OpenCores, “The #1 community within open source hardware
SHA-1 185 512 3 IP-cores,” 2013. [Online]. Available: http://opencores.org
TinyVLIW8
HMAC 251 616 66 [7] W. T. Barden, Z80 Microcomputer Handbook. Indianapolis, IN, USA:
SHA-1 109 278 4 Sams, 1978.
openMSP
HMAC 121 344 66 [8] “Nios II/e: Economy,” Altera, San Jose, CA, USA. [Online]. Available:
SHA-1 109 436 8 ”http://www.altera.com/devices/processor/nios2/cores/economy/ni2-
Nios II/e
HMAC 166 672 66 economy-core.html”
[9] Xilinx, PicoBlaze 8-bit Embedded Microcontroller User
Although Table V shows that the TinyVLIW8 has the Guide, 2nd ed., June 2011. [Online]. Avail-
able: http://www.xilinx.com/support/documentation/ip documentation/
largest number of instructions and code size, its performance is ug129.pdf
competitive or better to its counterparts. Due to its capability [10] ——, MicroBlaze Processor Reference Guide, 9th ed., 2008. [Online].
of combining two instructions in a single 32-bit instruction Available: http://www.xilinx.com/support/documentation/sw manuals/
word, the real number of instructions is only 128 for the mb ref guide.pdf
SHA-1 and 154 for the HMAC module, in which all these [11] J. Gaisler, LEON2 Processor User’s Manual, 1st ed., Aeroflex
instructions are executed within exactly two cycles. Hence, Gaisler AB, Goteborg, Sweden, 2003. [Online]. Available:
in real applications the overall performance will be equal or http://www.gaisler.com/doc/leon2-1.0.24-xst.pdf
slightly better than the openMSP’s and obviously batter than [12] OpenSPARC, “World’s first free 64-bit cmt microprocessor.” [Online].
Available: http://www.opensparc.net/
the Nios II/e’s performance. Furthermore, the TinyVLIW8’s
[13] O. Girard, “openMSP :: Overview,” 2014. [Online]. Available:
code size is similar to the Nios II/e and the smaller code size http://opencores.org/project,openmsp430
of the openMSP comes at the cost of the doubled design size.
[14] “IPMS430x,” Website, 2010, http://www.ipms.fraunhofer.de/.
[15] M. Schoeberl, “Leros: A Tiny Microcontroller for FPGAs,” in Proceed-
VI. C ONCLUSION ings of the 2011 21st International Conference on Field Programmable
Logic and Applications, ser. FPL ’11. Washington, DC, USA: IEEE
To be of use as embedded controller in tiny scale sys- Computer Society, 2011, pp. 10–14.
tems, a soft-core processor should combine a minimal design [16] K. Hays and Jshamlet, “Open8 urisc :: Overview,” 2013. [Online].
size with a consistent, predictable and low cycle count per Available: http://opencores.org/project,open8 urisc
instruction and the suitability for implementation in an ASIC. [17] J. Robinson, S. Vafaee, J. Scobbie, M. Ritche, and J. Rose, “The
While existing approaches exhibit some of these features, the supersmall soft processor,” in Proceedings of the 6th VI Southern
presented design is the first which combines all the desired Programmable Logic Conference, ser. SPL ’10, Porto de Galinhas
properties in one unit. The design size is less than the half Beach, Brazil, March 2010, pp. 3–8.
of other full-featured soft-core processors. Its performance is [18] C. Iseli and E. Sanchez, “Spyder: a reconfigurable VLIW processor
using FPGAs,” in FPGAs for Custom Computing Machines, 1993.
similar or even better than that of FPGA-optimized soft-core Proceedings. IEEE Workshop on, Apr 1993, pp. 17–24.
processor in terms of code size and required cycles to complete
[19] V. Brost, F. Yang, and M. Paindavoine, “A modular VLIW Processor,”
a task. It is not skewed towards a specific FPGA devices and in Circuits and Systems, 2007. ISCAS 2007. IEEE International Sym-
is therefore free to be used in silicon devices. The conclusions posium on, May 2007, pp. 3968–3971.
are based on analysis of a typical application, which shows [20] S. Wong, T. van As, and G. Brown, “(rho)-VEX: A reconfigurable and
that our design is suitable for realistic scenarios. In summary, extensible softcore VLIW processor,” in ICECE Technology, 2008. FPT
the presented VLIW processor implements a good balance of 2008. International Conference on, Dec 2008, pp. 369–372.
logic and instruction execution time, which makes it suitable [21] A. Beck and L. Carro, “A VLIW low power Java processor for
for a broad variety of embedded control tasks. embedded applications,” in Integrated Circuits and Systems Design,
2004. SBCCI 2004. 17th Symposium on, Sept 2004, pp. 157–162.
[22] G. Panic, T. Basmer, O. Schrape, S. Peter, F. Vater, and K. Tittelbach-
ACKNOWLEDGEMENTS Helmrich, “Sensor node processor for security applications,” in In
proceedings of 18th IEEE International Conference on Electronics,
The research leading to these results has received funding Circuits and Systems, ser. ICECS 2011, Beirut, Lebanon, December
from the Federal Ministry of Education and Research (BMBF) 2011, pp. 81–84.
under grant agreement No. 16 BN1110. [23] O. Stecklina, D. Genschow, and C. Goltz, “TandemStack - A Flexible
and Customizable Sensor Node Platform for Low Power Applications,”
R EFERENCES in Proceedings of the 1st International Conference on Sensor Networks,
ser. Sensornets 2012, Rome, Italy, February 2012.
[1] IHP, “Aeternitas: A energy-efficient Wakeup-System for wireless [24] “Running Leon2 on the Altera Nios Develop-
sensor nodes,” 2012. [Online]. Available: http://www.aet- ment Board, Cyclone Edition.” [Online]. Available:
projekt.de/project.html http://www.mdforster.pwp.blueyonder.co.uk/LeonCyclone.html
[2] O. Stecklina, S. Kornemann, and M. Methfessel, “A Secure Wake-up [25] A. Gifford, “Implementations of SHA-1, SHA-224, SHA-
Scheme for Low Power Wireless Sensor Nodes,” in Proceedings of the 256, SHA-384 and SHA-512,” 2004. [Online]. Available:
4th International Workshop on Mobile Systems and Sensors Networks http://www.aarongifford.com/computers/sha.html
for Collaboration, ser. MSSNC 2014, Minneapolis, USA, May 2014.
566