Hard and Soft Embedded FPGA Processor Systems Design: Design Considerations and Performance Comparisons
Hard and Soft Embedded FPGA Processor Systems Design: Design Considerations and Performance Comparisons
Hard and Soft Embedded FPGA Processor Systems Design: Design Considerations and Performance Comparisons
net/publication/267627161
CITATIONS READS
2 1,567
1 author:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
An advanced wireless intelligent autonomous humanoid robot with sophisticated obstacle avoidance and guidance systems based on IoT with machine vision for
diverse applications View project
All content following this page was uploaded by Vincent Andrew Akpan on 01 November 2014.
ABSTRACT
This paper presents a novel and efficient hardware/software co-design techniques for the development of high-performance embedded
processor system targeting field programmable gate arrays (FPGAs). Some very important and critical design considerations for
developing FPGA embedded processor systems are first presented. Next, the architectures of the IBM hard-core PowerPC™440 and the
Xilinx soft-core MicroBlaze™ processors are introduced together with comprehensive techniques for FPGA embedded processor
systems design. Then, two embedded processor systems are designed, implemented on Virtex-5 FX70T ML507 FPGA development
board, tested and their performances are evaluated on an industry-standard FPGA benchmark DMIPs (Dhrystone million instructions
per second). The two embedded processors are based on: 1) the IBM PowerPC™440 hard processor core and 2) the Xilinx MicroBlaze™
soft processor core. Experimental results have shown that the IBM hard-core PowerPC™440 embedded processor system out-performs
the Xilinx the soft-core MicroBlaze™ embedded processor system in terms of FPGA device consumptions and their maximum operating
frequency for the DMIPs benchmark implementation. The DMIPs benchmark performance results indicate that the embedded processor
system are highly optimized and can be deployed for the development of real-time embedded processor systems. Finally, a brief
conclusion and some discussions on future directions are given.
Keywords: Embedded processor system design, Dhrystone benchmark, field programmable gate array (FPGA), embedded PowerPC™440
processor core, embedded MicroBlaze™ processor core, Virtex-5 FX70T ML507 FPGA.
Embedded systems are now widespread in domestic and The field programmable gate array (FPGA) is a general-
industrial systems (appliances and applications). As systems purpose populated with digital logic building blocks [3], [4],
complexity increases with real-time constraints, the embedded [6]. The most primitive FPGA building block is called a logic
system design becomes more complex and the real-time element (LE) by Altera [7] or a logic cell (LC) by Xilinx [8].
constraints then depends on the computational Altera and Xilinx are two world market leaders in the FPGA
power/efficiency of the embedded platform. An embedded industries. Besides Alttera and Xilinx, many other FPGA
system is basically software implemented on hardware in order companies exist, but their products are not discussed here. In
to perform and realize specific real-time functionalities. either case, the FPGA building block consists of a look-up
Traditionally, embedded systems were designed and realized table (LUT) for logical functions and flip-flops for storage. In
using off-the-shelf microprocessors, microcontrollers, digital addition to the LC/LE block, FPGA also contains memory,
signal processors (DSPs), and application specific integrated clock management, input/output and multiplication blocks.
circuits (ASICs) [1]. However, in recent times, embedded
system designs have been directed on the use of field According to Jeff Bier [9]: “the next time you’re choosing an
programmable gate array (FPGA) [1]–[4]. embedded processor, you should consider an FPGA”.
Embedding a processor inside an FPGA has many advantages
Recently, investigations and surveys on the use of FPGAs in [Fletcher] with several challenges towards the design and
industrial control applications have been reported [3]–[5]; implementation of the embedded processor [9]–[11].
where it has been proposed that FPGAs can be configured to Thorough literature survey shows that FPGA are not widely
solve computationally intensive tasks for real-time used as embedded processor for even relatively simple up to
applications. For example, an FPGA-based framework for complicated real-time embedded applications [2], [12]. The
prototyping of multi-core embedded architectures have been performance of an embedded processor in a product is often a
proposed in [5] although no embedded processor was design key product differentiator [11].
nor implemented. The comparison of embedded system design
for industrial applications using microprocessors, An embedded system design is a complex task since it consists
microcontrollers, digital signal processors (DSPs), application of the software and hardware portions; and thus the difficulty
specific integrated circuits (ASICs), and FPGAs; indicated that of generating a design from a set of requirements and
specifications becomes more complex [2]. Another problem is
FPGAs are more suitable for such tasks and several references on how to experiment with mixed hardware/software solutions
can be a time-consuming process due to the historical
ISSN: 2049-3444 © 2013 – IJET Publications UK. All rights reserved. 1000
International Journal of Engineering and Technology (IJET) – Volume 3 No. 11, November, 2013
disconnect between software development methods and the embedded processor system using included memory
low-level methodologies required for hardware design and controllers. A variety of memory controllers enhance the
synthesis for implementation on an FPGA. For many FPGA embedded processor system’s interface capabilities.
applications, the complete hardware/software application is FPGA embedded processors use general-purpose FPGA logic
represented by collection of software and hardware source files to construct internal memory, processor buses, internal
that are not easily compiled, simulated or debugged with a peripherals, and external peripheral controllers including
single tool set. In addition, because the hardware design external memory controllers. As more pieces of buses, memory
process is relatively inefficient, hardware and software design controllers, peripherals and peripheral controllers are added to
cycles may be out of sync, requiring system interfaces, the embedded processor system, the system becomes
fundamental software/hardware partitioning decisions and increasingly more powerful and useful.
algorithm designs to be prematurely locked down.
However, it is worth noting that the additions of large banks of
Furthermore, one of the most exciting developments in FPGA external memory may increase the latency to access this
that has emerged in recent years is the emergence of hard and external memory and may have negative impact on
soft FPGA-embedded processors. These processors include performance. In addition, adding many pieces of peripherals
Xilinx MicroBlaze™, IBM PowerPC™440, Altera® Nios™ and memory as well as their respective controllers may reduce
II, and others. There are challenges to using FPGAs as software performance and increase the embedded system cost that
platforms; however, software programmers may not have the consumes the FPGA resources.
skills or indeed, the desire to make use of hardware design
tools or hardware-oriented languages such as VHDL and FPGA manufacturers often publish embedded processor
Verilog. Software programmers using FPGAs may also be performance benchmarks. The manufacturers obviously know
faced with design methods that are new and unfamiliar, what must be done in order to get the best out of their FPGAs
including the need to efficiently partition applications between that performs the best for each specific benchmark, and they
hardware and software. These same arguments may be true for take full advantage of every possible enhancement strategies
most hardware designers who are not familiar with software when benchmarking. A clue to these strategies is that the
programming or engineering. FPGA embedded processor system constructed to run the
benchmark has very few peripherals and runs exclusively using
To address the above issues, several design methodologies internal memory. However, no easy formula or chart exists that
have emerged, notably embedded systems design tools from shows how to compare the performance and cost for different
Altera [7] and Xilinx [8]; and the choice of FPGA design tool memory strategies and peripheral sets. The usual performance
poses additional challenges. In previous works, critical benchmark is the Dhrystone benchmark implementation to
overview of embedded system design technologies based on evaluate the Dhrystone million instructions per second
Xilinx system design tools have been reported [1]. In another (DMIPs) performance measured in terms of the maximum
study [2], embedded processor system design methodologies FPGA operating frequency (fmax) in (MHz) [17], [18]. It is then
from model-based design view point have also been reported left for the users of such FPGAs to achieve the maximum
where all of the Xilinx design tools were discussed and frequency and DMIPs set out by the manufacturers.
techniques on how these tools can be integrated to achieve
efficient FPGA embedded system design were proposed.
B. Some Advantages and Disadvantages of FPGA
In this paper, critical design considerations and techniques for Embedded Processor Systems
embedded processor system design are presented. Then, two
embedded processor systems are design, tested and evaluated The embedded systems are normally defined as the software
on an industry-standard benchmark. The first embedded implemented in hardware in order to realize specified real-time
processor is a hard-core IBM PowerPC™440 processor [13]– functionalities. The normally used soft-core processing
[15] and the second is a soft-core Xilinx MicroBlaze™ hardware includes microcontrollers, microprocessors, FPGAs,
processor [16]. Performance evaluation and device utilizations digital signal processors (DSPs), and application-specific
of the two embedded processor systems implemented on integrated circuits (ASICs), each of which has its own
Xilinx Virte-5 FX70T ML507 FPGA development board are properties. Although, FPGA hardware technologies have
compared. Finally, a brief conclusion is given and some attracted an always increasing interest and have significantly
remarks are made with directions for future works. disrupted the embedded system design technologies, it is worth
considering some advantages and disadvantages that may be
II. OVERVIEW OF EMBEDDED derived or incurred by the use of FPGA embedded processor
technologies.
PROCESSOR SYSTEMS AND DESIGN
CONSIDERATIONS
Here, some advantages of an FPGA embedded processor
A. Why Embedding a Processor Inside an FPGA system when compared to an off-the-shelf processor are
summarized in the following:
Embedding a processor inside an FPGA has many advantages.
Specific peripherals can be chosen to improve performance 1) Hardware Acceleration: The most compelling reason for
based on the application with unique user-defined peripherals FPGA embedded processor is the ability to make trade-offs
been easily attached. Likewise, large banks of external between hardware and software to maximize efficiency and
memory can be connected to the FPGA and accessed by the performance. Suppose an algorithm is identified as
ISSN: 2049-3444 © 2013 – IJET Publications UK. All rights reserved. 1001
International Journal of Engineering and Technology (IJET) – Volume 3 No. 11, November, 2013
ISSN: 2049-3444 © 2013 – IJET Publications UK. All rights reserved. 1002
International Journal of Engineering and Technology (IJET) – Volume 3 No. 11, November, 2013
Fig. 1: The PowerPC™440 Core system on a chip with two-level bus structure and additional peripherals
The PPC440 Core, as a member of the PowerPC™ 400 Family, program, in which over 80 third party vendors have combined
is supported by the IBM PowerPC™ Embedded Tools™ with IBM to provide a complete tools solution including Xilinx
ISSN: 2049-3444 © 2013 – IJET Publications UK. All rights reserved. 1003
International Journal of Engineering and Technology (IJET) – Volume 3 No. 11, November, 2013
[14]. Development tools for the PPC440 include C/C++ Local Bus (PLB), and Xilinx® CacheLink (XCL). The LMB
compilers, debuggers, bus functional models, provides single-cycle access to on-chip dual-port block RAM.
hardware/software co-simulation environments, and real-time The PLB interfaces provide a connection to both on-chip and
operating systems. As part of the tools program, IBM off-chip peripherals and memory. The CacheLink interface is
maintains a complete set of development tools by offering the intended for use with specialized external memory controllers.
High C/C++ Compiler, RISCWatch™ debugger with MicroBlaze also supports up to 16 Fast Simplex Link (FSL)
RISCTrace™ trace interface, VHDL and Verilog simulation ports, each with one master and one slave FSL interface. The
models and a PPC440 Core Superstructure development kit architecture of the Xilinx MicroBlaze™ processor core, the
[13]. The PPC440 CPU operates on instructions in a dual issue, core interfaces, buses, memory and peripherals are shown in
seven-stage pipeline, capable of dispatching two instructions
Fig. 3 [16].
per clock to multiple execution units and to optional Auxiliary
Processor Units (APUs). The PPC440 core block diagram is
The acronyms of the core interfaces shown in Fig. 3 are defined
shown in Fig. 2.
as follows [XMBPRG, 2010]:
The PowerPC™ 440 embedded processor implements the full,
32-bit fixed-point subset of the IBM Book E: Enhanced DPLB : Data interface, Processor Local Bus,
PowerPC™ architecture. The PowerPC™440 embedded DLMB: Data interface, Local Memory Bus (BRAM only),
processor fully complies with this architectural IPLB : Instruction interface, Processor Local Bus,
specification. The 64-bit operations of the architecture ILMB : Instruction interface, Local Memory Bus (BRAM
are not supported, and the embedded processor does not only),
MFSL 0..15 : FSL master interfaces,
implement the floating-point operations, although a
DWFSL 0..15 : FSL master direct connection interfaces,
floating-point unit (FPU) can be attached (using the
SFSL 0..15 : FSL slave interfaces,
APUs interface). Within the embedded processor, the 64-
DRFSL 0..15 : FSL slave direct connection interfaces,
bit operations and the floating-point operations are DXCL : Data side Xilinx CacheLink interface (FSL
trapped, and the floating-point operations can be master/slave pair),
emulated using software. IXCL : Instruction side Xilinx CacheLink interface
(FSL master/slave pair),
The PowerPC™ 440 embedded processor implemented in Core : Miscellaneous signals for: clock, reset,
Xilinx Virtex-5 devices and discussed in Xilinx’s debug, and trace.
documentations differs from the Book E architecture
specification in the use of bit numbering for architected The Xilinx MicroBlaze™ soft core processor is highly
registers [13], [15]. Specifically, Book E defines the full, 64- configurable and allows the section of a specific or fixed set of
bit instruction set architecture, where all registers have bit features required by the design for embedded processor system
numbers from 0 to 63, with bit 63 being the least significant. development. The fixed features of the processor includes: 1)
This document describes the PowerPC 440 embedded thirty-two 32-bit general purpose registers, 2) 32-bit
processor, which is a 32-bit subset implementation of the instruction word with three operands and two addressing
architecture. Accordingly, all architected registers are 32 bits mode, and 3) a 32-bit address bus, and a single issue pipeline.
in length, with the bits numbered from 0 to 31, where bit 31 is In addition to these fixed features, the MicroBlaze™ processor
the least significant. Therefore, references to register bit is parameterized to allow selective enabling of additional
numbers from 0 to 31 in this document correspond to bits 32 functionality.
to 63 of the same register in the Book E architecture
specification ([IBM PEPC440, 2010]; [XEPB Virtex-5, The MicroBlaze™ processor can be configured with the
2010]). following bus interfaces: 1) A 32-bit version of the PLB V4.6
interface, 2) LMB provides simple synchronous protocol for
2) Embedded Soft-Core MicroBlaze Processor efficient block RAM transfers, 3) FSL provides a fast non-
This sub-section gives a brief overview of the basic features arbitrated streaming communication mechanism, 4) XCL
and architecture of the Xilinx MicroBlaze™ embedded provides a fast slave-side arbitrated streaming interface
processor version 7.20 currently support for Xilinx between caches and external memory controllers, 5) Debug
MicroBlaze™ embedded processor development within the interface for use with the Microprocessor Debug Module
Embedded Development Kit (EDK) 11.4 for Xilinx Virtex-5 (MDM) core, and 6) Trace interface for performance analysis.
FX70T GPGA being used in this work.
The processor local bus (PLB) interfaces are implemented as
Like the IBM PowerPC™, The MicroBlaze™ soft core byte-enable capable 32-bit masters. The MicroBlaze™ on-chip
processor is a 32-bit reduced instruction set computer (RISC). peripheral bus (OPB) interfaces are implemented as byte-
The processor includes the Big-Endian bit reversed format, 32- enable capable masters. The local memory bus (LMB) is a
bit general purpose registers, virtual-memory management, synchronous bus used primarily to access on-chip block RAM.
cache software support, and Fast Simplex Link (FSL) It uses a minimum number of control signals and a simple
interfaces. The MicroBlaze core is organized as a Harvard protocol to ensure that local block RAM are accessed in a
architecture with separate bus interface units for data and single clock cycle. All the LMB signals are usually active high.
instruction accesses. The following three memory interfaces
are supported: Local Memory Bus (LMB), the IBM Processor As a note on the embedded MicroBlaz™ processor system
clocks and resets signals, the following should be taken into
ISSN: 2049-3444 © 2013 – IJET Publications UK. All rights reserved. 1004
International Journal of Engineering and Technology (IJET) – Volume 3 No. 11, November, 2013
optimization that performs jump and pop expensive inside the FPGA, but requires fewer FPGA input-
optimization. output (I/O) ports and is least expensive per megabyte.
Level 2: this the second level of optimization and is
designated as Medium (-02). This level activates In addition to the memory access time, the peripheral also
nearly all optimizations that do not involve a incurs some latency. In MicroBlaze, for example, the memory
speed-space trade-off and so the executables do controllers are attached to the on-chip peripheral bus (OPB).
not increase in size. The compiler doe not perform The OPB SDRAM controller requires about eight to ten cycle
loop unrolling, function in-lining or strict aliasing latency for a read operation and four to six cycle latency for a
optimizations. This is the standard level used that write operation depending on the clock frequency. Thus, it is
can be used for all program deployment. obvious that the worst possible program performance would be
Level 3: This level offers the highest level and is designated achieved by having the entire program reside in external
High (-03). This level adds more expensive memory. Since optimizing execution speed is a typical in the
embedded processor system design, an entire program, should
options, including those that increase code size. In
rarely be targeted solely at external memory.
some cases, this optimization level actually
produces code that is less efficient the Level 2, and
as such may be used with cautions. c) Instruction and Data Cache Memory
Size Optimized (-0s): This option produces the smallest
code size as much as possible. The PowerPC™ in Xilinx FPGAs has instruction and data
cache built directly into the silicon of the hard processor.
Note in general, however, that both any of the optimization Enabling this cache is almost always a performance advantage
level and debug option are used, the information obtained from for the PowerPC™ [10]. On the other hand, the MicoBlaze™
the optimization process may not correlate with the generated cache architecture is not on the dedicated silicon chip rather
source code. the instruction and data cache controllers are selectable
parameters in the MicroBlaze configuration. When these
2) Memory Types controllers are included, the cache memory is built from
BRAM. Therefore, enabling the cache is likely to consume
The FPGA embedded processor provide access to fast, local more BRAM than local memory for the same storage size
memory as well as an interface to slower, external memory. because the cache architecture requires address line tag
The way the memory is used has a significant effect on storage. Additionally, enabling the cache may also consume
performance. However, the memory usage can be manipulated general-purpose FPGA logic to build the cache controllers.
using the Linker Script. The consequences are that the achievable system frequency
may be reduced when the cache is enabled as more logic may
a) Local Memory Only be added and the complexity of the design may increase during
The local memory provides the fasted option in accessing the FPGA place and route operation. Despite these
memory. Xilinx FPGA local memory is made up of large consequences in enabling the MicroBlaze™ cache, especially
FPGA memory blocks called BlockRAM (BRAM). Embedded the instruction cache, may improve performance, even when
processor accesses BRAM in a single bus cycle. Since the the system is likely to run at lower frequency. Finally, enabling
processor and the bus run at the same frequency in MicroBlaze, the cached memory is always worth an experiment to justify
instructions stored in BRAM are executed at the full different trade-offs.
MicroBlaze processor frequency. In the MicroBlaze processor
system, BRAM is essentially equivalent in performance to a d) Combination of Local, External and Cache Memory
Level 1 (L1) cache. On the other hand, the PowerPC™ can run
at frequencies greater than the bus and has true built-in L1 As discussed earlier, the memory that provides the best
cache. Therefore, BRAM in a PowerPC™ processor system is performance is one that only has local memory. However, this
equivalent in performance to a Level 2 (L2) cache. Thus, if the architecture may not always be practical since many useful and
program for a particular embedded processor system design efficient embedded programs exceed the available local
fits entirely within the local memory, then the design is likely memory capacity. On the other hand, running from externally
to achieve optimal memory performance, although it is mostly memory exclusively may have more than eight times
likely that the embedded programs will exceed the local performance disadvantage due to the peripheral bus latency.
memory capacity.
Caching the external memory is an excellent choice for
embedded PowerPC™440 processor systems. For embedded
b) External Memory Only
MicroBlaze™ processor systems, perhaps the optimal memory
Xilinx FPGAs provides several memory controllers that configuration may be to wisely partition the program code,
interface with a variety of external memory devices. These maximizing the system frequency and local memory size.
memory controllers are connected to the processor’s peripheral Critical data, instructions and stack can also be placed in local
bus. The three types of volatile memory are supported by memory. Data cache may not be used so as to allow for a larger
Xilinx FPGAs are static RAM (SRAM), single-data-rate RAM local memory bank. Suppose that the local memory is not large
(SDRAM), and the double-data-rate RAM (DDR) SRAM. The enough; then the instruction cache can be enabled for the
SRAM controller is the smallest and simplest inside the FPGA address rang in the external memory used for instructions. By
while the SDRAM is the most expensive of the three memory not consuming BRAM in data cache, the local memory can be
types. The DDR SDRAM controller is the largest and most increased to contain more space. An instruction cache for the
ISSN: 2049-3444 © 2013 – IJET Publications UK. All rights reserved. 1006
International Journal of Engineering and Technology (IJET) – Volume 3 No. 11, November, 2013
instructions assigned to external memory could be very shifter rather than performing these functions in software.
effective. Alternatively, experimentation or profiling could Although, enabling these processor capabilities may consume
show which code fragments are most heavily accessed; and FPGA resources, but the performance improvements can be
assigning these fragments to local memory could provide a extraordinary.
greater performance improvement than caching.
d) Co-Processing Hardware
1) Optimization Specific to an FPGA Embedded
Processor Custom hardware logic can be designed to offload an FPGA
embedded processor. For example, a software bottleneck
Since the one of the objective of the proposed embedded identified in an algorithm can be converted into a custom
processor system design using the Xilinx Virtex-5 FX70T hardware. Then, custom software instructions can be defined
FPGA is to improve the performance of the hardware, to operate the hardware co-processor.
additional techniques must be exploited to achieve this
objective. Given the fact that the FPGA embedded processor
Any operation that is algorithmic, mathematical, or parallel is
resides next to additional FPGA hardware resources, one here a good candidate for a hardware co-processor which is the
technique is to consider a custom co-processor designed subject of the proposed embedded processor system design in
specifically to target the implementation of a core algorithm in this work. FPGA logic can be traded for performance but the
the design. advantages can be enormous and performance can be improved
significantly.
a) Logic Optimization and Reduction
The key point here is that only peripheral and buses that are III. THE EMBEDDED POWERPC™440
necessary and required should be connected. Suppose that the PROCESSOR SYSTEM
intended design does will not store and run any instructions DEVELOPMENT USING XILINX
using external memory; then connecting the instruction side of INTEGRATED SOFTWARE
the peripheral bus is not necessary. Connecting both the ENVIRONMENT (ISE) AND XILINX
instruction and data side of the processor to a single bus may PLATFORM STUDION
create a multi-master system which requires an arbiter.
Optimal bus performance is achieved when a single master The embedded PowerPC™440 processor design considered
resides on the bus. here follows closely from the design considerations outlined
and discussed in Section II. The embedded processor system
Furthermore, debug logic requires resources in the FPGA and design using the IBM PowerPC™ 440 hard processor cores are
may be the hardware bottleneck. When the design is each instantiated from the Xilinx ISE which then initializes the
completely debugged, the debug logic can be removed from XPS where the actual processor systems’ designs are done. The
the final system, which will potentially increase the system’s Xilinx ISE is started and the project name is assigned on the
performance. For example, in an embedded MicroBlaze™ “New Project Wizard”. The name assigned here for the
processor system with the cache enabled, the debug logic will PowerPC™440 processor system is
typically be the critical path that will slow down the entire “emb_ppc440_processor”. The FPGA device family Virtex-5
design [10]. XC5VFX70T is selected and the speed grade for this device
family based on our available Virtex-5 FX70T ML507 FPGA
b) Area and Timing Constraints board is –2 and is thus specified as well as the device package
of FF1136. The Xilinx synthesis tool (XST) as the synthesis
Xilinx FPGA place and route tools as well as the Xilinx’s tool to be used in synthesizing the design. The Xilinx
PlanAhead™ tool perform much better when the design ModelSim-SE is selected as the simulation tool. The language
objectives are well specified. In these Xilinx tools, the desired for the embedded processor system development is the VHDL
clock frequency, pin location, and logic element location can (very-high-speed hardware description language). In addition
be specified. By providing these details, the design tools can to these selections, the Embedded Processor is also added as a
be able to make efficient, optimized and smarter trade-offs “New Source” in this project wizard. The
during hardware design implementation. Therefore, a careful “emb_ppc440_proceesor” project summary is shown in Fig.
study of the datasheets for each peripheral together with the 4(a).
design guidelines goes a long way in this regard and it is a
necessity. When the “New Project Wizard” is completed, the ISE
initializes and automatically starts up the Xilinx platform
studio (XPS) since “Embedded Processor” was added as a
c) Hardware Acceleration
“New Source”. The XPS in turn initializes and brings up the
Dedicated hardware outperforms software at the expense of
Base System Builder (BSB) which is an automated tool that
FPGA resources for dramatic performance improvements.
can be used to create an embedded processor system. The
Therefore, the FPGA’s ability to accelerate the processor
embedded processor system design using the BSB is an eight-
performance with dedicated hardware should be considered.
stage procedure, namely: Welcome, Board, System, Processor,
Provided the hardware divider and the hardware barrel-shifter
Peripheral, Cache, Application, and the Summary.
are enabled, embedded MicroBlaze™ processor can be
customized to use a hardware divider and a hardware barrel-
ISSN: 2049-3444 © 2013 – IJET Publications UK. All rights reserved. 1007
International Journal of Engineering and Technology (IJET) – Volume 3 No. 11, November, 2013
The “Welcome” allows new processor(s) to be design or an A. The “System” stage shown in Fig. 4(c) allows a single- or
existing pre-designed processor system to be loaded as shown dual-processor system to be specified and designed. The
in Fig. 4(b). The “Board” stage allows the FPGA device family Virtex-5 XC5VFX70T devices family currently supports
and package to be specified, if different from that specified in single processor systems design. Thus, a single processor
the “New Project Wizard”. This is sometimes useful if a system is the target in this work. Then in the “Processor” stage,
custom FPGA board different from the pre-configured Xilinx the choices of selecting a PowerPC™440 or a MicroBlaze™
FPGA development boards. It is also useful if the processor processor are available. Thus, in this sub-section, a
design was not initialized and started using the Xilinx ISE. The PowerPC™440 is selected as the intended processor as shown
advantages of initializing and starting an embedded processor in Fig. 4(d) whereas in the next sub-section the MicroBlaze™
system design from the ISE are many as discussed in Appendix processor will be selected.
(c) Based System Builder: “System” (d) Based System Builder: “Processor”
Fig. 4: The Xilinx ISE “New Project Summary” and the BSB Welcome, System, and Processor design stages for the embedded
ISSN: 2049-3444 © 2013 – IJET Publications UK. All rights reserved. 1008
International Journal of Engineering and Technology (IJET) – Volume 3 No. 11, November, 2013
Fig. 5: The BSB: the Peripheral and Summary design stages for
the embedded PowerPC™440 processor system.
ISSN: 2049-3444 © 2013 – IJET Publications UK. All rights reserved. 1009
International Journal of Engineering and Technology (IJET) – Volume 3 No. 11, November, 2013
ii) Next, the Netlist is generated by selecting: “Hardware iv) Next, the programming file (BitStream) for the complete
Generate Netlist”. This stage of the design also generates embedded PowerPC™440 processor system is generated
all the “wrappers”, device drivers, and all the necessary by Double-clicking the blue-colored highlighted “Generate
design and technology files that would required by ISE for Programming File” shown in Fig. 7 to generate the
complete synthesis and implementation of the embedded programming file for the embedded processor project. This
processor system in the ISE. is the ISE implementation phase of the design which is
detailed in [2], [12]. However, the various stages of this ISE
iii) After the Netlist generation, attention is turned to the implementation are briefly described in the following. As
Xilinx ISETM. A section of the Xilinx ISE™ graphical user can be seen in Fig. 7, the ISE has seven major phases,
interface (GUI) for the PowerPC™440 embedded namely:
Fig. 6: The XPS graphical user interface (GUI) for the creation and initial compilation of the embedded processor system
ISSN: 2049-3444 © 2013 – IJET Publications UK. All rights reserved. 1010
International Journal of Engineering and Technology (IJET) – Volume 3 No. 11, November, 2013
Step 1) User Constraints, a text file that has syntactic descriptions of how individual
block RAMs constitutes a contiguous logical data space.
Step 2) Synthesize – XST (Xilinx Synthesis Tool), The Xilinx Data2MEM [21] uses BMM files to direct the
translation of data into the proper initialization form. Note
Step 3) Implemented Design, that since a BMM file is a text file, it is directly editable.
This file together with the bitstream and all the generated
Step 4) Generate Programming File, device drivers will be required to program the Virtex-5
during the software design portion of the embedded
Step 5) Configure Target Device, processor system. The BMM file is located in the top level
directory of the processor system together with the
Step 6) Update Bitstream with processor Data, and bitstream (with the extension .BIT).
Step 7) Analyze Design Using Chipscope. vi) Since the embedded processor project is now fully
updated by both Xilinx ISE™ and XPS, attention is
Double-clicking the “Generate Programming File” again turned to the XPS shown in Fig. 6 to perform the
implements Steps 2), 3) and 4) to generate this file. Note following:
that the XPS generated the UCF which takes care of step
1) Generate the block diagram of the complete system
1). Otherwise using the Xilinx PlanAhead, the UCF is generated by selecting from the XPS GUI of Fig.
would 6: Project Generate Block Diagram Image which
is shown in Fig. 8.
have been created here in Step 1). Because, the design is
not ready for the target Virtex-5 FX70T FPGA, Steps 5), 2) Generate the complete design report by selecting
6), and 7) are not implemented here. The generation of from the XPS GUI of Fig. 6: Project Generate
the bitstream completed without errors but with some and View Design Report. This report gives the
warnings. detailed information on the embedded processor
system but is not shown in this work since it is more
v) Note that the embedded processor design is coordinated than 200 pages. It is useful as a reference note to
by both the Xilinx ISE™ and the XPS. It is observed that accessing the different peripherals, memory types,
immediately after the generation of the Programming and memory and peripheral drivers especially when
File (bitstream); the Xilinx ISE™ indicates that the modifications, addressing and integrating custom
project design is out of data while the XPS indicates that hardware are necessary.
the project file has changed on disk on their respective
GUIs. Therefore, Step 1) to Step 4) is repeated to update 3) Generate and export the designed embedded
the system, after which both notifications disappear. processor hardware to the Xilinx software
development kit (Xilinx SDK) by selecting from the
In addition to the Programming File, an important file is XPS GUI in Fig. 6: Project Export Hardware
also generated called the “Block Memory Map (BMM)” file Design to SDK.
with extension bmm. For the current PowerPC™440
project, this file is edkBmmFile_bd.bmm. The BMM file is
ISSN: 2049-3444 © 2013 – IJET Publications UK. All rights reserved. 1011
International Journal of Engineering and Technology (IJET) – Volume 3 No. 11, November, 2013
Fig. 7: A section of the Xilinx ISE™ graphical user interface from where the PowerPC™440 embedded processor system design is instantiated.
Fig. 8: The block diagram of the PowerPC™440 embedded processor system with associated memory types, peripherals, clock generator, buses, hardware
and software specifications and key/symbols
ISSN: 2049-3444 © 2013 – IJET Publications UK. All rights reserved. 1012
International Journal of Engineering and Technology (IJET) – Volume 3 No. 11, November, 2013
Although the Export dialog box offers two options for with each allocated 32-KB SRAM from the default 8-KB as
exporting the designed hardware: Export Only and Export and shown in Fig. 9(b). In MicroBlaze™ processor system, small
Lunch SDK, the “Export Only” is selected since the designed cache sizes are implemented with FPGA look-up tables
hardware will be exported in the next two sub-section for (LUTs) while large cache sizes are implemented using block
memory and peripheral testing as well as the Dhrystone RAMs (BRAMs). As mentioned earlier in the previous Section
benchmark performance comparison of the designed III, these caches are optional and can also be configured during
PowerPC™440 processor system with Xilinx MicroBlaze™ software development for the embedded processor system as
embedded processor. However, this export process shown and discussed later in Section V. The design summary
automatically creates an SDK directory in the current design of the MicroBlaze™ embedded processor system created using
hierarchy and places the hardware structure of the designed the base system builder (BSB) is shown in Fig. 9(a) and list the
PowerPC™440 processor system major software associated with the processor system as shown
(emb_ppc440_processor.xml) as an XML document in the under Overall in the File Location category. Like the
created SDK directory. “Application” stage in the PowerPC™440 processor system,
the component associated with the “Application” stage are also
IV. THE EMBEDDED MICROBLAZE™ listed under the “System Summary” for the created
MicroBlaze™ embedded processor system.
PROCESSOR SYSTEM
DEVELOPMENT USING XILINX The software associated with the just created MicroBlaze™
INTEGRATED SOFTWARE embedded processor system is then compiled so that all the
ENVIRONMENT (ISE) AND XILINX memory types, peripherals, memory and peripheral driver
PLATFORM STUDION software as well as the entire embedded processor system are
updated. The compilation procedures are similar that described
The procedures for creating the embedded MicroBlaze™ for the PowerPC™440 embedded processor system where the
processor system is essentially the same as that for the Xilinx ISE and the XPX are used interchangeably to perform
embedded PowerPC™440 system using the Base System these compilations.
Builder (BSB). However, some differences exist in the
architectural design of the embedded MicroBlaze™ embedded
processor system when compared to the embedded
PowerPC™440 processor system. Here, name assigned to the
embedded MicroBlaze™ processor system project is
“emb_mb_processor”. At the “Processor” stage using the Base
System Builder (BSB) to create the MicroBlaze™ embedded
processor, “MicroBlaze” is selected as the option for
“Processor Type” as in the case of Fig. 4(d).
As discussed in Section II-(E), the choices and configurations
of different memory types and peripherals influences the
performances of embedded processors, especially for the
MicroBlaze™ processor where the FPGA fabrics are used to
implement the logic circuits and drivers. Thus, for the
“Peripheral” selection stage, data-side and instruction-side
local memory types and controllers are selected. These two
were in-built within the PowerPC™400 core. Similar to the
PowerPC™ processor, the DDR2 SDRAM, the SRAM and the
UART are included in the MicroBlaze™ processor system.
These peripherals together with their address range are shown
in the design summary of Fig. 9(a). Unlike in PowerPC™440
where the instruction and data caches are in-built and fixed at
32-KB with three memory options SRAM, DDR2 SDRAM
and BRAM available for enabling the cache memory type; only
the first memory type options are available for enabling the
MicroBlaze™ processor memory cache. While the instruction
and data memory cache size in the PowerPC™440 core is fixed
at 32-KB, that in the MicroBlaze™ processor core can be
specified. Noting that the amount of FPGA fabrics required to
implement the memory and the memory address decoders
varies with the specified memory size; the instruction and data (a) Based System Builder: “Summary”
caches for the MicroBlaze™ processor system are enabled
ISSN: 2049-3444 © 2013 – IJET Publications UK. All rights reserved. 1013
International Journal of Engineering and Technology (IJET) – Volume 3 No. 11, November, 2013
Fig. 10: The block diagram of the MicroBlaze™ embedded processor system with associated memory types, peripherals, clock generator,
buses, hardware and software specifications and key/symbols.
As in the previous Section III, the wrappers and hardware to create the SDK directory in the top level hierarchy of the
drivers, Libraries and the board support packages (BSPs) as MicroBlaze™ processor project directory and the hardware
well as the Netlist are generated using the XPS via its GUI description text file that encrypts the MicroBlaze™ embedded
while the Synthesis, programming file (Bitstream), block processor system is exported to this SDK directory. Finally, the
memory map (.BMM) file, all other implementation files and block diagram image and the XPS synthesis summary are
the device utilization summary are generated using the Xilinx generation using the XPS via its GUI. The MicroBlaze™
ISE™ software via its GUI. Next, the XPS via its GUI is used embedded processor system created is shown in Fig. 10.
ISSN: 2049-3444 © 2013 – IJET Publications UK. All rights reserved. 1014
International Journal of Engineering and Technology (IJET) – Volume 3 No. 11, November, 2013
Fig. 11: Xilinx software development kit graphical user interface for software development and programming the Virtex-5 ML507 FPGA
using the “Debug on Hardware” option.
ISSN: 2049-3444 © 2013 – IJET Publications UK. All rights reserved. 1015
International Journal of Engineering and Technology (IJET) – Volume 3 No. 11, November, 2013
In order to test the peripherals, another new “Manage Make C VI. INDUSTRY-STANDARD DHRYSTONE
Application Project” is created using the same procedures for BENCHMARK PERFORMANCE
the Memory Test. The “Peripheral Tests” which uses the
EVALUATION ON THE DESIGNED
“TestApp_Peripheral.c” shown in Fig. 9(a). The same
procedures in the test memory case are followed to build, THE EMBEDDED PROCESSOR
compile and test the embedded MicroBlaze™ processor SYSTEMS
peripherals. Running the Peripheral Test application on the
The Dhrystone is a benchmark test program used to evaluate
Virtex-5 ML507 FPGA produces the result shown on the the performance of embedded processor system and its
HyperTerminal of Fig. 12(b). The memory and peripheral tests
performance is compared to that of the manufacturer to
performed for the MicroBlaze™ embedded processor system
measure how well the memory types, peripheral and
is repeated for the PowerPC™440 embedded processor optimization techniques have been employed to create the
system, the results similar to Fig. 12(a) and (b) were obtained
embedded processor system for enhanced performance. As
but are not shown here for space economy. mentioned in Section II-(A), the performance for the
Dhrystone benchmark evaluation are usually measured in
These test results indicate that the memories and peripherals of
terms of the maximum FPGA operating frequency (fmax) and
the embedded processor systems are fully functional and well the Dhrystone million instructions per second (DMIPs) [10],
configured which implies that embedded processor systems
[17], [18]. Unfortunately, only the Dhrystone benchmark
could be deployed for the development of embedded system
program for evaluating embedded MicroBlaze™ processor
applications based on the selected devices.
system is available here for evaluation. However, since
essentially the same memory types, peripheral and their
respective controllers, the results for the benchmarking of the
MicroBlaze™ processor system could be used to judge the
PowerPC™440 processor system and noting that the
PowerPC™ is known for higher speed performance running at
a maximum frequency of 550MHz and 1,100 DMIPs when
compared to MicroBlaze™ of 210MHz and 240 DMIPs as
discussed in Section II-(D) [17], [18].
Since the copied project has change, the Xilinx ISE™ project
(b) Peripheral test has also changed and it shows out of data. Thus, the complete
MicroBlaze™ embedded processor project is agian fully
Fig. 12: The MicroBlaze™ processor: (a) memory and (b)
recompiled using both the XPS and the Xilinx ISE™ software
peripheral test results on the HyperTerminal window.
according to the 6-step procedures summarized in Section III.
New board support packages (BSPs), Netlist, programming
file (emb_mb_processor.bit), block memory map
ISSN: 2049-3444 © 2013 – IJET Publications UK. All rights reserved. 1016
International Journal of Engineering and Technology (IJET) – Volume 3 No. 11, November, 2013
Fig. 13: The XPS for creating, compiling and initializing the Dhrystone benchmark program to load from on-board BRAM for benchmark
performance evaluation of MicroBlaze™ embedded processor on Virtex-5 ML507 FPGA.
The Xilinx SDK is again opened. A new software platform is result may be different when the embedded programs are larger
created called “Dhrystone_Test”. A new “Manage Make C than the on-board BRAMs.
Application Project” is also created. This time around, the just
created and compiled “Dhrystone_Test” software application Following the same procedures as for the embedded
is selected. Next, the Virtex-5 ML507 is programmed and the MicroBlaze™ processor system, the Dhrystone benchmark
Dhrystone application is executed. The maximum operating program was implemented and evaluated on the embedded
frequency obtained is 188.2 MHz against the 210 MHz PowerPC™440 processor system. The maximum operating
specified by Xilinx and 204.7 DMIPs against the 240 DMIPs frequency obtained is 495.8 MHz against the 550 MHz
specified by Xilinx and 1001.6 DMIPs against the 1100 DMIPs
specified by Xilinx for the MicroBlaze™ processor [16], [18]. specified by Xilinx [16]. By dividing the DMIPs by the
By dividing the DMIPs by the maximum operating frequency maximum operating frequency obtained by Xilinx for the
obtained by Xilinx for the Virtex-5 ML507 FPGA gives 0.9748 Virtex-5 ML507 FPGA gives 1.8211 which implies that the
which implies that the designed MicroBlaze™ embedded designed MicroBlaze™ embedded processor system is highly
processor system is highly optimized for embedded optimized for embedded applications.
applications. Note that the embedded programs are initialized
and implemented via the BRAM due to its small size, but the
ISSN: 2049-3444 © 2013 – IJET Publications UK. All rights reserved. 1017
International Journal of Engineering and Technology (IJET) – Volume 3 No. 11, November, 2013
TABLE I: SUMMARY OF THE DHRYSTONE BENCHMARK peripherals) that constitute the embedded processor systems
PERFORMANCE EVALUATION FOR THE EMBEDDED POWERPC™440 design are well selected in line with the design considerations
AND MICROBLAZE™ PROCESSOR SYSTEMS. proposed earlier.
A summary of the DMIPs benchmark performance results by
Maximum Dhrystone Million DMIP/
both embedded processor systems are given in TABLE I. From Frequency Instructions Per freqmax
TABLE I, the DMIPs/freqmax from Xilinx’s implementation is (freqmax), MHz Second (DMIPs)
1.1429 [18] compared to the 0.9748 obtained by the designed Embedded MicroBlaze™ 210 240 1.1429
embedded MicroBlaze™ processor system shows that the later (Xilinx)
is 14.71% lower than the former. Also, comparing the 2.0000 Embedded MicroBlaze™ 188.2 204.7 0.9748
from Xilinx [17] to the 1.8211 obtained by the design (Designed)
embedded PowerPC™440 indicates that the later is 8.95% Embedded PowerPC™ 550 1100 2.0000
(Xilinx)
lower than the former. Despite the DMIPs/freqmax lower values
Embedded PowerPC™ 495.8 1001.6 1.8211
obtained using the designed embedded processor systems, it is (Designed)
evident that the embedded processor systems shows good
computational efficiencies and that the devices (memories and
TABLE II: THE XILINX PLATFORM STUDIO (XPS) EMBEDDED POWERPC™440 AND MICROBLAZE™ PROCESSOR SYSTEMS SYNTHESIS
SUMMARY.
ISSN: 2049-3444 © 2013 – IJET Publications UK. All rights reserved. 1018
International Journal of Engineering and Technology (IJET) – Volume 3 No. 11, November, 2013
VII. COMPARISON OF DEVICE UTILIZATION the advantages and disadvantages of FPGA embedded
CONSUMED BY THE DESIGNED EMBEDDED processor systems when compared to off-the-shelf
microprocessors, microcontrollers, digital signal processors
POWERPC™440 AND MICROBLAZE™
(DSPs) and application specific integrated circuits (ASICs)
PROCESSOR SYSTEMS have also been critically examined and compared. The
challenges and drawbacks of designing FPGA embedded
In this section, the Xilinx platform studio (XPS) synthesis and
processor systems from hardware and software view points
Xilinx ISE™ device utilization reports generated by the XPS
have been highlighted and discussed. Substantive discussions
and Xilinx ISE™ are summarized and are used to deduce and
compare the FPGA hardware resources consumption for on the IBM PowerPC™440 hard processor and the Xilinx
creating the PowerPC™440 and the MicroBlaze™ embedded MicroBlaze™ soft processor core have been presented and
processor systems. The XPS synthesis report summary is shown references to detailed discussions on these two processor types
in TABLE II whereas the Xilinx ISE™ device utilization have also been given.
summary is shown in TABLE III.
Important and critical design considerations as well as
From the XPS synthesis results of TABLE II, it is obvious that comprehensive hardware/software co-design techniques for
the MicroBlaze™ consumes more FPGA hardware resources FPGA embedded processor systems design have been presented
when compared to the embedded PowerPC™440 processor and used to design two efficient embedded processor systems,
system. For example, the PowerPC™440 used only 2 flip flops namely: an embedded hard-core PowerPC™440 and a soft-core
to implement the ppc440_0_wrapper, whereas the MicroBlaze™ processor systems. Both embedded processor
MicroBlaze™ used 1,375 to implement the systems have been implemented and tested on a Xilinx Virtex-
microblaze_0_wrapper which increases hardware cost. Also, 5 FX70T ML507 FPGA development board. The evaluation of
the DDR2 SDRAM (ddr2_sdram_wrapper) implementation for the DMIPs (Dhrystone million instruction per second) on the
the PowerPC™440 processor system consumes 2,355 flip flops two designed embedded processor systems showed that the
against the 3,458 flip flops required by the MicroBlaze™ designed embedded PowerPC™440 processor system is 8.95%
processor system, which invariably increase hardware cost. lower than the result reported by Xilinx, whereas the designed
Although, the debug module is implemented in the silicon of embedded MicroBlaze™ processor system is 14.71% lower
the PowerPC™440 hard processor core, a smaller amount of than that reported by Xilinx for the DMIPs benchmark test.
119 flip flops are required to realize the logic operation in the Furthermore, the embedded PowerPC™440 processor system
MicroBlaze™ processor system. On the other hand, the consumed less FPGA resources when compare to the embedded
PowerPC™440 utilized 255 and 138 flip flops to implement the MicroBlaze™ processor system.
xps_bram_if_cntlr_1_bram_wrapper and the
plb_v46_0_wrapper respectively as against the 150 flip flops Based on the DMIPs performance results, the embedded
required by the MicroBlaze™ processor system to implement PowerPC™440 processor system appear as a suitable choice for
the mb_plb_wrapper. On the average, all other hardware implementing real-time FPGA embedded processor systems for
consumptions by both embedded processor systems are time critical application due to it operating frequency based on
comparable as can be observed in TABLE II. the DMIPs results. However, work has already started on FPGA
development and implementation of adaptive neural network
The Xilinx ISE™ device utilization report summary of TABLE identification algorithms. Due to the fact that only the basic
III shows that the main processing engine of the MicroBlaze™ intellectual property (IP) cores were used for both
processor system may have been built from three high- embedded processor systems design (see Fig. 5 and Fig. 9); 1)
performance DSP48E multipliers with significant 6,740 look- FPGA implementation of adaptive neural network
up tables (LUTs) flip flop pairs. Also, the number of slices identification algorithms and 2) a complete FPGA-in-the-loop
occupied by the MicroBlaze™ processor system outweighs that implementation of a computationally intensive adaptive model
occupied by the PowerPC™ processor system by 9%. predictive control algorithm on both embedded processor
Furthermore, the number of slice registers and LUTs used in the systems as co-processors are currently been exploited for
embedded MicroBlaze™ processor system design is in excess further performance verifications.
of 6% and 3% when compared to that used in the
PowerPC™440 processor system design. It can be observed REFERENCES
that the embedded PowerPC™440 processor design required
additional 22 flip flops for routing and additional 2% excess flip [1] V. A. Akpan (Nov., 2009), “FPGA Embedded Systems
flops for build the memory. Design Technologies: with an Overview of Xilinx Systems
Design Tools”, Department of Electrical and Computer
Engineering, Aristotle University of Thessaloniki, Greece,
VIII. CONCLUSION AND DISCUSSIONS pp. 1 – 31. [Online] Available:
http://users.auth.gr/~iosamar/technicalreports.htm.
The importance of embedded processors in FPGA embedded
[2] V. A. Akpan, “Model-based FPGA embedded-processor
systems have been examined and discussed. For completeness, systems design methodologies: Modeling, syntheses,
implementation and validation”, African Journal of
ISSN: 2049-3444 © 2013 – IJET Publications UK. All rights reserved. 1019
International Journal of Engineering and Technology (IJET) – Volume 3 No. 11, November, 2013
[9] J. Bier, “Give FPGAs embedded nod”, Embedded Systems [20] S. Guccione, “List of FPGA-based computing machines”,
Conference, Silicon Valley, USA, April, 2006, pp. 1 – 2. [Online] Available: http://www.io.com/~guccione/HW-
list.html.
[10] B. H. Fletcher, “FPGA embedded processors: Revealing [21] Xilinx Inc., Data2MEM: User Guide, UG658, Version 1.0,
true system performance”, Embedded Systems Conference, April 27, 2009, pp. 1 – 44.
San Francisco, 2005, pp. 1 – 18.
ISSN: 2049-3444 © 2013 – IJET Publications UK. All rights reserved. 1020