0% found this document useful (0 votes)
21 views37 pages

FPGA Embedded Processors Revealing True

The document discusses the advantages and disadvantages of embedding processors within FPGAs, highlighting customization, obsolescence mitigation, and hardware acceleration as key benefits. It emphasizes the importance of understanding real-world design implications on performance and cost, as manufacturers often optimize benchmarks to showcase their products. The paper also presents case studies on memory strategies and peripheral sets, along with performance-enhancing techniques for FPGA embedded processor systems.

Uploaded by

utkarshap102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views37 pages

FPGA Embedded Processors Revealing True

The document discusses the advantages and disadvantages of embedding processors within FPGAs, highlighting customization, obsolescence mitigation, and hardware acceleration as key benefits. It emphasizes the importance of understanding real-world design implications on performance and cost, as manufacturers often optimize benchmarks to showcase their products. The paper also presents case studies on memory strategies and peripheral sets, along with performance-enhancing techniques for FPGA embedded processor systems.

Uploaded by

utkarshap102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

FPGA Embedded

Processors
Revealing True System Performance

Bryan H. Fletcher
Technical Program Manager
Memec
San Diego, California
www.memec.com
[email protected]

Embedded Training Program


Embedded Systems Conference San Francisco 2005
ETP-367
FPGA Embedded Processors: Revealing True System Performance ETP-367

1. Abstract
Embedding a processor inside an FPGA has many advantages. Specific peripherals can be chosen based
on the application, with unique user-designed peripherals being easily attached. A variety of memory
controllers enhance the FPGA embedded processor system’s interface capabilities.

FPGA embedded processors use general-purpose FPGA logic to construct internal memory, processor
busses, internal peripherals, and external peripheral controllers (including external memory controllers).
Soft processors are built from general-purpose FPGA logic as well.

As more pieces (busses, memory, memory controllers, peripherals, and peripheral controllers) are added to
the embedded processor system, the system becomes increasingly more powerful and useful. However,
these additions reduce performance and increase the embedded system cost, consuming FPGA resources.

Likewise, large banks of external memory can be connected to the FPGA and accessed by the embedded
processor system using included memory controllers. Unfortunately, the latency to access this external
memory can have a significant, negative impact on performance.

FPGA manufacturers often publish embedded processor performance benchmarks. Like all other
companies running benchmarks, these FPGA manufacturers purposely construct and use an FPGA
embedded processor system that performs the best for each specific benchmark. This means that the FPGA
processor system constructed to run the benchmark has very few peripherals and runs exclusively using
internal memory. The embedded designer must understand how a real-world design’s specific peripheral
set and memory architecture will affect the system performance. However, no easy formula or chart exists
showing how to compare the performance and cost for different memory strategies and peripheral sets.

This paper presents several case studies that examine the effects of various embedded processor memory
strategies and peripheral sets. Comparing the benchmark system to a real-world system, the study
examines techniques for optimizing the performance and cost in an FPGA embedded processor system.

2. FPGA Embedded Processors


The Field Programmable Gate Array (FPGA) is a general-purpose device filled with digital logic building
blocks. The two market leaders in the FPGA industry, Altera and Xilinx, are the focus of this study. Many
other programmable logic companies exist, although their products are not discussed in this paper.

The most primitive FPGA building block is called either a Logic Cell (LC) by Xilinx or a Logic Element
(LE) by Altera. In either case, this building block consists of a look-up table (LUT) for logical functions
and a flip-flop for storage. In addition to the LC/LE block, FPGAs also contain memory, clock
management, input/output (I/O), and multiplication blocks. For the purposes of this study, LC/LE
consumption is used in determining system cost.

2.1 Soft vs. hard processor


Both Xilinx and Altera produce FPGA families that embed a physical processor core into the FPGA silicon.
A processor built from dedicated silicon is referred to as a “hard” processor. Such is the case for the
ARM922T™ inside the Altera Excalibur family and the PowerPC™ 405 inside the Xilinx Virtex-II Pro and
Virtex-4 families.

Embedded Systems Conference, San Francisco, 2005 Page 2 of 18

B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367

Figure 1 – Xilinx Virtex-II Pro Die Showing PowerPC Hard Processors

A “soft” processor is built using the FPGA’s general-purpose logic. The soft processor is typically
described in a Hardware Description Language (HDL) or netlist. Unlike the hard processor, a soft
processor must be synthesized and fit into the FPGA fabric.

In both soft and hard processor systems, the local memory, processor busses, internal peripherals, peripheral
controllers, and memory controllers must be built from the FPGA’s general-purpose logic.

2.2 Advantages of an FPGA embedded processor


An FPGA embedded processor system offers many exceptional advantages compared to typical
microprocessors including:

1) customization
2) obsolescence mitigation
3) component and cost reduction
4) hardware acceleration

2.2.1 Customization
The designer of an FPGA embedded processor system has complete flexibility to select any combination of
peripherals and controllers. In fact, the designer can invent new, unique peripherals that can be connected
directly to the processor’s bus. If a designer has a non-standard requirement for a peripheral set, this can be
met easily with an FPGA embedded processor system. For example, a designer would not easily find an
off-the-shelf processor with ten UARTs. However, in an FPGA, this configuration is very easily
accomplished.

2.2.2 Obsolescence mitigation


Some companies, in particular those supporting military contracts, have a design requirement to ensure a
product lifespan that is much longer than the lifespan of a standard electronics product. Component
obsolescence mitigation is a difficult issue. FPGA soft-processors are an excellent solution in this case
since the source HDL for the soft-processor can be purchased. Ownership of the processor’s HDL code
may fulfill the requirement for product lifespan guarantee.

Embedded Systems Conference, San Francisco, 2005 Page 3 of 18

B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367

2.2.3 Component and cost reduction


With the versatility of the FPGA, previous systems that required multiple components can be replaced with
a single FPGA. Certainly this is the case when an auxiliary I/O chip or a co-processor is required next to an
off-the-shelf processor. By reducing the component count in a design, a company can reduce board size
and inventory management, both of which will save design time and cost.

2.2.4 Hardware acceleration


Perhaps the most compelling reason to choose an FPGA embedded processor is the ability to make
tradeoffs between hardware and software to maximize efficiency and performance. If an algorithm is
identified as a software bottleneck, a custom co-processing engine can be designed in the FPGA specifically
for that algorithm. This co-processor can be attached to the FPGA embedded processor through special,
low-latency channels, and custom instructions can be defined to exercise the co-processor. With modern
FPGA hardware design tools, transitioning software bottlenecks from software to hardware is much easier
since the software C code can be readily adapted into hardware with only minor changes to the C code.1

2.3 Disadvantages
The FPGA embedded processor system is not without disadvantages. Unlike an off-the-shelf processor, the
hardware platform for the FPGA embedded processor must be designed. The embedded designer becomes
the hardware processor system designer when an FPGA solution is selected.

Because of the integration of the hardware and software platform design, the design tools are more
complex. The increased tool complexity and design methodology requires more attention from the
embedded designer.

Since FPGA embedded processor software design is relatively new compared to software design for
standard processors, the software design tools are likewise relatively immature, although workable.
Significant progress in this area has been made by both Altera and Xilinx. Within the next year, this
disadvantage should be further diminished, if not eliminated.

Device cost is another aspect to consider. If a standard, off-the-shelf processor can do the job, that
processor will be less expensive in a head-to-head comparison with the FPGA capable of an equivalent
processor design. However, if a large FPGA is already in the system, consuming unused gates or a hard
processor in the FPGA essentially makes the embedded processor system cost inconsequential.

2.4 Peripherals and memory controllers


To facilitate FPGA embedded processor design, both Xilinx and Altera offer extensive libraries of
intellectual property (IP) in the form of peripherals and memory controllers. This IP is included in the
embedded processor toolsets provided by these manufacturers. To emphasize the versatility and flexibility
afforded the embedded designer using an FPGA, a partial list of IP included with the embedded processor
design tools from Altera and Xilinx is listed below.

2.4.1 Peripherals & peripheral controllers2


• General purpose I/O
• UART
• Timer
• Debug
• SPI
• DMA Controller
• Ethernet (interface to external MAC/PHY chip)

Embedded Systems Conference, San Francisco, 2005 Page 4 of 18

B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367

2.4.2 Memory controllers2


• SRAM
• Flash
• SDRAM
• DDR SDRAM (Xilinx only)
• CompactFlash

2.5 Manufacturers’ benchmarks


The industry standard benchmark for FPGA embedded processors is Dhrystone MIPs (DMIPs). Both
Altera and Xilinx quote DMIPs for most, if not all, of the available embedded processors.

The achieved DMIPs reported by the manufacturers are based on several things that maximize the
benchmark results. Some of these factors include the following:
• Optimal compiler optimization level
• Fastest available device family (unless otherwise noted)
• Fastest speed grade in that device family
• Executing from fastest, lowest latency memory, typically on-chip
• Optimization of processor’s parameterizable features

The available embedded processors with the manufacturers’ quoted maximum frequency and DMIPs are
summarized here.

2.5.1 Altera3
Table 1 – Altera Embedded Processors and Performance
Processor Processor Type Device Family Speed (MHz) DMIPs Achieved
Used Achieved
ARM922T™ hard Excalibur 200 210
NIOS® soft Stratix-II 180 Not Reported
Nios® II soft Stratix-II Not Reported 200
Nios® II soft Cyclone-II Not Reported 100

2.5.2 Xilinx4
Table 2 – Xilinx Embedded processors and Performance
Processor Processor Type Device Family Speed (MHz) DMIPs Achieved
Used Achieved
PowerPC™ 405 hard Virtex-4 450 680
MicroBlaze soft Virtex-II Pro 150 123
MicroBlaze soft Spartan-3 85 65

3. Performance Enhancing Techniques


Some embedded designers get frustrated when the performance of their selected FPGA embedded processor
is half or less of what was expected. In some cases, a designer is unable to duplicate the manufacturers’
published results for a benchmark.

The reason that achieved performance is lacking may be caused by the designer not enacting all of the
performance enhancing techniques available to FPGA embedded processors. The manufacturers obviously
know what must be done to get the most out of their chips, and they take full advantage of every possible

Embedded Systems Conference, San Francisco, 2005 Page 5 of 18

B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367

enhancement when benchmarking. Embedded designers, who are familiar with standard microprocessor
performance optimization, need to learn which software optimization techniques apply to FPGA embedded
processors. Designers must also learn performance-enhancing techniques that apply specifically to FPGAs.

The design landscape is certainly more complicated with an FPGA embedded processor. The incredible
advantages gained with this type of design are not without tradeoffs. Specifically, the increased design
complexity is overwhelming to many, including experienced embedded or FPGA designers. The
manufacturers and their partners put significant effort into training and providing support to designers
experimenting with this technology. Taking advantage of a local field applications engineer is essential to
FPGA embedded processor design success.

As an introduction to this type of design, a few performance-enhancing techniques are highlighted. Specific
references below are based on research of Xilinx FPGAs and tools, although Altera is likely to have similar
features.

3.1 Optimization techniques that are not FPGA specific


An embedded designer is familiar with many of the techniques discussed in this section. For that reason,
significant detail is not given here. The main objective of this section is to emphasize that many standard
microprocessor design optimization techniques apply to FPGA embedded processor design and can have
excellent benefits.

3.1.1 Code manipulation


Many optimizations are available to affect the application code. Some techniques apply to how the code is
written. Other techniques affect how the compiler handles the code.

3.1.1.1 Optimization level


Compiler optimizations are available in Xilinx Platform Studio (XPS) based on GCC. The current version
of the MicroBlaze and PowerPC GCC-based compilers in EDK 6.3 is 2.95.3-4. These compilers have
several levels of optimization, including: Levels 0, 1, 2, and 3 and also a size reduction optimization. An
explanation of the strategy for the different optimization levels briefly follows5:

Level 0: No optimization

Level 1: First level optimization. Performs jump and pop optimizations.

Level 2: Second level optimization. This level activates nearly all optimizations that do not
involve a speed-space tradeoff, so the executable should not increase in size. The
compiler does not perform loop unrolling, function in-lining or strict aliasing
optimizations. This is the standard optimization level used for program deployment.

Level 3: Highest optimization level. This level adds more expensive options, including those that
increase code size. In some cases this optimization level actually produces code that is
less efficient than the O2 level, and as such should be used with caution.

Size: Size optimization. The objective is to produce the smallest possible code size.

3.1.1.2 Use of manufacturer’s optimized instructions


Xilinx provides several customized instructions that have been streamlined for Xilinx embedded processors.
One example of this is xil_printf. The function is nearly identical to printf with the following
exceptions: support for type real numbers is removed; not reentrant; and no longlong (64-bit). For these
differences, the xil_printf function is 2953 bytes, making it much smaller than printf, which is
51788 bytes large.6

Embedded Systems Conference, San Francisco, 2005 Page 6 of 18

B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367

3.1.1.3 Assembly
Assembly, including in-line assembly, is supported by GCC. As with any microprocessor, assembly
becomes very useful in fully optimizing time critical functions. Be aware, however, that some compilers
will not optimize the remaining C code in a file if in-line assembly is also used in that file. Also, assembly
code does not enjoy the code portability advantages of C.

3.1.1.4 Miscellaneous
Many other code-related optimizations can and should be considered when optimizing an FPGA embedded
processor, including:
• locality of reference
• code profiling
• careful definition of variables (Xilinx provides a Basic Types definition)
• strategic use of small data sections, with accesses that can be twice as fast as large data
sections
• judicious use of function calls to minimize pushing/popping of stack frames
• loop length (especially where cache is involved)

3.1.2 Memory Usage


Many processors provide access to fast, local memory, as well as an interface to slower, secondary memory.
The same is true with FPGA embedded processors. The way this memory is used has a significant affect on
performance. Like other processors, the memory usage in an FPGA embedded processor can be
manipulated with a linker script.

3.1.2.1 Local memory only


The fastest possible memory option is to put everything in local memory. Xilinx local memory is made up
of large FPGA memory blocks called BlockRAM (BRAM). Embedded processor accesses to BRAM
happen in a single bus cycle. Since the processor and bus run at the same frequency in MicroBlaze,
instructions stored in BRAM are executed at the full MicroBlaze processor frequency. In a MicroBlaze
system, BRAM is essentially equivalent in performance to a Level 1 (L1) cache. The PowerPC can run at
frequencies greater than the bus and has true, built-in L1 cache. Therefore, BRAM in a PowerPC system is
equivalent in performance to a Level 2 (L2) cache.

Xilinx FPGA BRAM quantities differ by device. For example, the 1.5 million gate Spartan-3 device
(XC3S1500) has a total capacity of 64KB, whereas the 400,000 gate Spartan-3 device (XC3S400) has half
as much at 32KB. An embedded designer using FPGAs should refer to the device family datasheet to
review a specific chip’s BRAM capacity.

If the designer’s program fits entirely within local memory, then the designer achieves optimal memory
performance. However, many embedded programs exceed this capacity.

3.1.2.2 External memory only


Xilinx provides several memory controllers that interface with a variety of external memory devices. These
memory controllers are connected to the processor’s peripheral bus. The three types of volatile memory
supported by Xilinx are SRAM, single-data-rate SDRAM, and double-data-rate (DDR) SDRAM. The
SRAM controller is the smallest and simplest inside the FPGA, but SRAM is the most expensive of the
three memory types. The DDR controller is the largest and most complex inside the FPGA, but fewer
FPGA pins are required, and DDR is the least expensive per megabyte.

In addition to the memory access time, the peripheral bus also incurs some latency. In MicroBlaze, the
memory controllers are attached to the On-chip Peripheral Bus (OPB). For example, the OPB SDRAM
controller requires a four to six cycle latency for a write and eight to ten cycle latency for a read (depending

Embedded Systems Conference, San Francisco, 2005 Page 7 of 18

B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367

on bus clock frequency)7. The worst possible program performance is achieved by having the entire
program reside in external memory. Since optimizing execution speed is a typical goal, an entire program
should rarely, if ever, be targeted solely at external memory.

3.1.2.3 Cache external memory


The PowerPC in Xilinx FPGAs has instruction and data cache built into the silicon of the hard processor.
Enabling the cache is almost always a performance advantage for the PowerPC.

The MicroBlaze cache architecture is different than the PowerPC because the cache memory is not
dedicated silicon. The instruction and data cache controllers are selectable parameters in the MicroBlaze
configuration. When these controllers are included, the cache memory is built from BRAM. Therefore,
enabling the cache consumes BRAM that could have otherwise been used for local memory. Compared to
local memory, cache consumes more BRAM than local memory for the same storage size because the cache
architecture requires address line tag storage. Additionally, enabling the cache consumes general-purpose
logic to build the cache controllers.

For example, an experiment in Spartan-3 enables 8 KB of data cache and designates 32 MB of external
memory to be cached. This cache requires 12 address tag bits. This configuration consumes 124 logic cells
and 6 BRAMs. Only 4 BlockRAMs are required in Spartan-3 to achieve 8 KB of local memory. In this
case, cache is 50% more expensive in terms of BRAM usage than local memory. The 2 extra BRAMs are
used to store address tag bits.

If 1 MB of external memory is cached with an 8 KB cache, then the address tag bits can be reduced to 7.
This configuration then only requires 5 BRAMs rather than 6 (4 BRAMs for the cache, and 1 BRAM for
the tags). This is still 25% greater than if the BRAMs are used as local memory.

Additionally, the achievable system frequency may be reduced when the cache is enabled. In one example,
the system without any cache is capable of running at 75 MHz; the system with cache is only capable of
running at 60 MHz. Enabling the cache controller adds logic and complexity to the design, decreasing the
achieved system frequency during FPGA place and route. Therefore, in addition to consuming FPGA
BRAM resources that may have otherwise been used to increase local memory, the cache implementation
may also cause the overall system frequency to decrease.

Considering these cautions, enabling the MicroBlaze cache, especially the instruction cache, may improve
performance, even when the system must run at a lower frequency. As will be shown later in the DMIPs
section (see Section 4.1), a 60 MHz system with instruction cache enabled has a 150% advantage over a 75
MHz system without instruction cache (both systems store entire program in external memory). When both
instruction and data caches are enabled, the 60 MHz outperforms the 75 MHz system by 308%.

This example is not the most practical since the entire DMIPs program will fit in the cache. A more
realistic experiment is to use an application that is larger than the cache. Another precaution is regarding
applications that frequently jump beyond the size of the cache. Multiple cache misses degrade the
performance, sometimes making a cached external memory worse than the external memory without cache.

Given these warnings, enabling the cache is always worth an experiment to determine if it improves the
performance for an application.

3.1.2.4 Combination: code partitioning in internal, external, and cached memory


The memory architecture that provides the best performance is one that only has local memory. However,
this architecture is not always practical since many useful programs will exceed the available capacity of the
local memory. On the other hand, running from external memory exclusively may have more than an eight
times performance disadvantage due to the peripheral bus latency. Caching the external memory is an

Embedded Systems Conference, San Francisco, 2005 Page 8 of 18

B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367

excellent choice for PowerPC. Caching the external memory in MicroBlaze definitely improves results, but
an alternative method is presented that may provide optimal results.

For MicroBlaze, perhaps the optimal memory configuration is to wisely partition the program code,
maximizing the system frequency and local memory size. Critical data, instructions, and stack are placed in
local memory. Data cache is not used, allowing for a larger local memory bank. If the local memory is not
large enough to contain all instructions, the designer should consider enabling the instruction cache for the
address range in external memory used for instructions.

By not consuming BRAM in data cache, the local memory can be increased to contain more space. An
instruction cache for the instructions assigned to external memory can be very effective. Experimentation
or profiling shows which code items are most heavily accessed; assigning these items to local memory
provides a greater performance improvement than caching.

Express Logic’s Thread-Metric™ test suite is an example of how partitioning a small piece of code in local
memory can have a significant performance improvement (see Section 4.2). One function in the Thread-
Metric Basic Processing test is identified as time critical. The function’s data section (consisting of 19% of
the total code size) is allocated to local memory, and the instruction cache is enabled. The 60 MHz, cached
and partitioned-program system achieves performance that is 560% better than running a non-cached, non-
partitioned 75 MHz system using only external memory.

However, the 75 MHz system shows even more improvement with code partitioning. If the time critical
function’s data and text sections (22% of the total code size) are assigned to local memory on the 75 MHz
system, a 710% improvement is realized, even with no instruction cache for the remainder of the code
assigned to external memory.

In this one case, the optimal memory configuration is one that maximizes local memory and system
frequency without cache. In other systems where the critical code is not so easily pinpointed, a cached
system may perform better. Designers should experiment with both methods to determine what is optimal
for their design.

3.2 FPGA specific optimization techniques


Since the designer is actually building and creating the embedded processor system hardware in an FPGA,
much can be done to improve the performance of the hardware. Additionally, with an FPGA embedded
processor residing next to additional FPGA hardware resources, a designer can consider custom co-
processor designs specifically targeted at a design’s core algorithm.

3.2.1 Increase FPGA’s operating frequency


Employing FPGA design techniques to increase the operating frequency of the FPGA embedded processor
system increases performance. Several methods are considered.

3.2.1.1 Logic optimization and reduction


Only connect those peripherals and busses that will be used. A few examples are presented below.

If a design does not store and run any instructions using external memory, do not connect the instruction-
side of the peripheral bus. Connecting both the instruction and data side of the processor to a single bus
creates a multi-master system, which requires an arbiter. Optimal bus performance is achieved when a
single master resides on the bus.

Debug logic requires resources in the FPGA and may be the hardware bottleneck. When a design is
completely debugged, the debug logic can be removed from the production system, potentially increasing
the system’s performance. For example, removing a MicroBlaze Debug Module (MDM) with an FSL

Embedded Systems Conference, San Francisco, 2005 Page 9 of 18

B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367

acceleration channel saves 950 LCs. In MicroBlaze systems with the cache enabled, the debug logic will
typically be the critical path that slows down the entire design.

The Xilinx OPB External Memory Controller (EMC) used to connect SRAM and Flash memories creates a
32-bit address bus even if 32 bits are not required to address the memory. Xilinx also provides a bus-
trimming peripheral, which removes the unused address bits. When using this memory controller, the bus-
trimmer should always be used to eliminate the unused addresses. This frees up routing and pins that would
have otherwise been used. The Xilinx Base System Builder (BSB) now does this automatically.

Xilinx provides several GPIO peripherals. The latest GPIO peripheral version (v3.01.a) has excellent
capabilities, including dual-channel support, bi-directionality, and interrupt capability. However, these
features also require more resources, which affect timing. If a simple GPIO is all that the design requires,
the designer should use a more primitive version of the GPIO, or at least ensure that the unused features in
the enhanced GPIO are turned off. In the optimized examples in this study, GPIO v1.00.a is used, which is
much less sophisticated, much faster, and approximately half the size (304 LCs for 7 GPIO v1.00.a
peripherals as compared to 602 LCs for v3.01.a).

3.2.1.2 Area and timing constraints


Xilinx FPGA place and route tools perform much better when given guidelines as to what is most important
to the designer. In the Xilinx tools, a designer can specify the desired clock frequency, pin location, and
logic element location. By providing these details, the tools are able to make smarter trade-offs during the
hardware design implementation.

Some peripherals require additional constraints to ensure proper operation. For example, both the DDR
SDRAM controller and the 10/100 Ethernet MAC require additional constraints to guarantee that the tools
create correct and optimized logic. The designer must read the datasheet for each peripheral and follow the
recommended design guidelines.

3.2.2 Hardware acceleration


Dedicated hardware outperforms software. The embedded designer who is serious about increasing
performance must consider the FPGA’s ability to accelerate the processor performance with dedicated
hardware. Although this technique consumes FPGA resources, the performance improvements can be
extraordinary.

3.2.2.1 Turn on the hardware divider and barrel-shifter


MicroBlaze can be customized to use a hardware divider and a hardware barrel-shifter rather than
performing these functions in software. Enabling these processor capabilities consumes more logic but
improves performance. In one example, enabling the hardware divider and barrel-shifter adds 414 LCs, but
the performance is improved by 18.1% (DMIPs benchmark).

3.2.2.2 Software bottlenecks converted to co-processing hardware


Custom hardware logic can be designed to offload an FPGA embedded processor. When a software
bottleneck is identified, a designer can choose to convert the bottleneck algorithm into custom hardware.
Custom software instructions can then be defined to operate the hardware co-processor.

Both MicroBlaze and Virtex-4 PowerPC include very low-latency access points into the processor, which
are ideal for connecting custom co-processing hardware. Virtex-4 introduces the Auxiliary Processing Unit
(APU) for the PowerPC8. The APU provides a direct connection from the PowerPC to co-processing
hardware. In MicroBlaze, the low-latency interface is called the Fast Simplex Link (FSL) bus. The FSL
bus contains multiple channels of dedicated, uni-directional, 32-bit interfaces. Because the FSL channels
are dedicated, no arbitration or bus mastering is required. This allows an extremely fast interface to the
processor9.

Embedded Systems Conference, San Francisco, 2005 Page 10 of 18

B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367

Converting a software bottleneck into hardware may seem like a very difficult task. Traditionally, a
software designer identifies the bottleneck, after which the algorithm is transitioned to an FPGA designer
who writes VHDL or Verilog code to create the hardware co-processor. Fortunately, this process has been
greatly simplified by tools that are capable of generating FPGA hardware from C code. One such tool is
CoDeveloper™ from Impulse Accelerated Technologies. This tool allows one designer who is familiar
with C to port a software bottleneck into a custom piece of co-processing FPGA hardware using
CoDeveloper’s Impulse C libraries.

Some examples of algorithms that could be targeted for hardware-based co-processors include:
• Inverse Discrete Cosine Transformation, used in JPEG 2000
• Fast Fourier Transform
• MP3 decode
• Triple-DES and AES encryption
• Matrix-manipulation

Any operation that is algorithmic, mathematical, or parallel is a good candidate for a hardware co-
processor10. FPGA logic consumption is traded for performance. The advantages can be enormous,
improving performance by tens or hundreds of times.

4. MicroBlaze True System Performance Revealed


To illustrate the difference these performance-enhancing techniques can have on real designs, several case
studies are now presented. Each of these examples are built and run using Xilinx ISE and EDK software,
version 6.3. Where possible, the examples are executed on real hardware, using Memec’s Spartan-3 MB
Development Board, with a 1.5 million gate Spartan-3 FPGA (XC3S1500).

Figure 2 – Memec Spartan-3 MB Development Board

Embedded Systems Conference, San Francisco, 2005 Page 11 of 18

B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367

The value of any benchmark can be questioned. Therefore, to provide a more objective perspective, several
applications are examined: Dhrystone MIPs, Express Logic’s Thread-Metric RTOS benchmark, and a real-
world application (Triple-DES encryption).

4.1 Dhrystone MIPs


As previously specified, Xilinx publishes a Spartan-3 MicroBlaze benchmark of 65 DMIPs running at 85
MHz. This case study investigates the system design for which this benchmark was achieved and compares
it to other, more real-world processor systems.

4.1.1 Best performing, but minimal system


The best performing system is achieved under the following conditions:
• minimal peripheral set, including only those peripherals required to run the benchmark or
report results
• highly optimized hardware, including several FPGA design optimization methods as well as
the fastest speed grade for this family (-5)
• local memory only
• optimal compiler optimization

The IP included in this embedded processor design are listed below:

• MicroBlaze
• hardware divider and barrel shifter included
• no data or instruction cache
• Data-side peripheral bus (no instruction-side)
• 8KB instruction local memory
• 16KB data local memory
• No debug hardware
• RS232 UART
• Timer

A block diagram of the system is shown in Figure 3.

Figure 3 – Spartan-3 DMIPs System Block Diagram

Embedded Systems Conference, San Francisco, 2005 Page 12 of 18

B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367

Based on this processor system running in the slower, -4 speed grade on the Memec Spartan-3 MB board,
the achieved results are 63.3 DMIPs. The MicroBlaze system runs at a maximum frequency of 81.8 MHz.
Dividing the DMIPs by the operating frequency gives 0.773 DMIPs/MHz, which is useful when
extrapolating performance numbers for a similar processor system running at a different frequency. This
system consumes 2322 logic cells. Based on Xilinx press release numbers of $20 for an XC3S150011, this
system cost can be estimated at $1.74.

When this same system is built using a -5 speed grade Spartan-3 device, the achieved frequency is 90.8
MHz. Since a development board with a -5 device is not available, the DMIPs performance is extrapolated
based on the numbers achieved during the -4 experiment. Based on 0.773 DMIPs/MHz, this -5 system is
capable of achieving 70.3 DMIPs. This result is better than the published Xilinx benchmark of 65 DMIPs!

4.1.2 Real-world, but unoptimized system


A more realistic system is now used to run the same benchmark. Although the peripherals aren’t needed to
run the benchmark, many useful peripherals are added to the system. Xilinx Platform Studio’s Base System
Builder (BSB) is used to generate the hardware platform. No hardware optimization is attempted. The
entire program, including code and data, are stored in and run from external memory. The default
optimization level (Level 2) is used. By default, the hardware divider and barrel-shifter are not enabled.

The system consists of the following:


• MicroBlaze
• hardware divider and barrel shifter not included
• no data or instruction cache
• Data- and instruction-side peripheral bus with arbiter
• 32KB shared instruction and data local memory
• Debug hardware with FSL acceleration channel
• RS232 UART
• USB-to-serial bridge UART
• GPIO
• 4 LEDs
• 2 seven-segment displays
• 2 push buttons
• 8 dip switches
• Flash chip reset
• Flash chip RDY/BUSYn
• Flash controller
• SRAM controller
• 10/100 Ethernet MAC
• Timer
• Interrupt controller

Figure 4 shows the block diagram for this system.

Embedded Systems Conference, San Francisco, 2005 Page 13 of 18

B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367

Figure 4 – Spartan-3 Real-world System Block Diagram

For those who are inexperienced with FPGA embedded processors and Xilinx MicroBlaze specifically, the
default BSB system may very well be the typical system that a designer creates. The design will build and
run as expected, but the performance degradation is severe. The operating frequency achieved is only 62.5
MHz. The DMIPs result is 4.505 DMIPs, which is more than 14 times less than the Xilinx published
DMIPs number. This design consumes 7732 logic cells, which, based on the previous chip price quoted,
equates to a system cost of $5.81.

4.1.3 Real-world, optimized system


Next, the “typical” system from the previous experiment is optimized in both hardware and software. The
hardware divider and barrel-shifter are enabled. The GPIO units are replaced with a more primitive, but
smaller and faster version. Using other FPGA design techniques, the hardware is optimized and rebuilt,

Embedded Systems Conference, San Francisco, 2005 Page 14 of 18

B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367

with results achieving 75 MHz for the non-cached MicroBlaze system and 60 MHz for the cached
MicroBlaze system.

Although the code is small enough to entirely fit inside local memory, several alternative memory structures
are investigated. The following parameters are adjusted to determine affects on performance:
• Location of instructions
• Location of small data
• Location of large data
• Location of stack
• Data cache
• Instruction cache
• Compiler optimization level

Optimizing the hardware system to run at 75 MHz with a hardware divider and barrel shifter increases LC
usage by 986 at a cost of $0.74. This increases the performance to 6.846 DMIPs, which is a 52%
improvement compared to the unoptimized system. Changing the Compiler Optimization Level to Level 3
further increases the performance to 7.139, an additional increase of 4.3%.

Enabling the instruction and data caches adds an additional $0.27 of LC cost but increases the performance
to 29.130 DMIPs.

Eliminating the data cache saves $0.20 and allows the local memory to be increased by 16 KB. With the
instruction cache enabled and assigning the stack and small data sections to local memory, 33.178 DMIPs
are achieved. If all data is assigned to local memory, this performance increases to 47.811 DMIPs.

If both instruction and data cache are eliminated, and local memory is used for stack, small data sections,
and instructions, the results are 41.483 DMIPs. When local memory is used for the entire program, the
maximum performance for this hardware system is achieved at 59.785 DMIPs.

For this real-world system, the best results are within 8% of the manufacturer’s published benchmark. Even
when running cached external instruction memory with local memory data storage, the system is capable of
running at 74% of the manufacturer’s benchmark.

These results are summarized in Table 3.

Table 3 – DMIPs Experiment Results

Frequency DMIPs LCs


Initial optimized system, running from SRAM 75 MHz 6.846 8718
Increase compiler optimization level to Level 3 75 MHz 7.139 8718
Added instruction and data caches 60 MHz 29.130 9076
Removed data cache, moved stack and small
data sections to local memory 60 MHz 33.178 8812

Adding remaining data sections to local memory 60 MHz 47.811 8812


Removed instruction cache, moved large data
sections back to SRAM, moved instructions to
local memory 75 MHz 41.483 8718
Moved entire program to local memory 75 MHz 59.785 8718

Embedded Systems Conference, San Francisco, 2005 Page 15 of 18

B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367

4.2 Thread-Metric Real-Time Operating System (RTOS) benchmark


The performance differences between different optimized and unoptimized MicroBlaze systems are
examined using an RTOS benchmark. The Thread-Metric test suite consists of eight distinct RTOS tests
that are designed to highlight commonly used aspects of an RTOS. The tests measure the total number of
RTOS events that can be processed during a specific timer interval12. For the purpose of examining the
performance differences between systems, only the Basic Processing test is reported in these results. This
test consists of a single thread. A 30 second time interval is used for these experiments. Express Logic’s
ThreadX® RTOS is used for this comparison.

4.2.1 Real-world, unoptimized


The unoptimized system used in Section 4.1.2 is reused for this benchmark. With the program running
from SRAM, the results are 19713 events

4.2.2 Real-world, optimized


The hardware and software optimized system used in Section 4.1.3 is now used with Thread-Metric.
Through experimentation, source file tm_basic_processing_test.c is identified as the critical function.

With the program running from SRAM, the results are 23823 events, an improvement of 21%, attributable
solely to the increase in operating frequency from 62.5 MHz to 75 MHz. Next, the stack and all data
sections are moved to local memory. The achieved results are 29808 events, an additional performance
improvement of 25%.

With the entire program in SRAM and instruction cache enabled (system running at 60 MHz), the results
are 64006 events. This is a significant improvement of 225% over the original, unoptimized results. When
the critical function’s data section is moved to local memory with instruction cache enabled, the
performance is 157512 events.

By removing the instruction cache, the system speed is boosted to 75 MHz. The critical function’s data and
text sections are roughly one-fifth the total program size. When the critical function’s data and text are
assigned to local memory, the achieved results are 192867 events. If the entire program runs from local
memory, the performance realized is maximized at 198787 events.

For this one example, only 3% performance is lost in the program-partitioned experiment running at 75
MHz. When a critical function can be identified and assigned to permanently reside in local memory,
caching is not necessary, and the achieved results are nearly as good as if the entire program was running
from local memory. These results are summarized in Table 4.

Table 4 – Thread-Metric Basic Test Results

Thread-Metric Basic
Frequency Events
Initial optimized system, running from SRAM 75 MHz 23823
Stack and all data moved to local memory 75 MHz 29808
Running from SRAM, I-cache enabled 60 MHz 64006
Critical function’s data moved to local memory 60 MHz 157512
Removed instruction cache, critical function’s
instructions also moved to local memory 75 MHz 192867
Moved entire program to local memory 75 MHz 198787

Embedded Systems Conference, San Francisco, 2005 Page 16 of 18

B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367

4.3 User application


A Triple-DES encryption algorithm is used to show the power of using a hardware co-processor. First, the
algorithm runs in a MicroBlaze system using software only. Next, a piece of the algorithm is converted
using Impulse C to a hardware co-processor connected through the MicroBlaze’s FSL bus. The hardware
used is a 1 million gate Virtex-II FPGA on the Memec Virtex-II MB Development Board.

4.3.1 Software-only implementation


As reported in FPGA Journal13, 1000 text blocks were encrypted with Triple-DES. In the software only
approach, this encryption took 252 ms running on a 100 MHz MicroBlaze system.

4.3.2 Hardware co-processor


Impulse C was used to create a hardware co-processor to assist the MicroBlaze processor in performing the
Triple-DES encryption. The code was slightly modified to support hardware compilation using Impulse C.
The Impulse C compiler then created a hardware representation in the Virtex-II FPGA for the Triple-DES
algorithm. This newly built co-processor must be able to communicate with the MicroBlaze. Two FSL
channels are used to transfer data between the MicroBlaze and the Triple-DES co-processor.

A custom instruction is defined to exercise the FSL channels. The MicroBlaze application is modified to
use this newly defined instruction, which takes advantage of the co-processor. The hardware is now rebuilt.
The complexity of the co-processor results in a lower system frequency of 24 MHz. However, the overall
performance gain of the system is remarkable with a single encryption cycle taking only 6.9 ms. This is a
36 times increase in performance!

Upon further review, more sophisticated changes are made to the Impulse C version of the Triple-DES
algorithm. Taking advantage of C programming techniques such as array splitting, loop unrolling, and
pipelining, further optimization can be accomplished. Implementing the maximum optimization in the
Impulse C code, the final system performance is 0.59 ms, an incredible boost of 425 times over the original
software-only implementation!14

Although more hardware is consumed in the FPGA, the performance enhancements are significant. The
ability to create custom co-processing units specific to a designer’s application makes the FPGA embedded
processor solution unmatched in performance compared to any off-the-shelf processor!

5. Conclusion
Based on the experiments performed in this research, several conclusions are explained.

5.1 Reasonable expectations


All manufacturers want the best possible benchmark to publish. FPGA manufacturers take full advantage of
system flexibility and FPGA design techniques to achieve high benchmarks. The embedded designer must
understand how these benchmarks are achieved and realize that an actual FPGA embedded processor
system will not be able to achieve such high marks. The benchmarks are still useful as a means of
comparing one manufacturer with another.

Regarding the design process, remember that the hardware platform is part of the FPGA embedded
processor design. Unlike off-the-shelf processors where the hardware is pre-defined and fixed, FPGAs have
the flexibility and added complexity to create a multitude of different systems. A member of the design
team must be capable of hardware development and optimization of the FPGA embedded processor.

If an application does not require high performance or any of the other advantages that an FPGA can
provide, an off-the-shelf processor is most likely a less complicated, less expensive, and better solution.

Embedded Systems Conference, San Francisco, 2005 Page 17 of 18

B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367

5.2 Optimization through experimentation yields the best results


Much can be done to optimize an FPGA embedded processor system. In addition to standard software
optimization, the embedded designer can perform many hardware optimizations. Careful use of memory
has a large impact on the performance. Experimentation with different memory strategies is necessary to
achieve the best performance.

Understand that the addition or removal of each peripheral, peripheral controller, or bus alters the design
size, cost, and speed. Use only what is necessary and no more!

5.3 Take advantage of superior flexibility in FPGAs


An FPGA embedded processor has the power and ability to provide previously unachievable flexibility and
performance. With an FPGA, a designer can specify exactly the peripherals required in a system. With an
FPGA soft processor, a designer can purchase and own the source code for the processor. With an FPGA, a
designer can adapt C code from a software bottleneck to create a custom hardware co-processing unit.
These are excellent advantages that can only be realized in programmable hardware.

If the designer chooses not to take advantage of these capabilities, the realized FPGA embedded processor
system performance may be a huge disappointment. However, for those who take full advantage of the
FPGA embedded processor, specific application performance greatly exceeding typical microprocessor
expectations is possible!

6. Acknowledgements
Shalin Sheth, Xilinx
Ron Wright, Memec

7. References
1
www.impulsec.com/trueansic.htm
2
For a complete list of available IP, visit the following websites:
www.altera.com/products/ip/processors/nios2/features/ni2-peripherals.html
www.xilinx.com/ise/embedded/edk_ip.htm
3
The website URLs where these numbers were found, as of November 10, 2004, are listed below:
www.altera.com/products/devices/excalibur/exc-index.html
www.altera.com/products/ip/processors/nios/overview/nio-overview.html
www.altera.com/products/ip/processors/nios2/overview/ni2-overview.html
4
The website URLs where these numbers were found, as of November 10, 2004, are listed below:
www.xilinx.com/products/virtex4/capabilities.htm
www.xilinx.com/ipcenter/processor_central/embedded/performance.htm
www.xilinx.com/ipcenter/processor_central/microblaze/performance.htm
5
An Introduction to GCC for the GNU Compilers gcc and g++, Brian Gough, Network Theory Ltd., March 31, 2004
6
See Xilinx Answer Record 19592 at www.support.xilinx.com
7
OPB Synchronous DRAM (SDRAM) Controller, DS426, Xilinx, August 11, 2004
8
PowerPC™ 405 Processor Block Reference Guide, UG018, Xilinx, August 20, 2004
9
Fast Simplex Link (FSL) Bus (v2.00a), DS449, Xilinx, August 17, 2004
10
www.impulsec.com/applications.htm
11
Volume pricing for 250k units, end of 2004 ( www.xilinx.com/prs_rls/silicon_spart/0498s3.htm )
12
Thread-Metric RTOS Test Suite, Express Logic, INC., April 21, 2004
13
“FPGAs Provide Acceleration for Software Algorithms,” FPGA Journal, David Pellerin and Milan Saini, 2004
14
Optimizing Impulse C Code for Performance, Impulse Accelerated Technologies, Inc., Scott Thibault and David Pellerin,
September 30, 2004

Embedded Systems Conference, San Francisco, 2005 Page 18 of 18

B. Fletcher, Memec
CHAPTER 11
Programmable Logic Devices

Programmable logic is the means by which a large segment of engineers implement their custom
logic, whether that logic is a simple I/O port or a complex state machine. Most programmable logic
is implemented with some type of HDL that frees the engineer from having to derive and minimize
Boolean expressions each time a new logical relationship is designed. The advantages of program-
mable logic include rapid customization with relatively limited expense invested in tools and sup-
port.
The widespread availability of flexible programmable logic products has brought custom logic
design capabilities to many individuals and smaller companies that would not otherwise have the fi-
nancial and staffing resources to build a fully custom IC. These devices are available in a wide range
of sizes, operating voltages, and speeds, which all but guarantees that a particular application can be
closely matched with a relevant device. Selecting that device requires some research, because each
manufacturer has a slightly different specialty and range of products.
Programmable logic technology advances rapidly, and manufacturers are continually offering de-
vices with increased capabilities and speeds. After completing this chapter and learning about the
basic types of devices that are available, it is recommended that you to browse through the latest
manufacturers’ data sheets to get updated information. Companies such as Altera, Atmel, Cypress,
Lattice, QuickLogic, and Xilinx provide detailed data sheets on their web sites and also tend to offer
bundled development software for reasonable prices.

11.1 CUSTOM AND PROGRAMMABLE LOGIC

Beyond using discrete 7400 ICs, custom logic is implemented in larger ICs that are either manufac-
tured with custom masks at a factory or programmed with custom data images at varying points after
fabrication. Custom ICs, or application specific integrated circuits (ASICs), are the most flexible op-
tion because, as with anything custom, there are fewer constraints on how application specific logic
is implemented. Because custom ICs are tailored for a specific application, the potential exists for
high clock speeds and relatively low unit prices. The disadvantages to custom ICs are long and ex-
pensive development cycles and the inability to make quick logic changes. Custom IC development
cycles are long, because a design must generally be frozen in a final state before much of the silicon
layout and circuit design work can be completed. Engineering charges for designing a custom mask
set (not including the logic design work) can range from $50,000 to well over $1 million, depending
on the complexity. Once manufactured, the logic can’t simply be altered, because the logic configu-
ration is an inherent property of the custom design. If a bug is found, the time and money to alter the
mask set can approach that of the initial design itself.

249

Copyright 2003 by The McGraw-Hill Companies, Inc. Click Here for Terms of Use.
250 Advanced Digital Systems

Programmable logic devices (PLDs) are an alternative to custom ASICs. A PLD consists of gen-
eral-purpose logic resources that can be connected in many permutations according to an engineer’s
logic design. This programmable connectivity comes at the price of additional, hidden logic that
makes logic connections within the chip. The main benefit of PLD technology is that a design can be
rapidly loaded into a PLD, bypassing the time consuming and expensive custom IC development
process. It follows that if a bug is found, a fix can be implemented very quickly and, in many cases,
reprogrammed into the existing PLD chip. Some PLDs are one-time programmable, and some can
be reprogrammed in circuit.
The disadvantage of PLDs is the penalty paid for the hidden logic that implements the program-
mable connectivity between logic gates. This penalty manifests itself in three ways: higher unit cost,
slower speeds, and increased power consumption. Programmable gates cost more than custom gates,
because, when a programmable gate is purchased, that gate plus additional connectivity overhead is
actually being paid for. Propagation delay is an inherent attribute of all silicon structures, and the
more structures that are present in a path, the slower the path will be. It follows that a programmable
gate will be slower than a custom gate, because that programmable gate comes along with additional
connectivity structures with their own timing penalties. The same argument holds true for power
consumption.
Despite the downside of programmable logic, the technology as a whole has progressed dramati-
cally and is extremely popular as a result of competitive pricing, high performance levels, and, espe-
cially, quick time to market. Time to market is an attribute that is difficult to quantify but one that is
almost universally appreciated as critical to success. PLDs enable a shorter development cycle, be-
cause designs can be prototyped rapidly, the bugs worked out, and product shipped to a customer be-
fore some ASIC technologies would even be in fabrication. Better yet, if a bug is found in the field,
it may be fixable with significantly less cost and disruption. In the early days of programmable logic,
PLDs could not be reprogrammed, meaning that a bug could still force the recall of product already
shipped. Many modern reprogrammable PLDs allow hardware bugs to be fixed in the field with a
software upgrade consisting of a new image that can be downloaded to the PLD without having to
remove the product from the customer site.
Cost and performance are probably the most debated trade-offs involved in using programmable
logic. The full range of applications in which PLDs or ASICs are considered can be broadly split
into three categories as shown in Fig. 11.1. At the high end of technology, there are applications in
which an ASIC is the only possible solution because of leading edge timing and logic density re-
quirements. In the mid range, clock frequencies and logic complexity are such that a PLD is capable
of solving the problem, but at a higher unit cost than an ASIC. Here, the decision must be made be-
tween flexibility and time to market versus lowest unit cost. At the low end, clock frequencies and
logic density requirements are far enough below the current state of silicon technology that a PLD
may meet or even beat the cost of an ASIC.
It may sound strange that a PLD with its overhead can ever be less expensive than a custom chip.
The reasons for this are a combination of silicon die size and volume pricing. Two of the major fac-
tors in the cost of fabricating a working silicon die are its size and manufacturing yield. As a die gets
smaller, more of them can be fabricated at the same time on the same wafer using the same re-
sources. IC manufacturing processes are subject to a certain yield, which is the percentage of work-
ing dice obtained from an overall lot of dice. Some dice develop microscopic flaws during
manufacture that make them unusable. Yield is a function of many variables, including the reliability
of uniformly manufacturing a certain degree of complexity given the prevailing state of technology
at a point in time. From these two points, it follows that a silicon chip will be less expensive to man-
ufacture if it is both small and uses a technology process that is mature and has a high yield.
At the low end of speed and density, a small PLD and a small ASIC may share the same mature
technology process and the same yield characteristics, removing yield as a significant variable in
Programmable Logic Devices 251

10,000+ ASIC Only


Logic and RAM
Gate Density

1000
(x1000)

ASIC or PLD:
Cost vs. Time to Market

100

PLD may be cheaper


10
1 10 100 1000+

Clock Frequency (MHz)

FIGURE 11.1 PLDs vs. ASICs circa 2003.

their cost differential. Likewise, raw packaging costs are likely to be comparable because of the mat-
uration of stable packaging materials and technologies. The cost differential comes down to which
solution requires the smaller die and how the overhead costs of manufacturing and distribution are
amortized across the volume of chips shipped.
Die size is function of two design characteristics: how much logic is required and how many I/O
pins are required. While the size of logic gates has decreased by orders of magnitude over time, the
size of I/O pads, the physical structures that packaging wires connect to, has not changed by the
same degree. There are nonscalable issues constraining pad size, including physical wire bonding
and current drive requirements. I/O pads are often placed on the perimeter of a die. If the required
number of I/O pads cannot be placed along the existing die’s perimeter, the die must be enlarged
even if not required by the logic. ICs can be considered as being balanced, logic limited, or pad lim-
ited. A balanced design is optimal, because silicon area is being used efficiently by the logic and pad
structures. A logic-limited IC’s silicon area is dominated by the internal logic requirements. At the
low end being presently discussed, being logic limited is not a concern because of the current state
of technology. Pad-limited designs are more of a concern at the low end, because the chip is forced
to a certain minimum size to support a minimum number of pins.
Many low-end logic applications end up being pad limited as the state of silicon process technol-
ogy advances and more logic gates can be squeezed into ever smaller areas. The logic shrinks, but
the I/O pads do not. Once an IC is pad limited, ASIC and CPLD implementations may use the same
die size, removing it as a cost variable. This brings us back to the volume pricing and distribution as-
pects of the semiconductor business. If two silicon manufacturers are fabricating what is essentially
the same chip (same size, yield, and package), who will be able to offer the lowest final cost? The
comparison is between a PLD vendor that turns out millions of the exact same chip each year versus
an ASIC vendor that can manufacture only as many of your custom chips that you are willing to buy.
Is an ASIC’s volume 10,000 units per year? 100,000? One million? With all other factors being
equal, the high-volume PLD vendor has the advantage, because the part being sold is not custom but
a mass-produced generic product.
252 Advanced Digital Systems

11.2 GALS AND PALS

Among the most basic types of PLDs are Generic Array LogicTM (GAL) devices.* GALs are en-
hanced variants of the older Programmable Array LogicTM (PAL) architecture that is now essentially
obsolete. The term PAL is still widely used, but people are usually referring to GAL devices or other
PLD variants when they use the term. PALs became obsolete, because GALs provide a superset of
their functionality and can therefore perform all of the functions that PALs did. GALs are relatively
small, inexpensive, easily available, and manufactured by a variety of vendors (e.g., Cypress, Lat-
tice, and Texas Instruments).
It can be shown through Boolean algebra that any logical expression can be represented as an ar-
bitrarily complex sum of products. Therefore, by providing a programmable array of AND/OR
gates, logic can be customized to fit a particular application. GAL devices provide an extensive pro-
grammable array of wide AND gates, as shown in Fig. 11.2, into which all the device’s input terms
are fed. Both true and inverted versions of each input are made available to each AND gate. The out-
puts of groups of AND gates (products) feed into separate OR gates (sums) to generate user-defined
Boolean expressions.
Each intersection of a horizontal AND gate line and a vertical input term is a programmable con-
nection. In the early days of PLDs, these connections were made by fuses that would literally have to
be blown with a high voltage to configure the device. Fuse-based devices were not reprogrammable;

Input Terms

AND/OR
Output Terms

FIGURE 11.2 GAL/PAL AND/OR structure.

* GAL, Generic Array Logic, PAL, and Programmable Array Logic are trademarks of Lattice Semiconductor Corporation.
Programmable Logic Devices 253

once a microscopic fuse is blown, it cannot be restored. Today’s devices typically rely on EEPROM
technology and CMOS switches to enable nonvolatile reprogrammability. However, fuse-based ter-
minology remains in use for historical reasons. The default configuration of a connection emulates
an intact fuse, thereby connecting the input term to the AND gate input. When the connection is
blown, or programmed, the AND input is disconnected from the input term and pulled high to effec-
tively remove that input from the Boolean expression. Customization of a GAL’s programmable
AND gate is conceptually illustrated in Fig. 11.3.
With full programmability of the AND array, the OR connections can be hard wired. Each GAL
device feeds a differing number of AND terms into the OR gates. If one or more of these AND terms
are not needed by a particular Boolean expression, those unneeded AND gates can be effectively dis-
abled by forcing their outputs to 0. This is done by leaving an unneeded AND gate’s inputs unpro-
grammed. Remember that inputs to the AND array are provided in both true and complement
versions. When all AND connections are left intact, multiple expressions of the form A&A = 0 re-
sult, thereby forcing that gate’s output to 0 and rendering it nonparticipatory in the OR function.
The majority of a GAL’s logic customization is performed by programming the AND array. How-
ever, selecting flip-flops, OR/NOR polarities, and input/output configurations is performed by pro-
gramming a configurable I/O and feedback structure called a macrocell. The basic concept behind a
macrocell is to ultimately determine how the AND/OR Boolean expression is handled and how the
macrocell’s associated I/O pin operates. A schematic view of a GAL macrocell is shown in Fig. 11.4,
although some GALs may contain macrocells with slightly different capabilities. Multiplexers deter-
mine the polarity of the final OR/NOR term, regardless of whether the term is registered and
whether the feedback signal is taken directly at the flop’s output or at the pin. Configuring the mac-
rocell’s output enable determines how the pin behaves.

A B C D E F

Output = A & C & D & E & F


A A B B C C D D E E F F
Programmed Connection

FIGURE 11.3 Programming AND input terms.

Output
Enable

I/O
AND Array Pin
D Q
Terms
Clock
Configuration
Information
Feedback to
AND Array

Configuration
Information

FIGURE 11.4 GAL macrocell.


254 Advanced Digital Systems

There are two common GAL devices, the 16V8 and the 22V10, although other variants exist as
well. They contain eight and ten macrocells each, respectively. The 16V8 provides up to 10 dedi-
cated inputs that feed the AND array, whereas the 22V10 provides 12 dedicated inputs. One of the
22V10’s dedicated inputs also serves as a global clock for any flops that are enabled in the macro-
cells. Output enable logic in a 22V10 is evaluated independently for each macrocell via a dedicated
AND term. The 16V8 is somewhat less flexible, because it cannot arbitrarily feed back all macrocell
outputs depending on the device configuration. Additionally, when configured for registered mode
where macrocell flops are usable, two dedicated input pins are lost to clock and output enable func-
tions.
GALs are fairly low-density PLDs by modern standards, but their advantages of low cost and
high speed are derived from their small size. Implementing logic in a GAL follows several basic
steps. First, the logic is represented in either graphical schematic diagram or textual (HDL) form.
This representation is converted into a netlist using a translation or synthesis tool. Finally, the
netlist is fitted into the target device by mapping individual gate functions into the programmable
AND array. Given the fixed AND/OR structure of a GAL, fitting software is designed to perform
logic optimizations and translations to convert arbitrary Boolean expressions into sum-of-product
expressions. The result of the fitting process is a programming image, also called a fuse map, that
defines exactly which connections, or fuses, are to be programmed and which are to be left at their
default state. The programming image also contains other information such as macrocell configura-
tion and other device-specific attributes.
Modern PLD development software allows the back-end GAL synthesis and fitting process to
proceed without manual intervention in most cases. The straightforward logic flow through the pro-
grammable AND array reduces the permutations of how a given Boolean expression can be imple-
mented and results in very predictable logic fitting. An input signal propagates through the pin and
pad structure directly into the AND array, passes through just two gates, and can then either feed a
macrocell flop or drive directly out through an I/O pin. Logic elements within a GAL are close to
each other as a result of the GAL’s small size, which contributes to low internal propagation delays.
These characteristics enable the GAL architecture to deliver very fast timing specifications, because
signals follow deterministic paths with low propagation delays.
GALs are a logic implementation technology with very predictable capabilities. If the desired
logic cannot fit within the GAL, there may not be much that can be done without optimizing the al-
gorithm or partitioning the design across multiple devices. If the logic fits but does not meet timing,
the logic must be optimized, or a faster device must be found. Because of the GAL’s basic fitting
process and architecture, there isn’t the same opportunity of tweaking the device as can be done with
more complex PLDs. This should not be construed as a lack of flexibility on the part of the GAL.
Rather, the GAL does what it says it does, and it is up to the engineer to properly apply the technol-
ogy to solve the problem at hand. It is the simplicity of the GAL architecture that is its greatest
strength.
Lattice Semiconductor’s GAL22LV10D-4 device features a worst-case input-to-output combina-
torial propagation delay of just 4 ns.* This timing makes the part suitable for address decoding on
fast microprocessor interfaces. The same 22V10 part features a 3-ns tCO and up to 250-MHz opera-
tion. The tCO specification is a pin-to-pin metric that includes the propagation delays of the clock
through the input pin and the output signal through the output pin. Internally, the actual flop itself
exhibits a faster tCO that becomes relevant for internal logic feedback paths. Maximum clock fre-
quency specifications are an interesting aspect of all PLDs and some consideration. These specifica-
tions are best-case numbers usually obtained with minimal logic configurations. They may define

* GAL22LV10D, 22LV10_04, Lattice Semiconductor, 2000, p. 7.


Programmable Logic Devices 255

the highest toggle rate of the device’s flops, but synchronous timing analysis dictates that there is
more to fMAX than the flop’s tSU and tCO. Propagation delay of logic and connectivity between flops
is of prime concern. The GAL architecture’s deterministic and fast logic feedback paths reduces the
added penalty of internal propagation delays. Lattice’s GAL22LV10D features an internal clock-to-
feedback delay of 2.5 ns, which is the combination of the actual flop’s tCO plus the propagation delay
of the signal back through the AND/OR array. This feedback delay, when combined with the flop’s
3-ns tSU, yields a practical fMAX of 182 MHz when dealing with most normal synchronous logic that
contains feedback paths (e.g., a state machine).

11.3 CPLDS

Complex PLDs, or CPLDs, are the mainstream macrocell-based PLDs in the industry today, provid-
ing logic densities and capabilities well beyond those of a GAL device. GALs are flexible for their
size because of the large programmable AND matrix that defines logical connections between inputs
and outputs. However, this anything-to-anything matrix makes the architecture costly to scale to
higher logic densities. For each macrocell that is added, both matrix dimensions grow as well.
Therefore, the AND matrix increases in a square function of the I/O terms and macrocells in the de-
vice. CPLD vendors seek to provide a more linear scaling of connectivity resources to macrocells by
implementing a segmented architecture with multiple fixed-size GAL-style logic blocks that are in-
terconnected via a central switch matrix as shown in Fig. 11.5. Like a GAL, CPLDs are typically
manufactured with EEPROM configuration storage, making their function nonvolatile. After pro-
gramming, a CPLD will retain its configuration and be ready for operation when power is applied to
the system.
Each individual logic block is similar to a GAL and contains its own programmable AND/OR ar-
ray and macrocells. This approach is scalable, because the programmable AND/OR arrays remain
fixed in size and small enough to fabricate economically. As more macrocells are integrated onto the
same chip, more logic blocks are placed onto the chip instead of increasing the size of individual
logic blocks and bloating the AND/OR arrays. CPLDs of this type are manufactured by companies
including Altera, Cypress, Lattice, and Xilinx.

Input Terms
GAL-Type Pin Outputs
Feedback Logic Block

Input Terms
Logic Block GAL-Type Pin Outputs
Input Term Feedback Logic Block
I/O Cells
Switch
Matrix

Input Terms
GAL-Type Pin Outputs
Feedback Logic Block

Pin Inputs

FIGURE 11.5 Typical CPLD architecture.


256 Advanced Digital Systems

Generic user I/O pins are bidirectional and can be configured as inputs, outputs, or both. This is in
contrast to the dedicated power and test pins that are necessary for operation. There are as many po-
tential user I/O pins as there are macrocells, although some CPLDs may be housed in packages that
do not have enough physical pins to connect to all the chip’s I/O sites. Such chips are intended for
applications that are logic limited rather than pad limited.
Because the size of each logic block’s AND array is fixed, the block has a fixed number of possi-
ble inputs. Vendor-supplied fitting software must determine which logical functions are placed into
which blocks and how the switch matrix connects feedback paths and input pins to the appropriate
block. The switch matrix does not grow linearly as more logic blocks are added. However, the im-
pact of the switch matrix’s growth is less than what would result with an ever expanding AND ma-
trix. Each CPLD family provides a different number of switched input terms to each logic block.
The logic blocks share many characteristics with a GAL, as shown in Fig. 11.6, although addi-
tional flexibility is added in the form of product term sharing. Limiting the number of product terms
in each logic block reduces device complexity and cost. Some vendors provide just five product
terms per macrocell. To balance this limitation, which could impact a CPLD’s usefulness, product
term sharing resources enable one macrocell to borrow terms from neighboring macrocells. This
borrowing usually comes at a small propagation delay penalty but provides necessary flexibility in
handling complex Boolean expressions with many product terms. A logic block’s macrocell contains
a flip-flop and various configuration options such as polarity and clock control. As a result of their
higher logic density, CPLDs contain multiple global clocks that individual macrocells can choose
from, as well as the ability to generate clocks from the logic array itself.
Xilinx is a vendor of CPLD products and manufactures a family known as the XC9500. Logic
blocks, or function blocks in Xilinx’s terminology, each contain 18 macrocells, the outputs of which
feed back into the switch matrix and drive I/O pins as well. XC9500 CPLDs contain multiples of 18
macrocells in densities from 36 to 288 macrocells. Each function block gets 54 input terms from the
switch matrix. These input terms can be any combination of I/O pin inputs and feedback terms from
other function blocks’ macrocells.
Like a GAL, CPLD timing is very predictable because of the deterministic nature of the logic
blocks’ AND arrays and the input term switch matrix. Xilinx’s XC9536XV-3 features a maximum
pin-to-pin propagation delay of 3.5 ns and a tCO of 2.5 ns.* Internal logic can run as fast as 277 MHz
with feedback delays included, although complex Boolean expressions likely reduce this fMAX be-
cause of product term sharing and feedback delays through multiple macrocells.
CPLD fitting software is typically provided by the silicon vendor, because the underlying silicon
architectures are proprietary and not disclosed in sufficient detail for a third party to design the nec-
essary algorithms. These tools accept a netlist from a schematic tool or HDL synthesizer and auto-
matically divide the logic across macrocells and logic blocks. The fitting process is more complex

Macrocell

Product Macrocell
Product
Terms Term Macrocell
AND Array
Sharing and
Distribution

Macrocell

FIGURE 11.6 CPLD logic block.

* XC9536XV, DS053 (v2.2), Xilinx, August 2001, p. 4.


Programmable Logic Devices 257

than for a GAL; not every term within the CPLD can be fed to each macrocell because of the seg-
mented AND array structure. Product term sharing places restrictions on neighboring macrocells
when Boolean expressions exceed the number of product terms directly connected to each macro-
cell. The fitting software first reduces the netlist to a set of Boolean expressions in the form that can
be mapped into the CPLD and then juggles the assignment of macrocells to provide each with its re-
quired product terms. Desired operating frequency influences the placement of logic because of the
delay penalties of sharing product terms across macrocells. These trade-offs occur at such a low
level that human intervention is often impractical.
CPLDs have come to offer flexibility advantages beyond just logic implementation. As digital
systems get more complex, logic IC supply voltages begin to proliferate. At one time, most systems
ran on a single 5-V supply. This was followed by 3.3-V systems, and it is now common to find sys-
tems that operate at multiple voltages such as 3.3 V, 2.5 V, 1.8 V, and 1.5 V. CPLDs invariably find
themselves designed into mixed-voltage environments for the purposes of interface conversion and
bus management. To meet these needs, many CPLDs support more than one I/O voltage standard on
the same chip at the same time. I/O pins are typically divided into banks, and each bank can be inde-
pendently selected for a different I/O voltage.
Most CPLDs are relatively small in logic capacity because of the desire for very high-speed oper-
ation with deterministic timing and fitting characteristics at a reasonable cost. However, some
CPLDs have been developed far beyond the size of typical CPLDs. Cypress Semiconductor’s
Delta39K200 contains 3,072 macrocells with several hundred kilobits of user-configurable RAM.*
The architecture is built around clusters of 128 macrocell logic groups, each of which is similar in
nature to a conventional CPLD. In a similar way that CPLDs add an additional hierarchical connec-
tivity layer on top of multiple GAL-type logic blocks, Cypress has added a layer on top of multiple
CPLD-type blocks. Such large CPLDs may have substantial benefits for certain applications. Be-
yond several hundred macrocells, however, engineers have tended to use larger and more scalable
FPGA technologies.

11.4 FPGAS

CPLDs are well suited to applications involving control logic, basic state machines, and small
groups of read/write registers. These control path applications typically require a small number of
flops. Once a task requires many hundreds or thousands of flops, CPLDs rapidly become impractical
to use. Complex applications that manipulate and parse streams of data often require large quantities
of flops to serve as pipeline registers, temporary data storage registers, wide counters, and large state
machine vectors. Integrated memory blocks are critical to applications that require multiple FIFOs
and data storage buffers. Field programmable gate arrays (FPGAs) directly address these data path
applications.
FPGAs are available in many permutations with varying feature sets. However, their common de-
fining attribute is a fine-grained architecture consisting of an array of small logic cells, each consist-
ing of a flop, a small lookup table (LUT), and some supporting logic to accelerate common functions
such as multiplexing and arithmetic carry terms for adders and counters. Boolean expressions are
evaluated by the LUTs, which are usually implemented as small SRAM arrays. Any function of four
variables, for example, can be implemented in a 16 × 1 SRAM when the four variables serve as the
index into the memory. There are no AND/OR arrays as in a CPLD or GAL. All Boolean functions

* Delta39K ISR CPLD Family, Document #38-03039 Rev. *.C, Cypress Semiconductor, December 2001, p. 1.
258 Advanced Digital Systems

are implemented within the logic cells. The cells are arranged on a grid of routing resources that can
make connections between arbitrary cells to build logic paths as shown in Fig. 11.7. Depending on
the FPGA type, special-purpose structures are placed into the array. Most often, these are config-
urable RAM blocks and clock distribution elements. Around the periphery of the chip are the I/O
cells, which commonly contain one or more flops to enable high-performance synchronous inter-
faces. Locating flops within I/O cells improves timing characteristics by minimizing the distance,
and hence the delay between each flop and its associated pin. Unlike CPLDs, most FPGAs are based
on SRAM technology, making their configurations volatile. A typical FPGA must be reprogrammed
each time power is applied to a system. Major vendors of FPGAs include Actel, Altera, Atmel, Lat-
tice, QuickLogic, and Xilinx.
Very high logic densities are achieved by scaling the size of the two-dimensional logic cell array.
The primary limiting factor in FPGA performance becomes routing because of the nondeterministic
nature of a multipath grid interconnect system. Paths between logic cells can take multiple routes,
some of which may provide identical propagation delays. However, routing resources are finite, and
conflicts quickly arise between competing paths for the same routing channels. As with CPLDs,
FPGA vendors provide proprietary software tools to convert a netlist into a final programming im-
age. Depending on the complexity of the design (e.g., speed and density), routing an FPGA can take
a few minutes or many hours. Unlike a CPLD with deterministic interconnection resources, FPGA
timing can vary dramatically, depending on the quality of the logic placement. Large, fast designs re-
quire iterative routing and placement algorithms.

Logic Paths

IO IO IO IO IO IO

IO LC LC LC LC LC IO
R
A
M
IO LC LC LC LC LC IO

IO LC LC LC LC LC IO

IO LC LC LC LC LC IO

R
A
M
IO LC LC LC LC LC IO

IO IO IO IO IO IO

Routing
Resources

FIGURE 11.7 FPGA logic cell array.


Programmable Logic Devices 259

Human intervention can be critical to the successful routing and timing of a complex FPGA de-
sign. Floorplanning is the process by which an engineer manually partitions logic into groups and
then explicitly places these groups into sections of the logic cell array. Manually locating large por-
tions of the logic restricts the routing software to optimizing placement of logic within those bound-
aries and reduces the number of permutations that it must try to achieve a successful result.
Each vendor’s logic cell architecture differs somewhat, but mainly in how support functions such
as multiplexing and arithmetic carry terms are implemented. For the most part, engineers do not
have to consider the minute details of each logic cell structure, because the conversion of logic into
the logic cell is performed by a combination of the HDL synthesis tool and the vendor’s proprietary
mapping software. In extreme situations, wherein a very specific logic implementation is necessary
to squeeze the absolute maximum performance from a specific FPGA, optimizing logic and archi-
tecture for a given logic cell structure may have benefits. Engaging in this level of technology-spe-
cific optimization, however, can be very tricky and lead to a house-of-cards scenario in which
everything is perfectly balanced for a while, and then one new feature is added that upsets the whole
plan. If a design appears to be so aggressive as to require fine-tuned optimization, and faster devices
cannot be obtained, it may be preferable to modify the architecture to enable more mainstream, ab-
stracted design methodologies.
Notwithstanding the preceding comments, there are high-level feature differences among FPGAs
that should be evaluated before choosing a specific device. Of course, it is necessary to pick an
FPGA that has sufficient logic and I/O pins to satisfy the needs of the intended application. But not
all FPGAs are created equal, despite having similar quantities of logic. While the benefits of one
logic structure over another can be debated, the presence or absence of critical design resources can
make implementation of a specific design possible or impossible. These resources are clock distribu-
tion elements, embedded memory, embedded third-party cores, and multifunction I/O cells.
Clock distribution across a synchronous system must be done with minimal skew to achieve ac-
ceptable timing. Each logic cell within a FPGA holds a flop that requires a clock. Therefore, an
FPGA must provide at least one global clock signal distributed to each logic cell with low skew
across the entire device. One clock is insufficient for most large digital systems because of the prolif-
eration of different interfaces, microprocessors, and peripherals. Typical FPGAs provide anywhere
from 4 to 16 global clocks with associated low-skew distribution resources. Most FPGAs do allow
clocks to be routed using the general routing resources that normally carry logic signals. However,
these paths are usually unable to achieve the low skew characteristics of the dedicated clock distribu-
tion network and, consequently, do not enable high clock speeds.
Some FPGAs support a large number of clocks, but with the restriction that not all clocks can be
used simultaneously in the same portion of the chip. This type of restriction reduces the complexity
of clock distribution on the part of the FPGA vendor because, while the entire chip supports a large
number of clocks in total, individual sections of the chip support a smaller number. For example, an
FPGA might support 16 global clocks with the restriction that any one quadrant can support only 8
clocks. This means that there are 16 clocks available, and each quadrant can select half of them for
arbitrary use. Instead of providing 16 clocks to each logic cell, only 8 need be connected, thus sim-
plifying the FPGA circuitry.
Most FPGAs provide phase locked loops (PLLs) or delay locked loops (DLLs) that enable the in-
tentional skewing, division, and multiplication of incoming clock signals. PLLs are partially analog
circuits, whereas DLLs are fully digital circuits. They have significant overlap in the functions that
they can provide in an FPGA, although engineers may debate the merits of one versus the other. The
fundamental advantage of a PLL or DLL within an FPGA is its ability to improve I/O timing (e.g.,
tCO) by effectively removing the propagation delay between the clock input pin and the signal output
pin, also known as deskewing. As shown in Fig. 11.8, the PLL or DLL aligns the incoming clock to
a feedback clock with the same delay as observed at the I/O flops. In doing so, it shifts the incoming
260 Advanced Digital Systems

Clock
Distribution
Resources

Oscillator (PLL)
Input Clock Output Clock I/O
or
Flop
Delay Logic (DLL)

Feedback and Control Feedback Clock


Circuit

FIGURE 11.8 PLL/DLL clock deskew function within FPGA.

clock so that the causal edge observed by the I/O flops occurs at nearly the same time as when it en-
ters the FPGA’s clock input pin. PLLs and DLLs are discussed in more detail in a later chapter.
Additional circuitry enables some PLLs and DLLs to emit a clock that is related to the input fre-
quency by a programmable ratio. The ability to multiply and divide clocks is a benefit to some sys-
tem designs. An external board-level interface may run at a slower frequency to make circuit
implementation easier, but it may be desired to run the internal FPGA logic as a faster multiple of
that clock for processing performance reasons. Depending on the exact implementation, multiplica-
tion or division can assist with this scheme.
RAM blocks embedded within the logic cell array are a critical feature for many applications.
FIFOs and small buffers figure prominently in a variety of data processing architectures. Without on-
chip RAM, valuable I/O resources and speed penalties would be given up to use off-chip memory
devices. To suit a wide range of applications, RAMs need to be highly configurable and flexible. A
typical FPGA’s RAM block is based on a certain bit density and can be used in arbitrary width/depth
configurations as shown in Fig. 11.9 using the example of a 4-kb RAM block. Single- and dual-port
modes are also very important. Many applications, including FIFOs, benefit from a dual-ported

4,096 x 1

2,048 x 2

1,024 x 4

512 x 8

FIGURE 11.9 Configurable FPGA 4 kb RAM block.


Programmable Logic Devices 261

RAM block to enable simultaneous reading and writing of the memory by different logic blocks.
One state machine may be writing data into a RAM, and another may be reading other data out at the
same time. RAM blocks can have synchronous or asynchronous interfaces and may support one or
two clocks in synchronous modes. Supporting two clocks in synchronous modes facilitates dual-
clock FIFO designs for moving data between different clock domains.
Some FPGAs also allow logic cell LUTs to be used as general RAM in certain configurations. A
16 × 1 four-input LUT can serve as a 16 × 1 RAM if supported by the FPGA architecture. It is more
efficient to use RAM blocks for large memory structures, because the hardware is optimized to pro-
vide a substantial quantity of memory in a small area of silicon. However, LUT-based RAM is bene-
ficial when a design requires many shallow memory structures (e.g., a small FIFO) and all the large
RAM blocks are already used. Along with control logic, 32 four-input LUTs can be used to con-
struct a 16 × 32 FIFO. If a design is memory intensive, it could be wasteful to commit one or more
large RAM blocks for such a small FIFO.
Embedding third-party logic cores is a feature that can be useful for some designs, and not useful
at all for others. A disadvantage of FPGAs is their higher cost per gate than custom ASIC technol-
ogy. The main reason that engineers are willing to pay this cost premium is for the ability to imple-
ment custom logic in a low-risk development process. Some applications involve a mix of custom
and predesigned logic that can be purchased from a third party. Examples of this include buying a
microprocessor design or a standard bus controller (e.g., PCI) and integrating it with custom logic on
the same chip. Ordinarily, the cost per gate of the third-party logic would be the same as that of your
custom logic. On top of that cost is the licensing fee charged by the third party. Some FPGA vendors
have decided that there is sufficient demand for a few standard logic cores to offer specific FPGAs
that embed these cores into the silicon in a fixed configuration. The benefit of doing so is to drop the
per-gate cost of the core to nearly that of a custom ASIC, because the core is hard wired and requires
none of the FPGA’s configuration overhead.
FPGAs with embedded logic cores may cost more to offset the licensing costs of the cores, but
the idea is that the overall cost to the customer will be reduced through the efficiency of the hard-
wired core implementation. Microprocessors, PCI bus controllers, and high-speed serdes compo-
nents are common examples of FPGA embedded cores. Some specific applications may be well
suited to this concept.
I/O cell architecture can have a significant impact on the types of board-level interfaces that the
FPGA can support. The issues revolve around two variables: synchronous functionality and voltage/
current levels. FPGAs support generic I/O cells that can be configured for input-only, output-only,
or bidirectional operation with associated tri-state buffer output enable capability. To achieve the
best I/O timing, flops for all three functions—input, output, and output-enable—should be included
within the I/O cell as shown in Fig. 11.10. The timing improvement obtained by locating these three
flops in the I/O cells is substantial. The alternative would be to use logic cell flops and route paths
from the logic cell array directly to the I/O pin circuitry, increasing the I/O delay times. Each of the
three I/O functions is provided in both registered and unregistered options using multiplexers to
provide complete flexibility in logic implementation.
More advanced bus interfaces run at double data rate speeds, requiring more advanced I/O cell
structures to achieve the necessary timing specifications. Newer FPGAs are available with I/O cells
that specifically support DDR interfaces by incorporating two sets of flops, one for each clock edge
as shown in Fig. 11.11. When configured for DDR mode, each of the three I/O functions is driven by
a pair of flops, and a multiplexer selects the appropriate flop output depending on the phase of the
clock. A DDR interface runs externally to the FPGA on both edges of the clock with a certain width.
Internally, the interface runs at double the external width on only one edge of the same clock fre-
quency. Therefore, the I/O cell serves as a 2:1 multiplexer for outputs and a 1:2 demultiplexer for in-
puts when operating in DDR mode.
262 Advanced Digital Systems

Output
D Q
Enable

I/O
Output Pin
D Q
Data

Input
Data
Q D

Configuration
Information

FIGURE 11.10 FPGA I/O cell structure.

Output
Enable D Q
#1
D
D
R I/O
Output
Pin
Enable D Q
#2

DDR Mux Control


(e.g. Clock Phase)
Output
Data
#1, #2

Configuration
Information

Input
Data
#1 Q D

Input
Data Q D
#2

FIGURE 11.11 FPGA DDR I/O cell structure.


Programmable Logic Devices 263

Aside from synchronous functionality, compliance with various I/O voltage and current drive
standards is a key feature for modern, flexible FPGAs. Like CPLDs that support multiple I/O banks,
each of which that can drive a different voltage level, FPGAs are usually partitioned into I/O banks
as well, for the same purpose. In contrast with CPLDs, many FPGAs support a wider variety of I/O
standards for greater design flexibility.
Verilog Cheat Sheet
S Winberg and J Taylor

Comments Operators
// One-liner // These are in order of precedence...
/* Multiple // Select
lines */ A[N] A[N:M]
// Reduction
&A ~&A |A ~|A ^A ~^A
Numeric Constants // Compliment
// The 8-bit decimal number 106: !A ~A
8'b_0110_1010 // Binary // Unary
8'o_152 // Octal +A -A
8'd_106 // Decimal // Concatenate
8'h_6A // Hexadecimal {A, ..., B}
"j" // ASCII // Replicate
{N{A}}
78'bZ // 78-bit high-impedance // Arithmetic
A*B A/B A%B
Too short constants are padded with zeros A+B A-B
// Shift
on the left. Too long constants are A<<B A>>B
truncated from the left. // Relational
A>B A<B A>=B A<=B
Nets and Variables A==B A!=B
wire [3:0]w; // Assign outside always blocks // Bit-wise
reg [1:7]r; // Assign inside always blocks A&B
reg [7:0]mem[31:0]; A^B A~^B
A|B
integer j; // Compile-time variable // Logical
genvar k; // Generate variable A&&B
A||B
Parameters // Conditional
A ? B : C
parameter N = 8;
localparam State = 2'd3;
Module
Assignments module MyModule
#(parameter N = 8) // Optional parameter
assign Output = A * B;
(input Reset, Clk,
assign {C, D} = {D[5:2], C[1:9], E};
output [N-1:0]Output);
// Module implementation
endmodule

Module Instantiation
// Override default parameter: setting N = 13
MyModule #(13) MyModule1(Reset, Clk, Result);
Case Generate
always @(*) begin genvar j;
case(Mux) wire [12:0]Output[19:0];
2'd0: A = 8'd9;
2'd1, generate
2'd3: A = 8'd103; for(j = 0; j < 20; j = j+1)
2'd2: A = 8'd2; begin: Gen_Modules
default:; MyModule #(13) MyModule_Instance(
endcase Reset, Clk,
end Output[j]
);
always @(*) begin end
casex(Decoded) endgenerate
4'b1xxx: Encoded = 2'd0;
4'b01xx: Encoded = 2'd1;
4'b001x: Encoded = 2'd2;
State Machine
4'b0001: Encoded = 2'd3; reg [1:0]State;
default: Encoded = 2'd0; localparam Start = 2'b00;
endcase localparam Idle = 2'b01;
end localparam Work = 2'b11;
localparam Done = 2'b10;

Synchronous reg tReset;


always @(posedge Clk) begin
if(Reset) B <= 0; always @(posedge Clk) begin
else B <= B + 1'b1; tReset <= Reset;
end
if(tReset) begin
State <= Start;
Loop
always @(*) begin end else begin
Count = 0; case(State)
for(j = 0; j < 8; j = j+1) Start: begin
Count = Count + Input[j]; State <= Idle;
end end
Idle: begin
Function State <= Work;
end
function [6:0]F; Work: begin
input [3:0]A; State <= Done;
input [2:0]B; end
begin Done: begin
F = {A+1'b1, B+2'd2}; State <= Idle;
end end
endfunction default:;
endcase
end
end
Summary of Synthesisable Verilog 2001
Numbers and constants Assignment
Example: 4-bit constant 11 in binary, hex and decimal: Assignment to wires uses the assign primitive outside
4’b1011 == 4’hb == 4’d11 an always block, vis:
assign mywire = a & b
Bit concatenation using {}:
{2’b10,2’b11} == 4’b1011 This is called continuous assignment because mywire
is continually updated as a and b change (i.e. it is all
Note that numbers are unsigned by default. combinational logic).
Constants are declared using parameter vis: Registers are assigned to inside an always block which
parameter foo = 42 specifies where the clock comes from, vis:
always @(posedge clock)
r<=r+1;
Operators
The <= assignment operator is none blocking and is
Arithmetic: the usual + and - work for add and sub- performed on every positive edge of clock. Note that
tract. Multiply (*) divide (/) and modulus (%) are pro- if you have whole load of none blocking assignments
vided by remember that they may generate substantial then they are all updated in parallel.
hardware which could be quite slow.
Adding an asynchronous reset:
Shift left (<<) and shift right (>>) operators are avail-
always @(posedge clock or posedge reset)
able. Some synthesis systems will only shift by a con-
if(reset)
stant amount (which is trivial since it involves no logic).
r <= 0;
Relational operators: equal (==) not-equal (!=) and the else
usual < <= > >= r <= r+1;

Bitwise operators: and (&) or (|) xor (ˆ) not (˜)


Note that this will be synthesised to an asynchronous
Logical operators (where a multi-bit value is false if (i.e. independent of the clock) reset where the reset is
zero, true otherwise): and (&&) or (||) not (!) connected directly to the clear input of the DFF.

Bit reduction unary operators: and (&) or (|) xor (ˆ) The blocking assignment operator (=) is also used inside
Example, for a 3 bit vector a: an always block but causes assignments to be per-
&a == a[0] & a[1] & a[2] formed as if in sequential order. This tends to result
and |a == a[0] | a[1] | a[2] in slower circuits, so we do not used it for synthesised cir-
cuits.
Conditional operator ? used to multiplex a result
Example: (a==3’d3) ? formula1 : formula0
For single bit formula, this is equivalent to: Case and if statements
((a==3’d3) && formula1)
|| ((a!=3’d3) && formula0) case and if statements are used inside an always
block to conditionally update state.

Registers and wires Example:


always @(posedge clock)
Declaring a 4 bit wire with index starting at 0: if(add1 && add2) r <= r+3;
wire [3:0] w; else if(add2) r <= r+2;
else if(add1) r <= r+1;
Declaring an 8 bit register:
reg [7:0] r;
Note that we don’t need to specify what happens when
Declaring a 32 element memory 8 bits wide: add1 and add2 are both false since the default be-
reg [7:0] mem [0:31] haviour is that r will not be updated.
Bit extract example: Equivalent function using a case statement:
r[5:2]
always @(posedge clock)
returns the 4 bits between bit positions 2 to 5 inclusive.
case({add2,add1})
2’b11 : r <= r+3;
2’b10 : r <= r+2; case(func)
2’b01 : r <= r+1; 2’d0 : result <= a + b;
default: r <= r; 2’d1 : result <= a - b;
endcase 2’d2 : result <= a & b;
default : result <= a ˆ b;
And using the conditional operator (?): endcase
endmodule
always @(posedge clock)
r <= (add1 && add2) ? r+3 :
Instantiating the above module could be done as fol-
add2 ? r+2 :
lows:
add1 ? r+1 : r;
wire clk;
wire [3:0] data0,data1,sum;
Which because it is a contrived example can be short-
ened to: simpleClockedALU myFourBitAdder(
always @(posedge clock) .clock(clk),
r <= r + {add2,add1}; .func(0), // constant function
.a(data0),
.b(data1),
Note that the following would not work:
.result(sum));
always @(posedge clock) begin
if(add1) r <= r + 1; Notes:
if(add2) r <= r + 2;
end
• myFourBitAdder is the name of this instance of
the hardware
The problem is that the none blocking assignments must
happen in parallel, so if add1==add2==1 then we are • the .clock(clk) notation refers to:
asking for r to be assigned r+1 and r+2 simultane- .port_name(your_name)
ously which is ambiguous. which ensures that values are wired to the right
place.
• in this instance the function input is zero, to the
Module declarations synthesis system is likely to simplify the imple-
mentation of this instance so that it is only capa-
Modules pass inputs and outputs as wires only. If an
ble of performing an addition (the zero case)
output is also a register then only the output of that
register leaves the module as wires.

Example: Simulation
module simpleClockedALU(
input clock, Example simulation following on from the above in-
input [1:0] func, stantiation of simpleClockeALU:
input [3:0] a,b, reg clk;
output reg [3:0] result); reg [7:0] vals;
always @(posedge clock) assign data0=vals[3:0];
case(func) assign data1=vals[7:4];
2’d0 : result <= a + b;
2’d1 : result <= a - b; // oscillate clock every 10 simulation units
2’d2 : result <= a & b; always #10 clk <= !clk;
default : result <= a ˆ b;
endcase // initialise values
endmodule initial #0 begin
clk = 0;
vals=0;
Example in pre 2001 Verilog:
// finish after 200 simulation units
module simpleClockedALU( #200 $finish;
clock, func, a, b, result); end
input clock;
input [1:0] func; // monitor results
input [3:0] a,b; always @(negedge clk)
output [3:0] result; $display("%d + %d = %d",data0,data1,sum);
reg [3:0] result;
always @(posedge clock) Simon Moore

You might also like