FPGA Embedded Processors Revealing True
FPGA Embedded Processors Revealing True
Processors
Revealing True System Performance
Bryan H. Fletcher
Technical Program Manager
Memec
San Diego, California
www.memec.com
[email protected]
1. Abstract
Embedding a processor inside an FPGA has many advantages. Specific peripherals can be chosen based
on the application, with unique user-designed peripherals being easily attached. A variety of memory
controllers enhance the FPGA embedded processor system’s interface capabilities.
FPGA embedded processors use general-purpose FPGA logic to construct internal memory, processor
busses, internal peripherals, and external peripheral controllers (including external memory controllers).
Soft processors are built from general-purpose FPGA logic as well.
As more pieces (busses, memory, memory controllers, peripherals, and peripheral controllers) are added to
the embedded processor system, the system becomes increasingly more powerful and useful. However,
these additions reduce performance and increase the embedded system cost, consuming FPGA resources.
Likewise, large banks of external memory can be connected to the FPGA and accessed by the embedded
processor system using included memory controllers. Unfortunately, the latency to access this external
memory can have a significant, negative impact on performance.
FPGA manufacturers often publish embedded processor performance benchmarks. Like all other
companies running benchmarks, these FPGA manufacturers purposely construct and use an FPGA
embedded processor system that performs the best for each specific benchmark. This means that the FPGA
processor system constructed to run the benchmark has very few peripherals and runs exclusively using
internal memory. The embedded designer must understand how a real-world design’s specific peripheral
set and memory architecture will affect the system performance. However, no easy formula or chart exists
showing how to compare the performance and cost for different memory strategies and peripheral sets.
This paper presents several case studies that examine the effects of various embedded processor memory
strategies and peripheral sets. Comparing the benchmark system to a real-world system, the study
examines techniques for optimizing the performance and cost in an FPGA embedded processor system.
The most primitive FPGA building block is called either a Logic Cell (LC) by Xilinx or a Logic Element
(LE) by Altera. In either case, this building block consists of a look-up table (LUT) for logical functions
and a flip-flop for storage. In addition to the LC/LE block, FPGAs also contain memory, clock
management, input/output (I/O), and multiplication blocks. For the purposes of this study, LC/LE
consumption is used in determining system cost.
B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367
A “soft” processor is built using the FPGA’s general-purpose logic. The soft processor is typically
described in a Hardware Description Language (HDL) or netlist. Unlike the hard processor, a soft
processor must be synthesized and fit into the FPGA fabric.
In both soft and hard processor systems, the local memory, processor busses, internal peripherals, peripheral
controllers, and memory controllers must be built from the FPGA’s general-purpose logic.
1) customization
2) obsolescence mitigation
3) component and cost reduction
4) hardware acceleration
2.2.1 Customization
The designer of an FPGA embedded processor system has complete flexibility to select any combination of
peripherals and controllers. In fact, the designer can invent new, unique peripherals that can be connected
directly to the processor’s bus. If a designer has a non-standard requirement for a peripheral set, this can be
met easily with an FPGA embedded processor system. For example, a designer would not easily find an
off-the-shelf processor with ten UARTs. However, in an FPGA, this configuration is very easily
accomplished.
B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367
2.3 Disadvantages
The FPGA embedded processor system is not without disadvantages. Unlike an off-the-shelf processor, the
hardware platform for the FPGA embedded processor must be designed. The embedded designer becomes
the hardware processor system designer when an FPGA solution is selected.
Because of the integration of the hardware and software platform design, the design tools are more
complex. The increased tool complexity and design methodology requires more attention from the
embedded designer.
Since FPGA embedded processor software design is relatively new compared to software design for
standard processors, the software design tools are likewise relatively immature, although workable.
Significant progress in this area has been made by both Altera and Xilinx. Within the next year, this
disadvantage should be further diminished, if not eliminated.
Device cost is another aspect to consider. If a standard, off-the-shelf processor can do the job, that
processor will be less expensive in a head-to-head comparison with the FPGA capable of an equivalent
processor design. However, if a large FPGA is already in the system, consuming unused gates or a hard
processor in the FPGA essentially makes the embedded processor system cost inconsequential.
B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367
The achieved DMIPs reported by the manufacturers are based on several things that maximize the
benchmark results. Some of these factors include the following:
• Optimal compiler optimization level
• Fastest available device family (unless otherwise noted)
• Fastest speed grade in that device family
• Executing from fastest, lowest latency memory, typically on-chip
• Optimization of processor’s parameterizable features
The available embedded processors with the manufacturers’ quoted maximum frequency and DMIPs are
summarized here.
2.5.1 Altera3
Table 1 – Altera Embedded Processors and Performance
Processor Processor Type Device Family Speed (MHz) DMIPs Achieved
Used Achieved
ARM922T™ hard Excalibur 200 210
NIOS® soft Stratix-II 180 Not Reported
Nios® II soft Stratix-II Not Reported 200
Nios® II soft Cyclone-II Not Reported 100
2.5.2 Xilinx4
Table 2 – Xilinx Embedded processors and Performance
Processor Processor Type Device Family Speed (MHz) DMIPs Achieved
Used Achieved
PowerPC™ 405 hard Virtex-4 450 680
MicroBlaze soft Virtex-II Pro 150 123
MicroBlaze soft Spartan-3 85 65
The reason that achieved performance is lacking may be caused by the designer not enacting all of the
performance enhancing techniques available to FPGA embedded processors. The manufacturers obviously
know what must be done to get the most out of their chips, and they take full advantage of every possible
B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367
enhancement when benchmarking. Embedded designers, who are familiar with standard microprocessor
performance optimization, need to learn which software optimization techniques apply to FPGA embedded
processors. Designers must also learn performance-enhancing techniques that apply specifically to FPGAs.
The design landscape is certainly more complicated with an FPGA embedded processor. The incredible
advantages gained with this type of design are not without tradeoffs. Specifically, the increased design
complexity is overwhelming to many, including experienced embedded or FPGA designers. The
manufacturers and their partners put significant effort into training and providing support to designers
experimenting with this technology. Taking advantage of a local field applications engineer is essential to
FPGA embedded processor design success.
As an introduction to this type of design, a few performance-enhancing techniques are highlighted. Specific
references below are based on research of Xilinx FPGAs and tools, although Altera is likely to have similar
features.
Level 0: No optimization
Level 2: Second level optimization. This level activates nearly all optimizations that do not
involve a speed-space tradeoff, so the executable should not increase in size. The
compiler does not perform loop unrolling, function in-lining or strict aliasing
optimizations. This is the standard optimization level used for program deployment.
Level 3: Highest optimization level. This level adds more expensive options, including those that
increase code size. In some cases this optimization level actually produces code that is
less efficient than the O2 level, and as such should be used with caution.
Size: Size optimization. The objective is to produce the smallest possible code size.
B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367
3.1.1.3 Assembly
Assembly, including in-line assembly, is supported by GCC. As with any microprocessor, assembly
becomes very useful in fully optimizing time critical functions. Be aware, however, that some compilers
will not optimize the remaining C code in a file if in-line assembly is also used in that file. Also, assembly
code does not enjoy the code portability advantages of C.
3.1.1.4 Miscellaneous
Many other code-related optimizations can and should be considered when optimizing an FPGA embedded
processor, including:
• locality of reference
• code profiling
• careful definition of variables (Xilinx provides a Basic Types definition)
• strategic use of small data sections, with accesses that can be twice as fast as large data
sections
• judicious use of function calls to minimize pushing/popping of stack frames
• loop length (especially where cache is involved)
Xilinx FPGA BRAM quantities differ by device. For example, the 1.5 million gate Spartan-3 device
(XC3S1500) has a total capacity of 64KB, whereas the 400,000 gate Spartan-3 device (XC3S400) has half
as much at 32KB. An embedded designer using FPGAs should refer to the device family datasheet to
review a specific chip’s BRAM capacity.
If the designer’s program fits entirely within local memory, then the designer achieves optimal memory
performance. However, many embedded programs exceed this capacity.
In addition to the memory access time, the peripheral bus also incurs some latency. In MicroBlaze, the
memory controllers are attached to the On-chip Peripheral Bus (OPB). For example, the OPB SDRAM
controller requires a four to six cycle latency for a write and eight to ten cycle latency for a read (depending
B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367
on bus clock frequency)7. The worst possible program performance is achieved by having the entire
program reside in external memory. Since optimizing execution speed is a typical goal, an entire program
should rarely, if ever, be targeted solely at external memory.
The MicroBlaze cache architecture is different than the PowerPC because the cache memory is not
dedicated silicon. The instruction and data cache controllers are selectable parameters in the MicroBlaze
configuration. When these controllers are included, the cache memory is built from BRAM. Therefore,
enabling the cache consumes BRAM that could have otherwise been used for local memory. Compared to
local memory, cache consumes more BRAM than local memory for the same storage size because the cache
architecture requires address line tag storage. Additionally, enabling the cache consumes general-purpose
logic to build the cache controllers.
For example, an experiment in Spartan-3 enables 8 KB of data cache and designates 32 MB of external
memory to be cached. This cache requires 12 address tag bits. This configuration consumes 124 logic cells
and 6 BRAMs. Only 4 BlockRAMs are required in Spartan-3 to achieve 8 KB of local memory. In this
case, cache is 50% more expensive in terms of BRAM usage than local memory. The 2 extra BRAMs are
used to store address tag bits.
If 1 MB of external memory is cached with an 8 KB cache, then the address tag bits can be reduced to 7.
This configuration then only requires 5 BRAMs rather than 6 (4 BRAMs for the cache, and 1 BRAM for
the tags). This is still 25% greater than if the BRAMs are used as local memory.
Additionally, the achievable system frequency may be reduced when the cache is enabled. In one example,
the system without any cache is capable of running at 75 MHz; the system with cache is only capable of
running at 60 MHz. Enabling the cache controller adds logic and complexity to the design, decreasing the
achieved system frequency during FPGA place and route. Therefore, in addition to consuming FPGA
BRAM resources that may have otherwise been used to increase local memory, the cache implementation
may also cause the overall system frequency to decrease.
Considering these cautions, enabling the MicroBlaze cache, especially the instruction cache, may improve
performance, even when the system must run at a lower frequency. As will be shown later in the DMIPs
section (see Section 4.1), a 60 MHz system with instruction cache enabled has a 150% advantage over a 75
MHz system without instruction cache (both systems store entire program in external memory). When both
instruction and data caches are enabled, the 60 MHz outperforms the 75 MHz system by 308%.
This example is not the most practical since the entire DMIPs program will fit in the cache. A more
realistic experiment is to use an application that is larger than the cache. Another precaution is regarding
applications that frequently jump beyond the size of the cache. Multiple cache misses degrade the
performance, sometimes making a cached external memory worse than the external memory without cache.
Given these warnings, enabling the cache is always worth an experiment to determine if it improves the
performance for an application.
B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367
excellent choice for PowerPC. Caching the external memory in MicroBlaze definitely improves results, but
an alternative method is presented that may provide optimal results.
For MicroBlaze, perhaps the optimal memory configuration is to wisely partition the program code,
maximizing the system frequency and local memory size. Critical data, instructions, and stack are placed in
local memory. Data cache is not used, allowing for a larger local memory bank. If the local memory is not
large enough to contain all instructions, the designer should consider enabling the instruction cache for the
address range in external memory used for instructions.
By not consuming BRAM in data cache, the local memory can be increased to contain more space. An
instruction cache for the instructions assigned to external memory can be very effective. Experimentation
or profiling shows which code items are most heavily accessed; assigning these items to local memory
provides a greater performance improvement than caching.
Express Logic’s Thread-Metric™ test suite is an example of how partitioning a small piece of code in local
memory can have a significant performance improvement (see Section 4.2). One function in the Thread-
Metric Basic Processing test is identified as time critical. The function’s data section (consisting of 19% of
the total code size) is allocated to local memory, and the instruction cache is enabled. The 60 MHz, cached
and partitioned-program system achieves performance that is 560% better than running a non-cached, non-
partitioned 75 MHz system using only external memory.
However, the 75 MHz system shows even more improvement with code partitioning. If the time critical
function’s data and text sections (22% of the total code size) are assigned to local memory on the 75 MHz
system, a 710% improvement is realized, even with no instruction cache for the remainder of the code
assigned to external memory.
In this one case, the optimal memory configuration is one that maximizes local memory and system
frequency without cache. In other systems where the critical code is not so easily pinpointed, a cached
system may perform better. Designers should experiment with both methods to determine what is optimal
for their design.
If a design does not store and run any instructions using external memory, do not connect the instruction-
side of the peripheral bus. Connecting both the instruction and data side of the processor to a single bus
creates a multi-master system, which requires an arbiter. Optimal bus performance is achieved when a
single master resides on the bus.
Debug logic requires resources in the FPGA and may be the hardware bottleneck. When a design is
completely debugged, the debug logic can be removed from the production system, potentially increasing
the system’s performance. For example, removing a MicroBlaze Debug Module (MDM) with an FSL
B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367
acceleration channel saves 950 LCs. In MicroBlaze systems with the cache enabled, the debug logic will
typically be the critical path that slows down the entire design.
The Xilinx OPB External Memory Controller (EMC) used to connect SRAM and Flash memories creates a
32-bit address bus even if 32 bits are not required to address the memory. Xilinx also provides a bus-
trimming peripheral, which removes the unused address bits. When using this memory controller, the bus-
trimmer should always be used to eliminate the unused addresses. This frees up routing and pins that would
have otherwise been used. The Xilinx Base System Builder (BSB) now does this automatically.
Xilinx provides several GPIO peripherals. The latest GPIO peripheral version (v3.01.a) has excellent
capabilities, including dual-channel support, bi-directionality, and interrupt capability. However, these
features also require more resources, which affect timing. If a simple GPIO is all that the design requires,
the designer should use a more primitive version of the GPIO, or at least ensure that the unused features in
the enhanced GPIO are turned off. In the optimized examples in this study, GPIO v1.00.a is used, which is
much less sophisticated, much faster, and approximately half the size (304 LCs for 7 GPIO v1.00.a
peripherals as compared to 602 LCs for v3.01.a).
Some peripherals require additional constraints to ensure proper operation. For example, both the DDR
SDRAM controller and the 10/100 Ethernet MAC require additional constraints to guarantee that the tools
create correct and optimized logic. The designer must read the datasheet for each peripheral and follow the
recommended design guidelines.
Both MicroBlaze and Virtex-4 PowerPC include very low-latency access points into the processor, which
are ideal for connecting custom co-processing hardware. Virtex-4 introduces the Auxiliary Processing Unit
(APU) for the PowerPC8. The APU provides a direct connection from the PowerPC to co-processing
hardware. In MicroBlaze, the low-latency interface is called the Fast Simplex Link (FSL) bus. The FSL
bus contains multiple channels of dedicated, uni-directional, 32-bit interfaces. Because the FSL channels
are dedicated, no arbitration or bus mastering is required. This allows an extremely fast interface to the
processor9.
B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367
Converting a software bottleneck into hardware may seem like a very difficult task. Traditionally, a
software designer identifies the bottleneck, after which the algorithm is transitioned to an FPGA designer
who writes VHDL or Verilog code to create the hardware co-processor. Fortunately, this process has been
greatly simplified by tools that are capable of generating FPGA hardware from C code. One such tool is
CoDeveloper™ from Impulse Accelerated Technologies. This tool allows one designer who is familiar
with C to port a software bottleneck into a custom piece of co-processing FPGA hardware using
CoDeveloper’s Impulse C libraries.
Some examples of algorithms that could be targeted for hardware-based co-processors include:
• Inverse Discrete Cosine Transformation, used in JPEG 2000
• Fast Fourier Transform
• MP3 decode
• Triple-DES and AES encryption
• Matrix-manipulation
Any operation that is algorithmic, mathematical, or parallel is a good candidate for a hardware co-
processor10. FPGA logic consumption is traded for performance. The advantages can be enormous,
improving performance by tens or hundreds of times.
B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367
The value of any benchmark can be questioned. Therefore, to provide a more objective perspective, several
applications are examined: Dhrystone MIPs, Express Logic’s Thread-Metric RTOS benchmark, and a real-
world application (Triple-DES encryption).
• MicroBlaze
• hardware divider and barrel shifter included
• no data or instruction cache
• Data-side peripheral bus (no instruction-side)
• 8KB instruction local memory
• 16KB data local memory
• No debug hardware
• RS232 UART
• Timer
B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367
Based on this processor system running in the slower, -4 speed grade on the Memec Spartan-3 MB board,
the achieved results are 63.3 DMIPs. The MicroBlaze system runs at a maximum frequency of 81.8 MHz.
Dividing the DMIPs by the operating frequency gives 0.773 DMIPs/MHz, which is useful when
extrapolating performance numbers for a similar processor system running at a different frequency. This
system consumes 2322 logic cells. Based on Xilinx press release numbers of $20 for an XC3S150011, this
system cost can be estimated at $1.74.
When this same system is built using a -5 speed grade Spartan-3 device, the achieved frequency is 90.8
MHz. Since a development board with a -5 device is not available, the DMIPs performance is extrapolated
based on the numbers achieved during the -4 experiment. Based on 0.773 DMIPs/MHz, this -5 system is
capable of achieving 70.3 DMIPs. This result is better than the published Xilinx benchmark of 65 DMIPs!
B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367
For those who are inexperienced with FPGA embedded processors and Xilinx MicroBlaze specifically, the
default BSB system may very well be the typical system that a designer creates. The design will build and
run as expected, but the performance degradation is severe. The operating frequency achieved is only 62.5
MHz. The DMIPs result is 4.505 DMIPs, which is more than 14 times less than the Xilinx published
DMIPs number. This design consumes 7732 logic cells, which, based on the previous chip price quoted,
equates to a system cost of $5.81.
B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367
with results achieving 75 MHz for the non-cached MicroBlaze system and 60 MHz for the cached
MicroBlaze system.
Although the code is small enough to entirely fit inside local memory, several alternative memory structures
are investigated. The following parameters are adjusted to determine affects on performance:
• Location of instructions
• Location of small data
• Location of large data
• Location of stack
• Data cache
• Instruction cache
• Compiler optimization level
Optimizing the hardware system to run at 75 MHz with a hardware divider and barrel shifter increases LC
usage by 986 at a cost of $0.74. This increases the performance to 6.846 DMIPs, which is a 52%
improvement compared to the unoptimized system. Changing the Compiler Optimization Level to Level 3
further increases the performance to 7.139, an additional increase of 4.3%.
Enabling the instruction and data caches adds an additional $0.27 of LC cost but increases the performance
to 29.130 DMIPs.
Eliminating the data cache saves $0.20 and allows the local memory to be increased by 16 KB. With the
instruction cache enabled and assigning the stack and small data sections to local memory, 33.178 DMIPs
are achieved. If all data is assigned to local memory, this performance increases to 47.811 DMIPs.
If both instruction and data cache are eliminated, and local memory is used for stack, small data sections,
and instructions, the results are 41.483 DMIPs. When local memory is used for the entire program, the
maximum performance for this hardware system is achieved at 59.785 DMIPs.
For this real-world system, the best results are within 8% of the manufacturer’s published benchmark. Even
when running cached external instruction memory with local memory data storage, the system is capable of
running at 74% of the manufacturer’s benchmark.
B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367
With the program running from SRAM, the results are 23823 events, an improvement of 21%, attributable
solely to the increase in operating frequency from 62.5 MHz to 75 MHz. Next, the stack and all data
sections are moved to local memory. The achieved results are 29808 events, an additional performance
improvement of 25%.
With the entire program in SRAM and instruction cache enabled (system running at 60 MHz), the results
are 64006 events. This is a significant improvement of 225% over the original, unoptimized results. When
the critical function’s data section is moved to local memory with instruction cache enabled, the
performance is 157512 events.
By removing the instruction cache, the system speed is boosted to 75 MHz. The critical function’s data and
text sections are roughly one-fifth the total program size. When the critical function’s data and text are
assigned to local memory, the achieved results are 192867 events. If the entire program runs from local
memory, the performance realized is maximized at 198787 events.
For this one example, only 3% performance is lost in the program-partitioned experiment running at 75
MHz. When a critical function can be identified and assigned to permanently reside in local memory,
caching is not necessary, and the achieved results are nearly as good as if the entire program was running
from local memory. These results are summarized in Table 4.
Thread-Metric Basic
Frequency Events
Initial optimized system, running from SRAM 75 MHz 23823
Stack and all data moved to local memory 75 MHz 29808
Running from SRAM, I-cache enabled 60 MHz 64006
Critical function’s data moved to local memory 60 MHz 157512
Removed instruction cache, critical function’s
instructions also moved to local memory 75 MHz 192867
Moved entire program to local memory 75 MHz 198787
B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367
A custom instruction is defined to exercise the FSL channels. The MicroBlaze application is modified to
use this newly defined instruction, which takes advantage of the co-processor. The hardware is now rebuilt.
The complexity of the co-processor results in a lower system frequency of 24 MHz. However, the overall
performance gain of the system is remarkable with a single encryption cycle taking only 6.9 ms. This is a
36 times increase in performance!
Upon further review, more sophisticated changes are made to the Impulse C version of the Triple-DES
algorithm. Taking advantage of C programming techniques such as array splitting, loop unrolling, and
pipelining, further optimization can be accomplished. Implementing the maximum optimization in the
Impulse C code, the final system performance is 0.59 ms, an incredible boost of 425 times over the original
software-only implementation!14
Although more hardware is consumed in the FPGA, the performance enhancements are significant. The
ability to create custom co-processing units specific to a designer’s application makes the FPGA embedded
processor solution unmatched in performance compared to any off-the-shelf processor!
5. Conclusion
Based on the experiments performed in this research, several conclusions are explained.
Regarding the design process, remember that the hardware platform is part of the FPGA embedded
processor design. Unlike off-the-shelf processors where the hardware is pre-defined and fixed, FPGAs have
the flexibility and added complexity to create a multitude of different systems. A member of the design
team must be capable of hardware development and optimization of the FPGA embedded processor.
If an application does not require high performance or any of the other advantages that an FPGA can
provide, an off-the-shelf processor is most likely a less complicated, less expensive, and better solution.
B. Fletcher, Memec
FPGA Embedded Processors: Revealing True System Performance ETP-367
Understand that the addition or removal of each peripheral, peripheral controller, or bus alters the design
size, cost, and speed. Use only what is necessary and no more!
If the designer chooses not to take advantage of these capabilities, the realized FPGA embedded processor
system performance may be a huge disappointment. However, for those who take full advantage of the
FPGA embedded processor, specific application performance greatly exceeding typical microprocessor
expectations is possible!
6. Acknowledgements
Shalin Sheth, Xilinx
Ron Wright, Memec
7. References
1
www.impulsec.com/trueansic.htm
2
For a complete list of available IP, visit the following websites:
www.altera.com/products/ip/processors/nios2/features/ni2-peripherals.html
www.xilinx.com/ise/embedded/edk_ip.htm
3
The website URLs where these numbers were found, as of November 10, 2004, are listed below:
www.altera.com/products/devices/excalibur/exc-index.html
www.altera.com/products/ip/processors/nios/overview/nio-overview.html
www.altera.com/products/ip/processors/nios2/overview/ni2-overview.html
4
The website URLs where these numbers were found, as of November 10, 2004, are listed below:
www.xilinx.com/products/virtex4/capabilities.htm
www.xilinx.com/ipcenter/processor_central/embedded/performance.htm
www.xilinx.com/ipcenter/processor_central/microblaze/performance.htm
5
An Introduction to GCC for the GNU Compilers gcc and g++, Brian Gough, Network Theory Ltd., March 31, 2004
6
See Xilinx Answer Record 19592 at www.support.xilinx.com
7
OPB Synchronous DRAM (SDRAM) Controller, DS426, Xilinx, August 11, 2004
8
PowerPC™ 405 Processor Block Reference Guide, UG018, Xilinx, August 20, 2004
9
Fast Simplex Link (FSL) Bus (v2.00a), DS449, Xilinx, August 17, 2004
10
www.impulsec.com/applications.htm
11
Volume pricing for 250k units, end of 2004 ( www.xilinx.com/prs_rls/silicon_spart/0498s3.htm )
12
Thread-Metric RTOS Test Suite, Express Logic, INC., April 21, 2004
13
“FPGAs Provide Acceleration for Software Algorithms,” FPGA Journal, David Pellerin and Milan Saini, 2004
14
Optimizing Impulse C Code for Performance, Impulse Accelerated Technologies, Inc., Scott Thibault and David Pellerin,
September 30, 2004
B. Fletcher, Memec
CHAPTER 11
Programmable Logic Devices
Programmable logic is the means by which a large segment of engineers implement their custom
logic, whether that logic is a simple I/O port or a complex state machine. Most programmable logic
is implemented with some type of HDL that frees the engineer from having to derive and minimize
Boolean expressions each time a new logical relationship is designed. The advantages of program-
mable logic include rapid customization with relatively limited expense invested in tools and sup-
port.
The widespread availability of flexible programmable logic products has brought custom logic
design capabilities to many individuals and smaller companies that would not otherwise have the fi-
nancial and staffing resources to build a fully custom IC. These devices are available in a wide range
of sizes, operating voltages, and speeds, which all but guarantees that a particular application can be
closely matched with a relevant device. Selecting that device requires some research, because each
manufacturer has a slightly different specialty and range of products.
Programmable logic technology advances rapidly, and manufacturers are continually offering de-
vices with increased capabilities and speeds. After completing this chapter and learning about the
basic types of devices that are available, it is recommended that you to browse through the latest
manufacturers’ data sheets to get updated information. Companies such as Altera, Atmel, Cypress,
Lattice, QuickLogic, and Xilinx provide detailed data sheets on their web sites and also tend to offer
bundled development software for reasonable prices.
Beyond using discrete 7400 ICs, custom logic is implemented in larger ICs that are either manufac-
tured with custom masks at a factory or programmed with custom data images at varying points after
fabrication. Custom ICs, or application specific integrated circuits (ASICs), are the most flexible op-
tion because, as with anything custom, there are fewer constraints on how application specific logic
is implemented. Because custom ICs are tailored for a specific application, the potential exists for
high clock speeds and relatively low unit prices. The disadvantages to custom ICs are long and ex-
pensive development cycles and the inability to make quick logic changes. Custom IC development
cycles are long, because a design must generally be frozen in a final state before much of the silicon
layout and circuit design work can be completed. Engineering charges for designing a custom mask
set (not including the logic design work) can range from $50,000 to well over $1 million, depending
on the complexity. Once manufactured, the logic can’t simply be altered, because the logic configu-
ration is an inherent property of the custom design. If a bug is found, the time and money to alter the
mask set can approach that of the initial design itself.
249
Copyright 2003 by The McGraw-Hill Companies, Inc. Click Here for Terms of Use.
250 Advanced Digital Systems
Programmable logic devices (PLDs) are an alternative to custom ASICs. A PLD consists of gen-
eral-purpose logic resources that can be connected in many permutations according to an engineer’s
logic design. This programmable connectivity comes at the price of additional, hidden logic that
makes logic connections within the chip. The main benefit of PLD technology is that a design can be
rapidly loaded into a PLD, bypassing the time consuming and expensive custom IC development
process. It follows that if a bug is found, a fix can be implemented very quickly and, in many cases,
reprogrammed into the existing PLD chip. Some PLDs are one-time programmable, and some can
be reprogrammed in circuit.
The disadvantage of PLDs is the penalty paid for the hidden logic that implements the program-
mable connectivity between logic gates. This penalty manifests itself in three ways: higher unit cost,
slower speeds, and increased power consumption. Programmable gates cost more than custom gates,
because, when a programmable gate is purchased, that gate plus additional connectivity overhead is
actually being paid for. Propagation delay is an inherent attribute of all silicon structures, and the
more structures that are present in a path, the slower the path will be. It follows that a programmable
gate will be slower than a custom gate, because that programmable gate comes along with additional
connectivity structures with their own timing penalties. The same argument holds true for power
consumption.
Despite the downside of programmable logic, the technology as a whole has progressed dramati-
cally and is extremely popular as a result of competitive pricing, high performance levels, and, espe-
cially, quick time to market. Time to market is an attribute that is difficult to quantify but one that is
almost universally appreciated as critical to success. PLDs enable a shorter development cycle, be-
cause designs can be prototyped rapidly, the bugs worked out, and product shipped to a customer be-
fore some ASIC technologies would even be in fabrication. Better yet, if a bug is found in the field,
it may be fixable with significantly less cost and disruption. In the early days of programmable logic,
PLDs could not be reprogrammed, meaning that a bug could still force the recall of product already
shipped. Many modern reprogrammable PLDs allow hardware bugs to be fixed in the field with a
software upgrade consisting of a new image that can be downloaded to the PLD without having to
remove the product from the customer site.
Cost and performance are probably the most debated trade-offs involved in using programmable
logic. The full range of applications in which PLDs or ASICs are considered can be broadly split
into three categories as shown in Fig. 11.1. At the high end of technology, there are applications in
which an ASIC is the only possible solution because of leading edge timing and logic density re-
quirements. In the mid range, clock frequencies and logic complexity are such that a PLD is capable
of solving the problem, but at a higher unit cost than an ASIC. Here, the decision must be made be-
tween flexibility and time to market versus lowest unit cost. At the low end, clock frequencies and
logic density requirements are far enough below the current state of silicon technology that a PLD
may meet or even beat the cost of an ASIC.
It may sound strange that a PLD with its overhead can ever be less expensive than a custom chip.
The reasons for this are a combination of silicon die size and volume pricing. Two of the major fac-
tors in the cost of fabricating a working silicon die are its size and manufacturing yield. As a die gets
smaller, more of them can be fabricated at the same time on the same wafer using the same re-
sources. IC manufacturing processes are subject to a certain yield, which is the percentage of work-
ing dice obtained from an overall lot of dice. Some dice develop microscopic flaws during
manufacture that make them unusable. Yield is a function of many variables, including the reliability
of uniformly manufacturing a certain degree of complexity given the prevailing state of technology
at a point in time. From these two points, it follows that a silicon chip will be less expensive to man-
ufacture if it is both small and uses a technology process that is mature and has a high yield.
At the low end of speed and density, a small PLD and a small ASIC may share the same mature
technology process and the same yield characteristics, removing yield as a significant variable in
Programmable Logic Devices 251
1000
(x1000)
ASIC or PLD:
Cost vs. Time to Market
100
their cost differential. Likewise, raw packaging costs are likely to be comparable because of the mat-
uration of stable packaging materials and technologies. The cost differential comes down to which
solution requires the smaller die and how the overhead costs of manufacturing and distribution are
amortized across the volume of chips shipped.
Die size is function of two design characteristics: how much logic is required and how many I/O
pins are required. While the size of logic gates has decreased by orders of magnitude over time, the
size of I/O pads, the physical structures that packaging wires connect to, has not changed by the
same degree. There are nonscalable issues constraining pad size, including physical wire bonding
and current drive requirements. I/O pads are often placed on the perimeter of a die. If the required
number of I/O pads cannot be placed along the existing die’s perimeter, the die must be enlarged
even if not required by the logic. ICs can be considered as being balanced, logic limited, or pad lim-
ited. A balanced design is optimal, because silicon area is being used efficiently by the logic and pad
structures. A logic-limited IC’s silicon area is dominated by the internal logic requirements. At the
low end being presently discussed, being logic limited is not a concern because of the current state
of technology. Pad-limited designs are more of a concern at the low end, because the chip is forced
to a certain minimum size to support a minimum number of pins.
Many low-end logic applications end up being pad limited as the state of silicon process technol-
ogy advances and more logic gates can be squeezed into ever smaller areas. The logic shrinks, but
the I/O pads do not. Once an IC is pad limited, ASIC and CPLD implementations may use the same
die size, removing it as a cost variable. This brings us back to the volume pricing and distribution as-
pects of the semiconductor business. If two silicon manufacturers are fabricating what is essentially
the same chip (same size, yield, and package), who will be able to offer the lowest final cost? The
comparison is between a PLD vendor that turns out millions of the exact same chip each year versus
an ASIC vendor that can manufacture only as many of your custom chips that you are willing to buy.
Is an ASIC’s volume 10,000 units per year? 100,000? One million? With all other factors being
equal, the high-volume PLD vendor has the advantage, because the part being sold is not custom but
a mass-produced generic product.
252 Advanced Digital Systems
Among the most basic types of PLDs are Generic Array LogicTM (GAL) devices.* GALs are en-
hanced variants of the older Programmable Array LogicTM (PAL) architecture that is now essentially
obsolete. The term PAL is still widely used, but people are usually referring to GAL devices or other
PLD variants when they use the term. PALs became obsolete, because GALs provide a superset of
their functionality and can therefore perform all of the functions that PALs did. GALs are relatively
small, inexpensive, easily available, and manufactured by a variety of vendors (e.g., Cypress, Lat-
tice, and Texas Instruments).
It can be shown through Boolean algebra that any logical expression can be represented as an ar-
bitrarily complex sum of products. Therefore, by providing a programmable array of AND/OR
gates, logic can be customized to fit a particular application. GAL devices provide an extensive pro-
grammable array of wide AND gates, as shown in Fig. 11.2, into which all the device’s input terms
are fed. Both true and inverted versions of each input are made available to each AND gate. The out-
puts of groups of AND gates (products) feed into separate OR gates (sums) to generate user-defined
Boolean expressions.
Each intersection of a horizontal AND gate line and a vertical input term is a programmable con-
nection. In the early days of PLDs, these connections were made by fuses that would literally have to
be blown with a high voltage to configure the device. Fuse-based devices were not reprogrammable;
Input Terms
AND/OR
Output Terms
* GAL, Generic Array Logic, PAL, and Programmable Array Logic are trademarks of Lattice Semiconductor Corporation.
Programmable Logic Devices 253
once a microscopic fuse is blown, it cannot be restored. Today’s devices typically rely on EEPROM
technology and CMOS switches to enable nonvolatile reprogrammability. However, fuse-based ter-
minology remains in use for historical reasons. The default configuration of a connection emulates
an intact fuse, thereby connecting the input term to the AND gate input. When the connection is
blown, or programmed, the AND input is disconnected from the input term and pulled high to effec-
tively remove that input from the Boolean expression. Customization of a GAL’s programmable
AND gate is conceptually illustrated in Fig. 11.3.
With full programmability of the AND array, the OR connections can be hard wired. Each GAL
device feeds a differing number of AND terms into the OR gates. If one or more of these AND terms
are not needed by a particular Boolean expression, those unneeded AND gates can be effectively dis-
abled by forcing their outputs to 0. This is done by leaving an unneeded AND gate’s inputs unpro-
grammed. Remember that inputs to the AND array are provided in both true and complement
versions. When all AND connections are left intact, multiple expressions of the form A&A = 0 re-
sult, thereby forcing that gate’s output to 0 and rendering it nonparticipatory in the OR function.
The majority of a GAL’s logic customization is performed by programming the AND array. How-
ever, selecting flip-flops, OR/NOR polarities, and input/output configurations is performed by pro-
gramming a configurable I/O and feedback structure called a macrocell. The basic concept behind a
macrocell is to ultimately determine how the AND/OR Boolean expression is handled and how the
macrocell’s associated I/O pin operates. A schematic view of a GAL macrocell is shown in Fig. 11.4,
although some GALs may contain macrocells with slightly different capabilities. Multiplexers deter-
mine the polarity of the final OR/NOR term, regardless of whether the term is registered and
whether the feedback signal is taken directly at the flop’s output or at the pin. Configuring the mac-
rocell’s output enable determines how the pin behaves.
A B C D E F
Output
Enable
I/O
AND Array Pin
D Q
Terms
Clock
Configuration
Information
Feedback to
AND Array
Configuration
Information
There are two common GAL devices, the 16V8 and the 22V10, although other variants exist as
well. They contain eight and ten macrocells each, respectively. The 16V8 provides up to 10 dedi-
cated inputs that feed the AND array, whereas the 22V10 provides 12 dedicated inputs. One of the
22V10’s dedicated inputs also serves as a global clock for any flops that are enabled in the macro-
cells. Output enable logic in a 22V10 is evaluated independently for each macrocell via a dedicated
AND term. The 16V8 is somewhat less flexible, because it cannot arbitrarily feed back all macrocell
outputs depending on the device configuration. Additionally, when configured for registered mode
where macrocell flops are usable, two dedicated input pins are lost to clock and output enable func-
tions.
GALs are fairly low-density PLDs by modern standards, but their advantages of low cost and
high speed are derived from their small size. Implementing logic in a GAL follows several basic
steps. First, the logic is represented in either graphical schematic diagram or textual (HDL) form.
This representation is converted into a netlist using a translation or synthesis tool. Finally, the
netlist is fitted into the target device by mapping individual gate functions into the programmable
AND array. Given the fixed AND/OR structure of a GAL, fitting software is designed to perform
logic optimizations and translations to convert arbitrary Boolean expressions into sum-of-product
expressions. The result of the fitting process is a programming image, also called a fuse map, that
defines exactly which connections, or fuses, are to be programmed and which are to be left at their
default state. The programming image also contains other information such as macrocell configura-
tion and other device-specific attributes.
Modern PLD development software allows the back-end GAL synthesis and fitting process to
proceed without manual intervention in most cases. The straightforward logic flow through the pro-
grammable AND array reduces the permutations of how a given Boolean expression can be imple-
mented and results in very predictable logic fitting. An input signal propagates through the pin and
pad structure directly into the AND array, passes through just two gates, and can then either feed a
macrocell flop or drive directly out through an I/O pin. Logic elements within a GAL are close to
each other as a result of the GAL’s small size, which contributes to low internal propagation delays.
These characteristics enable the GAL architecture to deliver very fast timing specifications, because
signals follow deterministic paths with low propagation delays.
GALs are a logic implementation technology with very predictable capabilities. If the desired
logic cannot fit within the GAL, there may not be much that can be done without optimizing the al-
gorithm or partitioning the design across multiple devices. If the logic fits but does not meet timing,
the logic must be optimized, or a faster device must be found. Because of the GAL’s basic fitting
process and architecture, there isn’t the same opportunity of tweaking the device as can be done with
more complex PLDs. This should not be construed as a lack of flexibility on the part of the GAL.
Rather, the GAL does what it says it does, and it is up to the engineer to properly apply the technol-
ogy to solve the problem at hand. It is the simplicity of the GAL architecture that is its greatest
strength.
Lattice Semiconductor’s GAL22LV10D-4 device features a worst-case input-to-output combina-
torial propagation delay of just 4 ns.* This timing makes the part suitable for address decoding on
fast microprocessor interfaces. The same 22V10 part features a 3-ns tCO and up to 250-MHz opera-
tion. The tCO specification is a pin-to-pin metric that includes the propagation delays of the clock
through the input pin and the output signal through the output pin. Internally, the actual flop itself
exhibits a faster tCO that becomes relevant for internal logic feedback paths. Maximum clock fre-
quency specifications are an interesting aspect of all PLDs and some consideration. These specifica-
tions are best-case numbers usually obtained with minimal logic configurations. They may define
the highest toggle rate of the device’s flops, but synchronous timing analysis dictates that there is
more to fMAX than the flop’s tSU and tCO. Propagation delay of logic and connectivity between flops
is of prime concern. The GAL architecture’s deterministic and fast logic feedback paths reduces the
added penalty of internal propagation delays. Lattice’s GAL22LV10D features an internal clock-to-
feedback delay of 2.5 ns, which is the combination of the actual flop’s tCO plus the propagation delay
of the signal back through the AND/OR array. This feedback delay, when combined with the flop’s
3-ns tSU, yields a practical fMAX of 182 MHz when dealing with most normal synchronous logic that
contains feedback paths (e.g., a state machine).
11.3 CPLDS
Complex PLDs, or CPLDs, are the mainstream macrocell-based PLDs in the industry today, provid-
ing logic densities and capabilities well beyond those of a GAL device. GALs are flexible for their
size because of the large programmable AND matrix that defines logical connections between inputs
and outputs. However, this anything-to-anything matrix makes the architecture costly to scale to
higher logic densities. For each macrocell that is added, both matrix dimensions grow as well.
Therefore, the AND matrix increases in a square function of the I/O terms and macrocells in the de-
vice. CPLD vendors seek to provide a more linear scaling of connectivity resources to macrocells by
implementing a segmented architecture with multiple fixed-size GAL-style logic blocks that are in-
terconnected via a central switch matrix as shown in Fig. 11.5. Like a GAL, CPLDs are typically
manufactured with EEPROM configuration storage, making their function nonvolatile. After pro-
gramming, a CPLD will retain its configuration and be ready for operation when power is applied to
the system.
Each individual logic block is similar to a GAL and contains its own programmable AND/OR ar-
ray and macrocells. This approach is scalable, because the programmable AND/OR arrays remain
fixed in size and small enough to fabricate economically. As more macrocells are integrated onto the
same chip, more logic blocks are placed onto the chip instead of increasing the size of individual
logic blocks and bloating the AND/OR arrays. CPLDs of this type are manufactured by companies
including Altera, Cypress, Lattice, and Xilinx.
Input Terms
GAL-Type Pin Outputs
Feedback Logic Block
Input Terms
Logic Block GAL-Type Pin Outputs
Input Term Feedback Logic Block
I/O Cells
Switch
Matrix
Input Terms
GAL-Type Pin Outputs
Feedback Logic Block
Pin Inputs
Generic user I/O pins are bidirectional and can be configured as inputs, outputs, or both. This is in
contrast to the dedicated power and test pins that are necessary for operation. There are as many po-
tential user I/O pins as there are macrocells, although some CPLDs may be housed in packages that
do not have enough physical pins to connect to all the chip’s I/O sites. Such chips are intended for
applications that are logic limited rather than pad limited.
Because the size of each logic block’s AND array is fixed, the block has a fixed number of possi-
ble inputs. Vendor-supplied fitting software must determine which logical functions are placed into
which blocks and how the switch matrix connects feedback paths and input pins to the appropriate
block. The switch matrix does not grow linearly as more logic blocks are added. However, the im-
pact of the switch matrix’s growth is less than what would result with an ever expanding AND ma-
trix. Each CPLD family provides a different number of switched input terms to each logic block.
The logic blocks share many characteristics with a GAL, as shown in Fig. 11.6, although addi-
tional flexibility is added in the form of product term sharing. Limiting the number of product terms
in each logic block reduces device complexity and cost. Some vendors provide just five product
terms per macrocell. To balance this limitation, which could impact a CPLD’s usefulness, product
term sharing resources enable one macrocell to borrow terms from neighboring macrocells. This
borrowing usually comes at a small propagation delay penalty but provides necessary flexibility in
handling complex Boolean expressions with many product terms. A logic block’s macrocell contains
a flip-flop and various configuration options such as polarity and clock control. As a result of their
higher logic density, CPLDs contain multiple global clocks that individual macrocells can choose
from, as well as the ability to generate clocks from the logic array itself.
Xilinx is a vendor of CPLD products and manufactures a family known as the XC9500. Logic
blocks, or function blocks in Xilinx’s terminology, each contain 18 macrocells, the outputs of which
feed back into the switch matrix and drive I/O pins as well. XC9500 CPLDs contain multiples of 18
macrocells in densities from 36 to 288 macrocells. Each function block gets 54 input terms from the
switch matrix. These input terms can be any combination of I/O pin inputs and feedback terms from
other function blocks’ macrocells.
Like a GAL, CPLD timing is very predictable because of the deterministic nature of the logic
blocks’ AND arrays and the input term switch matrix. Xilinx’s XC9536XV-3 features a maximum
pin-to-pin propagation delay of 3.5 ns and a tCO of 2.5 ns.* Internal logic can run as fast as 277 MHz
with feedback delays included, although complex Boolean expressions likely reduce this fMAX be-
cause of product term sharing and feedback delays through multiple macrocells.
CPLD fitting software is typically provided by the silicon vendor, because the underlying silicon
architectures are proprietary and not disclosed in sufficient detail for a third party to design the nec-
essary algorithms. These tools accept a netlist from a schematic tool or HDL synthesizer and auto-
matically divide the logic across macrocells and logic blocks. The fitting process is more complex
Macrocell
Product Macrocell
Product
Terms Term Macrocell
AND Array
Sharing and
Distribution
Macrocell
than for a GAL; not every term within the CPLD can be fed to each macrocell because of the seg-
mented AND array structure. Product term sharing places restrictions on neighboring macrocells
when Boolean expressions exceed the number of product terms directly connected to each macro-
cell. The fitting software first reduces the netlist to a set of Boolean expressions in the form that can
be mapped into the CPLD and then juggles the assignment of macrocells to provide each with its re-
quired product terms. Desired operating frequency influences the placement of logic because of the
delay penalties of sharing product terms across macrocells. These trade-offs occur at such a low
level that human intervention is often impractical.
CPLDs have come to offer flexibility advantages beyond just logic implementation. As digital
systems get more complex, logic IC supply voltages begin to proliferate. At one time, most systems
ran on a single 5-V supply. This was followed by 3.3-V systems, and it is now common to find sys-
tems that operate at multiple voltages such as 3.3 V, 2.5 V, 1.8 V, and 1.5 V. CPLDs invariably find
themselves designed into mixed-voltage environments for the purposes of interface conversion and
bus management. To meet these needs, many CPLDs support more than one I/O voltage standard on
the same chip at the same time. I/O pins are typically divided into banks, and each bank can be inde-
pendently selected for a different I/O voltage.
Most CPLDs are relatively small in logic capacity because of the desire for very high-speed oper-
ation with deterministic timing and fitting characteristics at a reasonable cost. However, some
CPLDs have been developed far beyond the size of typical CPLDs. Cypress Semiconductor’s
Delta39K200 contains 3,072 macrocells with several hundred kilobits of user-configurable RAM.*
The architecture is built around clusters of 128 macrocell logic groups, each of which is similar in
nature to a conventional CPLD. In a similar way that CPLDs add an additional hierarchical connec-
tivity layer on top of multiple GAL-type logic blocks, Cypress has added a layer on top of multiple
CPLD-type blocks. Such large CPLDs may have substantial benefits for certain applications. Be-
yond several hundred macrocells, however, engineers have tended to use larger and more scalable
FPGA technologies.
11.4 FPGAS
CPLDs are well suited to applications involving control logic, basic state machines, and small
groups of read/write registers. These control path applications typically require a small number of
flops. Once a task requires many hundreds or thousands of flops, CPLDs rapidly become impractical
to use. Complex applications that manipulate and parse streams of data often require large quantities
of flops to serve as pipeline registers, temporary data storage registers, wide counters, and large state
machine vectors. Integrated memory blocks are critical to applications that require multiple FIFOs
and data storage buffers. Field programmable gate arrays (FPGAs) directly address these data path
applications.
FPGAs are available in many permutations with varying feature sets. However, their common de-
fining attribute is a fine-grained architecture consisting of an array of small logic cells, each consist-
ing of a flop, a small lookup table (LUT), and some supporting logic to accelerate common functions
such as multiplexing and arithmetic carry terms for adders and counters. Boolean expressions are
evaluated by the LUTs, which are usually implemented as small SRAM arrays. Any function of four
variables, for example, can be implemented in a 16 × 1 SRAM when the four variables serve as the
index into the memory. There are no AND/OR arrays as in a CPLD or GAL. All Boolean functions
* Delta39K ISR CPLD Family, Document #38-03039 Rev. *.C, Cypress Semiconductor, December 2001, p. 1.
258 Advanced Digital Systems
are implemented within the logic cells. The cells are arranged on a grid of routing resources that can
make connections between arbitrary cells to build logic paths as shown in Fig. 11.7. Depending on
the FPGA type, special-purpose structures are placed into the array. Most often, these are config-
urable RAM blocks and clock distribution elements. Around the periphery of the chip are the I/O
cells, which commonly contain one or more flops to enable high-performance synchronous inter-
faces. Locating flops within I/O cells improves timing characteristics by minimizing the distance,
and hence the delay between each flop and its associated pin. Unlike CPLDs, most FPGAs are based
on SRAM technology, making their configurations volatile. A typical FPGA must be reprogrammed
each time power is applied to a system. Major vendors of FPGAs include Actel, Altera, Atmel, Lat-
tice, QuickLogic, and Xilinx.
Very high logic densities are achieved by scaling the size of the two-dimensional logic cell array.
The primary limiting factor in FPGA performance becomes routing because of the nondeterministic
nature of a multipath grid interconnect system. Paths between logic cells can take multiple routes,
some of which may provide identical propagation delays. However, routing resources are finite, and
conflicts quickly arise between competing paths for the same routing channels. As with CPLDs,
FPGA vendors provide proprietary software tools to convert a netlist into a final programming im-
age. Depending on the complexity of the design (e.g., speed and density), routing an FPGA can take
a few minutes or many hours. Unlike a CPLD with deterministic interconnection resources, FPGA
timing can vary dramatically, depending on the quality of the logic placement. Large, fast designs re-
quire iterative routing and placement algorithms.
Logic Paths
IO IO IO IO IO IO
IO LC LC LC LC LC IO
R
A
M
IO LC LC LC LC LC IO
IO LC LC LC LC LC IO
IO LC LC LC LC LC IO
R
A
M
IO LC LC LC LC LC IO
IO IO IO IO IO IO
Routing
Resources
Human intervention can be critical to the successful routing and timing of a complex FPGA de-
sign. Floorplanning is the process by which an engineer manually partitions logic into groups and
then explicitly places these groups into sections of the logic cell array. Manually locating large por-
tions of the logic restricts the routing software to optimizing placement of logic within those bound-
aries and reduces the number of permutations that it must try to achieve a successful result.
Each vendor’s logic cell architecture differs somewhat, but mainly in how support functions such
as multiplexing and arithmetic carry terms are implemented. For the most part, engineers do not
have to consider the minute details of each logic cell structure, because the conversion of logic into
the logic cell is performed by a combination of the HDL synthesis tool and the vendor’s proprietary
mapping software. In extreme situations, wherein a very specific logic implementation is necessary
to squeeze the absolute maximum performance from a specific FPGA, optimizing logic and archi-
tecture for a given logic cell structure may have benefits. Engaging in this level of technology-spe-
cific optimization, however, can be very tricky and lead to a house-of-cards scenario in which
everything is perfectly balanced for a while, and then one new feature is added that upsets the whole
plan. If a design appears to be so aggressive as to require fine-tuned optimization, and faster devices
cannot be obtained, it may be preferable to modify the architecture to enable more mainstream, ab-
stracted design methodologies.
Notwithstanding the preceding comments, there are high-level feature differences among FPGAs
that should be evaluated before choosing a specific device. Of course, it is necessary to pick an
FPGA that has sufficient logic and I/O pins to satisfy the needs of the intended application. But not
all FPGAs are created equal, despite having similar quantities of logic. While the benefits of one
logic structure over another can be debated, the presence or absence of critical design resources can
make implementation of a specific design possible or impossible. These resources are clock distribu-
tion elements, embedded memory, embedded third-party cores, and multifunction I/O cells.
Clock distribution across a synchronous system must be done with minimal skew to achieve ac-
ceptable timing. Each logic cell within a FPGA holds a flop that requires a clock. Therefore, an
FPGA must provide at least one global clock signal distributed to each logic cell with low skew
across the entire device. One clock is insufficient for most large digital systems because of the prolif-
eration of different interfaces, microprocessors, and peripherals. Typical FPGAs provide anywhere
from 4 to 16 global clocks with associated low-skew distribution resources. Most FPGAs do allow
clocks to be routed using the general routing resources that normally carry logic signals. However,
these paths are usually unable to achieve the low skew characteristics of the dedicated clock distribu-
tion network and, consequently, do not enable high clock speeds.
Some FPGAs support a large number of clocks, but with the restriction that not all clocks can be
used simultaneously in the same portion of the chip. This type of restriction reduces the complexity
of clock distribution on the part of the FPGA vendor because, while the entire chip supports a large
number of clocks in total, individual sections of the chip support a smaller number. For example, an
FPGA might support 16 global clocks with the restriction that any one quadrant can support only 8
clocks. This means that there are 16 clocks available, and each quadrant can select half of them for
arbitrary use. Instead of providing 16 clocks to each logic cell, only 8 need be connected, thus sim-
plifying the FPGA circuitry.
Most FPGAs provide phase locked loops (PLLs) or delay locked loops (DLLs) that enable the in-
tentional skewing, division, and multiplication of incoming clock signals. PLLs are partially analog
circuits, whereas DLLs are fully digital circuits. They have significant overlap in the functions that
they can provide in an FPGA, although engineers may debate the merits of one versus the other. The
fundamental advantage of a PLL or DLL within an FPGA is its ability to improve I/O timing (e.g.,
tCO) by effectively removing the propagation delay between the clock input pin and the signal output
pin, also known as deskewing. As shown in Fig. 11.8, the PLL or DLL aligns the incoming clock to
a feedback clock with the same delay as observed at the I/O flops. In doing so, it shifts the incoming
260 Advanced Digital Systems
Clock
Distribution
Resources
Oscillator (PLL)
Input Clock Output Clock I/O
or
Flop
Delay Logic (DLL)
clock so that the causal edge observed by the I/O flops occurs at nearly the same time as when it en-
ters the FPGA’s clock input pin. PLLs and DLLs are discussed in more detail in a later chapter.
Additional circuitry enables some PLLs and DLLs to emit a clock that is related to the input fre-
quency by a programmable ratio. The ability to multiply and divide clocks is a benefit to some sys-
tem designs. An external board-level interface may run at a slower frequency to make circuit
implementation easier, but it may be desired to run the internal FPGA logic as a faster multiple of
that clock for processing performance reasons. Depending on the exact implementation, multiplica-
tion or division can assist with this scheme.
RAM blocks embedded within the logic cell array are a critical feature for many applications.
FIFOs and small buffers figure prominently in a variety of data processing architectures. Without on-
chip RAM, valuable I/O resources and speed penalties would be given up to use off-chip memory
devices. To suit a wide range of applications, RAMs need to be highly configurable and flexible. A
typical FPGA’s RAM block is based on a certain bit density and can be used in arbitrary width/depth
configurations as shown in Fig. 11.9 using the example of a 4-kb RAM block. Single- and dual-port
modes are also very important. Many applications, including FIFOs, benefit from a dual-ported
4,096 x 1
2,048 x 2
1,024 x 4
512 x 8
RAM block to enable simultaneous reading and writing of the memory by different logic blocks.
One state machine may be writing data into a RAM, and another may be reading other data out at the
same time. RAM blocks can have synchronous or asynchronous interfaces and may support one or
two clocks in synchronous modes. Supporting two clocks in synchronous modes facilitates dual-
clock FIFO designs for moving data between different clock domains.
Some FPGAs also allow logic cell LUTs to be used as general RAM in certain configurations. A
16 × 1 four-input LUT can serve as a 16 × 1 RAM if supported by the FPGA architecture. It is more
efficient to use RAM blocks for large memory structures, because the hardware is optimized to pro-
vide a substantial quantity of memory in a small area of silicon. However, LUT-based RAM is bene-
ficial when a design requires many shallow memory structures (e.g., a small FIFO) and all the large
RAM blocks are already used. Along with control logic, 32 four-input LUTs can be used to con-
struct a 16 × 32 FIFO. If a design is memory intensive, it could be wasteful to commit one or more
large RAM blocks for such a small FIFO.
Embedding third-party logic cores is a feature that can be useful for some designs, and not useful
at all for others. A disadvantage of FPGAs is their higher cost per gate than custom ASIC technol-
ogy. The main reason that engineers are willing to pay this cost premium is for the ability to imple-
ment custom logic in a low-risk development process. Some applications involve a mix of custom
and predesigned logic that can be purchased from a third party. Examples of this include buying a
microprocessor design or a standard bus controller (e.g., PCI) and integrating it with custom logic on
the same chip. Ordinarily, the cost per gate of the third-party logic would be the same as that of your
custom logic. On top of that cost is the licensing fee charged by the third party. Some FPGA vendors
have decided that there is sufficient demand for a few standard logic cores to offer specific FPGAs
that embed these cores into the silicon in a fixed configuration. The benefit of doing so is to drop the
per-gate cost of the core to nearly that of a custom ASIC, because the core is hard wired and requires
none of the FPGA’s configuration overhead.
FPGAs with embedded logic cores may cost more to offset the licensing costs of the cores, but
the idea is that the overall cost to the customer will be reduced through the efficiency of the hard-
wired core implementation. Microprocessors, PCI bus controllers, and high-speed serdes compo-
nents are common examples of FPGA embedded cores. Some specific applications may be well
suited to this concept.
I/O cell architecture can have a significant impact on the types of board-level interfaces that the
FPGA can support. The issues revolve around two variables: synchronous functionality and voltage/
current levels. FPGAs support generic I/O cells that can be configured for input-only, output-only,
or bidirectional operation with associated tri-state buffer output enable capability. To achieve the
best I/O timing, flops for all three functions—input, output, and output-enable—should be included
within the I/O cell as shown in Fig. 11.10. The timing improvement obtained by locating these three
flops in the I/O cells is substantial. The alternative would be to use logic cell flops and route paths
from the logic cell array directly to the I/O pin circuitry, increasing the I/O delay times. Each of the
three I/O functions is provided in both registered and unregistered options using multiplexers to
provide complete flexibility in logic implementation.
More advanced bus interfaces run at double data rate speeds, requiring more advanced I/O cell
structures to achieve the necessary timing specifications. Newer FPGAs are available with I/O cells
that specifically support DDR interfaces by incorporating two sets of flops, one for each clock edge
as shown in Fig. 11.11. When configured for DDR mode, each of the three I/O functions is driven by
a pair of flops, and a multiplexer selects the appropriate flop output depending on the phase of the
clock. A DDR interface runs externally to the FPGA on both edges of the clock with a certain width.
Internally, the interface runs at double the external width on only one edge of the same clock fre-
quency. Therefore, the I/O cell serves as a 2:1 multiplexer for outputs and a 1:2 demultiplexer for in-
puts when operating in DDR mode.
262 Advanced Digital Systems
Output
D Q
Enable
I/O
Output Pin
D Q
Data
Input
Data
Q D
Configuration
Information
Output
Enable D Q
#1
D
D
R I/O
Output
Pin
Enable D Q
#2
Configuration
Information
Input
Data
#1 Q D
Input
Data Q D
#2
Aside from synchronous functionality, compliance with various I/O voltage and current drive
standards is a key feature for modern, flexible FPGAs. Like CPLDs that support multiple I/O banks,
each of which that can drive a different voltage level, FPGAs are usually partitioned into I/O banks
as well, for the same purpose. In contrast with CPLDs, many FPGAs support a wider variety of I/O
standards for greater design flexibility.
Verilog Cheat Sheet
S Winberg and J Taylor
Comments Operators
// One-liner // These are in order of precedence...
/* Multiple // Select
lines */ A[N] A[N:M]
// Reduction
&A ~&A |A ~|A ^A ~^A
Numeric Constants // Compliment
// The 8-bit decimal number 106: !A ~A
8'b_0110_1010 // Binary // Unary
8'o_152 // Octal +A -A
8'd_106 // Decimal // Concatenate
8'h_6A // Hexadecimal {A, ..., B}
"j" // ASCII // Replicate
{N{A}}
78'bZ // 78-bit high-impedance // Arithmetic
A*B A/B A%B
Too short constants are padded with zeros A+B A-B
// Shift
on the left. Too long constants are A<<B A>>B
truncated from the left. // Relational
A>B A<B A>=B A<=B
Nets and Variables A==B A!=B
wire [3:0]w; // Assign outside always blocks // Bit-wise
reg [1:7]r; // Assign inside always blocks A&B
reg [7:0]mem[31:0]; A^B A~^B
A|B
integer j; // Compile-time variable // Logical
genvar k; // Generate variable A&&B
A||B
Parameters // Conditional
A ? B : C
parameter N = 8;
localparam State = 2'd3;
Module
Assignments module MyModule
#(parameter N = 8) // Optional parameter
assign Output = A * B;
(input Reset, Clk,
assign {C, D} = {D[5:2], C[1:9], E};
output [N-1:0]Output);
// Module implementation
endmodule
Module Instantiation
// Override default parameter: setting N = 13
MyModule #(13) MyModule1(Reset, Clk, Result);
Case Generate
always @(*) begin genvar j;
case(Mux) wire [12:0]Output[19:0];
2'd0: A = 8'd9;
2'd1, generate
2'd3: A = 8'd103; for(j = 0; j < 20; j = j+1)
2'd2: A = 8'd2; begin: Gen_Modules
default:; MyModule #(13) MyModule_Instance(
endcase Reset, Clk,
end Output[j]
);
always @(*) begin end
casex(Decoded) endgenerate
4'b1xxx: Encoded = 2'd0;
4'b01xx: Encoded = 2'd1;
4'b001x: Encoded = 2'd2;
State Machine
4'b0001: Encoded = 2'd3; reg [1:0]State;
default: Encoded = 2'd0; localparam Start = 2'b00;
endcase localparam Idle = 2'b01;
end localparam Work = 2'b11;
localparam Done = 2'b10;
Bit reduction unary operators: and (&) or (|) xor (ˆ) The blocking assignment operator (=) is also used inside
Example, for a 3 bit vector a: an always block but causes assignments to be per-
&a == a[0] & a[1] & a[2] formed as if in sequential order. This tends to result
and |a == a[0] | a[1] | a[2] in slower circuits, so we do not used it for synthesised cir-
cuits.
Conditional operator ? used to multiplex a result
Example: (a==3’d3) ? formula1 : formula0
For single bit formula, this is equivalent to: Case and if statements
((a==3’d3) && formula1)
|| ((a!=3’d3) && formula0) case and if statements are used inside an always
block to conditionally update state.
Example: Simulation
module simpleClockedALU(
input clock, Example simulation following on from the above in-
input [1:0] func, stantiation of simpleClockeALU:
input [3:0] a,b, reg clk;
output reg [3:0] result); reg [7:0] vals;
always @(posedge clock) assign data0=vals[3:0];
case(func) assign data1=vals[7:4];
2’d0 : result <= a + b;
2’d1 : result <= a - b; // oscillate clock every 10 simulation units
2’d2 : result <= a & b; always #10 clk <= !clk;
default : result <= a ˆ b;
endcase // initialise values
endmodule initial #0 begin
clk = 0;
vals=0;
Example in pre 2001 Verilog:
// finish after 200 simulation units
module simpleClockedALU( #200 $finish;
clock, func, a, b, result); end
input clock;
input [1:0] func; // monitor results
input [3:0] a,b; always @(negedge clk)
output [3:0] result; $display("%d + %d = %d",data0,data1,sum);
reg [3:0] result;
always @(posedge clock) Simon Moore