Design Performance An
Design Performance An
Introduction
FPGA designers often face daunting timing challenges. To reduce their frustration and to shrink their design time and effort, this application note is intended to help users analyze their design, identify the root causes associated with the timing issues, and finally cope with them. One other important goal is to help designers predict the outcome of their design decisions or tools settings. This paper highlights when a particular user action may be efficient and what is the effect of combining several user actions. The document will also cover some of the cautions that need to be considered when dealing with some of the timing-critical situations. User actions described in this document cover the following: RTL coding Synthesis options setting Place and route constraints and options setting "Brief Introduction to Flash FPGA Architecture" on page 2 presents the salient architectural features of the Flash-based Actel FPGAs: IGLOO, Fusion, ProASIC3, and ProASIC3L. "Root Causes of Timing Issues" on page 5 focuses on the main sources of timing challenges and ways to identify them. "Main Sources for Design Analysis" on page 7 introduces various design techniques to cope with each of these root causes of timing challenges. These techniques include RTL coding, synthesis flow setting, place-and-route, and physical and routing constraints. "Ingredients for Timing Optimizations" on page 9 focuses on inherent design congestion management and provides various results.
Basic Cell
The basic cell, called VersaTile as depicted in Figure 1, is a LUT with three inputs that can indifferently implement a three-input combinatorial gate or flip-flop with enable. This allows the implementation of any design with any combinatorial vs. register ratio. The nature of this tile makes it close to basic ASIC cells in the sense of its fine granularity, thus allowing ASIC-like cell-based mappers to apply all their optimization potential.
0 1 Data X3 Y Pin 1 0 1 0 1 F2
YL 0 1 CLK X2
Legend:
Ground
Routing
An abundance of routing resources has been designed to offer the highest routability possible even when the designs resources utilization is higher than 95%. The focus of this section is on the global routing resources and their flexible aggregation. Figure 2 illustrates the span of the various global and quadrant networks. In essence, each VersaTile in the die can be reached by six global and three quadrant networks. More importantly, these networks can be aggregated to map local clocks or resets or any high-fanout net. Moreover, the aggregation of one, two, four, or any number of spines introduces neither an overhead insertion delay nor a skew. This is achieved by the embedded MUX tree that offers a local choice between sourcing from the global network or any local driver. Refer to the IGLOO, Fusion, ProASIC3, and ProASIC3L product handbooks for more details.
Quadrant Global Pads T1 T2 T3
Pad Ring
Pad Ring
I/ORing
Top Spine
Global Pads Global Spine Global Ribs Scope of Spine (shaded area plus local RAMs and I/Os)
Bottom Spine
Spine-Selection MUX
I/O Ring
B1
B2
B3
Logic Tiles
CLKA
270 180 90 0
Output Delay D2
GLA Primary
D1 Feedback Delay
Output Delay D2 v
GLB
GLC
FlashROM
Actel IGLOO, Fusion, ProASIC3, and ProASIC3L Flash devices have 1 kbit of on-chip, user-accessible, nonvolatile FlashROM. The FlashROM can be used in diverse system applications such as system calibration settings, device serialization and/or inventory control, or subscription-based business models (for example, set-top boxes).
I/Os
The I/O tiles are flexible to support a large variety of standards ranging from LVTTL to LVDS. They also support DDR (double-data-rate) access with the I/O embedded registers, allowing high data rate and aggressive external timing.
Finally, it is needless to demonstrate the fact that the earlier these sources of timing troubles are identified and dealt with, the sooner the timing closure can be achieved. More importantly, the effort and investment are less for a larger opportunity when the root cause of timing trouble is avoided or identified soon. Figure 4 illustrates this theorem. As a corollary of this theorem, the best return on investment is in the order of importance associated with the following: Quality of the RTL code Getting the best of logic synthesis by means of judicious constraints and options setting A formal link between synthesis and place-and-route Getting the best of place-and-route engines via appropriate constraining and use of the FPGA architecture features
Time
E F F O R T S
B E N E F I T S
However, most designs do not meet all the timing specifications and several iterations are needed. The iterative approach that leads to timing convergence in minimal time and effort is based on the following: Analysis as a key to the identification of the aforementioned root causes of challenge A sort of these causes based on scope of implications and order of importance towards resolution of the timing issues Dealing with these sources with an appropriate order of importance, preferably at the highest level of the flowi.e., the RTL code if possible
The "Main Sources for Design Analysis" section on page 7 summarizes the major sources for pertinent data to identify the bottlenecks.
Unfortunately, the synthesis reports do not identify the congested blocks, the interconnectivity between blocks, nor the reasons why the synthesis process could not do some of the optimizations.
Similar to the synthesis report, the backend compile does not report data related to blocks that are timing-critical or those that are congested. It does not reveal any data on the interconnectivity of blocks or inter-block busses.
ChipPlanner
The ChipPlanner tool helps the user to check several aspects related to the quality of the place-and-route tools. The tool allows users to check the span of the global networks in the die or whether the routed design is congested. Using the cell selection function of the Viewer, users can check the relative placement of hierarchical blocks. The ChipPlanner can also help identify if the placement is inefficient. Low-fanout nets that are routed in a very sneaky way are an identification of such a poor placement. However, if the similar sneaky routing is associated with high-fanout nets, this may be due to poor routing.
Unfortunately, the user needs to know a lot more about the design before being able to use the tool to identify block interconnects using ChipPlanner.
Timing Reports
The timing reports are by far the most important source for information to know more about the critical regions of the design, the inherent congestion of blocks, as well as about the quality of results of synthesis, placement, or routing process. The expanded paths can reveal if the paths are timing-critical because of a large number of logic levels, or because of high delay penalties associated with high-fanout nets. If these high delay penalties are associated with low-fanout nets, this may reveal a poor placement or poor routing. If the expanded paths are made of a long set of two-input gates, the user needs to check if this is due to a poor synthesis or a poor RTL coding style. Unfortunately, the timing reports do not reveal shared critical nets between paths with worst negative slack, or negative slack distribution. Moreover, the timing reports do not report the hierarchical blocks that meet the timing with a narrow or large margin. To see the importance of these parameters, consider the timing profiles of two designs, given in Figure 5. A first look at the profiles will lead to the conclusion that the design on the left is a lot more complex than the one on the right. However, a deeper look at the common critical nets leading to the negative slack may reveal that the number is very limited for the design profiled on the left. Moreover, a closer look at the profile of the design on the right shows a very high sensitivity to changes in place-and-route; thus, making a decision to tackle the negative slack may induce a new larger set of paths with a negative slack. The moral of this illustration is that analysis must be deep and complete before making decisions.
Target Timing
Timing
Target Timing
# Paths
Figure 5 Slack/Timing Profiles for Two Different Designs or Implementations
Timing
# Paths
In case a block is the source of multiple high fanout nets and its size in terms of logic cells is limited (less than 200 VersaTiles), it is worth investigating the replication of the block itself, as illustrated in Figure 7.
Sub_Mod 2 Sub_Mod 3 Sub_Mod 2 Sub_Mod 3 Sub_Mod 6 Sub_Mod 7 High Fanout Module Sub_Mod 6 Sub_Mod 7 Sub_Mod 5 Sub_Mod 5 High Fanout Module High Fanout Module
Sub_Mod 1
Sub_Mod 1 Sub_Mod 4
Sub_Mod 4
In all these cases, designers are cautioned against making these explicit replications blindly. They have to consider this change with care and avoid replicating a driver of a generated clock domain or a synchronized reset. Failing to do so will lead to new clock domains and the headache of making sure they are mapped to low-skew routing resources, or having to analyze removal time for a large set of synchronized resets.
Synthesis Control
Synthesis tools typically offer ways to set fanout limits globally. When available, use local, block-level fanout control only for blocks exhibiting high net fanout.
10
Figure 8 illustrates the use of clock network segments to map several high fanout nets.
Synthesis Control
The synthesis flow allows for retiming. This register moving around the logic comes with an increase of area, as the number of registers may increase. Users need to watch this increase and monitor the utilization of the device resources. When the number of logic levels involves arithmetic blocks, users need to know that synthesis tools offer a variety of architectures for these blocks. Users need to check the default choice made by the synthesis tool and see whether it is an efficient architecture for the device.
11
Backend Control
As part of the analysis, users need to verify the placement of the cells as well as the fanout of the nets involved in these paths. In case the fanout of these nets is limited, the tool offers a flexible set of placement constraints allowing the user to confine the placement of these paths/blocks. In case one or more nets are associated with a delay penalty, users can use the shielding technique described above.
reg [63:0] A, B, C, D, E, F, G, H, M; reg [2:0] SELECT; always @ (A or B or C or D or E or F or G or H or SELECT) begin case (SELECT) 3b000 : Z = A; 3b001 : Z = B; 3b010 : Z = C; 3b011 : Z = D; 3b100 : Z = E; 3b101 : Z = F; 3b110 : Z = G; 3b111 : Z = H; End
8:1
12
Module 1 I1, I2
4:1
Module 4
Table 2 shows the area/frequency results obtained when synthesizing the code as is, instantiating a large MUX and implementing the recommended decentralization technique.
Table 2 Centralized vs. Decentralized Implementations Pure RTL Synthesis 32:1, 16-bit wide MUX 64:1, 16-bit wide MUX 148 MHz (130 tiles) 134 MHz (246 tiles) Using 4:1 MUX Cores 171 MHz (120 tiles) 160 MHz (217 tiles)
Synthesis Control
Using the same illustration example, users must pay attention to the select lines, as the least significant bits are definitely very high-fanout nets. Moreover, the coding of the select lines, either compact or one-hot, leads to different area and speed results. Table 3 shows these results.
Table 3 Compact vs. One-Hot Select Line Encoding Compact Select 64:1 8-bit-wide MUX 64:1 16-bit-wide MUX 134 MHz (132 tiles) 130 MHz (246 tiles) One-Hot Select 163 MHz (148 tiles) 158 MHz (231 tiles)
As a corollary, users need to think contextually. If the select lines of the large MUX are a state register of a FSM, the encoding of the states of this machine must be one-hot even if this encoding may not look optimal locally.
13
Another synthesis control is related to resource sharing. The goal of the resource sharing is to reduce design area by sharing large blocks by means of adding MUXes on the inputs of these blocks. Users need to check the implication and manage the balance between slightly higher utilization, added congestion, and number of logic levels.
Backend Control
One major recommendation is to avoid aggressive placement or timing constraints on these Routing Congestion spots. In other words, users are advised to relax the placement constraints to allow higher porosity. This has to be combined with a relaxation of the timing constraints on the congested block, as well as other non-timing-critical blocks, so that their internal nets do not come into conflict for the use of a critical routing resource. Another higher-level measure users can adopt is a more data-oriented floorplan for the blocks involved in the congestion.
Synthesis Control
Unfortunately, at the synthesis level, very little can be done to cope with these situations
Backend Control
The most efficient physical implementation of these busses is the shortest routing possible. This involves placement of the drivers and the destination of each slice of the busses. If the fanout of these slices is not homogeneous, the highest-fanout bus lines can be mapped using low-skew segments of the global networks. In any case, users need to adopt a data-driven placement of the communicating blocks and relax the placement constraints for higher porosity and ease of routability of blocks and busses.
Synthesis Control
The focus of the designer should be on the setting of all the timing constraints. These include the tightest clock frequencies, the inter-clocks off-sets, the input and output delays. False paths and multi-cycle paths are very critical and need not to be neglected.
Backend Control
The general guidelines are to separate the clock domains and adopt clock-domain-based floorplanning. This will also allow an economy of the global networks and a more effective use of the low-skew segments of the global networks. When doing so, users need to integrate data dependency between domains and take into account paths optimizations.
14
A case worthy of note is when some of the clock domains drive a large number of RAM/FIFO blocks (deep or shallow RAMs that are mapped cascading several embedded RAM blocks). In such a case, users can consider placing the RAM blocks on the top and bottom of the die, provided their performance degradation does not affect critical paths.
Synthesis Control
Users need to turn off the inference of registered I/Os, as this leads to an inherent placement constraint of the registers associated with these I/Os.
Backend Control
Users need to investigate carefully the ratio of logic and /IO utilization. If the logic utilization is low, the recommendation is to run place-and-route with I/O register combining turned off. If the internal registerto-register performance is satisfactory, then users need to investigate the slack margins and allow register combining for the most critical external setup of clock-to-out timing. If doing so does not resolve the problem, the I/O placement can be modified to accommodate both the internal and external timings.
All Buffers All Buffers All Buffers Removed Default Removed Default Removed 12822 87 6124 70 2538 71.67 3238 53.4 18181 83.9 7284 63.5 3041 72.7 11398 80.5 16362 87.5 6886 66 2909 77.8 9792 78.3 16969 84.2 7192 63.5 3035 75 11180 76.7 14359 82.7 6855 68.2 2905 72.5 9691 78.6
Design RDES MUX- and XOR-based Area in VersaTiles SystemClk (MHz Syncop Bus interfaces / large reg files Area in VersaTiles HighClock (MHz) Area in VersaTiles TopClk (MHz) CORDIC Datapath Area in VersaTiles MainClk (MHz)
HRK
15
Table 4 Sample of Results on a Small Set of Benchmarks (continued) Synthesis Options Design Name Imen Area-Driven Timing-Driven Replication On Timing-Driven Replication Off
Place-and-Route Option Default Scrambling and Descrambling Area in VersaTiles Speed (MHz) VersaC HardClk RxClk TxClk System Interfaces Control Tiles Speed (MHz) SysClk PicClk and 22013 44.7 51.2 4274 88 87 60 60
All Buffers All Buffers All Buffers Removed Default Removed Default Removed 4250 90 95 57 64 22008 46.5 53.64 6815 102 103 66 67 72274 42 37 6023 108 106 67 66 67231 42.4 35.2 6586 108 106 67 65 70780 39.8 40 5884 111 104 70 68 66708 46 38
Newton
Boldy
16215 121 39
16213 111 35
20729 116 41
19379 117 36
20326 109 44
19326 132 37
A first look at this small sample of results highlights the efficiency of the area-oriented flow, both in terms of compact area and respectable speed. Pushing this tool hard with non-realistic timing constraints leads, in most cases, to a huge overhead of area, particularly when logic replication is ON. For less area penalty and slightly higher frequencies, users can push for timing-driven synthesis with the replication switched OFF.
ConclusionsForward Looking
As its title suggests, this application note is a tour of various aspects related to timing convergence when targeting Actel Flash-based FPGAs using synthesis tools and Designer, the Actel backend toolset. The contribution of this document is to help designers focus on the analysis of the timing bottlenecks, identify the root causes, and cope with them. Several suggestions have been provided, which enable designers to tackle each of these timing challenges at the RTL code level, the synthesis setting, and the backend constraints.
16
Actel and the Actel logo are registered trademarks of Actel Corporation. All other trademarks are the property of their owners.
w w w. a c t e l . c o m
Actel Corporation 2061 Stierlin Court Mountain View, CA 94043-4655 USA Phone 650.318.4200 Fax 650.318.4600 Actel Europe Ltd. River Court, Meadows Business Park Station Approach, Blackwater Camberley Surrey GU17 9AB United Kingdom Phone +44 (0) 1276 609 300 Fax +44 (0) 1276 607 540 Actel Japan EXOS Ebisu Building 4F 1-24-14 Ebisu Shibuya-ku Tokyo 150 Japan Phone +81.03.3445.7671 Fax +81.03.3445.7668 http://jp.actel.com Actel Hong Kong Room 2107, China Resources Building 26 Harbour Road Wanchai, Hong Kong Phone +852 2185 6460 Fax +852 2185 6488 www.actel.com.cn
51900173-0/3.08