0% found this document useful (0 votes)

75 views17 pages

Design Performance An

FPGA designers often face daunting timing challenges. This Application Note is intended to help users analyze their design. It also aims to help designers predict the outcome of their design decisions.

Uploaded by

ankaiah_yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views17 pages

Design Performance An

Uploaded by

ankaiah_yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Application Note

Designing for Performance on Flash-Based FPGAs

Abstract
Recently, the market has seen the birth of new FPGAs that embrace for the first time applications and systems with very stringent timing and power requirements. Novice and experienced FPGA and ASIC designers face similar problems when it comes to achieving timing closure in a very limited number of iterations. This paper introduces the main ingredients for tackling this problem without penalties or frustration. The basic ingredient is a good understanding of the FPGA architecture and its embedded features. The exploitation of these features is the key to coping with several hurdles. The second ingredient is a great analysis methodology of the bottlenecks that can be inherent to the design itself, caused by the tools or by the user settings. The third ingredient is the knowledge and understanding of the tools underlying techniques and potential optimizations and limitations. To make a sauce of these main ingredientsi.e., quickly converge and meet timingthis paper proposes a methodology using a combination of synthesis tools and Designer, the Actel physical design toolset.

Introduction
FPGA designers often face daunting timing challenges. To reduce their frustration and to shrink their design time and effort, this application note is intended to help users analyze their design, identify the root causes associated with the timing issues, and finally cope with them. One other important goal is to help designers predict the outcome of their design decisions or tools settings. This paper highlights when a particular user action may be efficient and what is the effect of combining several user actions. The document will also cover some of the cautions that need to be considered when dealing with some of the timing-critical situations. User actions described in this document cover the following: RTL coding Synthesis options setting Place and route constraints and options setting "Brief Introduction to Flash FPGA Architecture" on page 2 presents the salient architectural features of the Flash-based Actel FPGAs: IGLOO, Fusion, ProASIC3, and ProASIC3L. "Root Causes of Timing Issues" on page 5 focuses on the main sources of timing challenges and ways to identify them. "Main Sources for Design Analysis" on page 7 introduces various design techniques to cope with each of these root causes of timing challenges. These techniques include RTL coding, synthesis flow setting, place-and-route, and physical and routing constraints. "Ingredients for Timing Optimizations" on page 9 focuses on inherent design congestion management and provides various results.

This document is organized in four sections.

March 2008 2008 Actel Corporation

Designing for Performance on Flash-Based FPGAs

Brief Introduction to Flash FPGA Architecture

This section will only briefly cover some features in the IGLOO, Fusion, ProASIC3, and ProASIC3L Flash FPGA families. For full details, the reader can refer to the corresponding product handbooks.

Basic Cell
The basic cell, called VersaTile as depicted in Figure 1, is a LUT with three inputs that can indifferently implement a three-input combinatorial gate or flip-flop with enable. This allows the implementation of any design with any combinatorial vs. register ratio. The nature of this tile makes it close to basic ASIC cells in the sense of its fine granularity, thus allowing ASIC-like cell-based mappers to apply all their optimization potential.

0 1 Data X3 Y Pin 1 0 1 0 1 F2

YL 0 1 CLK X2

CLR/ Enable X1 CLR XC*

Legend:

Via (hard connection)

Switch (flash connection)

Ground

Figure 1 Logic Cell or Versatile Architecture

Designing for Performance on Flash-Based FPGAs

Routing
An abundance of routing resources has been designed to offer the highest routability possible even when the designs resources utilization is higher than 95%. The focus of this section is on the global routing resources and their flexible aggregation. Figure 2 illustrates the span of the various global and quadrant networks. In essence, each VersaTile in the die can be reached by six global and three quadrant networks. More importantly, these networks can be aggregated to map local clocks or resets or any high-fanout net. Moreover, the aggregation of one, two, four, or any number of spines introduces neither an overhead insertion delay nor a skew. This is achieved by the embedded MUX tree that offers a local choice between sourcing from the global network or any local driver. Refer to the IGLOO, Fusion, ProASIC3, and ProASIC3L product handbooks for more details.
Quadrant Global Pads T1 T2 T3
Pad Ring

High-Performance Global Network

Pad Ring

I/ORing

Top Spine

Chip (main) Global Pads

Global Pads Global Spine Global Ribs Scope of Spine (shaded area plus local RAMs and I/Os)

Bottom Spine

Spine-Selection MUX

I/O Ring

Embedded RAM Blocks

Pad Ring

Logic Tiles

Figure 2 Global Networks Distribution

Designing for Performance on Flash-Based FPGAs

Clock Conditioning Circuitry (CCC)

For clock synthesis, the CCC offers the highest flexibility with a PLL core that supports very low input frequency and delivers three outputs that can reach 350 MHz (ProASIC3 or ProASIC3L). Figure 3 illustrates the CCC and is self-explanatory. This is mentioned here as it has a link later in the optimization flow.

CLKA

n PLL CORE m Fixed Delay System Delay

270 180 90 0

Output Delay D2

GLA Primary

D1 Feedback Delay

Output Delay D2 v

GLB

Secondary 1 D1 Output Delay Output Delay D2 YB

D1 = Programmable Delay Type 1 D2 = Programmable Delay Type 2 w

GLC

Secondary 2 D1 Output Delay YC

Figure 3 Clock Conditioning Circuitry

Embedded RAM and FIFOs

Embedded (in fixed locations on the die) dual-port RAMs offer variable aspect ratios, clocking, and flexible read and write data widths. Moreover, the user does not need to implement the read/write pointers, and the FIFO flag logic in logic cells as the FIFO controllers are embedded as well. Cascading for wide and deep RAMs and FIFOs is offered through the Actel Libero Integrated Design Environment (IDE) core generator tool. Several synthesis tools do not infer RAMs and FIFOs from the customer code, so users need to instantiate them in the RTL code.

FlashROM
Actel IGLOO, Fusion, ProASIC3, and ProASIC3L Flash devices have 1 kbit of on-chip, user-accessible, nonvolatile FlashROM. The FlashROM can be used in diverse system applications such as system calibration settings, device serialization and/or inventory control, or subscription-based business models (for example, set-top boxes).

I/Os
The I/O tiles are flexible to support a large variety of standards ranging from LVTTL to LVDS. They also support DDR (double-data-rate) access with the I/O embedded registers, allowing high data rate and aggressive external timing.

Designing for Performance on Flash-Based FPGAs

Root Causes of Timing Issues

The main sources that lead to timing challenges when designing with any FPGA are related to one or a combination of the following: A large number of paths with a high number of logic levels This may be due to a lack of efficiency in the RTL code, a bad setting of the synthesis option, or an issue with the synthesis mapper. They can be inherent to the design itself. A large number of high fanout nets This situation is mainly due to the reference to a signal or a variable several times in the RTL code. It can also be caused by a high sharing of a combinatorial sub-function. High congestion There are two types of congestion: netlist congestion and placement congestion. An example of netlist congestion is a crosspoint switch. This type of application has a high degree of interdependence for which the place-and-route engines can do very little. Placement congestion occurs when you have unassociated logic with high degrees of congestion placed in the same area, creating an artificially high demand on routing resources. It is important to understand netlist congestion for two reasons. The first is to avoid blaming placeand-route for doing a poor job. The second reason is to give guidance to the placer apportioning internal resources, especially high-drive circuits to handle high-fanout areas. A large number of busses System control functions involve several blocks communicating wide data and control busses. The routing of these heavily bussed designs becomes challenging when several busses broadcast to several blocks. A large number of clock domains More and more designs are involving several clocking schemes with asynchronous or related sources to accommodate system-level requirements or to implement parallelism (such as several networking fingers with separate clocks and resets). Moreover, with the integration of clock synthesis circuitries inside the FPGAs, users exploit this feature to manage the clocking of several other devices on the board. The challenge with this type of design is the interdependency and the fuzzy focus of the synthesis and place-and-route engines on the clock domains. A large number of embedded memory blocks The increasing need for RAM and FIFO size makes the cascading of a large number of embedded RAM blocks a must. In some cases, this leads to a placement challenge, which yields to a complex routing problem. Other situations are challenging because of the non-balance between logic, I/Os, and RAM blocks involved. An easy example is when all the RAM blocks on the top and bottom of the die are used while the core or I/O utilization is very low. Poor synthesis, poor placement and/or poor routing While some syndromes of this inefficiency are obvious, poor quality of results of these tools is the hardest to figure out and needs a lot of user effort and time to detect. User options and constraints settings Timing constraintsnamely clock domain frequencies, clock dependencies, and input and output delaysare an integral part of the design. Moreover, clock exceptions such as multi-cycle and false paths are of a great help to guide logic and physical synthesis. Several unnecessary iterations were performed because of missing clock exceptions. Other constraints, such as floorplanning/placement constraints, can help place-and-route if set appropriately, considering not only congestion but also data flows and timing requirements. Notice that some FPGA-specific constraints, such as I/O assignment or clock assignment to global networks or an aggregation of these networks, are routing-based floorplanning and imply placement constraints of the destination cells. They need care and attentive decision as they can cause serious issues, leading to not only timing headaches, but also expensive board re-spins.

Designing for Performance on Flash-Based FPGAs

Finally, it is needless to demonstrate the fact that the earlier these sources of timing troubles are identified and dealt with, the sooner the timing closure can be achieved. More importantly, the effort and investment are less for a larger opportunity when the root cause of timing trouble is avoided or identified soon. Figure 4 illustrates this theorem. As a corollary of this theorem, the best return on investment is in the order of importance associated with the following: Quality of the RTL code Getting the best of logic synthesis by means of judicious constraints and options setting A formal link between synthesis and place-and-route Getting the best of place-and-route engines via appropriate constraining and use of the FPGA architecture features

Time

E F F O R T S

B E N E F I T S

Figure 4 Efforts and Opportunities vs. Time

However, most designs do not meet all the timing specifications and several iterations are needed. The iterative approach that leads to timing convergence in minimal time and effort is based on the following: Analysis as a key to the identification of the aforementioned root causes of challenge A sort of these causes based on scope of implications and order of importance towards resolution of the timing issues Dealing with these sources with an appropriate order of importance, preferably at the highest level of the flowi.e., the RTL code if possible

The "Main Sources for Design Analysis" section on page 7 summarizes the major sources for pertinent data to identify the bottlenecks.

Designing for Performance on Flash-Based FPGAs

Main Sources for Design Analysis

It goes without demonstration that not knowing the problem does not help to solve the problem. The best way to tackle performance issues is to find clues that may lead to the root causes. The identification process is not easy and is time- and effort-consuming in most cases, but a systematic approach eases the task. Much data is already available and can be efficiently used to shorten the time and reduce the analysis efforts. The following section will cover most of the available reports and identify the type and nature of data available in each of the reports. Moreover, the missing data is also mentioned so that users check other ways to build the whole picture.

Synthesis Report or Log File

The synthesis quality of results heavily affects the final post-layout timing of the design. The current standard synthesis report files include most of the following data: Nets with the associated fanout and buffering Register and combinatorial cells replication Number of logic levels Architecture of the arithmetic and storage or memory elements Resource utilization Number of clock domains, including explicit and derived domains

Unfortunately, the synthesis reports do not identify the congested blocks, the interconnectivity between blocks, nor the reasons why the synthesis process could not do some of the optimizations.

Actel Designer Compile Report

The backend parser performs minor netlist optimizations such as the elimination of some buffers, inverters, cells involved in dangling nets, etc. In some cases, the tool performs combining of I/Os with registers. It also parses the user constraints and translates them into place-and-route constraints. The main data provided in the importer report is summarized as follows: Netlist optimization results (number and type of deleted cells) Nets promoted to global networks Distribution of fanout Device resource utilization Internal and external nets fanout A limited number, generally 20, of high-fanout net candidates for promotion to a global or to a segment of the global network Internal clocks

Similar to the synthesis report, the backend compile does not report data related to blocks that are timing-critical or those that are congested. It does not reveal any data on the interconnectivity of blocks or inter-block busses.

ChipPlanner
The ChipPlanner tool helps the user to check several aspects related to the quality of the place-and-route tools. The tool allows users to check the span of the global networks in the die or whether the routed design is congested. Using the cell selection function of the Viewer, users can check the relative placement of hierarchical blocks. The ChipPlanner can also help identify if the placement is inefficient. Low-fanout nets that are routed in a very sneaky way are an identification of such a poor placement. However, if the similar sneaky routing is associated with high-fanout nets, this may be due to poor routing.

Designing for Performance on Flash-Based FPGAs

Unfortunately, the user needs to know a lot more about the design before being able to use the tool to identify block interconnects using ChipPlanner.

Timing Reports
The timing reports are by far the most important source for information to know more about the critical regions of the design, the inherent congestion of blocks, as well as about the quality of results of synthesis, placement, or routing process. The expanded paths can reveal if the paths are timing-critical because of a large number of logic levels, or because of high delay penalties associated with high-fanout nets. If these high delay penalties are associated with low-fanout nets, this may reveal a poor placement or poor routing. If the expanded paths are made of a long set of two-input gates, the user needs to check if this is due to a poor synthesis or a poor RTL coding style. Unfortunately, the timing reports do not reveal shared critical nets between paths with worst negative slack, or negative slack distribution. Moreover, the timing reports do not report the hierarchical blocks that meet the timing with a narrow or large margin. To see the importance of these parameters, consider the timing profiles of two designs, given in Figure 5. A first look at the profiles will lead to the conclusion that the design on the left is a lot more complex than the one on the right. However, a deeper look at the common critical nets leading to the negative slack may reveal that the number is very limited for the design profiled on the left. Moreover, a closer look at the profile of the design on the right shows a very high sensitivity to changes in place-and-route; thus, making a decision to tackle the negative slack may induce a new larger set of paths with a negative slack. The moral of this illustration is that analysis must be deep and complete before making decisions.

Target Timing

Timing

Target Timing

# Paths
Figure 5 Slack/Timing Profiles for Two Different Designs or Implementations

Timing

# Paths

Designing for Performance on Flash-Based FPGAs

Ingredients for Timing Optimizations

Before tackling each of the root causes of timing issues, designers are advised to revisit the RTL code and check the various coding optimizations that they can afford. The motivation behind this preliminary step is that a minimal effort at the RTL code level allows a large impact on the final results. The same effect can not be reached at the place-and-route level even if the effort deployed is two or three times larger. Similarly, an appropriate set of synthesis options may lead to a faster convergence if compared to tedious and manual manipulation of a netlist generated with a poor set of synthesis switches.

RTL Coding Tips

Several coding styles have been published. Each of these is related to a technology, architecture, and set of tools. Table 1 gives a list of common-sense rules and should lead to a more efficient implementation. Some of these are automatically optimized by synthesis tools. Other RTL code techniques apply the DDR idea to speed-up the overall performance. This principle doubles the original clock frequency using one of the available CCCs, and performs a part of the processing on one cycle of the doubled clock (i.e., the positive edge of the original clock) and the remaining part of the processing during the second cycle of the doubled clock (or the negative edge of the original clock).
Table 1 RTL Coding Rules Original Code A when A >= 0 else A A + 1 when EN = 1 else A X<0 XY=0 X*9 X * 15 New Code Not A + 1 when A(31) else A A + EN X(XHIGH) X=Y X SHL 3 + X X SHL 4 X

Dealing with High-Fanout Nets

Explicit Logic Replication at the RTL Level
In most cases, synthesis performs logic and sequential replication of the drivers blindly and without consideration of the destination cells and where they belong. There are various possibilities for reducing the fanout by explicit replication of the driver. Figure 6 illustrates an example where the explicit replication allows flexibility of the replicated drivers to be placed closer to the internal destination cells and the external ones.

Figure 6 Explicit Replication

Designing for Performance on Flash-Based FPGAs

In case a block is the source of multiple high fanout nets and its size in terms of logic cells is limited (less than 200 VersaTiles), it is worth investigating the replication of the block itself, as illustrated in Figure 7.

Sub_Mod 2 Sub_Mod 3 Sub_Mod 2 Sub_Mod 3 Sub_Mod 6 Sub_Mod 7 High Fanout Module Sub_Mod 6 Sub_Mod 7 Sub_Mod 5 Sub_Mod 5 High Fanout Module High Fanout Module

Sub_Mod 1

Sub_Mod 1 Sub_Mod 4

Sub_Mod 4

Figure 7 Replication of a Block Source of High-Fanout Nets

In all these cases, designers are cautioned against making these explicit replications blindly. They have to consider this change with care and avoid replicating a driver of a generated clock domain or a synchronized reset. Failing to do so will lead to new clock domains and the headache of making sure they are mapped to low-skew routing resources, or having to analyze removal time for a large set of synchronized resets.

Synthesis Control
Synthesis tools typically offer ways to set fanout limits globally. When available, use local, block-level fanout control only for blocks exhibiting high net fanout.

Backend Control of High-Fanout Nets

The easiest and most efficient way to deal with the delay penalty associated with high-fanout nets is to map these nets to segments of the global networks called spines. Users must also consider the implication of such mapping, as it involves a placement constraint on the driver and the destination cells. They need to fit in the region covered by the segment, called the scope of the spine. If the global networks are not available or if the placement constraint will introduce high congestion, users can create a so-called net region to limit the skew and the penalty associated with the net. Finally, if these two techniques do not apply, users can reset the fanout of a net and the tool will work on the shielding of the critical ports and reducing the delay penalties.

Designing for Performance on Flash-Based FPGAs

Figure 8 illustrates the use of clock network segments to map several high fanout nets.

Figure 8 High-Fanout Nets Mapped to Global Network Segments

Dealing with a High Number of Logic Levels

Revisiting RTL Code
In some cases, the large number of logic levels is inherent to the design. Users can cope with this by adding explicit pipeline stages, adding registers when appropriate. Users may also need to re-architect the timing of their design and anticipate data readiness one cycle ahead of the cycle where they will be processed, allowing two cycles for these paths if possible. Also, when RAM blocks are involved in paths with a high number of logic levels, users can investigate the use of the pipelined configuration of the read port or anticipate the read a cycle ahead, thus allowing two cycles for these paths.

Synthesis Control
The synthesis flow allows for retiming. This register moving around the logic comes with an increase of area, as the number of registers may increase. Users need to watch this increase and monitor the utilization of the device resources. When the number of logic levels involves arithmetic blocks, users need to know that synthesis tools offer a variety of architectures for these blocks. Users need to check the default choice made by the synthesis tool and see whether it is an efficient architecture for the device.

Designing for Performance on Flash-Based FPGAs

Backend Control
As part of the analysis, users need to verify the placement of the cells as well as the fanout of the nets involved in these paths. In case the fanout of these nets is limited, the tool offers a flexible set of placement constraints allowing the user to confine the placement of these paths/blocks. In case one or more nets are associated with a delay penalty, users can use the shielding technique described above.

Dealing with High Congestion

This section illustrates using an inherently congested block. On a first look at the RTL code depicted in Figure 9, the reader can easily understand the function, but most of us do not realize the underlying complexity of the routing. The same complexity occurs whenever the number of logic cells needed is very limited but the number of routes is extremely high. This is the syndrome of what is called "routing congestion."

reg [63:0] A, B, C, D, E, F, G, H, M; reg [2:0] SELECT; always @ (A or B or C or D or E or F or G or H or SELECT) begin case (SELECT) 3b000 : Z = A; 3b001 : Z = B; 3b010 : Z = C; 3b011 : Z = D; 3b100 : Z = E; 3b101 : Z = F; 3b110 : Z = G; 3b111 : Z = H; End

A[63:0] B[63:0] C[63:0] D[63:0] E[63:0] F[63:0] G[63:0] H[63:0]

8:1

Figure 9 Example of Routing Congestion

Designing for Performance on Flash-Based FPGAs

Revisiting RTL Code

For this particular congested block, several ideas can be investigated at the RTL code level. One of the recommended techniques is to decentralize the routing congestion, as suggested in Figure 10.

I1 Module 1 I2 I3 Module 2 I4 8:1 Module 3 I5 I6 I7 I8

Figure 10 MUX Decentralization Technique

Module 1 I1, I2

Module 2 I3, I4 Z Module 3 I5, I6 Module 4 I7, I8

4:1

Module 4

Table 2 shows the area/frequency results obtained when synthesizing the code as is, instantiating a large MUX and implementing the recommended decentralization technique.
Table 2 Centralized vs. Decentralized Implementations Pure RTL Synthesis 32:1, 16-bit wide MUX 64:1, 16-bit wide MUX 148 MHz (130 tiles) 134 MHz (246 tiles) Using 4:1 MUX Cores 171 MHz (120 tiles) 160 MHz (217 tiles)

Synthesis Control
Using the same illustration example, users must pay attention to the select lines, as the least significant bits are definitely very high-fanout nets. Moreover, the coding of the select lines, either compact or one-hot, leads to different area and speed results. Table 3 shows these results.
Table 3 Compact vs. One-Hot Select Line Encoding Compact Select 64:1 8-bit-wide MUX 64:1 16-bit-wide MUX 134 MHz (132 tiles) 130 MHz (246 tiles) One-Hot Select 163 MHz (148 tiles) 158 MHz (231 tiles)

As a corollary, users need to think contextually. If the select lines of the large MUX are a state register of a FSM, the encoding of the states of this machine must be one-hot even if this encoding may not look optimal locally.

Designing for Performance on Flash-Based FPGAs

Another synthesis control is related to resource sharing. The goal of the resource sharing is to reduce design area by sharing large blocks by means of adding MUXes on the inputs of these blocks. Users need to check the implication and manage the balance between slightly higher utilization, added congestion, and number of logic levels.

Backend Control
One major recommendation is to avoid aggressive placement or timing constraints on these Routing Congestion spots. In other words, users are advised to relax the placement constraints to allow higher porosity. This has to be combined with a relaxation of the timing constraints on the congested block, as well as other non-timing-critical blocks, so that their internal nets do not come into conflict for the use of a critical routing resource. Another higher-level measure users can adopt is a more data-oriented floorplan for the blocks involved in the congestion.

Dealing with a Large Number of Busses

Revisiting RTL Code
While the margins for maneuvering are tight, designers may revisit the design architecture for bus sharing or attempt to bury some of the busses in larger blocks. Another important aspect that may make the routing of these busses even trickier is the fanout of each of the slices of these busses. If the fanout is not homogeneous, users are dealing with a more complex issue.

Synthesis Control
Unfortunately, at the synthesis level, very little can be done to cope with these situations

Backend Control
The most efficient physical implementation of these busses is the shortest routing possible. This involves placement of the drivers and the destination of each slice of the busses. If the fanout of these slices is not homogeneous, the highest-fanout bus lines can be mapped using low-skew segments of the global networks. In any case, users need to adopt a data-driven placement of the communicating blocks and relax the placement constraints for higher porosity and ease of routability of blocks and busses.

Dealing with a Large Number of Clock Domains

Revisit the RTL Code
While most of the clocking schemes are defined at an early stage of the system architecture cycle, designers use various techniques to handle clocking of various blocks of the design, to cope with either timing bottlenecks or area/resource constraints. Other designers are power-conscious and may implement various clocking schemes using clock MUXing or gated clocks. For all these techniques, designers must accurately assess the gain before heading up to creating/generating new clock domains. More importantly, designers need to think ahead of time and make sure that the resulting design and its associated clocking scheme keeps it "analysis-friendly," as the static timing analysis phase can become tedious and time consuming.

Synthesis Control
The focus of the designer should be on the setting of all the timing constraints. These include the tightest clock frequencies, the inter-clocks off-sets, the input and output delays. False paths and multi-cycle paths are very critical and need not to be neglected.

Backend Control
The general guidelines are to separate the clock domains and adopt clock-domain-based floorplanning. This will also allow an economy of the global networks and a more effective use of the low-skew segments of the global networks. When doing so, users need to integrate data dependency between domains and take into account paths optimizations.

Designing for Performance on Flash-Based FPGAs

A case worthy of note is when some of the clock domains drive a large number of RAM/FIFO blocks (deep or shallow RAMs that are mapped cascading several embedded RAM blocks). In such a case, users can consider placing the RAM blocks on the top and bottom of the die, provided their performance degradation does not affect critical paths.

Dealing with a Large Number of I/Os

Revisit the RTL Code
Even if the margins for maneuvering are tight at this level, designers may adopt time multiplexing and lower the number of I/Os if this is possible.

Synthesis Control
Users need to turn off the inference of registered I/Os, as this leads to an inherent placement constraint of the registers associated with these I/Os.

Backend Control
Users need to investigate carefully the ratio of logic and /IO utilization. If the logic utilization is low, the recommendation is to run place-and-route with I/O register combining turned off. If the internal registerto-register performance is satisfactory, then users need to investigate the slack margins and allow register combining for the most critical external setup of clock-to-out timing. If doing so does not resolve the problem, the I/O placement can be modified to accommodate both the internal and external timings.

Dealing with Poor Synthesis, Poor Placement, and Poor Routing

Poor synthesis results can be caused by either inherent limitation in the synthesis engine or by poor setting of options and wrong specification of timing constraints. It can also be related to lack of knowledge of how the mapping algorithms work and what to expect once a particular constraint is added. This mismatch between true capacity of the tool and the user expectation leads to frustration and several unnecessary iterations. In the category of "know your tool," Table 4 provides a sample of results for a very limited number of benchmarks that were processed with an industry synthesis tool and Designer backend toolset.
Table 4 Sample of Results on a Small Set of Benchmarks Synthesis Options Design Name Area-Driven Timing-Driven Replication On Timing-Driven Replication Off

Place-and-Route Option Default 12822 87 6150 71 2540 70.3 3249 57.3

All Buffers All Buffers All Buffers Removed Default Removed Default Removed 12822 87 6124 70 2538 71.67 3238 53.4 18181 83.9 7284 63.5 3041 72.7 11398 80.5 16362 87.5 6886 66 2909 77.8 9792 78.3 16969 84.2 7192 63.5 3035 75 11180 76.7 14359 82.7 6855 68.2 2905 72.5 9691 78.6

Design RDES MUX- and XOR-based Area in VersaTiles SystemClk (MHz Syncop Bus interfaces / large reg files Area in VersaTiles HighClock (MHz) Area in VersaTiles TopClk (MHz) CORDIC Datapath Area in VersaTiles MainClk (MHz)

HRK

Designing for Performance on Flash-Based FPGAs

Table 4 Sample of Results on a Small Set of Benchmarks (continued) Synthesis Options Design Name Imen Area-Driven Timing-Driven Replication On Timing-Driven Replication Off

Place-and-Route Option Default Scrambling and Descrambling Area in VersaTiles Speed (MHz) VersaC HardClk RxClk TxClk System Interfaces Control Tiles Speed (MHz) SysClk PicClk and 22013 44.7 51.2 4274 88 87 60 60

All Buffers All Buffers All Buffers Removed Default Removed Default Removed 4250 90 95 57 64 22008 46.5 53.64 6815 102 103 66 67 72274 42 37 6023 108 106 67 66 67231 42.4 35.2 6586 108 106 67 65 70780 39.8 40 5884 111 104 70 68 66708 46 38

Newton

Boldy

Dual MAC/Memory Intensive Area (VersaTiles) CPUclk (MHz) AFDX_clk (MHz)

16215 121 39

16213 111 35

20729 116 41

19379 117 36

20326 109 44

19326 132 37

A first look at this small sample of results highlights the efficiency of the area-oriented flow, both in terms of compact area and respectable speed. Pushing this tool hard with non-realistic timing constraints leads, in most cases, to a huge overhead of area, particularly when logic replication is ON. For less area penalty and slightly higher frequencies, users can push for timing-driven synthesis with the replication switched OFF.

ConclusionsForward Looking
As its title suggests, this application note is a tour of various aspects related to timing convergence when targeting Actel Flash-based FPGAs using synthesis tools and Designer, the Actel backend toolset. The contribution of this document is to help designers focus on the analysis of the timing bottlenecks, identify the root causes, and cope with them. Several suggestions have been provided, which enable designers to tackle each of these timing challenges at the RTL code level, the synthesis setting, and the backend constraints.

Actel and the Actel logo are registered trademarks of Actel Corporation. All other trademarks are the property of their owners.

w w w. a c t e l . c o m
Actel Corporation 2061 Stierlin Court Mountain View, CA 94043-4655 USA Phone 650.318.4200 Fax 650.318.4600 Actel Europe Ltd. River Court, Meadows Business Park Station Approach, Blackwater Camberley Surrey GU17 9AB United Kingdom Phone +44 (0) 1276 609 300 Fax +44 (0) 1276 607 540 Actel Japan EXOS Ebisu Building 4F 1-24-14 Ebisu Shibuya-ku Tokyo 150 Japan Phone +81.03.3445.7671 Fax +81.03.3445.7668 http://jp.actel.com Actel Hong Kong Room 2107, China Resources Building 26 Harbour Road Wanchai, Hong Kong Phone +852 2185 6460 Fax +852 2185 6488 www.actel.com.cn

51900173-0/3.08

Day 1 E-Link Overview - Version 1
No ratings yet
Day 1 E-Link Overview - Version 1
271 pages
DSDF3 PPT
No ratings yet
DSDF3 PPT
56 pages
Sec5-Fpga - Part2
No ratings yet
Sec5-Fpga - Part2
63 pages
Sanad Ba Caes
No ratings yet
Sanad Ba Caes
22 pages
26 October - 20 November, 2009: Fpga Design & VHDL Fundamentals of Fpgas
No ratings yet
26 October - 20 November, 2009: Fpga Design & VHDL Fundamentals of Fpgas
220 pages
2017 01 31 FPGA Lecture HS
No ratings yet
2017 01 31 FPGA Lecture HS
75 pages
Wa0066
No ratings yet
Wa0066
46 pages
Timing Diagram
80% (5)
Timing Diagram
25 pages
VLSI Logic Synthesis Part 3
No ratings yet
VLSI Logic Synthesis Part 3
19 pages
Core8051s EmbProc HW Tutorial UG
No ratings yet
Core8051s EmbProc HW Tutorial UG
60 pages
8.programmable Asic Design Software
50% (2)
8.programmable Asic Design Software
21 pages
MEGA65-Book Draft PDF
No ratings yet
MEGA65-Book Draft PDF
822 pages
System Design Using FPGA
No ratings yet
System Design Using FPGA
135 pages
"FPGA - CPLD Technologies and VHDL Programming Basics" Seminar Updated 2008 Version
No ratings yet
"FPGA - CPLD Technologies and VHDL Programming Basics" Seminar Updated 2008 Version
70 pages
Vlsi Unit-5
No ratings yet
Vlsi Unit-5
57 pages
Module 2a
No ratings yet
Module 2a
54 pages
CMOS VLSI Unit 2
No ratings yet
CMOS VLSI Unit 2
51 pages
Cadence PPT
No ratings yet
Cadence PPT
18 pages
VLSI ASIC - Approaches CMOS - Review IntroToPnR
No ratings yet
VLSI ASIC - Approaches CMOS - Review IntroToPnR
52 pages
DDHDL 4
No ratings yet
DDHDL 4
30 pages
Lec5 FPGA
No ratings yet
Lec5 FPGA
46 pages
FPGA Programming Technology and Interconnect Architecture
No ratings yet
FPGA Programming Technology and Interconnect Architecture
32 pages
2022 06 15 FPGA Lecture HS
No ratings yet
2022 06 15 FPGA Lecture HS
79 pages
Vlsiday 1
No ratings yet
Vlsiday 1
16 pages
Introduction To Actel FPGA Architecture PDF
50% (2)
Introduction To Actel FPGA Architecture PDF
8 pages
Fpga Based System Design
100% (1)
Fpga Based System Design
30 pages
PackardBell EasyNote ML65 WISTRON SJM50-PU
No ratings yet
PackardBell EasyNote ML65 WISTRON SJM50-PU
56 pages
1d996928lecture 2 and 3 PDF
No ratings yet
1d996928lecture 2 and 3 PDF
53 pages
HP Laptop 15-Bs0xx
No ratings yet
HP Laptop 15-Bs0xx
124 pages
Sap Ale Idocs
100% (2)
Sap Ale Idocs
140 pages
Fpgas Cortex™-M3
No ratings yet
Fpgas Cortex™-M3
15 pages
Libero SoC Brochure
No ratings yet
Libero SoC Brochure
12 pages
Mapping The SISO Module of The Turbo Decoder To A FPFA
No ratings yet
Mapping The SISO Module of The Turbo Decoder To A FPFA
8 pages
Basic Programming Simatic S7-300
No ratings yet
Basic Programming Simatic S7-300
42 pages
A+ Guide To Managing and Maintaining Your PC, 6e: Motherboards
100% (1)
A+ Guide To Managing and Maintaining Your PC, 6e: Motherboards
36 pages
FPGA Design Flow
No ratings yet
FPGA Design Flow
7 pages
An Introduction To FPGA and SOPC Development Board: Yong Wang
No ratings yet
An Introduction To FPGA and SOPC Development Board: Yong Wang
44 pages
SoCDesign PDF
No ratings yet
SoCDesign PDF
42 pages
Computer Architecture Lecture Notes Input - Output
No ratings yet
Computer Architecture Lecture Notes Input - Output
20 pages
VLSI Design Style
No ratings yet
VLSI Design Style
34 pages
Reconfigurable Computing Es Zg554 / Mel ZG 554 Session 6: BITS Pilani
No ratings yet
Reconfigurable Computing Es Zg554 / Mel ZG 554 Session 6: BITS Pilani
26 pages
19 20 IntroFPGA PDF
No ratings yet
19 20 IntroFPGA PDF
56 pages
16 Mealy and Moore Automata
No ratings yet
16 Mealy and Moore Automata
7 pages
Fpga and Cads: Presented by Peng Du & Xiaojun Bao
No ratings yet
Fpga and Cads: Presented by Peng Du & Xiaojun Bao
28 pages
PM11 Ec
No ratings yet
PM11 Ec
99 pages
CA Chap4 CPU NLT2020
No ratings yet
CA Chap4 CPU NLT2020
82 pages
Asic & Fpga Design QB For Me
No ratings yet
Asic & Fpga Design QB For Me
24 pages
Xfest07 GD
No ratings yet
Xfest07 GD
45 pages
Asic and Fpga Design
No ratings yet
Asic and Fpga Design
24 pages
Thesis Proposal Liuyulin
No ratings yet
Thesis Proposal Liuyulin
2 pages
CPLD and Fpga
No ratings yet
CPLD and Fpga
28 pages
Fpga Tutorial
No ratings yet
Fpga Tutorial
10 pages
Programmable ASIC Design: Haibo Wang ECE Department Southern Illinois University Carbondale, IL 62901
No ratings yet
Programmable ASIC Design: Haibo Wang ECE Department Southern Illinois University Carbondale, IL 62901
25 pages
FPGA Vs ASIC
No ratings yet
FPGA Vs ASIC
9 pages
Tetramax DF Ds
No ratings yet
Tetramax DF Ds
4 pages
Computer Hardware Worksheet / Quiz: Make Sure That The Worksheet Is Saved Into The Network Drive Folder For Credit
100% (1)
Computer Hardware Worksheet / Quiz: Make Sure That The Worksheet Is Saved Into The Network Drive Folder For Credit
2 pages
Asic Vs Fpga
No ratings yet
Asic Vs Fpga
34 pages
Design Flow
No ratings yet
Design Flow
18 pages
Microsoft Word - Unit4 VTU Format
No ratings yet
Microsoft Word - Unit4 VTU Format
100 pages
All About FPGAs
No ratings yet
All About FPGAs
9 pages
Auspy Datasheet
No ratings yet
Auspy Datasheet
2 pages
Reconfigurable Computing Using Content Addressable Memory For Improved Performance and Resource Usage
No ratings yet
Reconfigurable Computing Using Content Addressable Memory For Improved Performance and Resource Usage
6 pages
Design Flow
No ratings yet
Design Flow
18 pages
SAP Process Integration - SAP PI - Dynamic File Name Generation PDF
No ratings yet
SAP Process Integration - SAP PI - Dynamic File Name Generation PDF
3 pages
Innovate or Perish: FPGA Physical Design: Taraneh Taghavi, Soheil Ghiasi Abhishek Ranjan, Salil Raje Majid Sarrafzadeh
No ratings yet
Innovate or Perish: FPGA Physical Design: Taraneh Taghavi, Soheil Ghiasi Abhishek Ranjan, Salil Raje Majid Sarrafzadeh
8 pages
Structure and Properties of Fpgas: Current Technology
No ratings yet
Structure and Properties of Fpgas: Current Technology
18 pages
The 8086 Input/output Interface: Dr. Mohanad A. Shehab/ Electrical Engineering Department/ Mustansiriyah University
No ratings yet
The 8086 Input/output Interface: Dr. Mohanad A. Shehab/ Electrical Engineering Department/ Mustansiriyah University
12 pages
How To Debug & Test Inbound PI PROXIES34 PDF
No ratings yet
How To Debug & Test Inbound PI PROXIES34 PDF
3 pages
Design of VLSI Architecture For A Flexible Testbed of Artificial Neural Network For Training and Testing On FPGA
No ratings yet
Design of VLSI Architecture For A Flexible Testbed of Artificial Neural Network For Training and Testing On FPGA
7 pages
VTU LD Lab Manual
No ratings yet
VTU LD Lab Manual
5 pages
FPGA
No ratings yet
FPGA
7 pages
Real Time Clock
No ratings yet
Real Time Clock
4 pages
Lesson Plan MPMC-New
No ratings yet
Lesson Plan MPMC-New
6 pages
Department Programme Course Code Course Name Semester Credit Values Contact Hours Pre-Requisite (S) Vission & Mission: Vission
No ratings yet
Department Programme Course Code Course Name Semester Credit Values Contact Hours Pre-Requisite (S) Vission & Mission: Vission
4 pages
Definition:: Field-Programmable Gate Array
No ratings yet
Definition:: Field-Programmable Gate Array
6 pages
Implementation of Uart Using Systemc and Fpga Based Co-Design Methodology
No ratings yet
Implementation of Uart Using Systemc and Fpga Based Co-Design Methodology
7 pages
Fpga Vs Asic Design Flow
No ratings yet
Fpga Vs Asic Design Flow
32 pages
Ect304 Vlsi Circuit Design, December 2024
No ratings yet
Ect304 Vlsi Circuit Design, December 2024
3 pages
AlertConfiguration STEPS&Troubleshoot
No ratings yet
AlertConfiguration STEPS&Troubleshoot
13 pages
Introduction To Free-RTOS: Deepak D'Souza
No ratings yet
Introduction To Free-RTOS: Deepak D'Souza
34 pages
FPGA Design Flow: Page 1 of 5
No ratings yet
FPGA Design Flow: Page 1 of 5
6 pages
My First SAP Script Step by Step
No ratings yet
My First SAP Script Step by Step
17 pages
Ic 555 Astable Circuit: Theory
No ratings yet
Ic 555 Astable Circuit: Theory
5 pages
TCON
No ratings yet
TCON
5 pages
Multilevel Gate Networks
No ratings yet
Multilevel Gate Networks
8 pages
PI Application Support Daily Checklist - SCN PDF
0% (1)
PI Application Support Daily Checklist - SCN PDF
3 pages
1.FPGA Design Flow Processes Properties
No ratings yet
1.FPGA Design Flow Processes Properties
5 pages
MATLAB Support Package For Arduino Hardware Documentation
No ratings yet
MATLAB Support Package For Arduino Hardware Documentation
3 pages
Standard Syllabus For ECE
No ratings yet
Standard Syllabus For ECE
5 pages
Apple Hardware Test
No ratings yet
Apple Hardware Test
3 pages
Message Mapping Simplified - Part II - SCN PDF
No ratings yet
Message Mapping Simplified - Part II - SCN PDF
15 pages
Boom Box Manual Appendix Drawings: Seismic
No ratings yet
Boom Box Manual Appendix Drawings: Seismic
13 pages
2021-MESI Protocol For Multicore Processors Based On FPGA
No ratings yet
2021-MESI Protocol For Multicore Processors Based On FPGA
10 pages
The New B2B Add-On For SAP NetWeaver Process in PDF
No ratings yet
The New B2B Add-On For SAP NetWeaver Process in PDF
11 pages
SMTP Configuration in SAP XI - Process Integration - SCN Wiki PDF
No ratings yet
SMTP Configuration in SAP XI - Process Integration - SCN Wiki PDF
3 pages
2084 PDF
No ratings yet
2084 PDF
13 pages
On Chip Crosstalk Avoidance Using Fibonacci Codes With Hybrid Model
No ratings yet
On Chip Crosstalk Avoidance Using Fibonacci Codes With Hybrid Model
6 pages
C
100% (1)
C
2 pages
8.standard Adapter Framework Modules (AF - Modules) PDF
No ratings yet
8.standard Adapter Framework Modules (AF - Modules) PDF
5 pages
7.standard Adapter Framework Modules (AF - Modules) PDF
No ratings yet
7.standard Adapter Framework Modules (AF - Modules) PDF
5 pages
Design of Digit-Serial Fir Filters Algorithms Architectures and A Cad Tool
No ratings yet
Design of Digit-Serial Fir Filters Algorithms Architectures and A Cad Tool
1 page
Nonimmigrant Visa - Review Student - Exchange Visa Information
No ratings yet
Nonimmigrant Visa - Review Student - Exchange Visa Information
1 page

Design Performance An

Uploaded by

Design Performance An

Uploaded by

Application Note

Designing for Performance on Flash-Based FPGAs

This document is organized in four sections.

March 2008 2008 Actel Corporation

Designing for Performance on Flash-Based FPGAs

Brief Introduction to Flash FPGA Architecture

CLR/ Enable X1 CLR XC*

Via (hard connection)

Switch (flash connection)

Figure 1 Logic Cell or Versatile Architecture

Designing for Performance on Flash-Based FPGAs

High-Performance Global Network

Chip (main) Global Pads

Embedded RAM Blocks

Figure 2 Global Networks Distribution

Designing for Performance on Flash-Based FPGAs

Clock Conditioning Circuitry (CCC)

n PLL CORE m Fixed Delay System Delay

Secondary 1 D1 Output Delay Output Delay D2 YB

D1 = Programmable Delay Type 1 D2 = Programmable Delay Type 2 w

Secondary 2 D1 Output Delay YC

Figure 3 Clock Conditioning Circuitry

Embedded RAM and FIFOs

Designing for Performance on Flash-Based FPGAs

Root Causes of Timing Issues

Designing for Performance on Flash-Based FPGAs

Figure 4 Efforts and Opportunities vs. Time

Designing for Performance on Flash-Based FPGAs

Main Sources for Design Analysis

Synthesis Report or Log File

Actel Designer Compile Report

Designing for Performance on Flash-Based FPGAs

Designing for Performance on Flash-Based FPGAs

Ingredients for Timing Optimizations

RTL Coding Tips

Dealing with High-Fanout Nets

Figure 6 Explicit Replication

Designing for Performance on Flash-Based FPGAs

Figure 7 Replication of a Block Source of High-Fanout Nets

Backend Control of High-Fanout Nets

Designing for Performance on Flash-Based FPGAs

Figure 8 High-Fanout Nets Mapped to Global Network Segments

Dealing with a High Number of Logic Levels

Designing for Performance on Flash-Based FPGAs

Dealing with High Congestion

A[63:0] B[63:0] C[63:0] D[63:0] E[63:0] F[63:0] G[63:0] H[63:0]

Figure 9 Example of Routing Congestion

Designing for Performance on Flash-Based FPGAs

Revisiting RTL Code

I1 Module 1 I2 I3 Module 2 I4 8:1 Module 3 I5 I6 I7 I8

Module 2 I3, I4 Z Module 3 I5, I6 Module 4 I7, I8

Designing for Performance on Flash-Based FPGAs

Dealing with a Large Number of Busses

Dealing with a Large Number of Clock Domains

Designing for Performance on Flash-Based FPGAs

Dealing with a Large Number of I/Os

Dealing with Poor Synthesis, Poor Placement, and Poor Routing

Place-and-Route Option Default 12822 87 6150 71 2540 70.3 3249 57.3

Designing for Performance on Flash-Based FPGAs

Dual MAC/Memory Intensive Area (VersaTiles) CPUclk (MHz) AFDX_clk (MHz)

You might also like