MACDONALD TIMINGCLOSURE FINAle
MACDONALD TIMINGCLOSURE FINAle
Nancy D. MacDonald
Principal IC Design Engineer, Enterprise Networking Group
Broadcom Corporation, Irvine, CA, USA
Notice of Copyright
This document has been submitted to, and reviewed and posted by, the editors of DAC.com.
DAC.COM KNOWLEDGE CENTER ARTICLE Page 2 of 18
www.dac.com
Nancy D. MacDonald
Principal IC Design Engineer, Enterprise Networking Group
Broadcom Corporation, Irvine, CA, USA
I. INTRODUCTION
Final timing closure in complex, hierarchical, deep submicron designs is an arduous task. While
EDA tools do a very good optimization job, in general, they often leave several hundred setup
and hold violations on the table. This is not the fault of the tools: they do what they can while
maintaining competitive turnaround times, and design teams are always pushing the limits of
tool capabilities and design complexity. Such is the nature of chip design in a competitive world.
For hierarchical designs, timing closure applies to both individual blocks (hard partitions) as well
as the top-level. In this paper, we look at timing closure from the top-level perspective,
assuming all lower-level blocks are closed or nearly closed. In fact, even the “closed” blocks
may see a whole new set of violations when merged with the top-level. This is due to (1)
inaccuracies in block-level, interface timing constraints or (2) differences in SI timing windows
for block-level, interface nets, particularly clocks. In any event, assume that our EDA tool flow
has taken us as far as possible toward closing timing so that most of the remaining violations
are small, and all very large violations have been analyzed and removed. In addition, any
failures caused by incorrect or incomplete timing constraints have been addressed. What
should a designer do to achieve timing closure?
Typically, this last leg of timing closure consists of manual design tweaks and corresponding
timing iterations. For large designs, it should take about three weeks to close timing completely
(e.g., five manual repair iterations, at three days per iteration, assuming no more than a couple
of major tasks per designer). This article starts with high-level principles to follow when finishing
timing closure manually, and then gives some specific details.
DAC.COM KNOWLEDGE CENTER ARTICLE Page 3 of 18
www.dac.com
At the end of each iteration (three steps described above), we expect to see an improvement in
overall top-level timing QOR, although the mix of violations may vary slightly.
About 5 Iterations
Breakdown
BreakdownofofTiming
TimingViolations
Violationson
onPer-Block
per BlockBasis
Basis
In general, we address four main categories of violations during every iteration of the flow described
above. The categories are shown below and should be addressed in the following order.
1) Electrical rule violations
2) Noise violations
3) Setup violations
4) Hold violations
We fix electrical rules (such as large max transition and max capacitance) first because we want our
design in a “legal” state before taking action on other issues; otherwise, unnecessary design changes
(resizing or Vt swap) may occur to address violations that, in reality, stem from the electrical rule
failures. Next, we repair noise glitches to ensure a clean noise profile before addressing setup and
hold violations. Setup violations are fixed third because the changes needed to repair these failures
sometimes have a large ripple effect, whereas hold violations are fixed last because changes to fix
these violations are often local and isolated.
To expedite timing closure, we provide some general guidelines as to what types of “fixes” (design
changes / transformations) are allowable during each of the five iterations. In general, some changes
such as adding useful skew are far more disruptive than others such as changing the Vt class of an
instance. We recommend that highly disruptive operations be completed in very early iterations (say
1-2) and prohibited in later ones (say 3-5). The flow diagram in Figure 1 shows what types of
operations are permitted at various stages.
5) Adding useful skew. Useful skew is guaranteed to fix a violating path; however, it inevitably
causes other, new violations to pop up. Nevertheless, sometimes there is no other choice. This
tactic should be used only as a last resort since the resulting new violations must be fixed
manually or even with another pass through the EDA tool before resubmitting the block for top-
level integration.
Although we strongly recommend the above guidelines regarding what kind of operations are
permitted (and when) during timing closure iterations, we note that exceptions do occur. A few
exceptions should always be allowed on a special request basis.
Finally, for all repair passes, designers must be absolutely sure to visualize what is going on in the
layout. If problems cannot be solved with a simple Vt swap (or sometimes even when they can),
designers must examine the layout to identify the root cause. Is it a long net, a missing non-default
rule, poor placement of a few cells, or a detour around a memory or other hard macro? The
appropriate course of action often cannot be determined without taking a look at the layout view of the
design. Make certain that designers do so regularly.
Below are some specific tips for fixing different types of violations. In addition, we provide loose
guidelines for how to address timing failures across various chip modes and corners.
Rule Number 1: Even if designers are not able to address all violations in only one pass, they should
still look at all modes and corners to guide their manual fixes.
This concept is important for several reasons. First, the more modes that are addressed
simultaneously the better. We do not want any surprises to appear close to tapeout due to neglected
modes/corners. Next, occasionally, the same path cannot meet both setup and hold timing across all
corners. Paths such as this must be flagged to the front-end design manager as soon as possible.
Certainly, we must take note of any fixes that close timing in one mode but break another.
Rule Number 2: Always identify the dominant corner for setup failures and for hold failures, and fix
violations in those corners first.
Although we may look at all modes and corners when evaluating timing results, it is often more
efficient to identify one corner to address first when fixing violations. For example, we can usually
identify one dominant corner for setup (and another for hold). Fixing violations in the dominant corner
clears most of the violations for the others. In other words, if most violations in a design are on
memory interfaces, and memories are slowest in the ss/low temperature corner, then ss/low
temperature is the dominant corner.
If block designers cannot address all of their setup/hold violations in a single iteration, they should
focus on the main functional mode first, as fixing violations in this mode will likely fix the same
DAC.COM KNOWLEDGE CENTER ARTICLE Page 6 of 18
www.dac.com
violation in another mode. Also, changes made to a block in main functional mode tend to affect a
large number of other blocks, so we should address main functional mode early due to this ripple
effect. Then, address the secondary modes and all test modes (in parallel but with lower priority).
The first problem is that SI timing windows for clock inputs to blocks are frequently not modeled
accurately. This is due to inaccurate estimation of clock delays to block inputs and the fact that clock
reconvergence pessimism removal doesn’t apply to timing windows. Therefore, any OCV derating,
clock muxing, etc. will have an impact on the timing windows seen by the blocks. Inaccurate timing
windows on clock pins imply that the noise profile for a block, based on the timing constraints, may be
different from the one seen when the block is integrated into the top-level of the design. This problem
causes failures to appear during top-level timing that were not seen at the block level. Designers can
address these violations during manual timing closure, or a methodology change can be adopted to
ensure more accurate timing windows.
The second problem is simply noise on clock nets, most frequently caused by one of the following:
1) Weak drivers on long clock nets.
2) Missing or incorrectly applied non-default rules.
3) Clock nets routed on thick metal to cross over macro cells, etc. Thick metal has more sidewall
capacitance, so capacitive coupling usually increases (despite increased spacing rules).
4) Inadequate clock max transition or max capacitance rules.
These problems may creep into designs gradually during ECOs and manual or automated timing
optimization. Clock noise is a big problem since it translates directly into greater skew between flops.
Consider the example in Figure 2 below. In this case, eliminating the noise on net CLK_A_N would
fix a setup violation on REG_A2. CLK_A_N is driven by weak driver BUF_A. To improve the slew
rate on net CLK_A_N, we can change BUF_A to a higher-drive cell. This reduces the noise on
CLK_A_N. Note that we do not use Vt swap to resolve problems on clock nets because mixing Vt
classes on clock trees leads to noise problems and greater on-chip variation.
BUF_A
CLK_A_N
REG A2
In some cases, clock nets are missing non-default rules. In Figure 3, no resizing is possible, as
drivers are already maximally sized. The only solution is to apply a non-default rule to CLK_A_N as
shown in the figure. In this case, we use triple spacing. Shielding or increasing wire width for net
CLK_A_N may also be possible, but would come at a cost in terms of design routability.
DAC.COM KNOWLEDGE CENTER ARTICLE Page 7 of 18
www.dac.com
In general, note that different NDRs may be applied to different clocks. For example, we might shield
an extremely high-frequency system clock while using default rules for a low-frequency test clock.
Also, depending on congestion, we might apply an NDR to the upper portion of a large clock tree but
allow the leaf nets to be routed with default rules. Designers may employ many different flavors of
NDRs, and the best choice is typically design- and process-dependent.
Single Spacing
Triple Spacing
Noise problems caused by clocks routed in thick metal may also be resolved with non-default rules;
however, such problems usually occur in congested areas of the design such as when routing over a
macro. It is best to avoid routing clocks in thick metal if at all possible.
The problem of having relaxed clock max transition or max capacitance rules for clocks is systemic
and is difficult to resolve with manual tweaks to the clock tree. We just have to live with noise caused
by these problems.
Vt swapping to replace a (victim net’s) high Vt driver cell with a standard or low Vt cell is always a
good option because it requires no change to the routing at all, assuming that the cell footprints match
across all Vt classes. As mentioned previously, routing perturbations during the final stages of timing
closure are undesirable because they may easily introduce new violations of virtually any type.
If Vt swap is not possible, since the driving cell of the victim is already low Vt, then upsizing the driver
is the second most desirable choice. This will decrease slew rate on the victim net and change the SI
timing window.
The third best solution for eliminating noise glitches is repeater insertion. If a victim net is long, the
only choice may be to break the net. Repeaters should be inserted approximately halfway between
source and sink or inserted in such a way as to break up a high fanout net into relatively “equal” parts.
Figure 4 shows a few examples of “good” locations to add repeaters. Note that when fixing noise,
one is not trying to balance the capacitive load or speed up any part of the path. The goal is to
DAC.COM KNOWLEDGE CENTER ARTICLE Page 8 of 18
www.dac.com
eliminate noise by changing timing windows and minimizing coupling. Therefore, we simply try to
break the total wire length into equal parts and/or eliminate any long wire segments. Repeater
insertion does add delay to a path, but hopefully, the additional delay is cancelled out (and then
some) by the reduction in crosstalk delay.
When compared to logic changes, simply manually moving aggressor nets away from a victim net
may seem like the best approach. However, it isn’t always clear which net(s) to move. When multiple
aggressors are involved, it is difficult to assess the effect of per-aggressor and small accumulated-
aggressor filtering. We cannot easily/precisely determine how the contributions of each aggressor
combine to cause the problem. Furthermore, this approach is very tedious, and if other automated
routing changes occur in the neighborhood of the victim, the SI violation may reappear at a later stage
in the design cycle. Therefore, we try to avoid this approach because it isn’t always as simple as it
may appear.
Note that simply applying non-default routing rules to the victim net is also an option, but in a
congested area, this may perturb the routing and introduce new violations. Non-default rules for
signal nets should be used infrequently and only when there is sure to be minimal fallout.
(c)
(b)
(a)
Driver: yellow
Loads: blue
Inserted Buffer: red
Designers may need to clean up a few electrical rule violations manually. These include: (1) max
capacitance violations and (2) large max transition violations.
Max capacitance violations most often occur when library cells use pass transistor logic, as the drive
capability of these gates (XOR, MUX, HA carry bit) is quite limited. Figure 5 illustrates buffer insertion
to repair a max capacitance violation. In this case, the objective of buffer insertion is to break up the
load cells; therefore, we insert three buffers (one for each branch) very close to the driver. Clearly,
the number of inserted buffers and the corresponding buffer locations are much different than for
noise fixing.
Driver: yellow
Loads: blue
Inserted Buffer: red
Other possible resolutions for max capacitance violations include upsizing the driver (yellow) and/or
downsizing the loads (blue). Resizing is preferable to buffer insertion since it is less disruptive. In
addition to max capacitance failures, we must also fix large max transition violations, typically with
repeater insertion. Small slew rate violations can be tolerated or, if possible, repaired by upsizing the
driver.
It is important to fix the vast majority of all electrical rule violations in the early repair passes since the
necessary buffer insertion may add delay to some paths. Any additional violations that pop up can be
fixed in the later repair passes.
In order to fix setup violations effectively, we must first report the appropriate data for analysis.
Ideally, timing reports will include fanout, capacitive load, net delay, crosstalk delta delay, cell delay
and slew rate. A sample path from such a timing report is shown in Figure 7.
To identify the proper fix for a particular violation, we must inspect the report and look for any
anomalous data, such as (1) a large crosstalk delay, (2) an unusually large slew rate, or (3) a high
fanout, to determine where the problem lies. See Figure 6 below for examples of these. Anomalous
DAC.COM KNOWLEDGE CENTER ARTICLE Page 10 of 18
www.dac.com
cells and nets are good targets for manual setup fixes, since they have likely not been optimized
correctly.
Also, designers should browse the timing report to look for bottlenecks. When fixing setup, modify
higher-fanout instances, if possible, since they may address many endpoint violations at once.
A. Useful Skew
Useful skew causes major upheaval in the design and must only be added during first-iteration repair
and only when no other options exist for fixing the failure.
Startpoint: data_in
(input port clocked by clk_vir)
Endpoint: i_mod1/macroA
(rising edge-triggered flip-flop clocked by clk_ab)
Path Group: INPUT
Path Type: max
Max Data Paths Derating Factor : 1.09(cell) 1.09(net)
Min Clock Paths Derating Factor : 0.90(cell) 0.90(net)
Max Clock Paths Derating Factor : 1.02(cell) 1.02(net)
Consider the timing report shown in Figure 7. Examining the datapath portion of the report (from
input “data_in” to macro pin “i_mod1/macroA/macroA_data_in”), we see that all cells are maximally
sized and also low Vt. Therefore, no Vt swap or resizing is possible to reduce delay. Also, buffer
insertion is unlikely to help since there are no large net delays indicating underbuffering. To eliminate
the timing violation, we have no choice but to add useful skew. We choose to add skew by inserting a
buffer on the clock pin of the destination register, since the change can be localized to affect only that
particular register (in any significant way). Skewing the clock at the source register would impact
another designer’s block in this case. It would also require either buffer removal or moving the source
register closer to the root of the clock tree, both of which are likely to affect other registers on that
clock branch. See Figure 8 for a comparison of source versus destination register useful skew.
DQ DQ
Logic
Fanout
Fanout
If we were to add useful skew at the source register, we would have to bypass or delete the
two inverters, as indicated by the dotted line. This could affect timing for the fanout.
TE
DQ
TI
Delaying the clock may cause hold violations for the TE and TI pins. It may also
cause setup violations at the destinations driven by the “output logic”.
Useful skew will definitely fix the violation in question but has the potential side effect of creating new
violations as well. Delaying the clock to the register may cause hold violations on the test pins and
setup violations through the driven logic as illustrated in Figure 9. These should be cleaned up
immediately in order for the useful skew insertion to be considered successful and complete. Useful
skew is a last resort for fixing setup violations.
B. Buffer Insertion
Although far preferable to useful skew, buffer insertion is still a fairly disruptive operation, especially in
congested areas of the design. Since a new cell is added, the placement is locally perturbed, and the
place-and-route tool must do an ECO route to reconnect all affected nets. This changes the local
noise profile and may introduce new violations. However, buffer insertion is unavoidable in some
cases. Consider the timing report segment shown below in Figure 10.
In the report, net “n_33291” induces a large crosstalk delta delay. After checking the layout, we see
that net “n_33291” is very long, but its driver is already low Vt and has high drive strength. In this
event, we must add a buffer to break the long net.
Other cases that require buffer insertion are as follows. Load splitting is required for high fanout nets
or when load cells are spread out. Buffer insertion can also be used to shield particular loads. For
example, given a high-fanout net with a small number of timing-critical branches, an inserted buffer
can serve as the root of a subtree that drives only non-timing critical branches of the net, thereby
reducing the delay for critical branches.
C. Upsizing
Increasing drive strength is slightly less disruptive than buffer insertion, although the placement in the
area of the upsized gate may change slightly and ECO routing is needed. When upsizing cells, avoid
exceedingly large changes and proceed in a step-by-step manner. Usually, the optimization engine
has chosen particular gate sizes for a reason. Sometimes, cells are undersized simply because an
associated path does not violate timing at the block level. If a failure only manifests during final, top-
level STA, designers may find upsizing very useful and easy to do. Often, a Vt swap can be done
simultaneously with an upsize to reduce the power impact of the upsize.
There are certain cases where upsizing is less desirable. For example, gates are poor candidates for
upsizing when the instance that drives them has very high fanout. If the driving gate is already
DAC.COM KNOWLEDGE CENTER ARTICLE Page 14 of 18
www.dac.com
overloaded, upsizing one of its sinks may actually make timing worse. We can easily find and avoid
high-fanout gates by referring to the “fanout” column in a timing report.
D. Vt Swap
The simplest and most innocuous of setup fixing changes is Vt swap. This operation costs nothing in
terms of routability since cell footprints are typically the same across all Vt classes. In fact, it is not
even necessary to re-extract after performing Vt swap. Because Vt swap is not invasive, it is the
most desirable operation for fixing setup violations and should be used whenever possible. Consider
the timing report segment in Figure 11. Clearly, there are several candidates for Vt swap since the
path contains multiple high Vt cells. Still, when selecting a cell to swap, we try to choose one with an
anomalous delay. In this case, we select “G1002” and will change it to standard Vt (SVT_NAND4X1).
Note also that, in some standard cell libraries, engineers can also exploit same-footprint cells with
different drive strengths. For example, INVX2, INVX3 and INVX4 might have the same footprint. In
that case, upsizing INVX2 to INVX4 is effectively the “same cost” as a Vt swap operation and may be
a better way to fix timing in last-stage STA repair. However, not all libraries have this nice feature.
Again, it is best to localize the ripple effects of changes. Figure 12 shows a sample timing report for a
hold time violation. Looking carefully at the report, we see that the violation was probably caused by
DAC.COM KNOWLEDGE CENTER ARTICLE Page 15 of 18
www.dac.com
a useful skew buffer insertion (i_blkA/BUF_SKEW_27). Already, one hold buffer has been added in
an attempt to fix the problem (i_blkA/BUF_HOLD_3). We can add another buffer on the “d” pin of the
destination flip-flop to eliminate the remaining negative slack.
Startpoint: i_blkA/enable_reg_19_
(rising edge-triggered flip-flop clocked by clk)
Endpoint: i_blkA/write_reg
(rising edge-triggered flip-flop clocked by clk)
Path Group: clk
Path Type: min
Min Data Paths Derating Factor : 0.900(cell) 0.900(net)
Min Clock Paths Derating Factor : 0.950(cell) 0.950(net)
Max Clock Paths Derating Factor : 1.100(cell) 1.100(net)
Finally, some failures are easily fixed by changing registers from low or standard Vt to high Vt.
Optimization tools often do not touch registers; however, flip-flops are usually not as sacred as the
tools regard them. If the functional path through the register is not critical, changing to high Vt has no
negative impact on timing.
DAC.COM KNOWLEDGE CENTER ARTICLE Page 18 of 18
www.dac.com
IX.CONCLUSION
From the above discussion, we conclude that late-stage timing repair is still more of an art than a
science. Signoff timing closure virtually always includes manual repair, and it is still critical for
designers to learn how to perform this task quickly and effectively. In addition, having a solid
methodology to manage/accelerate convergence is imperative. Methodology also helps to integrate
new designers into the team and to teach new college graduates skills for success. In the end, there
is still no substitute for experience.