0% found this document useful (0 votes)
12 views5 pages

Clock Tree Divergence Ti

The document discusses the Divergence Engine (DE), an algorithm designed to predict clock-tree divergence (CTD) at the logic-synthesis stage of VLSI design, which aims to optimize timing margins for better frequency, area, and power performance. It highlights the limitations of traditional flat margin methodologies and emphasizes the need for path-specific margins to accurately account for varying divergence values across different timing paths. The DE utilizes fan-out based weight estimation and a mixed bag approach to improve the accuracy of divergence predictions, ultimately facilitating early identification of critical paths and reducing design cycle time.

Uploaded by

jakesdrake011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views5 pages

Clock Tree Divergence Ti

The document discusses the Divergence Engine (DE), an algorithm designed to predict clock-tree divergence (CTD) at the logic-synthesis stage of VLSI design, which aims to optimize timing margins for better frequency, area, and power performance. It highlights the limitations of traditional flat margin methodologies and emphasizes the need for path-specific margins to accurately account for varying divergence values across different timing paths. The DE utilizes fan-out based weight estimation and a mixed bag approach to improve the accuracy of divergence predictions, ultimately facilitating early identification of critical paths and reducing design cycle time.

Uploaded by

jakesdrake011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Divergence Engine: Early prediction of clock-tree

divergence at Logic-synthesis stage


Sanjana Sundaresh , Murali Mohan Thota , Atul Garg
Texas Instruments India Pvt Ltd ,Bangalore,India
[email protected],[email protected],[email protected]

Abstract— Timing convergence, while reducing area and a design. In extreme cases, high margins can cause problems
power, is one of the hardest challenges in the physical design of in the timing optimization itself and result in failure of timing
nanometer VLSI chips which push the limits of frequency closure. This current estimation of margins is very unscientific
entitlement. If there is clock tree divergence (CTD) due to and is derived from previous design knowledge.
architecture in a high frequency design, then margins are
For any given clock, timing paths in the design can be roughly
required during optimization stage. These margins will account
for divergence at the early stages that can significantly impact divided into 4 categories
the frequency entitlement. However, each timing path can have 1) Low Divergence + Low path depth
different divergence values and applying the same margins a. Optimize for best area and power
across all paths is not optimal. To address CTD, applying flat 2) Low divergence + High path depth
margins can lead to incorrect frequency entitlement, impacting a. Optimize to meet timing without area and power impact
area and power. If accurate path specific margins based on CTD 3) High divergence + Low path depth
are used, it will help in early optimization stages i.e. at logic - a. Optimize to meet timing smartly to avoid iteration from
synthesis and placement to achieve the right power, performance, synthesis.
area and schedule (PPAS). Early identification of critical paths
4) High divergence + High path depth
which are highly clock tree divergent at logic-synthesis stage and
in-time architecture feedback will go a long way to reduce design a. Identify such critical paths early which are frequency
or project cycle time. entitlement bottlenecks and provide RTL feedback.
As the nature of each category is different, a single flat
Keywords— clock tree divergence, early feedback, frequency margin cannot address all of them. Hence to get the best of
entitlement, clock skew each of these categories, applying path specific real margins
will give us the best frequency, area and power entitlement.
I. INTRODUCTION
Timing closure of a VLSI design is an iterative process. At II. TRADITIONAL MARGINS
each stage of the backend flow, i.e. logic synthesis, placement, This section explains the current methodology of flat
clock-tree synthesis and routing, different challenges are margins. In this case, a certain amount of clock latency and
exposed to the task of timing closure. At the logic synthesis CTD is assumed for the design based on previous design
stage, the basic netlist and connections are available. At experience as shown in Table I. Margins are calculated based
placement stage, floorplan related variations will start to take on this assumption and applied as part of uncertainty across
shape. Post clock-tree synthesis (CTS), clock latencies will be the optimization stage (till pre-CTS). They are removed post
applicable. This results in clock skew, CTD impacting timing. CTS.
Further, post routing, the real net delay will come to play. At
each of these stages, timing will degrade. TABLE I. Flat Margin Calculation
Especially for a high frequency design with considerable Predicted Latency 7000ps
CTD, appropriate selection of design-ware components is of Assumed CTD 30%
Lauch_clock derate 1.113
paramount importance. This will largely impact to what extent
Capture_clock derate 0.947
synthesis tool can perceive the criticality of the design. To Skew 250ps
achieve this, traditionally flat margins are used across the
flow. Calculation of flat margins is shown in Table I. The tool Divergence margins = 7000 * 30 (1.113 – 0.947) + 250 = 602ps
natively supports the application of flat margins as part of
clock uncertainly for any given clock. However, this option At post CTS stage, if a violating path has balanced clock
offers no flexibility and penalizes all paths equally. As we tree and no useful skew is employed, then the violation is most
likely due to CTD. So the divergence that was initially
attempt to push frequencies higher, the choice of these flat
assumed to calculate margins is incorrect or not sufficient for
margins becomes crucial. Lower margins can lead to incorrect
these paths. This had led to under optimization of these paths
frequency entitlement at signoff stage as paths may not be from logic-synthesis stage. In such cases, by the existing
optimized sufficiently. Higher margins can penalize area in methodology, this is taken as feedback and synthesis is
certain parts of design, since not all paths are timing critical in repeated with extra uncertainty for these paths . Then
placement and CTS is redone to check if the path is meeting There is a huge difference between pre-CTS and post CTS
frequency. If the path fails even after applying high margins, divergence percentages. In order to fill this gap, the major
architectural feedback is provided to the front end team. challenge is to predict the buffer insertion points done by the
Drawbacks of current methodology can be summarized as: CTS tool. Different approaches to achieve this will be
discussed below. For illustration purposes, launch path is
1) Inability to predict frequency entitlement at logic assumed to have worst uncommon clock path.
synthesis stage.
2) Requirement to complete CTS in order to find critical A. Fan-out based weight estimation
paths and hence delayed architectural feedback.
One basic reason for buffer insertion at CTS stage is to meet
3) Unscientific iterative margin process impacting PPAS. transition for high fan-out nets. Therefore, if the fan-out of an
RTL instantiated clock tree element (CTE) is high then it is
For a high frequency design with considerable CTD, these
drawbacks impact significantly. A better methodology is reasonable to assume that the CTS tool would insert buffers
required to bridge the gap between actual CTD and margins here. For example, if fan-out of a CTE was 5, then CTS could
applied at logic synthesis stage. insert 1 buffer hence a weight of 1 in assigned to this CTE.
The total count of CTE would be weight (1) + the element
itself = 2. In this manner, more the fan-out more the weight
III. DIVERGENCE ENGINE
assigned. This is done for every RTL instantiated CTE as
To address CTD margins, an algorithm is developed shown in Fig. 3
which predicts and dumps path specific divergence values.
This algorithm is called the Divergence Engine (DE). To
understand the divergence calculation intricacies, take the case
of a typical timing path between launch flop FF1 and capture
flop FF2 shown in Fig. 1.As shown in Table II estimated
divergence is 44%.

Fig.3 Timing path with fan-out based weight estimation

TABLE IV . Predicted Divergence using this methodology


Common clock path 5 stages + 5 weight
Uncommon launch clock path 4 stages + 1 weight

Divergence = 5/15 = 33%


Fig. 1 Typical timing path at pre-CTS stage
There is a huge gap in predicted (33%) vs actual divergence
TABLE II. Actual calculation of clock divergence pre-CTS stage (62%) in this methodology. There was a miscorrelation
Common clock path 5 stages because floorplan based inputs were not considered. Some key
Uncommon launch clock path 4 stages findings were :
Uncommon capture clock path 4 stages
1) The source of the clock tree example PLL or oscillator is
Divergence = Worst Uncommon clock path/Total clock path = 4/9 = 44%
placed further away in the floorplan from its branch and hence
Assume after post CTS expansion shown in Fig. 2 tool will always buffer the path to meet a given transition
during CTS.
2) This is also the case when the clock path traces through
different modules. As a single module tends to be clustered
together and could be placed far away from another module,
CTS tool will buffer to meet transition.

B. Mixed Bag Approach


From previous experiment, it was concluded that adding
weights is the right approach but it has to be fine-tuned to add
Fig. 2 Timing path at post clock tree expansion
weights appropriately by making it placement aware. Further,
TABLE III. Clock divergence post CTS stage as the percentage of clock tree divergence is used to calculate
Common clock path 8 stages the margins, the exact number of buffers is not required; rather
Uncommon launch clock path 13 stages just the correct ratio of worst uncommon clock path and total
Uncommon capture clock path 12 stages clock path is needed.
After doing extensive heuristic analysis on a post CTS
Divergence @ post CTS stage = 13/21 = 62% database, the following key points were identified to enable
weight insertion.
 Add weights for every RTL instantiated ICG and
MUX.
 Add weights if the clock path changed modules.
 Add weights to clock source.

With this approach predicted divergence is shown in Fig. 4

Fig 4. Timing path with weight estimation with mixed bag approach
Fig.5 Different stages where DE can be used in the design
TABLE V. Predicted divergence using mixed bag approach.
Common clock path 5 stages + 3 weight = 8 DE needs only the netlist and constraints as inputs to
Uncommon launch clock path 4 stages + 8 weight = 12 calculate divergence. The engine is run in the sign-off tool.
Since only instances and their connections are analyzed,
Divergence = 12/20 = 60% accurate delay calculations are not required. Once the
divergence values are estimated, the DE will require values of
With this an excellent correlation between post CTS derate, estimated skew and clock insertion latency to calculate
divergence data (62%) and predicted data (60%) is obtained. the final uncertainty values. Ideally divergence for each and
every interacting flop pair needs to be calculated. If there is a
The exact values of weights to be used for each of the above design which has 100K flops, the timing path interactions can
3 key points for divergence prediction algorithm is based run into 20Million. Hence calculation of divergence, based on
purely on heuristically analysis done on existing post CTS DB every timing path would be time consuming. To solve this
for a platform. In an experiment, weights were configured for problem, the flop’s immediate fan-in level-1 element is used,
a already existing digital design based on its post CTS that is a RTL instantiated CTE. If the divergence is calculated
database. Using these weights, on comparing the post CTS for one flop pair under two interacting level-1 CTE, then all
CTD percentage to predicted divergence percentage there was the flops under them will have the same divergence. Logic-
very good correlation with accuracy of +/-10%, if the timing synthesis inserted clock gating cells are skipped as they can
paths were balanced. With the same weights in place DE was change with every synthesis run.
run on another design belonging to the same platform and
similar extent of high accuracy was achieved. Hence, if DE is composed of 3 major parts. The first step is to find
weights are configured for DE for a platform design, then they the leaf level RTL instantiated CTE for each sequential
can be reused across different derivatives from the platform. If element. Divergence will be calculated for each of these CTE
a design is the first of its kind to a new platform, DE can be pair. The list of CTE (Fig. 6) is written out to a file to be used
run by assuming certain weights to begin with. Paths which by the next stage.
are predicted to be divergent at pre-CTS will remain divergent
Procedure: create_cteList.tcl
even after post CTS. But the predicted divergence percentage Inputs: timing database of the design under process
vs actual post CTS based divergence percentage could vary Outputs:
some extent. The weights will have to be fine-tuned based on 1. CTE – fan-out flop mapping for all registers.
post CTS feedback as shown in Fig 5. 2. CTE – RTL clock tree element mapping.
1. Obtain a list of all the leaf sequential elements in the design.
2. For each of the flops, obtain the fan-in element.
a. If such an element exists
IV. DIVERGENCE PREDICTION ALGORITHM AND i. Look into the fan-in to this element to obtain the RTL CTE element.
RUN TIME IMPROVEMENTS ii. If found, assign the CTE to the RTL CTE.
iii. Append the flop to the CTE fan-out list.
b. Write the fan-out list of CTE to the fan-out file.
The aim of the divergence prediction is to predict, with 3. Close the file handles and return the output files.
sufficient accuracy, the actual divergence percentage of each of Fig. 6 Creating the CTE list
the timing paths in the design before any clock tree is actually
built. The divergence engine (DE) then can be called at various
stages to achieve corresponding aims (Fig 5)
The next stage predicts the total clock path based on the
weight assumption given for single timing path for a given
CTE pair (Fig. 7).
B. Test-Run Details
Procedure: predict_clock_path In order to check out the divergence aware margin
Inputs: List of pre-CTS elements in path
methodology on Design, 4 runs were given till post CTS setup
Outputs: Number of elements expected post CTS expansion
1. For each element i in the path,
optimization.
a. Compare the module name with the previous element’s module name
i. If base module names are different, add a weight of 5 1. Traditional Methodology with flat margins
ii. If the second module names are different, add a weight of 1 A flat margin of 50% was assumed.
b. Check the cell name of the element
i. If it is an ICG, add a weight of 2 Clock Latency = 8000ps
ii. If it a CTMUX, add a weight of 2
Launch derate = 1.0632
2. Return the total number of elements + weights
Fig. 7 Heuristic for estimating path depth
Capture derate = 0.945
Skew = 250ps
This process is iterated for single CTE w.r.t. all other CTEs. Flat margin = 50%
If any valid path is found, the output is written the file. Since Divergence margins = 8000 * 50% (1.0632 – 0.945) + 250 =
each CTE has to be analyzed with all other CTEs, this step is 714ps
time intensive. To optimize the run time during this step, the Results are presented in Fig. 10
algorithm has been designed such that the calculation of
divergence for multiple CTE pairs can happen in parallel.. 2. Reduced flat margin flow
In the last step, the divergence output is parsed through a In this case, 12% of design is assumed to be divergent based
Perl script to generate uncertainty command files that can be on the CTD pie chart.
read by optimization pre-CTS tool. Divergence margins = 8000 * 12% (1.0632 – 0.945) +
250 = 363ps
Results are presented in Fig. 11

V. EXPERIMENTAL RESULTS 3. Divergence aware methodology (DAM)


This run is with the CTD based divergence aware
A. Test-Case Details methodology. Details are captured in section C (Divergence
The test case that considered to check out DE is a digital aware methodology) .
design which is area and power critical. . The design had been Results are presented in Fig.12
closed using flat margins assuming 50% clock divergence.
Using DE , divergence was calculated for every timing path 4. No margin flow
and then mapped to a pie chart details in Fig. 9. Just to make sure design was not over penalized w.r.t margins
and to see how much area gain can be done, design was run
with no margins. Results are presented in Fig.13

C. Divergence aware methodology (DAM)

In this section explains DAM calculations.


1) DE is run on synthesized netlist in signoff timing tool to
calculate “Divergence Percentage” for all possible paths
for a given clock w.r.t. all RTL inserted CTE. This
helps to study the overall divergence percentage of
Fig. 9 CTD pie chart of digital design
design. And pick a flat divergence % number which
represents most of the paths and add that as flat
From the CTD pie chart following are some key take-away. uncertainty. In this case it was 12%. Some portion of
1) Not many paths with high divergence exist in design. flat margins is included as the optimization tool is
2) If 50% of the design is assumed to be divergent for unable to take all the path based uncertainty given.
margin calculation, it would be valid only for ~5% of 2) To calculate this flat margin, below formula is used and
the design. added that as clock uncertainty in addition to clock jitter.
3) The margins were pessimistic for ~95% of paths
4) With the use of margins based on divergence engine, Divergence margins = 8000 * 12 (1.063 – 0.945) + 250
it’s expected to see area benefit for design. = 363ps
3) For paths which have high divergence (> 12 %), extra
uncertainty is added in addition to flat margins.
For example,
a) Assume paths between (Launch CT element) ICG1 and D. Analysis of the experiment on Design.
(Capture CT element) ICG2 have a predicted divergence
of 40%. 1) Traditional Margin flow.
b) Overall flat divergence is assumed to be 12%. Hence clock a) Good timing and leakage picture but area is high.
to clock uncertainty applied for say CLK1 will be 363ps 2) No margin flow.
c) Rest 28% of divergence is specific to paths between ICG1 a) ~5.8% area saved w.r.t traditional margin flow.
and ICG2. This is applied as uncertainty at logic- b) Huge impact on timing with unrecoverable 330ns
synthesis stage. (20X) as total negative slack (TNS).
d) Flops driven by ICG1 and ICG2 are grouped. Uncertainty c) 33% impact on leakage w.r.t traditional margin flow
is applied to only specific interacting paths . 3) Reduced flat margin flow.
a) ~3.5% of std cell logic area saving w.r.t traditional
margin.
b) 8X degradation in TNS post CTS which is
unrecoverable.
c) 10% increase in Leakage.
4) Divergence aware methodology
a) 3.5% std cell logic area saving
b) Predictable timing picture at pre-CTS stage when
compared to traditional flow.
c) Good leakage.
Fig 10 Results of Traditional flat margin flow
d) High run time.

Hold optimizations do not affect results as CTS is not


influenced but just data path optimization

VI. CONCLUSION

Fig 11 Results of Reduced flat margins flow We have developed a custom-built divergence prediction
engine that generates path-specific margins based on clock
tree divergence with the following advantages.

1) Complete and accurate prediction of expanded clock tree


divergence at logic synthesis level without any physical
information like floorplan.
2) Very early architectural feedback to the RTL team on
critical paths based on clock tree divergence.
Fig 12 Results of Divergence aware margin flow 3) With realistic margins, we can predict target frequency at
pre-CTS.
4) Engine once configured (weights configured) for a
platform works with same consistency across different
designs.
5) On a post CTS database, the engine can derive weights and
hence can calculate the real margins which can be fed
back to improve optimization.
6) Divergence Engine can be run on any netlist with a quick
Fig 13 Results of No margin flow turnaround time.

This algorithm can also be extended to dynamically


model the local clock skews, useful skew scenarios.

You might also like