Clock Tree Divergence Ti
Clock Tree Divergence Ti
Abstract— Timing convergence, while reducing area and a design. In extreme cases, high margins can cause problems
power, is one of the hardest challenges in the physical design of in the timing optimization itself and result in failure of timing
nanometer VLSI chips which push the limits of frequency closure. This current estimation of margins is very unscientific
entitlement. If there is clock tree divergence (CTD) due to and is derived from previous design knowledge.
architecture in a high frequency design, then margins are
For any given clock, timing paths in the design can be roughly
required during optimization stage. These margins will account
for divergence at the early stages that can significantly impact divided into 4 categories
the frequency entitlement. However, each timing path can have 1) Low Divergence + Low path depth
different divergence values and applying the same margins a. Optimize for best area and power
across all paths is not optimal. To address CTD, applying flat 2) Low divergence + High path depth
margins can lead to incorrect frequency entitlement, impacting a. Optimize to meet timing without area and power impact
area and power. If accurate path specific margins based on CTD 3) High divergence + Low path depth
are used, it will help in early optimization stages i.e. at logic - a. Optimize to meet timing smartly to avoid iteration from
synthesis and placement to achieve the right power, performance, synthesis.
area and schedule (PPAS). Early identification of critical paths
4) High divergence + High path depth
which are highly clock tree divergent at logic-synthesis stage and
in-time architecture feedback will go a long way to reduce design a. Identify such critical paths early which are frequency
or project cycle time. entitlement bottlenecks and provide RTL feedback.
As the nature of each category is different, a single flat
Keywords— clock tree divergence, early feedback, frequency margin cannot address all of them. Hence to get the best of
entitlement, clock skew each of these categories, applying path specific real margins
will give us the best frequency, area and power entitlement.
I. INTRODUCTION
Timing closure of a VLSI design is an iterative process. At II. TRADITIONAL MARGINS
each stage of the backend flow, i.e. logic synthesis, placement, This section explains the current methodology of flat
clock-tree synthesis and routing, different challenges are margins. In this case, a certain amount of clock latency and
exposed to the task of timing closure. At the logic synthesis CTD is assumed for the design based on previous design
stage, the basic netlist and connections are available. At experience as shown in Table I. Margins are calculated based
placement stage, floorplan related variations will start to take on this assumption and applied as part of uncertainty across
shape. Post clock-tree synthesis (CTS), clock latencies will be the optimization stage (till pre-CTS). They are removed post
applicable. This results in clock skew, CTD impacting timing. CTS.
Further, post routing, the real net delay will come to play. At
each of these stages, timing will degrade. TABLE I. Flat Margin Calculation
Especially for a high frequency design with considerable Predicted Latency 7000ps
CTD, appropriate selection of design-ware components is of Assumed CTD 30%
Lauch_clock derate 1.113
paramount importance. This will largely impact to what extent
Capture_clock derate 0.947
synthesis tool can perceive the criticality of the design. To Skew 250ps
achieve this, traditionally flat margins are used across the
flow. Calculation of flat margins is shown in Table I. The tool Divergence margins = 7000 * 30 (1.113 – 0.947) + 250 = 602ps
natively supports the application of flat margins as part of
clock uncertainly for any given clock. However, this option At post CTS stage, if a violating path has balanced clock
offers no flexibility and penalizes all paths equally. As we tree and no useful skew is employed, then the violation is most
likely due to CTD. So the divergence that was initially
attempt to push frequencies higher, the choice of these flat
assumed to calculate margins is incorrect or not sufficient for
margins becomes crucial. Lower margins can lead to incorrect
these paths. This had led to under optimization of these paths
frequency entitlement at signoff stage as paths may not be from logic-synthesis stage. In such cases, by the existing
optimized sufficiently. Higher margins can penalize area in methodology, this is taken as feedback and synthesis is
certain parts of design, since not all paths are timing critical in repeated with extra uncertainty for these paths . Then
placement and CTS is redone to check if the path is meeting There is a huge difference between pre-CTS and post CTS
frequency. If the path fails even after applying high margins, divergence percentages. In order to fill this gap, the major
architectural feedback is provided to the front end team. challenge is to predict the buffer insertion points done by the
Drawbacks of current methodology can be summarized as: CTS tool. Different approaches to achieve this will be
discussed below. For illustration purposes, launch path is
1) Inability to predict frequency entitlement at logic assumed to have worst uncommon clock path.
synthesis stage.
2) Requirement to complete CTS in order to find critical A. Fan-out based weight estimation
paths and hence delayed architectural feedback.
One basic reason for buffer insertion at CTS stage is to meet
3) Unscientific iterative margin process impacting PPAS. transition for high fan-out nets. Therefore, if the fan-out of an
RTL instantiated clock tree element (CTE) is high then it is
For a high frequency design with considerable CTD, these
drawbacks impact significantly. A better methodology is reasonable to assume that the CTS tool would insert buffers
required to bridge the gap between actual CTD and margins here. For example, if fan-out of a CTE was 5, then CTS could
applied at logic synthesis stage. insert 1 buffer hence a weight of 1 in assigned to this CTE.
The total count of CTE would be weight (1) + the element
itself = 2. In this manner, more the fan-out more the weight
III. DIVERGENCE ENGINE
assigned. This is done for every RTL instantiated CTE as
To address CTD margins, an algorithm is developed shown in Fig. 3
which predicts and dumps path specific divergence values.
This algorithm is called the Divergence Engine (DE). To
understand the divergence calculation intricacies, take the case
of a typical timing path between launch flop FF1 and capture
flop FF2 shown in Fig. 1.As shown in Table II estimated
divergence is 44%.
Fig 4. Timing path with weight estimation with mixed bag approach
Fig.5 Different stages where DE can be used in the design
TABLE V. Predicted divergence using mixed bag approach.
Common clock path 5 stages + 3 weight = 8 DE needs only the netlist and constraints as inputs to
Uncommon launch clock path 4 stages + 8 weight = 12 calculate divergence. The engine is run in the sign-off tool.
Since only instances and their connections are analyzed,
Divergence = 12/20 = 60% accurate delay calculations are not required. Once the
divergence values are estimated, the DE will require values of
With this an excellent correlation between post CTS derate, estimated skew and clock insertion latency to calculate
divergence data (62%) and predicted data (60%) is obtained. the final uncertainty values. Ideally divergence for each and
every interacting flop pair needs to be calculated. If there is a
The exact values of weights to be used for each of the above design which has 100K flops, the timing path interactions can
3 key points for divergence prediction algorithm is based run into 20Million. Hence calculation of divergence, based on
purely on heuristically analysis done on existing post CTS DB every timing path would be time consuming. To solve this
for a platform. In an experiment, weights were configured for problem, the flop’s immediate fan-in level-1 element is used,
a already existing digital design based on its post CTS that is a RTL instantiated CTE. If the divergence is calculated
database. Using these weights, on comparing the post CTS for one flop pair under two interacting level-1 CTE, then all
CTD percentage to predicted divergence percentage there was the flops under them will have the same divergence. Logic-
very good correlation with accuracy of +/-10%, if the timing synthesis inserted clock gating cells are skipped as they can
paths were balanced. With the same weights in place DE was change with every synthesis run.
run on another design belonging to the same platform and
similar extent of high accuracy was achieved. Hence, if DE is composed of 3 major parts. The first step is to find
weights are configured for DE for a platform design, then they the leaf level RTL instantiated CTE for each sequential
can be reused across different derivatives from the platform. If element. Divergence will be calculated for each of these CTE
a design is the first of its kind to a new platform, DE can be pair. The list of CTE (Fig. 6) is written out to a file to be used
run by assuming certain weights to begin with. Paths which by the next stage.
are predicted to be divergent at pre-CTS will remain divergent
Procedure: create_cteList.tcl
even after post CTS. But the predicted divergence percentage Inputs: timing database of the design under process
vs actual post CTS based divergence percentage could vary Outputs:
some extent. The weights will have to be fine-tuned based on 1. CTE – fan-out flop mapping for all registers.
post CTS feedback as shown in Fig 5. 2. CTE – RTL clock tree element mapping.
1. Obtain a list of all the leaf sequential elements in the design.
2. For each of the flops, obtain the fan-in element.
a. If such an element exists
IV. DIVERGENCE PREDICTION ALGORITHM AND i. Look into the fan-in to this element to obtain the RTL CTE element.
RUN TIME IMPROVEMENTS ii. If found, assign the CTE to the RTL CTE.
iii. Append the flop to the CTE fan-out list.
b. Write the fan-out list of CTE to the fan-out file.
The aim of the divergence prediction is to predict, with 3. Close the file handles and return the output files.
sufficient accuracy, the actual divergence percentage of each of Fig. 6 Creating the CTE list
the timing paths in the design before any clock tree is actually
built. The divergence engine (DE) then can be called at various
stages to achieve corresponding aims (Fig 5)
The next stage predicts the total clock path based on the
weight assumption given for single timing path for a given
CTE pair (Fig. 7).
B. Test-Run Details
Procedure: predict_clock_path In order to check out the divergence aware margin
Inputs: List of pre-CTS elements in path
methodology on Design, 4 runs were given till post CTS setup
Outputs: Number of elements expected post CTS expansion
1. For each element i in the path,
optimization.
a. Compare the module name with the previous element’s module name
i. If base module names are different, add a weight of 5 1. Traditional Methodology with flat margins
ii. If the second module names are different, add a weight of 1 A flat margin of 50% was assumed.
b. Check the cell name of the element
i. If it is an ICG, add a weight of 2 Clock Latency = 8000ps
ii. If it a CTMUX, add a weight of 2
Launch derate = 1.0632
2. Return the total number of elements + weights
Fig. 7 Heuristic for estimating path depth
Capture derate = 0.945
Skew = 250ps
This process is iterated for single CTE w.r.t. all other CTEs. Flat margin = 50%
If any valid path is found, the output is written the file. Since Divergence margins = 8000 * 50% (1.0632 – 0.945) + 250 =
each CTE has to be analyzed with all other CTEs, this step is 714ps
time intensive. To optimize the run time during this step, the Results are presented in Fig. 10
algorithm has been designed such that the calculation of
divergence for multiple CTE pairs can happen in parallel.. 2. Reduced flat margin flow
In the last step, the divergence output is parsed through a In this case, 12% of design is assumed to be divergent based
Perl script to generate uncertainty command files that can be on the CTD pie chart.
read by optimization pre-CTS tool. Divergence margins = 8000 * 12% (1.0632 – 0.945) +
250 = 363ps
Results are presented in Fig. 11
VI. CONCLUSION
Fig 11 Results of Reduced flat margins flow We have developed a custom-built divergence prediction
engine that generates path-specific margins based on clock
tree divergence with the following advantages.