0% found this document useful (0 votes)
53 views48 pages

Slack Driven Clock Offset CTS

This document discusses clock tree resynthesis for timing closure in multi-corner multi-mode designs. It presents an algorithm for clock tree resynthesis that calculates offsets at the output pins of clock tree cells to improve timing metrics. The algorithm considers offsets both positively and negatively, and bounds the offsets to avoid adverse impacts on the clock tree. It identifies potential acceptor pins using a slack manager to track slack parameters and guarantee no degradation under different operating conditions.

Uploaded by

jiangweili
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views48 pages

Slack Driven Clock Offset CTS

This document discusses clock tree resynthesis for timing closure in multi-corner multi-mode designs. It presents an algorithm for clock tree resynthesis that calculates offsets at the output pins of clock tree cells to improve timing metrics. The algorithm considers offsets both positively and negatively, and bounds the offsets to avoid adverse impacts on the clock tree. It identifies potential acceptor pins using a slack manager to track slack parameters and guarantee no degradation under different operating conditions.

Uploaded by

jiangweili
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Clock Tree Resynthesis for

Multi-corner Multi-mode
Timing Closure
Subhendu Roy1, Pavlos M. Mattheakis2, Laurent
Masse-Navette2 and David Z. Pan1
1ECE Department, The University of Texas at Austin
2Mentor Graphics, Fremont

1
Outline

!  CTS Preliminaries
!  Prior Work and Limitations

!  Clock Tree Resynthesis

!  Experimental Results

!  Conclusion and Future Work

2
CTS-Preliminaries

!  CTS – a fundamental step in physical design


!  Modern designs – multi-corner, multi-mode (MCMM)

!  Timing closure – extremely difficult in MCMM designs

3
CTS-Preliminaries

!  If targeting global zero skew, that would


›  cost in area/power
›  limit achievable operating frequency
!  Data-path optimization is not sufficient to handle timing
violations
!  Need for data path aware clock scheduling or useful clock
skew optimization

4
Prior Work and Limitations(1)

Useful Skew Optimization


!  [Kourtav+, ICCAD’99], [Nawale+, ICCAD’06] –
›  Solve LP or Quadratic problem
›  Calculate clock skew in pre-CTS stage
›  Actual implementation difficult to achieve in later
design stage
›  No support for MCMM

5
Prior Work and Limitations(2)
[Lu+, IMSCS’09] – Post-CTS bounded delay buffering
! 

at leaves
›  Buffering at leaves high area/power cost
›  Does not tackle MCMM scenario

Too much
B1 B1 area cost

B2 B3 B2 B3

ff1 ff2 ff3 ff4 ff5


ff1 ff2 ff3 ff4 ff5
6
Prior Work and Limitations(3)
! [Shen+, ISQED’10] – Post-CTS useful skew implementation
in MCMM
›  Local transformation at leaf-level greedy, high area/power cost
›  Insert/remove buffer to delay/speed up clock arrival at flop inputs
›  Speed up by buffer removal may not be practically realizable

D Q D Q
Dslack < 0 Qslack > 0 Dslack > 0 Qslack < 0

Clk
Clk

7
Notion of Offset
!  Pre-CTS useful skew Difficult to implement
!  Post-CTS useful skew greedy, high area cost, may not
support MCMM

B1 B1
Reduce granularity
in clock scheduling
B2 B3 o1 B2 B3 o2

s1 s2 s3 s4 s5

ff1 ff2 ff3 ff4 ff5 ff1 ff2 ff3 ff4 ff5

Clock scheduling moved up to driver pins of clock-tree buffers


8
Notion of Offset

B0
!  Positive offset if doff > 0,
clock-arrival at B1’s
output to be delayed by doff
B2 B1 B3
doff
! Negative offset if doff < 0,
clock-arrival at B1’s output to
B4 B5 be expedited by doff

9
Our Contributions
!  First
work to consider offsets at output pins of clock
tree cells
›  In a placed design with already routed clock tree
!  Anarea-efficient and non-intrusive algorithm is
presented
›  To realize negative offsets
!  A methodology for clock tree resynthesis presented
›  Significantly improved timing metrics in large-scale
industrial designs under MCMM scenarios

10
Outline

!  CTS Preliminaries
!  Prior Work and Limitations

!  Clock Tree Resynthesis

!  Experimental Results

!  Conclusion and Future Work

11
How CT-Resynthesis Fit in the Flow
Floorplanning, Placement

Pre-CTS Optimization

Two Step Approach


Clock Tree Synthesis and Clock Tree Routing

Estimate offsets by
LP solver
Clock Tree Resynthesis

Realize offsets
incrementally
Post-CTS Data-path Optimization

12
MCMM Offset Estimation

Synthesized/routed clock tree


User specified Offset Range

LP Solver [ Rama, ISPD’12]

Multi-corner offsets &


TNS/THS improvement prediction

13
Positive Offset Realization

No impact on siblings
B0 B0

B2 B1 B3 B2
+doff B1 B3

B4 B5 D1

Delay block B4 B5

14
Negative Offset Realization Issues(1)

B0 B0

B2 B1 B2 B1
B3 B3
B5

B5 -d
B4 off B B4 B6
6

!  Significant impact on timing profile


›  Impact on leaf cells at the TFO cone of old/new siblings of B5
›  Difficult to guarantee the overall improvement of timing

15
Negative Offset Realization Issues(2)
!  Speed-up by buffer removal may not be practically
realizable

B0 B0

B1

B2 B3 B4 B2 B3 B4

B0 is driving more load (wire load + buffers)


after buffer removal
16
Offset Bounded Clock Scheduling
!  Implementing negative offset is difficult
!  For a pin, more the negative offset
›  More the pin needs to be moved upwards tree
›  More FFs downwards the tree will be impacted
!  Solution:
›  Calculation and realization of offsets should be
tightly coupled
›  Need for offset-bounds

Offset Bounded Clock Scheduling


17
Offset Bound Experiments

Levels = [0 3]
Levels = [-1 3]
Levels = [-3 3]

!  Discrete offsets in steps of buffer delay (say 50ps)


›  if Levels = [-1 1], then possible offset values: -50ps and 50ps

Observation: Hardly any TNS improvement from Run


2 to Run 3
Conclusion: Realize the offsets for Run 2
18
Robust Negative Offset Realization
hn0 !  Any Restructuring should be
performed within the scope
of hyper-net
›  Clock gating functionality
preserved
! Hyper-net " set of nets in
same physical partition
›  Nets are logically equivalent
or opposite polarity
›  Separated by buffers/inverters
›  Connected in a tree-topology

hn1 hn2

19
Robust Negative Offset Realization
!  Restructuring should guarantee no adverse impact on
clock-tree under MCMM
!  Need to identify potential acceptor pins
›  Sequential cells in TFO should have available positive slack

B0
B0 needs to be B0
a good acceptor
B1
B3 B2 B1 B3
B5

B4 B5 B6 B4 B6
-doff

20
Slack Manager to Identify Acceptors

B1 Qslksum = -8
Qslkcnt = 2

Qslksum = -2
Qslkcnt = 1 B3
Qslksum = -6
!  Same info kept for D-slack
B2 Qslkcnt = 1 parameters
!  Slack parameters
calculated
ff1 ff2 ff3 ff4 ff5 ›  Per scenario (mode +
corner combination)
Qslk=8 Qslk=4 Qslk=-2 Qslk=8 Qslk=-6
›  Bottom-up fashion

21
Clock Tree Restructuring

B4
lev = x - 1

B0 B5 B6
lev = x

lev = x + 1 B1

Is neg. Q-slack count at B0


- neg. D-slack count at B0 >= 0 ?
B2 B3

22
Clock Tree Restructuring

B4
lev = x - 1

B0 B5 B6
lev = x

lev = x + 1 B1
Is neg. Q-slack count at B0
-  neg. D-slack count at B0 >= 0 ?
No " Size up B1
B2 B3 Yes " To Move B1, Is neg. Q-
slack count at B4 = 0 across all
scenarios?
23
Clock Tree Restructuring

B4
lev = x - 1

B0 B5 B6
lev = x

lev = x + 1 B1
Is neg. Q-slack count at B4 = 0
across all scenarios?
Yes " B4 is a candidate
B2 B3 acceptor

24
Clock Tree Restructuring

B4
lev = x - 1

B0 B5 B6
lev = x B1

lev = x + 1
B2 B3

Restructuring guarantee no
adverse impact on FFs at the
TFO of B5 and B6

25
Neg. Offset Realization Algorithm (NORA)

Prune candidate Acceptors by level

Cost Function
Sort according to geometrical
proximity
Cost = ∞, if DRC violation
β * (error), o.w.

Estimate cost for each acceptor where, error = inaccuracy in


Offset implementation in
constraint scenario

Commit min. cost solution

26
Neg. Offset Realization Algorithm (NORA)

!  If lot of acceptors, first 10 acceptors considered


›  Saves run time
›  At the same time, area-efficient restructuring
!  If no potential acceptor with available slack,
›  Choose the acceptor with max. Qslacksum across all
scenarios

27
Clock Tree Resynthesis Algorithm
Calculate clock tree offsets

No
Extract offset(p) Offset(p) > 0?

Yes

Insert buffer at p Update Slack Manager

Yes
Any remaining NORA (p, offset)
offset?

No

End
28
Experimental Setup
!  Integrated to Industrial P&R tool
!  Run on 256GB RAM, 16-core 3GHz CPU

!  7 industrial designs using 20-32nm technology node

Design Cells (M) Scenarios TNS (ps) WNS (ps) FEP

A 0.35 5 -789723 -4433 1907


B 0.62 8 -1586320 -414 12850

C 0.62 8 -82529 -218 1262


D 0.7 8 -1129784 -6433 2408
E 0.85 1 -8032671 -1483 17491
F 1.17 5 -8968128 -6394 43938
G 2.03 6 -4289746 -15418 31946

29
Only Negative Offset Realization

Design % TNS % WNS % FEP % Clock Tree Run


Imprv. Imprv. Imprv. Overhead Time
(min)
A 10.70 -0.13 5.61 2.56 43
B 11.67 0.24 3.61 7.33 175
C 13.35 0.92 9.75 2.56 178
D 32.80 2.64 25.46 1.11 125
E 2.24 2.83 2.20 1.36 98
F 5.91 0.75 7.31 0.17 161
G 34.30 0.08 27.54 0.04 410
Avg. 15.85 1.05 11.64 1.95 -

!  Restructuring is area-efficient
!  Avg. 15.85% improvement in TNS

30
Pos. and Neg. Offset Realization

Design % TNS % WNS % FEP % Clock Tree Run


Imprv. Imprv. Imprv. Overhead Time
(min)
A 77.65 1.20 39.54 20.10 46
B 56.25 0.97 47.32 47.09 189
C 76.62 49.08 57.84 8.63 140
D 31.58 18.51 17.57 11.51 129
E 69.79 10.05 44.43 54.98 306
F 22.80 0.72 35.69 29.78 250
G 62.09 3.80 50.33 11.12 368
Avg. 56.68 12.04 41.82 26.87 -

!  Timing improves more at the cost of clock-tree area


!  Avg. 56.68% improvement in TNS

31
The Overall Comparison

32
Conclusion and Future Work
!  First work to consider offsets at output pins of clock tree
cells instead of estimating clock schedule at registers
!  A novel clock tree resynthesis methodology presented

!  Integrated to Industrial P&R tool


›  Avg. 57% TNS improvement with avg. 26% clock tree area
overhead in large-scale MCMM industrial designs

Future Work:
!  Concurrent offset realization
!  Introduce OCV-impact into the cost function

33
THANK YOU

Questions?

34
Back-up Slides

35
Future Work

!  Concurrent offset realization


!  Clock-tree area overhead is mainly due to pos. offset
realization
›  Modify cost function in neg. offset realization
!  Introduce the OCV-impact into the cost function
›  Inserting buffer might have adverse effect
›  Restructuring might improve/degrade OCV due to CPPR

36
Local Transformation
!  Speed-up by buffer removal may not be practically
realizable

B0 B0

B1

B2 B3 B4 B2 B3 B4

B0 is driving more load (wire load + buffers)


after buffer removal
37
Our Approach
!  Estimate offset (positive/negative) at the clock tree
driver pins
›  Performed by an LP solver [Rama12]
›  MCMM scenarios are considered

!  Realize the positive/negative offsets incrementally


›  On already synthesized and routed clock tree
›  To ensure rest of the clock tree remains intact

[Rama12] Functional Skew Aware Clock Tree Synthesis by V. Ramachandran, ISPD 2012

38
Motivation
!  [Kour99],[Naw06] - data path aware clock scheduling
›  Calculate clock skew in pre-CTS stage
›  Actual implementation difficult to achieve
›  Unaware of MCMM scenarios

!  [Lu09] – post-CTS bounded delay buffering at leaves


›  Buffering at leaves – high area/power cost
›  Does not tackle MCMM scenarios
›  Only delaying clock arrival – limited scope for optimization
[Kour99] Clock Skew Scheduling for Improved Reliability via Quadratic Programming by Kourtav et al.,
ICCAD 99
[Naw06] Optimal Useful Clock Skew Scheduling in the Presence of Variations Using Robust ILP
Formulations by Nawale et al., ICCAD 2006
[Lu09] Post-CTS Clock Skew Scheduling with Limited Delay Buffering by Lu et al., IMSCS 2009

39
Preliminaries
FF1 FF2

Comb. Block

sd – ss

sd – ss > 0 # positive skew


sd – ss < 0 # negative skew

40
Preliminaries
FF1 FF2

Comb. Block

sd – ss

Set Up Constraint : T + (sd – ss) > tpd,reg+ tpd,comb +


Tsu

Hold Constraint : tcd,reg+ tcd,comb > (sd – ss) + Th

41
Motivation

!  Earlier Approach: Clock Skew Minimization


›  Fish90, Tsay91, Kahng92, Chen04

!  Issues
›  Maximum operating frequency limited
›  Sacrifice in area/power

[Fish90] Clock Skew Optimization by J P Fishburn, Trans. On Computers 90


[Tsay91] Exact Zero Skew Clock-routing Algorithm by Tsay, ICCAD 91
[Kahng92] Zero Skew Clock Routing Trees with Min. Wirelength by Kahn et al., Int. Conf. on ASIC 92
[Chen04] Zero Skew Clock Tree Optimization with Buffer Insertion/Sizing and Wire Sizing by Chen etal.,
IEEE Trans. On CAD 2004

42
Motivation

tpd,reg = 2 ns
Tsu = 1 ns T + (sd – ss) > tpd,reg+ tpd,comb + Tsu

17 ns 11 ns

Tclock,min = 20 ns

43
Motivation

tpd,reg = 2 ns
Tsu = 1 ns T + (sd – ss) > tpd,reg+ tpd,comb + Tsu

17 ns 11 ns

3 ns

Useful Skew
Tclock,min = 17 ns

44
Outline

!  Preliminaries

!  Motivation

!  Our Approach
!  Feasibility Aware Clock Scheduling (FACS)

!  Clock Tree Resynthesis

!  Experimental Results

!  Future Work and Conclusion

45
What is Offset?

B0 B0

B2 B1 B1
B3 B2 B3
op +doff op -doff

B4 B5 B4 B5

Clock-arrival at op to be Clock-arrival at op to be
delayed by doff expedited by doff

46
Experimental Results
Discussion:
!  In design E, clock-tree overhead (54.98%) seems high !

›  But increase in total area is < 1%


!  Run time depends on

›  Size of the clock-tree


›  Number of offsets to be realized
!  THS optimization with neg. and (pos. + neg.) offset

›  Design B: 14.5%, 88%


›  Design D: 13%, 15%
!  Biggest benchmark: 2.03M cells, 6 scenarios

›  62% improvement in TNS


›  11% overhead in clock-tree area

47
Offset Extraction in MCMM

!  MCMM Handling
›  Scaling factors calculated for each corner
›  Functional timing paths across all active modes analyzed

!  Discrete offsets in steps of buffer delay


›  if Level = [-2 3] and Dbuf = 50 ps, then possible offset values:
-100 ps, -50 ps, 50 ps, 100 ps and 150 ps

48

You might also like