0% found this document useful (0 votes)
14 views15 pages

Scalable H-Tree With Useful Skew

Uploaded by

madhuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views15 pages

Scalable H-Tree With Useful Skew

Uploaded by

madhuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO.

Y, MONTH 2017 1

Scalable Construction of Clock Trees with Useful


Skew and High Timing Quality
Rickard Ewetz Member, IEEE, and Cheng-Kok Koh

Abstract—Clock trees can be constructed based on static


arrival time constraints or dynamic implied skew constraints.
Dynamic implied skew constraints allow the full timing margins
to be utilized. However, the dynamic skew constraints require a
high run-time complexity to be evaluated. In contrast, static ar-
rival time constraints are more restrictive but can be evaluated in
constant time. Consequently, there is a trade-off between timing (a)
margin utilization and run-time. In this paper, a scalable clock
tree synthesis (CTS) framework is proposed for the construction
of low-cost useful skew trees (USTs) with high timing quality.
The scalability is based on combining the use of arrival time
constraints with virtual minimum and maximum delay offsets,
which facilitates that a pair of smaller subtrees can be joined
into a larger subtree in constant time. The ability to quickly
join subtrees is leveraged to perform a high degree of solution
space exploration, which translates into the construction of USTs
with low-cost. In particular, clock trees with various routing tree
topologies, buffer tree topologies, buffer sizes, and stem wire
lengths are explored. Moreover, the arrival time constraints are
specified with the objective of being the least restrictive to reduce (b)
cost. Furthermore, the constraints are respecified throughout the
tree construction process using a slack graph (SG) to expose Fig. 1. (a) SCG capturing the explicit skew constraints. Weights in the SCG
additional timing margins. The high timing quality is obtained are updated after a skew is specified. (b) Dynamic implied skew constraints
by seamlessly integrating arbitrary timing models using the depend on the SCG in (a). The implied skew constraints are therefore required
SG. Finally, the proposed CTS framework is integrated with to be updated when a weight in the SCG is updated. |V | is the number of
a clock tree optimization (CTO) framework to demonstrate that clock sinks and dij denotes the length of the shortest path from vertex i to
the constructed USTs are capable of meeting timing constraints vertex j in the SCG.
under the influence of on-chip variations (OCV).
Index Terms—Clock tree synthesis, useful skew, timing con-
straints, low-power, algorithms. constructed clock trees mainly depend on the (i) constraints
used to guide the tree construction, (ii) the performed solution
I. I NTRODUCTION space exploration, and (iii) the timing margins exposed by the
specified constraints.
Modern integrated circuits (ICs) have limited routing re-
To satisfy the explicit skew constraints, the merging of
sources and tight power budgets, which requires clock trees to
subtrees can be guided by static arrival time constraints [1],
be constructed with short wire length and small buffer area.
[2], [3], [4] or dynamic implied skew constraints [5]. A dy-
At the same time, the clock signal must be delivered to the
namic implied skew constraint is a bound on the minimum and
sequential elements (or clock sinks) meeting irregular timing
maximum skew between a pair of sinks. Based on the explicit
constraints using useful skew, to provide robustness to on-chip
skew constraints, there is a dynamic implied skew constraint
variations (OCV). Clock skew is the difference in the arrival
between every pair of sinks, as shown in Figure 1(b). If each
time of the clock signal to the sequential elements. There is
pair of subtrees are joined within a dynamic implied skew
a explicit skew constraint between each pair of sequentially
constraint, all the explicit skew constraints will be satisfied.
related clock sinks, i.e., each pair of clock sinks that are
The bounds of each constraint are defined by the length of
only separated by combinational logic in the data and control
two shortest paths in the SCG. However, after a pair of sinks
paths. These explicit skew constraints can be stored in a skew
have been merged, the skew between the sinks is specified and
constraint graph (SCG) as shown in Figure 1(a). In an SCG,
edge weights are required to be updated in the SCG, which
the vertices and edges represent sinks and skew constraints,
in turn requires every implied skew constraint to be updated
respectively. Clock trees meeting the explicit skew constraints
(with high run-time complexity). For example, the implied
can be constructed using a bottom-up synthesis process based
skew constraint between sink 1 and sink 4 may change when
on iteritevely joining (or merging) pairs of subtrees to form
the skew between sink 3 and sink 4 is specified, as illustrated
larger subtrees. The run-time and resource utilization of the
in Figure 1(a) and (b).
Manuscript received MONTH XX, YYYY; revised MONTH DD, YYY. An alternative to implied skew constraints is static arrival
Digital Object Identifier: 10.1109/TCAD.2019.2834437

1937-4151 c 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2017 2

clock trees with an equal number of buffers on the path


from the clock source to each clock sink. In [3], a rerooting
technique was proposed, which effectively allows subtrees to
(a) (c) be merged at internal nodes. The technique translated into the
construction of clock trees with shorter wire length.
Equal arrival time constrains, bounded arrival time con-
straints, and implied skew constraints are uniquely defined
by the SCG. In contrast, different sets of useful arrival time
constraints and bounded useful arrival time constraints are
(b) (d) required to be specified based on the explicit skew constraints
in the SCG. In [2], useful arrival time constraints were
specified using an linear programming (LP) formulation to
minimize the clock period or to maximize the robustness
to variations. In [4], [7], the bounded useful arrival time
constraints were specified to maximize the length of the arrival
time ranges lexicographically, i.e., the minimum length is
iteritevely maximized. The limitation of this approach is that
the specified arrival time ranges may become unaligned, which
(e) translates into the construction of clock trees with higher
Fig. 2. The static arrival time constraints in (a-d) are decoupled from the
capacitive cost.
SCG in (e). |V | is the number of clock sinks. In this paper, a scalable clock tree synthesis (CTS) frame-
work is proposed for the construction low-cost USTs with high
timing quality. First, we extend the BST construction in [3]
time constraints [1], [2], [3], [4], which consist of an arrival to enable subtree pairs to be joined in constant time based on
time range (or in special cases a point) for each sink, as shown bounded useful arrival time constraints. The extension is per-
in Figure 2(a)–(d). The arrival time constraints are satisfied if formed by introducing virtual minimum and maximum delay
the clock signal is delivered to each clock sink within the offsets. The capacitive cost of a clock tree is determined by
respective arrival time ranges. The advantage of static arrival the degree of solution space exploration performed in the tree
time constraints is that the constraints are decoupled from the construction process. To achieve a high degree, routing tree
SCG and are therefore not required to be updated after each topology exploration, buffer tree topology exploration, buffer
skew is specified, which is illustrated in Figure 2(e). Moreover, size and stem wire length exploration, are generalized into
the decoupling allows the constraints to be obtained in constant subtree transformations. Next, various subtree transformations
time. However, static arrival time constraints are inherently are explored in a bottom-up tree construction process. The key
more restrictive than implied skew constraints. is that each transformation can be performed in constant time
By storing the minimum and maximum downstream delay using the proposed UST construction, which allows numerous
of each subtree, subtree pairs can be joined in constant transformations to be explored. Moreover, the arrival time
time using static arrival time constraints. Zero skew trees constraints are specified to minimize the capacitive cost of
(ZSTs) [1], useful skew trees (USTs) [1], and bounded skew the constructed clock trees by optimizing both the length and
trees (BSTs) [3], can be constructed based on equal arrival the alignment of the arrival time ranges. In addition, a slack
time constraints [1], useful arrival time constraints [2], and graph (SG) is introduced to allow the arrival time constraints
bounded arrival time constraints [3], respectively. In each of to be respecified throughout the synthesis process to expose
these clock tree construction algorithms, subtree pairs were additional timing margins. Moreover, the SG facilitates a
joined while accounting for the delay of the wires used to seamless integration of arbitrary timing models. Lastly, the
join the subtrees. In contrast, interconnect delays were not proposed CTS framework is integrated with a clock tree
considered when USTs were constructed based on bounded optimization framework to demonstrate the effectiveness of the
useful arrival time constraints [4] in [7]. proposed frameworks capability of meeting skew constraints
The resource utilization (measured in capacitive cost) of under on-chip variations (OCV).
a constructed clock tree depends on the solution space explo- Experimental results show that the proposed approach is
ration performed in the tree construction process, i.e., the order capable of constructing clock trees with improved robustness,
of subtrees being joined, the insertion of buffers of various 27% lower cost, and shorter run-time, compared with in earlier
sizes, and the insertion of stem wires of various lengths. studies.
Buffers are required to be inserted to meet transition time The remainder of the paper is organized as follows: pre-
constraints. A stem wire is a piece of wire connecting a subtree liminaries are given in Section II. Previous studies are re-
to the buffer driving the subtree. Naturally, stem wires allow viewed in Section III. The problem formulation is presented in
more flexibility in the placement of the attached buffers. In [8], Section IV. The scalable clock tree construction is explained
buffer and stem wire insertion were integrated into the iterative in Section V. The solution space exploration is described
process of joining subtrees. In [9], subtree merging was in Section VI. Constraints are specified and respecified in
interleaved with buffer and stem wire insertion to construct Section VII. The methodology is summarized in Section VIII.
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2017 3

The experimental results are presented in Section IX. We xlb


i respectively denote the lower and the upper bound of the
conclude in Section X. range ri . (The xlb ub
i and xi notation is illustrated in Figure 4
in Section V-B.) Note that the arrival time ranges are specified
II. P RELIMINARIES with respect to an arbitrary reference point that is not required
Setup and hold time constraints are imposed between each to be defined. A set of arrival time ranges are defined to be
pair sequential elements, or flip flips (FFs), that are separated valid if they guarantee that the explicit skew constraints in the
by only combinational logic. The setup and hold time con- SCG are satisfied, which can be ensured as follows:
straints between a launching flip flop FFi and a capturing flip xlb ub
i ≤ xi , ∀i ∈ V (5)
flop FFj are formulated as follows:
xub
i − xlb
j ≤ wij , ∀(i, j) ∈ E (6)
ti + tCQ
i + tmax
ij + tSj + δi ≤ tj + T − δj , (1)
where V and E are the vertices and the edges in an SCG,
ti + tCQ
i + tmin
ij − δ i ≥ tj + tH
j + δj , (2) respectively. As the arrival time constraints are decoupled from
where ti and tj are the arrival times of the clock signal to FFi the SCG (after they are specified), they are not required to be
and FFj , respectively. tmin and tmax are the minimum and updated when skews are committed and edge weights in the
ij ij
maximum propagation delay through the combinational logic; SCG are updated.
tCQ
i is the clock to output delay of FFi ; T is the clock period;
tSj and tH III. P REVIOUS WORKS
j are the setup and hold time of FFj , respectively. δi
and δj are delay variations introduced by OCV. In this section, we review earlier studies on clock tree con-
The setup and hold time constrains in Eq (1) and Eq (2) can struction, clock tree solution space exploration, and techniques
be reformulated into explicit skew constraints as follows: used to specify arrival time constraints.
th − tk ≤ chk , (3)
A. Review of clock tree construction
where th , tk , and chk are respectively equal to ti , tj , and The ideal properties of clock trees constructed based on
T − tCQ
i − tmax
ij − tSj − Muser , for each setup constraint in static arrival time constraints and dynamic implied skew
Eq (1). The hold time constraints are first reformulated to tj − constraints are shown in Table I.
ti ≤ tCQ
i + tmin
ij − tHj − δi − δj . Here, th , tk , and chk are Equal and useful arrival time constraints result in a low
respectively equal to tj , ti , and tmin
ij + tCQ
i − tH
j − Muser , for degree of timing margin utilization because the arrival time
each hold time constraint in Eq (2). Muser is a user specified ranges are in the form of points. Bounded useful arrival time
non-negative safety margin that is introduced to account for constraints (bounded arrival time constraints) translate into a
the delay variations δi and δj . high (medium) degree of timing margin utilization because the
The explicit skew constraints in Eq (3) can be captured in arrival time ranges may (not) be unaligned and have different
a skew constraint graph (SCG). In an SCG G = (V, E), V is lengths. It can be understood that bounded useful arrival time
the set of clock sinks and E is the set of skew constraints. For constraints are a dominating generalization of equal arrival
each skew constraint in Eq (3), an edge ehk from vertex h to time constraints, useful arrival time constraints, and bounded
vertex k is added with a weight whk = chk . Throughout the arrival time constraints. Therefore, we focus on the differences
synthesis process, skews are specified between pairs of clock between clock tree construction based on bounded useful
sinks. If a skew skewij = ti − tj = a is specified between arrival time constraints and dynamic implied skew constraints.
sink i and sink j, the weight of the edges eij and eji are The advantage of clock tree construction based on bounded
updated to wij = a and wji = −a, respectively, as shown in useful arrival time constraints is that the constraints can be
Figure 1(a) and Figure 2(e). obtained in constant time using the arbitrary reference point.
Dynamic implied skew constraints are imposed between On the other hand, the dynamic implied skew constraints
each pair of sinks by the explicit skew constraints. In [5], it allow the full timing margins to be utilized. A full degree
was shown that the implied skew constraints between a pair of timing margin utilization is achieved because implied skew
of sinks is defined as follows: constraints capture both the explicit skew constraints and the
skews committed in the clock tree construction process. In
−dji ≤ ti − tj ≤ dij , (4)
contrast, bounded useful arrival time constraints only capture
where dij and dji denotes the shortest path from vertex i to the explicit skew constraints. However, in the experimental
vertex j and from vertex j to vertex i in the SCG, respectively. results in Section IX, it is demonstrated that the difference
As the implied skew constrains are defined by shortest paths in timing margin utilization only translates into marginal
in the SCG, they are required to be updated when any skew differences in capacitive cost. Hence, it is advantageous to
is specified in the SCG. The time complexity to compute (or leverage that bounded useful arrival time constraints can be
update) an implied skew constraint is O(V log V + E) [6]. obtained in constant time.
Static arrival time constraints consist of an arrival time Using the static arrival time constraints in Figure 2(a-c),
range for each clock sink. The arrival time constraints are pairs of smaller subtrees were merged into larger subtrees
satisfied if the clock signal is delivered to the clock sinks while accounting for the delay of the wires used to join the
within the respective arrival time ranges [4]. The arrival time subtrees [1], [3]. This was facilitated by storing and incre-
range for a sink i is denoted ri = [xlb ub lb
i , xi ], where xi and mentally computing the downstream delay of each subtree.
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2017 4

TABLE I
P ROPERTIES CLOCK TREE CONSTRUCTION PERFORMED BASED ON THE DYNAMIC IMPLIED SKEW CONSTRAINTS AND THE STATIC ARRIVAL TIME
CONSTRAINTS , WHICH ARE RESPECTIVELY SHOWN IN F IGURE 1 AND F IGURE 2.

partial subtree of a larger subtree (as a BST is constructed).


As a result, each rerooted subtree can be obtained by pairwise
merging three partial subtrees of the initial subtree (or a
previously rerooted subtree). Therefore, each rerooted subtree
can be formed in constant time and the run-time is linear
with respect to the number of topologies that are explored.
Hence, the exploration of tree topologies based on arrival time
constraints is labeled easy in Table I. Exploring tree topologies
based on dynamic implied skew constraints is labeled difficult
in Table I because each alternative topology would require a
Fig. 3. For a subtree with n leaf nodes, 2n − 3 tree topologies are explored separate SCG to be stored.
by rerooting [3]. (2n − 3) · (2m − 3) routing tree topologies are explored Buffer and stem wire insertion has mainly been explored
when two subtrees with n and m leaf nodes are joined.
for ZSTs and is typically performed based on a set of prede-
fined rules. For example, the maximum length stem wire is
For an ZST, the unique downstream delay of each subtree inserted below each buffer in [9]. In Section VI, we propose
was stored [1]. For the USTs in [1], a virtual delay offset was to overcome this limitation by generalizing the insertion of
used to account for the non-alignment of the useful arrival buffers, buffer sizing, and stem wire insertion into subtree
time constraints. For an BST, both the minimum and maximum transformations.
downstream delay were required to be stored [3]. In contrast,
interconnect delays were not considered when joining subtrees C. Specifying and respecifying arrival time constraints
in the construction of USTs based on bounded useful arrival
time constraints in [7]. Instead, sink pairs (or subtree pairs) Given an SCG, many different sets of bounded useful
were joined if the respective arrival time ranges intersected, arrival time constraints can be specified. Each different set of
which is not sufficient to ensure that the arrival time constraints constraints results in a clock tree with a different capacitive
are satisfied. We propose a clock tree construction technique cost. In [4], [7], the minimum length of the arrival time
to overcome this limitation in Section V. ranges was lexicographically maximized up to user specified
threshold. The drawback of this approach is that the arrival
time ranges may become unaligned, which may increase the
B. Solution space exploration
capacitive cost. In Section VII, it is discussed why it is
The cost of a clock tree depends on the order the subtrees important to both optimize the length and alignment of the
are merged, the technique used to insert and size buffers, and constraints.
the method used to insert stem wires. In [10], [3], it was shown Static arrival time constraints are independent of the skews
that a greedy approach of iteritevely joining the subtree pair committed in the tree construction process. To expose addi-
that requires the least amount of wire length to be inserted tional timing margins, we propose to periodically respecify the
results clock trees with small capacitive cost. bounded useful arrival time constraints while considering the
In [3], before merging a pair of subtrees, each subtree can skews committed in the tree construction process. The details
be rerooted into multiple subtrees with different topologies, are provided in Section VII.
as illustrated in Figure 3. For example, a subtree with n ≥ 2
leaf nodes can be rerooted into 2n − 3 subtrees with different
tree topologies. Consequently, two subtrees with n ≥ 2 and IV. P ROBLEM FORMULATION
m ≥ 2 respective leaf nodes can be merged to a subtree with This paper is focused on the problem of constructing a clock
(n + m) leaf nodes while considering (2n − 3) · (2m − 3) tree tree that delivers a synchronizing clock signal from a clock
topologies. source to a set of clock sinks. The source to sink connections
During rerooting, it is utilized that the minimum and max- in the clock tree are realized using buffers from a buffer library
imum downstream delay are computed and stored for each and wires from a wire library. The objective is to construct
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2017 5

a clock tree that satisfies the explicit skew constraints using


the least amount of wire and buffer resources. The resource
utilization is measured in capacitive cost, which is known
to correlate closely with power consumption. Moreover, the
transition time at any point in the clock tree is required to be
less than a user specified parameter Ttran .
We approach this problem by proposing an UST construc-
tion algorithm that allows pairs of subtrees to be joined in
constant time based on bounded useful arrival time constraints.
The algorithm is based on combining virtual minimum and
maximum delay offsets with the BST algorithm in [3].
Fig. 4. Offsets defined by bounded useful arrival time constraints.
Using the proposed clock tree construction, there is a
trade-off between solution space exploration and run-time.
The proposed techniques for solution space exploration is where w(k, i) and w(k, j) denotes the interconnect delay of
described in Section VI. the wire between the root of the subtree k and the root of the
Given the constraints in an SCG, different sets of arrival subtrees i and j, respectively.
time constraints can be specified. In Section VII, the arrival In the next section, the BST construction is extended such
time constraints are specified to minimize the capacitive cost. that an UST can be constructed based on static bounded useful
Moreover, a slack graph (SG) is introduced to allow the arrival arrival time constraints.
time constraints to be respecified throughout the synthesis
process, to expose additional timing margins. The SG also
B. Proposed construction of UST
facilitates the construction of clock trees with high timing
quality by allowing various timing models to seamlessly be The proposed UST construction is based on storing the
integrated. minimum and maximum downstream delay of each subtree
(similar to the BST construction in [3]). In addition, virtual
V. S CALABLE CONSTRUCTION OF UST S minimum and maximum delay offsets are used to account for
The proposed UST construction is based on combining the the non-alignment and the range constraints, similar to the
BST construction in [3] with the use of virtual minimum and delay offsets used to handle useful arrival time constraints.
maximum delay offsets. First, the BST construction in [3] is A maximum skew bound B v and virtual minimum and
reviewed in Section V-A. Next, the proposed UST construction maximum delay offset are introduced for each subtree. Let
is presented in Section V-B. of fimin and of fimax denote the virtual minimum and maxi-
mum delay offset for a subtree i. Consider setting B = B v
A. Detailed review of BST construction in [3] and min ti = of fimin and max ti = of fimax for each clock
In [3], the BST construction is based on the observation that subtree i, respectively. Next, an UST can be constructed in an
the clock signal will be delivered within the bounded arrival identical fashion as an BST in [3].
time constraints if the maximum skew between any pair of B v and the virtual minimum and maximum delay offset for
clock sinks is less than B. Here, B is equal to the length of each sink are defined by the arrival time constraints, which
each arrival time range, i.e., the difference between the upper is illustrated in Figure 4. B v is set to an arbitrary value that
v v

and lower bound of each arrival time range. satisfies B2 ≥ xub i and B2 ≥ −xlb i for all i ∈ V , which
To facilitate the construction of such a BST, the mini- is illustrated in Figure 4. The virtual minimum delay offset
mum and maximum downstream delay of each subtree i are of fimin and virtual maximum delay offset of fimax for a sink i
stored and denoted min ti and max ti , respectively. Initially, are specified by the arrival time constraints and B v as follows:
min t = 0 and max t = 0 for each subtree. Next, a clock tree
is constructed by iteratively merging subtrees while ensuring Bv
of fimin = − − xlb
i , (9)
that max tk − min tk ≤ B of each formed subtree k. 2
v
A pair of subtrees i and j are merged into a larger subtree B
of fimax = − xub
i , (10)
k with max tk − min tk ≤ B as follows: the subtrees i 2
and j are connected with a wire and the length of the wire The skew bound B v can be obtained in constant time and
is equal to the Manhattan distance between the subtrees. (For min tk and max tk can still be incrementally computed
certain pairs of delay imbalanced subtrees, detour wiring is for each subtree k. Therefore, it is possible to merge a pair
required [1].) Next, the alternative locations for the root of of subtrees in constant time. Given a specified routing tree
subtree k are determined. This can be performed in constant topology, it is therefore also possible to construct a UST in
time, as the skew bound B can be obtained in constant time linear time.
and min tk and max tk can be computed incrementally as Note that the reference point is arbitrary and not specified
follows: and that B v can in fact be defined to an arbitrary value by
the offsets. The generalization allows arrival time ranges to
min tk = min{min ti + w(k, i), min tj + w(k, j)}, (7)
be unaligned (importantly pairwise non-intersecting) and of
max tk = max{max ti + w(k, i), max tj + w(k, j)}, (8) different lengths. The increased flexibility in the specification
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2017 6

of the ranged translates into looser timing constraints and


the ability to incorporate useful skew while retaining all the
features of the BST construction.

VI. C LOCK TREE SOLUTION SPACE EXPLORATION


The clock tree solution space exploration in this paper is
performed by using three forms of subtree transformations: a
routing tree topology transformation [3], a buffer tree topology
Fig. 5. Alternative buffer sizes and stem wire lengths explored by the buffer
transformation, and a buffer sizing and stem wire insertion sizing and stem wire insertion transformation.
transformation. Using the transformations, each subtree is
transformed into alternative implementations. Next, pairs of
subtrees are joined to form larger subtrees while considering the use of stem wires of various lengths increases the flexibility
each alternative implementation for every subtree. in the placement of the buffer driving the subtree, which
The routing tree topology transformation is defined to be may translate into the use of less wirelength. In addition,
identical to the rerooting transformation of a subtree [3]. The larger buffers and long stem wires can substantially change the
buffer tree topology transformation and the buffer sizing and min tk and max tk of a subtree k, which may be required
stem wire insertion transformation are outlined below. to satisfy the arrival time constraints. This transformation may
be particularly advantageous when combined with the previous
A. Buffer tree topology transformation two transformations.
After sufficiently large subtrees have been formed (by merg-
ing subtrees), buffers are required to be inserted at the root of VII. S PECIFYING AND RESPECIFYING ARRIVAL TIME
each subtree to satisfy the transition time constraints [8], [9]. CONSTRAINTS
The buffer tree topology transformation involves relocating
In this section, valid bounded useful arrival time constraints
the inserted buffer from the root node to internal nodes of the
are specified based on the explicit skew constraints. Tech-
subtree. A subtree with n leaf nodes (in form of the clock sinks
niques are also proposed to respecify the constraints using
or buffers downstream in the topology) can be transformed
an SG after part of a clock tree has been constructed. It is not
into 2n − 3 alternative topologies, which is shown in part of
difficult to specify a set of valid arrival time constraints. Every
Figure 3. Relocating a buffer to an internal node is equivalent
feasible solution of an LP formulated with the constraints in
to removing the buffer, rerooting the subtree and inserting the
Eq (5) and Eq (6) forms a set of valid arrival time constraints.
buffer at the new root node. Hence, the run-time is linear with
The challenge is how to define a suitable objective function,
the number of explored buffer tree topologies, as the rerooting
such that the solution to the LP formulation results in arrival
can be applied in constant time and buffer insertion can be
time constraints that minimize the capacitive cost of clock
performed in constant time using a look-up table (LUT). In
trees constructed using the constraints.
the LUT, the propagation delay of the buffer is a function of
the input transition time and lumped capacitive load (the input We approach this challenge by observing the following
transition time is annotated in the bottom-up tree construction). property of arrival time constraints: let rI be the intersection
The advantage of the proposed transformation is that the of the arrival time constraints of all sinks and let |rI | be the
placement of the buffer is more flexible, which results in the range of rI (if the intersection is non-empty). All subtree(s)
construction of clock trees with shorter wire length. constructed from the sinks satisfying a skew bound B = |rI |
will satisfy the arrival time constraints.
It can be easily understood that the larger the |rI | is, the
B. Buffer sizing and stem wire insertion transformation less constrained the tree construction is and therefore, the
The buffer sizing and stem wire insertion is based on more likely it is that the clock tree will have lower capacitive
resizing the buffer attached to the root of a subtree and cost. Suppose we construct the bottom k stages of a clock
inserting a stem wire. For each subtree with a buffer inserted tree without considering any skew constraints, where a stage
at the root, the buffer is resized into p different sizes and consists of subtrees, each driven by a buffer. Let skew(k)
stem wires of q different lengths are inserted, for a total of denote the maximum skew between any pair of sinks in
p · q combinations. In particular, the smallest p buffers that the subtrees of these bottom k stages constructed in such a
can drive each subtree are considered. The q inserted stem fashion. We attempt to specify the arrival time constraints with
wires have lengths evenly distributed between zero and the |rI | ≥ skew(k) . This would imply that the k bottom-most
maximum length determined by the transition time constraint. stages could be constructed in an unconstrained fashion, which
Note that the maximum length is dependent on the size of the probably would result in clock trees with small capacitive cost,
buffer. The different buffer and stem wire length combinations as it is well known that a majority of the capacitive cost of a
for p = 2 and q = 3 are illustrated in Figure 5. clock tree is located in the bottom most stages [11], [12].
The transformation allows p · q alternative buffer and stem The limitation of the proposed approach is that if any
wire combinations to be evaluated, which may translate into explicit skew constraints require useful skew to be satisfied,
the construction of clock trees with lower cost. In particular, i.e., ti − tj ≤ −b, where b > 0. No common intersection rI
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2017 7

Note that piecewise linear functions can be formulated


by using linear variables and linear constraints. In general,
a p-piecewise objective function consists of p terms in the
objective and p − 1 linear constraints.

B. Respecifying constraints using a slack graph (SG)


In this section, a slack graph (SG) is introduced to capture
both the skew constraints captured in an SCG but also the
(a) (b) downstream delays to the clock sinks. Using the SG, the arrival
(a) lower bound objective (b) upper bound objective time constraints are respecified to expose additional timing
Fig. 6. Convex piecewise linear functions f (x)lb and f (x)ub .
margins.
In an SG G = (V, E), the vertices V represent the
clock sinks and the edges E represent the slack in the skew
exists, which is the main limitation of the bounded arrival time constraints. For each constraint in Eq (1), an edge eij is
constraints in [3]. added with a weight equal to the slack in the setup time
constraint, i.e., wij = tj − ti + T − tCQ i − tmax
ij − tSj − Muser .
For each constraint in Eq (1), an edge eji is added with a
A. Proposed LP formulation weight equal to the slack in the hold time constraint, i.e.,
We propose to specify the arrival time constraints with the wji = ti − tj + tmin ij + tCQ
i − tHj − Muser . Note that an
following goals: (1) The range constraints have to be valid, i.e., SG and an SCG are identical prior to the tree construction
the constraints in Eq. (5) and Eq. (6) have to be satisfied. (2) when ti and tj are zero.
The lower and upper bounds of each range constraint should be To respecify the arrival time constraints, the SG is first
minimized and maximized, respectively. (3) The arrival time formed. The vertices that belong to the sinks in the same
constraints should be aligned although they are allowed to subtree are merged into a single vertex. Next, the arrival
be unaligned (to allow the use of useful skews). (4) Arrival time range for the subtrees can be specified directly using
time constraints consisting of arrival time ranges of similar SG and the LP formulation in Eq (11)-(13). The motivation
length are preferred. The motivation for this preference is that for merging the vertices located in the same subtree is that
a subtree is always more constrained timing wise than the the arrival time range for all the sinks in the subtree will be
subtrees from which it was constructed. Hence, the clock tree aligned with respect to the current arrival times, which exposed
construction is constrained by the arrival time range with the additional timing margins.
smallest length. The SG also facilitates a seamless integration of arbitrary
With these goals, we propose the following LP formulation: timing models. The SG can be formed while computing
the downstream delays using an accurate timing model (as
X NGSPICE simulations [13]). Next, the arrival time constraints
min f (xlb lb ub ub
i ) + f (xi ) (11)
can be respecified and the tree construction can be performed
i∈V
guided by the Elmore delay model. Note that when different
xlb ub
i ≤ xi , ∀i ∈ V (12) timing models are integrated, the formed SG may contain
xub
i − xlb
j ≤ wij , ∀(i, j) ∈ E (13) negative cycles, which are required to be removed prior to
solving the LP formulation in Eq (11)-(13). In our implemen-
where, f (x)lb and f (x)ub are convex p-part piecewise linear tation, the negative cycles are eliminated by removing safety
functions shown in Figure 6. c1 , · · · , cp are user specified margins (increasing edge weights) lexicographically from the
(1) (p−1)
weights and skew 2 , · · · , skew2 are stage skews. slack constraints that are part of the negative cycles, i.e., to
It is evident that the formulation achieves the goals (1) and iteritevely minimize the maximum reduction of the inserted
(2) by the constraints in Eq (12) and Eq (13) and the objective safety margins Muser .
function. The formulation achieves the goals (3) and (4) by Using the proposed transformations in Section VI, assume
setting the slope of the piecewise linear functions f (x)lb and that there are K alternative implementations for each sub-
f (x)ub as illustrated in Figure 6(a) and (b). In the figure, it tree. The arrival time constraints can be respecified while
can be observed that there is a heavy penalty if the lower accounting for each of the alternative implementations. Let
bound (the upper bound) of an arrival time range is not set tki denote the arrival time of the clock signal to clock sink i
(1) (1)
to be lesser (greater) than − skew 2 ( skew
2 ). Moreover, the in implementation k. Let tmin i and tmax
i denote the minimum
slopes of f (x)lb and f (x)ub are changed at certain multiples and maximum arrival time to the clock sink i across the K
(1) (1)
of − skew 2 and skew 2 , to encourage that the arrival time alternative implementations. Next, the SG is formed by adding
+ T − tCQ
(1) (1)
ranges are aligned and centered around [− skew , skew ]. an edge eij with a weight, wij = tmin j − tmax
i i −
2 2
Empirically, we find that it is important to set the slope, of the tmax
ij −tSj −Muser , for each setup time constraint. An edge eji
CQ
different parts to be drastically different, to avoid specifying a with a weight, wji = tmin i − tmax
j + tmin
ij + ti − tH
j − Muser ,
few arrival time ranges with disproportionately small lengths. is added for each hold time constraint. It can be observed that
In our implementation, ci , is set to 200i /20000. these edge weights are more constrained (smaller), as the worst
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2017 8

stem from that the inserted safety margins are smaller than the
magnitude of the delay variations introduced by OCV.
The clock tree construction is guided by the Elmore delay
model and pre-characterized LUTs for the buffers. The LUTs
characterize the output transition time and propagation delay
of the buffers as a function of the downstream load capacitance
and the input transition time. The CTO process is guided by
NGSPICE simulations. Optionally, NGSPICE simulations can
also be integrated into the CTS phase to accurately evaluate
the transition time of each subtree and to update the SG with
accurate timing after each buffer stage has been constructed.
Fig. 7. Flow for CTS and CTO.

minimum and maximum downstream delay to each sink is A. Tree construction


used to specify the edge weights. Therefore, the consideration
of alternative subtree implementations comes at the expense 1) Merging of subtrees: The input to the subtree merging
of exposing a smaller amount of timing margins. Moreover, phase is the clock sinks (singleton subtrees) or the buffered
negative cycles may be created in the SG when alternative subtrees of the previous stage. In the merging process, larger
implementations are considered. Therefore, the arrival time subtrees are formed by merging pairs of smaller subtrees.
constraints are required to be respecified while only consid- Iteritevely, the pair of subtrees that requires the least amount
ering selected subset of the alternative implementations such of wire length to be physically joined is selected to be
that no negative cycles are formed in the SG. merged [10]. (Note that the selection of the physical location
for the merging point is deferred using the DME paradigm
VIII. M ETHODOLOGY until the top-down embedding [8], [3].) Merging and evaluat-
In this section, we outline the proposed framework con- ing the cost of merging a pair of subtrees can be computed in
sisting of a clock tree synthesis (CTS) phase and a clock O(1), using the virtual minimum and maximum delay offsets
tree optimization (CTO) phase. The CTS phase is based on proposed in Section V-B. The merging process guided by a
combining the scalable UST construction in Section V with nearest neighbour graph (NNG) that consist of vertices and
the solution space exploration in Section VI and the speci- weighted edges. In an NNG, the vertices represent subtrees and
fication and respecification of the arrival time constraints in the weight of each edge is equal to the wire length required
Section VII. The overall CTS flow follows a classical bottom- to join the two subtrees connected by the edge. Iteritevely,
up tree construction algorithms based on the DME paradigm the pair of subtrees connected with the least weight edge are
in [5], [3], [9], [14], [15], [8]. The topology of a clock tree is selected to be joined. Next, the two vertices of the respective
generated in a bottom-up tree construction process. Next, the subtrees are removed from the NNG. If subtree pair can
clock tree is embedded top-down by specifying the physical be joined while satisfying the transition time constraint, the
locations for the internal nodes. An overview of the framework subtree pair is merged into a larger subtree and inserted into
is shown in Figure 7. We refer the reader to [3] for the details the NNG again [10]. If the subtree pair cannot be joined while
of the DME paradigm. satisfying the transition time constraint, the two subtrees are
A clock tree is constructed buffer stage by buffer stage. locked from further merging. The transition time constraint
A buffer stage consists of a set of subtrees, each driven by is evaluated using the LUTs and the closed form expression
a buffer. The input to the construction of the bottom most in [16]. Since the input transition time is not known during
stage is the clock sinks, and the input to the construction the bottom-up tree construction process, it annotated to a user
of the consecutive stages are the input pins of the driving specified parameter Ttran = 60 ps. After all subtrees have
buffers of the previous buffer stage. Each buffer stage is been locked, buffers are inserted to drive each subtree.
constructed by specifying (or respecifying) the arrival time Subtrees that are spatially distant are not likely to be joined
constraints (see Section VIII-C). Next, subtrees are iteratively because the wire length required to join two subtrees would
pairwise merged to form larger subtrees while satisfying the be long. Therefore, to reduce the run-time, edges are only
arrival time constraints and the transition time constraint (see added in the NNG between subtrees that are spatially close.
Section VIII-A1). After no more subtrees can be merged, a The pairs of subtree that are spatially close are determined
buffer is inserted to drive each subtree (see Section VIII-A2). using Delaunay triangulation [17]. (An equally good alternate
Solution space exploration can be performed throughout the would be to use the partitioning method in [3].)
tree construction process (see Section VIII-B), to reduce the 2) Buffer insertion: Each locked subtree has the minimum
capacitive cost. buffer that can drive the subtree without violating the transi-
After all sinks have been joined to a single clock tree, the tion time constraint inserted at the root of the subtree. The
clock tree is embedded top-down [8], [3]. Next, clock tree minimum buffer is found using a binary search of the buffer
optimization (CTO) is employed to remove any remaining library. Next, the subtrees with driving buffers attached are
timing violations (see Section VIII-D). The timing violations reinserted into the NNG.
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2017 9

B. Solution space exploration implementation is considered for each subtree. (We do not
Solution space exploration is performed by transforming consider various buffer sizes and various stem wire lengths
each subtree into alternative implementations. In the following, when respecifying the constraints.) However, when alternative
we describe the details of how the transformations are applied buffer tree topologies are considered, negative cycles may be
in the tree construction process. formed in the SG. Therefore, a subset of alternative subtree
1) Routing tree topology transformation: When a subtree implementations that do not create negative cycles is required
with n leaf nodes is inserted into the NNG, it is rerooted to to be found. This is solved by first selecting the minimum cost
(2n − 3) subtrees with different tree topologies. To bound the subtree implementation for each subtree. Next, one alternative
run-time overhead, subtrees with detour wires are discarded implementation is added to each subtree while it is verified
during the rerooting process. In addition, if there are more than that no negative cycles are formed in the SG. If a negative
Nmax alternative implementations for a subtree, the Nmax cycle is formed, the subtree implementation creating the cycle
implementations with the smallest capacitive cost are sampled is discarded. This process is repeated until all implementations
and the remainder of the implementations are discarded [3]. have been added or discarded.
Consequently, the run-time complexity of evaluating the cost
2
of joining two subtrees is O(Nmax ). After a pair of subtrees
are joined, only the subtree combination with the smallest D. Clock tree optimization
capacitive cost is kept. Next, the newly formed subtree is
After an initial clock tree has been constructed in the CTS
rerooted and reinserted into the NNG.
phase, some timing violations may still exist. The timing
2) Buffer tree topology transformation: The buffer tree
violations stem from that the inserted safety margins Muser
topology transformation is applied after a buffer has been
are smaller than the magnitude of the delay variations δi and
inserted at the root of each subtree in the buffer insertion step.
δj introduced by OCV. Let the closest common ancestor of
The transformation allows a buffered subtree with n leaf nodes
FFi and FFj in the clock tree be denoted CCAij . In Eq (1)
to be transformed into (2n − 3) alternative implementations.
and Eq (2), δi (δj ) is equal to cocv times the propagation delay
However, many implementations are eliminated because of
between CCAij and FFi (FFj ) in the clock tree. The parameter
the transition time constraint. In our implementation, subtrees
cocv is set to 0.085 in our experiments. The total negative
that required the insertion of detour wires were also removed.
slack (TNS) is the sum of the timing violations in Eq (1) and
Next, the formed subtrees with buffers attached at the root are
Eq (2). The objective of the CTO phase is to reduce TNS to
inserted into the NNG.
zero. The motivation for performing CTO in this paper is to
3) Buffer sizing and stem wire insertion transformation:
allow comparisons with results in earlier studies where CTO
Buffer sizing and stem wire insertion is applied after the
was applied. We evaluate our clock trees both after CTS and
buffer tree topology transformation. In combination, the two
after CTO, to demonstrate the effectiveness of the proposed
transformations result in up to p · q · (2n − 3) alternative
framework.
implementations for a subtree with n leaf nodes.
When stem wires of different lengths are used, the cost of In general, CTO is performed by realizing delay adjustments
merging two subtrees is set to the length of the stem wires (in in the tree by inserting buffers and detour wires. The delay
the previous buffer stage) plus the length of wires required to adjustments are specified using an LP formulation [18], [19],
connect the subtree pair. If buffers are sized up, the capacitive [20]. For further technical details of the CTO phase, please
cost of the buffers is also included. (The wire length cost refer to [18], [19], [20].
can be translated into capacitive cost.) When stem wires of
different lengths are used, it is common that multiple different
alternative subtree combinations result in subtrees with the IX. E XPERIMENTAL EVALUATION
exact same (smallest) cost. For these cases, it is important to
In this section, we present experimental results to demon-
keep all (or at least multiple) alternative subtree combinations,
strate the effectiveness of the proposed framework. In Sec-
or there may be an noticeable overall loss in solution quality.
tion IX-A, we introduce the tree structures that are used in
Even though the capacitive costs are the same, the root node of
the evaluation. In Section IX-B, static arrival time constraints
each topology may be restricted to different spatial locations.
are compared with dynamic implied skew constraints. The
Note that the selection of the exact physical location of each
solution space exploration is evaluated in Section IX-C. The
buffer (and stem wire) is deferred using the DME paradigm
techniques of specifying and respecifying arrival time con-
until the top-down embedding as in [9].
straints are evaluation in Section IX-D. The robustness of the
constructed clock trees to OCV is evaluated in Section IX-E
C. Specification and respecification of arrival time constraints and the timing model selection is evaluated in Section IX-F.
In the construction of the bottom most buffered stage, Note that CTO is only applied when evaluating the robustness
the arrival time constraints are specified with respect to the to OCV and the timing model selection in Section IX-E and
sinks, as described in Section VII. In the construction of a Section IX-F, respectively.
higher-level buffer stage, an SG is formed and the arrival The algorithms are implemented in C++ and the experi-
time constraints are respecified to expose additional timing ments are performed on a 8 core 3.4 GHz Linux machine
margins. It is easy to respecify the constraints if only a single with 31.3 GB of memory.
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2017 10

A. Tree structures used in evaluation TABLE II


P ROPERTIES OF SYNTHESIZED CIRCUITS .
Using the proposed tree construction framework, various
different tree structures are constructed. The structures are
described below, as follows:
(1) The D-UST structure is a tree structure that is constructed
using dynamic implied skew constraints, i.e., the Greedy-
UST/DME algorithm in [5].
(2) The LD-UST structure is an extension of the D-UST
structure that can meet a user-specified latency1 bound
at the expense of increased capacitive cost [15].
(3) The P-UST structure is a tree structure that is constructed
based on useful arrival time constraints. The point con-
straints are specified using a modified version of the LP
formulation in Section VII.
(4) The S-UST structure is a tree structure that is constructed
based on bounded useful arrival time constraints. The
range constraints are specified using the LP formulation
in Section VII.
(5) The LS-UST structure is a tree structure that is con-
structed based on bounded useful arrival time constraints.
The range constraints are specified by maximizing the
length of the range constraints lexicographically [4], [7].
(6) The S1-UST structure is the S-UST structure with the
addition of the routing tree topology transformation.
Here, the 1 denotes that one transformation is applied.
(7) The S2-UST structure is the S-UST structure with both
the rerooting transformation and the buffer sizing and
stem wire insertion transformation applied, i.e., the 2
denotes that two transformations are applied. Fig. 8. Comparison of guiding the tree construction using static arrival time
constraints and dynamic implied skew constraints.
(8) The S3-UST structure is the S-UST structure with all
three subtree transformations applied, i.e, various routing
tree topologies, buffer tree topologies, buffer sizes, and constraints. Clock tree construction based on equal and
stem wire lengths are explored. The 3 denotes that three bounded arrival time constraints is not evaluated because
transformations are applied. useful skew is required on many of the circuits.
(9) The RS2-UST structure is the S2-UST structure with the In Figure 8, the S-UST structures, P-UST structures, and
arrival time constraints respecified using an SG after the D-UST structures are compared in terms of average cost and
construction of each buffer stage. run-time. The detailed results for the D-UST structures and
(10) The RS3-UST structure is the S3-UST structure with the S-UST structures are shown in Table III. The performance
arrival time constraints respecified using an SG after the in terms of capacitive cost is presented in the column labeled
construction of each buffer stage. Here, the constraints “Cap cost” and the run-time is presented in the column labeled
are respecified while considering the alternative buffer “Run-time”. All the constructed clock trees meet the specified
tree topology implementations. transition time constraint Ttran .
The experimental evaluation is performed by constructing The S-UST structures are a dominating generalization of
the various tree structures on the thirteen circuits in Table II, the P-UST structures, i.e., range constraints dominate the use
which are available in an online repository [11]. These circuits of point constraints. Therefore, it is not surprising that the
have been obtained by synthesizing Open Cores [21] verilog capacitive cost of the P-UST structures are 2.52X higher
specifications using the Synopsys tool chain and a 32 nm compared with the S-UST structures. The P-UST structures
technology library. The circuits with a ‘scaled’ prefix are have 67% higher run-time than the S-UST structures. The
carefully scaled to 32 nm technology using the ITRS road- run-time overhead stems from that larger clock trees are
map. The top seven circuits have been used in earlier studies. constructed.
The D-UST structures have a higher degree to timing margin
B. Static vs. Dynamic constraints utilization than the S-UST structures. Therefore, it is expected
that the D-UST structures would have smaller capacitive cost
In this section, we compare constructing clock trees based
than the S-UST structures. However, it can be observed in
on static arrival time constraints and dynamic implied skew
Table III that the S-UST structures actually have 3% lower
1 The latency of a clock tree is the maximum propagation delay from the average capacitive cost than the D-UST structures. The lower
clock source to any clock sink. capacitive cost may stem from that the S-UST structures have
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2017 11

TABLE III
C OMPARISONS OF VARIOUS TREE STRUCTURES .

relatively aligned arrival time constraints (specified before


the tree construction). In contrast, the arrival times in the
D-UST structures may be significantly skewed, as skews are
incrementally specified within dynamic implied skew con-
straints (during the tree construction). Skewing the clock signal
typically increases the capacitive cost of the clock tree, as
detour wires or delay buffers may be required to be inserted.
The D-UST structures have 6.83X higher average run-time
than the S-UST structures, which can easily be understood
because subtrees are merged in O(V log V + E) [6]. In the
S-UST structures, each pair of subtrees are merged in constant
time O(1). The run-time differences can clearly be observed
on the larger circuits. On the largest circuit jpeg, the S-UST
structure is constructed 35X faster than the D-UST structure.
(a)
Now, that it is established that it is advantageous to per-
form tree construction based on bounded useful arrival time
constraints, the proposed transformations for solution space
exploration are evaluated.

C. Evaluation solution space exploration


The S1-UST structures have 13% lower average capacitive
cost when compared with the S-UST structures. The lower
cost stems from that the S1-UST structures allows subtrees
to be rerooted using the routing tree topology transformation,
which facilitates the exploration of various tree topologies. As
a greater solution space is explored, clock trees with lower
capacitive costs are obtained. On the other hand, the average
run-time of the S1-UST structures is 5.93X higher.
(b) (b)
In Figure 9, we illustrate a part of a S1-UST structure and
part of a S-UST structure on the circuit ecg. It can be observed Part of the proposed S1-UST structure.
that the S1-UST structure has shorter wire length. Clearly, the Fig. 9. S-UST structure vs. S1-UST structure. Clock sinks, wires, and buffers,
topology exploration translates in wire length reductions. are respectively shown with ‘x’, blue lines, and red triangles. (a) Part of S-
The S2-UST structures have 3% lower average capacitive UST structure. (b) Part of the proposed S1-UST structure.
cost and 30% higher run-time when compared with the
S1-UST structures. The differences stem from that the S2-UST
structures are constructed while exploring the use of buffers S3-UST structures typically give better results on the circuits
of various sizes and stem wire of various lengths in addition with looser timing constraints and the S2-UST structures give
to various routing topologies. better results on the circuits with tighter timing constraints,
Compared with the S2-UST structures, the S3-UST struc- as the circuits scaled s1585, aes, and jpeg. An explanation
tures have 1% lower capacitive cost and 41% longer run-time. to why the S2-UST structures achieve better results on the
The lower capacitive cost stems from that the buffer tree topol- circuits with tight timing constraints is that the buffer tree
ogy transformation is performed. It should be noted that the topology exploration may allow most timing margins to be
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2017 12

in terms of timing yield and capacitive cost. The comparison


is performed using the Monte Carlo framework proposed
in [14], which is an extension of the ISPD 2010 clock contest
formulation in [22]. Note that we also report TNS after the
CTS phase and the CTO phase. However, the TNS in this
paper is only a metric that is used to guide the CTO process
to improve the yield, which is the primary evaluation metric
used to evaluate timing quality. In Section IX-E1, the Monte
Fig. 10. Comparison between specifying bounded useful arrival time con-
straints using the proposed LP formulation and lexicographically as in [4], Carlo framework is introduced. In Section IX-E2, we evaluate
[7], i.e., S-UST vs. LS-UST. the performance in terms of timing yield and the capacitive
cost.
1) Monte Carlo evaluation framework: Each clock tree is
utilized in the bottom-part of the S3-UST structures, which evaluated in terms of timing yield and capacitive cost. The
may overly constrain the construction of the top-parts. Ideally, timing yield of a clock tree is determined by simulating
the tree construction algorithm should dynamically determine the clock tree with 500 Monte Carlo simulations. In each
the degree of timing margin utilization and solution space simulation, the clock tree is subject to wire width variations
exploration, i.e., a hybrid of the S2-UST structures and the (±5.0%), supply voltage variations (±7.5%), temperature
S3-UST structures. variations (±15.0%), and channel length variations (±5.0%)
around the nominal values [14]. The variations are generated
D. Evaluation of specifying and respecifying arrival time with spatial correlation using a 5-level quad tree [23], with
constraints the variations evenly distributed among the levels of the quad
In Figure 10, we compare specifying the range constraints tree.
using the proposed LP formulation in Eq (11)-(13) with max- Each simulation represents the testing of a chip, if all the
imizing the length of the range constraints lexicographically. skew and transition times are satisfied, the chip is classified
Compared with the S-UST structures, the LS-UST structures as good. The transition time constraint is equal to 100 ps.
have 44% higher average cost and 10% higher run-time. The If any timing constraint is violated, the chip is classified as
higher capacitive cost stems from that the range constraints are defective. (In this paper, none of the structures suffer yield loss
specified to only maximize the lengths, instead of considering from violations of the transition time constraint.) The timing
both length and alignment. The slight run-time overhead stems yield is defined to be the number of good chips divided by
from that larger clock trees are constructed. the number of tested chips.
The technique of respecifying the arrival time constraints 2) Evaluation of timing yield and cost: In Table IV, we
based on the introduced SG is evaluated by comparing the compare the RS2-UST structures constructed in this work
RS2-UST structures with the S2-UST structures. By respeci- with the D-UST structures constructed in [14], which reported
fying the arrival time constraints, it can be observed that the results on six of the thirteen benchmark circuits. We also
average capacitive cost is reduced by 4% and the run-time is compare against the D-UST structures and LD-UST struc-
increased with 10%. The savings in capacitive cost stems from tures in [15], which reported results on three circuits. The
that additional timing margins are exposed by respecifying normalized capacitive results (labeled “Norm.” in Table IV) are
the constraints. The increase in run-time may be a result of obtained with respect to the capacitive cost of the RS2-UST
that the exposed timing margins allow further solution space structures after CTO. The yield (labeled “Yield”) is obtained
exploration to be performed. from the Monte Carlo framework and the TNS (labeled
Compared with the S3-UST structures, the RS3-UST struc- “TNS”) values are obtained from Eq (1) and Eq (2).
tures have 2% lower capacitive cost and 14% shorter run- First, we compare the results after CTS (and before CTO).
time. Here, arrival time constraints are respecified based on the In terms of capacitive cost, the D-UST structures in [14], the
SG while considering the multiple alternative implementations D-UST structures in [15], and the LD-UST structures in [15]
of each subtree. Consequently, smaller timing margins are have 44%, 27%, 32% higher capacitive cost than the RS2-UST
exposed. Therefore, it is not too surprising that the cost structures, respectively. These results are similar to the results
reductions are smaller. Interestingly, it can be observed that reported for the D-UST structures in Table III. Even though
the RS2-UST structures have smaller cost compared with the no form of latency optimization is performed, the latencies
RS3-UST structures. This highlights that there is a complex of the RS2-UST structures are 30% lower compared with the
trade-off between the amount of exposed timing margins, LD-UST structures. We believe that this stems from the con-
timing margin utilization, and solution space exploration. Nev- struction of smaller clock trees. The RS2-UST structures have
ertheless, the average cost and run-time of the two structures slightly worse (or equal) results in TNS and timing yield after
are very similar. CTS (and before CTO). The explanation for the worse timing
performance is that the solution space exploration allows the
timing margins in the bounded useful arrival time constraints
E. Evaluation of robustness to OCV to be used to reduce capacitive cost. Timing margins that are
In this section, the RS2-UST structures constructed by our not utilized to reduce cost can otherwise function as extra
framework are compared with the clock trees in [14], [15], guardbands to variations.
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2017 13

TABLE IV
E VALUATION OF CLOCK TREES IN TNS, TIMING YIELD , AND CAPACITIVE COST. A ‘-’ IN THE CTO RUN - TIME COLUMN MEANS THAT CTO IS NOT
REQUIRED .

After CTO, we observe that the capacitive cost of the TABLE V


RS2-UST structures have only increased by 1% on the average E VALUATION OF TIMING MODEL SELECTION .
(by comparing RS2-UST structures obtained after CTS and
after CTO). Therefore, it can be understood that after CTO
the RS2-UST structures have 52%, 28%, and 33% lower
capacitive costs compared with the D-UST structures in [14],
the D-UST structures in [15], and the LD-UST structures
in [15], respectively. CTO algorithms are in general more
effective at removing timing violations from clock trees with F. Evaluation of timing model selection
low latencies because the introduced delay variations δi and δj In this section, we evaluate guiding the CTS phase using
are smaller in many timing constraints. Consequently, it is not a fast timing model (the Elmore delay model combined with
surprising that the CTO algorithms are capable of drastically LUTs) and an accurate timing model (the optional NGSPICE
reducing TNS and improving the yield of the RS2-UST simulations). The CTO phase is always required to be guided
structures. The RS2-UST structures achieve a 100% yield on by NGSPICE simulations because the timing performance af-
all circuits except aes and jpeg, where a yield of 99.4% and ter CTO is evaluated using NGSPICE simulations. In Table V,
99.8% is obtained, respectively. As mentioned earlier, this the performance of the RS2-UST structure is evaluated after
improvement is achieved with a 1% overhead in capacitive CTS and after CTO. The normalized capacitive cost, TNS, and
cost. Compared with the D-UST structures in [14], the D-UST run-time are reported in the columns labeled “Norm. Cap”,
structures in [15], and the LD-UST structures in [15], the “Norm. TNS”, and “Norm. Run-time”, respectively. In Fig-
RS2-UST structures obtain better or equal TNS and timing ure 11, we compare source to sink delays in a clock tree
yield on all considered circuits after CTO. Note that on aes computed using the two different timing models. The correla-
the yield is actually better before CTO compared with after, tion and delay differences between the two timing models are
which emphasizes the need for better CTO algorithms. shown in Figure 11(a) and Figure 11(b), respectively.
In Table V, it can be observed that the CTS phase is
performed 3.39X faster when guided by the fast timing
Clearly, the RS2-UST structures demonstrate better quality model instead of the accurate timing model. However, as
in terms of run-time, capacitive cost, TNS, and timing yield expected, the normalized capacitive cost is slightly higher
when compared with the results in earlier studies, which and the normalized TNS is substantially worse after CTS.
demonstrates the effectiveness of the proposed framework. The higher capacitive cost is obtained because the fast timing
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2017 14

[3] J. Cong, A. B. Kahng, C.-K. Koh, and C.-W. A. Tsao, “Bounded-skew


clock and steiner routing,” ACM Transactions Design Automation of
Electronc Systems, vol. 3, pp. 341–388, July 1998.
[4] C. Albrecht, B. Korte, J. Schietke, and J. Vygen, “Maximum mean
weight cycle in a digraph and minimizing cycle time of a logic chip,”
Discrete Applied Mathemathics, vol. 123, no. 1-3, pp. 103–127, 2002.
[5] C.-W. A. Tsao and C.-K. Koh, “UST/DME: a clock tree router for
general skew constraints,” ACM Transactions on Design Automation of
Electronic Systems, vol. 7, no. 3, pp. 359–379, 2002.
[6] R. Ewetz, S. Janarthanan, and C.-K. Koh, “Fast clock skew scheduling
based on sparse-graph algorithms,” ASP-DAC ’14, pp. 472–477, 2014.
[7] S. Held, B. Korte, J. Massberg, M. Ringe, and J. Vygen, “Clock
scheduling and clocktree construction for high performance asics,”
(a) (b)
ICCAD’03, pp. 232–239, 2003.
[8] M. M. S. A. Vittal, “Power optimal buffered clock tree design,” DAC’95,
Fig. 11. (a) shows the correlation between NGSPICE simulations and the pp. 497–502, 1995.
Elmore delay model combined with LUTs. (b) difference between “Elmore + [9] Y. P. Chen and D. F. Wong, “An algorithm for zero-skew clock tree
LUTs model” and “NGSPICE model” for each data point in (a). routing with buffer insertion,” EDTC’96, pp. 230–237, 1996.
[10] M. Edahiro, “A clustering-based optimization algorithm in zero-skew
routings,” DAC’93, pp. 612 – 616, 1993.
model overestimates the transition time of each subtree, which [11] R. Ewetz, S. Janarthanan, and C.-K. Koh, “Benchmark circuits for clock
scheduling and synthesis.,” https://purr.purdue.edu/publications/1759,
translates into that more buffers are required to be inserted. 2015.
There are 5.7X more timing violations after CTS when the [12] Y.-C. Chang, C.-K. Wang, and H.-M. Chen, “On construction low power
fast timing model is used to guide the tree construction. The and robust clock tree via slew budgeting,” ISPD’12, pp. 129–136, 2012.
[13] NGSPICE, “http://ngspice.sourceforge.net/,” 2012.
worse TNS is a result of that there is a mismatch between [14] R. Ewetz and C.-K. Koh, “A useful skew tree framework for inserting
the timing model used in the synthesis and in the evaluation. large safety margins,” ISPD ’15, pp. 85–92, 2015.
Although, there is a strong correlation between the two timing [15] R. Ewetz, C. Tan, and C.-K. Koh, “Construction of latency-bounded
clock trees,” ISPD ’16, pp. 81–88, 2016.
models, the fast timing model may overestimate the delay with [16] H. Bakoglu, “Circuits, interconnects, and packaging for vlsi,” Reading,
up to 11 ps, which is shown in Figure 11(b). An error of 11 MA:Addison-Wesley., 1990.
ps is significant as the inserted safety margins are in the order [17] R. Ewetz and C.-K. Koh, “Fast clock scheduling and an application to
clock tree synthesis,” Integration, the VLSI Journal, vol. 56, pp. 115 –
of 0 to 50 ps. 127, 2017.
After CTO, it can be observed that the difference in run-time [18] V. Ramachandran, “Construction of minimal functional skew clock
is reduced to 2.85X (from 3.39X after CTS), which is trees,” ISPD’12, pp. 119–120, 2012.
[19] S. Roy, P. M. Mattheakis, L. Masse-Navette, and D. Z. Pan, “Clock
expected since the CTO phase is required to eliminate more tree resynthesis for multi-corner multi-mode timing closure,” ISPD’14,
timing violations. Consequently, it is also not surprising that pp. 69–76, 2014.
the difference in capacitive cost has increased to 2%. On the [20] R. Ewetz and C.-K. Koh, “MCMM clock tree optimization based on
slack redistribution using a reduced slack graph,” ASP-DAC ’16, pp. 366
other hand, most of the timing violations are removed and – 371, 2016.
there is only a 20% difference in terms of TNS after CTO. [21] OpenCores, “http://opencores.net/,” 2014.
The results demonstrate that the proposed SG allows the [22] C. N. Sze, “ISPD 2010 high performance clock network synthesis
contest: Benchmark suite and results,” ISPD’10, pp. 143–143, 2010.
integration of two (arbitrary) timing models, which translates [23] A. Agarwal, D. Blaauw, and V. Zolotov, “Statistical timing analysis
into a trade-off between run-time, capacitive cost, and timing for intra-die process variations with spatial correlations,” ICCAD’03,
quality. In this paper, the 2% overhead in capacitive cost pp. 900–907, 2003.
and 20% degradation in TNS is considered acceptable for an
2.85X speed-up in overall run-time.

X. S UMMARY AND F UTURE WORK


In this paper, it is demonstrated that static bounded useful
arrival time constraints can be used to construct clock trees
meeting irregular skew constraints. Moreover, the quick eval-
uation of the constraints allow the tree construction to explore
various tree topologies, buffer sizes, and buffer locations. An
SG is introduced to specify and respecify the arrival time
constraints to expose additional timing margins. The SG also
facilitates the integration of various timing models. In the
future, we plan to extend the framework to insert tailored
safety margins in the timing constraints, to incorporate latency
minimization techniques, and to handle multi-corner timing
constraints. Rickard Ewetz received the M. S. degree, in
Applied Physics and Electrical Engineering, from
R EFERENCES Linköpings Universitet in 2011. He received the
[1] R.-S. Tsay, “Exact zero skew,” ICCAD’91, pp. 336–339, 1991. Ph. D degree in Electrical and Computer Engineer-
[2] J. P. Fishburn, “Clock skew optimization,” IEEE Transactions on Com- ing from Purdue University in 2016. Currently, he
puters, vol. 39, no. 7, pp. 945–951, 1990. is an assistant professor in the Electrical and Com-
puter Engineering Department at the University of
Central Florida. His research interests include phys-
ical design and computer-aided design for emerging
technologies.
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2017 15

Cheng-Kok Koh received the B. S. degree with first


class honors and the M. S. degree, both in computer
science, from the National University of Singapore
in 1992 and 1996, respectively. He received the Ph.
D. degree in computer science from University of
California at Los Angeles in 1998. Currently, he is
a Professor of Electrical and Computer Engineering
at Purdue University, West Lafayette, Indiana. His
research interests include physical design of VLSI
circuits and modeling and analysis of large-scale
systems. He is a recipient of the ACM Special
Interest Group on Design Automation (SIGDA) Meritorious Service Award
and Distinguished Service Award, the National Science Foundation CAREER
Award, the Semiconductor Research Corporation Inventor Recognition Award.

You might also like