0% found this document useful (0 votes)
79 views14 pages

Generalised H-Tree

This article proposes a method to determine optimal clock power, skew, and latency for generalized H-tree clock distribution topologies by co-optimizing topology and buffering. The method uses dynamic programming and models to explore tradeoffs between H-trees and spine-based 'fishbone' topologies. Validation shows up to 30% clock power reduction compared to commercial tools.

Uploaded by

madhuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views14 pages

Generalised H-Tree

This article proposes a method to determine optimal clock power, skew, and latency for generalized H-tree clock distribution topologies by co-optimizing topology and buffering. The method uses dynamic programming and models to explore tradeoffs between H-trees and spine-based 'fishbone' topologies. Validation shows up to 30% clock power reduction compared to commercial tools.

Uploaded by

madhuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
1

Optimal Generalized H-Tree Topology and


Buffering for High-Performance and Low-Power
Clock Distribution
Kwangsoo Han Andrew B. Kahng Jiajia Li

Abstract—Clock power, skew and maximum latency are three


key metrics for clock distribution in low-power and high-
performance designs. An H-tree offers minimum clock skew
and good robustness against variations, but at the cost of large
wirelength and clock power. On the other hand, a “fishbone”
clock network with spine-ribs structures has smaller wirelength,
latency and clock power, but larger skew, as compared to an
H-tree. No previous work enables systematic exploration of the
regime between H-tree and spine to achieve an optimal tradeoff
among clock power, skew and latency. In this work, we study the
concept of a generalized H-tree – a topologically balanced tree
with an arbitrary sequence of branching factors – and propose
a dynamic programming-based method to determine optimal Fig. 1: 8-level GH-tree with branching pattern (4, 2, 2, 2, 4,
clock power, skew and latency, in the space of generalized H- 2, 2, 2).
tree solutions. Our method co-optimizes clock tree topology and
buffering along branches according to fitted electrical models.
As reviewed below, there are several methods of clock
We further propose a balanced K-means clustering and a linear
programming-guided buffer placement approach to embed the distribution. Tree-based constructions are still dominant, and
generalized H-tree with respect to a given sink placement. remain the default of commercial clock tree synthesis (CTS)
We validate our solutions in commercial clock tree synthesis tools. To reduce skew and increase robustness (e.g., in light
tool flows, in a commercial foundry’s 28LP technology. The of manufacturing variability or reliability mechanisms), mesh
results show up to 30% clock power reduction while achieving
and other non-tree topologies (e.g., trees + cross-link insertion)
similar skew and maximum latency as CTS solutions from
recent versions of leading commercial place-and-route tools. have been used. Such non-tree methods typically have large
Our proposed approach also achieves up to 56% clock power overheads in terms of power, area, wirelength and signoff
reduction while achieving similar skew and maximum latency as analysis complexity. Thus, clock trees are still of interest
compared to CTS solutions from a state-of-the-art academic tool. and great practical relevance for reasons of cost efficiency,
flexibility and design flow complexity. Increasingly, structured
approaches to clock tree design are seen in practice, since
I. I NTRODUCTION these offer benefits of predictability in the resource-versus-
skew tradeoff, particularly in the upper levels of clock trees.
Physical implementation of clock distribution networks
is increasingly critical to the success of high-performance, As a special case of structured clock trees, the highly
low-power IC product designs. Clock distribution takes up regular, recursive H-tree embedding of a complete binary tree
substantial routing and buffering resources as well as a [4] offers minimum skew, but at the cost of larger wirelength
significant portion of overall power consumption [25]. Power and potentially larger clock power and latency. “Fishbone”
dissipation in the clock network has often been estimated to clock tree topologies (e.g., [2]) with spines and ribs can be
be one third of total IC power dissipation [21], or even half more cost-efficient in terms of latency, wirelength, area and
the total power in some designs. Further, the quality (skew clock power – but incur varying propagation delays (that is, of
and latency) of clock delivery strongly determines achievable the clock signal to branching points along a given spine) which
performance of the design, particularly in advanced nodes. cause skew. To explore the tradeoff among skew, latency and
Skew is well known to affect datapath area and power, as well (clock power, clock buffer area) cost, this work proposes the
as the design schedule needed to achieve timing closure [10] concept of a generalized H-tree (GH-tree), which is a balanced
[26] [29] [34] [36] [41]. Maximum clock latency is another tree topology with arbitrary branching factor at each level.
key metric of the clock distribution network in advanced (Like the H-tree, the GH-tree is a multi-level topology; like
nodes since skews are magnified by on-chip variation (OCV) the fishbone, it can have branching factor greater than two.)
deratings [5]. Figure 1 shows a GH-tree with depth P = 8 and branching
factors (4, 2, 2, 2, 4, 2, 2, 2) at levels p = 1, 2, ..., 8. In the
example, we assume there are 1024 (= 4 · 2 · 2 · 2 · 4 · 2 · 2 · 2)
K. Han, A. B. Kahng and J. Li are with the University of California at nodes (sinks) uniformly placed in the region. If the root of
San Diego, La Jolla, CA 92093. E-mail: {kwhan, abk, jil150}@ucsd.edu the tree is at the region center, each root-to-leaf path will

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
2

contain horizontal segments of lengths 3·W W 3·W


4 , 8 , 32 and 64 ,
W
works, such as [14] [15] [30] [31], propose clock mesh
in that order. The lengths of the successive vertical segments designs for high-performance circuits that require robust
in each root-leaf path are H H H H
2 , 4 , 8 and 16 . We note that in our clock networks. To reduce the cost (e.g., wirelength, power),
proposed GH-tree, we do not necessarily insert a buffer at each hybrid clock distribution methods integrating both mesh
branching point. Further, we allow buffering that is internal to and tree topologies have been proposed [29] [36]. Several
a given branch, that is, at any location along wiring of that works [12] [17] [26] [32] propose to insert cross-links for
given branch. minimization of clock skew, based on an initial clock tree
In this work, we study potential benefits of the generalized solution, with small power overhead. Rajaram et al. [26]
H-tree for low-power, low-skew, and low-latency clock propose an algorithm to recursively merge subtrees with
distribution. We propose a dynamic programming (DP) backward slew propagation. Ewetz and Koh [12] propose
algorithm that efficiently finds an optimal1 GH-tree with systematic cross-link insertion methods to improve the
minimum clock power for given latency and skew targets. robustness of a clock tree while minimizing its overheads.
This optimization uses calibrated clock buffer library and They propose a vertex reduction method to reduce the amount
interconnect timing and power models, and co-optimizes of redundancy in their non-tree structures. Although non-tree
the clock tree topology along with the buffering along methods reduce clock skew and enhance robustness of clock
branches. Furthermore, we also propose a clustering and networks, their intrinsic redundancy incurs additional cost
linear programming-based heuristic to embed the GH-tree (e.g., wirelength, power, effort of verification) as compared to
with respect to a given placement of clock sinks. Finally, tree-based methods. Furthermore, non-tree clock distribution
our embedding of the GH-tree is blockage-aware. In a 28LP topologies such as meshes lack flexibility and tunability; this
testbed with multi-corner timing constraints, our embedded can block, e.g., useful skew optimizations.
GH-tree solutions provide significant clock power benefits Dolev et al. [11] propose a hexagonal grid-based clock
(iso-skew and -latency) in comparison to commercial CTS topology (HEX), consisting of a hexagonal grid with
solutions from the place-and-route tools of two leading EDA intermediate nodes that control the clock signals in the grid and
vendors as well as a state-of-the-art academic CTS tool. Our supply the clock signals to nearby functional units. Abdelhadi
contributions are summarized as follows. et al. [1] propose an algorithm to construct a variation-
• We propose the concept of a generalized H-tree, which tolerant hybrid clock network based on a combination of non-
is a balanced tree topology that can have an arbitrary uniform meshes and unbuffered trees. Their method selectively
sequence of branching factors. reduces clock skew variations on critical timing paths. Zhou
• We propose a DP-based method to co-optimize clock tree et al. [42] propose an algorithm to determine tapping points
topology and buffering to achieve an optimal GH-tree for local buffers that drive a clock mesh with non-uniform
solution with respect to the tradeoffs among skew, latency load distribution in a tree-driven grid clock network. Their
and clock power. algorithm first calculates load for each node, then clusters the
• We propose a balanced K-means clustering and linear nodes. Tapping points are determined for each cluster based
programming-based buffer placement to embed our GH- on the minimum and maximum latencies.
tree solution with respect to any given sink placement. Recently, Y. Kim and T. Kim [39] have proposed a synthesis
• We validate our GH-tree optimization based on sink
algorithm for clock spine networks that effectively optimizes
placements from a leading commercial place-and-route the tradeoff between clock resource and variation tolerance.
(P&R) tool, which include testcases with high floorplan The key idea of their algorithm is to treat the clock spine
aspect ratio and existence of blockages. allocation and placement problem as a slicing floorplan
• Our methodology and optimizations can easily be
optimization problem. Clock tree solutions from the CTS
integrated with commercial P&R tools. Our experimental algorithm of [39] are compared to our GH-tree solutions in
results in a foundry 28LP technology with multi-corner Section IV.
testcases show up to 30% clock power reductions
compared to current CTS tool solutions from two leading Tree-based methods. Due to their cost efficiency, clock
EDA vendors and up to 56% clock power reduction tree-based methods have been commonly used for clock
compared to a state-of-the-art academic solution [39]. distribution in low-power designs. Early works [9] [6] [19]
[38] propose clock tree constructions based on linear or
II. BACKGROUND AND M OTIVATION Elmore delay models to minimize wirelength for a given
In this section, we first review previous works on clock skew target. However, delay and power impacts of buffers are
distribution, categorizing them as: (i) non-tree methods, and ignored in these works. Approaches in [8] [13] [20] [23] [27]
(ii) tree-based methods. We then provide a motivating analysis [28] [40] [41] comprehend buffering impact and co-optimize
of the GH-tree solution space. clock tree construction (i.e., tree topology) with buffering.
Non-tree methods. Mesh topologies are commonly Vittal and Marek-Sadowska [41] give an early algorithm that
understood to provide robustness and small skew. Many co-optimizes tree topology and buffer insertion. Mehta et al.
[23] propose a clustering algorithm to obtain approximately
1 We note that our claim of an optimal clock tree solution is in the regime
load-balanced clusters and construct clock trees so as to
of generalized H-tree solutions (i.e., the continuum between H-tree and spine).
We do not claim that our optimal GH-tree is a globally optimal clock tree minimize skew. These previous approaches typically construct
solution. the clock tree in a bottom-up way with a greedy algorithm,

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
3

and do not explore the skew vs. cost tradeoff. Furthermore, a linear delay model3 according to Equations (3) and (5).
few works adapt their tree construction approaches to, and The figures also show the Pareto frontier of non-dominated
validate their solution quality with, commercial P&R tools points of each tradeoff. Figures 2(c) and 2(d) respectively show
and realistic design blocks. Other works [5] [16] are ECO- skew-power and latency-power tradeoffs of buffered GH-trees
based incremental optimizations based on an initial clock tree with various branching patterns, as reported by a commercial
solution generated by commercial P&R tools. The objective static timing analysis tool with foundry 28LP technology
functions of these works differ from ours. Chan et al. [5] and library models. The Pareto frontier of each tradeoff is
minimize skew at the top-level, whereas Han et al. [16] shown as the black curve. We observe from the figures
minimize skew variation across corners. that different branching patterns lead to wide-ranging skew-
A Motivating Analysis. To motivate our main studies below, power and/or latency-power tradeoffs. In this work, we explore
we briefly illustrate the tradeoffs among clock tree wirelength, branching patterns via dynamic programming to optimize
global skew and maximum clock latency seen across GH-tree tradeoffs among skew, latency and power. Figure 2(c) shows
topologies with various branching patterns. Before doing so, that with the same branching pattern (e.g., (10, 4, 2, 12)),
we summarize in Table I the terminology and notation used different buffering solutions can lead to more than 20% skew
in the remaining discussion. difference with similar clock power. We therefore perform
Given layout region area W × H, we analyze the wirelength co-optimization of tree topology (i.e., branching pattern) and
of a GH-tree with branching pattern (b1 , b2 , b3 , ..., bP ). We buffering to minimize clock power, skew and latency.
assume that (i) at any level, the region area is uniformly split
into sink regions (i.e., regions that contain downstream sinks)
according to the branching factor; (ii) the root of a sink region
is located at the center of the sink region; (iii) branching factor
bp at any level p is always an even number; and (iv) the GH-
tree always starts with a horizontal segment at the top level.
Based on these assumptions, the wirelength of a horizontal
(vertical) wire segment wp (hp ) at level p is calculated as2
bp − 1 bp − 1
wp = (p+1)/2
·W , hp = p/2
·H (1)
Πi=1 b2i−1 Πi=1 b2i
The total wirelength is calculated as
d P2 e b2c P
X b2k−1 − 1 k−1Y X b2k − 1 Y k
L= [ b2i ] · W + [ b2i−1 ] · H
b2k−1 i=1 b2k i=1
k=1 k=1
(2)
Assuming a linear wire delay model and ignoring buffering,
we derive the maximum and minimum (linear-delay) clock
latency from clock source to any sink as Fig. 2: Study of motivating tradeoffs. (a) Linear delay skew vs.
d P2 e b P2 c
wirelength; and (b) linear delay clock latency vs. wirelength
1 X b2k−1 − 1 X b2k − 1 with different branching patterns (256 sinks and region area
tmax = ·[ ( k )·W + ( k ) · H] (3)
2 Πi=1 b2i−1 Πi=1 b2i = 100µm × 100µm). (c) Skew vs. clock power; and (d)
k=1 k=1
maximum latency vs. clock power, in buffered GH-trees for a
d P2 e 2 bP c testcase with 17K sinks and region area = 380µm × 380µm.
1 X 1 X 1
tmin = ·[ ( k )·W + ( k ) · H] (4) In red are dominated points; in black are Pareto frontiers and
2 Πi=1 b2i−1 Πi=1 b2i
k=1 k=1 non-dominated points.
The maximum global skew ω is then defined as the
difference between tmax and tmin . III. O UR A PPROACH
d P2 e b P2 c We now describe our problem formulation and our
1 X b2k−1 − 2 X b2k − 2
ω= ·[ ( k )·W + ( k ) · H] (5) approach. Based on the motivating examples shown in
2 Πi=1 b2i−1 Πi=1 b2i Figure 2, we construct GH-trees to explore the tradeoff among
k=1 k=1

Figures 2(a) and 2(b) respectively show skew-wirelength skew, latency and clock power. Our construction comprehends
and latency-wirelength tradeoffs in GH-trees with various the delay and power impact of buffer insertion, sink placement
branching patterns. We calculate skew and latency based on and multiple constraints (e.g., maximum transition time and

3 The linear wire delay model approximates clock latency with relatively
low accuracy. However, we give this motivating analysis using the simple
2 Note that (p + 1)/2 in Equation 1 is always an integer since we create linear delay model to more intuitively illustrate the tradeoff among clock
horizontal segments (wp ) and vertical segments (hp ) in alternation, and always power, skew and latency with different GH-tree topologies. Below, we
start with a horizontal segment from the top. Thus, all horizontal segments apply comprehensive buffer and wire delay modeling in our optimization to
are created when p is an odd number. demonstrate the benefits of our proposed approach.

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
4

TABLE I: Description of notations used in our work.


Term Meaning
r (R) clock tree (set of clock trees)
ω (Ω) global clock skew (maximum global skew constraint)
t (T ) clock latency (maximum clock latency constraint)
γ (Γ) clock slew (maximum clock slew constraint)
C maximum load capacitance constraint
L clock tree wirelength
O set of placement blockages
w (W ) width (width of given layout region)
h (H) height (height of given layout region)
n (N ) number of sink regions (required number of sink regions)
p (P ) level index [p = 1 for clock source] (depth of clock tree)
bp branching factor at level p in clock tree
B branching patterns (i.e., sequences of branching factors {b1 , b2 , ..., bP })
uk kth sink cluster (uk ∈ U )
si ith sink (si ∈ S)
dk,i Manhattan distance between sink si and root of cluster uk
ηk,i binary indicator of whether sink si belongs to cluster uk
llx,lly,urx,ury
ψj,q binary indicator whether j th buffer is located outside the corresponding boundaries of q th blockage
ci clock pin capacitance of sink si
xi (yi ) x-coordinate (y-coordinate) of sink si
((xll ll ur ur
k , yk ), (xk , yk )) bounding box of sink cluster uk

maximum load capacitance). More specifically, we address the Therefore, an improved resolution of the “chicken-and-egg”
following GH-tree construction problem: loop between (i) placement of roots of sink regions and (ii)
Given: a placement solution (i.e., a layout region W × H top-level tree topology optimization, as well as consideration
and placement of sinks), number of sink regions N such that of clock gating cells during the optimization, remain as open
each region contains <40 sinks (see Footnote 4 below), timing research directions.
library of clock buffers (.lib), maximum clock skew constraint
Ω, maximum clock latency constraint T , maximum transition
time constraint Γ (i.e., at both sinks and clock buffer input
pins), and maximum load capacitance constraint C.
Perform: GH-tree construction with co-optimization of the
clock tree topology (i.e., branching patterns) and buffering to
minimize clock power, subject to the given constraints.
Figure 3 shows our overall GH-tree construction flow.
An instance consists of a post-placement layout, the
number of sink regions, and constraints, along with pre-
characterized technology- and library-specific lookup tables
(LUTs) containing power and delay information of candidate
buffering solutions (i.e., segment length and buffer sizes).
We perform the GH-tree construction primarily through two
main steps: (i) according to the total sink capacitance and
layout region area and aspect ratio, we first formulate a
dynamic programming problem to co-optimize branching
pattern and buffering; and (ii) we then perform balanced K-
means clustering and formulate an integer linear programming
problem to determine clock buffer placement (i.e., to Fig. 3: Overall flow of GH-tree construction.
embed our generalized GH-tree structure into the given sink
placement). We note that the key step is the first step (i.e.,
DP-based co-optimization of clock topology and buffering) A. LUT Characterization
that systematically explores the continuum between H-tree and We characterize LUTs based on simulations using
spine to achieve an optimal tradeoff among clock power, skew Synopsys PrimeTime [49] as inputs for our DP-based
and latency (within this regime). Last, we realize our GH-tree optimization. These LUTs contain power, input capacitance,
solution in a commercial P&R tool and report metrics (e.g., slew propagation, and delay information of buffered and
skew, latency, power, etc.) to assess solution quality. unbuffered wire segments. We use four types of buffers (X50,
Although our approach systematically optimizes the tradeoff X67, X100, X134 from the 28LP libraries). In this technology,
among clock power, latency and skew in the GH-tree regime, the gate area of a X134 buffer is ∼7× the gate area of
our method has two limitations that we highlight. First, we a minimum-size (X2) buffer. We also use “ganged buffers”
do not co-optimize tree topology and buffering together with (i.e., two, four or six X134 buffers with shorted inputs and
sink placement, due to large runtime complexity. Second, our shorted outputs) to achieve higher driving strengths. We create
approach does not consider clock gating cells in a clock tree. wire segments of lengths 15µm, 30µm, 45µm, 60µm, 75µm,

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
5

and 90µm.4 Along these wire segments, we enumerate all B. DP-Based Co-optimization of Clock Topology and
possible buffering solutions with the minimum granularity Buffering
of 15µm (i.e., the minimum distance between two buffers
is 15µm). The granularity of LUTs (e.g., number of buffer Based on the characterized LUTs, we determine the optimal
candidates, minimum wire segment length) will determine the branching pattern along with the buffering solution for GH-
tradeoff between optimization runtime and solution quality. tree using dynamic programming (DP). Other inputs to our
In this work, we empirically select the granularity of our optimization are layout region, placement of sinks, the number
LUTs to achieve improved solution quality compared to of sink regions, and the maximum skew and maximum latency
two commercial tools, while using comparable runtime. For constraints. A sink region typically contains <40 sinks in our
example, with a 45µm segment length, a minimum buffering optimization. Our objective is to minimize the clock power
distance of 15µm and seven buffer sizes, there are 83 (i.e., while satisfying the given maximum skew and maximum
no buffer or exactly one of the seven buffer sizes, at each latency constraints. As discussed, in this step, we assume that
of the three buffering locations) distinct buffering solutions. the sink regions are induced from a uniform placement of
Note that our LUTs include unbuffered solutions, i.e., pure- sinks, and that branching points are always located at the
wire solutions. We consider both 1W1S (single-width, single- center of the corresponding sink region. We understand that
spacing) and 2W2S (double-width, double-spacing – which we the sink regions are typically not uniform for a real placement
understand to be the common non-default routing rule (NDR) solution. However, due to high runtime complexity, it is
for clock distribution) wire segments in our characterization. practically infeasible for our current approach to consider sink
We vary the input slew from 5ps to 60ps in steps of 5ps placement during our DP-based optimization. We therefore
and we vary output load from 1f F to 150f F in steps of assume uniform sink regions during our DP-based GH-
1f F from 1f F to 5f F , and in steps of 5f F from 5f F to tree construction. We then embed our DP-based solution
150f F . For each 3-tuple of (distance, input slew, output load) (without solution quality degradation) into the given (real)
we obtain the buffering solution (including input capacitance) sink placement based on sink clustering and branching point
and the output slew. From the large number of possible displacement. We formulate our DP in a high-dimensional
solutions, we prune solutions as follows. For each (distance, solution space with seven dimensions (i.e., with respect
input capacitance, output slew, output load) 4-tuple, we select to seven essential parameters of a clock tree optimization)
three delay values at the 10th , 50th and 90th percentiles of the and construct our GH-tree in a bottom-up way. The seven
delay range, and then select the minimum-power solution for dimensions are {clock tree depth (P ), region area (width
each of these three delay values. Figure 4 shows an example (w) and height (h)), number of sink regions (n), maximum
of our pruning on LUTs, in which we select minimum-power and minimum clock latencies (tmax and tmin ), and input
solutions with different output load values. Red (resp. blue) capacitance (i.e., the load capacitance seen from the root)}.
dots are the buffering solutions with output load = 75f F (resp. Algorithm 1 describes our optimization procedure; see also
35f F ). Cross (x) points are the selected buffering solutions. the illustration in Figure 5. We first construct GH-trees for
the base case, that is, trees with depth = 1 over all different
region areas (i.e., w × h) and numbers of sink regions (i.e., n)
(Line 1). As an example, Figure 5 shows the solutions at level
p (i.e., subtree with depth = 1) with different region areas (i.e.,
5 × 5 (red), 10 × 10 (green) and 15 × 15 (purple)). Procedure
build base trees(w, h, n, ∅) constructs GH-trees with depths
of one (i.e., spines with different buffering solutions) within a
w×h region and with a branching factor of bp . Following [37],
we use the term spine to denote one horizontal or vertical
wire segment in the clock tree. Note that at the bottom
level, bp = n. As illustrated in Figure 5, we optimize the
buffering solution along the spine based on characterized
LUTs. Optimization of each tree segment along the spine
Fig. 4: Example of pruning for buffering solutions; distance generates a minimum-power Pareto surface in the high-
= 45µm, output slew = 35ps. dimensional space indexed by the LUT input parameters (e.g.,
maximum and minimum latencies, and input capacitance). The
In practice, we have found that this pruning reduces the optimization eventually results in multiple subtree solutions.
number of buffering solutions by ∼94% at the cost of only We store these subtree solutions in a set R indexed by tree
∼5% solution quality (i.e., in terms of skew or latency) loss. depths, region area, number of sink regions, maximum and
minimum clock latencies, and input capacitance. In other
4 We use LUTs based on multiple short wire segments to estimate the words, R is a set of subtree solutions along with their
delay and slew propagation of a long wire segment. Since we match the depth, region area, number of sink regions, clock latency,
output load, input capacitance as well as output and input slew values of two input capacitance and power information corresponding to the
consecutive short segments to form a long segment, the estimation error is
negligible. The small estimation error comes from discreteness of capacitance
minimum-power Pareto surface.
and slew values. Next, we recursively search for the optimal (i.e., minimum-

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
6

solutions based on the set of all stored subtrees (rl ) (from R)


that satisfy
Pl = P − 1; wl = h; hl = w/bt ; nl = n/bt (6)
where Pl is the depth of rl ; w×h is the dimension of the layout
region; wl × hl is the dimension of a sink region (i.e., layout
region for subtree rl ); bt is the branching factor at the topmost
level of the current tree; and nl is the number of sink regions
of rl . Lines 3-4 and Line 5 respectively enumerate possible
dimensions and numbers of sink regions for subtree solutions.
In Line 6 of Algorithm 1, we find all subtree solutions, which
Fig. 5: Co-optimization of GH-tree topology and buffering. we have optimized in previous iterations, with branching factor
The example illustrates construction of trees with depth = 2 bt such that 2 ≤ bt ≤ n/2.
and eight sink regions, based on subtrees with depth = 1 and Procedure build tree(w, h, n, rl ) then builds trees r with
two sink regions. depth = P , using copies of the collected subtrees rl ∈ R
(which have depth = (P − 1)) as its lower-level subtrees
Algorithm 1 DP-based GH-tree construction.
(Line 8). In other words, we build the tree segment with
1: R ← build base trees(w, h, n, ∅), ∀ w, h, n optimized buffering at the topmost level, and at each sink
s.t. 0 < w ≤ W, 0 < h ≤ H, 2 ≤ n ≤ N, n is an even number
2: for P := 2 to Pmax do of the topmost segment, we use (i.e., instantiate) the same
3: for w := 0 to W do subtree rl to build lower levels of the tree r. To reduce the
4: for h := 0 to H do
5: for n := 2 to (N · w · h)/(W · H) do runtime complexity, our current approach assumes that at any
6: R ← retrieve subtrees(Pl , wl , hl , nl ) level, the subtrees are identical. As shown in Section IV
7: for all (rl ) ∈ R do
8: r ← build tree(w, h, n, rl ) below, this does not preclude strong final solution quality.
9: r 0 ← retrieve tree(R, P, w, h, n, r.tmax , r.tmin ) Among all the constructed trees with depth = P and the same
10: if r 0 = null then
11: R ← R ∪ {r} maximum and minimum latency, we select the solution with
12: else if r.power < r 0 .power then minimum power and add it to the solution set R (Lines 7–
13: remove r 0 from R
14: R ← R ∪ {r} 16). Procedure retrieve tree(R, P, w, h, n, r.tmax , r.tmin ) in
15: end if Line 9 retrieves a previously stored solution r0 from the set
16: end for
17: end for R that satisfies the conditions depth = P , width = w, height
18: end for = h and number of sink regions = n, and has maximum and
19: end for
20: end for minimum latencies equal to specified tmax and tmin , where t
21: ropt .power ← ∞ is clock latency. Finally, we select the solution with minimum
22: for all r0 ∈ R s.t. r0 .w = W, r0 .h = H, r0 .n ≥ N do
23: if r0 .tmax − r0 .tmin ≤ Ω && r0 .tmax ≤ T && r0 .power < power that satisfies the maximum skew constraint (Ω) and the
ropt .power then maximum latency constraint (T ) from set R with number of
24: ropt ← r 0
25: end if sinks N and region area W × H (Lines 21–27).
26: end for
27: return ropt We note that the slews are propagated from top to bottom in
a tree. However, our optimization performs bottom-to-top GH-
tree construction. We propagate slew bottom-up to accurately
power) GH-tree solutions with depth P > 1, region area w×h, capture the slew degradation and avoid maximum transition
and number of sink regions n. We increase P by one per violations. We first assume several slew values (e.g., 25ps,
iteration during the optimization, until P = Pmax (Lines 2– 30ps, 35ps, 40ps) at the root of each sink region. For each
20). The maximum depth Pmax for a given N is estimated of the assumed slew values, we propagate slew bottom-up,
based on the conventional H-tree (which has branching factor based on LUTs (note that our LUTs contain output and input
of two for all levels), i.e., Pmax = dlog2 N e. For each depth slews for each buffering and/or wiring solution). During the
P , we perform buffering optimization along the tree segments DP-based optimization, we only select solutions which ensure
at the topmost level, based on the stored subtree solutions that slew values throughout the slew propagation are always
at depth (P − 1). In other words, we use existing solutions within the range of [5ps, 60ps], where 60ps is the maximum
(i.e., subtree solutions from R) and add one more (topmost) slew constraint in our experiments, and 5ps is the minimum
level with optimized buffering to construct a new tree with achievable slew in practice in our experiments. We also note
depth increased by one. Figure 5 illustrates how our optimizer that buffer locations are determined by the selected LUT
constructs solutions at level (p − 1) (i.e., subtrees with depth solution with bottom-up slew propagation. Thus, buffers are
= 2) based on the solutions at level p (i.e., subtrees with not necessarily inserted in all branching points.
depth = 1). In this example, solutions for region area 20 × 5 The runtime complexity of proposed algorithm is O(P max ·
(resp. 40 × 10) at level (p − 1) are constructed based on four W ·H
wint ·hint· N 2 · γγmax
int
), where wint , hint and γint are
instantiated solutions for region area 5 × 5 (resp. 10 × 10) at respectively the minimum distance and timing intervals for
level p. discretization of the original continuous solution space to
For each (P , w, h, n) tuple, we construct our optimization formulate the dynamic programming; γmax is the specified

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
7

maximum clock latency constraint. We set wint , hint < 5µm on the clustering solution.6 We note that our approach is
and γint = 1ps in our experiments. To reduce the runtime, we different from conventional top-down clock tree construction
apply the following pruning techniques. methods (e.g., Planar-DME [19]), in that (i) we embed an
• Pruning with number of leaf regions. For a given sub- optimized clock tree topology with buffering solutions to a
region of size w × h we prune solutions with number of layout region with given sink placements, and (ii) we balance
leaf regions greater than NW·w·h load capacitance among sink regions. By maintaining the
·H .
• Pruning with skew/latency constraints. We prune distances between consecutive branching points and buffers
solutions that have skew larger than the maximum skew at each clock level as well as balancing the load capacitance
constraint or maximum latency larger than the latency among sink regions, we preserve the solution quality (i.e.,
upper bound. skew and latency) of the GH-tree solution obtained by the
• Pruning with maximum fanout constraint. We prune DP-based optimization. Furthermore, we understand that the
solutions that have branching factor larger than the optimal GH-tree topology and buffering solution can vary
maximum fanout constraint.5 across different sink placements. With this in mind, we keep
the best M solutions from our DP-based optimization and
select the minimum-power solution for the given (actual) sink
placement as our final solution. Based on our preliminary
experiments, we empirically use M = 5 to generate the
results reported below, where increasing M beyond 5 will not
improve the solution quality significantly.

Fig. 6: Runtime of our DP-based optimization method with


and without pruning techniques across different numbers of
sink regions.
Figure 6 shows DP-based optimization runtime with the
number of sink regions ranging from 200 to 4000, where
each sink region contains ∼25 flops. The maximum skew
constraint used in the experiment is 30ps. Results show that
with pruning, the DP-based optimization can optimize a design
with more than 4K sink regions (or 100K flops) within six
hours. Assuming that the flop count to total instance count
ratio is typically 10% to 25%, our approach can optimize
Fig. 7: Example of sink clustering and clock buffer placement.
a design with 1M instances within six hours. Our studies
In this example, we only show the local clustering for the
show that runtime and memory usage increase significantly
leftmost branch of the first-level clock tree.
if we do not apply the proposed pruning techniques, due to
the large number of intermediate solutions (such that we are The remainder of this subsection describes two
not able to optimize beyond a design with 1K sink regions mathematical programs (ILPs) that perform sink clustering
due to excessive memory usage). Moreover, we observe same (i.e., to assign flip-flops to sinks of the constructed GH-tree)
solution quality between the runs with and without pruning. and place buffers of our GH-tree (one example is shown in
Figure 7). The two ILPs respectively act at global and local
C. Embedding of GH-Tree Into a Sink Placement clustering, as we describe below. We perform the clustering
and branching point placement top-down, level-by-level.
The clock tree topology and buffering solution from our
Our clustering optimization balances the load capacitance
DP-based optimization assumes a uniform sink (region)
across different clusters (each cluster is assigned to a sink of
distribution (whereby branching points are at the centers of
our GH-tree) to minimize the discrepancy between our DP
regions). However, given a (realistic) non-uniform sink (flip-
solution (which assumes a uniform sink placement) and the
flop) placement, we must cluster flip-flops with balanced load
final solution (with the given actual sink placement). Note
across different clusters to avoid skew and latency increase.
that this implies that we must consider wire capacitance at
In other words, we should assign clusters of flip-flops to
the bottom-most level, where the routing is achieved by a
sinks of the GH-tree, based on the actual sink flip-flop
commercial P&R tool.7 However, any constructive approach
placement. To adapt our optimized GH-tree to the given
sink (i.e., flip-flop) placement, we perform a balanced K- 6 Conventional K-means clock tree synthesis [10] cannot be applied to our
means clustering of sinks and adapt buffer placements based problem as it does not comprehend the load capacitance balancing criterion.
7 Different routing tools can have different clock routing solutions for the
5 In our experiments, we set the maximum fanout constraint to 40 based bottom-level clock tree. However, the difference is very small. Based on our
on guidance from an industrial collaborator [18]. experimental results, skew from bottom-level clock routing is <1ps.

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
8

Algorithm 2 Embedding of GH-Tree into a sink placement.


1: for p := 1 to P do We define di as the distance between the sink si and the
2: if |S|/|U | ≤ Qth then branching point of the cluster that includes the sink si ; dmax
3: global clustering()
4: else if r.power < r0 .power then denotes the maximum distance among the distances di ; and
5: local clustering() α is a weighting factor.8 Our objective is to minimize the
6: end if
7: branching point and buf f er placement() sum of all distances di and weighted dmax . Constraint (7)
8: end for ensures that each sink can only belong to exactly one cluster.
In Constraints (8) and (9), we obtain di for each sink and
dmax , respectively. Constraint (10) ensures that the total
to this wire capacitance estimation is inaccurate and can pin capacitance of each cluster satisfies specified lower and
dramatically increase runtime complexity during top-level upper bounds. The lower and upper bound capacitances are
optimization where each branching point can have many sink determined by C pin , which is estimated as the total pin
flip-flops. We therefore divide the GH-tree into global and capacitance covered by the current region divided by the
local clustering, based on the total number of downstream number of clusters, along with the margin ∆. Since the
sinks (i.e., flip-flops), and only consider wire capacitance capacitance cannot be always balanced between clusters, we
during local clustering. add margin ∆ to ensure that there is a feasible solution of the
Algorithm 2 describes our clustering procedure. For each ILP.
level p, we iteratively apply either global or local clustering ILP formulation for local clustering. Although the ILP
followed by LP-based branching point and clock buffer for global clustering finds a balanced pin capacitance solution
placement. Starting from level 1 (the topmost level) of the tree, over all sink regions, it ignores wire capacitance. Therefore,
we cluster sinks based on initial locations of the branching we formulate a second ILP and apply it to local clusters that
points (i.e., the light blue dots in the top-left figure of Figure 7, have smaller regions.
which are at the centers of uniform sink regions) of the DP-
based GH-tree solution (Algorithm 1). The number of global X X α
Minimize: di + · (xur ll ur ll
k − xk + yk − yk )
clusters is the same as the number of branching points. For |U |
si ∈S uk ∈U
an example in Figure 7, since p = 1 and b1 = 4, our
ILP will generate four global clusters that have similar load Subject to:
capacitance. We then formulate an LP to determine the exact xur
k ≥ xi · ηk,i ∀si ∈ S, uk ∈ U (11)
branching point locations as well as buffer locations in each ur
yk ≥ yi · ηk,i ∀si ∈ S, uk ∈ U (12)
global cluster. When the number of sinks in each cluster is
xll
k ≤ xi + λ · (1 − ηk,i ) ∀si ∈ S, uk ∈ U (13)
smaller than a threshold value (i.e., |S|/|U | < Qth ), we apply
our ILP-based local clustering, and refine the branching point ykll ≤ yi + λ · (1 − ηk,i ) ∀si ∈ S, uk ∈ U (14)
pin+wire
locations in each local cluster using our LP. Note that the C · (1 − ∆) ≤
“K” in the K-means clustering is determined by the number X
(ci + ζ) · ηk,i + β · (xur ll ur ll
k − xk + yk − yk ) (15)
of branching points in the GH-tree.
si ∈S
ILP formulation for global clustering. We pre-calculate C pin+wire
· (1 + ∆) ≥
distance dk,i between the branching point of cluster uk and X
the sink si within the bounding box of the region that will be (ci + ζ) · ηk,i + β · (xur ll ur ll
k − xk + yk − yk ) (16)
clustered. The initial locations of branching points are assumed si ∈S

to be at the centers of uniformly-sized regions corresponding + Constraints (7) and (8)


to the branching factor. The blue dots in the top-left figure of
Figure 7 show an example of initial branching point locations
The objective of this second ILP is to minimize the sum of
when the branching factor b1 = 4.
all distances di , of which the definition is the same as above,
plus the weighted sum of half-perimeter wirelength (HPWL)
X
Minimize: di + α · dmax of all clusters’ bounding boxes. We use HPWL and the number
si ∈S of sinks within a cluster to model the wire capacitance of the
Subject to: sink region. In Constraints (11)–(14), we obtain the lower-left
X and upper-right corner locations of each cluster’s bounding
ηk,i = 1 ∀si ∈ S (7) box. λ is a large positive integer. Constraints (15) and (16)
uk ∈U ensure that total (i.e., pin and wire) capacitance of each cluster
X
di = dk,i · ηk,i ∀si ∈ S (8) satisfy given lower and upper bounds. ζ and β are respectively
uk ∈U coefficients for the number of sinks and the cluster’s bounding
dmax ≥ di ∀si ∈ S (9)
X
pin pin
C · (1 − ∆) ≤ ci · ηk,i ≤ C · (1 + ∆)
si ∈S
8 Based on our preliminary studies, we empirically use α = 8 in our
∀uk ∈ U (10) experiments.

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
9

box HPWL in our linear wire capacitance estimation model.9 Here, (xfj , yjf ) is the location of the j th buffer at a given
{llx,lly,urx,ury}
LP formulation for branching point location. Based on the clock level. ψj,q are binary indicator variables
two ILPs described above, we obtain the clustering solution which indicate whether the j th buffer is located outside the
at a given clock level. However, the initial assumption of corresponding boundaries of the blockages. The last inequality
branching locations to be at the center of a region may in (22) defines the constraint that at least one of the indicator
cause large skew if the sinks within a cluster are placed non- variables must be true. Satisfying this constraint implies that
uniformly. To address this issue, we formulate a linear program the j th buffer is not in the bounding box of the q th blockage.
(LP) to place each branching point close to the weighted center Note that with the Constraints (22), the problem becomes an
of each cluster, as follows. integer linear program (ILP).

IV. E XPERIMENTAL S ETUP AND R ESULTS


Minimize: d∆max
Subject to: We conduct our experiments in a commercial foundry’s
28nm LP technology, with a dual-Vt, 12-track standard-
|xbk+1 − xbk | + |yk+1
b
− ykb | = db (17) cell library. The input placement solutions (including clock
x∆ b w ∆ b w
k + xk − xk ≥ 0, xk − xk + xk ≥ 0 (18) sink placements) are generated using Cadence Innovus
yk∆ + ykb − ykw ≥ 0, yk∆ − ykb + ykw ≥ 0 (19) Implementation System v15.2 [44]. Our optimization flow is
implemented using C++ and Tcl scripts. We use CPLEX
x∆
max ≥ x∆
k ,

ymax ≥ yk∆ (20) v12.6 [45] as both ILP solver and LP solver, along with
d∆
max = ∆
xmax ∆
+ ymax (21) OpenMP [48] to enable multi-threaded execution. We execute
all our experiments by using up to 40 threads on a 2.6GHz
Intel Xeon E5-2690 server. We construct reference clock tree
Here, xbk and ykb denote the x-/y-coordinates of the k th
solutions using the latest releases available to us of two
branching point, and xw w
k and yk denote the x-/y-coordinates
th leading-edge commercial P&R tools (i.e., Tool1 and Tool2)
of the weighted center of the k cluster.10 The objective is to
as well as a state-of-the-art academic tool [39], and report
minimize the sum of the maximum x- and y-distances between
attributes of solutions from these tools along with those of
any branching point and its corresponding weighted center.
our GH-tree solutions.11
Constraint (17) ensures a fixed distance db between any two
consecutive branching points. In Constraints (18) and (19), TABLE II: Description of our testcases.
variables x∆ ∆
k and yk respectively denote the x- and y-distances Max tran. Max cap.
Testcases #Insts. #FFs Util. (%)
between the branching point and the weighted center of the (ps) (f F )
k th cluster. We then obtain the sum of the maximum x- and B19 39788 3086 73 60 80
y-distances from Constraints (20) and (21). JPEG 46937 4712 73 60 80
Placement blockage. We further show a simple extension VGA 66226 17057 76 60 80
of our LP formulation to an ILP formulation to comprehend LEON3MP 463104 108817 74 60 80
blockage-aware buffer placement. In this extension, we adjust VGA blockage 65891 17057 61 60 80
VGV high AR 65124 17057 75 60 80
buffer placement to address the existence of blockages. A
caveat is that this method may not work well for designs with a
large number of blockages and/or a complex floorplan, since TABLE III: MCMM settings and clock periods (ns) for our
our DP-based tree topology and buffering solutions are not testcases.
aware of blockages. We assume that there are O rectangular C1 = {ss, 0.9V, -40C} C2 = {ff, 1.1V, -40C}
blockages. The index of a blockage is denoted by q. Each B19 1.5 1.0
blockage is defined by its lower-left corner (xll ll
q , yq ) and upper-
JPEG 1.2 1.0
right corner (xur , y ur
). When there are placement blockages VGA 1.4 1.0
q q
LEON3MP 1.6 1.0
in the floorplan, we define the following constraints.
VGA blockage 1.4 1.0
VGA high AR 1.4 1.0
llx
ψj,q = 1 ⇔ xfj < xll
q
We evaluate our optimizer using four designs: JPEG
urx
ψj,q = 1 ⇔ xfj > xur
q from OpenCores [47], and B19, VGA and LEON3MP
lly
ψj,q = 1 ⇔ yjf < yqll (22) from the ISPD-2012 contest [24]. We use the real design
ury (JPEG) and the testcases from the ISPD-2012 contest (B19,
ψj,q = 1 ⇔ yjf > yqur
VGA, LEON3MP) since they contain datapaths (in contrast
llx urx lly ury
ψj,q + ψj,q + ψj,q + ψj,q ≥1 to testcases from the ISPD-2010 contest, which do not
contain datapath information), thus enabling comparison
9 We determine ζ and β by fitting a linear model to wire capacitances
versus commercial tools. We synthesize these testcases using
extracted for all our testcases. We perform least-squares regression as follows:
wire capacitance = ζ· #sinks + β· HPWL. In our 28nm foundry enablement,
ζ = 0.28 and β = 2.93. 11 The tool flows used in our work are based on latest versions of leading
10 We calculate the x- and y-coordinates of each weighted center as commercial EDA tools available through the respective vendors’ university
the respective means of all x- and y-coordinates of placed sinks in the programs. More specific identification of tools and vendors is not permitted
corresponding cluster. by the vendors.

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
10

Synopsys Design Compiler H-2013.03-SP3 [51]. Table II aspect ratio and existence of placement blockages (shown in
summarizes the instance count, number of clock sinks, Figure 15), where we observe similar clock power and latency,
placement utilization and timing constraints for our testcases. but larger skew, compared to the case without blockage and
To study the impact of maximum skew and latency constraints high floorplan aspect ratio. Compared to the results of [39],
on power, we run our optimizer with different maximum our GH-tree achieves much smaller power (i.e., up to 55% on
skew and latency constraints and collect discrete solution B19) and skew, but at the cost of larger maximum latency.
points showing skew-power and latency-power tradeoff. We Power analysis. We also observe that our GH-tree solutions
obtain reference clock solutions with multi-corner multi- have smaller number of buffers and clock wirelength as
mode (MCMM) optimization; we define our mode and corner compared to solutions from commercial and academic tools.
settings in Table III. We also apply late and early deratings of Figure 10 further shows a histogram of clock buffer power
1.1 and 0.9 to model OCV effects. We use Synopsys HSPICE (i.e., sum of internal, leakage and dynamic power) values of
G-2012.06-SP1 [50] to perform timing and power analysis. our GH-tree versus corresponding values from a commercial
A. Comparison with Trees from State-of-the-art CTS tool’s solution. We observe that our DP-based optimization,
which can select its buffering solution “optimally” based on
Figure 8 and Figure 9 respectively compare skew and clock the characterized LUTs, achieves smaller buffer power values
power, and maximum latency and clock power, of GH-tree for most of the clock buffers. In other words, our GH-tree
solutions to those from the two commercial tools and one optimization is able to achieve optimized load capacitance
academic flow [39]. Table IV compares #buffers, buffer area, of buffers as well as slew propagation for reduced clock
max (insertion) delay across corners, and wirelength among power. The power information from our LUTs enables power-
clock tree solutions as well as optimization runtime. All flows awareness in our DP-based GH-tree construction.
use the same sink placement solution as input. We apply the
same setups (i.e., clock buffer cells (X50, X67, X100, X134
and ganged buffers), BEOL layers (M3 and M4), maximum
transition constraints (60ps)) to our GH-tree construction and
to both commercial and academic tool flows. We sweep the
maximum skew and maximum latency constraints on the four
designs for the GH-tree constructions. Blue curves in the
figures are skew-power and latency-power Pareto curves of
our GH-tree solutions.12
Overall analysis. Our results show that our GH-tree solutions
achieve significant clock power reduction with similar or
reduced skew and latency values as compared to the solutions
from both commercial and academic tools. We also include
the conventional “strict” H-tree solution as a comparison.
Note that we use the same methodology to determine the
buffer locations and sizes in “strict” H-tree and GH-tree Fig. 10: Distribution of clock buffer power values from GH-
constructions. In other words, it is unnecessary to add a tree and commercial tool’s solution. Design: VGA.
buffer at each branching point. For example, we achieve power
reductions of 30% on B19 and 20% on LEON3MP in Table IV Runtime analysis. Results in Table IV show that although the
compared to commercial tools’ results. Moreover, we observe naive worst-case time complexity of our (DP- and ILP-based)
that due to the symmetric topology of our GH-tree, our GH- optimization is high (cf. the nested for loops in Algorithm 1),
tree solution is typically more robust against skew variation the pruning techniques and empirically selected granularity of
across different corners as compared to the clock trees from our LUTs (e.g., 15µm for distance, 5ps for slew, and 5f F for
other tools. As an example, skew of the clock tree solution capacitance) make the runtime of our optimization comparable
from Tool2 increases by 137% on design VGA blockage to those of commercial tools. The large runtime of design
between two corners; by contrast, our solutions generally have LEON3MP mainly comes from bottom-level tree construction
similar skew values across corners. The conventional “strict” (∼40 minutes) and ECO routing according to our GH-tree
H-tree achieves the minimum clock skew but at the cost of solution using OpenAccess [46] (∼15 minutes). The actual
larger power, buffer area and wirelength. As an example, GH-tree construction runtime (i.e., DP + ILP runtime) for
for LEON3MP, H-tree has depth P = 12, but GH-tree has design LEON3MP is only ∼25 minutes. We understand that
P = 10 (i.e., branching factor = (2, 2, 2, 2, 2, 4, 4, 2, such runtime is very acceptable in light of the potential clock
2, 2)). The shallower depth of GH-tree significantly reduces power benefits from our approach.
the number of buffers and wirelength. We also validate our Robustness analysis. We further perform Monte Carlo
GH-tree optimization on design VGA with high floorplan simulation on our GH-tree solution and those from commercial
tools, and compare the resultant variation in clock skew
12 We estimate the Pareto curve based on discrete solution points due to
and power. Figure 11 shows that our GH-tree solution
limited computing resources. In addition, since the tradeoff between skew/max
latency versus clock power is monotone, we feel that three solution points can exhibits relatively smaller variation in clock skew (i.e.,
provide a useful estimation of the tradeoff. ∼35ps) compared to commercial tools’ solutions (i.e., ∼40ps).

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
11

Fig. 8: Power and skew comparisons among GH-tree, Tool1, Tool2 and [39] for four testcases.

Fig. 9: Power and maximum latency comparisons among GH-tree, Tool1, Tool2 and [39] for four testcases.

Furthermore, all of Tool1, Tool2 and GH-tree solutions have of the latency versus clock power tradeoffs. Figure 12 shows
small power variation. the comparison for the JPEG testcase at 28LP technology. Due
to better slew propagation, solutions with 2W2S have fewer
B. Impact of NDR clock buffers as compared to the 1W1S solutions (i.e., the
We now summarize observed impacts of non-default rules average numbers of clock buffers are respectively 83 and 111
(NDRs) on clock tree solution quality. We generate GH-trees in 2W2S- and 1W1S-only solutions). Further, solutions that
with various NDR options: (i) 1W1S only, (ii) 2W2S only, permit either 1W1S or 2W2S at each level (of clock subnets)
and (iii) the combination of 1W1S and 2W2S. We set the are able to achieve a better tradeoff between latency and power.
maximum skew constraint to 200ps and compare Pareto curves

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
12

TABLE IV: Comparison between clock tree solutions from [39], Tool1 and Tool2 versus our GH-trees. Technology: 28LP.
Corner = C1 Corner = C2
Buf area Clk WL Runtime
Testcase Flow Max laten. Skew Clk power Max laten. Skew Clk power #Buffers
(ps) (ps) (mW ) (ps) (ps) (mW ) (µm2 ) (mm) (min)
Tool1 150 17 3.8 110 27 9.3 104 227 15688 15
Tool2 222 29 3.4 129 5 8.3 84 245 12996 11
[39] (min pwr) 111 40 5.6 63 40 12.1 211 338 N/A N/A
B19
[39] (min skew) 125 36 6.0 78 38 13.0 228 413 N/A N/A
GH-tree 170 12.5 2.6 116 25.8 6.4 41 106 12242 15
H-tree 166 7 3.1 106 11 7.6 147 227 13941 16
Tool1 196 26 6.9 131 26 16.8 160 345 20967 16
Tool2 236 34 6.2 141 18 15.2 120 352 18432 14
[39] (min pwr) 179 65 9.2 103 73 19.7 340 651 N/A N/A
JPEG
[39] (min skew) 155 30 9.4 92 36 20.4 353 676 N/A N/A
GH-tree 201 19 5.9 129 17 14.5 147 296 20009 17
H-tree 229 12 6.6 150 16 16.3 169 456 20064 14
Tool1 260 36 20.7 152 7 55.3 464 1119 57678 16
Tool2 314 28 18.0 201 11 48.5 369 1047 56305 22
[39] (min pwr) 171 52 24.0 114 73 52.1 911 1651 N/A N/A
VGA
[39] (min skew) 171 52 24.0 114 73 52.1 911 1651 N/A N/A
GH-tree 238 19 17.4 174 41 44.2 331 1036 57404 21
H-tree 253 16 20.4 162 19 55.0 597 1682 62957 16
Tool1 426 63 109.5 276 25 195.9 2661 6654 369737 54
Tool2 633 34 102.5 421 58 184.2 2509 7225 367854 37
[39] (min pwr) N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
LEON3MP
[39] (min skew) N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
GH-tree 415 24 87.2 285 35 157.3 1331 4154 374568 101
H-tree 415 22 96.0 287 24 173.0 2399 6741 393582 99
Tool1 245 23 21.3 174 36 56.7 475 1148 66323 14
Tool2 347 57 17.6 212 24 47.5 401 1127 64640 24
[39] (min pwr) 298 133 29.9 208 145 54.5 1118 2154 N/A N/A
VGA blockage
[39] (min skew) 252 86 34.4 157 93 74.3 1291 2387 N/A N/A
GH-tree 239 36 16.9 163 31 45.4 293 815 68635 19
H-tree 282 22 20.8 174 21 56.0 599 1685 72636 16
Tool1 231 19 20.5 164 27 54.7 456 1094 59506 14
Tool2 325 33 18.5 211 43 49.8 395 1120 58114 21
[39] (min pwr) 161 28 25.1 105 74 54.5 956 1679 N/A N/A
VGA high AR
[39] (min pwr) 161 28 25.1 157 93 54.5 956 1679 N/A N/A
GH-tree 265 33 17.5 187 30 47.3 299 1039 57855 21
H-tree 271 15 20.4 169 12 54.8 661 1669 65181 22

Fig. 11: Clock skew and power comparison among GH-tree,


Tool1 and Tool2 through Monte Carlo simulation. Design: Fig. 12: GH-tree optimization with various NDR options.
VGA.
C. Study of “Skew Budgeting” across Clock Levels the skew constraint is tight (i.e., ≤15ps), most of the skew
How to optimally budget skew across clock tree levels has occurs in the bottom levels (i.e., levels 5, 6 and 7) of the
been an open problem for over two decades. Interestingly, our clock tree. However, when the skew constraint is relaxed (i.e.,
DP approach may provide new insights into how to budget ≥20ps), most of the skew occurs in the top levels (i.e., levels
skew for minimum clock power. To study the skew budgeting 1 and 2) of the clock tree. We show the normalized clock
across clock levels, we run our optimizer multiple times with power for each target skew at the top of each bar in Figure
target skews from 5ps to 40ps in steps of 5ps on testcase 13. For example, minimum-power GH-tree solution for target
VGA. In each run, we find the minimum-power GH-tree skew 30ps consumes 74% power of minimum-power GH-tree
solution that satisfies the given target skew. Figure 13 shows solution for target skew 5ps. To achieve ∼26% clock power
the normalized skew at each clock level of the minimum- reduction by changing target skew from 5ps to 30ps, the clock
power GH-tree solutions across different target skews . When tree must have ∼80% of the skew in levels 1 and 2. We

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
13

emphasize that, as noted above, in our GH-tree a level does


not necessarily imply the insertion of a buffer. In other words,
a higher number of levels does not necessarily result in larger
latency. Rather, we observe from our GH-tree solutions (which
are on Pareto frontiers with respect to tradeoffs among skew,
latency and clock power) that skew and latency are typically
correlated with each other.

Fig. 14: Layout example of GH-tree on VGA. In red is clock


routing of the top four levels in the GH-tree with branching
pattern (4, 4, 2, 6). The left figure shows clock routing (top and
bottom levels) and the right figure shows the sink clustering
solution.

Fig. 13: Skew budgeting, normalized clock power and the


number of levels (depth) of the clock tree for different target
skews. According to our optimal DP solutions, nearly 80% of
skew occurs in the bottom levels of the tree when the target
skew is ≤15ps, but this shifts to the top levels of the tree
when the target skew is ≥20ps.
Fig. 15: Layout example of GH-tree on VGA blockage. In
V. C ONCLUSION red is clock routing of the top six levels in the GH-tree with
In this work, we propose the concept of a generalized branching pattern (2, 2, 2, 2, 2, 4). The left figure shows clock
H-tree, which is a balanced tree topology with an arbitrary routing (top and bottom levels) and the right figure shows the
sequence of branching factors at each level. Our DP-based sink clustering solution.
method provides an optimal GH-tree that has minimum
ACKNOWLEDGMENTS
clock power for a given skew and maximum latency targets.
Our DP solutions are constructed using clock buffers (with We would like to deeply thank Y. Kim and T. Kim for
ganging) along with interconnect timing and power models running their tool on our testcases and providing the results.
from a 28LP foundry design enablement; we co-optimize We also thank Soonbok Jang of Samsung Electronics for her
the clock tree topology along with the buffering along participation in this project while a Visiting Scholar at UCSD
branches. We furthermore propose a clustering- and linear and thank Yaping Sun for valuable early discussions.
programming-based heuristic to embed the GH-tree with
respect to the given placement of clock sinks. We validate R EFERENCES
our solutions in commercial P&R tool flows in a 28LP [1] A. Abdelhadi, R. Ginosar, A. Kolodny and E. G. Friedman,
foundry technology. The results show up to 30% clock power “Timing-Driven Variation-Aware Synthesis of Hybrid Mesh/Tree Clock
reduction while achieving similar skew and latency as CTS Distribution Networks”, Integration, the VLSI Journal 46(4) (2013), pp.
382-391.
solutions from recent versions of leading commercial P&R [2] A. Andreev, A. Nikishin, S. Gribok, P.-C. Tan and C.-H. Choo, “Clock
tools. Our proposed approach also achieves up to 56% clock Network Fishbone Architecture for a Structured ASIC Manufactured on
power reduction compared to a state-of-the-art academic tool a 28 NM CMOS Process Lithographic Node”, U.S. Patent 8,629,548,
January 2014.
[39]. Compared to “strict” H-tree, our results achieve better [3] C. J. Alpert, Z. Li, G.-J. Nam, S. Ramji, C. N. Sze, P. G. Villarubia
tradeoffs such that power is significantly reduced at the and N. Viswanathan, “Structured Placement of Latches/Flip-Flops to
cost of small skew increase. Our ongoing and future work Minimize Clock Power in High-Performance Designs”, U.S. Patent
8,954,912, May 2014.
includes (i) co-optimization of sink placement and clock tree [4] H. B. Bakoglu, Circuits, Interconnections and Packaging for VLSI,
construction; (ii) budgeting of skew and latency across levels; Reading, MA, Addison-Wesley, 1990.
(iii) application of useful skew in GH-tree; (iv) application [5] T.-B. Chan, K. Han, A. B. Kahng, J.-G. Lee and S. Nath, “OCV-Aware
Top-Level Clock Tree Optimization”, Proc. GLSVLSI, 2014, pp. 33-38.
of GH-tree construction in hierarchical designs (that require [6] T. H. Chao, Y. C. Hsu, J. M. Ho, K. D. Boese and A. B. Kahng, “Zero
hierarchical CTS); (v) co-optimization of datapath placement Skew Clock Routing With Minimum Wirelength”, IEEE TCAS 39(11)
and GH-tree construction; (vi) clock gate- and logic cells- (1992), pp. 799-814.
[7] M. Charikar, J. Kleinberg, R. Kumar, S. Rajagopalan, A. Sahai and A.
aware GH-tree construction; and (vii) blockage-aware DP- Tomkins, “Minimizing Wirelength in Zero and Bounded Skew Clock
based clock tree topology and buffering. Trees”, SIAM J. Discrete Mathematics 17(4) (2004), pp. 582-595.

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
14

[8] Y.-Y. Chen, C. Dong and D. Chen, “Clock Tree Synthesis Under [36] H. Su and S. S. Sapatnekar, “Hybrid Structured Clock Network
Aggressive Buffer Insertion”, Proc. DAC, 2010, pp. 86-89. Construction”, Proc. ICCAD, 2001, pp. 333-336.
[9] J. Cong, A. B. Kahng, C. K. Koh and C.-W. A. Tsao, “Bounded-Skew [37] S. Tam, “Modern Clock Distribution Systems” in Clocking in Modern
Clock and Steiner Routing”, ACM TODAES 3(3) (1998), pp. 341-388. VLSI Systems, New York, Springer, 2009.
[10] C. Deng, Y.-C. Cai and Q. Zhou, “Register Clustering Methodology for [38] C.-W. A. Tsao and C.-K. Koh, “UST/DME: A Clock Tree Router for
Low Power Clock Tree Synthesis”, J. Computer Science and Technology General Skew Constraints”, ACM TODAES, 7(3) (2002), pp. 359-379.
30(2) (2015), pp. 391-403.
[11] D. Dolev, M. Fugger, C. Lenzen, M. Perner and U. Schmid, “HEX: [39] Y. Kim and T. Kim, “Algorithm for Synthesis and Exploration of Clock
Scaling Honeycombs is Easier Than Scaling Clock Trees”, Proc. SPAA, Spines”, Proc. ASP-DAC, 2017, pp. 263-268.
2013, pp. 164-175. [40] J.-L. Tsai, T.-H. Chen and C. C.-P. Chen, “Zero Skew Clock-Tree
[12] R. Ewetz and C.-K. Koh, “Cost-Effective Robustness in Clock Networks Optimization with Buffer Insertion/Sizing and Wire Sizing”, IEEE
Using Near-Tree Structures”, IEEE TCAD 34(4) (2015), pp. 515-528. TCAD, 23(4) (2004), pp. 565-572.
[13] M. R. Guthaus, D. Sylvester and R. B. Brown, “Clock Buffer and Wire [41] A. Vittal and M. Marek-Sadowska, “Low-Power Buffered Clock Tree
Sizing using Sequential Programming”, Proc. DAC, 2006, pp. 1041- Design”, IEEE TCAD 16(9) (1997), pp. 965-975.
1046. [42] N. Y. Zhou, P. Restle, J. Palumbo, J. Kozhaya, H. Qian, Z. Li, C. J.
[14] M. R. Guthaus, G. Wilke and R. Reis, “Non-uniform Clock Mesh Alpert and C. Sze, “PACMAN: Driving Nonuniform Clock Grid Loads
Optimization with Linear Programming Buffer Insertion”, Proc. DAC, for Low-Skew Robust Clock Network”, Proc. SLIP, 2014.
2010, pp. 74-79. [43] CAD/CAM/CAE Wallchart.
[15] M. R. Guthaus, G. Wilke and R. Reis, “Revisiting Automated Physical http://www.garysmitheda.com/wp-content/uploads/2015/05/All WC-
Synthesis of High-performance Clock Networks”, ACM TODAES 18(2) 15.pdf
(2013), pp. 31:1-31:27. [44] Cadence Innovus User Guide, http://www.cadence.com
[16] K. Han, A. B. Kahng, J. Lee, J. Li and S. Nath, “A Global-Local [45] IBM ILOG CPLEX. www.ilog.com/products/cplex/
Optimization Framework for Simultaneous Multi-Mode Multi-Corner [46] Si2 OpenAccess. http://www.si2.org/?page=69
Skew Variation Reduction”, Proc. DAC, 2015, pp. 26:1-26:6. [47] OpenCores: Open Source IP-Cores, http://www.opencores.org
[17] J. Hu, A. B. Kahng, B. Liu, G. Venkataraman and X. Xu, “A Global [48] OpenMP Architecture Review Board.
Minimum Clock Distribution Network Augmentation Algorithm for [49] Synopsys PrimTime User Guide, http://www.synopsys.com
Guaranteed Clock Skew Yield”, Proc. ASP-DAC, 2007, pp. 24-31. [50] Synopsys HSPICE User Guide, http://www.synopsys.com
[18] S. Jang, Samsung Electronics, personal communication, May 2015. [51] Synopsys Design Compiler User Guide, http://www.synopsys.com
[19] A. B. Kahng and C.-W. A. Tsao, “Planar-DME: Improved Planar Zero-
Skew Clock Routing with Minimum Pathlength Delay”, Proc. Euro
DAC, 1994, pp. 440-445.
[20] I.-M. Liu, T.-L. Chou, A. Aziz and D. F. Wong, “Zero-Skew Clock
Tree Construction by Simultaneous Routing, Wire Sizing and Buffer
Insertion”, Proc. ISPD, 2000, pp. 33-38.
Kwangsoo Han received B.S. and M.S. degrees
[21] D. Liu and C. Svensson, “Power Consumption Estimation in CMOS
in electrical engineering from Hanyang University,
VLSI Circuits”, IEEE J. Solid-State Circuits 29(6) (1994), pp. 663-670.
Seoul, Korea. He joined the VLSI CAD Laboratory,
[22] F. Minami and M. Takano, “Clock Tree Synthesis Based on RC Delay
University of California at San Diego, as a Ph.D.
Balancing”, Proc. CICC, 1992, pp. 28-3.1-28.3.4.
student in September 2013. His current research
[23] A. D. Mehta, Y.-P. Chen, N. Menezes, D. F. Wong and L. T. Pileggi,
interests include design for manufacturability and
“Clustering and Load Balancing for Buffered Clock Tree Synthesis”,
VLSI physical design optimization.
Proc. ICCD, 1997, pp. 217-223.
[24] M. M. Ozdal, C. Amin, A. Ayupov, S. M. Burns, G. R. Wilke and C.
Zhuo, “ISPD-2012 Discrete Cell Sizing Contest and Benchmark Suite”,
Proc. ISPD, 2012, pp. 161–164, http://archive.sigda.org/ispd/contests/
12/ispd2012 contest.html.
[25] M. Pedram, “Power Minimization in IC Design: Principles and
Applications” , ACM TODAES 1(1) (1996), pp. 3-56.
[26] A. Rajaram and D. Z. Pan, “Variation Tolerant Buffered Clock Network
Synthesis with Cross Links”, Proc. ISPD, 2006, pp. 157-164.
[27] L. Rakai, A. Farshidi, L. Behjat and D. Westwick, “Buffer Sizing for Andrew B. Kahng is a professor in the
Clock Networks using Robust Geometric Programming Considering Computer Science Engineering Department and the
Variations in Buffer Sizes”, Proc. ISPD, 2013, pp. 154-161. Electrical and Computer Engineering Department
[28] R. R. Rao, D. Blaauw, D. Sylvester, C. J. Alpert and S. Nassif, “An of the University of California at San Diego. His
Efficient Surface-Based Low-Power Buffer Insertion Algorithm”, Proc. interests include IC physical design, the design-
ISPD, 2005, pp. 86-93. manufacturing interface, combinatorial optimization,
[29] P. J. Restle, T. G. McNamara, D. A. Webber, P. J. Camporese, K. F. and technology roadmapping. He received the Ph.D.
Eng, K. A. Jenkins, D. H. Allen, M. J. Rohn, M. P. Quaranta, D. W. degree in Computer Science from the University of
Boerstler, C. J. Alpert, C. A. Carter, R. N. Bailey, J. G. Petrovick, B. California at San Diego.
L. Krauter, and B. D. McCredie, “A Clock Distribution Network for
Microprocessors”, IEEE J. Solid-State Circuits 36(5) (2001), pp. 792-
799.
[30] J. Reuben, H. M. Kittur and M. Shoaib, “A Novel Clock Generation
Algorithm for System-on-Chip Based on Least Common Multiple”,
Computers and Electrical Engineering 40(7) (2014), pp. 2113-2125.
[31] J. Reuben, V. M. Zackriya, S. Nashit and H. M. Kittur, “Capacitance
Jiajia Li received the B.S. degree in software
Driven Clock Mesh Synthesis to Minimize Skew and Power
engineering from Shenzhen University, China, in
Dissipation”, IEICE Electronics Express 10(24) (2013), pp. 1-12.
2011; and the M.S. degree in electrical engineering
[32] R. Samanta, J. Hu and P. Li, “Discrete Buffer and Wire Sizing for
from the University of California at San Diego, La
Link-Based Non-Tree Clock Networks”, IEEE TVLSI 18(7) (2010), pp.
Jolla, in 2013. He is currently pursuing the Ph.D.
1025-1035.
degree at the University of California at San Diego.
[33] V. Sathe, S. Arekapudi, A. Ishii, C. Ouyang, M. Papaefthymiou and S.
He joined the VLSI CAD Laboratory, University
Naffziger, “Resonant Clock Design for a Power-efficient, High-volume
of California at San Diego, in April 2012. His
x86-64 Microprocessor”, Proc. ISSCC, 2012, pp. 140-149.
current research interests include physical design
[34] H. Seo, J. Kim, M. Kang and T. Kim, “Synthesis for Power-Aware Clock
and signoff optimization, margin reduction and low-
Spines”, Proc. ICCAD, 2015, pp. 126-131.
power design.
[35] C. N. Sze, “ISPD 2010 High Performance Clock Network Synthesis
Contest: Benchmark Suite and Results”, Proc. ISPD, 2010, pp. 143-143.

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like