Generalised H-Tree
Generalised H-Tree
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
1
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
2
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
3
and do not explore the skew vs. cost tradeoff. Furthermore, a linear delay model3 according to Equations (3) and (5).
few works adapt their tree construction approaches to, and The figures also show the Pareto frontier of non-dominated
validate their solution quality with, commercial P&R tools points of each tradeoff. Figures 2(c) and 2(d) respectively show
and realistic design blocks. Other works [5] [16] are ECO- skew-power and latency-power tradeoffs of buffered GH-trees
based incremental optimizations based on an initial clock tree with various branching patterns, as reported by a commercial
solution generated by commercial P&R tools. The objective static timing analysis tool with foundry 28LP technology
functions of these works differ from ours. Chan et al. [5] and library models. The Pareto frontier of each tradeoff is
minimize skew at the top-level, whereas Han et al. [16] shown as the black curve. We observe from the figures
minimize skew variation across corners. that different branching patterns lead to wide-ranging skew-
A Motivating Analysis. To motivate our main studies below, power and/or latency-power tradeoffs. In this work, we explore
we briefly illustrate the tradeoffs among clock tree wirelength, branching patterns via dynamic programming to optimize
global skew and maximum clock latency seen across GH-tree tradeoffs among skew, latency and power. Figure 2(c) shows
topologies with various branching patterns. Before doing so, that with the same branching pattern (e.g., (10, 4, 2, 12)),
we summarize in Table I the terminology and notation used different buffering solutions can lead to more than 20% skew
in the remaining discussion. difference with similar clock power. We therefore perform
Given layout region area W × H, we analyze the wirelength co-optimization of tree topology (i.e., branching pattern) and
of a GH-tree with branching pattern (b1 , b2 , b3 , ..., bP ). We buffering to minimize clock power, skew and latency.
assume that (i) at any level, the region area is uniformly split
into sink regions (i.e., regions that contain downstream sinks)
according to the branching factor; (ii) the root of a sink region
is located at the center of the sink region; (iii) branching factor
bp at any level p is always an even number; and (iv) the GH-
tree always starts with a horizontal segment at the top level.
Based on these assumptions, the wirelength of a horizontal
(vertical) wire segment wp (hp ) at level p is calculated as2
bp − 1 bp − 1
wp = (p+1)/2
·W , hp = p/2
·H (1)
Πi=1 b2i−1 Πi=1 b2i
The total wirelength is calculated as
d P2 e b2c P
X b2k−1 − 1 k−1Y X b2k − 1 Y k
L= [ b2i ] · W + [ b2i−1 ] · H
b2k−1 i=1 b2k i=1
k=1 k=1
(2)
Assuming a linear wire delay model and ignoring buffering,
we derive the maximum and minimum (linear-delay) clock
latency from clock source to any sink as Fig. 2: Study of motivating tradeoffs. (a) Linear delay skew vs.
d P2 e b P2 c
wirelength; and (b) linear delay clock latency vs. wirelength
1 X b2k−1 − 1 X b2k − 1 with different branching patterns (256 sinks and region area
tmax = ·[ ( k )·W + ( k ) · H] (3)
2 Πi=1 b2i−1 Πi=1 b2i = 100µm × 100µm). (c) Skew vs. clock power; and (d)
k=1 k=1
maximum latency vs. clock power, in buffered GH-trees for a
d P2 e 2 bP c testcase with 17K sinks and region area = 380µm × 380µm.
1 X 1 X 1
tmin = ·[ ( k )·W + ( k ) · H] (4) In red are dominated points; in black are Pareto frontiers and
2 Πi=1 b2i−1 Πi=1 b2i
k=1 k=1 non-dominated points.
The maximum global skew ω is then defined as the
difference between tmax and tmin . III. O UR A PPROACH
d P2 e b P2 c We now describe our problem formulation and our
1 X b2k−1 − 2 X b2k − 2
ω= ·[ ( k )·W + ( k ) · H] (5) approach. Based on the motivating examples shown in
2 Πi=1 b2i−1 Πi=1 b2i Figure 2, we construct GH-trees to explore the tradeoff among
k=1 k=1
Figures 2(a) and 2(b) respectively show skew-wirelength skew, latency and clock power. Our construction comprehends
and latency-wirelength tradeoffs in GH-trees with various the delay and power impact of buffer insertion, sink placement
branching patterns. We calculate skew and latency based on and multiple constraints (e.g., maximum transition time and
3 The linear wire delay model approximates clock latency with relatively
low accuracy. However, we give this motivating analysis using the simple
2 Note that (p + 1)/2 in Equation 1 is always an integer since we create linear delay model to more intuitively illustrate the tradeoff among clock
horizontal segments (wp ) and vertical segments (hp ) in alternation, and always power, skew and latency with different GH-tree topologies. Below, we
start with a horizontal segment from the top. Thus, all horizontal segments apply comprehensive buffer and wire delay modeling in our optimization to
are created when p is an odd number. demonstrate the benefits of our proposed approach.
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
4
maximum load capacitance). More specifically, we address the Therefore, an improved resolution of the “chicken-and-egg”
following GH-tree construction problem: loop between (i) placement of roots of sink regions and (ii)
Given: a placement solution (i.e., a layout region W × H top-level tree topology optimization, as well as consideration
and placement of sinks), number of sink regions N such that of clock gating cells during the optimization, remain as open
each region contains <40 sinks (see Footnote 4 below), timing research directions.
library of clock buffers (.lib), maximum clock skew constraint
Ω, maximum clock latency constraint T , maximum transition
time constraint Γ (i.e., at both sinks and clock buffer input
pins), and maximum load capacitance constraint C.
Perform: GH-tree construction with co-optimization of the
clock tree topology (i.e., branching patterns) and buffering to
minimize clock power, subject to the given constraints.
Figure 3 shows our overall GH-tree construction flow.
An instance consists of a post-placement layout, the
number of sink regions, and constraints, along with pre-
characterized technology- and library-specific lookup tables
(LUTs) containing power and delay information of candidate
buffering solutions (i.e., segment length and buffer sizes).
We perform the GH-tree construction primarily through two
main steps: (i) according to the total sink capacitance and
layout region area and aspect ratio, we first formulate a
dynamic programming problem to co-optimize branching
pattern and buffering; and (ii) we then perform balanced K-
means clustering and formulate an integer linear programming
problem to determine clock buffer placement (i.e., to Fig. 3: Overall flow of GH-tree construction.
embed our generalized GH-tree structure into the given sink
placement). We note that the key step is the first step (i.e.,
DP-based co-optimization of clock topology and buffering) A. LUT Characterization
that systematically explores the continuum between H-tree and We characterize LUTs based on simulations using
spine to achieve an optimal tradeoff among clock power, skew Synopsys PrimeTime [49] as inputs for our DP-based
and latency (within this regime). Last, we realize our GH-tree optimization. These LUTs contain power, input capacitance,
solution in a commercial P&R tool and report metrics (e.g., slew propagation, and delay information of buffered and
skew, latency, power, etc.) to assess solution quality. unbuffered wire segments. We use four types of buffers (X50,
Although our approach systematically optimizes the tradeoff X67, X100, X134 from the 28LP libraries). In this technology,
among clock power, latency and skew in the GH-tree regime, the gate area of a X134 buffer is ∼7× the gate area of
our method has two limitations that we highlight. First, we a minimum-size (X2) buffer. We also use “ganged buffers”
do not co-optimize tree topology and buffering together with (i.e., two, four or six X134 buffers with shorted inputs and
sink placement, due to large runtime complexity. Second, our shorted outputs) to achieve higher driving strengths. We create
approach does not consider clock gating cells in a clock tree. wire segments of lengths 15µm, 30µm, 45µm, 60µm, 75µm,
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
5
and 90µm.4 Along these wire segments, we enumerate all B. DP-Based Co-optimization of Clock Topology and
possible buffering solutions with the minimum granularity Buffering
of 15µm (i.e., the minimum distance between two buffers
is 15µm). The granularity of LUTs (e.g., number of buffer Based on the characterized LUTs, we determine the optimal
candidates, minimum wire segment length) will determine the branching pattern along with the buffering solution for GH-
tradeoff between optimization runtime and solution quality. tree using dynamic programming (DP). Other inputs to our
In this work, we empirically select the granularity of our optimization are layout region, placement of sinks, the number
LUTs to achieve improved solution quality compared to of sink regions, and the maximum skew and maximum latency
two commercial tools, while using comparable runtime. For constraints. A sink region typically contains <40 sinks in our
example, with a 45µm segment length, a minimum buffering optimization. Our objective is to minimize the clock power
distance of 15µm and seven buffer sizes, there are 83 (i.e., while satisfying the given maximum skew and maximum
no buffer or exactly one of the seven buffer sizes, at each latency constraints. As discussed, in this step, we assume that
of the three buffering locations) distinct buffering solutions. the sink regions are induced from a uniform placement of
Note that our LUTs include unbuffered solutions, i.e., pure- sinks, and that branching points are always located at the
wire solutions. We consider both 1W1S (single-width, single- center of the corresponding sink region. We understand that
spacing) and 2W2S (double-width, double-spacing – which we the sink regions are typically not uniform for a real placement
understand to be the common non-default routing rule (NDR) solution. However, due to high runtime complexity, it is
for clock distribution) wire segments in our characterization. practically infeasible for our current approach to consider sink
We vary the input slew from 5ps to 60ps in steps of 5ps placement during our DP-based optimization. We therefore
and we vary output load from 1f F to 150f F in steps of assume uniform sink regions during our DP-based GH-
1f F from 1f F to 5f F , and in steps of 5f F from 5f F to tree construction. We then embed our DP-based solution
150f F . For each 3-tuple of (distance, input slew, output load) (without solution quality degradation) into the given (real)
we obtain the buffering solution (including input capacitance) sink placement based on sink clustering and branching point
and the output slew. From the large number of possible displacement. We formulate our DP in a high-dimensional
solutions, we prune solutions as follows. For each (distance, solution space with seven dimensions (i.e., with respect
input capacitance, output slew, output load) 4-tuple, we select to seven essential parameters of a clock tree optimization)
three delay values at the 10th , 50th and 90th percentiles of the and construct our GH-tree in a bottom-up way. The seven
delay range, and then select the minimum-power solution for dimensions are {clock tree depth (P ), region area (width
each of these three delay values. Figure 4 shows an example (w) and height (h)), number of sink regions (n), maximum
of our pruning on LUTs, in which we select minimum-power and minimum clock latencies (tmax and tmin ), and input
solutions with different output load values. Red (resp. blue) capacitance (i.e., the load capacitance seen from the root)}.
dots are the buffering solutions with output load = 75f F (resp. Algorithm 1 describes our optimization procedure; see also
35f F ). Cross (x) points are the selected buffering solutions. the illustration in Figure 5. We first construct GH-trees for
the base case, that is, trees with depth = 1 over all different
region areas (i.e., w × h) and numbers of sink regions (i.e., n)
(Line 1). As an example, Figure 5 shows the solutions at level
p (i.e., subtree with depth = 1) with different region areas (i.e.,
5 × 5 (red), 10 × 10 (green) and 15 × 15 (purple)). Procedure
build base trees(w, h, n, ∅) constructs GH-trees with depths
of one (i.e., spines with different buffering solutions) within a
w×h region and with a branching factor of bp . Following [37],
we use the term spine to denote one horizontal or vertical
wire segment in the clock tree. Note that at the bottom
level, bp = n. As illustrated in Figure 5, we optimize the
buffering solution along the spine based on characterized
LUTs. Optimization of each tree segment along the spine
Fig. 4: Example of pruning for buffering solutions; distance generates a minimum-power Pareto surface in the high-
= 45µm, output slew = 35ps. dimensional space indexed by the LUT input parameters (e.g.,
maximum and minimum latencies, and input capacitance). The
In practice, we have found that this pruning reduces the optimization eventually results in multiple subtree solutions.
number of buffering solutions by ∼94% at the cost of only We store these subtree solutions in a set R indexed by tree
∼5% solution quality (i.e., in terms of skew or latency) loss. depths, region area, number of sink regions, maximum and
minimum clock latencies, and input capacitance. In other
4 We use LUTs based on multiple short wire segments to estimate the words, R is a set of subtree solutions along with their
delay and slew propagation of a long wire segment. Since we match the depth, region area, number of sink regions, clock latency,
output load, input capacitance as well as output and input slew values of two input capacitance and power information corresponding to the
consecutive short segments to form a long segment, the estimation error is
negligible. The small estimation error comes from discreteness of capacitance
minimum-power Pareto surface.
and slew values. Next, we recursively search for the optimal (i.e., minimum-
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
6
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
7
maximum clock latency constraint. We set wint , hint < 5µm on the clustering solution.6 We note that our approach is
and γint = 1ps in our experiments. To reduce the runtime, we different from conventional top-down clock tree construction
apply the following pruning techniques. methods (e.g., Planar-DME [19]), in that (i) we embed an
• Pruning with number of leaf regions. For a given sub- optimized clock tree topology with buffering solutions to a
region of size w × h we prune solutions with number of layout region with given sink placements, and (ii) we balance
leaf regions greater than NW·w·h load capacitance among sink regions. By maintaining the
·H .
• Pruning with skew/latency constraints. We prune distances between consecutive branching points and buffers
solutions that have skew larger than the maximum skew at each clock level as well as balancing the load capacitance
constraint or maximum latency larger than the latency among sink regions, we preserve the solution quality (i.e.,
upper bound. skew and latency) of the GH-tree solution obtained by the
• Pruning with maximum fanout constraint. We prune DP-based optimization. Furthermore, we understand that the
solutions that have branching factor larger than the optimal GH-tree topology and buffering solution can vary
maximum fanout constraint.5 across different sink placements. With this in mind, we keep
the best M solutions from our DP-based optimization and
select the minimum-power solution for the given (actual) sink
placement as our final solution. Based on our preliminary
experiments, we empirically use M = 5 to generate the
results reported below, where increasing M beyond 5 will not
improve the solution quality significantly.
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
8
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
9
box HPWL in our linear wire capacitance estimation model.9 Here, (xfj , yjf ) is the location of the j th buffer at a given
{llx,lly,urx,ury}
LP formulation for branching point location. Based on the clock level. ψj,q are binary indicator variables
two ILPs described above, we obtain the clustering solution which indicate whether the j th buffer is located outside the
at a given clock level. However, the initial assumption of corresponding boundaries of the blockages. The last inequality
branching locations to be at the center of a region may in (22) defines the constraint that at least one of the indicator
cause large skew if the sinks within a cluster are placed non- variables must be true. Satisfying this constraint implies that
uniformly. To address this issue, we formulate a linear program the j th buffer is not in the bounding box of the q th blockage.
(LP) to place each branching point close to the weighted center Note that with the Constraints (22), the problem becomes an
of each cluster, as follows. integer linear program (ILP).
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
10
Synopsys Design Compiler H-2013.03-SP3 [51]. Table II aspect ratio and existence of placement blockages (shown in
summarizes the instance count, number of clock sinks, Figure 15), where we observe similar clock power and latency,
placement utilization and timing constraints for our testcases. but larger skew, compared to the case without blockage and
To study the impact of maximum skew and latency constraints high floorplan aspect ratio. Compared to the results of [39],
on power, we run our optimizer with different maximum our GH-tree achieves much smaller power (i.e., up to 55% on
skew and latency constraints and collect discrete solution B19) and skew, but at the cost of larger maximum latency.
points showing skew-power and latency-power tradeoff. We Power analysis. We also observe that our GH-tree solutions
obtain reference clock solutions with multi-corner multi- have smaller number of buffers and clock wirelength as
mode (MCMM) optimization; we define our mode and corner compared to solutions from commercial and academic tools.
settings in Table III. We also apply late and early deratings of Figure 10 further shows a histogram of clock buffer power
1.1 and 0.9 to model OCV effects. We use Synopsys HSPICE (i.e., sum of internal, leakage and dynamic power) values of
G-2012.06-SP1 [50] to perform timing and power analysis. our GH-tree versus corresponding values from a commercial
A. Comparison with Trees from State-of-the-art CTS tool’s solution. We observe that our DP-based optimization,
which can select its buffering solution “optimally” based on
Figure 8 and Figure 9 respectively compare skew and clock the characterized LUTs, achieves smaller buffer power values
power, and maximum latency and clock power, of GH-tree for most of the clock buffers. In other words, our GH-tree
solutions to those from the two commercial tools and one optimization is able to achieve optimized load capacitance
academic flow [39]. Table IV compares #buffers, buffer area, of buffers as well as slew propagation for reduced clock
max (insertion) delay across corners, and wirelength among power. The power information from our LUTs enables power-
clock tree solutions as well as optimization runtime. All flows awareness in our DP-based GH-tree construction.
use the same sink placement solution as input. We apply the
same setups (i.e., clock buffer cells (X50, X67, X100, X134
and ganged buffers), BEOL layers (M3 and M4), maximum
transition constraints (60ps)) to our GH-tree construction and
to both commercial and academic tool flows. We sweep the
maximum skew and maximum latency constraints on the four
designs for the GH-tree constructions. Blue curves in the
figures are skew-power and latency-power Pareto curves of
our GH-tree solutions.12
Overall analysis. Our results show that our GH-tree solutions
achieve significant clock power reduction with similar or
reduced skew and latency values as compared to the solutions
from both commercial and academic tools. We also include
the conventional “strict” H-tree solution as a comparison.
Note that we use the same methodology to determine the
buffer locations and sizes in “strict” H-tree and GH-tree Fig. 10: Distribution of clock buffer power values from GH-
constructions. In other words, it is unnecessary to add a tree and commercial tool’s solution. Design: VGA.
buffer at each branching point. For example, we achieve power
reductions of 30% on B19 and 20% on LEON3MP in Table IV Runtime analysis. Results in Table IV show that although the
compared to commercial tools’ results. Moreover, we observe naive worst-case time complexity of our (DP- and ILP-based)
that due to the symmetric topology of our GH-tree, our GH- optimization is high (cf. the nested for loops in Algorithm 1),
tree solution is typically more robust against skew variation the pruning techniques and empirically selected granularity of
across different corners as compared to the clock trees from our LUTs (e.g., 15µm for distance, 5ps for slew, and 5f F for
other tools. As an example, skew of the clock tree solution capacitance) make the runtime of our optimization comparable
from Tool2 increases by 137% on design VGA blockage to those of commercial tools. The large runtime of design
between two corners; by contrast, our solutions generally have LEON3MP mainly comes from bottom-level tree construction
similar skew values across corners. The conventional “strict” (∼40 minutes) and ECO routing according to our GH-tree
H-tree achieves the minimum clock skew but at the cost of solution using OpenAccess [46] (∼15 minutes). The actual
larger power, buffer area and wirelength. As an example, GH-tree construction runtime (i.e., DP + ILP runtime) for
for LEON3MP, H-tree has depth P = 12, but GH-tree has design LEON3MP is only ∼25 minutes. We understand that
P = 10 (i.e., branching factor = (2, 2, 2, 2, 2, 4, 4, 2, such runtime is very acceptable in light of the potential clock
2, 2)). The shallower depth of GH-tree significantly reduces power benefits from our approach.
the number of buffers and wirelength. We also validate our Robustness analysis. We further perform Monte Carlo
GH-tree optimization on design VGA with high floorplan simulation on our GH-tree solution and those from commercial
tools, and compare the resultant variation in clock skew
12 We estimate the Pareto curve based on discrete solution points due to
and power. Figure 11 shows that our GH-tree solution
limited computing resources. In addition, since the tradeoff between skew/max
latency versus clock power is monotone, we feel that three solution points can exhibits relatively smaller variation in clock skew (i.e.,
provide a useful estimation of the tradeoff. ∼35ps) compared to commercial tools’ solutions (i.e., ∼40ps).
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
11
Fig. 8: Power and skew comparisons among GH-tree, Tool1, Tool2 and [39] for four testcases.
Fig. 9: Power and maximum latency comparisons among GH-tree, Tool1, Tool2 and [39] for four testcases.
Furthermore, all of Tool1, Tool2 and GH-tree solutions have of the latency versus clock power tradeoffs. Figure 12 shows
small power variation. the comparison for the JPEG testcase at 28LP technology. Due
to better slew propagation, solutions with 2W2S have fewer
B. Impact of NDR clock buffers as compared to the 1W1S solutions (i.e., the
We now summarize observed impacts of non-default rules average numbers of clock buffers are respectively 83 and 111
(NDRs) on clock tree solution quality. We generate GH-trees in 2W2S- and 1W1S-only solutions). Further, solutions that
with various NDR options: (i) 1W1S only, (ii) 2W2S only, permit either 1W1S or 2W2S at each level (of clock subnets)
and (iii) the combination of 1W1S and 2W2S. We set the are able to achieve a better tradeoff between latency and power.
maximum skew constraint to 200ps and compare Pareto curves
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
12
TABLE IV: Comparison between clock tree solutions from [39], Tool1 and Tool2 versus our GH-trees. Technology: 28LP.
Corner = C1 Corner = C2
Buf area Clk WL Runtime
Testcase Flow Max laten. Skew Clk power Max laten. Skew Clk power #Buffers
(ps) (ps) (mW ) (ps) (ps) (mW ) (µm2 ) (mm) (min)
Tool1 150 17 3.8 110 27 9.3 104 227 15688 15
Tool2 222 29 3.4 129 5 8.3 84 245 12996 11
[39] (min pwr) 111 40 5.6 63 40 12.1 211 338 N/A N/A
B19
[39] (min skew) 125 36 6.0 78 38 13.0 228 413 N/A N/A
GH-tree 170 12.5 2.6 116 25.8 6.4 41 106 12242 15
H-tree 166 7 3.1 106 11 7.6 147 227 13941 16
Tool1 196 26 6.9 131 26 16.8 160 345 20967 16
Tool2 236 34 6.2 141 18 15.2 120 352 18432 14
[39] (min pwr) 179 65 9.2 103 73 19.7 340 651 N/A N/A
JPEG
[39] (min skew) 155 30 9.4 92 36 20.4 353 676 N/A N/A
GH-tree 201 19 5.9 129 17 14.5 147 296 20009 17
H-tree 229 12 6.6 150 16 16.3 169 456 20064 14
Tool1 260 36 20.7 152 7 55.3 464 1119 57678 16
Tool2 314 28 18.0 201 11 48.5 369 1047 56305 22
[39] (min pwr) 171 52 24.0 114 73 52.1 911 1651 N/A N/A
VGA
[39] (min skew) 171 52 24.0 114 73 52.1 911 1651 N/A N/A
GH-tree 238 19 17.4 174 41 44.2 331 1036 57404 21
H-tree 253 16 20.4 162 19 55.0 597 1682 62957 16
Tool1 426 63 109.5 276 25 195.9 2661 6654 369737 54
Tool2 633 34 102.5 421 58 184.2 2509 7225 367854 37
[39] (min pwr) N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
LEON3MP
[39] (min skew) N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
GH-tree 415 24 87.2 285 35 157.3 1331 4154 374568 101
H-tree 415 22 96.0 287 24 173.0 2399 6741 393582 99
Tool1 245 23 21.3 174 36 56.7 475 1148 66323 14
Tool2 347 57 17.6 212 24 47.5 401 1127 64640 24
[39] (min pwr) 298 133 29.9 208 145 54.5 1118 2154 N/A N/A
VGA blockage
[39] (min skew) 252 86 34.4 157 93 74.3 1291 2387 N/A N/A
GH-tree 239 36 16.9 163 31 45.4 293 815 68635 19
H-tree 282 22 20.8 174 21 56.0 599 1685 72636 16
Tool1 231 19 20.5 164 27 54.7 456 1094 59506 14
Tool2 325 33 18.5 211 43 49.8 395 1120 58114 21
[39] (min pwr) 161 28 25.1 105 74 54.5 956 1679 N/A N/A
VGA high AR
[39] (min pwr) 161 28 25.1 157 93 54.5 956 1679 N/A N/A
GH-tree 265 33 17.5 187 30 47.3 299 1039 57855 21
H-tree 271 15 20.4 169 12 54.8 661 1669 65181 22
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
13
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2018.2889756, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
14
[8] Y.-Y. Chen, C. Dong and D. Chen, “Clock Tree Synthesis Under [36] H. Su and S. S. Sapatnekar, “Hybrid Structured Clock Network
Aggressive Buffer Insertion”, Proc. DAC, 2010, pp. 86-89. Construction”, Proc. ICCAD, 2001, pp. 333-336.
[9] J. Cong, A. B. Kahng, C. K. Koh and C.-W. A. Tsao, “Bounded-Skew [37] S. Tam, “Modern Clock Distribution Systems” in Clocking in Modern
Clock and Steiner Routing”, ACM TODAES 3(3) (1998), pp. 341-388. VLSI Systems, New York, Springer, 2009.
[10] C. Deng, Y.-C. Cai and Q. Zhou, “Register Clustering Methodology for [38] C.-W. A. Tsao and C.-K. Koh, “UST/DME: A Clock Tree Router for
Low Power Clock Tree Synthesis”, J. Computer Science and Technology General Skew Constraints”, ACM TODAES, 7(3) (2002), pp. 359-379.
30(2) (2015), pp. 391-403.
[11] D. Dolev, M. Fugger, C. Lenzen, M. Perner and U. Schmid, “HEX: [39] Y. Kim and T. Kim, “Algorithm for Synthesis and Exploration of Clock
Scaling Honeycombs is Easier Than Scaling Clock Trees”, Proc. SPAA, Spines”, Proc. ASP-DAC, 2017, pp. 263-268.
2013, pp. 164-175. [40] J.-L. Tsai, T.-H. Chen and C. C.-P. Chen, “Zero Skew Clock-Tree
[12] R. Ewetz and C.-K. Koh, “Cost-Effective Robustness in Clock Networks Optimization with Buffer Insertion/Sizing and Wire Sizing”, IEEE
Using Near-Tree Structures”, IEEE TCAD 34(4) (2015), pp. 515-528. TCAD, 23(4) (2004), pp. 565-572.
[13] M. R. Guthaus, D. Sylvester and R. B. Brown, “Clock Buffer and Wire [41] A. Vittal and M. Marek-Sadowska, “Low-Power Buffered Clock Tree
Sizing using Sequential Programming”, Proc. DAC, 2006, pp. 1041- Design”, IEEE TCAD 16(9) (1997), pp. 965-975.
1046. [42] N. Y. Zhou, P. Restle, J. Palumbo, J. Kozhaya, H. Qian, Z. Li, C. J.
[14] M. R. Guthaus, G. Wilke and R. Reis, “Non-uniform Clock Mesh Alpert and C. Sze, “PACMAN: Driving Nonuniform Clock Grid Loads
Optimization with Linear Programming Buffer Insertion”, Proc. DAC, for Low-Skew Robust Clock Network”, Proc. SLIP, 2014.
2010, pp. 74-79. [43] CAD/CAM/CAE Wallchart.
[15] M. R. Guthaus, G. Wilke and R. Reis, “Revisiting Automated Physical http://www.garysmitheda.com/wp-content/uploads/2015/05/All WC-
Synthesis of High-performance Clock Networks”, ACM TODAES 18(2) 15.pdf
(2013), pp. 31:1-31:27. [44] Cadence Innovus User Guide, http://www.cadence.com
[16] K. Han, A. B. Kahng, J. Lee, J. Li and S. Nath, “A Global-Local [45] IBM ILOG CPLEX. www.ilog.com/products/cplex/
Optimization Framework for Simultaneous Multi-Mode Multi-Corner [46] Si2 OpenAccess. http://www.si2.org/?page=69
Skew Variation Reduction”, Proc. DAC, 2015, pp. 26:1-26:6. [47] OpenCores: Open Source IP-Cores, http://www.opencores.org
[17] J. Hu, A. B. Kahng, B. Liu, G. Venkataraman and X. Xu, “A Global [48] OpenMP Architecture Review Board.
Minimum Clock Distribution Network Augmentation Algorithm for [49] Synopsys PrimTime User Guide, http://www.synopsys.com
Guaranteed Clock Skew Yield”, Proc. ASP-DAC, 2007, pp. 24-31. [50] Synopsys HSPICE User Guide, http://www.synopsys.com
[18] S. Jang, Samsung Electronics, personal communication, May 2015. [51] Synopsys Design Compiler User Guide, http://www.synopsys.com
[19] A. B. Kahng and C.-W. A. Tsao, “Planar-DME: Improved Planar Zero-
Skew Clock Routing with Minimum Pathlength Delay”, Proc. Euro
DAC, 1994, pp. 440-445.
[20] I.-M. Liu, T.-L. Chou, A. Aziz and D. F. Wong, “Zero-Skew Clock
Tree Construction by Simultaneous Routing, Wire Sizing and Buffer
Insertion”, Proc. ISPD, 2000, pp. 33-38.
Kwangsoo Han received B.S. and M.S. degrees
[21] D. Liu and C. Svensson, “Power Consumption Estimation in CMOS
in electrical engineering from Hanyang University,
VLSI Circuits”, IEEE J. Solid-State Circuits 29(6) (1994), pp. 663-670.
Seoul, Korea. He joined the VLSI CAD Laboratory,
[22] F. Minami and M. Takano, “Clock Tree Synthesis Based on RC Delay
University of California at San Diego, as a Ph.D.
Balancing”, Proc. CICC, 1992, pp. 28-3.1-28.3.4.
student in September 2013. His current research
[23] A. D. Mehta, Y.-P. Chen, N. Menezes, D. F. Wong and L. T. Pileggi,
interests include design for manufacturability and
“Clustering and Load Balancing for Buffered Clock Tree Synthesis”,
VLSI physical design optimization.
Proc. ICCD, 1997, pp. 217-223.
[24] M. M. Ozdal, C. Amin, A. Ayupov, S. M. Burns, G. R. Wilke and C.
Zhuo, “ISPD-2012 Discrete Cell Sizing Contest and Benchmark Suite”,
Proc. ISPD, 2012, pp. 161–164, http://archive.sigda.org/ispd/contests/
12/ispd2012 contest.html.
[25] M. Pedram, “Power Minimization in IC Design: Principles and
Applications” , ACM TODAES 1(1) (1996), pp. 3-56.
[26] A. Rajaram and D. Z. Pan, “Variation Tolerant Buffered Clock Network
Synthesis with Cross Links”, Proc. ISPD, 2006, pp. 157-164.
[27] L. Rakai, A. Farshidi, L. Behjat and D. Westwick, “Buffer Sizing for Andrew B. Kahng is a professor in the
Clock Networks using Robust Geometric Programming Considering Computer Science Engineering Department and the
Variations in Buffer Sizes”, Proc. ISPD, 2013, pp. 154-161. Electrical and Computer Engineering Department
[28] R. R. Rao, D. Blaauw, D. Sylvester, C. J. Alpert and S. Nassif, “An of the University of California at San Diego. His
Efficient Surface-Based Low-Power Buffer Insertion Algorithm”, Proc. interests include IC physical design, the design-
ISPD, 2005, pp. 86-93. manufacturing interface, combinatorial optimization,
[29] P. J. Restle, T. G. McNamara, D. A. Webber, P. J. Camporese, K. F. and technology roadmapping. He received the Ph.D.
Eng, K. A. Jenkins, D. H. Allen, M. J. Rohn, M. P. Quaranta, D. W. degree in Computer Science from the University of
Boerstler, C. J. Alpert, C. A. Carter, R. N. Bailey, J. G. Petrovick, B. California at San Diego.
L. Krauter, and B. D. McCredie, “A Clock Distribution Network for
Microprocessors”, IEEE J. Solid-State Circuits 36(5) (2001), pp. 792-
799.
[30] J. Reuben, H. M. Kittur and M. Shoaib, “A Novel Clock Generation
Algorithm for System-on-Chip Based on Least Common Multiple”,
Computers and Electrical Engineering 40(7) (2014), pp. 2113-2125.
[31] J. Reuben, V. M. Zackriya, S. Nashit and H. M. Kittur, “Capacitance
Jiajia Li received the B.S. degree in software
Driven Clock Mesh Synthesis to Minimize Skew and Power
engineering from Shenzhen University, China, in
Dissipation”, IEICE Electronics Express 10(24) (2013), pp. 1-12.
2011; and the M.S. degree in electrical engineering
[32] R. Samanta, J. Hu and P. Li, “Discrete Buffer and Wire Sizing for
from the University of California at San Diego, La
Link-Based Non-Tree Clock Networks”, IEEE TVLSI 18(7) (2010), pp.
Jolla, in 2013. He is currently pursuing the Ph.D.
1025-1035.
degree at the University of California at San Diego.
[33] V. Sathe, S. Arekapudi, A. Ishii, C. Ouyang, M. Papaefthymiou and S.
He joined the VLSI CAD Laboratory, University
Naffziger, “Resonant Clock Design for a Power-efficient, High-volume
of California at San Diego, in April 2012. His
x86-64 Microprocessor”, Proc. ISSCC, 2012, pp. 140-149.
current research interests include physical design
[34] H. Seo, J. Kim, M. Kang and T. Kim, “Synthesis for Power-Aware Clock
and signoff optimization, margin reduction and low-
Spines”, Proc. ICCAD, 2015, pp. 126-131.
power design.
[35] C. N. Sze, “ISPD 2010 High Performance Clock Network Synthesis
Contest: Benchmark Suite and Results”, Proc. ISPD, 2010, pp. 143-143.
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.