0% found this document useful (0 votes)
27 views6 pages

Shift Power Reduction in High-Performance Clock Network Designs

The document discusses techniques for reducing shift power in high-performance clock network designs, particularly in deep submicron VLSI circuits. It highlights the challenges posed by clock skew and insertion delays, and proposes a novel method that introduces intentional delays between clock entry points to mitigate peak power during scan shift operations. The proposed solution aims to maintain functional timing while effectively managing power consumption in complex clock architectures.

Uploaded by

Harsh Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views6 pages

Shift Power Reduction in High-Performance Clock Network Designs

The document discusses techniques for reducing shift power in high-performance clock network designs, particularly in deep submicron VLSI circuits. It highlights the challenges posed by clock skew and insertion delays, and proposes a novel method that introduces intentional delays between clock entry points to mitigate peak power during scan shift operations. The proposed solution aims to maintain functional timing while effectively managing power consumption in complex clock architectures.

Uploaded by

Harsh Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

8th IEEE International Test Conference India (ITC India) 2024

Shift Power Reduction in High-Performance


Clock Network Designs
Kamlesh Kishorbhai Bhesaniya Omar Sharif Cherukur Lakshmi Kandula
Asic Products Division Asic Products Division Asic Products Division
Broadcom Inc. Broadcom Inc. Broadcom Inc.
Bangalore, India Bangalore, India Bangalore, India
[email protected] [email protected] [email protected]

Ravishankar Chevuri Pradeep Sreenivasa Anoop Padmanabhan


Asic Products Division Asic Products Division Asic Products Division
Broadcom Inc. Broadcom Inc. Broadcom Inc.
Bangalore, India Bangalore, India Bangalore, India
[email protected] [email protected] [email protected]

Abstract—With the rapid development of deep submicron which is generally utilized because of low power utilization
(DSM) high-performance VLSI circuits, building a robust clock and less routing resource utilization, as well as effortlessness
2024 IEEE 8th International Test Conference India (ITC India) | 979-8-3503-5259-7/24/$31.00 ©2024 IEEE | DOI: 10.1109/ITCINDIA62949.2024.10652198

tree with minimal insertion delays, skews and jitter has turned of execution and simulation [1]. The tree-based architecture
out to be challenging. Several techniques have been developed can be very sensitive to Process, Voltage, and Temperature
to reduce the clock insertion delay & clock skew. A few such (PVT) variations, particularly in high-performance chip
techniques are Clock Mesh CTS, Multi Source CTS or Multi designs. Clock mesh, the other architecture, provides better
Entry CTS. These techniques help reduce insertion delays, tolerance to variations [2]. With heaps of mesh nodes and
skews and jitter across the design but can adversely affect the uneven loads, clock mesh is hard to examine and robotize [3].
power since the significant section of a design driven by a clock
Multi-Point entry clock tree depict a novel clock distribution
source can toggle within a short time. Impact will be higher
system that fills the methodology gap between the ordinary
during the scan shift phase where all clock gating circuits are
transparent in blocks targeted for ATPG. Due to a significant clock tree and clock mesh [4]. This solution reduces clock
reduction in insertion delay and clock skew, the clock edge skew and mitigate timing violations, ultimately improving
reaches all the sinks in design quickly resulting in a significant overall chip performance. These 3 different types of clock
amount of activity happening in a short span of every shift clock architectures are shown in Fig. 1.
edge. In this paper, we discuss the adverse effects of high-
performance clock network schemes on Shift power and
techniques to save Shift power in such scenarios. Results from
the analysis of a few blocks will also be shared. Solution
described in this paper is applicable in high performance
designs where clock mesh clock tree network architectures are
implemented.

Keywords— High-Performance Clock Network, CTS, Clock


Mesh, Multi Source, Multi Entry, H-Clock Tree, Shift Power
Reduction, Clock Skew, Scan Shift, ATPG

I. INTRODUCTION: CLOCK NETWORK ARCHITECTURES


Fig. 1. Clock Network Architectures [5].
The clock network is a highly researched topic in VLSI
designs. At the Deep Sub Micron (DSM) silicon technology
levels, the clock network is facing new challenges. First, with In these aggressive Clock Tree Architectures, the Clock
the development of DSM VLSI design, the device feature size tree build will start from the On-Chip Clock Controller (OCC)
continues to shrink, and the density of the transistor and output, go through the global clock tree and clock coarse mesh
interconnect are increased significantly, also the circuit scale or other automated mechanisms, multiple tap points are
increases. Second, the delay of interconnect has become a created across the layout as shown in Fig. 2.
dominant factor of circuit performance, so the strict At the physical block level, a single clock pin during the
synchronization of the clock network is difficult to meet, DFT phase will be cloned into multiple inputs during the clock
which poses a challenge to the model and the accuracy of the tree build stage, which will be acting synchronously. These
clock routing. Furthermore, with an increase in the scale and inputs are placed across the block based on the floorplan and
the frequency of the system, the power consumption of the location. This is illustrated in Fig. 3, which shows a block
whole system also increases, where the power consumption of where the HCLK pin is cloned 8 times when it is tapped from
the clock network occupies more than 1/3rd of the total power the global clock tree mesh/network. P1 to P8 denotes floorplan
consumption of the circuit, thus reducing the power partitions which cover a single Hclk tap point. All Hclk_* pins
consumption of the clock network is critical to minimize the are synchronous to each other during timing analysis.
system’s overall power consumption. There are two normal
clock distribution architectures actualized to meet the timing
necessities of the system. One is the traditional clock tree,

979-8-3503-5259-7/24/$31.00 ©2024 IEEE

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on February 04,2025 at 05:04:46 UTC from IEEE Xplore. Restrictions apply.
targeted blocks need to be transparent. With this, we have the
entire clock tree active and based on scan shift data, there will
be data power too. Since we are introducing the shift clock
into a functional clock tree through a mux at the OCC, the shift
clock path will also mimic the functional clock tree. Even if
the shift clock is operating at a much lesser frequency as
compared to the functional clock, the entire clock network and
subsequent data activity will be done by matching to the
functional clock frequency and network. To illustrate this, we
have done a power analysis on the block and the details are
provided in Table II.

TABLE II. BLOCK LEVEL DETAILS

Parameters Value

Scan Cells 760k


Fig. 2. Clock Mesh Tree Architecture.
Clock Domains 3

Clock Tap Points 20

A high scan shift power pattern [load of 0101 and unload


of 1010] has been chosen, an SDF back annotated simulation
has been run to save VCD and power analysis has been done
using industry-standard tools. Fig. 4, denotes the current
drawn as per the power reporting tool. Even though the shift
period spans 40ns, we can see 2 peaks in the current draw, one
for the posedge of the cycle and another for the negedge. We
Fig. 3. Block-level Clock Tree Architecture. can also see entire activity during the shift cycle happens
around the clock edge causing 2 sharp peaks spanning for very
II. PROBLEM STATEMENT : IMPACT ON SHIFT POWER little time. Even if the shift clock frequency has been reduced
to 5MHz, we should not see any difference in peak current.
High-performance clock network solutions help to reduce
insertion delay and clock skew along different branches of the
clock network. We can interpret this as a clock edge once it
is launched at the root of the tree and travels to all its sink
sequential elements like flops, latches and memories within a
short time. Since clock skew is also improved significantly,
all sink cells driven by clock toggle together causing a
significant spike in power. Table I denotes a 7nm design of
>450mm2 die size with >60M scan cell count. We have
shown 3 major clocks driving 80% of the design, their
insertion delays and frequencies. We can see that for such a
big design, clock insertion delays are significantly less.

TABLE I. CLOCKS AND LATENCIES

Clock Pin Clock Frequency Clock Latency

clkA 1050M 2.3ns


Fig. 4. Current Drawn in Single Shift Cycle.
clkB 1680M 1.84ns

clkC 1152M 1.97ns


Peak power increases IR-drop in the design, thereby
reducing the voltage across the transistor and can lead to
failures and reliability issues [6][7]. We would see
unnecessary yield loss to transients in the ATPG Scan shift
In a functional scenario, the clock and subsequent data
phase.
power are contained using Integrated Clock Gating (ICG)
circuits introduced during the RTL design phase. A lot of Several DFT-based methods [8] are available for reducing
synthesis tools offer Power Gating Techniques by introducing the instantaneous impact of switching activity, thus IR-drop,
ICGs during synthesis. But during scan shift mode all ICGs in during scan shift operations. In [9], multiple staggered clocks

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on February 04,2025 at 05:04:46 UTC from IEEE Xplore. Restrictions apply.
are used to shift scan chains at slightly different times to among different scan shift elements will give rise to some hold
reduce peak switching activity. In [10], scan segmentation in violations. Scan chain re-ordering and re-partitioning will
combination with clock gating is used to effectively reduce alter elements in the scan chain based on layout location. This
peak shift power. In [11], optimal scan chain grouping for should reduce the number of hold violations significantly.
reducing local IR-drop on flip-flops is used to mitigate the IR- Once scan chain re-ordering is done, the proposal is to identify
drop-induced test data corruption. In [12], flip-flops are scan elements crossing between different clock entry point
assigned to scan segments and shift clock domains to reduce domains and introduce a lock-up latch between them. This
peak and average shift power. In [13], scan segments are should make scan element crossing between different clock
regrouped to reduce excessive IR-drop on individual clock entry points skew agnostic.
paths. However, none of these methods can effectively reduce
peak power during the scan shift phase without impacting the Now, let's focus on the amount of delay to be introduced
functional timing. between clock entry points. Several experiments have been
conducted by varying delays introduced between different
In this paper, we propose a new scalable method of clock entry point groups. Our research shows that the best
reducing shift power in presence of aggressive clock network results are achieved when delay variation between clock entry
architectures. point groups is equal to the period of functional clock driving
these flops. This way, the shift activity of one layout region
• Which works at the block level within the clock will be complete before the shift starts in the next layout
domain and can also reduce local instantaneous region.
voltage drop problems.
To illustrate the proposed solution, let's consider the block
• It does not impact the functional timing as much. shown in Fig. 3. We have a block which has 8 clock entry tap
• It is floorplan-aware and effectively reduces the peak points dividing the floorplan into 8 different layout regions.
power. We could divide the block into 4 different groups as illustrated
in Table III. Now if the functional clock period of Hclk is
In addition, we have compared the results with some 1GHz, then Delay_Param will be 1ns. Between Group0 and
techniques like Q-Suppression/Q-gating. Group1 there will be a 1ns clock skew difference. Between
Group0 and Group2, there will be a 2ns clock skew difference.
III. PROPOSED SOLUTION Similarly, between Group0 and Group3, there will be 3ns
In our proposed solution, the idea is to reuse clock tap clock skew difference.
points from a high-performance clock tree network and TABLE III. BLOCK GROUP DIVISION INFORMATION
introduce intentional delay along different clock points to
reduce the scan shift peak power. Below are steps to be Group Layout Clock Tap Skew Introduced on
followed. Names Locations Points Clock Points
Affected
• Divide clock tap points at block entry level into
multiple groups based on layout division.
Group0 P1, P3 Hclk_0, 0*Delay_Param
• Introduce different intentional skew along the clock Hclk_2
tap points during the scan shift phase.
Group1 P2, P4 Hclk_1, 1*Delay_Param
Fig. 5, denotes the details of the circuit to be added at the Hclk_3
input of clock tap entry points at the block level. When scan
enable is high, the clock will go through a series of delay
elements and follow the functional clock path when scan Group2 P5, P7 Hclk_4, 2*Delay_Param
Hclk_6
enable is low. The amount of delay added is varied as per
layout at different clock points. This ensures that different
parts of the layout are shifted at different intervals reducing Group3 P6, P8 Hclk_5, 3*Delay_Param
the peak current. Gating the delay elements ensures that there Hclk_7
will not be any clock power in functional mode.

Fig. 6, shows how clock edges shift with the proposed


solution. A lock-up latch introduction between scan chain
elements between different groups will make the solution
clock skew agnostic. This will ensure that there are no hold
violations because of this circuit during scan shift timing
analysis.

Fig. 5. Details of the Proposed Solution.

Even though the proposed solution will not impact


functional and ATPG capture timing much, but the
introduction of different delay elements on the shift clock path
will affect scan shift timing. Differences in clock skews Fig. 6. Clock Edge Shift and Timing Window with the Proposed Solution.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on February 04,2025 at 05:04:46 UTC from IEEE Xplore. Restrictions apply.
Let’s consider the impact of the proposed solution on
leakage, process and OCV. In these high-performance
designs, leakage power would usually constitute to 10% of
total power. Impact of these additional cells in clock would
have negligible impact on leakage. Since we are gating delay
cells with TDR, there will not be any impact to dynamic power
in functional usage. Delay of clock cells can vary across the
process parameters. In this solution we are choosing cell to
match clock period from best process corner. In worst corner
delay of cells can worsen there by producing a bigger impact.
For OCV, since we are adding lockup latches between the
clock entry points, there is half shift cycle period skew benefit,
for hold timing closure. For scan interactions within one clock
entry point, derates will be nullified with effect of Clock
Reconvergence Pessimism Removal. Impact to jitter due to
additional clock cells will be negligible as compared to shift
frequency.
IV. EXPERIMENTAL DATA AND RESULTS Fig. 8. Current Drawn Analysis with Q-Suppression Solution.
Considering the same block shown in Table II and Fig. 4.
the block has 20 clock tap points which are grouped into 4. A
skew of 1.5ns has been introduced between clock groups by Let's consider another block from a 7nm design with
adding the circuit shown in Fig. 5. details shown in Table IV. This design is of size >270mm *
mm.
Fig. 7, shows the current drawn from power analysis tools
TABLE IV. BLOCK LEVEL DETAILS
for the same configuration shown in Fig. 4. As you can see
instead of one peak current of 55A, now we can see 4 small
peaks in the current drawn which can related to 4 clock groups Parameters Value
created. The peak current is reduced from 55A to 15.62A. The
average power in both cases remained the same at 0.74W. Scan Cells 335k

Memory Instances 36

Clock Domains 1

Clock Tap Points 5

Functional Freq 1GHz

Shift Frequency 50MHz

A Redhawk VCD-based Power analysis was done on the


below scenarios.
• Default mode which is without any shift power
Fig. 7. Current Drawn Analysis with the Proposed Solution. solution.
To compare the results, the Q-suppression solution was • Q-suppression on flops driving fanout cone of 1000
run in the same block by enabling the gating on flops with a cells.
fanout cone of more than 1000 cells (functional fanout cone
of 60k flops in a total were gated off during shift). We can see, • Dividing Clock Tap points into 4 groups and
from Fig. 8, that the Peak Current goes to 44.1A but the introducing a clock delay of 1ns & 1.5ns between
average power is reduced to 0.694W. groups.
We can see this difference because with the intentional Results from these experiments are shown below in Fig.
clock delay mechanism, we can see both clock and data power 9,10 and 11.
getting affected but with Q suppression only data power is
getting affected. A similar analogy can be made with EDA • Fig. 9, represents the number of cells switching in the
tools low power solutions where the tool shifts a series of 0s targeted shift cycle.
in the internal stump chain to reduce the data power. • Fig. 10, shows peak current differences with different
solutions.
• Fig. 11, shows average power differences.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on February 04,2025 at 05:04:46 UTC from IEEE Xplore. Restrictions apply.
TABLE V. BLOCK LEVEL DETAILS

Parameters Value

Scan Cells 260k

Memory Instances 0

Clock Domains 1

Clock Tap Points 2


Fig. 9. Switching Instances with Different Solutions.
Functional Freq 1.5GHz

A Redhawk VCD-based Power analysis was done on the


below scenarios.
• Default mode which is without any shift power
solution.
• Q-suppression on flops driving fanout cone of 1000
cells.
• Dividing Clock Tap points into 2 groups and
introducing a clock delay of 0.666ns, 1ns and 2ns
between groups.
Fig. 12 & 13, show the results from the experiments. From
Fig. 10. Peak Current Difference with Different Solutions.
these results, we can learn that Peak current is significantly
reduced with a proposed solution.

Fig. 11. Average Power Difference with Different Solutions.


Fig. 12. Switching Instances Difference in each Schemes.

These results also show that we can reduce peak current


effectively with different intentional delay additions along
different clock entry tap points.
Let's consider another block with only 2 clock tap entry
points within a block. This is to analyze the impact in blocks
with very few clock tap points. Details of the block are given
in Table V. This block is taken from a 5nm design with a
design size of > 500 mm2.

Fig. 13. Peak Current Difference in each Schemes.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on February 04,2025 at 05:04:46 UTC from IEEE Xplore. Restrictions apply.
V. CONCLUSION and Instrumentation Engineering Vol. 2, Issue 6, June
2013
The solution proposed in this paper has the following [5] A. B. Chong, "Hybrid Multisource Clock Tree
advantages, Synthesis," 2021 28th IEEE International Conference on
Electronics, Circuits, and Systems (ICECS), Dubai,
A. Efficient Re-Use of functional clock tree: United Arab Emirates, 2021, pp. 1-6, doi:
Clock tap points introduced at block entry-level re-used to 10.1109/ICECS53924.2021.9665516.
introduce clock skew. In this solution, there are no deviations [6] P. Narayanan, R. Mittal, S. Poddutur, V. Singhal and P.
caused to the clock tree to reduce shift power. Clock delay is Sabbarwal, "Modified flip-flop architecture to reduce
hold buffers and peak power during scan shift
introduced within a given clock domain and within a block operation," 29th VLSI Test Symposium, Dana Point, CA,
which can reduce local instantaneous voltage drop problems USA, 2011, pp. 154-159, doi:
as well. 10.1109/VTS.2011.5783776.
[7] J. T. Tudu, E. Larsson, V. Singh and V. D. Agrawal, "On
B. Minimal impact on timing analysis: Minimization of Peak Power for Scan Circuit during
With this solution, we would see mux getting added in Test," 2009 14th IEEE European Test Symposium,
functional mode at block inputs. Since we are introducing Seville, Spain, 2009, pp. 25-30, doi:
10.1109/ETS.2009.36.
lock-up latches through scripts post scan chain re-ordering [8] Y. Zhang et al., "Clock-Skew-Aware Scan Chain
there is no impact to scan shift timing as well. Grouping for Mitigating Shift Timing Failures in Low-
Power Scan Testing," 2018 IEEE 27th Asian Test
C. Solution is layout aware Symposium (ATS), Hefei, China, 2018, pp. 149-154,
The DFT engineer gets an opportunity to work with the doi: 10.1109/ATS.2018.00037.
layout engineer to decide on clock grouping which is layout [9] T. Yoshida and M. Watari, "MD-SCAN method for low
aware. power scan testing," Proceedings of the 11th Asian Test
Symposium, 2002. (ATS '02)., Guam, USA, 2002, pp.
D. Scalable 80-85, doi: 10.1109/ATS.2002.1181690.
As shown in the results, even with 2 clock tap points, peak [10] Y. Bonhomme, P. Girard, L. Guiller, C. Landrault and S.
Pravossoudovitch, "A gated clock scheme for low power
current during shift power can be reduced effectively. This scan testing of logic ICs or embedded cores,"
solution can be automated and scaled for every design that Proceedings 10th Asian Test Symposium, Kyoto, Japan,
uses high-performance clock network architectures. 2001, pp. 253-258, doi: 10.1109/ATS.2001.990291.
[11] Y. Zhang, S. Holst, X. Wen, K. Miyase, S. Kajihara and
VI. REFERENCES J. Qian, "Scan Chain Grouping for Mitigating IR-Drop-
Induced Test Data Corruption," 2017 IEEE 26th Asian
Test Symposium (ATS), Taipei, Taiwan, 2017, pp. 145-
[1] L. Xiao, et al., “Local clock skew minimization using 150, doi: 10.1109/ATS.2017.37.
blockage aware mixed tree-mesh clock network,” in [12] Y. -C. Huang et al., "Test Clock Domain Optimization to
International Conference on Computer-Aided Design, Avoid Scan Shift Failure Due to Flip-Flop Simultaneous
2010, pp. 458–462 Triggering," in IEEE Transactions on Computer-Aided
[2] C. Yeh, et al., “Clock distribution architectures: a Design of Integrated Circuits and Systems, vol. 32, no.
comparative study,” in International Symposium on 4, pp. 644-652, April 2013, doi:
Quality Electronic Design, 2006, pp. 85–91. 10.1109/TCAD.2012.2228741.
[3] A. Abdelhadi, et al., “Timing driven variation aware [13] Y. Yamato, X. Wen, M. A. Kochte, K. Miyase, S.
synthesis of hybrid mesh/tree clock distribution Kajihara and L. -T. Wang, "A novel scan segmentation
networks,” Integration, the VLSI Journal, vol. 46, pp. design method for avoiding shift timing failure in scan
382–391, sept. 2013 testing," 2011 IEEE International Test Conference,
[4] Nikhil Patel, “A Novel Clock Distribution Technology – Anaheim, CA, USA, 2011, pp. 1-8, doi:
Multisource Clock Tree System (MCTS)” International 10.1109/TEST.2011.6139162.
Journal of Advanced Research in Electrical, Electronics

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on February 04,2025 at 05:04:46 UTC from IEEE Xplore. Restrictions apply.

You might also like