Shift Power Reduction in High-Performance Clock Network Designs
Shift Power Reduction in High-Performance Clock Network Designs
Abstract—With the rapid development of deep submicron which is generally utilized because of low power utilization
(DSM) high-performance VLSI circuits, building a robust clock and less routing resource utilization, as well as effortlessness
2024 IEEE 8th International Test Conference India (ITC India) | 979-8-3503-5259-7/24/$31.00 ©2024 IEEE | DOI: 10.1109/ITCINDIA62949.2024.10652198
tree with minimal insertion delays, skews and jitter has turned of execution and simulation [1]. The tree-based architecture
out to be challenging. Several techniques have been developed can be very sensitive to Process, Voltage, and Temperature
to reduce the clock insertion delay & clock skew. A few such (PVT) variations, particularly in high-performance chip
techniques are Clock Mesh CTS, Multi Source CTS or Multi designs. Clock mesh, the other architecture, provides better
Entry CTS. These techniques help reduce insertion delays, tolerance to variations [2]. With heaps of mesh nodes and
skews and jitter across the design but can adversely affect the uneven loads, clock mesh is hard to examine and robotize [3].
power since the significant section of a design driven by a clock
Multi-Point entry clock tree depict a novel clock distribution
source can toggle within a short time. Impact will be higher
system that fills the methodology gap between the ordinary
during the scan shift phase where all clock gating circuits are
transparent in blocks targeted for ATPG. Due to a significant clock tree and clock mesh [4]. This solution reduces clock
reduction in insertion delay and clock skew, the clock edge skew and mitigate timing violations, ultimately improving
reaches all the sinks in design quickly resulting in a significant overall chip performance. These 3 different types of clock
amount of activity happening in a short span of every shift clock architectures are shown in Fig. 1.
edge. In this paper, we discuss the adverse effects of high-
performance clock network schemes on Shift power and
techniques to save Shift power in such scenarios. Results from
the analysis of a few blocks will also be shared. Solution
described in this paper is applicable in high performance
designs where clock mesh clock tree network architectures are
implemented.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on February 04,2025 at 05:04:46 UTC from IEEE Xplore. Restrictions apply.
targeted blocks need to be transparent. With this, we have the
entire clock tree active and based on scan shift data, there will
be data power too. Since we are introducing the shift clock
into a functional clock tree through a mux at the OCC, the shift
clock path will also mimic the functional clock tree. Even if
the shift clock is operating at a much lesser frequency as
compared to the functional clock, the entire clock network and
subsequent data activity will be done by matching to the
functional clock frequency and network. To illustrate this, we
have done a power analysis on the block and the details are
provided in Table II.
Parameters Value
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on February 04,2025 at 05:04:46 UTC from IEEE Xplore. Restrictions apply.
are used to shift scan chains at slightly different times to among different scan shift elements will give rise to some hold
reduce peak switching activity. In [10], scan segmentation in violations. Scan chain re-ordering and re-partitioning will
combination with clock gating is used to effectively reduce alter elements in the scan chain based on layout location. This
peak shift power. In [11], optimal scan chain grouping for should reduce the number of hold violations significantly.
reducing local IR-drop on flip-flops is used to mitigate the IR- Once scan chain re-ordering is done, the proposal is to identify
drop-induced test data corruption. In [12], flip-flops are scan elements crossing between different clock entry point
assigned to scan segments and shift clock domains to reduce domains and introduce a lock-up latch between them. This
peak and average shift power. In [13], scan segments are should make scan element crossing between different clock
regrouped to reduce excessive IR-drop on individual clock entry points skew agnostic.
paths. However, none of these methods can effectively reduce
peak power during the scan shift phase without impacting the Now, let's focus on the amount of delay to be introduced
functional timing. between clock entry points. Several experiments have been
conducted by varying delays introduced between different
In this paper, we propose a new scalable method of clock entry point groups. Our research shows that the best
reducing shift power in presence of aggressive clock network results are achieved when delay variation between clock entry
architectures. point groups is equal to the period of functional clock driving
these flops. This way, the shift activity of one layout region
• Which works at the block level within the clock will be complete before the shift starts in the next layout
domain and can also reduce local instantaneous region.
voltage drop problems.
To illustrate the proposed solution, let's consider the block
• It does not impact the functional timing as much. shown in Fig. 3. We have a block which has 8 clock entry tap
• It is floorplan-aware and effectively reduces the peak points dividing the floorplan into 8 different layout regions.
power. We could divide the block into 4 different groups as illustrated
in Table III. Now if the functional clock period of Hclk is
In addition, we have compared the results with some 1GHz, then Delay_Param will be 1ns. Between Group0 and
techniques like Q-Suppression/Q-gating. Group1 there will be a 1ns clock skew difference. Between
Group0 and Group2, there will be a 2ns clock skew difference.
III. PROPOSED SOLUTION Similarly, between Group0 and Group3, there will be 3ns
In our proposed solution, the idea is to reuse clock tap clock skew difference.
points from a high-performance clock tree network and TABLE III. BLOCK GROUP DIVISION INFORMATION
introduce intentional delay along different clock points to
reduce the scan shift peak power. Below are steps to be Group Layout Clock Tap Skew Introduced on
followed. Names Locations Points Clock Points
Affected
• Divide clock tap points at block entry level into
multiple groups based on layout division.
Group0 P1, P3 Hclk_0, 0*Delay_Param
• Introduce different intentional skew along the clock Hclk_2
tap points during the scan shift phase.
Group1 P2, P4 Hclk_1, 1*Delay_Param
Fig. 5, denotes the details of the circuit to be added at the Hclk_3
input of clock tap entry points at the block level. When scan
enable is high, the clock will go through a series of delay
elements and follow the functional clock path when scan Group2 P5, P7 Hclk_4, 2*Delay_Param
Hclk_6
enable is low. The amount of delay added is varied as per
layout at different clock points. This ensures that different
parts of the layout are shifted at different intervals reducing Group3 P6, P8 Hclk_5, 3*Delay_Param
the peak current. Gating the delay elements ensures that there Hclk_7
will not be any clock power in functional mode.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on February 04,2025 at 05:04:46 UTC from IEEE Xplore. Restrictions apply.
Let’s consider the impact of the proposed solution on
leakage, process and OCV. In these high-performance
designs, leakage power would usually constitute to 10% of
total power. Impact of these additional cells in clock would
have negligible impact on leakage. Since we are gating delay
cells with TDR, there will not be any impact to dynamic power
in functional usage. Delay of clock cells can vary across the
process parameters. In this solution we are choosing cell to
match clock period from best process corner. In worst corner
delay of cells can worsen there by producing a bigger impact.
For OCV, since we are adding lockup latches between the
clock entry points, there is half shift cycle period skew benefit,
for hold timing closure. For scan interactions within one clock
entry point, derates will be nullified with effect of Clock
Reconvergence Pessimism Removal. Impact to jitter due to
additional clock cells will be negligible as compared to shift
frequency.
IV. EXPERIMENTAL DATA AND RESULTS Fig. 8. Current Drawn Analysis with Q-Suppression Solution.
Considering the same block shown in Table II and Fig. 4.
the block has 20 clock tap points which are grouped into 4. A
skew of 1.5ns has been introduced between clock groups by Let's consider another block from a 7nm design with
adding the circuit shown in Fig. 5. details shown in Table IV. This design is of size >270mm *
mm.
Fig. 7, shows the current drawn from power analysis tools
TABLE IV. BLOCK LEVEL DETAILS
for the same configuration shown in Fig. 4. As you can see
instead of one peak current of 55A, now we can see 4 small
peaks in the current drawn which can related to 4 clock groups Parameters Value
created. The peak current is reduced from 55A to 15.62A. The
average power in both cases remained the same at 0.74W. Scan Cells 335k
Memory Instances 36
Clock Domains 1
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on February 04,2025 at 05:04:46 UTC from IEEE Xplore. Restrictions apply.
TABLE V. BLOCK LEVEL DETAILS
Parameters Value
Memory Instances 0
Clock Domains 1
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on February 04,2025 at 05:04:46 UTC from IEEE Xplore. Restrictions apply.
V. CONCLUSION and Instrumentation Engineering Vol. 2, Issue 6, June
2013
The solution proposed in this paper has the following [5] A. B. Chong, "Hybrid Multisource Clock Tree
advantages, Synthesis," 2021 28th IEEE International Conference on
Electronics, Circuits, and Systems (ICECS), Dubai,
A. Efficient Re-Use of functional clock tree: United Arab Emirates, 2021, pp. 1-6, doi:
Clock tap points introduced at block entry-level re-used to 10.1109/ICECS53924.2021.9665516.
introduce clock skew. In this solution, there are no deviations [6] P. Narayanan, R. Mittal, S. Poddutur, V. Singhal and P.
caused to the clock tree to reduce shift power. Clock delay is Sabbarwal, "Modified flip-flop architecture to reduce
hold buffers and peak power during scan shift
introduced within a given clock domain and within a block operation," 29th VLSI Test Symposium, Dana Point, CA,
which can reduce local instantaneous voltage drop problems USA, 2011, pp. 154-159, doi:
as well. 10.1109/VTS.2011.5783776.
[7] J. T. Tudu, E. Larsson, V. Singh and V. D. Agrawal, "On
B. Minimal impact on timing analysis: Minimization of Peak Power for Scan Circuit during
With this solution, we would see mux getting added in Test," 2009 14th IEEE European Test Symposium,
functional mode at block inputs. Since we are introducing Seville, Spain, 2009, pp. 25-30, doi:
10.1109/ETS.2009.36.
lock-up latches through scripts post scan chain re-ordering [8] Y. Zhang et al., "Clock-Skew-Aware Scan Chain
there is no impact to scan shift timing as well. Grouping for Mitigating Shift Timing Failures in Low-
Power Scan Testing," 2018 IEEE 27th Asian Test
C. Solution is layout aware Symposium (ATS), Hefei, China, 2018, pp. 149-154,
The DFT engineer gets an opportunity to work with the doi: 10.1109/ATS.2018.00037.
layout engineer to decide on clock grouping which is layout [9] T. Yoshida and M. Watari, "MD-SCAN method for low
aware. power scan testing," Proceedings of the 11th Asian Test
Symposium, 2002. (ATS '02)., Guam, USA, 2002, pp.
D. Scalable 80-85, doi: 10.1109/ATS.2002.1181690.
As shown in the results, even with 2 clock tap points, peak [10] Y. Bonhomme, P. Girard, L. Guiller, C. Landrault and S.
Pravossoudovitch, "A gated clock scheme for low power
current during shift power can be reduced effectively. This scan testing of logic ICs or embedded cores,"
solution can be automated and scaled for every design that Proceedings 10th Asian Test Symposium, Kyoto, Japan,
uses high-performance clock network architectures. 2001, pp. 253-258, doi: 10.1109/ATS.2001.990291.
[11] Y. Zhang, S. Holst, X. Wen, K. Miyase, S. Kajihara and
VI. REFERENCES J. Qian, "Scan Chain Grouping for Mitigating IR-Drop-
Induced Test Data Corruption," 2017 IEEE 26th Asian
Test Symposium (ATS), Taipei, Taiwan, 2017, pp. 145-
[1] L. Xiao, et al., “Local clock skew minimization using 150, doi: 10.1109/ATS.2017.37.
blockage aware mixed tree-mesh clock network,” in [12] Y. -C. Huang et al., "Test Clock Domain Optimization to
International Conference on Computer-Aided Design, Avoid Scan Shift Failure Due to Flip-Flop Simultaneous
2010, pp. 458–462 Triggering," in IEEE Transactions on Computer-Aided
[2] C. Yeh, et al., “Clock distribution architectures: a Design of Integrated Circuits and Systems, vol. 32, no.
comparative study,” in International Symposium on 4, pp. 644-652, April 2013, doi:
Quality Electronic Design, 2006, pp. 85–91. 10.1109/TCAD.2012.2228741.
[3] A. Abdelhadi, et al., “Timing driven variation aware [13] Y. Yamato, X. Wen, M. A. Kochte, K. Miyase, S.
synthesis of hybrid mesh/tree clock distribution Kajihara and L. -T. Wang, "A novel scan segmentation
networks,” Integration, the VLSI Journal, vol. 46, pp. design method for avoiding shift timing failure in scan
382–391, sept. 2013 testing," 2011 IEEE International Test Conference,
[4] Nikhil Patel, “A Novel Clock Distribution Technology – Anaheim, CA, USA, 2011, pp. 1-8, doi:
Multisource Clock Tree System (MCTS)” International 10.1109/TEST.2011.6139162.
Journal of Advanced Research in Electrical, Electronics
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on February 04,2025 at 05:04:46 UTC from IEEE Xplore. Restrictions apply.