ARM 1176-JZFS CPU-Based Low-Power Subsystem
ARM 1176-JZFS CPU-Based Low-Power Subsystem
ARM 1176-JZFS CPU-Based Low-Power Subsystem: Methodology to Reduce Electrical and Functional Failure in a Low-Power Design
By David Flynn, Fellow ARM; Sachin Idgunji, Architect, ARM; Felix Jen, Manager Design Implementation, UMC; Wen-Pin Lin, Senior Manager, UMC; and Vivek Shukla, Cadence Architect, Bangalore.
Abstract
Leakage control has become a major design issue due to leakage currents that drain a batterys charge even when a device is inactive or in standby mode. Transistors in each new process generation leak more than those in previous generations, due to transistor scaling effects, only exacerbating the problem. A few years ago, designers began using power shut-off in their designs and EDA suppliers provided low-power methodology solutions. However, power shutoff created next level issues like performance, wear-outs of power switches, more complexity in the power switch analysis, managing system-level performance due to power-up time, test, and reliability. This required accompanying ASIC implementation and verification methodology to reduce the risk of chip failure, both functional and electrical. We demonstrate the application of these techniques and the methodology on an ARM1176-JZFS CPU-based system that is targeted for a 65nm technology node, which achieves higher speed, but has lower leakage, with a methodology to reduce post silicon electrical failure.
Memory compiler with memory shut offs, std cells, PMK library
Power Solution
Sec14:2
Joint Collaboration Contributors: This effort has been jointly executed by ARM, UMC, and Cadence to accomplish the following tapeout and silicon measurements. UMC: 65nm standard process Looking for performance and yield on the LP implementation ARM: ARM1176JZFS based SoC to demonstrate power management on a high performance design Low-power architecture Power management and low-power memory IP for managing leakage Cadence: CPU implementation Complete low-power tool and methodology support
L90 1P9M SP/LL 193nm Dry 1.0/1.2 16/22 30/52/65 (1.8/2.5/3.3V) 70/80 CoSi2 Cu Low-k (k=2.9) 280 1.16/0.99*
L65 1P10M SP/LL 193nm Dry 1.0/1.2 12/19 30/52/65 (1.8/2.5/3.3V) 40/55 NiSi Cu Low-k (k=2.9) 200 0.499*/0.525*
Low Leakage (LL) process has approximately half of the performance at 1.2V in comparison with Standard Process (SP) running at 1.0V (Figure 3).
Sec14:3
105
10
103
102
65LL 1.2V
90LL 1.2V
101
As shown in Figure 4, Low Leakage (LL) Nodes gain significantly (>80x) across the process space. They are highly sensitive to temperature (sub-threshold component). High Performance Nodes gain an average of 25% on Drive Strength. This is dominated by the process spread.
Leakage Gain (LL vs SP)
1000.00
NMOS ratio PMOS ratio
100.00
Gain
Gain
10.00
25 TT
25 SS
25 FF
25 SF
25 FS
125 TT
-40 FF
Corners
Corners
High Performance Nodes gain significantly (average 30%) across the corners. The power dissipation can be managed effectively with voltage scaling. Multichannel devices can be used to reduce the leakage.
Sec14:4
Dynamic Power
1.60E+00 1.40E+00 1.20E+00 1.00E+00 8.00E+01 6.00E+01 4.00E+01 2.00E+01 0.00E+00 0.5 0.6 0.7 0.8 0.9 1 1.1
Gain
Corners
Voltage
Delay for some key structures is V , is in the range of 1.5 2. As shown in Figure 6 and Figure 7, temperature sensitivity decreases with lowered voltage (Zero Temperature Coefficient for block around 0.78V). Variability is highly sensitive to the voltage and increases drastically at lower voltages impacting the functionality of design.
Voltage vs Delay (Average)
0.18 450 400 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0.85 0.9 1 1.1 Delay sensitivity to temperature
Delay (normalized)
350 300 250 200 150 100 50 0 0.75 0.85 0.95 1.05 1.15 1.25 Block 1 Block 2 Block 3
Voltage (V)
Voltage
Sec14:5
35
y = 16.396x
Observed std dev (arbititrary units)
30 25
-3.0055
20
15
10
Dispersion data
Figure 7: Variability
Sec14:6
Debug Interface
Flexible DFT/MBIST
Ceompressor Controller
JTAG debug
Instruction Cache (16K) TCRAM 0/1 Interface TrustZone enabled ARM11 core Data Cache (16K) TCRAM 0/1 Interface
ARM1176J2_1616
Memory Management
AMBA AXI Interface Instruction Interface Data Interface DMA Peripheral Port
Timer x2
UART X2
INTC
GPIO
ARM 1176_1616
DTCDataRAM
DTCDataRAM
ITCDataRAM
ITCDataRAM
DData RAM
DData RAM
DData RAM
DData RAM
IData RAM
IData RAM
IData RAM
IData RAM
DData RAM
DData RAM
IData RAM
IData RAM
IData RAM ITag RAM ITag RAM IValid RAM BTACTag RAM BTACData RAM
Expected location of Pb
DDirty RAM DValid RAM Instruction read only and data read/write ports
TLBRAM
Expected location of A1176 Core TLBRAM On the connectors, the location of the pins is indicated by a line Clock, reset, and interupts port Coprocessor ports ETM ports Figure 6-3 Alternative macrocell floorplan
To reduce the wear-out of the power switches as well as maintaining the performance, Ulterior proposes two kinds of power switch matricesthe weak network and the strong network. Control for both networks comes from the Advance Leakage Controller(ALC) separately; the weak network has 8 power shut-off control input requests and acknowledge; the strong network has one shut-off enable request. Weak resistive network brings up virtual grid gently with sufficient current to ensure VVDD reaches to 0.95*VDD @high temperature. Strong matrix is turned-on once virtual grid reaches to 0.95*VDD to reduce the IR drop. Implementation of 8 weak enable-based network is to carry out wear-out experiment with 1/2/4/8 enables. All the power controls acknowledge signal selection is based on STA measurement. Figure 10 has one example sequence, where 8N1.
VDDCPU
CLOCK N_ISOLATE
Memory subsystem contains 37 single port memories; each memory can work in three low-power modes (Figure 11): a. Standby mode (HALT) CEN disables the memory b. Retention mode (SRPG) Power is supplied to core array to retain state Power is off for periphery for reduced leakage Outputs are clamped to zero
Sec14:8
c. Shutdown mode (HIBERNATE) Power is off for core and periphery for reduced leakage Outputs are clamped to zero Possible through both integrated MTCMOS and separated power sources for core and periphery
PGEN RETN PGEN
Column Decoders Sense Amps and I/O CORE Ground LOGIC VSS
Word Lines
Implementation Overview
Figure-12 illustrates the Cadence CPF-based low-power implementation flow, with the following key highlights. Single CPF used from the synthesis to backend, power and timing sign-off Leakage optimizations in the synthesis and in the backend flows CPF-based MMMC flow in the Encounter platform PSO Planning flow to meet performance/electrical/power goals Automated Power Switch Network Simulation for multiple combinations CRC model based spice simulation to reduce TaT for complex power switch analysis
Sec14:9
While addressing low-power implementation and its verification, it is also important that methodology be adequate enough to deal with the challenges of maintaining performance and reliability. Here are the key issues addressed by the implementation methodology: Power shut-off and MSV implementation Maintaining the system performance is a challenge CPF based methodology simplifies the Low Power Insertion Low-power verification Verifying the low power Through RTL and gate simulations Through formal checks
CPF
Sec14:10
PD-Aware Physical Implementation SoC Encounter Timing & SI Signoff Encounter Timing System IR drop & Power Signoff VoltageStorm-PE & DC Physical Verification
Ensuring Reliability New power structures and strategy may lead to Defects in the t>0 time Needs to be taken care in the design ARM has come up with new approach in the design To avoid electrical failures How Implementation would support such mechanism
Ulterior Implementation
Low-power Verification
Low-power verification is the backbone of any low-power flow. Verification can be performed through dynamic simulation on the RTL as well as gate, and static checks. Cadence Encounter Conformal Low Power verifies the correct implementation of low-power design techniques and validates the design using formal techniques (versus simulation) throughout the design process. It also decreases the risk of missed bugs, before a product goes out the door. Conformal Low Power accepts RTL/gatelevel netlists with or without explicit power or ground nets and CPF file as input. It performs structural and rule-based checks to verify that low-power implementation is as per the power specification defined in the CPF file. Under Low Power Equivalency Checking, Conformal Low Power ensures that lowpower optimizations do not introduce a technology mapping bug or a logical bug in the design netlist. It reads golden and revised designs along with CPF files and checks the logical equivalence without setting any constraints on low-power control signals. The RTL and Conformal Low Power flow is used to verify the CPF. It reads RTL and CPF as input and reports missing, and redundant low-power rules as per the power architecture of the design. Conformal Low Power flows for the synthesis and physical netlists are used to verify the low-power implementation with respect to power specification defined in the CPF file. Since instances in the synthesis netlist does not have power ground pins, power domain are assigned based on the CPF definition. The power domains to the instances in the physical netlist are assigned based on the power and ground pin connectivity. Power domain consistency check (PDCIC) performs power-aware equivalence checking and checks low-power cells. The PDCIC between synthesis and the physical netlist performs the power-aware equivalence checking between the golden and revised design. In this case, it assigns the power domain for the synthesis netlist using the CPF definitions while the power domains for the physical netlist use the power and ground pin connectivity.
Sec14:11
RTL
Power Equivalence
Sec14:12
The commitCPF command mainly creates the following information according to the loaded CPF: Creates power domains and defines their global connections Checks and inserts level shifters and isolations based on the rules Checks and replaces flip-flops with SRPGs based on the rules Creates the analysis views
Once design is imported and CPF is loaded, the following are the key steps performed by the SoCE: (i) (ii) Low-power CPF flow and the MMMC settings Different kinds of the power shut-off (PSO) for the design
On-chip PSO Column-based checker board PSO PSO for hard macro (memory) Off-chip PSO VDDCPU, secondary domain for VDDCore, can also be shut off externally (iii) (iv) (v) (vi) Different kinds of level shifter implementation for the design Isolation implementation for the PSO power domains State retention for the PSO power domains Always-on net synthesis Low-to-high level shifter
Sec14:13
(vii)
Secondary power pin connection for the SRPG/always-on buffer/LVL shifter On-chip Variation (OCV) timing analysis mode Timing optimization and analysis in MMMC Clock tree synthesis in MMMC Domain-aware routing
(xiii) MMMC SI closure (xiv) Hold timing optimization in MMMC (xv) MMMC leakage optimization (xvi) Running multiple-CPU processing to reduce the runtime for multiple mode analysis
Figure 15 illustrates the floorplan and the power switches columns of the ulterior design.
Sec14:14
The key in the implementation is the power network planning. As per the power architecture section, we need two types of power switch networkthe weak network and the strong power switch network. Weak network itself has eight different enables and same number of acknowledgements.
REQ_WEAK_0 REQ_WEAK_1
REQ_WEAK_2
REQ_WEAK_3
REQ_WEAK_4
ACK_WEAK_1 ACK_WEAK_0
ACK_WEAK_3
ACK_WEAK_5
ACK_WEAK_7
ACK_WEAK_2
ACK_WEAK_4
ACK_WEAK_6
Sec14:15
REQ_STRONG
ACK_STRONG_1 ACK_STRONG_0
ACK_STRONG_3
ACK_STRONG_5
ACK_STRONG_7
ACK_STRONG_1_1
ACK_STRONG_7_1
ACK_STRONG_2
ACK_STRONG_4
ACK_STRONG_6
ACK_STRONG_0_1
ACK_STRONG_2_1
ACK_STRONG_6_1
Sec14:16
SRPG Cells
There are ~40k state retention flops in the PDcore PSO power domain (Figure 22). While all the flops are inserted during the synthesis, its placement and power connections have been performed during P&R. Secondary power pin connection to always-on power for state-retention flops is shown in Figure 23, SRPG is double height cell with VSS in bottom. The entire secondary power hookup for the SRPG were done using Cadence NanoRoute router.
Sec14:18
Window based decap eco was used to get a rough idea of how much decap was needed
.tm waveforms
.ptipeak rush current file plots plots reports reports decap eco file
*Except VStorm dynamic IR drop, all runs were done in 3 corners (FF, SS, TT). VStorm dynamic IR drop was done in TT only.
Sec14:19
Figure 25 illustrates the static IR-drop plots and numbers. The average drop across the PSO varies between 15% to 25% of the total IR-drop. So it is important that one does careful analysis at the time of PSO strategy to maintain the performance.
Figures 26 and 27 illustrate the power switch network simulation results and waveforms. For simplification, Figure 27 illustrates the limited and optimal set of simulation results at the end of the project, but during the power switch network optimization, several combination and corners have been tried out to get optimized power switch network. As you can see, rush current (Ipeak and Iavg) are minimal when we have one weak enable turned on, but ramp-up time is 12X more in comparison with the 8 weak enable ON. Similarly, 8 weak enable conditions has 12X more rush current than the one weak enable condition.
Sec14:20
Sec14:21
Sec14:22
Sec14:23
Sec14:24